Linear Regression

Mastering Multiple Linear Regression: A Comprehensive Guide

Discover the power of multiple linear regression in interpreting relationships between variables, data visualizing, model building, and more.

https://www.analyticsvidhya.com/blog/2021/05/multiple-linear-regression-using-python-and-scikit-learn/

💡

Linear Regression is a supervised machine learning algorithm where the predicted output is continuous and has a constant slope. It’s used to predict values within a continuous range, (e.g. sales, price) rather than trying to classify them into categories (e.g. cat, dog). 输入特征（features）和输出目标（target）之间存在线性关系

很多复杂模型的基础

Neural Networks 本质上就是大量 Linear Regression 的组合

非常好的 baseline model

实战中永远建议：先跑一个 Linear Regression
再看复杂模型是否真的“有提升”

Interpretability（可解释性）极强

每个 feature 对结果的影响都能直接解释给人听
金融、风控、监管环境非常重要

Simple Linear Regression

Simple linear regression uses traditional slope-intercept form. 𝑥 represents our input data and 𝑦 represents our prediction.

𝑦 = 𝑚𝑥+𝑏

Ordinary Least Squares (OLS)

Ordinary Least Squares (OLS) is a method for estimating the unknown parameters in a linear regression model. It minimizes the sum of the squares of the differences between the observed dependent variable (values of the variable being predicted) and those predicted by the linear function.

The goal of OLS is to find the best-fitting line through the data by minimizing the sum of the squares of the vertical distances of the points from the line. These distances are known as the residuals.

Multiple Linear Regression（多变量线性回归）

Multiple Linear Regression is a statistical technique used to predict the outcome of a dependent variable based on two or more independent variables. It is an extension of simple linear regression to multiple predictors and is used to understand the relationship between several independent variables and a dependent variable.

multi-variable linear equation might look like this, where 𝑤 represents the coefficients, or weights, our model will try to learn.

𝑓(𝑥,𝑦,𝑧) = 𝑤1𝑥+𝑤2𝑦+𝑤3𝑧

1.Assumptions

Assumption	Violated Leads To	Common Diagnostic
Linearity	Model misspecification	Residual plots
Independence	Inflated significance	Durbin–Watson
Homoscedasticity	Invalid standard errors	Breusch–Pagan
Normality	Invalid inference	Q–Q plot
No multicollinearity	Unstable coefficients	VIF

1. Linearity

Assumption

There is a linear relationship between the independent variables ( X ) and the dependent variable ( y ).

Why it matters

If the true relationship is nonlinear, the model will be misspecified, leading to biased estimates and poor predictive performance.

How to check

Scatter plots of ( y ) vs. predictors

Residuals vs. fitted values plot (should show no systematic pattern)

Typical fixes

Feature transformation (log, square, interaction terms)

Polynomial regression

Use non-linear models if appropriate

2.no autocorrelation

时间序列

同一个客户反复出现

有“惯性”的数据

如果违反会怎样？

👉 模型会过度自信，以为自己很准。

Assumption

Residuals are independent across observations (no autocorrelation).

Why it matters

Violation leads to underestimated standard errors, inflating t-statistics and causing false significance.

Common violation scenarios

Time-series data

Panel / longitudinal data

Repeated measurements

How to check

Durbin–Watson test(Durbin–Watson 统计量取值范围 0≤DW≤4)

DW 区间	解释
1.5 – 2.5	通常认为没有明显自相关
< 1.5	存在正自相关的嫌疑
> 2.5	存在负自相关的嫌疑

Residuals plotted over time

Typical fixes

Add lag variables

Use GLS / AR models

Clustered or robust standard errors

3. Homoscedasticity (Constant Variance)

Assumption

The variance of residuals is constant across all levels of predictors.

不管 X 是大是小，模型“犯错的幅度”都差不多。

Why it matters

Heteroscedasticity does not bias coefficients, but it invalidates standard errors and hypothesis tests.

How to check

Residuals vs. fitted values plot (no funnel shape)

Breusch–Pagan test

White test

Typical fixes

Log-transform the target variable

Use heteroscedasticity-robust standard errors

Weighted least squares (WLS)

4. Normality of Residuals

Assumption

Residuals are normally distributed with mean zero.

大多数预测误差应该接近 0，极端错误很少

样本很大时影响会变小

小样本时特别重要

Why it matters

Required for valid confidence intervals and hypothesis testing

Less critical for large samples (Central Limit Theorem)

How to check

Histogram of residuals

Q–Q plot

Shapiro–Wilk test (small samples)

Typical fixes

Transform dependent variable

Use bootstrap inference

Rely on asymptotic results for large ( n )

5. No Multicollinearity

Assumption

Independent variables are not highly correlated with each other.

Why it matters

Inflates variance of coefficients

Makes estimates unstable and hard to interpret

Small data changes → large coefficient swings

How to check

Correlation matrix

Variance Inflation Factor (VIF)

Rule of thumb

VIF > 5 (moderate concern)

VIF > 10 (serious issue)

Typical fixes

Drop redundant variables

Feature selection

PCA / regularization (Ridge, Lasso)

2.Cost function

在 Linear Regression 中，我们训练模型的目标是：

最小化 Sum of Squared Error（SSE）

实际训练中用的：

Sum of Squared Errors（SSE）

为什么要平方？

正负误差不会抵消

大误差被更重惩罚

数学上更好优化

📌 SSE 在这里就是：

Loss Function

Cost Function

优化目标

3.Evaluation

The performance of a multiple linear regression model can be evaluated using metrics like R-squared, Mean Squared Error (MSE), or Mean Absolute Error (MAE). These metrics provide information on how well your model is explaining the variability of the data or how close the predictions are to the actual values.

Polynomial Regression（多项式回归）

做法本质：

对输入特征做非线性变换，再用 Linear Regression

👉 参数仍然是线性的

为什么 Polynomial Regression 有用？

因为现实世界中：

很多关系 不是直线

但可以通过 feature transformation 捕捉

汽车例子：

输入：horsepower

输出：miles per gallon

对比：

原始 Linear Regression → 拟合一般

Polynomial Regression → 明显更好

Training MSE + Test MSE 都下降

👉 说明：

模型捕捉到了真实的非线性结构，而不是过拟合

How it works with overfitting and underfitting

Overfitting

When model performs well on training data but poorly on unseen data it's a sign of overfitting. It learns the noise in the data along with other patterns, essentially memorising data which leads to model being specific instead of generic.

To prevent overfitting, we can do

Feature selection: Only select attributes which contribute to the final decision, remove unnecessary attributes

Cross validation: Dividing data into two or more subsets, use one subset for training and others for validation and testing. This can be achieved using k-fold cross-validation

Regularisation: Adds a penalty term while training, discouraging model from fitting to noise. Lasso and ridge regularisation techniques are widely used.

Underfitting

When model performs poorly on both training and testing, not being able to learn the underlying patterns form the training data and in turn give bad results on test data.

To prevent underfitting, we can

Remove noise: Getting rid of data points which could be measurement errors, sampling errors from training data.

More features: Including more feature attributes which contribute to prediction.

Pros and Cons

Pros

Easy to implement, theory is not complex, low computational power compared to other algorithms.

Easy to interpret coefficients for analysis.

Perfect for linearly separable datasets.

Susceptible to overfitting, but can avoid using dimensionality reduction techniques, cross-validation, and regularization methods.

Cons

Unlikely in the real world to have perfectly linearly separable datasets, model often underfits in real-word scenarios or is outperformed by other ML and Deep Learning algorithms.

Parametric, has a lot of assumptions that needs to be met for its data in regards to its distribution. Assumes a linear relationship between the dependent and independent variables.

Parametric: Assume statistical distributions in the data, several conditions need to be met (Ex: Linear Regression)

Examples of Assumptions

There is a linear relationship between the dependent variable and the independent variables.
The independent variables aren’t too highly correlated with each other.
Your observations for the dependent variable are selected independently and at random.
Regression residuals are normally distributed.

Non-Parametric: Makes no assumptions about data distribution/conditions for the data to meet.

Python Implementation

Simple Linear Regression


#Import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#Getting data
dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

#Training and Testing Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)


#Training the Simple Linear Regression model on the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

#Print Coefficients
print('Coefficients: \n', lm.coef_)

#Predicting Test Data
y_pred = regressor.predict(X_test)

#Evaluating the Model
from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

#Residuals
sns.distplot((y_test-predictions),bins=50);

#Visualising the Training set results
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

#Visualising the Test set results
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

Multiple Linear Regression

This particular dataset holds data from 50 startups in New York, California, and Florida. The features in this dataset are R&D spending, Administration Spending, Marketing Spending, and location features, while the target variable is: Profit.

1. R&D spending: The amount which startups are spending on Research and development.

2. Administration spending: The amount which startups are spending on the Admin panel.

3. Marketing spending: The amount which startups are spending on marketing strategies.

4. State: To which state that particular startup belongs.

5. Profit: How much profit that particular startup is making.