Linear Regression is a supervised machine learning algorithm where the predicted output is continuous and has a constant slope. It’s used to predict values within a continuous range, (e.g. sales, price) rather than trying to classify them into categories (e.g. cat, dog). 输入特征(features)和输出目标(target)之间存在线性关系
- 很多复杂模型的基础
- Neural Networks 本质上就是大量 Linear Regression 的组合
- 非常好的 baseline model
- 实战中永远建议:先跑一个 Linear Regression
- 再看复杂模型是否真的“有提升”
- Interpretability(可解释性)极强
- 每个 feature 对结果的影响都能直接解释给人听
- 金融、风控、监管环境非常重要
Simple Linear Regression
Simple linear regression uses traditional slope-intercept form. 𝑥 represents our input data and 𝑦 represents our prediction.
𝑦 = 𝑚𝑥+𝑏
Ordinary Least Squares (OLS)
Ordinary Least Squares (OLS) is a method for estimating the unknown parameters in a linear regression model. It minimizes the sum of the squares of the differences between the observed dependent variable (values of the variable being predicted) and those predicted by the linear function.
The goal of OLS is to find the best-fitting line through the data by minimizing the sum of the squares of the vertical distances of the points from the line. These distances are known as the residuals.
Multiple Linear Regression(多变量线性回归)
Multiple Linear Regression is a statistical technique used to predict the outcome of a dependent variable based on two or more independent variables. It is an extension of simple linear regression to multiple predictors and is used to understand the relationship between several independent variables and a dependent variable.
multi-variable linear equation might look like this, where 𝑤 represents the coefficients, or weights, our model will try to learn.
𝑓(𝑥,𝑦,𝑧) = 𝑤1𝑥+𝑤2𝑦+𝑤3𝑧

1.Assumptions
Assumption | Violated Leads To | Common Diagnostic |
Linearity | Model misspecification | Residual plots |
Independence | Inflated significance | Durbin–Watson |
Homoscedasticity | Invalid standard errors | Breusch–Pagan |
Normality | Invalid inference | Q–Q plot |
No multicollinearity | Unstable coefficients | VIF |
1. Linearity
Assumption
There is a linear relationship between the independent variables ( X ) and the dependent variable ( y ).
Why it matters
If the true relationship is nonlinear, the model will be misspecified, leading to biased estimates and poor predictive performance.
How to check
- Scatter plots of ( y ) vs. predictors
- Residuals vs. fitted values plot (should show no systematic pattern)
Typical fixes
- Feature transformation (log, square, interaction terms)
- Polynomial regression
- Use non-linear models if appropriate
2.no autocorrelation
- 时间序列
- 同一个客户反复出现
- 有“惯性”的数据
如果违反会怎样?
👉 模型会过度自信,以为自己很准。
Assumption
Residuals are independent across observations (no autocorrelation).
Why it matters
Violation leads to underestimated standard errors, inflating t-statistics and causing false significance.
Common violation scenarios
- Time-series data
- Panel / longitudinal data
- Repeated measurements
How to check
- Durbin–Watson test(Durbin–Watson 统计量取值范围 0≤DW≤4)
DW 区间 | 解释 |
1.5 – 2.5 | 通常认为没有明显自相关 |
< 1.5 | 存在正自相关的嫌疑 |
> 2.5 | 存在负自相关的嫌疑 |
- Residuals plotted over time
Typical fixes
- Add lag variables
- Use GLS / AR models
- Clustered or robust standard errors
3. Homoscedasticity (Constant Variance)
Assumption
The variance of residuals is constant across all levels of predictors.
不管 X 是大是小,模型“犯错的幅度”都差不多。
Why it matters
Heteroscedasticity does not bias coefficients, but it invalidates standard errors and hypothesis tests.
How to check
- Residuals vs. fitted values plot (no funnel shape)
- Breusch–Pagan test
- White test
Typical fixes
- Log-transform the target variable
- Use heteroscedasticity-robust standard errors
- Weighted least squares (WLS)
4. Normality of Residuals
Assumption
Residuals are normally distributed with mean zero.
大多数预测误差应该接近 0,极端错误很少
- 样本很大时影响会变小
- 小样本时特别重要
Why it matters
- Required for valid confidence intervals and hypothesis testing
- Less critical for large samples (Central Limit Theorem)
How to check
- Histogram of residuals
- Q–Q plot
- Shapiro–Wilk test (small samples)
Typical fixes
- Transform dependent variable
- Use bootstrap inference
- Rely on asymptotic results for large ( n )
5. No Multicollinearity
Assumption
Independent variables are not highly correlated with each other.
Why it matters
- Inflates variance of coefficients
- Makes estimates unstable and hard to interpret
- Small data changes → large coefficient swings
How to check
- Correlation matrix
- Variance Inflation Factor (VIF)
Rule of thumb
- VIF > 5 (moderate concern)
- VIF > 10 (serious issue)
Typical fixes
- Drop redundant variables
- Feature selection
- PCA / regularization (Ridge, Lasso)
2.Cost function
3.Evaluation
The performance of a multiple linear regression model can be evaluated using metrics like R-squared, Mean Squared Error (MSE), or Mean Absolute Error (MAE). These metrics provide information on how well your model is explaining the variability of the data or how close the predictions are to the actual values.
Polynomial Regression(多项式回归)
做法本质:
对输入特征做非线性变换,再用 Linear Regression

👉 参数仍然是线性的
为什么 Polynomial Regression 有用?
How it works with overfitting and underfitting
Overfitting
When model performs well on training data but poorly on unseen data it's a sign of overfitting. It learns the noise in the data along with other patterns, essentially memorising data which leads to model being specific instead of generic.
To prevent overfitting, we can do
- Feature selection: Only select attributes which contribute to the final decision, remove unnecessary attributes
- Cross validation: Dividing data into two or more subsets, use one subset for training and others for validation and testing. This can be achieved using k-fold cross-validation
- Regularisation: Adds a penalty term while training, discouraging model from fitting to noise. Lasso and ridge regularisation techniques are widely used.
Underfitting
When model performs poorly on both training and testing, not being able to learn the underlying patterns form the training data and in turn give bad results on test data.
To prevent underfitting, we can
- Remove noise: Getting rid of data points which could be measurement errors, sampling errors from training data.
- More features: Including more feature attributes which contribute to prediction.
Pros and Cons
Pros
- Easy to implement, theory is not complex, low computational power compared to other algorithms.
- Easy to interpret coefficients for analysis.
- Perfect for linearly separable datasets.
- Susceptible to overfitting, but can avoid using dimensionality reduction techniques, cross-validation, and regularization methods.
Cons
- Unlikely in the real world to have perfectly linearly separable datasets, model often underfits in real-word scenarios or is outperformed by other ML and Deep Learning algorithms.
- Parametric, has a lot of assumptions that needs to be met for its data in regards to its distribution. Assumes a linear relationship between the dependent and independent variables.
- Parametric: Assume statistical distributions in the data, several conditions need to be met (Ex: Linear Regression)
- Examples of Assumptions
- There is a linear relationship between the dependent variable and the independent variables.
- The independent variables aren’t too highly correlated with each other.
- Your observations for the dependent variable are selected independently and at random.
- Regression residuals are normally distributed.
- Non-Parametric: Makes no assumptions about data distribution/conditions for the data to meet.
Python Implementation
Simple Linear Regression
#Import libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd #Getting data dataset = pd.read_csv('Salary_Data.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values #Training and Testing Data from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0) #Training the Simple Linear Regression model on the Training set from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train) #Print Coefficients print('Coefficients: \n', lm.coef_) #Predicting Test Data y_pred = regressor.predict(X_test) #Evaluating the Model from sklearn import metrics print('MAE:', metrics.mean_absolute_error(y_test, predictions)) print('MSE:', metrics.mean_squared_error(y_test, predictions)) print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions))) #Residuals sns.distplot((y_test-predictions),bins=50); #Visualising the Training set results plt.scatter(X_train, y_train, color = 'red') plt.plot(X_train, regressor.predict(X_train), color = 'blue') plt.title('Salary vs Experience (Training set)') plt.xlabel('Years of Experience') plt.ylabel('Salary') plt.show() #Visualising the Test set results plt.scatter(X_test, y_test, color = 'red') plt.plot(X_train, regressor.predict(X_train), color = 'blue') plt.title('Salary vs Experience (Test set)') plt.xlabel('Years of Experience') plt.ylabel('Salary') plt.show()
Multiple Linear Regression
This particular dataset holds data from 50 startups in New York, California, and Florida. The features in this dataset are R&D spending, Administration Spending, Marketing Spending, and location features, while the target variable is: Profit.
1. R&D spending: The amount which startups are spending on Research and development.
2. Administration spending: The amount which startups are spending on the Admin panel.
3. Marketing spending: The amount which startups are spending on marketing strategies.
4. State: To which state that particular startup belongs.
5. Profit: How much profit that particular startup is making.


