Boosting can be described as a method where we combine several weak classifiers and transform them into one strong classifier sequentially.
模型 | 用一句话总结 | 类比 |
AdaBoost | 盯着前面的错题一步步补丁式修 | 普通学生反复练错题 |
XGBoost | 专业工程队按标准修复误差,有正则化、有优化 | 有计划的专业装修队 |
LightGBM | XGBoost 的竞速加强版,加速 + 压缩 + 大规模处理 | 工程队 + 挖掘机 + 压缩包 |
AdaBoost
XGBoost (Extreme Gradient Boosting)
AdaBoost 有缺点,所以后来出现 Gradient Boosting(GBDT):
用“残差(误差)”来引导下一棵树学习。
而 XGBoost = GBDT 的专业工程队版本,做了很多增强:
核心思想
每一棵树都在拟合上一棵树的“误差/残差”。
比如预测 default:
第一棵树:
预测不准 → 留下一堆 residual(残差)
第二棵树:
专门去学残差
第三棵树:
继续学剩下没学好的残差
最终:
很多棵小树叠加出一个非常强的模型。
XGBoost 为什么强?(简单到一听就懂)
① 正则化(L1/L2)
让树不要长太多叶子 → 防止过拟合。
② Shrinkage (learning rate)
每棵树只贡献一点点力量,整体更稳。
③ Column subsampling
每次只用部分特征 → 样本更“多样” → 泛化更好。
④ 支持并行(树的节点可并行计算)
速度很快。
⑤ Sparsity-aware(会自动跳过缺失)
不需要手动填补缺失值。
📌 XGBoost = 有正则化、有工程优化、有细节、有安全性的 GBDT 升级版。
Whole XGBoost pipeline
1. Import Necessary Libraries
import pandas as pd from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.metrics import mean_squared_error from sklearn.feature_extraction import DictVectorizer from sklearn.pipeline import Pipeline import xgboost as xgb
2. Load and Prepare Data
# Example data loading data = { 'LotFrontage': [80, 81, 82, None, 84, 83, 85, 87, None, 89], 'OverallQual': [7, 6, 7, 8, 5, 6, 8, 9, 7, 5], 'YearBuilt': [2003, 1976, 2001, 1915, 2000, 2002, 1999, 1980, 2005, 1998], 'SalePrice': [200000, 150000, 180000, 130000, 175000, 165000, 210000, 220000, 190000, 160000] } df = pd.DataFrame(data) # Features and target X = df.drop('SalePrice', axis=1) y = df['SalePrice'] # Fill missing values X['LotFrontage'] = X['LotFrontage'].fillna(0)
3. Define and Fit the Pipeline
# Setup the pipeline steps steps = [("ohe_onestep", DictVectorizer(sparse=False)), ("xgb_model", xgb.XGBRegressor(objective='reg:squarederror', random_state=42))] # Create the pipeline xgb_pipeline = Pipeline(steps) # Convert DataFrame to dictionary format X_dict = X.to_dict("records") # Fit the pipeline xgb_pipeline.fit(X_dict, y)
4. Model Evaluation
# Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Convert training and testing sets to dictionary format X_train_dict = X_train.to_dict("records") X_test_dict = X_test.to_dict("records") # Fit the pipeline on the training data xgb_pipeline.fit(X_train_dict, y_train) # Predict on the test data y_pred = xgb_pipeline.predict(X_test_dict) # Calculate and print the Mean Squared Error mse = mean_squared_error(y_test, y_pred) print(f"Mean Squared Error: {mse}")
5. Hyperparameter Tuning Using Grid Search
# Define the parameter grid param_grid = { 'xgb_model__n_estimators': [50, 100, 200], 'xgb_model__learning_rate': [0.01, 0.1, 0.2], 'xgb_model__max_depth': [3, 5, 7], 'xgb_model__subsample': [0.6, 0.8, 1.0], 'xgb_model__colsample_bytree': [0.6, 0.8, 1.0] } # Initialize GridSearchCV grid_search = GridSearchCV(estimator=xgb_pipeline, param_grid=param_grid, cv=3, scoring='neg_mean_squared_error', n_jobs=-1, verbose=2) # Fit GridSearchCV grid_search.fit(X_train_dict, y_train) # Print the best parameters and score print("Best Parameters:", grid_search.best_params_) print("Best Score:", -grid_search.best_score_)
Summary
- Data Preparation: We handled missing values in the
LotFrontagecolumn.
- Pipeline Setup: Created a pipeline with
DictVectorizerfor one-hot encoding andXGBRegressorfor regression.
- Model Training: Split the data into training and testing sets, trained the model, and evaluated it using Mean Squared Error.
- Hyperparameter Tuning: Used
GridSearchCVto find the best hyperparameters for the XGBoost model.
LightGBM
LightGBM 的目标只有一个:极致加速 + 极致省内存。
核心思想
和 XGBoost 一样是 Gradient Boosting,但使用了两大“黑科技”:
关键黑科技 1:Histogram Binning(直方图分箱)
不再对连续变量逐点比较
→ 先把数值压缩成 255 个 bins
→ 再在 bins 上找分裂点
→ 速度提升几十倍
→ 内存减少大量
📌 用一句话形容:
把长长的连续数值文件压缩成小档案,加速阅读。
关键黑科技 2:Leaf-wise growth(叶节点优先扩展)
XGBoost:
- “深度优先”长树
- 每一层要平衡左右
LightGBM:
- 找到 信息增益最高 的叶子 → 直接往下长
- 树长得“不均匀”,但更精准
📌 用一句话形容:
LightGBM 每次都在最有价值的地方继续挖,模型更强但更容易过拟合。
其他 LightGBM 优势
- 原生支持 categorical features
- 数据量越大越占优势
- 内存极低
- 训练速度非常快(XGBoost 的 5〜20 倍)
