Machine Learning
Machine Learning

Machine Learning

机器学习小红书购买笔记
机器学习小红书购买笔记

Machine learning Overview &Basic Concepts

  • Business understanding
  • Data Collection.
  • Data Exploration and Wrangling.
  • Feature Engineering
  • Building and training a model.
  • Evaluating the model performance.
  • Fine-tuning the model.
  • Model Deployment.

Machine learning Basics

Types of Data

1. What Is Data?

Data

  • 数据是通过观察收集的信息或特征,通常可以被表示为数值形式。
  • 几乎任何事物都可以被“数值化”,因此都可以成为数据。

Forms of Data

  • Numerical data:长度、重量、温度等
  • Text data:单词、句子、文档
  • Image data:像素值构成
  • Audio data:声波信号
  • Video data:图像序列 + 时间维度
👉 核心理解:所有数据在底层都是数字
2. Relationships in Data

Spatial Relationships

  • 数据点之间存在空间上的“距离”或“位置关系”
  • 例子:地理位置、图像像素之间的相对位置

Temporal Relationships

  • 数据点通过时间产生关联
  • 越接近的时间点,通常相关性越强
3. Types of Data by Structure

Structured Data

  • 有固定结构、预定义字段的数据
  • 通常以 rows(记录)+ columns(字段) 形式存在
  • 常见于:
    • Excel
    • Relational Databases(关系型数据库)
📌 优点:
  • 易于存储、查询、分析
  • 非技术人员也能直接使用

Unstructured Data

  • 没有预定义字段或固定长度
  • 常见类型:
    • Text(文本)
    • Images(图像)
    • Audio(音频)
    • Video(视频)
📌 特点:
  • 需要专业工具和算法(NLP、CV、Deep Learning)
  • 在企业数据中占比约 80%
4. Types of Data by Value Nature

Continuous Data

  • 连续型数值变量
  • 在任意两个值之间,都可以有无限多个取值
📌 Examples:
  • Temperature
  • Height / Weight
  • Time

Categorical Data

  • 可以分成有限类别的变量
  • 有些有顺序(Ordinal),有些没有(Nominal)
📌 Examples:
  • Gender
  • Major
  • Color
  • Material Type

Discrete Data

  • 数值型,但取值是可数的
  • 在机器学习中,经常被当作 categorical data 处理
📌 Examples:
  • Age
  • Number of parts
  • Year built
5. Time Series Data

Time Series Data

  • 按时间顺序排列的数据
  • 通常时间间隔是固定的(equally spaced)
📌 Examples:
  • Stock prices(15分钟 / 日频)
  • Sensor readings
  • Smart meter usage

Key Assumptions

  1. Time is one-directional
      • 时间只向前,不可逆
  1. Temporal proximity matters
      • 越近的时间点,关联性通常越强
6. Structured Data Terminology (Machine Learning Context)
以下以 房价预测(House Price Prediction) 为例:

Observation / Instance / Example

  • 数据中的一行
  • 代表一个具体对象或样本
  • 在例子中:一套房子

Feature

  • 数据中的一列
  • 描述样本的一个属性
📌 常见同义词:
  • Factor
  • Predictor
  • Independent Variable
  • X variable
  • Attribute
  • Dimension
📌 Examples:
  • Square footage
  • Number of bedrooms
  • Year built

Target

  • 模型要预测的变量
  • 通常是数据集中的最后一列
📌 常见同义词:
  • Label
  • Response
  • Y variable
  • Dependent Variable
  • Annotation
📌 Example:
  • House sale price

What Is a Model?

Definition of a Model

Model

  • 模型本质上是对 两个或多个变量之间关系的近似(approximation)
  • 模型并不是“真实世界”,而是对真实关系的简化表达。

Input Variables (X)

  • 模型的输入变量
  • 也称为 features / predictors / independent variables
👉 中文解释:
X 代表我们用来“解释或预测”的已知信息。

Output Variables (Y)

  • 模型的输出变量
  • 也称为 target / label / dependent variable
👉 中文解释:
Y 是模型试图预测或估计的结果。

Error Term (ε / error term)

  • 表示模型无法解释的部分
👉 核心思想(非常重要):
现实世界存在随机性,任何模型都不可能完美预测 Y,因此必须承认误差的存在。

Mathematical Form

notion image
Four Things Needed to Build a Model(非常重要)

1️⃣ Feature Selection

  • 选择哪些变量作为模型输入
👉 中文解释:
  • 哪些信息有预测价值
  • 哪些变量可能引入噪声或偏差

2️⃣ Algorithm

  • 模型的整体结构或形式
  • Examples:
    • Linear Regression
    • Decision Tree
    • Random Forest
    • Neural Network
👉 中文解释:
算法决定:
  • 模型是线性的还是非线性的
  • 能捕捉多复杂的关系

3️⃣ Hyperparameters

  • 控制模型复杂度的参数
  • 是在训练前设定的
👉 常见比喻:
Hyperparameters are knobs you can turn.
👉 中文解释:
  • 控制模型:
    • 太简单 → underfitting
    • 太复杂 → overfitting

4️⃣ Loss Function

  • 用于量化模型误差的函数
👉 中文解释:
  • Loss function 定义了什么是“好模型”
  • 模型训练的目标:minimize loss
Model Training(训练过程)

Training Data

  • Historical X (features)
  • Historical Y (targets)

Training Process

  • 在给定:
    • Algorithm
    • Hyperparameters
    • Loss function
  • 的前提下
  • 学习一组 model parameters,使 loss 最小
👉 中文解释:
训练 ≠ 记忆数据
训练 = 找到能最好近似 X 与 Y 关系的参数

Bias-Variance Trade-Off

1. Why is model complexity challenging?
The degree to which a model can capture patterns in the data.
中文解释:
机器学习建模的核心难点之一,是为具体问题找到“合适的模型复杂度”
太简单 → 学不到规律
太复杂 → 学到了噪声
2. 模型复杂度主要由 三个因素共同决定

1.Number of Features(特征数量)

中文解释:
  • 特征越多
  • 模型需要拟合的维度越高
  • 复杂度自然增加
📌 这也是为什么 feature selection 会直接影响 overfitting / underfitting

2.Algorithm Choice(算法选择)

中文解释:
  • 不同算法本身的表达能力不同
Examples:
  • Linear Regression → 简单
  • Decision Tree → 中等
  • Neural Network → 高度复杂
📌 算法本身就隐含了不同的复杂度上限

3️⃣ Hyperparameters(超参数)

Hyperparameters
Tunable knobs that control model complexity for a given algorithm.
中文解释:
  • 超参数不是模型学出来的
  • 人为设定的复杂度控制开关
Examples:
  • Tree depth
  • Number of layers
  • Regularization strength
  • Learning rate
3. Error Decomposition: Bias & Variance

Model Error can be decomposed into:

TotalError=Bias²+Variance+IrreducibleErrorTotal Error = Bias² + Variance + Irreducible Error

① Bias(偏差)

Bias
Error caused by overly simplistic assumptions in the model.
中文解释:
  • 模型太简单
  • 无法捕捉真实世界中的复杂规律
  • 预测结果系统性偏离真实值
📌 High Bias models:
  • Linear model 用来拟合强非线性关系
  • 特征太少
  • 过强正则化

② Variance(方差)

Variance
Error caused by sensitivity to small fluctuations in training data.
中文解释:
  • 模型对训练数据过度敏感
  • 把噪声当成规律
  • 换一批数据,预测结果大幅波动
📌 High Variance models:
  • 非常复杂模型
  • 特征过多
  • 正则化不足

③ Irreducible Error(不可约误差)

中文解释:
  • 数据本身的随机性
  • 测量误差
  • 外部不可观测因素
📌 这是任何模型都无法消除的部分
4. Bias–Variance Tradeoff(偏差–方差权衡)
There is a natural tradeoff between bias and variance.
中文解释:
  • 模型越简单 → Bias ↑,Variance ↓
  • 模型越复杂 → Bias ↓,Variance ↑
📌 我们的目标不是让某一项最小
📌 而是让 总误差最小
5. Underfitting vs Overfitting
Underfitting(欠拟合)
Model is too simple to capture underlying patterns.

特征:

  • High Bias
  • Low Variance
  • High total error

常见原因:

  • 模型太简单
  • 特征不足
  • 过强正则化
Overfitting(过拟合)
Model is too complex and fits noise instead of signal.

特征:

  • Low Bias
  • High Variance
  • High total error(out-of-sample)

常见原因:

  • 模型过于复杂
  • 特征冗余
  • 训练数据不足

解决办法:

Increase data / Data augmentation(从根本上降低 variance)
More diverse and representative data reduces variance and improves generalization.
为什么有用?
  • Overfitting 常发生在 small sample size
  • 数据越多,模型越难记住每个样本
关键词:
  • More representative data
  • Better coverage of data distribution
Proper train / validation / test split + Cross-validation
Cross-validation helps detect overfitting by evaluating model performance across multiple unseen subsets.
核心作用:
  • 防止模型“刚好”适配某一份训练集
  • 检查模型稳定性
k-fold CV 在做什么?
  • 在不同子样本上反复测试
  • 检验是否过度依赖特定数据切片
Feature selection / Dimensionality reduction
Removing irrelevant or noisy features reduces model variance and improves generalization.
为什么特征多会过拟合?
  • 每加一个特征 = 增加一个自由度
  • 噪声特征会被模型“当成信号”
你列的技术是对的:
  • Correlation filtering
  • Mutual information
  • Lasso(embedded)
  • PCA(⚠️解释性 trade-off)
Regularization(核心方法之一)
Regularization discourages overly complex models by penalizing large coefficients, helping control variance.
本质:
  • 给“复杂模型”加成本
效果:
  • L1 → 稀疏 + 特征选择
  • L2 → 参数收缩 + 稳定性
  • Elastic Net → 兼顾两者
Use simpler models
Simpler models tend to generalize better when data is limited.
Occam’s Razor(奥卡姆剃刀)思想
  • 能用简单模型解决,就不要上复杂模型
例子:
  • Logistic regression > deep NN
  • Shallow tree > deep tree
Early stopping(NN / Boosting 专属)
Early stopping prevents the model from memorizing noise by monitoring validation performance.
发生了什么?
  • 训练后期:训练误差 ↓,验证误差 ↑
  • 说明模型开始记噪声
Early stopping 的作用:
  • 在“最佳泛化点”停下

Test Set vs Validation Set

1. Why do we need both Test and Validation sets?

Core Motivation(核心动机)

The goal of machine learning is to perform well on unseen data.
中文解释:
机器学习的目标不是“在已知数据上表现好”,
而是 在未来从未见过的新数据上表现好(generalization)
由于我们无法提前拿到未来数据,只能通过人为划分数据集来模拟“未来”。
2. Dataset Splitting Overview(数据集划分总览)

Three main datasets

Dataset
Purpose
中文解释
Training Set
Train the model
用来拟合模型参数
Validation Set
Model selection & tuning
用来比较模型、调超参数
Test Set
Final evaluation
评估最终模型泛化能力
3. Test Set(测试集)

Test Set

A held-out dataset used only once to evaluate the final model.
中文解释:
Test set 是完全独立的数据,用于回答一个问题:
👉 “这个最终模型在真实新数据上能表现得多好?”

Key Rules for Test Set(必须遵守)

  • 不能用于:
    • Feature selection
    • Algorithm comparison
    • Hyperparameter tuning
  • 只能用于:
    • 最终模型评估
    • 一次性、事后使用
📌 Test set 的核心属性:unbiased(无偏)

Typical size

  • 通常占 10–20% 的总数据
4. Validation Set(验证集)

Validation Set

A dataset used during model development to compare models and tune hyperparameters.
中文解释:
Validation set 是 模型开发过程的一部分,用于:
  • 比较不同 algorithms
  • 调整 hyperparameters
  • 判断模型是否过拟合 / 欠拟合
📌 Validation set ≠ 最终评估工具

Typical size

  • 通常占 10–20%
  • 从 training set 中再拆分出来

5. Why NOT use the Test Set for model selection?

The core problem: Data Leakage

Data Leakage
Occurs when test data is used during model building.
中文解释:
如果你用 test set 来选模型或调参数:
  • Test set 已经“被模型见过”
  • 不再代表真正的 unseen data
  • 性能评估会 过度乐观(over-optimistic)
📌 在 MRM / 审计中,这是严重问题
6. Correct Workflow

Step-by-step Process

  1. Split full data into Training + Test
  1. Split Training into Training + Validation
  1. Train models on Training
  1. Compare models using Validation
  1. Select final model
  1. Retrain final model on Training + Validation

Outcomes vs Outputs

Outcome Metrics 是什么?
Outcome Metrics 用来衡量:
👉 模型或产品在真实业务世界中产生了什么影响
它回答的问题是:
这个模型上线之后,对业务有没有“真正的价值”?

Outcome Metrics 的核心特点

  • 面向 business / customer / management
  • 关注最终结果,而不是模型本身
  • 通常以以下形式表达:
    • 金钱(cost / revenue)
    • 风险(risk / loss)
    • 时间(time saved)
    • 安全(incidents)
    • 排放或合规(emissions / compliance)
  • 不是模型技术指标

Outcome Metrics 的典型例子

  • 成本是否下降(cost savings)
  • 收入是否提升(revenue increase)
  • 风险事件是否减少(risk reduction)
  • 事故数量是否下降(fewer incidents)
  • 单位产出排放是否降低(lower emissions)
📌 Outcome Metrics 是客户真正关心的东西
Output Metrics(模型输出指标)

Output Metrics 是什么?

Output Metrics 用来衡量:
👉 模型预测本身做得好不好
它回答的问题是:
模型预测得准不准?稳不稳?误差大不大?

Output Metrics 的核心特点

  • 面向 data scientists / model developers
  • 是模型层面的技术指标
  • 通常只在模型团队内部使用
  • 很少直接展示给客户

Output Metrics 的典型例子

  • Accuracy
  • Precision / Recall
  • AUC / F1-score
  • RMSE / MAE
  • Log Loss
📌 Output Metrics 是工具,不是目标

Outcome Metrics 与 Output Metrics 的关系

最重要的一条原则

Outcome Metrics 决定 Output Metrics,而不是反过来
解释如下:
  1. 先明确业务要达成什么结果(Outcome)
  1. 再判断模型需要在预测层面做到什么程度
  1. 最后选择合适的 Output Metrics 来评估模型
📌 如果顺序反了(先追求高 Accuracy,再想业务),
👉 这是 典型的数据科学视角错误
Airline Turbulence Prediction case study

业务目标

利用气象数据提前预测 turbulence,从而优化航线、提升飞行安全。

Outcome Metrics(业务层)

  • 每年安全事故数量是否下降
  • 与安全相关的索赔金额是否降低
  • 乘客安全性是否提升
中文解释:
航空公司真正关心的是:
  • 飞机是不是更安全
  • 风险是不是降低
  • 成本是不是减少
他们不关心模型 Accuracy 是 92% 还是 94%

Output Metrics(模型层)

  • Classification metrics(如 Precision、Recall)
中文解释:
模型是一个分类问题,
Output Metrics 用来判断模型在预测 turbulence 时是否:
  • 漏报过多
  • 误报过多
但这些指标只是为了支撑 Outcome Metrics 是否能改善

Loss Function

The loss function measures how well a machine learning algorithm fits the underlying data
Classification
1.Cross-Entropy
Cross-entropy loss measures the performance of a binary classification model by comparing the output distribution to observations.
用于分类任务,特别是在神经网络中,衡量模型的分类预测与实际标签之间的差异。
L = y log ( p ) + (1 − y ) log (1 − p )
where y is the target (data label), and p is the predicted probability.
 
2.Hinge Loss
Hinge loss measure of the amount by which the model's prediction is incorrect. It’s used in SVM.
通常用于支持向量机(Support Vector Machine,SVM)等模型,用于最大化分类边界的间隔
L = max(0,1 − y )
where y is the predicted output.
notion image
Regression
1.Mean Squared Error
用于回归任务,衡量预测值与实际值之间的平方差的平均值。MSE越小,模型的预测越接近实际值
Squared differences between actual and predicted values
notion image
 
 
2.Mean Absolute Error
Absolute differences between actual and predicted values
notion image
 
 
 
 

How to decide Selecting which machine learning algorithm

1. What is an Algorithm?

Machine Learning Algorithm

A template that defines the form of the relationship between inputs (X) and outputs (y).
中文解释:
算法可以理解为一个模型模板(template),它决定了:
  • 输入变量 X
  • 输出变量 y
    • 之间“关系长什么样”
📌 算法 ≠ 模型参数
📌 算法定义的是结构,训练学习的是参数

2. Algorithm Types by Task

Task-oriented Algorithms

  • Regression algorithms
  • Classification algorithms
中文解释:
不同算法适合不同任务类型:
  • 预测连续数值 → Regression
  • 预测类别 / 标签 → Classification

3. Parametric vs Non-Parametric Algorithms

Dimension
Parametric Algorithms
Non-parametric Algorithms
Definition
Assume a fixed functional form between X and y
Do not assume a fixed functional form
Model form
Predefined mathematical equation
Learned directly from data
What training does
Estimate a fixed set of parameters
Learn structure + parameters
Model complexity
Fixed once algorithm is chosen
Grows with data
Data requirement
Relatively small
Usually large
Flexibility
Low–Medium
High
Risk of overfitting
Lower
Higher (if not controlled)
Category
Parametric
Non-parametric
Linear models
Linear Regression, Logistic Regression
Tree-based
Decision Tree, Random Forest
Distance-based
k-NN
Kernel-based
(partially) Linear SVM
Kernel SVM
Neural Networks
Neural Networks
Ensembles
Boosting, Bagging

Parametric Algorithms

Algorithms defined by a fixed set of mathematical equations.
中文解释:
  • 模型形式在训练前是已知的
  • 训练的目标是学习这些公式里的 parameters / coefficients
典型例子:
  • Linear Regression
  • Logistic Regression
📌 特点:
  • 结构简单
  • 假设强
  • 可解释性高
  • 对数据规模要求相对低

Non-Parametric Algorithms

Algorithms that do not assume a fixed functional form.
中文解释:
  • 没有一个预先设定的公式
  • 模型结构由数据“长出来”
典型例子:
  • Decision Tree
  • k-NN
  • Random Forest
  • Neural Networks(通常也被归为 non-parametric)
📌 特点:
  • 灵活
  • 表达能力强
  • 通常需要更多数据
  • 可解释性较弱(部分例外,如单棵树)

4. No Free Lunch Theorem(非常重要)

No Free Lunch Theorem

No single algorithm performs best across all possible machine learning problems.
中文解释:
  • 不存在“万能算法”
  • 一个算法在某个任务上表现好
    • 👉 并不意味着在所有任务上都好
📌 最优算法取决于:
  • 具体问题(problem)
  • 数据特征(data)
  • 业务约束(constraints)

Practical Implication(实践含义)

算法选择的常规做法:
  1. 尝试多个算法
  1. 分别训练模型
  1. 比较结果
  1. 选择最合适的,而不是“最复杂的”

5. Three Key Criteria for Algorithm Selection

Algorithm selection ≠ only accuracy

在实际中,我们通常考虑 三个核心维度

1️⃣ Model Performance / Accuracy

中文解释:
  • 预测是否准确
  • 指标是否达标(RMSE, AUC, etc.)
📌 这是最直观的标准,但不是唯一标准

2️⃣ Interpretability(可解释性)

How easy it is to understand why a model makes a prediction.
中文解释:
  • 能否解释模型“为什么这么预测”
  • 是否可以向客户 / 用户 / 监管解释
高 interpretability 的算法:
  • Linear Regression
  • Decision Tree
低 interpretability 的算法:
  • Neural Networks
  • Complex ensembles
📌 在:
  • 金融
  • 医疗
  • 风控
    • 中,这一点极其重要

3️⃣ Computational Efficiency(计算效率)

中文解释:
  • 训练是否耗时
  • 预测是否昂贵
  • 对算力的要求多高
高效算法:
  • Linear models
  • Simple trees
计算成本高的算法:
  • Deep neural networks
  • Large ensembles
📌 有些模型:
  • 训练要几天 / 几周
  • 推理也很慢
  • 并不适合实时系统
Aspect
Parametric
Non-parametric
Interpretability
High
Low–Medium
Easy to explain to regulator
❌(需 XAI)
Transparency
Strong
Weak
Documentation effort
Low
High
Model risk
Lower
Higher

6. Case Study: Netflix Prize

Netflix Prize

  • 时间:early 2000s
  • 目标:预测用户对电影的评分
  • 胜利条件:比 Netflix 原模型 提升 ≥10% accuracy

Competition Outcome

  • 获胜模型:
    • 一个 complex ensemble
    • 由多个不同算法组合而成
📌 从“学术角度”看:
→ 非常成功

Netflix Engineering Decision(重点)

Netflix 工程师评估后发现:
  • 比赛用数据:millions of reviews
  • Netflix 内部数据:billions of reviews
👉 将复杂模型扩展到真实规模:
  • 工程成本极高
  • 计算资源消耗巨大
  • 实际收益(10% 提升)不值得

Final Decision

Netflix decided to stick with their original, simpler model.
中文解释:
  • 在真实系统中:
    • 工程复杂度
    • 计算成本
    • 可维护性
      • 比“极限精度提升”更重要

模型性能排错

当模型效果不好时,不要急着换算法。90% 的问题,都出在“算法之前”
模型效果差,按顺序查这 5 件事:
1.Problem Framing & Metrics(最致命、最容易被忽略)

核心问题(问自己)

  • 我真的在解决业务真正关心的问题吗?
  • 我选的 metric 真的是“成功”的衡量方式吗?

常见错误

  • Regression 去解一个本质是 Classification 的问题
  • 用技术指标(如 Accuracy)去衡量业务成功

真实案例(电力停电预测)

❌ 一开始的错误做法
  • 问题定义:预测每个 town 的 outage 数量(Regression)
  • 结果:模型怎么都不准
✅ 后来的正确做法
  • 重新理解业务需求
  • 业务真正要的是:
    • 事件整体严重程度(1–5 等级)
  • 改成 Classification
  • 价值立刻体现
📌 关键结论
如果问题定义错了,
再好的模型都是在“精确地做错事”
2. Data Quantity & Quality(数据问题,算法救不了)

要检查什么

  • Data Quantity
    • 数据量是否足够?
    • 是否覆盖了足够多的情景?
  • Data Quality
    • Missing values 多不多?
    • Outliers 是否异常?
    • 数据是否 noisy?

重要原则(一定要记住)

Garbage In, Garbage Out
如果数据本身:
  • 不完整
  • 不准确
  • 偏差严重
👉 模型性能存在硬上限
3.Feature Definition(漏特征 = 必败)

核心问题

  • 是否包含了真正决定结果的因素
  • 是否遗漏了 domain 里“显而易见但没写进数据”的东西?

为什么这是难点

  • 特征不是“技术问题”
  • Domain Knowledge 问题

常见失败模式

  • 特征全是“方便拿到的”
  • 缺少“真正有因果意义的”
📌 关键结论
模型不是学不会,
是你没给它“该知道的信息”。
4.Model Fit & Complexity(终于轮到算法了)

要问的不是“这个算法高级吗”

而是:
  • 有没有 Underfitting(模型太简单)
  • 有没有 Overfitting(模型太复杂)
  • 有没有调过 Hyperparameters
  • 有没有比较过 Multiple Algorithms

正确思路

  • 尝试多种 algorithm
  • 调整 hyperparameters
  • 用 validation / cross-validation
  • 找到 Bias–Variance 的平衡点
📌 提醒
算法问题通常是 第 4 位原因
不是第 1 位
5.Inherent Error(很多人不愿接受的真相)

什么是 Inherent Error

  • 现实世界本身是:
    • noisy
    • 随机
    • 不完全可预测的

意味着什么

  • 不存在 100% 准确的模型
  • 很多问题 99% 本来就是不可能的

正确心态

模型不是魔法
它只能逼近现实,不是重建现实

Maximum Likelihood Estimation, MLE

你已经看到了数据,这些数据是“既成事实”。
MLE 反过来问的是:
如果世界真的是某个参数 θ 生成的,那现在看到这批数据的概率有多大?
然后做一件事:
选那个让概率最大的 θ。
这就是“最大似然”。
Maximum Likelihood Estimation estimates model parameters by maximizing the likelihood of the observed data under a specified probabilistic model.
Intuitively, it chooses parameters that make the observed outcomes most probable.
OLS can be viewed as a special case of MLE under Gaussian errors, while logistic regression relies entirely on MLE due to its Bernoulli likelihood.

Machine learning step by step

Real life example

In the real world, data scientists follow a structured workflow to apply machine learning to solve business problems. While the specific steps can vary based on the project, industry, and data, the general process is as follows:

1. Understanding the Problem

  • Define the business problem: Data scientists start by understanding the business goal or problem they need to solve.
    • Example: Predict customer churn, detect fraud, recommend products.
  • Ask the right questions: What is the desired outcome? What metrics should be optimized (e.g., accuracy, precision, recall, revenue)?
  • Stakeholder collaboration: Communicate with domain experts and stakeholders to gather requirements and context.

2. Collecting and Understanding the Data

  • Gather data: Pull data from internal databases, external APIs, or third-party sources. This may involve SQL queries, REST API requests, or working with data engineers.
  • Explore the data:
    • Use tools like pandas, NumPy, or SQL to understand the dataset.
    • Perform exploratory data analysis (EDA) to uncover patterns, distributions, and relationships between variables.
    • Create visualizations using tools like matplotlib, seaborn, or Plotly to identify trends and outliers.
  • Assess data quality:
    • Check for missing values, duplicates, inconsistencies, and biases.
    • Example real-world data sources:
    • CRM systems (customer data).
    • IoT devices (sensor data).
    • Transaction logs (e.g., for e-commerce or banking).
    • External APIs like OpenWeather or social media platforms.

3. Data Preprocessing and Cleaning

  • Data scientists spend a significant amount of time preparing data. Common steps include:
    • Handling missing data: Impute missing values, drop rows, or infer values based on other features.
    • Handling outliers: Remove or transform outliers.
    • Encoding categorical data: Convert categorical variables to numeric (e.g., one-hot encoding or label encoding).
    • Scaling or normalizing: Ensure all features are on a similar scale (e.g., using MinMaxScaler or StandardScaler).
    • Combining data: Merge multiple datasets if required (e.g., joining transaction data with customer profiles).
    • Feature engineering:
      • Create new features from existing ones (e.g., extract day_of_week from a timestamp).
      • Transform non-linear relationships to linear ones.
    • Data transformation: Apply techniques like log transformations, polynomial transformations, etc., depending on the data distribution.
    • Real-world challenge: Real-world data is often messy, incomplete, and unstructured, requiring significant effort to clean and preprocess.

4. Exploratory Data Analysis (EDA)

  • Perform EDA to:
    Tools commonly used:
    • Identify relationships between variables.
    • Detect correlations and multicollinearity.
    • Visualize the target variable and its relationship with features.
    • Generate hypotheses about which features may be useful for the model.
    • Python: pandas, seaborn, matplotlib, plotly
    • Visualization dashboards: Tableau, Power BI.

5. Feature Selection and Engineering

  • Feature selection: Identify which features are most relevant to the problem using techniques like:
    • Correlation analysis, mutual information, or statistical tests.
    • Model-based feature importance (e.g., from Random Forest or XGBoost).
  • Feature engineering: Transform data to make it more informative for the model:
    • Create ratios (e.g., revenue per customer).
    • Aggregate data (e.g., customer purchase behavior over time).
    • Decompose timestamps into day, month, season, etc.

6. Model Selection and Training

  • Select a baseline model: Start with simple models like linear regression, logistic regression, or decision trees to establish a baseline.
  • Choose advanced algorithms:
    • For structured data: Random Forests, Gradient Boosting (XGBoost, LightGBM, CatBoost).
    • For unstructured data: Convolutional Neural Networks (CNNs) for images, Transformers for text (e.g., BERT).
  • Split the data: Divide into training, validation, and test sets (e.g., 70-20-10 split).
  • Train the model: Train the model using the training set and tune hyperparameters using the validation set.
    • Key tools:
    • Python libraries: scikit-learn, XGBoost, LightGBM, TensorFlow, PyTorch.
    • Automated machine learning (AutoML): Tools like H2O.ai, Google AutoML, or Azure ML automate model selection and tuning.

7. Model Evaluation

  • Metrics: Evaluate the model on the test set using appropriate metrics:
    • Regression: RMSE, MAE, R².
    • Classification: Accuracy, precision, recall, F1-score, ROC-AUC.
  • Error analysis:
    • Identify patterns in errors (e.g., is the model consistently wrong for a specific group?).
    • Check for bias in predictions.

8. Hyperparameter Tuning

  • Optimize hyperparameters to improve model performance:
    Example:
    • Use grid search or random search for hyperparameter optimization.
    • Advanced techniques like Bayesian optimization (e.g., Optuna, Hyperopt) or Genetic Algorithms.
    • from sklearn.model_selection import GridSearchCV param_grid = {'n_estimators': [100, 200], 'max_depth': [3, 5, None]} grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5) grid.fit(X_train, y_train)

9. Deployment

  • Once a model is trained and evaluated, it needs to be deployed into production.
    • Model deployment frameworks:
      • Flask or FastAPI to expose models as REST APIs.
      • Tools like MLflow, Docker, and cloud platforms (AWS, GCP, Azure) to deploy and scale models.
    • Batch vs. real-time inference:
      • Batch: Process data in bulk (e.g., nightly predictions).
      • Real-time: Serve predictions via APIs (e.g., for fraud detection).
    • Monitoring: Continuously monitor model performance in production for drift or degraded accuracy.

10. Monitoring and Iteration

  • Monitor model performance:
    • Check for data drift (e.g., if the data distribution changes in production).
    • Track key metrics (e.g., accuracy, latency) using tools like Prometheus or Grafana.
  • Retrain the model:
    • Periodically retrain the model with new data to ensure it remains accurate and up-to-date.

Common Challenges in Real Work

  1. Dirty Data: Real-world data often has missing values, errors, and inconsistencies.
  1. Data Access: Getting access to relevant data, especially in large organizations, can take time due to security and bureaucracy.
  1. Balancing complexity: Choosing the right trade-off between a complex model (e.g., deep learning) and a simpler interpretable model.
  1. Stakeholder communication: Explaining machine learning results and trade-offs to non-technical stakeholders.
  1. Model interpretability: Explaining how the model makes decisions, especially in sensitive areas like healthcare and finance.

Tools Used in the Industry

  • Data Collection and Cleaning: SQL, pandas, NumPy, PySpark.
  • Exploratory Data Analysis: Jupyter Notebooks, matplotlib, seaborn, Tableau, Power BI.
  • Machine Learning: scikit-learn, TensorFlow, PyTorch, XGBoost, LightGBM.
  • Big Data: Apache Spark, Hadoop.
  • Model Deployment: Flask, FastAPI, Docker, Kubernetes, AWS SageMaker, GCP AI Platform, Azure ML.
  • Experiment Tracking: MLflow, Weights & Biases.
  • Monitoring: Grafana, Prometheus, custom logging frameworks.

Defining the problem

Defining a problem is a crucial first step in any machine learning project. Properly defining the problem helps ensure that your project is well-focused, addresses a real need, and guides your efforts throughout the project lifecycle. Here are the key steps to define a machine learning problem effectively:
  1. Understand the Domain:
      • Begin by gaining a deep understanding of the domain in which you are working. Whether it's healthcare, finance, image recognition, or any other field, understanding the context is essential.
  1. Identify the Business or Research Objective:
      • Determine the primary goal or objective of your machine learning project. What problem are you trying to solve, and why is it important? For example, are you trying to improve customer retention, predict disease outcomes, or classify images?
  1. Formulate the Problem as a Question or Task:
      • Translate the business or research objective into a specific question or task that can be addressed with machine learning. This question/task should be clear and well-defined. For example, "Can we predict customer churn based on historical data?" or "Is it possible to classify images of cats and dogs accurately?"
  1. Define the Scope:
      • Clearly outline the scope of your problem. What are the boundaries and limitations? What data sources are available, and what data is accessible? Are there any constraints or regulations to consider?
  1. Determine the Data Requirements:
      • Specify the data needed to solve the problem. What types of data are required (structured, unstructured, text, images, etc.)? What features or attributes should be collected? Ensure data quality and availability.
  1. Select an Evaluation Metric:
      • Choose an appropriate evaluation metric that aligns with your problem's objectives. For example, if you're working on a classification problem, you might use accuracy, precision, recall, or F1-score as your evaluation metric.
  1. Consider Ethical and Privacy Concerns:
      • Be mindful of ethical and privacy considerations, especially when dealing with sensitive or personal data. Ensure compliance with legal regulations and ethical guidelines.
  1. Explore Existing Solutions:
      • Research existing solutions or approaches to similar problems. Understanding what others have done can provide valuable insights and help you avoid reinventing the wheel.
  1. Define Success Criteria:
      • Clearly define what success looks like for your machine learning project. What level of performance or accuracy are you aiming for? What constitutes a successful outcome?
  1. Create a Project Plan:
      • Develop a project plan that outlines the project's timeline, tasks, responsibilities, and milestones. A well-structured plan helps keep the project on track.

EDA

Exploratory Data Analysis (EDA) is a fundamental process in data analysis and data science where the main purposes are to:
  1. Understand the Data: EDA helps in getting familiar with the data, understanding its structure, the type of data it includes, and the range of values it encompasses. This involves identifying the number and types of variables, checking for missing values, and understanding the data distribution.
  1. Discover Patterns and Relationships: By using statistical summaries and visualizations, EDA allows analysts to discover patterns, trends, correlations, and potential relationships between variables. This can involve plotting data points on various types of charts, which can reveal underlying structures or unexpected insights.
  1. Spot Anomalies and Outliers: EDA is crucial for detecting anomalies and outliers that could indicate data errors or special cases. These findings are essential as they can affect the overall analysis and predictive modeling later on.
  1. Formulate Hypotheses and Assumptions: Based on the initial findings, analysts can formulate hypotheses about the data, which can be tested with more sophisticated statistical models. EDA also helps in making assumptions based on data distributions and relationships.
  1. Inform Feature Engineering: Insights gained from EDA can guide the transformation and creation of new variables (features) that might be more relevant for predictive modeling. This includes identifying which features might need normalization, encoding, or binning.
  1. Prepare for Advanced Analysis: EDA primes the data for further analysis and modeling. By understanding the data’s characteristics and issues early, analysts can better preprocess the data, choose appropriate models, and apply the right techniques for further analysis.
  1. Support Decision Making: EDA provides a factual basis for making decisions about how to handle data and what analytical techniques to deploy, ensuring that subsequent steps are data-driven.
  1. Enhance Data Quality: Through the identification of issues such as missing data, duplicate data, and incorrect values, EDA helps in enhancing the overall quality of the data, which is crucial for reliable analysis.
EDA is not just about making preliminary assessments but also about making the data exploration stage an integral part of the overall data analysis process, ensuring that any subsequent steps, like machine learning or deep statistical analysis, are grounded in a solid understanding of the dataset's characteristics and quirks.
Single Variable Plots
Relationships & Multi-variable plots
Histograms:
1.Understanding Distribution: Histograms provide a quick visual grasp of:
Shape: Whether the data is normal (bell-shaped), skewed (lopsided), multimodal (multiple peaks), or uniform (evenly distributed).

Centrality: Identification of the central tendency (mean, median, mode) of the distribution.
Spread: Assessing variability, range, and how much the data is spread out.

2.Identifying Outliers: Histograms help pinpoint unusual or potentially erroneous data points that lie far outside the normal data distribution.

3.Spotting Potential Skewness: Skewness helps understand:
Right-skewed: Many data points clustered towards the lower end of the range (e.g., income distributions tend to be right-skewed).
Left-skewed: Many data points clustered towards the higher end of the range.
Scatterplots:

Box Plots :
•Provide a visualization of the summary statistics
•Minimum, maximum, Outliers, IQR and Median
•Allow us to confirm assumptions about skewness and extreme
values
Correlation Matrices(Heatmap):
Distributions:
Show us the underlying frequency distribution of a variable
1. The shape of the distribution gives us information about the data Is it normally distributed?
2. Is it skewed?
3. How many peaks?
Bar charts:
•Very useful for visualizing categorical variables
•Show us the frequency of data for each category
•Essential part of EDA, we need to get an idea of common and uncommon values for our features
Pie chart
Univariate analysis
Univariate analysis is the simplest form of analyzing data. As the name implies, it deals with analyzing data within a single column or variable and is mostly used to describe data. There are different kinds of univariate analyses.
notion image
  1. Basic Plots for Categorical Features
(1)Countplot
One of the most basic plots for categorical features is the use of the countplot function. This allows us to see the count of each unique categorical value in the feature.
# Plot the count of the categories in the Faculty column and then order them. # faculty_counts = df_grades['faculty'].value_counts().index countplot = sns.countplot(data = df_grades, x ='faculty', order = df_grades['faculty'].value_counts().index)
notion image
(2) Pie Chart
Another way of displaying categorical values is with the use of pie charts.
# Plotting via pie chart pie_data = df_grades['faculty'].value_counts() plt.pie(pie_data.values, labels = pie_data.index, autopct="%.1f%%") plt.show()
notion image
 
 
  1. Basic Plots for Numeric Features
(1) Histogram
We will first go over a histogram, which displays the distribution of a numeric column. Histograms allow us to see the shape of the data to easily see insights such as where the data is most concentrated, how spread out the data is, and how skewed the data is.
# Show a basic histplot of the ClassesSkipped column histplot = sns.histplot(data = df_grades, x ='classes_skipped', binwidth = 1)
notion image
 
(2) Box plot
One other simple yet effective way to show a distribution and to identify outliers is through a box plot. Box plots allow us to see the quartiles of a distribution and allows us to easily see outliers. We normally consider anything outside of the whisker based on the IQR rule.
# Plot a box plot for the office hours participated column. boxplot = sns.boxplot(data = df_grades, x = 'OH_participated')
notion image
(3) One key note when looking for outliers based on the whiskers of the box plot is that the whiskers always assumes outliers based on the IQR rule.
We can also combine the two above visuals into one by combining their axis together. For example, we will combine the x axis of both charts to neatly visualize the distributions.
# Plot a histogram and blox plot together and make them share the same axis fig, (hist, box) = plt.subplots(2, 1,sharex=True) histplot = sns.histplot(data = df_grades, x ='classes_skipped', discrete = True, ax = hist) boxplot = sns.boxplot(data = df_grades, x = 'classes_skipped', orient='h', ax = box)
notion image
 
3.Descriptive Stats for Numeric Features
The describe function to show a summary of stats for a numeric column. We initially used this function to clean our data, but this is also a very effective way to look at the distribution of our numeric column.
# Summary of statistics df_grades['classes_skipped'].describe()
count 30.000000 mean 4.733333 std 3.027840 min 0.000000 25% 2.250000 50% 4.000000 75% 7.000000 max 10.000000 Name: classes_skipped, dtype: float64
What if we want to single out a specific statistical measure? We can use functions such as the mean function and others to get the specific statistical measure for each numerical feature.
# Manually return a statistic of interest print(df_grades['tuition'].mean()) print(df_grades['tuition'].quantile(0.25)) print(df_grades['tuition'].std()) print(df_grades['tuition'].var())
40307.066666666666
34824.75
5792.681196447648
33555155.44367816
 
Bivariate analysis
Bivariate analysis involves analyzing data with two variables or columns. This is usually a way to explore the relationships between these variables and how they influence each other, if at all.
A bivariate analysis could take one of three different forms: numeric-numeric, numeric-categorical and categorical-categorical.
  1. Numeric-Numeric:
(1) Scatter Plot
Scatter plots are a common way to compare two numeric variables. Let’s investigate the relationship between “annual_mileage” and “speeding_violations”.
#Create a scatter plot to. show relationship between "annual_mileage" and "speeding_violations" plt.figure(figsize=[8,5]) plt.scatter(data=df,x="annual_mileage",y="speeding_violations") plt.title("Annual Mileage vrs Speeding Violations") plt.ylabel("Speeding Violations") plt.xlabel("Annual Mileage") plt.show()
notion image
From the graph, we can infer a negative correlation between annual mileage and the number of speeding violations. This means the more miles a client drives per year, the fewer speeding violations they commit.
(2) Correlation matrix
We could also use a correlation matrix to get more specific information about the relationship between these two variables. A correlation matrix is useful for identifying the relationship between several variables. As an example, let’s create a matrix using the”speeding_violations”, DUIs”, and “past_accidents” columns.
notion image
 
Generally speaking, a correlation coefficient between 0.5 and 0.7 indicates variables that can be considered moderately correlated, while a correlation coefficient whose magnitude is between 0.3 and 0.5 indicates variables that exhibit weak correlation, as is the case with most of our variables. This means a moderate, positive correlation exists between the number of past accidents and speeding violations, while a weak, positive correlation exists between the number of past accidents and DUIs.
(3) Heatmap
We can easily create one by passing the correlation matrix into the heatmap() function in Seaborn.
#Create a heatmap to visualize correlation plt.figure(figsize=[8,5]) sns.heatmap(corr_matrix,annot=True,cmap='Reds') plt.title("Correlation between Selected Variables") plt.show()
notion image
  1. Numeric-Categorical: Here, we analyze data using one set of numeric variables and another set of categorical variables. Analysis can be done by using the mean and median as in the example below. We first group by “outcome” and then calculate the mean “annual_mileage” for each group.
#Check the mean annual mileage per category in the outcome column df.groupby('outcome')['annual_mileage'].mean() outcome False 11375.549735 True 12401.574221 Name: annual_mileage, dtype: float64
Using this method, we could return the minimum, maximum, or median annual mileage for each category by using the min(), max(), and median() methods respectively. However, we can better visualize the difference in dispersion or variability between two variables by using box plots. Box plots display a five-number summary of a set of data; the minimum, first quartile, median, third quartile, and maximum.
#Plot two boxplots to compare dispersion sns.boxplot(data=df,x='outcome', y='annual_mileage') plt.title("Distribution of Annual Mileage per Outcome") plt.show()
notion image
Both variables have similar medians (denoted by the middle line that runs through the box) though clients who made a claim have slightly higher median annual mileage than clients who didn’t. The same can be said for the first and third quartiles (denoted by the lower and upper borders of the box respectively).
Similarly, we can compare the distributions of the two categories in “outcome” based on their credit scores, but this time we’ll make use of a bivariate histogram by setting the “hue” argument in the histplot() function to “outcome”.
#Create histograms to compare distribution sns.histplot(df,x="credit_score",hue="outcome",element="step",stat="density") plt.title("Distribution of Credit Score per Outcome") plt.show()
notion image
  1. Categorical-Categorical: As you may have guessed by now, this involves a set of two categorical variables. As an example, we will explore how the “outcome” variable relates to categories like age and vehicle year. To begin, we will convert the labels in the outcome column from True and False to 1s and 0s respectively. This will allow us to calculate the claim rate for any group of clients.
#Create a new "claim rate" column df['claim_rate'] = np.where(df['outcome']==True,1,0) df['claim_rate'].value_counts() 0 6867 1 3133 Name: claim_rate, dtype: int64
Half as many clients made a claim in the past year compared to those who didn’t. Now let’s check how the claim rate is distributed between the different categories of age.
#Plot the average claim rate per age group plt.figure(figsize=[8,5]) df.groupby('age')['claim_rate'].mean().plot(kind="bar") plt.title("Claim Rate by Age Group") plt.show()
notion image
From the above, it is clear that younger people are more likely to make an insurance claim. We can do the same for “vehicle_year”.
Create an empty figure object fig, axes = plt.subplots(1,2,figsize=(12,4)) #Plot two probability graphs for education and income for i,col in enumerate(["education","income"]): sns.histplot(df, ax=axes[i],x=col, hue="outcome",stat="probability", multiple="fill", shrink=.8,alpha=0.7) axes[i].set(title="Claim Probability by "+ col,ylabel=" ",xlabel=" ")
notion image
Clients with no education are more likely to file a claim compared to high school and university graduates, while clients in the “poverty” income group are more likely to file a claim, followed by clients in the “working class” and “middle class” categories, in that order.
 
Multivariate analysis
notion image
Multivariate analysis comprises data analysis involving more than two variables.
Heatmap
A common type of multivariate analysis is the heatmap. Heatmaps provide a fast and simple way for visual recognition of patterns and trends. We can easily check the relationship between variables in our data set like “education” and “income” by using a third variable, claim rate. First, we will create a pivot table.
 
#Create a pivot table for education and income with average claim rate as values edu_income = pd.pivot_table(data=df,index='education',columns='income',values='claim_rate',aggfunc='mean') edu_income
notion image
#Create a heatmap to visualize income, education and claim rate plt.figure(figsize=[8,5]) sns.heatmap(edu_income,annot=True,cmap='coolwarm',center=0.117) plt.title("Education Level and Income Class") plt.show()
notion image
High school graduates in the poverty income class have the highest claim rate, followed by university graduates in the poverty income class. Clients in the upper class income category with no education have the lowest claim rates.
Let’s do the same for driving experience and marital status.
#Create pivot table for driving experience and marital status with average claim rate as values driv_married = pd.pivot_table(data=df,index='driving_experience',columns='married',values='claim_rate') #Create a heatmap to visualize driving experience, marital status and claim rate plt.figure(figsize=[8,5]) sns.heatmap(driv_married,annot=True,cmap='coolwarm', center=0.117) plt.title("Driving Experience and Marital Status") plt.show()
notion image
 
 
 
 
 
 
 

Data Preprocessing for Machine Learning

Remove Unwanted Observations

The first step to data cleaning is removing unwanted observations from your dataset. Specifically, you’ll want to remove duplicate or irrelevant observations.
Duplicate observations
Duplicate observations are important to remove because you don’t want them to bias your results or models. Duplicates most frequently arise during data collection, such as when you:
  • Combine datasets from multiple places
  • Scrape data
  • Receive data from clients/other departments
 
Irrelevant observations
Irrelevant observations are those that don’t actually fit the specific problem that you’re trying to solve. For example, if you were building a model for Single-Family homes only, you wouldn’t want observations for Apartments in there.
 

Handle Missing Data

Identify missing values
介绍检查缺失值的简易方法:
  • isnull()
  • notnull()
df.isnull()
A
B
C
0
False
False
False
1
False
True
False
2
True
True
False
df.notnull()
A
B
C
0
True
True
True
1
True
False
True
2
False
False
True
df.isnull().sum()
Unnamed: 0 0 id 0 age 2446 gender 0 income 0 days_on_platform 141 city 0 purchases 0 lifetime_value 0 dtype: int64
nulls_summary_table
def nulls_summary_table(df): """ Returns a summary table showing null value counts and percentage Parameters: df (DataFrame): Dataframe to check Returns: null_values (DataFrame) """ null_values = pd.DataFrame(df.isnull().sum()) null_values[1] = null_values[0]/len(df) null_values.columns = ['null_count','null_pct'] return null_values nulls_summary_table(df)
 
Dealing with missing values
Handling missing values is a crucial step in the data preprocessing phase. In Python, there are several ways to deal with missing values.
Univariate
  • dropna()
  • SimpleImputer
  • fillna with the mean, median, or mode
1.Dropping Missing Values:
  • Drop rows or columns with missing values using the dropna() method.
import pandas as pd df = pd.read_csv('your_dataset.csv') df_cleaned = df.dropna() # Drops rows with any missing values
 
2.SimpleImputer
  • Use SimpleImputer from scikit-learn to fill missing values with mean, median, or a constant.
  • If "mean", then replace missing values using the mean along each column. Can only be used with numeric data.
  • If "median", then replace missing values using the median along each column. Can only be used with numeric data.
  • If "most_frequent", then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.
  • If "constant", then replace missing values with fill_value. Can be used with strings or numeric data.
from sklearn.impute import SimpleImputer imputer = SimpleImputer(strategy='mean') # Other strategies: 'median', 'most_frequent', 'constant' X_imputed = imputer.fit_transform(X)
3.Imputation with Mean, Median, or Mode:
notion image
  • Fill missing values with the mean, median, or mode of the respective column using fillna()
mean_fill = df.fillna(df.mean()) X_train_m.loc[:,'age'] = X_train_m['age'].fillna(np.mean(X_train_m['age'])) median_fill = df.fillna(df.median()) m_df.loc[:,'age'] = df['age'].fillna(np.median(m_df['age'])) mode_fill = df.fillna(df.mode().iloc[0]) m_df.loc[:,'age'] = m_df['age'].fillna(stats.mode(m_df['age'])[0][0])
Multivariate
1.K-Nearest Neighbors Imputation:
  • Use KNNImputer for imputing missing values based on k-nearest neighbors.
from sklearn.impute import KNNImputer knn_imputer = KNNImputer(n_neighbors=2) X_imputed = knn_imputer.fit_transform(X)
 
2.Iterative Imputation(MICE)
  • Use IterativeImputer for iterative imputation based on regression.
# Subset numeric features: numeric_colsnumeric_cols = loan_data.select_dtypes(include=[np.number]) # Iteratively impute imp_iter = IterativeImputer(max_iter=3, random_state=123) loans_imp_iter = imp_iter.fit_transform(numeric_cols) # Convert returned array to DataFrame loans_imp_iterDF = pd.DataFrame(loans_imp_iter, columns=numeric_cols.columns) # Check the DataFrame's info print(loans_imp_iterDF.info())
Imputation in a ColumnTransformer
from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler # Example data data = pd.DataFrame({ 'age': [25, np.nan, 35, 40], 'income': [50000, 60000, np.nan, 80000], 'city': ['New York', 'Paris', np.nan, 'London'] }) # Define transformers num_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler()) ]) cat_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')) ]) # Apply ColumnTransformer preprocessor = ColumnTransformer( transformers=[ ('num', num_transformer, ['age', 'income']), ('cat', cat_transformer, ['city']) ] ) data_transformed = preprocessor.fit_transform(data) print(data_transformed)

Deal with Outliers 异常值处理

Outliers can cause problems with certain types of models. For example, linear regression models are less robust to outliers than decision tree models. In general, if you have a legitimate reason to remove an outlier, it will help your model’s performance.
However, outliers are innocent until proven guilty. You should never remove an outlier just because it’s a “big number.” That big number could be very informative for your model.
Reasons for outlier
  • 数据输入错误(人为错误)
  • 测量误差(仪器误差)
  • 实验错误(数据提取或实验计划/执行错误)
  • 故意的(虚假的异常值用于测试异常值检测方法)
  • 数据处理错误(数据处理或数据集意外突变)
  • 抽样错误(从错误或各种不同来源提取或混合数据)
  • 自然引入(并不是错误,而是数据多样性导致的数据新颖性)
 
Outlier Detection

Univariate

1. boxplot:
Use visualization tools like box plots, scatter plots, or histograms to identify potential outliers. Seaborn and Matplotlib are commonly used libraries for this purpose.
import seaborn as sns import matplotlib.pyplot as plt # Create a box plot to identify outliers sns.boxplot(df['purchases']) plt.show()
notion image
def extract_outliers_from_boxplot(array): ## Get IQR iqr_q1 = np.quantile(array, 0.25) iqr_q3 = np.quantile(array, 0.75) med = np.median(array) # finding the iqr region iqr = iqr_q3-iqr_q1 # finding upper and lower whiskers upper_bound = q3+(1.5*iqr) lower_bound = q1-(1.5*iqr) outliers = array[(array <= lower_bound) | (array >= upper_bound)] print('Outliers within the box plot are :{}'.format(outliers)) return outliers extract_outliers_from_boxplot(df['purchases'])
 
2. Z-Score Method:
The Z-score is a measure of how many standard deviations a data point is from the mean. You can use a threshold (e.g., Z-score greater than 3 or less than -3) to identify and filter outliers.
当计算z-score时,我们实际在数据中心化,并且寻找距离零点比较远的数据点,这就说明是离群的,一般来说z-score大于3或者小于-3,就说明此数据点是离群的
from scipy.stats import zscore z_scores = zscore(data['feature']) data_no_outliers = data[(z_scores < 3) & (z_scores > -3)]
 
3. IQR (Interquartile Range) Method:
The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile). Data points outside a certain multiple of the IQR can be considered outliers.
Q1 = data['feature'].quantile(0.25) Q3 = data['feature'].quantile(0.75) IQR = Q3 - Q1 data_no_outliers = data[~((data['feature'] < Q1 - 1.5 * IQR) | (data['feature'] > Q3 + 1.5 * IQR))]
 

Multivariate

1.DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
DBSCAN is a density-based clustering algorithm that identifies clusters based on the density of data points. It defines clusters as continuous regions of high-density points separated by areas of lower density.
from sklearn.cluster import DBSCAN dbscan = DBSCAN(eps=1.0, min_samples=5) labels = dbscan.fit_predict(data[['feature']]) data_no_outliers = data[labels != -1]
 
2.IsolationForest
Isolation forest uses the number of tree splits to identify anomalies or minority classes in an imbalanced dataset. The idea is that anomaly data points take fewer splits because the density around the anomalies is low. Python’s sklearn library has an implementation for the isolation forest model.
Isolation forest is an unsupervised algorithm, where the actual labels of normal vs. anomaly data points are not used in model training.
Isolation forest identify anomalies by isolating outliers using trees. The steps are:
  1. For a tree, randomly select features and randomly split for each feature.
  1. For each data point, there is a splitting path from the root node to the leaf node. Calculate the path length for each data point.
  1. Repeat step 1 and step 2 for each tree.
  1. Get the average path length across all trees.
  1. The anomalies have a shorter average path length than normal data points.
Step 1: Import Libraries/Step 2: Create Imbalanced Dataset
#Step 1: Import Libraries import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.ensemble import IsolationForest # step 2: Create an imbalanced dataset X, y = make_classification( n_samples=100000, # Total number of samples in the dataset n_features=2, # Number of features for each sample n_informative=2, # Number of informative features n_redundant=0, # Number of redundant features n_repeated=0, # Number of duplicated features n_classes=2, # Number of classes n_clusters_per_class=1, # Number of clusters per class weights=[0.995, 0.005], # Weights assigned to each class (imbalanced) class_sep=0.5, # Separation between classes random_state=0 # Random seed for reproducibility ) # Convert the data from numpy array to a pandas dataframe df = pd.DataFrame({'feature1': X[:, 0], 'feature2': X[:, 1], 'target': y}) # Check the target distribution target_distribution = df['target'].value_counts(normalize=True) print(target_distribution)
The output shows that we have about 1% of the data in the minority class and 99% in the majority class.
0 0.9897 1 0.0103 Name: target, dtype: float64
Step 3: Train Test Split /Step 4: Train Isolation Forest Model
In this step, we split the dataset into 80% training data and 20% validation data. random_state ensures that we have the same train test split every time. The seed number for random_state does not have to be 42, and it can be any number.
#Step 3: Train Test Split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Check the number of records print('The number of records in the training dataset is', X_train.shape[0]) print('The number of records in the test dataset is', X_test.shape[0]) print(f"The training dataset has {sorted(Counter(y_train).items())[0][1]} records for the majority class and {sorted(Counter(y_train).items())[1][1]} records for the minority class.") #Step 4: Train Isolation Forest Model # Train the isolation forest model if_model = IsolationForest(n_estimators=100, random_state=0).fit(X_train) # Predict the anomalies if_prediction = if_model.predict(X_test) # Change the anomalies' values to make it consistent with the true values if_prediction = [1 if i==-1 else 0 for i in if_prediction] # Check the model performance print(classification_report(y_test, if_prediction))
The number of records in the training dataset is 80000
The number of records in the test dataset is 20000
The training dataset has 79183 records for the majority class and 817 records for the minority class.
We train the isolation forest model using the training dataset and make the predictions on the testing dataset. By default, isolation forest labels the normal data points as 1s and anomalies as -1s. To compare the labels with the ground truth in the testing dataset, we changed the anomalies’ labels from -1 to 1, and the normal labels from 1 to 0.
notion image
The model has a recall values of 38%, meaning that it captures 38% of the anomaly data points.
3.LocalOutlierFactor
The LocalOutlierFactor (LOF) algorithm is an unsupervised outlier detection algorithm in scikit-learn. It measures the local density deviation of a data point with respect to its neighbors. LOF is useful for identifying anomalies in datasets where normal instances have higher local density than outliers.
Core Concepts of Local Outlier Factor (LOF):
  1. Local Density:
      • LOF focuses on the local density of data points within the dataset. It computes the density of a data point relative to its neighbors. Outliers are often expected to have lower local density compared to their neighbors.
  1. LOF Score:
      • The LOF algorithm assigns a score to each data point, indicating its degree of "outlierness." A higher LOF score suggests a higher likelihood of the point being an outlier.
  1. K-Nearest Neighbors:
      • LOF relies on the concept of nearest neighbors. For each data point, it identifies its k-nearest neighbors and computes the local reachability density based on the distances between the point and its neighbors.
  1. Local Reachability Density:
      • Local reachability density measures how close a data point is to its neighbors. Points with significantly lower density compared to their neighbors are likely to be outliers.
  1. LOF Calculation:
      • The LOF for each data point is calculated by comparing its local reachability density with the densities of its neighbors. Anomalies are identified as points with a substantially lower density compared to their neighbors.
  1. Contamination Parameter:
      • The contamination parameter is crucial in LOF, representing the expected proportion of outliers in the dataset. It influences the threshold for classifying points as outliers.
# Import necessary libraries import numpy as np import matplotlib.pyplot as plt from sklearn.neighbors import LocalOutlierFactor from sklearn.datasets import make_blobs # Create synthetic data (normal and anomaly) X_normal, _ = make_blobs(n_samples=300, centers=1, cluster_std=0.6, random_state=0) X_anomaly, _ = make_blobs(n_samples=20, centers=1, cluster_std=2, random_state=0) # Combine normal and anomaly data X = np.vstack([X_normal, X_anomaly]) # Apply LocalOutlierFactor lof = LocalOutlierFactor(contamination=0.05) # contamination is an important parameter labels = lof.fit_predict(X) # Visualize the results plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis') plt.title('Local Outlier Factor') plt.show()
Handling Outliers
Removal
the first method is simply removing our outliers. The typical way to remove outliers is through z-score removal. Specify the z-score or percentile cutoff you want for your outliers, then, remove any point that falls above or below that threshold.
def z_score_removal(df, column, lower_z_score, upper_z_score): col_df = df[column] z_scores = scipy.stats.zscore(purchases) outliers = (z_scores > upper_z_score) | (z_scores < lower_z_score) return df[~outliers] def percentile_removal(df, column, lower_bound_perc, upper_bound_perc): col_df = df[column] upper_bound = np.percentile(col_df, upper_bound_perc) lower_bound = np.percentile(col_df, lower_bound_perc) z_scores = scipy.stats.zscore(purchases) outliers = (z_scores > upper_bound) | (z_scores < lower_bound) return df[~outliers] filtered_df = z_score_removal(df, 'purchases', -1.96, 1.96) percentile_removal(df, 'purchases', lower_bound_perc = 1, upper_bound_perc = 99)
 
Winsorize
Dropping outliers is the crudest approach. If you feel those rows are valuable, we can winsorize, also known as "capping" our outliers. Rather than keep the outlier value, if the value falls above a specific threshold, we can replace the outlier with that threshold value. Here, we've written a function for you:
# Print: before winsorize print((loan_data['Monthly Debt']).mean()) print((loan_data['Monthly Debt']).median()) print((loan_data['Monthly Debt']).max()) # Winsorize numeric columns debt_win = mstats.winsorize(loan_data['Monthly Debt'], limits=[0.05, 0.05]) # Convert to DataFrame, reassign column name debt_out = pd.DataFrame(debt_win, columns=['Monthly Debt']) # Print: after winsorize print(debt_out.mean()) print(debt_out.median()) print(debt_out.max())
不是删除异常点,而是把特别离谱的数值“压回正常范围”
例如把最小值提升到 1% 分位,把最大值压到 99% 分位。
意思是:你还在,但你没有资格影响我。
为什么要做
  • 极端值会不成比例地影响模型参数
  • 特别是线性模型、正态化、MSE
  • 裁剪后可以保留数据整体形状,但控制极端值权重
什么时候使用
  • 极值是真实存在但非常稀有
  • 不想删除数据(样本本身有价值)
  • 用线性模型、逻辑回归、回归模型(对 outlier 敏感)
  • 熟悉分位点的分布结构
优点
  • 不删除数据
  • 快速有效
  • 对模型稳定性提升明显
缺点
  • 人为设定分位数(需要经验)
  • 如果数据本身 heavy-tailed,可能过度裁剪
 

Scaling(缩放)

Scaling normalizes magnitudes to ensure stable gradients, fair regularization, and reliable distance or similarity calculations.
就是“把不同量级的特征变成差不多的大小”,否则:
  • 有的变量像大象
  • 有的变量像蚂蚁
    • 模型会搞不清谁重要。

为什么要缩放?

  • 防止大数值特征 dominate 梯度
  • 避免距离度量模型(SVM、KNN)被误导
  • 让正则化惩罚公平分配
  • 让神经网络稳定收敛
  • 让 PCA 不会倾向大量纲特征

1. StandardScaler
是什么(大白话)
把数据转换成均值为 0、标准差为 1,使每个特征在同一尺度上。
为什么要用
某些模型假设所有特征具有相似的尺度,否则:
  • 大尺度特征会主导梯度方向
  • 正则化会不公平惩罚特征
  • 模型会更难收敛
  • 距离度量会被放大(SVM/KNN)
什么时候最合适
  • 数据分布接近正态
  • 使用线性模型(Linear Regression / Logistic)
  • 使用基于距离或核函数的模型(SVM、KNN)
  • 使用 PCA(避免第一主成分被大尺度特征主导)
优点
  • 最通用
  • 与正态分布数据契合度高
  • 对大部分传统 ML 模型最适合
缺点
对异常值非常敏感,因为均值和标准差都会被极端值影响。
一句话总结
StandardScaler 把特征标准化为均值 0 方差 1,非常适合线性模型、SVM、KNN 以及 PCA 等对尺度敏感的算法。
2. MinMaxScaler
是什么(大白话)
把每个特征缩放到 0 到 1 的区间。
最简单的缩放方式。
公式:
(x - min) / (max - min)
为什么要用
神经网络非常依赖输入值的尺度,激活函数(sigmoid / tanh / ReLU)对输入范围特别敏感,MinMaxScaler 能让梯度更新更稳定。
什么时候最合适
  • 神经网络(最常用的缩放方法)
  • 深度学习框架的数据输入层
  • 数值范围分布较均匀
  • 数据没有明显异常值
优点
  • 保留原本分布结构
  • 所有值都落在[0,1],利于激活函数运作
  • 适用于需要严格相同尺度的模型
缺点
对异常值极度敏感,只要 1 个 outlier 就能拉炸整个缩放。
一句话总结
MinMaxScaler 将数据缩放到 0 到 1 区间,是神经网络中最常用的 scaling 方法。
3. RobustScaler
是什么(大白话)
用中位数和 IQR(四分位距)代替均值和标准差来缩放,从而对异常值不敏感。
公式原理:
(x - median) / IQR
为什么要用
因为均值和标准差会被 outlier 严重拉偏,而 median 和 IQR 不会。
如果你的数据有很多极端值,这就是唯一稳定的选择。
什么时候最合适
  • 数据分布非常偏态
  • 存在大量异常点
  • 金融数据、交易数据、点击次数等 heavy-tailed 数据
  • 你不希望异常值影响缩放后所有数据的尺度
优点
  • 极度稳健
  • 对 outlier 不敏感
  • 在 heavy-tailed 分布中表现最好
缺点
  • 可能破坏原有的分布特征
  • 比 Standard 和 MinMax 稍不直观
  • 不保证在[0,1]范围
一句话总结
RobustScaler 基于中位数和 IQR 进行缩放,对异常值特别稳健,是 heavy-tailed 数据或噪声较多数据的首选。

Handling Imbalanced data

Random Under Sampling (RUS): throw away data, computationally e)cient
Random Over Sampling (ROS): straightforward and simple, but training your model on many
duplicates
Synthetic Minority Oversampling Technique (SMOTE): more sophisticated and realistic
dataset, but you are training on "fake" data
Undersampling
notion image
Oversampling
notion image
from imblearn.over_sampling import RandomOverSampler method = RandomOverSampler() X_resampled, y_resampled = method.fit_resample(X, y) compare_plots(X_resampled, y_resampled, X, y)
Synthetic Minority Oversampling Technique (SMOTE)
notion image
from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression # Define resampling method and split into train and test method = SMOTE() X_train, X_test, y_train, y_test = train_test_split(X, y,train_size=0.8, random_state=0) # Apply resampling to the training data only X_resampled, y_resampled = method.fit_resample(X_train, y_train) # Continue fitting the model and obtain predictions model = LogisticRegression() model.fit(X_resampled, y_resampled) # Get your performance metrics predicted = model.predict(X_test) print (classification_report(y_test, predicted))
 
 
 

Feature Engineering

notion image
Feature Engineering = 把原始数据变成模型能更好理解的特征
Feature engineering is about creating new input features from your existing ones. In general, you can think of data cleaning as a process of subtraction and feature engineering as a process of addition.
An effective Feature Engineering implies:
  • Higher efficiency of the model
  • Easier Algorithms that fit the data
  • Easier for Algorithms to detect patterns in the data
  • Greater Flexibility of the features

变量编码

无监督编码

无监督编码即不需要标签信息,直接对原始离散变量进行变量编码。本节将介绍无监督编码常用的3种方式:One-hot (独热) 编码、Dummy variable (哑变量) 编码和Label (标签) 编码。
One-hot编码
One-hot编码又称为一位有效编码,如果离散变量的种类有M个,One-hot编码就采用M位状态寄存器对这M种可能取值进行编码,每个可能的取值由独立的寄存器表示,即M位中只有一位有效,其本质就是采用二进制的方式表示变量中的每个值,最终得到的是一个M维的稀疏矩阵。
One-hot编码是一种非常有效的编码方式,它将不可排序的离散变量映射到欧式空间,离散变量的每种取值就是欧式空间中的某个点,这使得距离的比较与相似度的度量可计算,并且保持了原有离散变量的等距特性。比如,对性别变量编码后,计算得到的男、女、未知两两变量之间的欧式距离为
,是等距的。欧式距离的计算公式如下:[插图]通过One-hot编码后,离散变量的每一个维度都可以看成一个连续变量。编码后的变量,其数值范围已经在[0,1],这与变量归一化效果一致。
为了操作方便,One-hot编码后的变量可以与连续变量一起进行归一化处理。归一化操作是非常重要的步骤,常用的方法有最大最小归一化和z-score归一化,其目的是消除不同量纲之间的影响,如收入的数值远远大于年龄,则在计算距离时数值较小的变量不起作用。更重要的是,归一化会使模型优化的搜索空间规则化,加速模型收敛速度。如图5-1所示为归一化示意图。
针对One-hot编码后维度过高的问题,往往要先将变量进行合并,再进行One-hot编码,变量合并的过程就是一个变量分箱的过程。后续会介绍分箱的方法。
Dummy variable
与One-hot编码类似,哑变量 (Dummy Variable) 编码也是一种无监督编码方式,同样采用二进制编码的方式来表示变量的值。不同的是,哑变量编码用较小的维度来表示变量的取值。如果离散变量的种类有M个,哑变量编码只用M-1维就可以表示M种可能出现的取值。其思想就是一旦训练集确定,离散变量的状态空间就已经确定,比如性别变量只有{男、女、未知}3种情况,只需要用2位二进制表示是男或是女的情况,第3种情况一定就是未知
与One-hot编码相比,哑变量编码可以用更小的空间去表示离散变量的值,但当离散变量较稀疏时,编码后依然存在与One-hot编码相同的编码矩阵过于稀疏的问题
Label (标签) 编码
在离散变量中,可排序的变量的数值化转换时,如果希望保留等级大小关系,则需要用标签编码 (Label编码) 来完成。例如离散变量学历,其取值为{高中、本科、硕士、博士}。很明显,高中<本科<硕士<博士,并且不同学历间的距离也是不相同的,即本科与高中、硕士、博士的距离是不同的。

有监督编码

前面介绍的都是无监督编码过程,也就是对输入变量自身进行编码,完成数值化的过程。如果考虑目标变量,则变量编码的过程可能会使离散变量的数值化过程更具有方向性,这就是有监督编码。
WOE编码WOE (Weight of Evidence) 编码就是评分卡中最常用的有监督编码方法。WOE编码既可以对离散变量编码,也可以对分箱后的连续变量编码。WOE的计算公式如下:
notion image
其中,M为离散变量的可能取值个数,对于连续变量为分箱的组数:Badi为变量第i个可能取值中的坏样本个数;Badtotal为总体样本中的坏样本数。Goodi为变量第i个可能取值中的好样本个数;Goodtotal为总体样本中的好样本数。由此可知,WOE编码就是对坏样本分布与好样本分布的比值再进行对数变换的结果,
notion image
对于一个二分类问题,在给定参数的情况下,模型预测样本的概率表示为:
notion image
notion image
WOE编码先计算变量的每个可能取值 (对于离散变量为变量的种类,连续变量为分箱后的每个箱) 的WOE值,求和得到整个变量的WOE值。
WOE编码的好处:• 增加变量的可解释性,并且可解释的粒度细化到变量的每个可能取值。• 可以指示自变量 (模型输入变量) 对因变量 (模型目标变量) 的预测能力,样本的概率值与WOE值有密切的关系。• 可以处理缺失值,将缺失值作为一个特征进行WOE编码,免除缺失值插补带来的不确定性。在WOE编码时,往往希望变量编码后的WOE值是线性的,如果不满足线性关系,也应该满足WOE值是单调的。
这种说法应满足一个前提条件,即评分卡模型选用Logistic回归模型。Logistic回归模型的本质是对数线性模型,为了使模型能够具有较好的表达能力,会要求建模的变量满足线性关系或满足单调关系。
而对于编码后非线性、非单调的变量的处理方式有如下两种:• 人为干涉使变量满足线性或单调的条件,对于离散变量可采用合并的方式;对连续变量需要重新分箱,使其向着满足条件的分箱结果改变。• 强制删除该变量,如果人为干涉的结果依然不满足线性或单调的条件,为了保证Logistic回归模型的效果,应删除该变量。在非线性或非单调的变量空间中,只用Logistic回归模型不能学到很好的规则,这是模型本身的限制,并不代表该变量本身就不具有预测能力。例如年龄变量,其分箱后的WOE值往往就是一个类似于U形的结构,即中间部分的WOE值小,表示该区间的客户信用风险较小,发生违约的概率相对较小;两端部分 (或年龄小、或年龄大的部分) 的WOE值偏大,表示这个区间的客户更容易发生违约。这是符合业务解释的,只是Logistic回归模型不能很好地把握这类特征而已,这时可以选择其他非线性能力强的模型。因此,在实际项目中,要有所取舍,适当选择。

变量分箱

变量分箱对模型的好处

(1) 降低异常值的影响,增加模型的稳定性。数据中存在异常值时会使模型产生一定的偏差,从而影响预测效果。通过分箱方法可以降低异常值的噪声特性,使模型更稳健。树模型对异常值不敏感,但Logistic回归模型和神经网络对异常值敏感。
(2) 缺失值作为特殊变量参与分箱,减少缺失值填补的不确定性。通常由于某些原因造成某些特征 (字段) ,训练数据出现缺失值的情况,如用户录入错误、操作人员的失误或数据存储的问题。而大部分机器学习模型都是无法处理缺失值的。树模型可以处理缺失值,但对实际有缺失值的变量不会起到很大作用,比如决策树模型在树的构建过程中,如果遇到有缺失值的变量,在计算信息增益或Gini系数时会忽略缺失值,用非缺失值计算一个值,再将缺失值的比例作为一个影响因子考虑进来,但是对有缺失值的样本做出的判断并不准确。所以无论是树模型还是其他模型,缺失值填补的工作必不可少。缺失值造成的原因不可追溯,插补方法也不尽相同,但如果能将缺失值作为一种特征,则会免去主观填充带来的不确定性问题,可以增加模型的稳定性。而分箱方法可以将缺失值作为特殊值参与分箱处理。通常的做法是,离散特征将缺失值转为字符串作为特殊字符即可,而连续特征将缺失值作为特殊值即可,这样,缺失值将作为一个特征参与分箱。
3) 增加变量的可解释性。分箱的方法往往要配合变量编码使用,这就大大提高了变量的可解释性。通常采用的编码方式为WOE编码,具体编码方法已在第5章中详细介绍过。本章将介绍的分箱方法有Chi-megerd方法、Best-KS方法、IV最优分箱方法和基于树的最优分箱方法。
(4) 增加变量的非线性。由于分箱后会采用编码操作,常用的编码方式有WOE编码、哑变量编码和One-hot编码。对于WOE编码,编码的计算结果与目标变量相关,会产生非线性的效果,而采用哑变量编码或One-hot编码,会使编码后的变量比原始变量获得更多的权重,并且这些权重是不同的,因此增加了模型的非线性。[插图]注意:WOE编码和One-hot编码在Logistic回归模型中本质上是相同的,WOE编码并不会显著提高模型的预测效果,它只是一种将分箱结果数值化的一种方式。当然,也可以直接用坏样本的比率作为每个箱的数值化结果。
(5) 增加模型的预测效果。从统计学角度考虑,机器学习模型在训练时会将数据划分为训练集和测试集,通常假设训练集与测试集的样本是服从同分布的,分箱操作使连续变量离散化,使得训练集和测试集更容易满足这种假设。因此,分箱会增加模型预测效果的稳定性,即会减少模型在训练集上的表现与测试集上的偏差。

变量分箱流程

变量分箱的目的是增加变量的预测能力或减少变量的自身冗余。当预测能力不再提升或冗余性不再降低时,则分箱完毕。因此,分箱过程是一个优化过程,所有满足上述要求的指标都可以用于变量分箱,这个指标也可叫作目标函数,可以终止或改变分箱的限制就是优化过程的约束条件。
优化的目标函数可以是卡方值、KS值、IV值、WOE值、信息熵和Gini值等,只要是可以提高变量的预测能力或减少变量自身冗余的指标,都可以作为目标函数使用。优化的约束条件可以是分箱数限制 (一般不要大于10箱) 、每组内最小样本数(WOE值计算要求好坏样本数不能为0) 、每组内最多样本数 (不希望分箱后样本分布非常不均衡) 、组间距离限制 (切分点不能太接近) 、Earlystopping策略的限制 (继续切分已经不会带来目标函数的更大提升而停止继续切分) 、WOE单调或baterate单调 (可以不设定,与算法有关,在上一章中已经讲过)
notion image
(1) 选择优化指标,可以选择卡方值、KS值和IV值等作为优化指标,以判断最优切分点;同时初始化分箱数nbins=1,开始认为只有1箱,通过切分的方式得到最优分箱结果。
(2) 初始化切分点,对于连续变量采用等距离初始化方法,并且要满足组间距离不能太小的约束条件;对于离散变量可以将变量的取值作为切分点,如果变量非常稀疏,则可以先用坏样本比率数值化,然后按照连续变量分箱操作。切分点即为分箱合并的候选集,可以初始化100个切分点,然后分别计算在切分点处的目标函数值,通过切分点分裂的方式从初始化箱数为1逐步达到最优分箱结果。[插图]注意:分箱可以消除异常值的影响,但是异常值会影响初始化切分点的选择。例如,初始化分箱数是100,采用等距离初始化方法。如果存在异常值,则会出现切分点的间隔较大,数据分布不均,即靠近异常值的箱内样本分布较少,而在某些箱内样本分布较多的情况。因此,初始化切分点前要做异常值处理。
(3) 初始化切分点后,要判断不同切分点间的最小样本数是否小于最小样本数约束。如果小于最小样本数,则重新进行切分点选择,即采用切分点合并的方式,将临近的切分点合并,以满足最小样本数的约束。
(4) 随后在最大分箱数的约束下进行切分点选择,如果大于最大分箱数,则分箱结束。
(5) 如果小于最大分箱数,则先计算最优切分点,以每个初始化切分点为边界,分别计算在此分箱后的优化指标值,选择最优的指标值对应的切分点作为最优切分点。
(6) 得到最优切分点后要计算增益值,判断当前分箱策略下的优化指标是否优于前一次分箱得到的优化指标值,如果分箱已经无法得到更优的指标值或指标值增加的速度明显变缓,则分箱结束,满足Early stopping约束条件。如果增益明显增加则分箱数增加,直到满足最大分箱约束条件后分箱结束。

最优Chi-merge卡方分箱方法

Chi-merge卡方分箱方法是一种自底向上的分箱方法,其思想是将原始数据初始化为多个数据区间,并对相邻区间的样本进行合并,计算合并后的卡方值,用卡方值的大小衡量相邻区间中类分布的差异情况。如果卡方值较小,表明该相邻区间的类分布情况非常相似,可以进行区间合并;反之,卡方值越大,则表明该相邻区间的类分布情况不同,不能进行区间合并操作。卡方值计算公式如
notion image
其中,n表示区间数,因为是相邻区间进行合并,所以n=2;m表示类别数,如二分类m=2。Eij表示第i个区间第j个类别的期望值,计算方式是第j个类别在总体样本中的占比乘以第i个区间的全部样本数。Binij表示第i个区间第j类样本的个数。
notion image

Best-KS分箱

Best-KS分箱方法是一种自顶向下的分箱方法。与卡方分箱相比,Best-KS分箱方法只是目标函数采用了KS统计量,其余分箱步骤没有差别。这里重点介绍KS统计量的计算方法及与目标变量之间的关系。
KS统计量是柯尔莫哥洛夫与斯米尔诺夫两个人提出来的,用于评估模型对好坏样本的区分能力。首先介绍一下K-S曲线。K-S曲线的绘制方法:做变量排序,以变量的某个值为阈值点,统计样本中好样本与坏样本占总样本中好样本与坏样本的比例。显然,随着变量阈值点的变化,这两个比值也随之改变,这两个比值之差的最大值就是KS统计量
notion image
KS统计量的计算过程如下:(1) 将变量进行升序排序,确定初始化阈值点候选集,离散变量可以直接采用变量的可能取值作为候选集,连续变量可以先进行分组,如分10组,将每组的边界作为阈值点。(2) 分别计算在小于阈值点的样本中,好坏样本与总体样本中好坏样本数的比值。(3) 将好坏样本的比值做差就得到各个切分点处的KS统计值。

最优IV分箱方法

最优IV分箱方法也是自顶向下的分箱方式,其目标函数为IV (Information Value)值。
IV值其本质是对称化的K-L距离,即在切分点处分裂得到的两部分数据中,选择好坏样本的分布差异最大点作为最优切分点。分箱结束后,计算每个箱内的IV值加和得到变量的IV值,可以用来刻画变量对目标值的预测能力。即变量的IV值越大,则对目标变量的区分能力越强,因此,IV值还可以用来做变量选择。

基于树的最优分箱方法

基于树的分箱方法借鉴了决策树在树生成的过程中特征选择 (最优分裂点) 的目标函数来完成变量分箱过程,可以理解为单变量的决策树模型。决策树采用自顶向下递归的方法进行树的生成,每个节点的选择目标是为了分类结果的纯度更高,也就是样本的分类效果更好。因此,不同的损失函数有不同的决策树,ID3采用信息增益方法,C4.5采用信息增益比,CART采用基尼系数 (Gini) 指标。本节将重点介绍采用信息增益作为目标函数进行变量分箱的过程。
概率是表示随机变量确定性的度量,而信息是随机变量不确定性的度量。信息熵是不确定性度量的平均值,就是信息的平均值。
notion image
notion image
则条件熵计算如下:
notion image
信息增益计算如下:
notion image

Feature selection

方法
核心特征
关键词
Filter
不用模型
correlation, chi-square
Wrapper
把模型当黑箱反复跑
train & test subsets
Embedded
模型内部完成
L1, tree importance
The process of identifying the most relevant set or subset of features used to train a model.
Features are defined as the intersection of:
  1. What factors influence the problem
    1. 哪些因素在逻辑上会影响我要预测的目标?
      房价预测中的潜在影响因素:
      • 房屋本身(size, age, rooms)
      • 时间因素(year built, year sold)
      • 区域 / 城市特征
      • 学区、配套设施
  1. What data we can realistically collect
    1. 并不是所有“有影响的因素”都能拿到数据
      • 有些数据容易收集(房屋面积、建成年份)
      • 有些数据非常难获取(购房者心理、谈判能力)
      👉 现实中的特征选择 = 影响力 × 可获取性

Filter Methods

1. Correlation Analysis(相关性分析)
是什么(大白话)
计算特征之间、以及特征与目标之间的相关性,筛掉冗余或无关特征。
用途 1:特征与目标的相关性
  • 线性关系强的变量可优先保留(适用于线性模型)
用途 2:特征之间的相关性
  • 检测 multicollinearity
  • 若两个特征高度相关(如 >0.9),通常只保留一个
什么时候用
  • 线性模型(线性回归、逻辑回归)
  • 数据量大时快速预筛
  • 非线性模型前的辅助检查
优点
  • 直观、快速
  • 适合处理 multicollinearity
  • 完全模型无关
缺点
  • 只能检测线性关系
  • 对非线性关系无效
  • 对类别特征需转换为数值
面试官喜欢听的点
Correlation is useful for identifying redundant features and reducing multicollinearity before fitting linear models.
一句话总结
Correlation-based selection removes redundant variables and highlights those with strong linear relationships to the target.
2. Chi-Square Test(卡方检验)
是什么(大白话)
衡量“一个类别特征与目标变量之间是否有显著关系”的统计方法。
本质:检查类别分布是否依赖于目标变量。
适用于:
  • 类别特征 vs 目标变量(也是类别)
典型例子:
某个类别变量(如州、省份、行业)是否影响违约率、转化率、点击率。
什么时候用
  • 类别变量
  • 目标是分类问题
  • 想找“类别差异是否显著”的特征
优点
  • 对类别数据非常有效
  • 快速、无需模型
  • 能直接判断类别特征的显著性
缺点
  • 适用于分类目标,不适合回归
  • 要求样本数足够,否则分布不稳定
面试官喜欢听的点
Chi-square evaluates dependency between categorical features and the target, making it suitable for selecting informative categorical predictors.
一句话总结
Chi-square tests the dependency between categorical features and the target, helping identify which categories contribute useful signal.
3. Mutual Information(互信息)
是什么(大白话)
衡量“特征知道多少目标的信息量”。
互信息越大,说明该特征对目标越有帮助。
不同于相关性,互信息可以捕捉非线性关系。
为什么好用
它并不假设线性关系,能捕捉任意形式的依赖关系,因此:
  • 适用于非线性模型
  • 适用于类别和连续变量
  • 比相关性更一般化
什么时候用
  • 特征与目标之间存在非线性关系
  • 混合数据类型(连续+类别)
  • 想做快速预筛但不想限制线性模型假设
优点
  • 可检测非线性
  • 适用连续和类别特征
  • 不依赖模型
缺点
  • 计算较相关性更复杂
  • 解释性不如相关系数直观
  • 容易受噪声影响
面试官喜欢听的点
Mutual information captures both linear and non-linear relationships, making it a robust filter method for mixed-type features.
一句话总结
Mutual information measures how much information a feature provides about the target, capturing non-linear dependencies that correlation misses

Wrapper Methods

Filter 方法只看“特征本身”。
Wrapper 方法则是:直接让模型参与特征选择,通过训练模型来评估特征的质量。就像把模型当作“评审”,让它告诉你哪些特征好、哪些特征拖后腿。
Wrapper 的最大优势:
基于模型性能做选择,比 Filter 更可靠。

为什么要用 Wrapper 方法

  • 特征不是孤立存在的
  • 两个特征可能单独不强,但组合后非常强
  • Wrapper 方法考虑了特征之间的交互作用
  • 最终选出来的特征更适合模型

Wrapper 缺点(必须知道的)

  • 非常耗时(需要训练无数次模型)
  • 数据量大时难以使用
  • 容易过拟合(因为一直在训练模型)
  • 通常只适合轻量模型(Logistic / Linear / Tree)
RFE(Recursive Feature Elimination)
RFE 的思路就是:
越不重要的越先淘汰,逐层筛选出特征最优子集。“递归地把不重要的特征剔除掉。”
步骤如下:
  1. 训练模型
  1. 根据模型权重/重要性排序特征
  1. 去掉最不重要的若干个
  1. 重复训练 → 继续去掉特征
  1. 最终保留前 k 个最重要的特征

为什么好用

  • 考虑了特征之间的相关性
  • 基于模型的重要性排序,更接近真实表现
  • 效果通常优于简单的 Filter

什么时候用

  • 使用线性模型(L1/L2 Logistic Regression)
  • 使用 Tree 模型(RF / XGBoost)
  • 特征数较多但不超过几千
  • 想要“排序级”的特征重要性

优点

  • 基于模型性能,可靠性高
  • 能自动处理 feature interactions
  • 可以和任意模型结合

缺点

  • 训练成本高
  • 容易过拟合
  • 结果依赖于所选模型(不同模型可能给不同结果)
Forward/Backward Selection
从“一个特征都没有”开始,逐个往模型里加特征。
步骤:
  1. 从所有特征中选一个效果最好的
  1. 加入模型
  1. 再从剩余特征中选能带来最大增益的
  1. 不断重复
  1. 直到加入更多特征不再提升性能
本质:
从小到大逐步构建最优特征组合。

为什么要用

  • 不会一次性加入太多不必要特征
  • 比 Backward 更适合法多 > 样本少的情况
  • 易解释,便于做面试讲解

什么时候用

  • 特征数量较多
  • 不希望模型训练太慢
  • 想理解“哪些特征最先起作用”

优点

  • 计算量低于 backward
  • 易于理解和解释
  • 不容易被 multicollinearity 影响

缺点

  • 贪心(greedy)策略,不保证全局最优
  • 可能忽略“需要组合才起作用”的特征
  • 仍然需要多次训练模型

Embedded Methods

Embedded 方法的核心思想是:模型在训练的同时就能自动判断哪些特征重要,哪些不重要。特征选择是训练过程本身的一部分,而不是训练后另做处理。
Lasso
Lasso 是一种加了 L1 正则化的线性模型,会让某些特征的系数变成零。也就是说:特征被自动“压成 0” → 被丢弃。这是最经典、最干净的嵌入式特征选择方法。

为什么能做特征选择

L1 惩罚项会逼迫模型参数稀疏,也就是让不重要的特征系数变成零,从而留下最重要的特征。数学细节不需要说,面试只需要讲:L1 会产生稀疏系数,零系数对应特征不重要。

什么时候用

  • 特征数量大
  • 线性模型
  • 数据存在 multicollinearity
  • 希望获得可解释的特征子集
  • 想做 feature shrinkage(权重缩小)

优点

  • 能自动选特征
  • 可解释性强
  • 对多重共线性特征有良好处理能力
  • 简单、稳定、工业界常用

缺点

  • 有时会偏向选择一个特征、将其他高度相关的特征全部扔掉
  • 不适用于强非线性关系
  • 对噪声敏感
Tree-based feature importance
树模型在每次分裂特征时都会计算“这个特征带来了多少纯度提升”。纯度贡献越大,特征越重要。换句话说:树模型天然会告诉你:哪个特征更常被用来做重要的分裂点。

为什么能做特征选择

树模型的结构本身就是逐层分裂特征,因此:
  • 分裂次数多 = 特征更有用
  • 纯度提升大 = 特征更重要
这就是内置的 feature importance。

两种常见 importance

  • Gini importance(基于纯度减少)
  • Permutation importance(模型性能下降)

什么时候用

  • 数据有非线性关系
  • 特征之间存在高阶交互
  • 树模型已经是你的核心模型
  • 对维度要求不敏感

优点

  • 能捕捉非线性
  • 能捕捉特征交互(比 Lasso 强)
  • 对缩放不敏感
  • 能输出 feature importance 排序

缺点

  • 对高基数类别偏好(容易被误导)
  • 重要性是模型相关的,不是全局统计意义
  • Feature importance 不是因果意义
XGBoost feature importance
XGBoost 会在每一棵树、每一次分裂中记录:
  • 某个特征被用的次数
  • 带来的增益
  • 带来的覆盖样本量
因此它能提供多种特征重要性指标。

常见的重要性类型(面试官爱问)

  1. Gain(最推荐的指标)
    1. 每个特征带来的损失下降量。越大越重要。
      最可靠。
  1. Cover
    1. 该特征在分裂中影响了多少样本。
      比 gain 次一级。
  1. Frequency(weight)
    1. 特征作为分裂点出现的次数。
      但可能被误导,所以不如 gain 可靠。

为什么 XGBoost 的重要性更强

  • 基于 boosting 的集成结构
  • 反复学习残差,能捕捉复杂关系
  • 系统地利用梯度信息
  • 能捕捉高阶交互

什么时候用

  • 使用 XGBoost/LightGBM/CatBoost
  • 想要非线性 + 交互的特征排序
  • 数据集较复杂时

优点

  • 高度可靠的 feature ranking
  • 支持分类、回归
  • 适用于工业界大量模型

缺点

  • 容易偏向高基数类别
  • 同一个特征在不同模型中可能排序不同
  • 重要性不等于因果关系

Dimensionality Reduction

PCA
PCA 通过线性组合,把高维特征压成几个“最有信息量的新特征”。
这些新特征叫“主成分”,按信息量(方差)从大到小排序。
理解方式:
PCA = 在不看目标变量的情况下,尽量保留数据的最大方差。

为什么要用

  • 去除特征之间的冗余和高相关性
  • 消除多重共线性
  • 减少维度、降低噪声
  • 提升模型稳定性

什么时候用

  • 特征高度相关的线性模型
  • 数据规模大,需要降维加速训练
  • 可解释性不是首要目标
  • 对方差保留率有要求(例如保留 95% 变化)

优点

  • 快速、稳定、可解释性中等
  • 能处理共线性
  • 适合连续变量

缺点

  • 只能捕捉线性关系
  • 新特征(主成分)难以直接解释
  • 不利用目标变量(无监督)
Factor Analysis
Factor Analysis 假设观测到的特征由少数几个“潜在因子”驱动,并试图找出这些隐藏因子。
对 PCA 的大白话比较:
PCA 是为了压缩信息量,
Factor Analysis 是试图解释“共同原因”。

为什么要用

  • 希望从众多变量中提炼“潜在维度”
  • 常用于心理学、问卷、金融因子模型
  • 用于解释性强的场景

什么时候用

  • 想要解释特征背后的“共同驱动力”
  • 明确假设特征由少量 latent factors 影响
  • 处理变量间共性结构

优点

  • 可解释性比 PCA 强
  • 能体现变量间的共性结构
  • 适合发现潜在维度

缺点

  • 假设较强(特征由因子驱动)
  • 模型更复杂
  • 不如 PCA 稳定、快速

面试官喜欢听的点

Factor analysis models the covariance structure by assuming that observed variables are generated by a few latent factors.
UMAP
UMAP 尝试保持原始数据的“拓扑结构”,让距离关系在低维空间保持一致。
理解方式:
UMAP = 让高维空间的邻近关系在低维中尽量保持不变。

为什么要用

  • 比 t-SNE 快得多
  • 可以保持全局结构
  • 支持监督和无监督
  • 非线性能力强

什么时候用

  • 高维复杂数据(文本、嵌入、图像)
  • 想要既看局部结构又看全局结构
  • 想比 t-SNE 更快、更稳定的替代方案
  • 数据规模从中等到大型

优点

  • 快速、可扩展、支持监督
  • 捕捉局部和全局结构
  • 比 t-SNE 更适合建模前降维

缺点

  • 参数较多(邻居数、最小距离)
  • 解释性较弱
  • 可视化结果可能受参数影响

面试官喜欢听的点

UMAP preserves both local and global manifold structure, making it a scalable non-linear dimensionality reduction method.
t-SNE(用于可视化)
t-SNE把高维空间中“距离近的点”尽量放在低维空间中靠近,把远的点放远。
它更关注局部结构,用来做可视化非常有用。

为什么要用

  • 能把高维数据(如 embeddings)投射到 2D 或 3D
  • 聚类结构非常清晰
  • 用于理解 latent space(例如 NLP、图像)

什么时候用

  • 可视化高维嵌入、特征、分类簇
  • 探索数据结构
  • 用于论文、展示、EDA
    • 不适合作为训练前降维

优点

  • 可视化效果好
  • 能发现 clusters、patterns
  • 对非线性结构敏感

缺点

  • 非常慢
  • 不保留全局结构
  • 难以用于建模前的降维
  • 不可用于外推(不能给新样本轻松降维)

面试官喜欢听的点

t-SNE is excellent for visualizing high-dimensional data by preserving local neighborhoods but is not suitable for modeling pipelines.

Data Splitting

~ Training Data: Used to train the model on input~output pairs.
~ Validation Data: Used to optimize model hyperparameters during training.
~ Testing Data: Evaluates the model's performance on unseen data after training.
1.The training set is usually the largest, used for building the machine learning model.
2.The validation and test sets are smaller and roughly of similar size.
3.The model is not allowed to use examples from the validation and test sets during training, which is why they are often called "hold-out sets.”
4.In the past, a common rule of thumb was 70% for training, 15% for validation, and 15% for testing.
Purpose of Validation Set:
The validation set serves two main purposes:
1. It helps choose the appropriate learning algorithm.
2. It assists in finding the best values for hyperparameters
Purpose of Test Set:
1.The test set is used to assess the model's performance objectively.
2.It ensures that the model performs well on data it hasn't seen during training.
 

Pick ML Algorithms

Aspect
Supervised Learning
Unsupervised Learning
Data Type
Labeled: Input-output pairs
Unlabeled: Only input data
Objective
Predict or classify outputs
Discover patterns, structures
Examples
Classification, Regression
Clustering, Dimensionality Reduction
Evaluation
Based on prediction accuracy
Often subjective, hard to quantify
Use Cases
Predictive modeling, Classification
Anomaly detection, Data exploration
Challenges
Requires labeled data, Model may overfit
Harder to evaluate, Subject to bias

Supervised Machine Learning

Supervised learning is a type of machine learning where the model is trained on a labeled dataset, meaning that the input data is paired with corresponding output labels. The goal is for the model to learn the mapping from inputs to outputs so that it can make predictions or classifications on new, unseen data.
💡
Key Characteristics:
  1. Labeled Data: The training dataset includes input-output pairs.
  1. Objective: The model aims to learn the relationship between inputs and corresponding labels.
  1. Types: Common types include classification (predicting labels) and regression (predicting numerical values).
 
Linear Regression
Linear Regression
Logistic Regression
Logistic Regression
KNN
KNN
Tree-Based Model
Tree-Based Model
SVM
SVM
Naive Bayes
Naive Bayes
Softmax Regression(多分类逻辑回归)
Softmax Regression(多分类逻辑回归)

Unsupervised Machine Learning

Definition:
Unsupervised learning is a type of machine learning where the model is given input data without explicit output labels. The goal is for the model to discover patterns, structures, or relationships within the data without explicit guidance on what to look for.
💡
Key Characteristics:
  1. Unlabeled Data: The training dataset consists of input data without corresponding output labels.
  1. Objective: The model aims to identify inherent patterns or groupings in the data.
  1. Types: Common types include clustering (grouping similar data points) and dimensionality reduction (simplifying data while preserving its key features).
 
Common unsupervised machine learning types:
  • Clustering: the process of segmenting the dataset into groups based on the patterns found in the data. Used to segment customers and products, for example.
  • Association: the goal is to find patterns between the variables, not the entries. It's frequently used for market basket analysis, for instance.
  • Anomaly detection: this kind of algorithm tries to identify when a particular data point is completely off the rest of the dataset pattern. Frequently used for fraud detection.

Clustering Algorithm

Nowadays used:
  • For market segmentation (types of customers, loyalty)
  • To merge close points on a map
  • For image compression
  • To analyze and label new data
  • To detect abnormal behavior
Hierarchical Clustering
Hierarchical Clustering
K-Means Clustering
K-Means Clustering

Dimensionality Reduction

Principal Component Analysis (PCA)
Principal Component Analysis (PCA)

Optimization Algorithm

Gradient Descent
Gradient Descent

Model Evaluation

Model evaluation overview

评估指标的主要用途是用来对比模型的性能,以便选择最好的模型或者调优模型的效果。简单来说,你会用这些指标来回答以下问题:
  1. 哪个模型表现更好?
      • 比如你训练了多个模型(如决策树、随机森林、支持向量机等),评估指标可以帮助你比较这些模型的性能。
  1. 同一个模型在不同参数配置下,哪个配置效果更好?
      • 例如,你训练了一个随机森林模型,但尝试了不同的超参数(如树的数量、深度等)。通过评估指标可以确定哪个超参数配置效果最佳。
  1. 模型在训练集和测试集上的表现是否一致?
      • 如果模型在训练集上表现很好,但在测试集上表现很差(过拟合),评估指标可以帮助你发现这个问题。
  1. 模型是否符合业务需求?
      • 比如在分类任务中,业务可能更关心召回率(recall),因为漏掉一个正例(比如癌症病人)会有很大影响,而不是仅仅关注整体准确率。

对比的两个对象:模型或配置

通常来说,评估指标是用来对比以下两种情况:
1. 不同模型之间的对比

场景:选择最优模型

当你训练了多个不同的模型(如逻辑回归、随机森林、XGBoost、神经网络),你需要通过评估指标对这些模型的效果进行比较,来决定哪个模型适合你的任务。
例子:
from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from sklearn.metrics import accuracy_score # Train Logistic Regression logistic_model = LogisticRegression() logistic_model.fit(X_train_cls, y_train_cls) y_pred_logistic = logistic_model.predict(X_test_cls) # Train Random Forest rf_model = RandomForestClassifier() rf_model.fit(X_train_cls, y_train_cls) y_pred_rf = rf_model.predict(X_test_cls) # Train SVM svm_model = SVC(probability=True) svm_model.fit(X_train_cls, y_train_cls) y_pred_svm = svm_model.predict(X_test_cls) # Compare accuracy of models print("Logistic Regression Accuracy:", accuracy_score(y_test_cls, y_pred_logistic)) print("Random Forest Accuracy:", accuracy_score(y_test_cls, y_pred_rf)) print("SVM Accuracy:", accuracy_score(y_test_cls, y_pred_svm))

输出示例:

Logistic Regression Accuracy: 0.85 Random Forest Accuracy: 0.90 SVM Accuracy: 0.88
在这个例子中,随机森林的准确率最高,因此它在这个场景下是表现最好的模型。
 
2. 同一个模型的不同超参数配置之间的对比

场景:调优超参数

当你使用同一个模型(如随机森林),但尝试了不同的超参数(如树的数量、深度等),评估指标可以帮助你决定哪个超参数配置效果更好。
例子:
from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import f1_score # Train Random Forest with different number of trees rf_10 = RandomForestClassifier(n_estimators=10, random_state=42) rf_50 = RandomForestClassifier(n_estimators=50, random_state=42) rf_100 = RandomForestClassifier(n_estimators=100, random_state=42) rf_10.fit(X_train_cls, y_train_cls) rf_50.fit(X_train_cls, y_train_cls) rf_100.fit(X_train_cls, y_train_cls) # Predictions y_pred_rf_10 = rf_10.predict(X_test_cls) y_pred_rf_50 = rf_50.predict(X_test_cls) y_pred_rf_100 = rf_100.predict(X_test_cls) # Compare F1-scores for different configurations print("Random Forest (10 trees) F1-Score:", f1_score(y_test_cls, y_pred_rf_10)) print("Random Forest (50 trees) F1-Score:", f1_score(y_test_cls, y_pred_rf_50)) print("Random Forest (100 trees) F1-Score:", f1_score(y_test_cls, y_pred_rf_100))

输出示例:

Random Forest (10 trees) F1-Score: 0.82 Random Forest (50 trees) F1-Score: 0.88 Random Forest (100 trees) F1-Score: 0.90
在这个例子中,100 棵树的随机森林模型在 F1 分数上表现最好,因此这个超参数配置更优。
具体对比指标如何选择?

分类任务中:

选择指标主要取决于你的任务目标:
  • 准确率 (Accuracy):
    • 当你的类别是平衡的时候(即每个类别的样本数量差不多)使用。
    • 例子:图像分类中,每个类别样本数量均衡。
  • 精确率 (Precision):
    • 当误报(False Positives)代价较高时使用。
    • 例子:垃圾邮件分类中,不想把正常邮件误认为垃圾邮件。
  • 召回率 (Recall):
    • 当漏报(False Negatives)代价较高时使用。
    • 例子:癌症检测中,不能漏掉患癌病人。
  • F1-Score:
    • 平衡精确率和召回率时使用。
    • 例子:任务中同样关注误报和漏报的情况。

回归任务中:

选择指标主要取决于你的目标和数据的特性:
  • MAE (Mean Absolute Error):
    • 当你更关心预测值与真实值之间的绝对差距时使用。
    • 例子:预测房价时,绝对误差更直观(比如 5000 美元的误差)。
  • MSE (Mean Squared Error)RMSE:
    • 当你更关心大误差时使用,因为它会对较大的误差赋予更高的权重。
    • 例子:天气预测中,过大的误差可能会导致灾难性后果。
  • R² (R-Squared):
    • 衡量模型对目标值变化的解释能力,通常在对模型整体性能有直观评估时使用。
 
Classification Metrics
Regression Metrics
1.Accuracy: The proportion of correctly classified instances. It's suitable for balanced datasets but can be misleading in imbalanced datasets.

2.Precision: The ratio of true positives to the total predicted positives. It measures the model's ability to avoid false positives.

3.Recall (Sensitivity or True Positive Rate): The ratio of true positives to the total actual positives. It quantifies the model's ability to find all positive instances.

4.F1-Score: The harmonic mean of precision and recall, which balances both metrics. It's useful when there's an imbalance between classes.

5.ROC Curve (Receiver Operating Characteristic Curve): A graphical representation of the trade-off between true positive rate (recall) and false positive rate at different thresholds. The area under the ROC curve (AUC-ROC) is often used as a summary metric.

6.Precision-Recall Curve: A graph of precision against recall for different threshold values. It's useful when dealing with imbalanced datasets.

7.Confusion Matrix: A table that summarizes the model's performance by showing true positives, true negatives, false positives, and false negatives.
1.Mean Absolute Error (MAE): The average absolute difference between predicted and actual values. It's less sensitive to outliers.

2.Mean Squared Error (MSE): The average squared difference between predicted and actual values. It amplifies the impact of larger errors.

3.Root Mean Squared Error (RMSE): The square root of MSE, which is in the same units as the target variable.

4.R-squared (Coefficient of Determination): Measures the proportion of the variance in the target variable explained by the model. It ranges from 0 to 1, with higher values indicating a better fit.

5.Mean Absolute Percentage Error (MAPE): Measures the percentage difference between predicted and actual values, making it interpretable.
 

Regression Metrics

Mean Absolute Error (MAE)
Mean Absolute Error 的计算方式是:
  • 对每一个样本,计算预测值与真实值之间的差
  • 取绝对值
  • 对所有样本求平均

MAE 的核心特性

  • 不包含平方项
  • outliers 更稳健
  • 每一个误差的惩罚是线性的
  • 数值与预测目标处于相同量级
  • 更容易解释和沟通

MAE 与 MSE 的核心差异

  • MSE:强调避免极端错误
  • MAE:强调整体预测的平均偏差水平
📌 MAE 更像是“日常预测误差感受”

Mean Squared Error (MSE)
Mean Squared Error 是通过以下方式计算的:
  • 对每一个样本,计算预测值与真实值之间的差
  • 将差值进行平方
  • 对所有样本的平方误差求平均

MSE 的核心特性

  • outliers(极端误差)非常敏感
  • 因为使用了平方项:
    • 单次大误差会被严重放大
  • 与数据本身的 scale(量纲)强相关
  • 不同问题之间的 MSE 数值不可比较

什么时候适合使用 MSE

  • 一次严重错误是不可接受的
  • 当业务场景对极端偏差非常敏感
  • 希望模型尽量避免任何一次“灾难性预测”
Mean Absolute Percentage Error (MAPE)
Mean Absolute Percentage Error 的计算方式是:
  • 计算预测误差的绝对值
  • 再除以真实值
  • 将误差转化为百分比
  • 对所有样本求平均

MAPE 的优势

  • 误差以 percentage 表达
  • 对非技术人员非常友好
  • 常用于:
    • 商业汇报
    • 客户沟通
    • KPI 展示

MAPE 的主要问题

  • 当真实值 接近 0 或非常小 时:
    • 即使绝对误差很小
    • 百分比误差也会被放大
  • 容易产生误导性的高误差
📌 MAPE 在低基数场景下需要非常谨慎使用
Root Mean Squared Error (RMSE)
Root Mean Squared Error 是:
  • MSE 取平方根得到的指标

RMSE 的意义

  • 数值回到原始预测变量的量纲
  • 比 MSE 更容易直观理解
  • 但依然:
    • 对 outliers 非常敏感
    • 与数据 scale 相关
📌 本质上,RMSE 只是 MSE 的可解释版本
R-squared (Coefficient of Determination)
R-squared 表示:模型解释的变异性占总变异性的比例,measure how well the model fits the data but penalize the excessive use of variables
计算方式可以表示为:
  • SSR / SST
  • 或 1 − SSE / SST

R-squared 的取值含义

  • R² = 1
    • 模型完美解释所有目标变量的变异
  • R² = 0
    • 模型完全没有解释能力
  • 通常取值范围在 0 到 1 之间

但它有一个致命问题:

只要你加变量,R² 一定不会下降
哪怕你加的是:
  • 随机数
  • 噪声
  • 完全没业务意义的变量
R² 都可能“假装变好了
Adjusted R-squared
如果加入一个新变量后,R² 上升但 Adjusted R² 下降,说明这个变量“看起来有用”,但“实际上没什么价值”,可以考虑剔除。
Adjusted R² 在 R² 的基础上 加了一条规则
我允许你加变量,但你必须“值回票价”
  • ✔ 变量带来明显解释力 → 我奖励你
  • ❌ 变量贡献太小 → 我惩罚你

一个非常好理解的比喻(重点)

想象你在写论文 / 简历

  • R-squared
    • 页数越多,看起来越厉害
  • Adjusted R-squared
    • 我不看页数,我看内容密度
你加了一页废话:
  • 页数 ↑(R² ↑)
  • 含金量 ↓(Adjusted R² ↓)
Python Implementation
# === 2. Regression Metrics === # # Generate synthetic regression dataset X_reg, y_reg = make_regression(n_samples=100, n_features=5, noise=10, random_state=42) X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42) # Train a Random Forest Regressor reg = RandomForestRegressor(random_state=42) reg.fit(X_train_reg, y_train_reg) y_pred_reg = reg.predict(X_test_reg) # Regression metrics print("\nRegression Metrics:") print("Mean Absolute Error (MAE):", mean_absolute_error(y_test_reg, y_pred_reg)) print("Mean Squared Error (MSE):", mean_squared_error(y_test_reg, y_pred_reg)) print("Root Mean Squared Error (RMSE):", np.sqrt(mean_squared_error(y_test_reg, y_pred_reg))) print("Mean Absolute Percentage Error (MAPE):", mean_absolute_percentage_error(y_test_reg, y_pred_reg)) print("R² Score:", r2_score(y_test_reg, y_pred_reg))
对比
指标
英文全称
直观含义(中文)
对极端误差敏感?
是否容易解释
典型使用场景
取值范围 & 判断方向
MSE
Mean Squared Error
把每次预测误差平方后求平均,一次大错会被放大惩罚
✅ 非常敏感
❌ 不直观
风险、监管、一次大错不能接受的场景
≥ 0,越小越好
RMSE
Root Mean Squared Error
MSE 开平方,量纲回到原始尺度,但本质还是重罚大错
✅ 非常敏感
⚠️ 一般
金融预测、工程预测、需要“可读误差量级”
≥ 0,越小越好
MAE
Mean Absolute Error
每次误差取绝对值后求平均,看整体平均偏差
❌ 不敏感
✅ 直观
日常业务预测、运营指标、稳定性优先
≥ 0,越小越好
MAPE
Mean Absolute Percentage Error
把误差转成百分比,方便给人看
⚠️ 对小值敏感
✅ 非常直观
商业汇报、客户沟通、KPI 展示
≥ 0%,越小越好
R-squared
模型解释了目标变量多少比例的波动
❌ 不反映误差大小
⚠️ 易被误解
辅助解释能力,不用于单独决策
理论上 ≤ 1,越接近 1 越好

Classification Metrics

notion image
The confusion matrix
A confusion matrix is a table used in classification problems to evaluate the performance of a machine learning model. It provides a summary of the model's predictions compared to the actual outcomes. The confusion matrix is especially useful when dealing with binary classification problems, where there are two classes: positive and negative. However, it can be adapted for multi-class classification as well.
notion image
Accuracy
The ratio between the number of all correctly predicted samples and the number of all samples
notion image
notion image
Precision(你认为positive里多少真的positive)
The ratio between the number of true positives and the number of all samples classified as positive.
想象一个图书馆找书的场景,你要找所有的计算机类书籍(正例)。
● 精确率:你找出来的一堆书中,真正是计算机类的书占这堆书的比例就是精确率。如果找出来10本,只有6本是计算机类的,精确率就是60%,说明你找的这堆书里有不少“误抓”的非计算机类书。
notion image
Recall(你认为positive里有多少被你找到了)
Recall, also known as sensitivity or the true positive rate, is a performance metric used in classification tasks to measure a model's ability to identify all relevant instances of a positive class within a dataset. It quantifies the proportion of actual positive instances that the model correctly predicted as positive. Recall is particularly important when the consequences of missing positive instances (false negatives) are significant or costly.
召回率:图书馆里所有计算机类书是一个固定数量,你找出来的计算机类书占图书馆里实际计算机类书总数的比例就是召回率。假如图书馆有20本计算机类书,你只找出12本,召回率就是60%,意味着还有一些计算机类书没被你找到。
notion image
 
F1 score
The F1 score, also known as the F1 measure or F1 score, is a commonly used metric in classification tasks, especially when dealing with imbalanced datasets. It combines two essential performance metrics: precision and recall, into a single value that balances the trade-off between them.
In a binary classification model, a large F1 score of 1 indicates excellent precision and recall, while a low score indicates poor model performance.
Interpreting the F1 score depends on the specific problem and context at hand. In general, a higher F1 score suggests better model performance

High F1 score

A high F1 score indicates the strong overall performance of a binary classification model. It signifies that the model can effectively identify positive cases while minimizing false positives and false negatives.
You can achieve a high F1 score using the following techniques:
  1. High-quality training data: A high-quality dataset that accurately represents the problem being solved can significantly improve the model’s performance.
  1. Appropriate model selection: Selecting a model architecture well-suited for the specific problem can enhance the model’s ability to learn and identify patterns within the data.
  1. Effective feature engineering: Choosing or creating informative features that capture relevant information from the data can enhance the model’s learning capabilities and generalization.
  1. Hyperparameter tuning: Optimizing the model’s hyperparameters through careful tuning can improve its performance.
2(PrecisionRecall)/(Precision+Recall)2 * (Precision * Recall) / (Precision + Recall)
ROC-AUC Curve
The Receiver Operating Characteristic (ROC) curve is a graphical representation used to evaluate the performance of a binary classification model across different thresholds for classifying positive and negative instances. The ROC curve plots the trade-off between the true positive rate (sensitivity) and the false positive rate (1-specificity) at various threshold settings.
notion image
当 threshold 从 0.5 降到 0.3,Recall / True Positive Rate 一定会上升(It goes up)
Threshold ↓ → 更容易判正 → 抓到更多正样本 → Recall ↑
Threshold 下降
变化方向
Recall / TPR
✅ 上升
False Positive Rate
✅ 上升
Precision
❌ 下降(通常)
判正数量
✅ 增加
ROC Curve
在不同 threshold 下,
True Positive Rate (TPR)False Positive Rate (FPR) 之间的关系

ROC Curve 的构建过程(直觉版)

  1. 选择一个 threshold
  1. 根据 threshold,把概率转成预测结果
  1. 计算:
      • TPR
      • FPR
  1. 在图上画一个点
  1. 改变 threshold
  1. 重复上述过程
  1. 把所有点连起来,形成 ROC Curve

ROC Curve 的坐标含义

  • 横轴:False Positive Rate (FPR)
  • 纵轴:True Positive Rate (TPR)
AUROC(Area Under the ROC Curve)

AUROC 是什么

AUROC 表示 ROC Curve 下方的面积,用来衡量模型整体区分能力。

AUROC 的取值含义

  • AUROC = 1
    • 完美分类模型
    • 不论 threshold 怎么选,都能完美区分正负类
  • AUROC = 0.5
    • 等同于随机猜测
  • 0.5 < AUROC < 1
    • 实际模型常见区间
    • 数值越大,区分能力越强
📌 AUROC 衡量的是“排序能力”,不是具体某个 threshold 的表现
PR Curve(Precision-Recall Curve)
PR Curve 是一条曲线,用来展示:在不同 threshold 下,PrecisionRecall 之间的关系

PR Curve 的构建逻辑

和 ROC 类似,只是关注点不同:
  • 不再使用 FPR
  • 只关心:
    • Precision
    • Recall

PR Curve 的坐标含义

  • 横轴:Recall
  • 纵轴:Precision

PR Curve 的关键特性

  • 不使用 True Negative
  • 完全聚焦在:
    • 正类是否被找出来
    • 找出来的正类靠不靠谱

在 Class Imbalance 场景下的优势

当 negative 极多时:
  • ROC 会“被 True Negative 冲好看”
  • PR Curve 不受这个影响
  • 能真实反映模型对正类的能力
📌 PR Curve 更“关心少数重要样本”
Classification_report
The classification_report from scikit-learn is a very useful tool to evaluate the performance of a classification model. It provides a summary of the precision, recall, F1-score, and support for each class.
confusion_matrixclassification_report 使用的是最终的分类结果(即 y_pred),因为它们评估的是模型在给定阈值下的性能。
  • Confusion Matrix: 显示模型的真阳性、假阳性、真阴性和假阴性的数量。
  • Classification Report: 提供每个类别的精确度、召回率和F1分数。
from sklearn.metrics import classification_report from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.datasets import make_classification # Generate a synthetic dataset for demonstration X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42) # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Train a logistic regression model model = LogisticRegression() model.fit(X_train, y_train) # Predict the test set labels y_pred = model.predict(X_test) # Generate the classification report report = classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1']) print(report)
precision recall f1-score support Class 0 0.89 0.87 0.88 150 Class 1 0.86 0.89 0.87 150 accuracy 0.88 300 macro avg 0.88 0.88 0.88 300 weighted avg 0.88 0.88 0.88 300
Python Implementation
  • (X_test, y_test): This pair is used with knn.score to evaluate the model's performance directly using the test features (X_test) and the true test labels (y_test). The method uses the model to predict the labels for X_test and then compares these predictions to y_test to compute the score.
  • (y_test, y_pred): This pair is typically used when you manually evaluate the model's predictions. y_pred is the set of predictions made by the model on X_test, and y_test are the true labels. Metrics like accuracy, precision, recall, F1-score, or R2R^2R2 for regression can be calculated using these.
# Import necessary libraries from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, roc_auc_score, precision_recall_curve, confusion_matrix import matplotlib.pyplot as plt # Sample true labels and predicted labels (replace with your data) true_labels = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0] predicted_labels = [1, 0, 0, 1, 1, 1, 0, 1, 0, 1] # Accuracy accuracy = accuracy_score(true_labels, predicted_labels) print("Accuracy:", accuracy) # Precision precision = precision_score(true_labels, predicted_labels) print("Precision:", precision) # Recall (Sensitivity) recall = recall_score(true_labels, predicted_labels) print("Recall (Sensitivity):", recall) # F1-Score f1 = f1_score(true_labels, predicted_labels) print("F1-Score:", f1) # ROC Curve and AUC-ROC fpr, tpr, thresholds = roc_curve(true_labels, predicted_labels) roc_auc = roc_auc_score(true_labels, predicted_labels) plt.figure() plt.plot(fpr, tpr, label='ROC curve (AUC = {:.2f})'.format(roc_auc)) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.legend() plt.title('ROC Curve') plt.show() # Precision-Recall Curve precision, recall, thresholds = precision_recall_curve(true_labels, predicted_labels) plt.figure() plt.plot(recall, precision, label='Precision-Recall curve') plt.xlabel('Recall') plt.ylabel('Precision') plt.legend() plt.title('Precision-Recall Curve') plt.show() # Confusion Matrix conf_matrix = confusion_matrix(true_labels, predicted_labels) print("Confusion Matrix:") print(conf_matrix)
应用案例(如何根据不同的应用场景选择 Precision 还是 Recall 以及阈值的调整)
1. 垃圾邮件检测(Spam Email Detection)
情景描述: 识别垃圾邮件以减少用户收件箱中的无用邮件。
选择重点:
  • Precision: 更加重要。高 Precision 意味着大多数标记为垃圾邮件的邮件确实是垃圾邮件,从而避免将重要邮件错误地分类为垃圾邮件(即避免 False Positive)。
  • Recall: 也重要,但相比 Precision 次要。因为漏掉一些垃圾邮件(False Negative)虽然会让用户多看到一些垃圾邮件,但影响不如误判正常邮件为垃圾邮件大。
阈值调整:
  • 提高阈值可以增加 Precision,但可能会降低 Recall。
  • 降低阈值可以增加 Recall,但可能会降低 Precision。
 
from sklearn.metrics import precision_recall_curve import matplotlib.pyplot as plt # 假设 y_true 是真实标签,y_scores 是模型输出的概率分数 precision, recall, thresholds = precision_recall_curve(y_true, y_scores) # 计算 F1 分数 f1_scores = 2 * (precision * recall) / (precision + recall) best_threshold_index = f1_scores.argmax() best_threshold = thresholds[best_threshold_index] print(f"最佳阈值: {best_threshold}") print(f"对应的 Precision: {precision[best_threshold_index]}") print(f"对应的 Recall: {recall[best_threshold_index]}") # 绘制 Precision-Recall 曲线 plt.plot(recall, precision, marker='.') plt.xlabel('Recall') plt.ylabel('Precision') plt.title('Precision-Recall Curve') plt.show()
2. 疾病检测(Disease Detection)
情景描述: 识别患者是否患有某种疾病,以便及时治疗。
选择重点:
  • Recall: 更加重要。高 Recall 意味着大多数患病患者能够被检测出来,从而减少漏诊的风险(即减少 False Negative)。
  • Precision: 也重要,但相比 Recall 次要。因为误诊(False Positive)虽然会导致一些健康人接受不必要的进一步检查,但不如漏诊危害大。
阈值调整:
  • 降低阈值可以增加 Recall,但可能会降低 Precision。
  • 提高阈值可以增加 Precision,但可能会降低 Recall。
 
3. 网络欺诈检测(Cyber Fraud Detection)
情景描述: 识别欺诈性交易以保护用户和金融机构免受损失。
选择重点:
  • Recall: 更加重要。高 Recall 意味着大多数欺诈交易能够被检测出来,减少潜在的损失(即减少 False Negative)。
  • Precision: 也重要,因为误判正常交易为欺诈(False Positive)会影响用户体验,但相比 Recall 次要。
阈值调整:
  • 降低阈值可以增加 Recall,但可能会降低 Precision。
  • 提高阈值可以增加 Precision,但可能会降低 Recall。

如何选择阈值(Threshold)

选择阈值的目的是在 Precision 和 Recall 之间找到最佳平衡。具体方法如下:
  1. 绘制 Precision-Recall 曲线: 通过不同阈值下的 Precision 和 Recall 计算结果绘制曲线。
  1. 选择最佳 F1 分数: F1 分数是 Precision 和 Recall 的调和平均数。在某些场景中,通过选择最大化 F1 分数的阈值可以实现较好的平衡。
  1. 根据业务需求调整: 在一些特定业务场景中,可以根据实际需求手动调整阈值。例如,在疾病检测中,更倾向于 Recall 较高的阈值。
 

Improve Your Model’s Performance

Choose only the most relevant features/attributes
There are many other factors that people say are crucial to success such as diet, where the people is born, if she had a summer job, if the person learned to be independent while young, and so on. But let’s face it, only a few of those factors actually contribute to the positive results. It’s similar to the 80/20 principle popular in business and management. Only 20% of the input is responsible for 80% of the output (not exactly but you get the point).
It’s a similar case with machine learning. People collect massive amounts of data that result to thousands of feature columns. It’s a huge task that even computers will struggle to crunch the numbers.
Let us say we’re able to accomplish that gargantuan task. Gathering more data may not significantly increase the model’s accuracy. We might also use ensemble learning methods but there’s a limit to how much it can help.
The 80/20 principle might also help again here. We only select the most relevant features that are likely to have the largest effect on the output.
However, the pressing question then is this: How do we figure out which are the most important features? These are the common approaches to find out and make the right selection:
● Use expertise in the subject (domain knowledge)
● Test the features one by one or in combination and find out which produce the best results
Let us start with domain knowledge. Many companies prefer hiring data scientists and machine learning developers who are also experts in certain fields (e.g. agriculture, manufacturing, logistics, social media, physics). That’s because a strong background knowledge is often essential in making sense of the data (and taking advantage of it).
For example, the following factors are often attributed to startup success:
● Timing (or just plain luck)
● Potential market evolution (how the market will grow through years)
● Proportion of investment to research & development
● The founders (can they execute and sustain the plan?)
● The marketing (starting niche then go broad, or go broad in the first place)
● Competitors
● Current economic, social, & political climate (will the society embrace the idea & startup?)
At first, it seems each of those factors can have a huge effect to a startup’s success. However, a few experienced investors might only consider 2 or 3 factors. Even fewer investors will only consider just one factor (timing or plain luck). With that in mind, they’ll spread their investments in different startups and wait for a winner.
Experience and domain knowledge play a role in that decision (determining which are the most relevant factors and deciding where to invest the bulk of their money). Without those, they’ll be spreading themselves too thin and perhaps get results which are far from optimal.
Use dimensionality reduction
Dimensionality reduction simply means we’ll work only on fewer features as a result of removing redundancies. It’s similar to mapping data from a higher dimensional space into a lower dimensional space.
For example, we have 2 features and one output. This may require plotting the points into a 3-dimensional space. But if we apply dimensionality reduction, the plotting may only require 2d.
Use Cross-validation (用于模型选择)
notion image
在机器学习中,评估模型性能是非常重要的。常见的做法是将数据分成训练集和测试集,用训练集训练模型,用测试集评估模型的性能。然而,仅仅依赖一次划分(如 80% 训练集和 20% 测试集)可能会导致以下问题:
  1. 测试结果不稳定:由于数据随机划分的不同,评估结果可能有较大的波动。
  1. 数据分布偏差:一次划分可能无法代表全数据的分布。
  1. 数据浪费:单次划分可能会浪费一些数据,因为部分数据只用于测试。
交叉验证通过将数据集划分成多个子集,并多次训练和验证模型,从而更稳定、更全面地评估模型性能。
交叉验证的基本工作原理

这种方法被称为 K-Fold Cross-Validation

  1. 将数据集分成 K 个相等大小的子集(folds)
  1. 进行 K 次训练和测试
      • 每次选择一个子集作为测试集,剩余的 k 个子集作为训练集。
  1. 计算每次的评估指标(如准确率、F1-Score 等)。
  1. 最终取 K 次评估结果的平均值,作为模型的整体性能。
在机器学习中,交叉验证(Cross-Validation)通常在以下几个方面实施:
Model evaluation(模型评估)
问题背景:
  • 单一 train–test split 可能是“运气好 / 运气差”
  • 评估结果对数据切分非常敏感
Cross-validation 在做什么?
  • 多次训练、 多次验证
  • 对结果取平均
本质作用:
Reduce evaluation variance caused by a single random split.
面试表达升级版:
Cross-validation provides a more robust and unbiased estimate of out-of-sample performance by averaging results across multiple folds.
Overfitting detection & stability assessment(过拟合与稳定性)
为什么 CV 能发现过拟合?
  • 如果模型在不同 fold 上表现波动很大
  • 说明模型高度依赖特定数据子集
你要抓住两个指标:
  • Mean performance(整体好不好)
  • Variance across folds(稳不稳)
面试金句:
High variability across folds is often a sign of overfitting or model instability.
Understanding the bias–variance tradeoff(偏差–方差诊断)
Cross-validation 不只是给你一个分数,而是给你结构性信息
现象
解释
Mean 很低,Variance 低
High bias(模型太简单)
Mean 高,Variance 高
High variance(过拟合)
Mean 高,Variance 低
理想状态
一句话总结:
Cross-validation helps diagnose whether performance issues come from bias or variance.
Model selection & hyperparameter tuning(模型选择)
为什么不用 test set 直接选模型?
  • Test set 应该只用一次
  • 否则会造成 information leakage
CV 的角色:
  • 在 training data 内部完成模型 / 参数选择
  • 选“平均表现最好、最稳定”的方案
面试标准句:
Cross-validation enables fair comparison of models and hyperparameters without overfitting to a single validation split.
 
交叉验证的优点
  1. 减少过拟合风险:因为模型在每次验证时使用了不同的训练和测试数据。
  1. 更稳定的性能评估:避免了一次划分数据带来的偶然性结果。
  1. 更高效利用数据:所有数据都可以被用作训练集或测试集。
交叉验证的种类
K-Fold Cross Validation
notion image
K-Fold is we’re essentially dividing our dataset into multiple datasets, then running train-test-split multiple times, across these subsets.
Import parameters we should keep in mind:
n_splits: This is the number of splits we want to make within our dataset.
shuffle: This tells us whether we should shuffle our data before splitting into folds.
random_state: This is the random seed we're setting, similar to train-test-split.
from sklearn.model_selection import KFold kf = KFold(n_splits=2, shuffle = True, random_state = 42) kf.get_n_splits(X) folds = {} for train, test in kf.split(X): # Fold fold_number = 1 # Store fold number folds[fold_number] = (df.iloc[train], df.iloc[test]) print('train: %s, test: %s' % (df.iloc[train], df.iloc[test])) fold_number += 1
Typically, after completing K-Fold Cross-Validation we'll want to calculate a cross-validation score. Typically, we'll get the scores for each fold, then take an average:
from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier() scores = cross_val_score(model, X, y, scoring='accuracy', cv=kf, n_jobs=-1) print(np.mean(scores))
k-Fold gives a more stable and trustworthy result since training and testing is performed on several different parts of the dataset. We can make the overall score even more robust if we increase the number of folds to test the model on many different sub-datasets.
Still, k-Fold method has a disadvantage. Increasing k results in training more models and the training process might be really expensive and time-consuming.
 
 
Leave-One-Out Cross Validation(LOOCV)
与标准的K折交叉验证不同,LOOCV的特点是每次只保留一个样本作为测试集,其余样本用于训练模型。这意味着对于包含N个样本的数据集,LOOCV会执行N次训练和测试,每次都选择一个不同的样本作为测试集,其余N-1个样本用于训练。
from sklearn.model_selection import LeaveOneOut from sklearn.metrics import accuracy_score loo = LeaveOneOut() loo.get_n_splits(X) all_preds = [] for train_index, test_index in loo.split(X[:100]): print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] model = RandomForestClassifier() model.fit(X_train, y_train) y_preds = model.predict(X_test) correct = y_preds[0] == y_test.values[0] all_preds.append(correct)
LOOCV的优点包括:
  • 提供了对模型性能的高度可信度估计,因为每个样本都在测试集上出现一次,尽可能地利用了数据。
  • 可以检测模型是否过拟合,因为每个模型都在留出的单个样本上进行测试,能够更敏感地捕捉到模型对个别样本的过拟合情况。
然而,LOOCV也有一些缺点:
  • 计算成本较高,因为需要执行N次训练和测试,对于大型数据集来说可能会非常耗时。
  • 对于小样本数据集,LOOCV的估计可能具有高方差,因为每次只有一个样本用于测试,评估结果可能不稳定。
Time Series KFold
Time Series K-Fold Cross-Validation is a variation of K-Fold Cross-Validation designed specifically for time series data. Unlike standard K-Fold CV, where data points can be randomly shuffled, time series data has a temporal order that must be preserved during cross-validation. Time Series K-Fold CV addresses this issue by creating folds in a way that respects the chronological order of the data.
Time Series K-Fold Cross-Validation is suitable for time series forecasting tasks, where you want to assess how well your model generalizes to future time periods. It helps avoid data leakage and provides a more realistic estimate of a model's performance in production.
from sklearn.model_selection import TimeSeriesSplit tscv = TimeSeriesSplit() all_scores = [] for train_index, test_index in tscv.split(X): # print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] model = RandomForestClassifier() model.fit(X_train, y_train) y_preds = model.predict(X_test) pr_auc = average_precision_score(y_preds, y_test) all_scores.append(pr_auc) print(all_scores)
交叉验证的常见问题和注意事项

1. 数据泄露

  • 在预处理中(如特征缩放、缺失值填充)时,要确保操作只基于训练集,不能泄露测试集信息。
  • 解决方法:使用 scikit-learn 的 Pipeline
示例
from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler # 创建一个带预处理的流水线 pipeline = make_pipeline(StandardScaler(), RandomForestClassifier()) # 在交叉验证中使用 Pipeline scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')

2. 不平衡数据

  • 对于不平衡数据集,最好使用 StratifiedKFold,保证每个 fold 的类别分布一致。

3. 模型选择和调参

  • 交叉验证不仅用于评估模型,也可以用在模型选择和调参过程中。
  • GridSearchCVRandomizedSearchCV 结合使用。
示例:交叉验证 + 超参数调优
from sklearn.model_selection import GridSearchCV # 定义超参数搜索范围 param_grid = {'n_estimators': [10, 50, 100], 'max_depth': [None, 10, 20]} # 使用 GridSearchCV 结合交叉验证 grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='accuracy') grid_search.fit(X, y) print("最佳参数:", grid_search.best_params_) print("最佳交叉验证得分:", grid_search.best_score_)
 
L1 and L2 regularization
Hyperparameter tuning
Hyperparameter tuning is a critical step in optimizing the performance of machine learning models. It involves finding the best combination of hyperparameters for a given algorithm. Here’s an overview of how to perform hyperparameter tuning, specifically using techniques like Grid Search and Random Search, along with an example using a popular machine learning library, scikit-learn.

Methods of Hyperparameter Tuning

  1. Grid Search:
      • Grid Search involves exhaustively searching through a manually specified subset of the hyperparameter space.
      • It tests all possible combinations of the hyperparameters provided.
  1. Random Search:
      • Random Search samples a fixed number of hyperparameter settings from a specified range.
      • It does not test all combinations but can cover a larger hyperparameter space given the same computational budget as Grid Search.

Example Using Scikit-Learn

Let's use Grid Search and Random Search to tune hyperparameters for a Random Forest classifier.

1. Grid Search

from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split # Load dataset data = load_iris() X = data.data y = data.target # Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Initialize the model rf = RandomForestClassifier() # Define the hyperparameter grid param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20, 30], 'min_samples_split': [2, 5, 10] } # Initialize GridSearchCV grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy') # Fit GridSearchCV grid_search.fit(X_train, y_train) # Best hyperparameters print("Best Hyperparameters:", grid_search.best_params_) # Evaluate the model with the best hyperparameters best_rf = grid_search.best_estimator_ accuracy = best_rf.score(X_test, y_test) print("Test Set Accuracy:", accuracy)

2. Random Search

from sklearn.model_selection import RandomizedSearchCV import numpy as np # Initialize the model rf = RandomForestClassifier() # Define the hyperparameter space param_dist = { 'n_estimators': [int(x) for x in np.linspace(start=50, stop=200, num=10)], 'max_depth': [None] + [int(x) for x in np.linspace(10, 30, num=5)], 'min_samples_split': [2, 5, 10], 'bootstrap': [True, False] } # Initialize RandomizedSearchCV random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_dist, n_iter=100, cv=5, n_jobs=-1, scoring='accuracy', random_state=42) # Fit RandomizedSearchCV random_search.fit(X_train, y_train) # Best hyperparameters print("Best Hyperparameters:", random_search.best_params_) print("Tuned Logistic Regression Best Accuracy Score: {}".format(random_search.best_score_))

Summary

  • Grid Search: Exhaustively tests all specified combinations of hyperparameters.
  • Random Search: Samples random combinations of hyperparameters, allowing for a broader search in the hyperparameter space within a fixed computational budget.

Advantages:

  • Grid Search: Ensures that the best combination of the provided hyperparameters is found.
  • Random Search: More efficient in exploring a large hyperparameter space.

Disadvantages:

  • Grid Search: Can be computationally expensive, especially with a large number of hyperparameters and values.
  • Random Search: Might miss the optimal combination but generally provides good results with less computation.

机器学习算法超参数调优指南

超参数调优是机器学习模型优化的重要步骤,合理的超参数设置可以显著提升模型性能。以下是一份针对常见机器学习算法的超参数调优指南,涵盖常用的算法及其关键超参数。

1. 通用超参数调优方法

在调优之前,了解以下通用方法:
  • 网格搜索(Grid Search):遍历所有可能的超参数组合,适合小规模搜索空间。
  • 随机搜索(Random Search):随机采样超参数组合,适合大规模搜索空间。
  • 贝叶斯优化(Bayesian Optimization):基于概率模型选择最优超参数,适合高维空间。
  • 进化算法(Evolutionary Algorithms):通过模拟自然选择优化超参数。
  • 早停法(Early Stopping):在验证集性能不再提升时停止训练,防止过拟合。

2. 常见算法及其超参数

2.1 线性回归(Linear Regression)

  • 超参数
    • alpha(正则化强度):用于 Ridge 或 Lasso 回归的正则化参数。
    • l1_ratio(L1/L2 混合比例):用于 ElasticNet 回归,控制 L1 和 L2 正则化的比例。
  • 调优方法
    • 使用网格搜索或随机搜索优化 alphal1_ratio

2.2 逻辑回归(Logistic Regression)

  • 超参数
    • C(正则化强度的倒数):较小的值表示更强的正则化。
    • penalty(正则化类型):l1l2
    • solver(优化算法):如 liblinearlbfgssag 等。
  • 调优方法
    • 使用网格搜索优化 Cpenalty

2.3 决策树(Decision Tree)

  • 超参数
    • max_depth(树的最大深度):控制树的复杂度。
    • min_samples_split(节点分裂的最小样本数):防止过拟合。
    • min_samples_leaf(叶节点的最小样本数):防止过拟合。
    • criterion(分裂标准):如 ginientropy
  • 调优方法
    • 使用网格搜索或随机搜索优化 max_depthmin_samples_splitmin_samples_leaf

2.4 随机森林(Random Forest)

  • 超参数
    • n_estimators(树的数量):树越多,模型越稳定,但计算成本越高。
    • max_depth(树的最大深度):控制单棵树的复杂度。
    • min_samples_split(节点分裂的最小样本数)。
    • min_samples_leaf(叶节点的最小样本数)。
    • max_features(每棵树分裂时考虑的最大特征数):如 sqrtlog2
  • 调优方法
    • 使用网格搜索或随机搜索优化 n_estimatorsmax_depthmax_features

2.5 梯度提升树(Gradient Boosting, 如 XGBoost、LightGBM、CatBoost)

  • 超参数
    • n_estimators(树的数量)。
    • learning_rate(学习率):控制每棵树的贡献。
    • max_depth(树的最大深度)。
    • subsample(样本采样比例):防止过拟合。
    • colsample_bytree(特征采样比例):防止过拟合。
    • lambdaalpha(正则化参数):用于 XGBoost。
  • 调优方法
    • 使用贝叶斯优化或网格搜索优化 learning_ratemax_depthn_estimators

2.6 支持向量机(SVM)

  • 超参数
    • C(正则化参数):控制间隔与分类错误的权衡。
    • kernel(核函数):如 linearrbfpoly
    • gamma(核函数系数):影响模型的复杂度。
  • 调优方法
    • 使用网格搜索优化 Ckernelgamma

2.7 K近邻(K-Nearest Neighbors, KNN)

  • 超参数
    • n_neighbors(邻居数量):控制模型的复杂度。
    • weights(权重函数):如 uniformdistance
    • p(距离度量):如 1(曼哈顿距离)或 2(欧氏距离)。
  • 调优方法
    • 使用网格搜索优化 n_neighborsweights

2.8 神经网络(Neural Networks)

  • 超参数
    • learning_rate(学习率):控制权重更新的步长。
    • batch_size(批量大小):影响训练速度和稳定性。
    • num_layers(网络层数):控制模型复杂度。
    • hidden_units(每层的神经元数量)。
    • activation(激活函数):如 relusigmoidtanh
    • dropout(丢弃率):防止过拟合。
  • 调优方法
    • 使用随机搜索或贝叶斯优化优化 learning_ratebatch_sizehidden_units

2.9 K均值聚类(K-Means Clustering)

  • 超参数
    • n_clusters(聚类数量):控制聚类的数量。
    • init(初始化方法):如 k-means++random
    • max_iter(最大迭代次数)。
  • 调优方法
    • 使用肘部法(Elbow Method)或轮廓系数(Silhouette Score)确定 n_clusters

3. 调优工具

  • Scikit-learn:提供 GridSearchCVRandomizedSearchCV
  • Optuna:基于贝叶斯优化的超参数调优库。
  • Hyperopt:支持分布式超参数优化。
  • Ray Tune:支持分布式和高效的超参数调优。

4. 调优建议

  • 从小范围开始:先在小范围内搜索,再逐步扩大范围。
  • 交叉验证:使用交叉验证评估模型性能,避免过拟合。
  • 并行化:利用多核 CPU 或分布式计算加速调优过程。
  • 记录结果:保存每次调优的结果,便于分析和比较。

通过以上指南,你可以针对不同算法进行有效的超参数调优,从而提升模型性能。
机器学习模型可解释性
  1. Feature Importance(特征重要性)
      • 概念:特征重要性通常指模型内部对各个输入特征贡献度的评估。
      • 作用:帮助我们快速了解模型在做决策时,哪些特征的影响最大。
      • 常见方法:基于树模型(如随机森林、XGBoost)可以直接输出特征重要性,也有一些基于扰动(permutation importance)的算法。
  1. SHAP(Shapley Additive Explanations)
      • 概念:SHAP 借鉴了博弈论中的 Shapley 值,计算每个特征在预测中“分担”贡献的大小。
      • 作用:能为单个预测提供详细解释,量化每个特征对预测结果的正负影响。
      • 优点:具有理论保证(Shapley 值的公平性)和一致性,适用于各种模型。
  1. LIME(Local Interpretable Model-Agnostic Explanations)
      • 概念:LIME 通过在目标数据点附近构造一个简单的局部代理模型(如线性模型)来解释复杂模型的局部行为。
      • 作用:为单个预测提供解释,展示该预测附近哪些特征起了关键作用。
      • 特点:模型无关性(model-agnostic),适用于任何黑箱模型,但只解释局部区域,可能不反映全局特征影响。
  1. Surrogate Models
  • Definition: Simpler, interpretable models (e.g., linear regression, decision trees) that approximate the behavior of a complex AI system.
  • Example: LIME (Local Interpretable Model-agnostic Explanations) approximates a complex model’s prediction locally using an interpretable model.
  • Advantages: Provides a simple way to understand the local behavior of complex models.
  • Applications: Explaining individual predictions of black-box models.
总的来说,特征重要性更多的是从全局角度给出各特征的重要程度;而 SHAPLIME 则侧重于解释单个样本的预测,提供更细致的局部解释。选择哪种方法可以根据具体任务、模型类型和解释需求来定。
Optuna(自动化超参数优化框架)
Optuna 是一个专门为机器学习和深度学习设计的 自动超参数搜索框架
核心优势是:轻量、自动化、支持动态搜索空间、搜索效率高、可视化强。
相比 Grid Search / Random Search,Optuna 的搜索方式“更智能”。

1. Optuna 主要概念(大白话)

Study

一次“完整的超参数优化任务”。
可以理解为:
一个搜索 session,负责管理所有实验。

Trial

每一次模型训练就是一个 trial。
Study 会包含多个 trial。
一句话:
Study = 多次 Trial 的集合
Trial = 一次尝试一次训练

Objective Function(目标函数)

你定义一个函数,告诉 Optuna:
“怎么训练模型 + 怎么返回评价指标”。
Optuna 会自动反复调用这个函数来搜索最优参数。

2. Optuna 的关键优势(与 Grid Search / Random Search 对比)

1) 动态搜索空间(Data-Driven)

Grid Search:提前写死搜索空间
Random Search:随机
Optuna:根据历史结果自动决定下一个参数范围
大白话:
试过某一块很差,之后自动减少探索;试过某块效果好,就多搜索那一块。

2) 贝叶斯优化式的搜索(更智能、更快)

使用 TPE(Tree-structured Parzen Estimator)算法,
能在少量 trial 下快速找到好的参数。

3) 支持 pruning(提前停止不好的 trial)

如果某个 trial 在中途表现明显很差,Optuna 会提前终止它,加速训练。

4) 原生支持并行

多 GPU、多 CPU、多机器都支持。

3. 最常考的问题:

Optuna 与 Grid Search / Random Search 的区别?
你可以这样回答:
Grid Search 穷举搜索,慢,而且高维会爆炸。
Random Search 快一些,但利用过去信息的能力弱。
Optuna 则使用 TPE 算法,能根据历史结果动态调整搜索分布,并且支持提前终止差的 trial,因此搜索效率更高。
一句话总结:
Grid Search 是盲目搜索;Optuna 是数据驱动的智能搜索。

4. Optuna 的核心组件(面试常问)

1) Trial

负责:
  • trial.suggest_*() 采样超参数
  • 把参数传给 objective function
  • 返回当前结果
Trial 可以做:
  • suggest_int
  • suggest_float
  • suggest_categorical
  • suggest_loguniform
  • 建立搜索空间

2) Pruners(剪枝器)

常见:Median Pruner、Successive Halving
作用:提前停止不好的 trial

3) Samplers(采样器)

常见:TPE
作用:决定下一组 hyperparameter 的分布

5. Optuna 的使用流程(大白话)

  1. 定义一个 objective function
      • 在里面写训练模型的步骤
      • 用 trial.suggest_*() 取参数
      • 返回 metric(accuracy / auc / loss)
  1. 建立 Study
      • Study = 整个搜索过程
  1. 运行 study.optimize(objective, n_trials=50)
  1. 用 best_trial.params 取最佳参数
你只需要写 objective,其他都自动化。

6. 什么时候应该用 Optuna?

  • 搜索空间大(三五个以上超参数)
  • 使用 XGBoost、LightGBM、CatBoost(最佳搭配)
  • 使用深度学习模型(学习率、层数、节点数)
  • 想要比 Random Search 更快
  • 训练成本高,希望用 pruning 节省时间

7. Optuna 的优缺点

优点

  • 搜索效率高
  • 搜索空间可以动态调整
  • 支持提前停止
  • 支持并行
  • API 简单
  • 支持可视化(优化曲线、参数重要度等)

缺点

  • 超参数空间太小的话没必要
  • 学习成本比 GridSearchCV 稍高
  • 结果与随机种子、采样器有关
  • 不保证全局最优(但一般很快找到近似最优)