Deep Learning

Foundations（基础概念）

Neural Networks 的起源

生物启发

神经网络（Neural Network）来源于 人脑中的神经元网络

单个神经元能力有限

大量神经元协同 → 能完成复杂计算

McCulloch & Pitts（1943）

提出 人工神经元模型：

一个人工神经元做的事情：

接收多个输入（signals）

加权求和

与阈值比较

超过阈值 → 激活（fire）

👉 本质：加权求和 + 非线性激活

从 Neural Network 到 Deep Learning

什么是 Deep？

含有多层（multiple layers）神经网络

层数越多 → 网络越“深”

Deep Learning = Deep Neural Networks

历史发展脉络

1940s–1950s

人工神经元提出

Rosenblatt：Perceptron

Widrow & Hoff：早期可工作的神经网络

❌ 受限：

没有好的训练方法

计算能力不足

数据极少

1980s：关键突破

Backpropagation（反向传播）

能训练多层网络

❌ 仍然受限：

数据不足

计算资源太弱

2000s 之后：Deep Learning 爆发

三大条件同时满足：

海量数据

强大算力（GPU）

算法改进

为什么 Deep Learning 近十年才成功？

🔑 三大核心推动力

1️⃣ 数据爆炸

互联网

传感器

移动设备

已被大量 标注（labeled）

深度学习极度依赖数据规模

2️⃣ 计算能力提升

并行计算

云计算

3️⃣ 算法层面的进步

更稳定的训练

缓解梯度消失

更深的网络结构成为可能

Deep Learning 的典型应用

📷 图像识别

人脸识别（Facebook 自动标注）

图像分类

🌍 神经机器翻译

多语言自动翻译

文本 → 文本

🏥 医疗

ICU 中预测败血症（Sepsis）

利用传感器时间序列数据

🍕 工业视觉

披萨质量检测

自动检查配料、比例、数量

Deep Learning 擅长什么？（核心总结）

✅ 特别擅长以下场景

1️⃣ 海量数据

数据越多 → 模型越强

2️⃣ 高维特征

图像：每个像素 = 一个特征

文本：大量词向量

视频：时间 × 空间

3️⃣ 强非线性关系

特征之间关系复杂

传统模型难以建模

Neural Network Basics 神经网络基础


Input Layer
    ↓
Hidden Layer(s)
(Linear + Activation)
    ↓
Output Layer

Neural network structure（layers, neurons, weights, biases）

神经网络结构：层、神经元、权重、偏置

A neural network is composed of layers of neurons, where each neuron applies a weighted linear transformation followed by a non-linear activation.

神经网络由多层神经元组成，每个神经元执行一次“线性变换 + 非线性激活”。

核心参数是 weights（权重） 和 biases（偏置），模型通过学习它们来逼近复杂函数。

Forward pass 前向传播

English:

The forward pass computes the output by passing input through all layers.

中文：

前向传播就是将输入逐层传递，直到产生最终输出，用于预测。

数学表达：


h = σ(Wx + b)

Loss functions（MSE, BCE, Cross-entropy）

损失函数：均方误差、二元交叉熵、多分类交叉熵**

English:

The loss function measures how far predictions are from the true labels.

中文：

损失函数衡量“预测和真实值之间的差距”，是模型学习的目标。

Typical losses:

MSE（均方误差）— 回归

BCE（二元交叉熵）— 二分类

Cross-entropy（交叉熵）— 多分类

Activation Functions（激活函数）

Common activations 常见激活函数

ReLU（最重要）

English:

ReLU outputs max(0, x), making computation efficient and gradients stable.

中文：

ReLU 输出 max(0, x)，计算简单高效，是目前深度学习中使用最广的激活函数。

Sigmoid

English:

Sigmoid maps inputs to (0, 1), useful for binary classification.

中文：

Sigmoid 将输入压缩到 (0,1)，适合二分类概率输出，但容易梯度消失。

Tanh

English:

Tanh outputs values between –1 and 1 and is zero-centered.

中文：

Tanh 将输出压缩到（–1, 1），比 sigmoid 更稳定，但仍可能导致梯度消失。

Why use ReLU? 为什么使用 ReLU？

Avoids vanishing gradients（相比 sigmoid/tanh）

避免梯度消失**

English:

ReLU keeps gradients strong for positive inputs, solving the vanishing gradient problem common in sigmoid and tanh.

中文：

对正数梯度恒为 1，不会像 sigmoid/tanh 那样让梯度趋近于 0，从而避免深层网络训练失败。

Sparse activation 稀疏激活

English:

ReLU outputs zero for half of the inputs, creating sparsity that improves generalization.

中文：

ReLU 会让部分神经元“关闭”，形成天然的正则化效果，增强泛化能力。

Highly efficient 高效易算

English:

ReLU is extremely simple to compute, making training faster.

中文：

ReLU 的数学形式简单，计算成本极低，训练速度更快。

Optimization（训练核心）

Optimizers 优化器

SGD（随机梯度下降）

English:

SGD updates parameters using mini-batches, making training faster and more scalable than full-batch gradient descent.

中文：

SGD 每次用小批量样本更新参数，提高训练速度，可处理超大数据集。

缺点：噪声大，容易震荡。

Momentum

English:

Momentum accelerates SGD by accumulating past gradients and smoothing the update direction.

中文：

Momentum 引入“惯性”，让模型在正确方向上走得更快，并减少震荡。

公式直观理解：


往同一个方向连续更新 → 加速；
换方向 → 减速。

RMSProp

English:

RMSProp adapts the learning rate for each parameter by dividing by a running average of squared gradients.

中文：

RMSProp 对每个参数自动调整学习率，使其对不同尺度的特征更稳定。

Adam（最常用）

English:

Adam combines Momentum and RMSProp, using both first- and second-moment estimates for fast and stable convergence.

中文：

Adam = Momentum + RMSProp 的结合，既有平滑更新，又能自适应学习率，是最常用的优化器。

Learning Rate 学习率

Too high → divergence（发散）

English:

If learning rate is too large, updates overshoot the minimum, causing instability.

中文：

学习率太大 → 每次更新“跳太远”，损失震荡甚至发散。

Too low → slow or stuck（训练太慢或卡住）

English:

Small learning rates converge slowly and may get stuck in flat regions.

中文：

学习率太小 → 学得很慢，甚至停在“平坦区域”不动。

Schedulers（学习率调度器）

step decay：训练越久学习率越低

cosine decay：平滑衰减

warm-up：先用小学习率预热，再恢复正常（Transformer 必备）

中文总结：

学习率调度 = “先大后小” 或 “先小后大再小”，提高稳定性。

Regularization（防止过拟合）

核心正则化方法

L1 / L2 Regularization

English:

L2 shrinks weights smoothly; L1 forces sparsity.

中文：

L2 让权重变小更稳定；L1 可以让权重变成 0 → 实现特征选择。

Dropout

English:

Dropout randomly disables neurons during training to prevent co-adaptation.

中文：

Dropout 随机“关闭”部分神经元，强迫网络学到更稳健的特征。

Early stopping

English:

Stop training when validation loss stops improving.

中文：

防止模型继续“记忆训练数据”。

Batch Normalization

English:

Normalizes intermediate activations, stabilizing gradients and speeding training.

中文：

在每一层对激活值归一化，提高训练稳定性，还允许使用更大学习率。

Data augmentation

English:

Artificially increases training data (flip, crop, noise) to improve generalization.

中文：

通过数据增强扩充数据量（翻转、裁剪、加噪声），减少过拟合。

BatchNorm 原理（中英结合）

English:

BatchNorm normalizes each feature per mini-batch, keeping mean ~0 and variance ~1.

It reduces internal covariate shift, stabilizes training, and allows deeper networks.

It can also reduce the need for dropout.

中文：

BatchNorm 对每个 mini-batch 内的激活值做归一化，让网络各层分布保持稳定。

优点：

减少梯度爆炸/消失

加速训练

让深层网络可训练

部分替代 Dropout 的作用

Convolutional Neural Networks（CNN）

CNN Concepts 核心概念

Convolution & Filters（卷积 & 滤波器）

English:

Filters slide over the image to detect patterns like edges or textures.

中文：

卷积核在图像上滑动，提取局部特征，如边缘、角点、纹理。

Feature maps（特征图）

English:

Outputs of convolutions showing where patterns occur.

中文：

特征图显示网络在何处检测到了某种模式。

Pooling（池化）

English:

Reduces spatial size and increases robustness (max/avg pooling).

中文：

池化降低维度，使模型更鲁棒，同时减少计算量。

Translation invariance（平移不变性）

English:

CNNs recognize patterns regardless of position.

中文：

CNN 可以识别图像中被移动位置的同一个物体。

CNN Strengths 优势

English:

CNNs are efficient due to parameter sharing and local connectivity.

They naturally capture hierarchical features: early layers learn edges, deeper layers learn objects.

中文：

CNN 参数共享、局部连接让它非常高效；

浅层学边缘纹理，深层学语义形状，是处理图像的最佳架构之一。

RNN / LSTM / GRU（序列模型）

RNN Basics 基础

English:

RNNs process sequences one step at a time and maintain a hidden state capturing past information.

However, they suffer from vanishing gradients and struggle with long dependencies.

中文：

RNN 按时间顺序逐步处理输入，用隐藏状态保存历史信息。

但会遇到梯度消失问题，无法捕捉长期依赖。

LSTM / GRU（为什么出现？）

English:

LSTM and GRU introduce gating mechanisms that control what information to keep, update, or forget.

This allows them to model long-term dependencies and stabilize gradients.

They are widely used in text, speech, and time-series tasks.

中文：

LSTM/GRU 引入“门结构”，决定哪些信息保留、更新或遗忘。

有效解决长期依赖问题，并缓解梯度消失。

因此在 NLP、语音、时间序列中应用广泛

Loss Functions（必须会）

Classification Losses 分类损失函数

Cross-entropy（多分类常用）

English:

Cross-entropy measures the difference between predicted class probabilities and true labels, making it ideal for multi-class classification with softmax.

中文：

交叉熵用于多分类，通过衡量模型预测的概率分布与真实分布之间的差异来训练模型。

它会对“自信但错误”的预测给予高惩罚，使模型快速学习区分类。

BCE（Binary Cross-Entropy）二元交叉熵

English:

BCE is used for binary classification and measures how well predicted probabilities match binary labels.

中文：

用于二分类任务（0/1），计算预测概率与实际标签的差距，是 logistic regression 和二分类神经网络标准损失。

Softmax + CE（softmax 组合交叉熵）

English:

Softmax converts logits into probabilities; cross-entropy then penalizes incorrect predictions.

中文：

Softmax 将最后一层输出转为概率分布，交叉熵负责训练模型，使正确类别的概率更高。

Regression Losses 回归损失

MSE（Mean Squared Error）均方误差

English:

MSE penalizes large errors heavily and is smooth for optimization.

中文：

MSE 会对大误差进行高惩罚，适合需要平滑优化的模型（如线性回归、神经网络回归）。

MAE（Mean Absolute Error）平均绝对误差

English:

MAE is more robust to outliers because it treats all errors linearly.

中文：

MAE 对异常值更鲁棒，误差以线性方式增加，因此不会被 outlier 过度影响。

Advanced Loss — Focal Loss（处理类别不平衡）

English:

Focal loss reduces the weight of easy examples and focuses training on hard, minority samples.

中文：

Focal loss 是处理严重类别不平衡的常用方法，通过降低“容易分类样本”的权重，让模型更关注难样本（例如 fraud、违约预测）。

Training Deep Networks（核心技巧）

Training Neural Networks

Step 1：Forward Propagation（前向传播）

对一个样本：

1️⃣ 输入 → input layer

2️⃣ 乘 weights → hidden layer

3️⃣ 过 activation function

4️⃣ 再乘 weights → output layer

5️⃣ 得到预测值 ŷ

📌 这一步 只做计算，不更新参数

Step 2：Compute Loss（计算损失）

用 ŷ vs y

计算 cost / loss

Step 3：Backpropagation（反向传播）

核心思想：

从输出层开始，一层一层往回算“每个 weight 对错误的贡献”

做了三件事：

使用 Chain Rule

给每一层、每一个 weight 分配“责任”

📌 不用记公式，只记逻辑：

“谁参与造成错误，谁就被改”

Step 4：Gradient descent 梯度下降

梯度下降通过沿着“降低损失的方向”更新权重，是优化的核心。

公式：

Step 5：重复直到收敛

换下一个样本

或换一个 mini-batch

直到 loss 不再明显下降

Gradient Descent 的三种训练方式

1.Batch Gradient Descent

English

Uses the entire training dataset to compute one gradient update.

中文

每次用全部训练数据计算一次梯度，再更新参数。

特点

梯度最稳定

计算最慢

大数据集几乎不可用

一句话总结

稳，但太慢

2.Stochastic Gradient Descent（SGD）

English

Uses one sample at a time to update parameters.

中文

每次只用一个样本更新参数。

特点

更新非常频繁

噪声最大

易震荡，但探索性强

一句话总结

快，但不稳

3.Mini-batch Gradient Descent（现实中的默认）

English

Uses small batches (e.g., 32, 64) to compute gradients.

中文

每次用一小批样本（如 32 / 64）更新参数。

特点

速度快

稳定性好

可并行计算

一句话总结（必背）

深度学习训练的事实标准

Weight Initialization 权重初始化

Xavier Initialization

English:

Xavier keeps variance stable across layers for tanh/sigmoid activations.

中文：

适合 sigmoid/tanh，使前向传播和反向传播中的方差保持稳定。

He Initialization（ReLU 最佳）

English:

He initialization is designed for ReLU, compensating for the zeroed-out half of activations.

中文：

He 初始化针对 ReLU 激活的特性（会把一半输入变成零），让梯度更稳定，是现代深度网络默认初始化方式。

Overfitting Detection 过拟合检测

Train/val curves 训练/验证损失走势

English:

If training loss decreases while validation loss increases, the model is overfitting.

中文：

训练损失下降但验证损失上升 = 明显过拟合。

Early stopping

English:

Stop training once validation performance stops improving.

中文：

一旦验证集不再变好，就提前停止训练，避免模型“记住训练集”。

Dropout

English:

Randomly disabling neurons forces the model to learn more robust representations.

中文：

随机丢弃神经元，减少 co-adaptation，从机制上抑制过拟合。

Vanishing / Exploding Gradients 梯度消失/爆炸

Why occur? 为什么出现？

English:

Deep networks multiply many gradients; small values vanish and large values explode.

中文：

深度网络中梯度在层之间不断相乘，如果值太小 → 消失；太大 → 爆炸。

How activation functions help（ReLU）

English:

ReLU keeps gradients strong and avoids squashing.

中文：

ReLU 不会像 sigmoid/tanh 那样把梯度压缩到极小，因此能显著缓解梯度消失。

How LSTM/GRU help

English:

They use gating mechanisms and additive structures to preserve long-term gradients.

中文：

LSTM/GRU 的“门结构”允许梯度在序列中长距离传播，解决“长期依赖问题”。

Evaluation Metrics

Metrics 指标

Accuracy（不适用于不平衡）

English:

Accuracy fails when classes are imbalanced, as predicting the majority class may give high accuracy but poor usefulness.

中文：

在类别不平衡的任务中（如 fraud），accuracy 无意义，因为只预测 majority 也能很高。

Precision / Recall / F1

English:

Precision measures false-positive control; recall measures the ability to capture positives; F1 balances the two.

中文：

Precision 衡量 FP；Recall 衡量 FN；F1 是两者的调和平均数，用于不平衡分类。

AUC

English:

AUC measures ranking ability and is threshold-independent.

中文：

AUC 衡量模型排序能力，适用于不平衡数据集。

PR-AUC

English:

More informative than ROC-AUC under heavy imbalance because it focuses on the minority class.

中文：

在严重不平衡情况下 PR-AUC 优于 ROC-AUC，因为它专注于 positive 类。

Deep learning Application

Computer Vision

Computer Vision

= 使用深度学习模型，让机器“理解图像和视频内容”

📌 核心难点不在“看见”，而在理解

四大核心任务（一定要分清）

1️⃣ Image Classification（图像分类）

做什么？

给整张图片一个标签

输入 → 输出：

输入：一张图

输出：一个类别

典型例子：

人脸识别（是谁）

OCR（数字 / 字母）

医疗影像 → 疾病类型

📌 只回答：“这是什么”

2️⃣ Object Detection（目标检测）

做什么？

不仅要知道“是什么”

还要知道“在哪里”

输出包括两部分：

Class（类别）

Bounding Box（位置）

典型例子：

自动驾驶：行人 / 自行车

安防监控

📌 回答：“是什么 + 在哪里”

3️⃣ Semantic Segmentation（语义分割）

做什么？

给 每一个像素 分类

和 Object Detection 的区别：

Object detection：画框

Segmentation：描边、抠图

典型例子：

医疗影像（骨骼 vs 组织）

自动驾驶（道路 / 行人 / 车辆）

📌 回答：“每一像素属于什么”

4️⃣ Image Generation（图像生成）

做什么？

用模型“生成”图片

常见模型：

GAN（Generative Adversarial Network）

典型例子：

Deepfake

虚拟人脸

📌 核心风险：真实性 & 伦理

图像是怎么变成“模型能吃的”？

像素 = 特征（features）

图像由 pixels 组成

每个 pixel = 3 个值（RGB）

RGB 编码：

Red / Green / Blue

每个 ∈ [0, 255]

举个数字感很强的例子

图像尺寸：1080 × 1920

每个像素 3 个 channel

👉 特征数 ≈ 1080 × 1920 × 3 ≈ 600 万

📌 这就是为什么普通全连接网络扛不住

为什么不能直接用 Fully Connected Network？

Fully Connected 的问题

每个输入 → 每个神经元

图像特征太多

参数数量爆炸

训练几乎不可能

👉 所以引入 CNN

ImageNet & Transfer Learning（工业标准）

ImageNet 是什么？

~1400 万张图片

~20,000 类

计算机视觉的“预训练圣经”