Theory

定义特征 X

构造线性组合 z = XW

使用 Sigmoid Function 得到概率

定义 Log Loss

使用 Gradient Descent 最小化损失

得到最优 weights

用 threshold 把概率转成分类

Definition

Logistic regression is a statistical method for predicting binary classes. The outcome or target variable is dichotomous in nature, dichotomous means there are only two possible classes. for example, it can be used for cancer detection problems. It computes the probability of an event occurrence.

Properties of Logistic Regression:

The dependent variable in logistic regression follows Bernoulli Distribution.

Estimation is done through maximum likelihood.

No R Square, Model fitness is calculated through Concordance, KS-Statistics.

关键思想：不预测“类别”，而预测“概率”

Logistic Regression = Linear Model + Sigmoid 映射

Linear Regression Vs. Logistic Regression

Linear regression gives you a continuous output but logistic regression gives a continuous output. An example of a continuous output is house price and stock price. An example of the discrete output is predicting whether a patient has cancer or not, predicting whether the customer will churn. Linear regression is estimated using Ordinary Least Squares (OLS) while logistic regression is estimated using the Maximum Likelihood Estimation (MLE) approach.

Pros and Cons

Pros

Simple algorithm that is easy to implement, does not require high computation power.

Performs extremely well when the data/response variable is linearly separable.

Less prone to over-fitting, with low-dimensional data.

Very easy to interpret, can give a measure of how relevant a predictor is and the association (positive or negative impact on response variable).

Cons

Logistic regression has a linear decision surface that separates its classes in its predictions, in the real world it is extremely rare that you will have linearly separable data.

Need to perform careful data exploration, logistic regression suffers with datasets with high multicollinearity between their variables, repetition of information can lead to wrong training of parameters.

Requires that independent variables are linearly related to the log odds (log(p/(1-p)).

Algorithm is sensitive to outliers.

Hard to capture complex relationships, deep learning and classifiers such as Random Forest can outperform with more realistic datasets.

名称	做什么
Logit function	概率 → 实数（log-odds）
Sigmoid / Logistic function	实数 → 概率

Logit function

What are odds, really?

Odds compare how likely an event is to happen versus not happen

Logit function 本质上是：把“概率”转换成“可以做线性建模的数”

Sigmoid Function

The sigmoid function, also known as the logistic function, is a crucial component of logistic regression.The sigmoid function maps any real-valued number to the range of [0, 1], making it suitable for modeling the probability of an event occurring.

If the output of sigmoid function more than 0.5, we can classify the outcome as 1 or YES and if it is less than 0.5, we can classify it as 0 or NO. The outputcannotFor example: If the output is 0.75, we can say in terms of probability as: There is a 75 percent chance that the patient will suffer from cancer.

Cost Function

使用的是 Log Loss / Cross-Entropy Loss

目标：让模型输出的概率和真实标签一致

核心特点：

👉 对“非常自信但预测错误”的情况惩罚极重

❌「自信地错」 = 重罚

⚠️「不确定地错」 = 轻罚

✅「自信地对」 = 奖励

原因：

👉 分类任务中，“自信地错”比“犹豫地错”危险得多

实际 y	预测概率	模型态度	对错	Log Loss 惩罚
1	0.99	非常自信	对	非常小
1	0.55	不太自信	对	较小
1	0.45	不太自信	❌	中等
1	0.01	非常自信	❌	极大

Gradient Descent（梯度下降）

Gradient descent is an iterative optimization method that updates model parameters by moving in the opposite direction of the gradient to minimize the cost function.

在 Logistic Regression（或任何模型）里：

我们有一个 Cost Function

它衡量：

👉 当前参数下，模型有多“糟”

Gradient Descent 的目标只有一个：

👉 把 Cost Function 降到最低

Gradient Descent 的目标：找到一组 weights，使 Cost Function 最小

1️⃣ 初始化 weights（随机）

2️⃣ 计算当前 Cost Function 的 gradient

3️⃣ 按规则更新 weights

新的 weights= 旧的 weights− learning rate × gradient

learning rate 是什么？

控制“每一步走多远”

太大：

可能跳过最优点

太小：

收敛太慢

📌 这是后面 Neural Networks 的核心基础

Hyperparameters

Important Hyperparameters for Logistic Regression

penalty: Specifies the norm used in the penalization (regularization). Common values are 'l1' (Lasso), 'l2' (Ridge), 'elasticnet', or None.

C: Inverse of regularization strength; smaller values specify stronger regularization.

solver: Algorithm to use in the optimization problem. Choices include 'newton-cg', 'lbfgs', 'liblinear', 'sag', and 'saga'.

max_iter: Maximum number of iterations taken for the solvers to converge.

l1_ratio: The Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1. Only used if penalty='elasticnet'.