一份完全模拟真实商业银行 CRE PD 模型开发文档（MDD）

CRE PD Model – Model Development Document (MDD)

Commercial Bank – Income-Producing Real Estate Portfolio

Version: 1.0 – Model Development Document

Date: 2025

Prepared by: Credit Risk Modeling Team

1. Executive Summary

1.1 Purpose of the Model

The Commercial Real Estate (CRE) PD Model estimates the annual Probability of Default (PD) for income-producing commercial properties. The model supports:

CECL lifetime loss estimation

Quarterly Allowance for Credit Losses (ACL)

RWA / regulatory capital

Portfolio monitoring

Pricing and underwriting decision support

The model is developed in accordance with:

SR 11-7

Basel “Use-Test” principles

Internal Model Governance Standards

Data Quality Framework

1.2 Modeling Approach

Logistic Regression using Weight of Evidence (WoE) transformed predictors

Borrower/Loan level PD estimation

Calibrated using CRE loan data from 2014–2024

Includes: collateral risk, cash flow risk, borrower quality, and geographic segmentation

Separate risk drivers for Office, Retail, Multifamily, Industrial, Mixed Use

1.3 Key Predictive Variables

Final model uses:

Variable	Category	Rationale
LTV_WoE	Collateral	Most predictive collateral measure
DSCR_WoE	Cash Flow	Key repayment capacity metric
PropertyType_WoE	Structural	Office, Retail, MF differ materially
Geography_WoE	Regional	Captures state-level cycles
LoanAge_WoE	Behavioral	Seasoning effect on PD
SponsorStrength_WoE	Borrower	Captures leverage + liquidity

1.4 Model Performance

Metric	Result
AUC	0.74
KS	36
Gini	0.48
Out-of-Time AUC	0.71
Calibration error	< 3.5%

Performance is consistent with the bank’s risk appetite and industry benchmarks.

1.5 Major Findings

✔ No material leakage found

✔ Strong discriminatory power

✔ Stable across geography and property types

✔ DSCR & LTV remain the two strongest CRE PD predictors

✔ Office portfolio shows structural deterioration (post-COVID) → noted in Limitations

2. Portfolio Description

2.1 CRE Portfolio Definition

CRE portfolio includes:

Income-producing properties

Constructed and stabilized assets

Loans secured by:

Multifamily
Office
Retail
Industrial
Mixed Use
Hotel (excluded in v1 model due to limited data)

Construction loans are excluded (separate PD model).

2.2 Data Sources

Data Type	System	Description
Loan Characteristics	Core Loan System	Balance, LTV, DSCR
Collateral Appraisal	Appraisal Database	Property values
Sponsor Data	Borrower System	Sponsor net worth, liquidity
Property Attributes	CRE Market Data Provider (Trepp/CoStar)	Vacancy, rent index
Default Events	Loss Accounting System	90DPD, non-accrual, foreclosure

2.3 Data Time Window

Development dataset covers 2014–2024

Default event window: 12-month horizon

Out-of-time validation period: 2021–2024

2.4 Portfolio Statistics

Metric	Value
Total Loans	32,450
Total Exposure	$48.3B
Overall Default Rate	2.1%
Avg DSCR	1.42
Avg LTV	63%
Property Type Mix	MF 40%, Office 22%, Retail 18%, Industrial 15%, Other 5%

Office PD noticeably higher (~3.7%) due to post-COVID market dynamics.

3. Data Preparation

3.1 Data Cleaning

Performed:

Removal of duplicates (0.3%)

Standardization of appraisal dates

Capping of extreme values (e.g., DSCR > 5 capped)

Consolidation of multiple properties per sponsor

Treating negative NOI data

3.2 Default Definition

Following internal credit policy:

A loan is considered defaulted if:

90+ days past due

Classified as non-accrual

Foreclosure initiated

Charged-off (partial or full)

Transferred to OREO

12-month PD is modeled.

3.3 Outlier Handling

LTV

Values > 150% capped

Negative values removed

DSCR

DSCR < 0.1 winsorized

Missing DSCR assigned to WoE “Missing” bin

Sponsor Strength

Winsorized at 5th and 95th percentiles

3.4 Missing Value Treatment

Missing values treated through:

Dedicated WoE bins

Business confirmation (“Missing DSCR = weak financials”)

98%+ completeness achieved after treating missing values

4. Feature Engineering

4.1 Weight of Evidence (WoE) Transformation

All numeric and categorical predictors are transformed using WoE.

Goals of WoE:

Linearize log-odds → improve logistic regression stability

Guarantee monotonicity → regulator-friendly

Handle missing/extreme values gracefully

Allow business-friendly interpretation

4.2 Example WoE Table – LTV

(模拟真实银行数据)

LTV Bin	# Good	# Bad	Bad Rate	WoE	IV
0–50%	2,100	10	0.47%	-1.69	0.485
50–70%	2,500	25	0.99%	-0.96	0.247
70–85%	1,600	34	2.08%	-0.21	0.010
85–95%	800	38	4.54%	0.62	0.067
95%+	400	49	10.91%	1.43	0.658
Total	7,400	156	—	—	1.47

✔ Monotonic

✔ High IV → powerful predictor

✔ Directionally correct

4.3 Example WoE Table – DSCR

DSCR Bin	Bad Rate	WoE
Missing	5.5%	0.92
<1.0	8.0%	1.31
1.0–1.3	2.6%	-0.55
1.3–1.6	1.4%	-1.01
>1.6	0.8%	-1.53

✔ DSCR monotonic

✔ DSCR is 2nd most predictive variable

4.4 IV Summary Table (All Variables)

Variable	IV	Decision
LTV	1.47	Keep (Strong)
DSCR	0.42	Keep
Debt Yield	0.31	Correlated → Drop
Vacancy Rate	0.11	Keep
Geography	0.18	Keep
Sponsor Strength	0.24	Keep
Property Type	0.29	Keep
NOI Growth	0.07	Weak → Remove
ㅤ	ㅤ	ㅤ

5. Feature Selection

The feature selection framework follows a three-layer approach consistent with the bank’s Model Development Standards and regulatory expectations (SR 11-7).

5.1 Layer 1 – Univariate Predictive Power (IV Screening)

Each raw variable was evaluated individually using:

Information Value (IV)

Monotonicity of default rate

Missing-value pattern

Variables with IV < 0.02 were eliminated.

Summary of Univariate IV Results

Variable	IV	Predictive Power	Decision
LTV	1.47	Strong	Keep
DSCR	0.42	Strong	Keep
Property Type	0.29	Medium	Keep
Geography	0.18	Medium	Keep
Sponsor Strength	0.24	Medium	Keep
Interest Rate	0.05	Weak	Keep (Economic rationale)
Vacancy Rate	0.11	Medium	Keep
Debt Yield	0.31	Medium	Drop (collinearity with DSCR/LTV)
NOI Growth	0.07	Weak	Drop
Zip Code	0.01	None	Drop
Seasonality Indicator	0.004	None	Drop

Approximately 42 variables → 13 survivors after IV screening.

5.2 Layer 2 – Multicollinearity Screening (VIF)

Variance Inflation Factor (VIF) was computed on the reduced set of variables.

Threshold: VIF < 5

Multicollinearity Results

Variable	VIF	Decision	Notes
LTV	8.4	Keep	High, but key risk variable
DSCR	2.7	Keep	Core predictor
Debt Yield	7.1	Drop	Strongly correlated with DSCR
Property Type	1.9	Keep	No issues
Sponsor Strength	1.4	Keep	Stable
Geography	1.8	Keep	Stable
Vacancy Rate	1.9	Drop	Fails governance later
Interest Rate	1.5	Keep	Small correlation
Loan Age	1.3	Keep	Behavioral variable

Debt Yield was eliminated due to high VIF with DSCR and LTV.

Vacancy Rate was eliminated not due to VIF, but due to governance issues (see below).

5.3 Layer 3 – Governance & Economic Rationale Screening

Variables must satisfy:

(1) Monotonic WoE Pattern Check

Examples:

DSCR → monotonic decreasing WoE → ✔

LTV → monotonic increasing WoE → ✔

Vacancy Rate → non-monotonic (W-shape) → ❌ (Dropped)

(2) Coefficient Direction Check

Regulatory expectation:

Variable	Expected Sign	Reason
LTV	+	Higher LTV → higher default
DSCR	–	Better coverage → lower default
Sponsor Strength	–	Stronger borrower → lower PD
Property Type (Office)	+	Higher structural risk

Variables with sign contradictions were removed.

(3) Business Rationale Check

Business SME (CRE credit team) confirmed:

Property Type risk order: Office > Retail > MF > Industrial

Geography risk varies with micro-market cycles

LoanAge reflects seasoning effects

Vacancy Rate failed SME review because property-level vacancy may not equal sponsor-level ability to service debt (CRE underwriting nuance).

5.4 Final Selected Features

Variable	Type	Reason for Inclusion
LTV_WoE	Collateral	Highest IV, monotonic, intuitive
DSCR_WoE	Cash Flow	Strongest economic rationale
PropertyType_WoE	Structural	Segment risk
Geography_WoE	Regional	Captures local cycles
LoanAge_WoE	Behavioral	Seasoning effect
SponsorStrength_WoE	Borrower	Predictable + intuitive

Final model uses 6 variables.

6. Model Specification

The CRE PD model uses logistic regression.

6.1 Estimated Coefficients

(模拟真实银行模型参数)

Variable	Coefficient (β)	Expected Sign	Meets Expectation?
Intercept	-1.92	—	—
LTV_WoE	0.87	+	✔
DSCR_WoE	-0.64	–	✔
PropertyType_WoE	0.41	+	✔
Geography_WoE	0.33	+	✔
LoanAge_WoE	-0.12	+/-	✔ (negative seasoning effect)
SponsorStrength_WoE	-0.29	–	✔

All coefficients have the correct sign and pass economic rationale review.

6.2 Interpretation of Coefficients

LTV: Strongest predictor; PD nearly doubles for highest WoE bin

DSCR: Negative coefficient; low DSCR significantly increases PD

Property Type: Office loans contribute positively to risk

Geography: High-risk states (CA, NY, IL) show elevated PD

Loan Age: PD decreases as loan seasons (first 2 years highest risk)

Sponsor Strength: Liquidity + net worth reduce PD likelihood

6.3 Variance-Covariance Matrix

(略；可加入附录)

7. Model Performance

7.1 Discriminatory Power

AUC / ROC Analysis

Sample	AUC
Development	0.74
Validation	0.71
Out-of-Time (2021–2024)	0.70

Industry benchmark for CRE PD models: AUC = 0.65–0.75

→ model performs at the upper end of industry norm.

7.2 KS Statistic

Sample	KS
Development	36
Validation	33
Out-of-Time	31

KS > 30 considered strong

→ Model meets performance expectations.

7.3 Gini Coefficient

Development sample Gini = 0.48 (healthy for CRE PD).

7.4 Calibration Performance

Bin-Level Calibration (Observed vs. Expected)

PD Decile	Expected PD	Observed PD	Difference
1	0.4%	0.5%	+0.1%
2	0.7%	0.8%	+0.1%
5	1.9%	1.8%	-0.1%
8	4.3%	4.6%	+0.3%
10	10.6%	10.9%	+0.3%

Calibration error < 3.5% overall → satisfactory.

8. Backtesting & Stability Testing

8.1 Backtesting Methodology

To evaluate model robustness, we conduct:

Out-of-time tests (2021–2024)

Vintage stability tests

Subsegment backtests (Office, MF, Retail, Industrial)

Geography-based backtesting

8.2 Backtesting Results – Out-of-Time

Year	Model AUC	Observed PD	Expected PD
2021	0.73	1.8%	1.9%
2022	0.72	2.0%	2.1%
2023	0.70	2.3%	2.4%
2024	0.69	2.7%	2.8%

Model remains stable—performance decline < 0.05 AUC per year.

8.3 Property-Type Stability

Property Type	Expected PD	Observed PD	Result
Multifamily	1.2%	1.3%	OK
Retail	2.6%	2.7%	OK
Industrial	1.0%	1.1%	OK
Office	3.5%	4.2%	Fail (macro deterioration)

Office loans deviate due to macro cyclic downturn post-2021.

This is noted as a model limitation and requires overlay.

8.4 Geography Stability Test

High-risk states (e.g., CA, NY, IL) show upward PD deviation; model still directionally correct.

8.5 Conclusions

✔ Model stable across:

Time

Geography

Most property types

⚠ Exception: Office loans

→ requires monitoring and possible overlay.

9. Benchmarking

Benchmarking is used to evaluate the model’s performance against independent reference points. Three types of benchmarks were used:

External Benchmarking (industry loss data)

Internal Challenger Model (macro-driven model)

Legacy Model Comparison

9.1 External Benchmarking (FDIC / Trepp / Market Data)

To validate whether the modeled PDs align with broad market behavior, the following were compared:

FDIC Charge-off Rates (2014–2024)

CRE average charge-off rate: 1.3% – 2.4%

Office charge-off rate: 3.0% – 4.8%

Multifamily: 1.0% – 1.5%

Model PD Comparison

Segment	FDIC Benchmark	Model PD	Result
Multifamily	1.0–1.5%	1.3%	✔
Retail	2.0–3.0%	2.5%	✔
Industrial	1.0–1.5%	1.2%	✔
Office	3.0–4.8%	4.2%	✔ (directional)

Conclusion:

Model outputs fall within industry ranges and reflect correct directional risk.

9.2 Internal Challenger Model

An internal macro-based challenger PD model was constructed using:

GDP growth

CRE Price Index

Vacancy Rates

Unemployment Rate

Sample regression (not full model):

Comparison with PD Model

Property Type	PD (Main Model)	PD (Challenger)	Difference	Result
MF	1.3%	1.4%	+0.1%	Acceptable
Retail	2.5%	2.6%	+0.1%	Acceptable
Industrial	1.2%	1.0%	-0.2%	Acceptable
Office	4.2%	4.6%	+0.4%	Acceptable (same direction)

Conclusion:

Model PD tracks macro-based challenger model directionally. Office shows highest stress sensitivity, consistent with business expectations.

9.3 Legacy Model Comparison

Metric	Legacy PD Model	New PD Model
AUC	0.66	0.74
KS	29	36
Calibration Error	6.7%	3.5%
Variables	4	6 (improved WoE)

Conclusion:

The new model materially improves discriminatory power and calibration.

10. Model Limitations & Assumptions

Regulators (SR 11-7) require explicit identification of all model limitations.

10.1 Data Limitations

1. Low Default Portfolio

CRE loans have historically low default rates, creating:

PD estimation challenges

Wider confidence intervals

Higher sensitivity to rare events

2. Office Market Structural Shift

Post-COVID, office loans exhibit:

Higher PD

Increased macro volatility

Out-of-distribution behavior

This structural change means historical data may underrepresent current/future risk.

3. Appraisal Lag

Property value updates lag by 12–24 months → LTV may not reflect current market shock.

10.2 Modeling Limitations

1. Logistic Regression Functional Form

Even with WoE, linearity assumption may not fully capture non-linear CRE risk.

2. Missing Property-Level Tenant Data

Model does not include:

Tenant rollover profile

Lease maturity schedule

Tenant concentration

Occupancy-by-tenant

Due to limited data availability.

10.3 Assumptions

Assumption	Description
Log-odds linearity	WoE addresses non-linearity
Stationarity	Development sample represents future cycles
Sponsor data accuracy	Sponsor strength is self-reported
Appraisal values	Appraisal values represent market values

10.4 Mitigants

Monthly monitoring

Management overlays

Cross-validation

Macro stress overlays for office segment

Conservative calibration for high-risk states

11. Model Monitoring Plan

Monitoring follows SR 11-7 Ongoing Monitoring standards.

11.1 Monitoring Frequency

Quarterly: Performance & calibration

Semi-Annual: Data quality & drift

Annual: Full validation by MRM

11.2 Monitoring Metrics

1. Discrimination Drift

AUC decrease > 0.05 triggers review

KS drop > 20% requires escalation

2. Calibration Drift

Tolerance threshold:

Metric	Threshold
PD vs Observed	±20% deviation
Calibration RMSE	< 3%

3. Data Drift

Monitor:

LTV distribution

DSCR distribution

Property type mix

Geographic exposure shifts

4. Stability Drift

WoE monotonicity breaks

Population shifts

Increased missing rates

11.3 Triggers & Escalation Framework

Trigger Level	Description	Action
Yellow	Moderate KS/AUC decline	Monitoring + SME review
Orange	Significant drift in 1–2 metrics	Model overlay consideration
Red	Failure of ≥3 metrics	Full redevelopment mandated

11.4 Overlay Policy

Conditions requiring overlay:

Office PD deviations > 50 bps

Market downturn (CREPI ↓ > 10%)

Sponsor weakness in certain regions

Overlay documented and approved via MCC (Model Committee).

12. Governance & Model Use

12.1 Compliance with SR 11-7

This model satisfies SR 11-7 through:

1. Conceptual Soundness

Justified model form

Variable selection aligned with economics

WoE transformations

Diagnostic tests documented

2. Ongoing Monitoring

Quarterly monitoring

Annual validation

3. Outcome Analysis

Benchmarking

Backtesting

Independent challenger model

12.2 Roles & Responsibilities

Group	Responsibility
Model Development	Build, document, calibrate model
Model Risk (MRM)	Independent validation
Credit Risk	Business oversight
Internal Audit	Governance compliance
Model Committee	Approval authority

12.3 Model Use Policy

Model used for:

CECL PD estimation

Risk-based pricing

Portfolio risk analytics

Stress testing support

Not permitted for:

Collateral valuation

Standalone loan approval without human oversight

12.4 Change Management

Changes requiring MCC approval:

Variable set

Data source

Model type

Calibration methodology

Minor changes documented under version control.

Appendices

Appendix A — Full WoE Binning Tables for 10+ CRE PD Variables

以下 10 个变量为 CRE PD 模型中最常见、最具预测力的变量：

DSCR

Loan Age

Property Type

Geography

Interest Rate

Sponsor Strength

Tenant Concentration (若数据可用)

NOI Growth

Borrower Exposure / Portfolio Concentration

每个表格格式均为真实银行标准格式：

Good / Bad

Bad Rate

Dist good / Dist bad

IV contribution

单调性检查（每个表末尾我给解释）

A.1 WoE – Loan-to-Value (LTV)

（高预测力变量，IV 非常高）

LTV Bin	# Good	# Bad	Bad Rate	Dist_G	Dist_B	WoE	IV
0–50%	2,100	10	0.47%	0.350	0.064	-1.69	0.485
50–70%	2,500	25	0.99%	0.417	0.159	-0.96	0.247
70–85%	1,600	34	2.08%	0.267	0.216	-0.21	0.010
85–95%	800	38	4.54%	0.133	0.241	0.62	0.067
95%+	400	49	10.91%	0.067	0.520	1.43	0.658
Total	7,400	156	—	1.0	1.0	—	1.47

Monotonicity: PERFECT

Interpretation: This is the strongest predictor.

A.2 WoE – DSCR

DSCR Bin	# Good	# Bad	Bad Rate	WoE	IV
Missing	180	12	6.25%	0.92	0.051
<1.0	420	34	7.48%	1.31	0.184
1.0–1.3	2,900	76	2.55%	-0.55	0.061
1.3–1.6	3,500	51	1.43%	-1.01	0.104
>1.6	2,200	19	0.86%	-1.53	0.173
Total	9,200	192	—	—	0.57

Monotonicity: PERFECT

Interpretation: DSCR is the second most powerful variable.

A.3 WoE – Loan Age (Seasoning)

Loan Age (Months)	Default Rate	WoE	IV
0–12	3.2%	0.55	0.033
12–24	2.4%	0.17	0.006
24–48	1.6%	-0.40	0.019
48–72	1.1%	-0.82	0.038
72+	0.9%	-1.10	0.052
Total	—	—	0.15

Interpretation: New loans risk highest → consistent with seasoning effect.

A.4 WoE – Property Type

Property Type	Bad Rate	WoE	IV
Multifamily	1.2%	-0.92	0.061
Retail	2.5%	0.41	0.044
Industrial	1.1%	-0.97	0.027
Mixed Use	2.0%	0.20	0.008
Office	4.2%	1.18	0.132
Total	—	—	0.27

Interpretation: Office behaves as a high-risk structural segment.

A.5 WoE – Geography (State-Level PD)

示例分箱：按风险+区域聚合（真实 CRE 模型常这么做）

Region	Bad Rate	WoE	IV
West (CA, WA, OR)	2.9%	0.36	0.017
Midwest	1.7%	-0.42	0.019
Northeast (NY/NJ/MA)	3.1%	0.47	0.021
South	1.4%	-0.58	0.033
NYC Metro	4.5%	1.11	0.124
Total	—	—	0.21

A.6 WoE – Interest Rate

Rate Bin (%)	Bad Rate	WoE	IV
<4%	1.3%	-0.41	0.012
4–6%	1.9%	-0.05	0.002
6–8%	2.5%	0.38	0.007
>8%	3.8%	0.89	0.018
Total	—	—	0.04

Weak, but included due to macro-interpretability.

A.7 WoE – Sponsor Strength

Sponsor Strength Score	Bad Rate	WoE	IV
1 – Weak	4.0%	1.05	0.102
2 – Below Avg	2.2%	0.33	0.006
3 – Average	1.4%	-0.48	0.012
4 – Good	1.0%	-0.81	0.019
5 – Excellent	0.7%	-1.21	0.041
Total	—	—	0.18

A.8 WoE – Tenant Concentration

（CRE现金流重要驱动因素）

Tenant Concentration	Bad Rate	WoE	IV
Single-tenant	3.6%	0.81	0.042
2–3 tenants	2.1%	0.10	0.002
4–10 tenants	1.5%	-0.40	0.011
>10 tenants	1.1%	-0.72	0.018
Total	—	—	0.07

A.9 WoE – NOI Growth

NOI Growth (YoY)	Bad Rate	WoE	IV
< -10%	3.7%	0.76	0.021
-10% to 0%	2.4%	0.19	0.003
0–5%	1.4%	-0.52	0.012
5–10%	1.0%	-0.86	0.019
>10%	0.8%	-1.10	0.017
Total	—	—	0.07

A.10 WoE – Borrower Exposure

Exposure Bin	Bad Rate	WoE	IV
<$2M	1.2%	-0.61	0.022
$2–10M	2.0%	-0.02	0.000
$10–30M	2.8%	0.39	0.006
>$30M	4.1%	0.92	0.015
Total	—	—	0.04

Appendix B — Performance Charts (AUC, KS, Calibration)

B.1 ROC Curve (AUC = 0.74)

Interpretation (写在文档里)：

Model demonstrates strong discriminatory power with an AUC of 0.74, consistent with industry norms for CRE PD models (0.65–0.75).

B.2 KS Statistic Plot（KS = 36）

Interpretation：

Maximum separation between good and bad distributions occurs at ~36% → strong model power.

B.3 Calibration Plot（Observed vs Expected PD）

Interpretation：

Calibration error < 3.5%, deviations within tolerance across all deciles.

B.4 Decile Plot / Lift Chart

Shows monotonic increase in default rate across PD deciles → healthy rank ordering.

Appendix C — Data Dictionary

C.1 Variable List Overview

Variable	Category	WoE Applied	Used in Model
LTV	Collateral	Yes	Yes
DSCR	Cash Flow	Yes	Yes
Property Type	Structural	Yes	Yes
Geography	Regional	Yes	Yes
Loan Age	Behavioral	Yes	Yes
Sponsor Strength	Borrower	Yes	Yes
Interest Rate	Pricing	Yes	No (removed)
Debt Yield	Cash Flow	Yes	No (collinearity)
Vacancy Rate	Property	Yes	No (non-monotonic)
NOI Growth	Performance	Yes	No
Borrower Exposure	Portfolio	Yes	No

C.2 Full Data Dictionary（标准银行格式）

1. Variable Name: LTV (Loan-to-Value Ratio)

Category: Collateral Risk

Definition: Current loan balance divided by most recent appraised property value

Formula: LTV = Loan Amount / Appraised Value

Source System: Appraisal System + Loan Accounting

Refresh Frequency: Quarterly

Data Owner: Collateral Valuation Team

Transformation: WoE monotonic bins

Data Quality Notes:

Values >150% capped
Must confirm appraisal dates not post-default

Usage: Included in model (strongest predictor)

2. Variable Name: DSCR (Debt Service Coverage Ratio)

Category: Cash Flow Risk

Definition: NOI divided by annual debt service

Formula: DSCR = NOI / Debt Service

Source System: CRE Underwriting System

Refresh Frequency: Annual (or at renewal)

Data Owner: CRE Underwriting

Transformation: WoE

Missing Values: Put into "Missing" WoE bin

Usage: Included in model (2nd most predictive)

3. Variable Name: Property Type

Category: Structural Risk

Possible Values: Multifamily, Office, Retail, Industrial, Mixed Use

Source System: Collateral / CRE Underwriting

Business Relevance: CRE segment risk varies significantly (Office = highest PD)

Transformation: WoE categorical encoding

Usage: Included in model

4. Variable Name: Geography (Region / State)

Category: Regional Economic Risk

Definition: Loan's primary collateral state grouped by risk clusters

Source System: Loan Origination System

Transformation: WoE (state → region risk bins)

Usage: Included in model

5. Variable Name: Loan Age

Category: Behavioral / Vintage

Definition: Months since loan origination

Source System: Loan Accounting

Transformation: WoE monotonic seasoning pattern

Usage: Included in model

6. Variable Name: Sponsor Strength Score

Category: Borrower Quality

Definition: Bank’s internal rating of sponsor’s liquidity + net worth

Source: Borrower Financials

Scale: 1 = Weak, 5 = Strong

Transformation: WoE

Usage: Included in model

7. Variable Name: Interest Rate

Category: Pricing

Definition: Current contractual rate

Source: Loan System

Transformation: WoE (kept monotonic)

Usage: Not included in final model (low IV)

8. Variable Name: Debt Yield

Category: Cash Flow

Definition: NOI / Loan Balance

Source: CRE Underwriting

Notes: Highly correlated with DSCR → removed

Usage: Not included due to VIF > 5

9. Variable Name: Vacancy Rate

Category: Property Performance

Source: External CRE data provider

Notes: WoE non-monotonic → removed

10. Variable Name: NOI Growth

Category: Property Performance

Definition: Year-over-year growth of property NOI

Source: Financial Reporting

Usage: Removed (weak IV)

Appendix D — Model Code Snippets

D.1 数据导入与初始准备


import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve

df = pd.read_csv("cre_loan_data.csv")

TARGET = "default_flag"

D.2 自动分箱（Supervised Monotonic Binning）

（真实模型会采用等频＋监督式单调分箱）


def monotonic_binning(x, y, max_bins=5):
    df_temp = pd.DataFrame({'x': x, 'y': y})
    df_temp = df_temp.sort_values('x')

    bins = pd.qcut(df_temp['x'], max_bins, duplicates='drop')
    df_temp['bin'] = bins

    grouped = df_temp.groupby('bin')['y'].mean()

    # enforce monotonicity
    while not (grouped.is_monotonic_increasing or grouped.is_monotonic_decreasing):
        max_bins -= 1
        bins = pd.qcut(df_temp['x'], max_bins, duplicates='drop')
        df_temp['bin'] = bins
        grouped = df_temp.groupby('bin')['y'].mean()

    return df_temp['bin']

D.3 WoE 计算函数


def compute_woe_iv(df, feature, target):
    df_woe = df.groupby(feature).agg({target: ['sum', 'count']})
    df_woe.columns = ['bad', 'total']

    df_woe['good'] = df_woe['total'] - df_woe['bad']
    df_woe['dist_good'] = df_woe['good'] / df_woe['good'].sum()
    df_woe['dist_bad'] = df_woe['bad'] / df_woe['bad'].sum()

    df_woe['woe'] = np.log(df_woe['dist_good'] / df_woe['dist_bad'])
    df_woe['iv'] = (df_woe['dist_good'] - df_woe['dist_bad']) * df_woe['woe']

    return df_woe[['woe', 'iv']]

D.4 对所有变量生成 WoE 编码


features = ["LTV", "DSCR", "Loan_Age", "Property_Type",
            "Geography", "Sponsor_Strength"]

woe_maps = {}

for var in features:
    df[var + "_bin"] = monotonic_binning(df[var], df[TARGET])
    woe_table = compute_woe_iv(df, var + "_bin", TARGET)
    woe_maps[var] = woe_table['woe'].to_dict()
    df[var + "_WOE"] = df[var + "_bin"].map(woe_maps[var])

D.5 Logistic Regression 训练


X = df[[f"{v}_WOE" for v in features]]
y = df[TARGET]

model = LogisticRegression(max_iter=200)
model.fit(X, y)

pd.Series(model.coef_[0], index=X.columns)

D.6 评分函数（生产可用）


def score_new_loan(record):
    z = model.intercept_[0]
    for v in features:
        bin_value = pd.Interval(left=record[v+"_bin"].left, right=record[v+"_bin"].right)
        woe = woe_maps[v][bin_value]
        z += model.coef_[0][features.index(v)] * woe
    pd_value = 1 / (1 + np.exp(-z))
    return pd_value

你可以将这个函数用于：

CECL monthly batch

PD score generation

Stress testing scenario PD