NLP

NLP 基础概念全景图

重点内容：

什么是 NLP，NLP 应用场景（情感分析、命名实体识别、聊天机器人、法规文档处理等）
NLP vs 计算机视觉 vs Tabular ML 的不同
传统 NLP vs 深度学习 NLP（BoW/TF-IDF vs Word2Vec/BERT）

维度	问题	你应该掌握的回答要点
NLP 是什么？	什么是 NLP？	让计算机理解和生成自然语言（人类语言）的技术。
NLP 用在哪？	NLP 在金融领域的用例有哪些？	年报分析、舆情监控、法规审查、合规文本抽取、客服自动化
文本预处理	Tokenization vs Lemmatization 的区别？	Tokenization：分词；Lemmatization：还原为词典形式
表示方法	BoW vs TF-IDF vs Word2Vec 的本质区别？	BoW：只看词频；TF-IDF：考虑稀有性；Word2Vec：引入语义关系
模型演进	为什么 Transformer 优于 LSTM？	并行计算 + 长距离依赖 + 不受时间序列顺序限制
NLP vs Tabular	NLP 与结构化数据建模有什么不同？	NLP特征是文本，需要预处理和向量化，不像表格直接数值处理
BERT vs GPT	BERT 与 GPT 的核心区别？	BERT是双向编码器，GPT是单向解码器，用于生成文本

文本预处理核心方法

重点内容：

Tokenization（分词）
Stopwords removal（去除停用词）
Lemmatization vs Stemming（词干还原 vs 词形还原）
Lowercasing、去标点、处理 HTML、表情符号

Python实战（用 nltk/spacy）

用 nltk 对金融新闻进行清洗和分词
输出一篇金融报告的高频词云

文本预处理核心方法（NLP基础）

技术	说明	举例/方法
Tokenization	将文本切分为一个个词/子词	`nltk.word_tokenize()`
Stopwords Removal	去除如 "the", "is", "and" 等无实际语义的高频词	`from nltk.corpus import stopwords`
Lowercasing	转小写，统一格式	`'Finance' → 'finance'`
Punctuation Removal	去除标点符号	`re.sub(r'[^\w\s]', '', text)`
HTML Cleaning	去除网页文本中的 `<p>` 等标签	`from bs4 import BeautifulSoup`
Emoji Removal	去除表情符号	`re.sub(emoji_pattern, '', text)`
Stemming	提取词干（如 dealing → deal）	`from nltk.stem import PorterStemmer`
Lemmatization	词形还原（如 better → good）	`from nltk.stem import WordNetLemmatizer`

Python 实战：清洗 + 分词（使用 nltk）

🔧 1. 安装必要包（仅第一次执行）


pip install nltk wordcloud beautifulsoup4

📥 2. 下载资源包（nltk）


import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

📰 3. 金融新闻清洗与分词


import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup

# 示例金融文本（你可以替换为真实新闻数据）
text = """<p>Stocks rallied today as the Federal Reserve signaled a potential pause in interest rate hikes.
Markets reacted positively to the dovish tone.</p> 🚀💵"""

# Step 1: 去除 HTML 和表情符号
soup = BeautifulSoup(text, "html.parser")
text = soup.get_text()
text = re.sub(r'[^\w\s]', '', text)  # 去标点
text = re.sub(r'\d+', '', text)      # 去数字
text = text.lower()                  # 小写
text = re.sub(r'[^\x00-\x7F]+', '', text)  # 去除非ASCII字符（emoji）

# Step 2: 分词 + 去停用词 + 词形还原
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

cleaned_tokens = [lemmatizer.lemmatize(w) for w in tokens if w not in stop_words and len(w) > 2]

print("🧹 Clean Tokens:\n", cleaned_tokens)

☁️ 4. 输出高频词云图（WordCloud）


from wordcloud import WordCloud
import matplotlib.pyplot as plt

# 合并所有词汇为一个字符串
clean_text = ' '.join(cleaned_tokens)

# 生成词云
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(clean_text)

# 可视化
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("💼 Financial News Word Cloud", fontsize=16)
plt.show()

✅ 最终效果

你会得到：

一份干净的词汇表 cleaned_tokens

一张金融词云图，展示新闻中出现频率最高的关键词（如：fed, market, rate, pause, inflation）

NLP 常用正则表达式大全（含金融实战场景）

功能	正则表达式	说明	示例
移除HTML标签	`<.*?>`	删除 `<p>`, `<div>` 等标签	`re.sub(r'<.*?>', '', text)`
去除标点符号	`[^\w\s]`	保留字母、数字、空格	`re.sub(r'[^\w\s]', '', text)`
去除数字	`\d+`	匹配所有数字	`re.sub(r'\d+', '', text)`
去除非ASCII字符	`[^\x00-\x7F]+`	处理 emoji、中文、特殊符号	`re.sub(r'[^\x00-\x7F]+', '', text)`
保留字母数字组合	`\b[A-Za-z0-9]+\b`	提取词语如 `risk123`	`re.findall(r'\b[A-Za-z0-9]+\b', text)`
检测 Email	`[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}`	提取邮箱地址	`re.findall(...)`
检测 URL	`https?://[^\s]+`	提取网页链接	`re.findall(...)`
提取日期（简化版）	`\d{4}-\d{2}-\d{2}`	匹配如 `2023-10-01`	`re.findall(...)`
匹配特定金融词	`\b(risk	loan	default
多个空格合并成一个	`\s+`	用一个空格替换多个空格/tab	`re.sub(r'\s+', ' ', text)`
只保留字母	`[^a-zA-Z\s]`	删除数字、符号，仅保留英文单词	`re.sub(r'[^a-zA-Z\s]', '', text)`

金融文本清洗函数库


import re
import pandas as pd

class FinancialTextCleaner:
    def __init__(self):
        pass

    def clean_html(self, text):
        return re.sub(r'<.*?>', '', text)

    def remove_non_ascii(self, text):
        return re.sub(r'[^        return re.sub(r'[^\x00-        return re.sub(r'[^\x00-\x7F]+', '', text)

    def remove_digits(self, text):
        return re.sub(r'\d+', '', text)

    def remove_punctuation(self, text):
        return re.sub(r'[^\w\s]', '', text)

    def remove_multiple_spaces(self, text):
        return re.sub(r'\s+', ' ', text).strip()

    def lower_case(self, text):
        return text.lower()

    def clean_text(self, text):
        text = self.clean_html(text)
        text = self.remove_non_ascii(text)
        text = self.remove_digits(text)
        text = self.remove_punctuation(text)
        text = self.remove_multiple_spaces(text)
        text = self.lower_case(text)
        return text

    def clean_dataframe_column(self, df, column_name, new_column_name=None):
        if new_column_name is None:
            new_column_name = column_name + '_cleaned'
        df[new_column_name] = df[column_name].astype(str).apply(self.clean_text)
        return df

    def extract_keywords(self, text, keywords):
        pattern = r'\\b(' + '|'.join(map(re.escape, keywords)) + r')\\b'
        return re.findall(pattern, text.lower())

    def extract_money_amounts(self, text):
        return re.findall(r'\$\d+(\.\d+)?\s?(million|billion)?', text.lower())

    def extract_dates(self, text):
        return re.findall(r'\d{4}-\d{2}-\d{2}', text)

    def extract_ratings(self, text):
        return re.findall(r'(AAA|AA\+?|A\+?|BBB\+?|BB\+?|B\+?)', text)


# Example usage:
if __name__ == "__main__":
    cleaner = FinancialTextCleaner()
    sample_text = "<p>As of 2023-12-31, the Company had $25.3 million in credit risk exposure. 📉</p>"
    print("Cleaned Text:", cleaner.clean_text(sample_text))
    print("Keywords:", cleaner.extract_keywords(sample_text, ['credit', 'risk', 'exposure']))
    print("Amounts:", cleaner.extract_money_amounts(sample_text))
    print("Dates:", cleaner.extract_dates(sample_text))
    print("Ratings:", cleaner.extract_ratings("The firm was rated AAA and later downgraded to BBB+"))

文本表示方法（BoW、TF-IDF、词向量）

核心知识总结

表示方法	特点	优缺点	适用场景
BoW (Bag of Words)	将每个文本表示为词频向量（稀疏矩阵）	简单、易于实现；不考虑词序和语义	文本分类、简单规则引擎
TF-IDF	考虑词频（TF）与逆文档频率（IDF）的权重，降低高频无意义词影响	保留关键词重要性；依然稀疏且无语义	新闻摘要提取、关键词提取
Word2Vec （CBOW / Skip-Gram）	将词映射为稠密向量，捕捉语义关系 CBOW：预测当前词 Skip-Gram：预测上下文词	有语义信息；可计算词相似度；训练成本高	情感分析、文本生成、金融语义识别

深度理解对比

对比维度	BoW / TF-IDF	Word2Vec
向量维度	高维	低维（如 100~300）
稀疏性	稀疏	稠密
语义信息	无（仅频率）	有（上下文捕捉）
适合模型	线性模型、SVM、树类模型	深度学习（LSTM、Transformer）
存储成本	高	低

实战任务设计（金融文本）

对一批金融新闻（或年报段落）进行以下操作，观察向量表示的不同：

1. 文本准备


docs = [
    "The interest rate will rise next quarter due to inflation pressure.",
    "The central bank is expected to keep the interest rate unchanged.",
    "Stock market volatility increased following policy changes."
]

2. BoW 表示


from sklearn.feature_extraction.text import CountVectorizer

bow_vectorizer = CountVectorizer()
bow_matrix = bow_vectorizer.fit_transform(docs)

print("BoW Feature Names:\n", bow_vectorizer.get_feature_names_out())
print("BoW Matrix:\n", bow_matrix.toarray())

3. TF-IDF 表示


from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(docs)

print("TF-IDF Feature Names:\n", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())

4. Word2Vec 表示（平均词向量）


import gensim.downloader as api
from gensim.utils import simple_preprocess
import numpy as np

model = api.load("word2vec-google-news-300")  # 需要联网，建议本地已有模型

def get_avg_vector(sentence):
    tokens = simple_preprocess(sentence)
    vectors = [model[word] for word in tokens if word in model]
    return np.mean(vectors, axis=0) if vectors else np.zeros(300)

word2vec_matrix = np.array([get_avg_vector(doc) for doc in docs])
print("Word2Vec Shape:", word2vec_matrix.shape)

你需要掌握的问题点

1. Why are BoW and TF-IDF sparse? What are their limitations?

Sparse nature:

BoW and TF-IDF create high-dimensional vectors where each dimension represents a unique word in the vocabulary. Since each document only contains a small subset of all words, most elements in the vector are zeros — making them sparse.

Limitations:

No semantic understanding: “bank” and “finance” are treated as completely unrelated.
No contextual information: Word order and co-occurrence are ignored.
High memory usage: Large vocabulary → high-dimensional vectors.
Poor performance on deep learning models, which favor dense inputs.

2. How does Word2Vec learn word relationships through context?

Mechanism:

CBOW (Continuous Bag of Words): Predicts a target word based on its surrounding context words.
Skip-Gram: Predicts surrounding context words given a target word.

Learning process:

Trained using a shallow neural network.
Adjusts word vectors such that words appearing in similar contexts have similar vector representations.
Captures semantic similarity: e.g., vec("king") - vec("man") + vec("woman") ≈ vec("queen").

3. Why do deep neural networks need dense vector inputs?

Sparse vectors (like BoW or TF-IDF) are:

Memory inefficient.
Contain lots of zero values → make training harder.
Do not capture semantic or contextual information.

Dense vectors (like Word2Vec or BERT embeddings) are:

Low-dimensional and continuous.
Capture meaning, relationships, and structure.
Enable gradient-based learning in neural networks to be more efficient and effective.

4. Which is better for small datasets: CBOW or Skip-Gram?

Skip-Gram is better for small datasets:

Skip-Gram is more robust with fewer data because it samples more context pairs from each sentence.
It learns better rare word representations.
CBOW tends to average context, which may lose nuance in small datasets.

5. In financial NLP tasks (e.g., credit scoring, annual report analysis), which representation is more suitable?

Scenario	Recommended Representation	Reason
Credit Scoring / Fraud Detection	TF-IDF + Word2Vec hybrid	TF-IDF identifies keywords, Word2Vec captures meaning
Annual Report Analysis / Risk Disclosure	Word2Vec or Transformer-based embeddings (BERT)	Better at capturing domain-specific context
Sentiment Analysis (e.g., earnings calls)	Word2Vec / FinBERT	Captures tone, sentiment, and subtle nuances
Rule-based keyword triggers	TF-IDF / BoW	Simple, interpretable, low-latency logic

⚠️ For modern NLP applications in finance, dense vector models like Word2Vec or BERT variants (e.g., FinBERT) are typically preferred due to their semantic power and ability to generalize across contexts.

Classic NLP Models

经典的 NLP 模型（Classic NLP Models）是指深度学习出现之前广泛使用的自然语言处理算法，主要基于统计学习和规则推理。这些模型至今仍被用于基线实验、特征工程任务或资源受限场景。

应用层级结构


text
复制编辑
输入文本 → 文本清洗 → 向量表示（BoW / TF-IDF）→ 训练分类模型（Naive Bayes / SVM / LR）→ 预测情感 / 主题

Model	Core Idea	Strengths	Weaknesses	Typical Use Cases
Bag of Words (BoW)	统计词频，不考虑顺序	简单高效，适合小数据集	忽略语序和上下文	文本分类、情感分析
TF-IDF	加权词频，惩罚常见词	兼顾全局与局部词频	仍然是稀疏向量	信息检索、关键词提取
Naive Bayes	条件独立假设下的概率分类器	训练快、解释性强	独立性假设不现实	文本分类、垃圾邮件检测
Logistic Regression	学习词与标签的线性关系	稳定、鲁棒性好	无法建模复杂非线性关系	情感分析、舆情预测
SVM (支持向量机)	寻找最优分类边界	对高维数据表现好	对大型数据集慢	文本分类、小样本任务
Decision Tree / Random Forest	基于规则的树结构模型	解释性强，不易过拟合	对稀疏数据不敏感	标签预测、多分类任务
LDA (Latent Dirichlet Allocation)	主题建模，发现文档中潜在主题	可解释性好，适用于文本聚类	模型假设较强，训练慢	新闻聚类、舆情主题分析

Sentiment Analysis

Sentiment Analysis (also known as Opinion Mining) is a Natural Language Processing (NLP) technique used to determine whether a piece of text expresses a positive, negative, or neutral sentiment.

Why is it Important?

Industry	Application
Finance	判断用户对信用卡、银行服务、年报的情绪
Marketing	跟踪品牌声誉，用户满意度分析
Politics	舆情监控与选民倾向预测
Customer Service	自动分类用户反馈（投诉 vs 赞扬）

Key Components

1. Text Preprocessing

Before classification, text is cleaned and normalized:

Lowercasing

Removing punctuation / stopwords

Tokenization

Lemmatization or Stemming

2. Text Representation

Convert text into numerical features:

Bag of Words (BoW)

TF-IDF

Word2Vec / GloVe

Transformer Embeddings (BERT)

3. Classification Models

Models are trained on labeled data to predict sentiment:

Naive Bayes: Fast, good baseline

Logistic Regression: Interpretable, stable

SVM: Strong performance with sparse features

LSTM / RNN: Sequence-sensitive deep models

BERT: Context-aware transformer model

Sentiment Analysis Pipeline (Example)


Raw Text → Cleaned Text → Feature Vectors → Model → Sentiment Label

Example:

Input: "The credit card support was terrible and slow."

Output: Negative 😡

Evaluation Metrics

Metric	Meaning
Accuracy	Overall correctness
Precision	Positive predictions actually positive
Recall	True positive rate
F1 Score	Harmonic mean of precision and recall
ROC-AUC	Area under curve for binary classifiers
Confusion Matrix	TP, FP, TN, FN breakdown

Financial Applications

Use Case	Example
Credit Card Review Classification	将用户评论分为满意/不满
Stock News Sentiment	判断新闻对股价是利好还是利空
10-K 年报情绪倾向	量化风险段落的负面情绪浓度
金融客服自动回复系统	情绪识别+自动响应建议

Tools for Sentiment Analysis

Library	Feature
NLTK	Basic preprocessing and classifiers
TextBlob	Simple sentiment polarity scoring
Scikit-learn	Pipeline + classifier integration
SpaCy	Efficient tokenization & POS tagging
HuggingFace Transformers	Pretrained BERT, RoBERTa, etc.
VADER (for social media)	Lexicon-based scoring for short texts
FinBERT (for finance)	Fine-tuned BERT on financial news/reports