
NLP 基础概念全景图
- 重点内容:
- 什么是 NLP,NLP 应用场景(情感分析、命名实体识别、聊天机器人、法规文档处理等)
- NLP vs 计算机视觉 vs Tabular ML 的不同
- 传统 NLP vs 深度学习 NLP(BoW/TF-IDF vs Word2Vec/BERT)
维度 | 问题 | 你应该掌握的回答要点 |
NLP 是什么? | 什么是 NLP? | 让计算机理解和生成自然语言(人类语言)的技术。 |
NLP 用在哪? | NLP 在金融领域的用例有哪些? | 年报分析、舆情监控、法规审查、合规文本抽取、客服自动化 |
文本预处理 | Tokenization vs Lemmatization 的区别? | Tokenization:分词;Lemmatization:还原为词典形式 |
表示方法 | BoW vs TF-IDF vs Word2Vec 的本质区别? | BoW:只看词频;TF-IDF:考虑稀有性;Word2Vec:引入语义关系 |
模型演进 | 为什么 Transformer 优于 LSTM? | 并行计算 + 长距离依赖 + 不受时间序列顺序限制 |
NLP vs Tabular | NLP 与结构化数据建模有什么不同? | NLP特征是文本,需要预处理和向量化,不像表格直接数值处理 |
BERT vs GPT | BERT 与 GPT 的核心区别? | BERT是双向编码器,GPT是单向解码器,用于生成文本 |
文本预处理核心方法
- 重点内容:
- Tokenization(分词)
- Stopwords removal(去除停用词)
- Lemmatization vs Stemming(词干还原 vs 词形还原)
- Lowercasing、去标点、处理 HTML、表情符号
- Python实战(用 nltk/spacy)
- 用
nltk对金融新闻进行清洗和分词 - 输出一篇金融报告的高频词云
- 一份干净的词汇表
cleaned_tokens - 一张金融词云图,展示新闻中出现频率最高的关键词(如:
fed,market,rate,pause,inflation)
文本预处理核心方法(NLP基础)
技术 | 说明 | 举例/方法 |
Tokenization | 将文本切分为一个个词/子词 | nltk.word_tokenize() |
Stopwords Removal | 去除如 "the", "is", "and" 等无实际语义的高频词 | from nltk.corpus import stopwords |
Lowercasing | 转小写,统一格式 | 'Finance' → 'finance' |
Punctuation Removal | 去除标点符号 | re.sub(r'[^\w\s]', '', text) |
HTML Cleaning | 去除网页文本中的 <p> 等标签 | from bs4 import BeautifulSoup |
Emoji Removal | 去除表情符号 | re.sub(emoji_pattern, '', text) |
Stemming | 提取词干(如 dealing → deal) | from nltk.stem import PorterStemmer |
Lemmatization | 词形还原(如 better → good) | from nltk.stem import WordNetLemmatizer |
Python 实战:清洗 + 分词(使用 nltk)
🔧 1. 安装必要包(仅第一次执行)
pip install nltk wordcloud beautifulsoup4
📥 2. 下载资源包(nltk)
import nltk nltk.download('punkt') nltk.download('stopwords') nltk.download('wordnet')
📰 3. 金融新闻清洗与分词
import re from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer from bs4 import BeautifulSoup # 示例金融文本(你可以替换为真实新闻数据) text = """<p>Stocks rallied today as the Federal Reserve signaled a potential pause in interest rate hikes. Markets reacted positively to the dovish tone.</p> 🚀💵""" # Step 1: 去除 HTML 和表情符号 soup = BeautifulSoup(text, "html.parser") text = soup.get_text() text = re.sub(r'[^\w\s]', '', text) # 去标点 text = re.sub(r'\d+', '', text) # 去数字 text = text.lower() # 小写 text = re.sub(r'[^\x00-\x7F]+', '', text) # 去除非ASCII字符(emoji) # Step 2: 分词 + 去停用词 + 词形还原 tokens = word_tokenize(text) stop_words = set(stopwords.words('english')) lemmatizer = WordNetLemmatizer() cleaned_tokens = [lemmatizer.lemmatize(w) for w in tokens if w not in stop_words and len(w) > 2] print("🧹 Clean Tokens:\n", cleaned_tokens)
☁️ 4. 输出高频词云图(WordCloud)
from wordcloud import WordCloud import matplotlib.pyplot as plt # 合并所有词汇为一个字符串 clean_text = ' '.join(cleaned_tokens) # 生成词云 wordcloud = WordCloud(width=800, height=400, background_color='white').generate(clean_text) # 可视化 plt.figure(figsize=(10, 5)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') plt.title("💼 Financial News Word Cloud", fontsize=16) plt.show()
✅ 最终效果
你会得到:
NLP 常用正则表达式大全(含金融实战场景)
功能 | 正则表达式 | 说明 | 示例 |
移除HTML标签 | <.*?> | 删除 <p>, <div> 等标签 | re.sub(r'<.*?>', '', text) |
去除标点符号 | [^\w\s] | 保留字母、数字、空格 | re.sub(r'[^\w\s]', '', text) |
去除数字 | \d+ | 匹配所有数字 | re.sub(r'\d+', '', text) |
去除非ASCII字符 | [^\x00-\x7F]+ | 处理 emoji、中文、特殊符号 | re.sub(r'[^\x00-\x7F]+', '', text) |
保留字母数字组合 | \b[A-Za-z0-9]+\b | 提取词语如 risk123 | re.findall(r'\b[A-Za-z0-9]+\b', text) |
检测 Email | [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} | 提取邮箱地址 | re.findall(...) |
检测 URL | https?://[^\s]+ | 提取网页链接 | re.findall(...) |
提取日期(简化版) | \d{4}-\d{2}-\d{2} | 匹配如 2023-10-01 | re.findall(...) |
匹配特定金融词 | `\b(risk | loan | default |
多个空格合并成一个 | \s+ | 用一个空格替换多个空格/tab | re.sub(r'\s+', ' ', text) |
只保留字母 | [^a-zA-Z\s] | 删除数字、符号,仅保留英文单词 | re.sub(r'[^a-zA-Z\s]', '', text) |
金融文本清洗函数库
import re import pandas as pd class FinancialTextCleaner: def __init__(self): pass def clean_html(self, text): return re.sub(r'<.*?>', '', text) def remove_non_ascii(self, text): return re.sub(r'[^ return re.sub(r'[^\x00- return re.sub(r'[^\x00-\x7F]+', '', text) def remove_digits(self, text): return re.sub(r'\d+', '', text) def remove_punctuation(self, text): return re.sub(r'[^\w\s]', '', text) def remove_multiple_spaces(self, text): return re.sub(r'\s+', ' ', text).strip() def lower_case(self, text): return text.lower() def clean_text(self, text): text = self.clean_html(text) text = self.remove_non_ascii(text) text = self.remove_digits(text) text = self.remove_punctuation(text) text = self.remove_multiple_spaces(text) text = self.lower_case(text) return text def clean_dataframe_column(self, df, column_name, new_column_name=None): if new_column_name is None: new_column_name = column_name + '_cleaned' df[new_column_name] = df[column_name].astype(str).apply(self.clean_text) return df def extract_keywords(self, text, keywords): pattern = r'\\b(' + '|'.join(map(re.escape, keywords)) + r')\\b' return re.findall(pattern, text.lower()) def extract_money_amounts(self, text): return re.findall(r'\$\d+(\.\d+)?\s?(million|billion)?', text.lower()) def extract_dates(self, text): return re.findall(r'\d{4}-\d{2}-\d{2}', text) def extract_ratings(self, text): return re.findall(r'(AAA|AA\+?|A\+?|BBB\+?|BB\+?|B\+?)', text) # Example usage: if __name__ == "__main__": cleaner = FinancialTextCleaner() sample_text = "<p>As of 2023-12-31, the Company had $25.3 million in credit risk exposure. 📉</p>" print("Cleaned Text:", cleaner.clean_text(sample_text)) print("Keywords:", cleaner.extract_keywords(sample_text, ['credit', 'risk', 'exposure'])) print("Amounts:", cleaner.extract_money_amounts(sample_text)) print("Dates:", cleaner.extract_dates(sample_text)) print("Ratings:", cleaner.extract_ratings("The firm was rated AAA and later downgraded to BBB+"))
文本表示方法(BoW、TF-IDF、词向量)
核心知识总结
表示方法 | 特点 | 优缺点 | 适用场景 |
BoW (Bag of Words) | 将每个文本表示为词频向量(稀疏矩阵) | 简单、易于实现;不考虑词序和语义 | 文本分类、简单规则引擎 |
TF-IDF | 考虑词频(TF)与逆文档频率(IDF)的权重,降低高频无意义词影响 | 保留关键词重要性;依然稀疏且无语义 | 新闻摘要提取、关键词提取 |
Word2Vec (CBOW / Skip-Gram) | 将词映射为稠密向量,捕捉语义关系 CBOW:预测当前词 Skip-Gram:预测上下文词 | 有语义信息;可计算词相似度;训练成本高 | 情感分析、文本生成、金融语义识别 |
深度理解对比
对比维度 | BoW / TF-IDF | Word2Vec |
向量维度 | 高维 | 低维(如 100~300) |
稀疏性 | 稀疏 | 稠密 |
语义信息 | 无(仅频率) | 有(上下文捕捉) |
适合模型 | 线性模型、SVM、树类模型 | 深度学习(LSTM、Transformer) |
存储成本 | 高 | 低 |
实战任务设计(金融文本)
对一批金融新闻(或年报段落)进行以下操作,观察向量表示的不同:
1. 文本准备
docs = [ "The interest rate will rise next quarter due to inflation pressure.", "The central bank is expected to keep the interest rate unchanged.", "Stock market volatility increased following policy changes." ]
2. BoW 表示
from sklearn.feature_extraction.text import CountVectorizer bow_vectorizer = CountVectorizer() bow_matrix = bow_vectorizer.fit_transform(docs) print("BoW Feature Names:\n", bow_vectorizer.get_feature_names_out()) print("BoW Matrix:\n", bow_matrix.toarray())
3. TF-IDF 表示
from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer() tfidf_matrix = tfidf_vectorizer.fit_transform(docs) print("TF-IDF Feature Names:\n", tfidf_vectorizer.get_feature_names_out()) print("TF-IDF Matrix:\n", tfidf_matrix.toarray())
4. Word2Vec 表示(平均词向量)
import gensim.downloader as api from gensim.utils import simple_preprocess import numpy as np model = api.load("word2vec-google-news-300") # 需要联网,建议本地已有模型 def get_avg_vector(sentence): tokens = simple_preprocess(sentence) vectors = [model[word] for word in tokens if word in model] return np.mean(vectors, axis=0) if vectors else np.zeros(300) word2vec_matrix = np.array([get_avg_vector(doc) for doc in docs]) print("Word2Vec Shape:", word2vec_matrix.shape)
你需要掌握的问题点
1. Why are BoW and TF-IDF sparse? What are their limitations?
- Sparse nature:
BoW and TF-IDF create high-dimensional vectors where each dimension represents a unique word in the vocabulary. Since each document only contains a small subset of all words, most elements in the vector are zeros — making them sparse.
- Limitations:
- No semantic understanding: “bank” and “finance” are treated as completely unrelated.
- No contextual information: Word order and co-occurrence are ignored.
- High memory usage: Large vocabulary → high-dimensional vectors.
- Poor performance on deep learning models, which favor dense inputs.
2. How does Word2Vec learn word relationships through context?
- Mechanism:
- CBOW (Continuous Bag of Words): Predicts a target word based on its surrounding context words.
- Skip-Gram: Predicts surrounding context words given a target word.
- Learning process:
- Trained using a shallow neural network.
- Adjusts word vectors such that words appearing in similar contexts have similar vector representations.
- Captures semantic similarity: e.g.,
vec("king") - vec("man") + vec("woman") ≈ vec("queen").
3. Why do deep neural networks need dense vector inputs?
- Sparse vectors (like BoW or TF-IDF) are:
- Memory inefficient.
- Contain lots of zero values → make training harder.
- Do not capture semantic or contextual information.
- Dense vectors (like Word2Vec or BERT embeddings) are:
- Low-dimensional and continuous.
- Capture meaning, relationships, and structure.
- Enable gradient-based learning in neural networks to be more efficient and effective.
4. Which is better for small datasets: CBOW or Skip-Gram?
- Skip-Gram is better for small datasets:
- Skip-Gram is more robust with fewer data because it samples more context pairs from each sentence.
- It learns better rare word representations.
- CBOW tends to average context, which may lose nuance in small datasets.
5. In financial NLP tasks (e.g., credit scoring, annual report analysis), which representation is more suitable?
Scenario | Recommended Representation | Reason |
Credit Scoring / Fraud Detection | TF-IDF + Word2Vec hybrid | TF-IDF identifies keywords, Word2Vec captures meaning |
Annual Report Analysis / Risk Disclosure | Word2Vec or Transformer-based embeddings (BERT) | Better at capturing domain-specific context |
Sentiment Analysis (e.g., earnings calls) | Word2Vec / FinBERT | Captures tone, sentiment, and subtle nuances |
Rule-based keyword triggers | TF-IDF / BoW | Simple, interpretable, low-latency logic |
⚠️ For modern NLP applications in finance, dense vector models like Word2Vec or BERT variants (e.g., FinBERT) are typically preferred due to their semantic power and ability to generalize across contexts.
Classic NLP Models
经典的 NLP 模型(Classic NLP Models)是指深度学习出现之前广泛使用的自然语言处理算法,主要基于统计学习和规则推理。这些模型至今仍被用于基线实验、特征工程任务或资源受限场景。
应用层级结构
text 复制编辑 输入文本 → 文本清洗 → 向量表示(BoW / TF-IDF)→ 训练分类模型(Naive Bayes / SVM / LR)→ 预测情感 / 主题
Model | Core Idea | Strengths | Weaknesses | Typical Use Cases |
Bag of Words (BoW) | 统计词频,不考虑顺序 | 简单高效,适合小数据集 | 忽略语序和上下文 | 文本分类、情感分析 |
TF-IDF | 加权词频,惩罚常见词 | 兼顾全局与局部词频 | 仍然是稀疏向量 | 信息检索、关键词提取 |
Naive Bayes | 条件独立假设下的概率分类器 | 训练快、解释性强 | 独立性假设不现实 | 文本分类、垃圾邮件检测 |
Logistic Regression | 学习词与标签的线性关系 | 稳定、鲁棒性好 | 无法建模复杂非线性关系 | 情感分析、舆情预测 |
SVM (支持向量机) | 寻找最优分类边界 | 对高维数据表现好 | 对大型数据集慢 | 文本分类、小样本任务 |
Decision Tree / Random Forest | 基于规则的树结构模型 | 解释性强,不易过拟合 | 对稀疏数据不敏感 | 标签预测、多分类任务 |
LDA (Latent Dirichlet Allocation) | 主题建模,发现文档中潜在主题 | 可解释性好,适用于文本聚类 | 模型假设较强,训练慢 | 新闻聚类、舆情主题分析 |
Sentiment Analysis
Sentiment Analysis (also known as Opinion Mining) is a Natural Language Processing (NLP) technique used to determine whether a piece of text expresses a positive, negative, or neutral sentiment.
Why is it Important?
Industry | Application |
Finance | 判断用户对信用卡、银行服务、年报的情绪 |
Marketing | 跟踪品牌声誉,用户满意度分析 |
Politics | 舆情监控与选民倾向预测 |
Customer Service | 自动分类用户反馈(投诉 vs 赞扬) |
Key Components
1. Text Preprocessing
Before classification, text is cleaned and normalized:
- Lowercasing
- Removing punctuation / stopwords
- Tokenization
- Lemmatization or Stemming
2. Text Representation
Convert text into numerical features:
- Bag of Words (BoW)
- TF-IDF
- Word2Vec / GloVe
- Transformer Embeddings (BERT)
3. Classification Models
Models are trained on labeled data to predict sentiment:
- Naive Bayes: Fast, good baseline
- Logistic Regression: Interpretable, stable
- SVM: Strong performance with sparse features
- LSTM / RNN: Sequence-sensitive deep models
- BERT: Context-aware transformer model
Sentiment Analysis Pipeline (Example)
Raw Text → Cleaned Text → Feature Vectors → Model → Sentiment Label
Example:
Input: "The credit card support was terrible and slow."
Output: Negative 😡
Evaluation Metrics
Metric | Meaning |
Accuracy | Overall correctness |
Precision | Positive predictions actually positive |
Recall | True positive rate |
F1 Score | Harmonic mean of precision and recall |
ROC-AUC | Area under curve for binary classifiers |
Confusion Matrix | TP, FP, TN, FN breakdown |
Financial Applications
Use Case | Example |
Credit Card Review Classification | 将用户评论分为满意/不满 |
Stock News Sentiment | 判断新闻对股价是利好还是利空 |
10-K 年报情绪倾向 | 量化风险段落的负面情绪浓度 |
金融客服自动回复系统 | 情绪识别+自动响应建议 |
Tools for Sentiment Analysis
Library | Feature |
NLTK | Basic preprocessing and classifiers |
TextBlob | Simple sentiment polarity scoring |
Scikit-learn | Pipeline + classifier integration |
SpaCy | Efficient tokenization & POS tagging |
HuggingFace Transformers | Pretrained BERT, RoBERTa, etc. |
VADER (for social media) | Lexicon-based scoring for short texts |
FinBERT (for finance) | Fine-tuned BERT on financial news/reports |
