NLP

NLP

notion image

NLP 基础概念全景图

  • 重点内容:
    • 什么是 NLP,NLP 应用场景(情感分析、命名实体识别、聊天机器人、法规文档处理等)
    • NLP vs 计算机视觉 vs Tabular ML 的不同
    • 传统 NLP vs 深度学习 NLP(BoW/TF-IDF vs Word2Vec/BERT)
维度
问题
你应该掌握的回答要点
NLP 是什么?
什么是 NLP?
让计算机理解和生成自然语言(人类语言)的技术。
NLP 用在哪?
NLP 在金融领域的用例有哪些?
年报分析、舆情监控、法规审查、合规文本抽取、客服自动化
文本预处理
Tokenization vs Lemmatization 的区别?
Tokenization:分词;Lemmatization:还原为词典形式
表示方法
BoW vs TF-IDF vs Word2Vec 的本质区别?
BoW:只看词频;TF-IDF:考虑稀有性;Word2Vec:引入语义关系
模型演进
为什么 Transformer 优于 LSTM?
并行计算 + 长距离依赖 + 不受时间序列顺序限制
NLP vs Tabular
NLP 与结构化数据建模有什么不同?
NLP特征是文本,需要预处理和向量化,不像表格直接数值处理
BERT vs GPT
BERT 与 GPT 的核心区别?
BERT是双向编码器,GPT是单向解码器,用于生成文本

文本预处理核心方法

  • 重点内容:
    • Tokenization(分词)
    • Stopwords removal(去除停用词)
    • Lemmatization vs Stemming(词干还原 vs 词形还原)
    • Lowercasing、去标点、处理 HTML、表情符号
  • Python实战(用 nltk/spacy)
    • nltk 对金融新闻进行清洗和分词
    • 输出一篇金融报告的高频词云

    • 文本预处理核心方法(NLP基础)

      技术
      说明
      举例/方法
      Tokenization
      将文本切分为一个个词/子词
      nltk.word_tokenize()
      Stopwords Removal
      去除如 "the", "is", "and" 等无实际语义的高频词
      from nltk.corpus import stopwords
      Lowercasing
      转小写,统一格式
      'Finance' → 'finance'
      Punctuation Removal
      去除标点符号
      re.sub(r'[^\w\s]', '', text)
      HTML Cleaning
      去除网页文本中的 <p> 等标签
      from bs4 import BeautifulSoup
      Emoji Removal
      去除表情符号
      re.sub(emoji_pattern, '', text)
      Stemming
      提取词干(如 dealing → deal)
      from nltk.stem import PorterStemmer
      Lemmatization
      词形还原(如 better → good)
      from nltk.stem import WordNetLemmatizer

      Python 实战:清洗 + 分词(使用 nltk)

      🔧 1. 安装必要包(仅第一次执行)

      pip install nltk wordcloud beautifulsoup4

      📥 2. 下载资源包(nltk)

      import nltk nltk.download('punkt') nltk.download('stopwords') nltk.download('wordnet')

      📰 3. 金融新闻清洗与分词

      import re from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer from bs4 import BeautifulSoup # 示例金融文本(你可以替换为真实新闻数据) text = """<p>Stocks rallied today as the Federal Reserve signaled a potential pause in interest rate hikes. Markets reacted positively to the dovish tone.</p> 🚀💵""" # Step 1: 去除 HTML 和表情符号 soup = BeautifulSoup(text, "html.parser") text = soup.get_text() text = re.sub(r'[^\w\s]', '', text) # 去标点 text = re.sub(r'\d+', '', text) # 去数字 text = text.lower() # 小写 text = re.sub(r'[^\x00-\x7F]+', '', text) # 去除非ASCII字符(emoji) # Step 2: 分词 + 去停用词 + 词形还原 tokens = word_tokenize(text) stop_words = set(stopwords.words('english')) lemmatizer = WordNetLemmatizer() cleaned_tokens = [lemmatizer.lemmatize(w) for w in tokens if w not in stop_words and len(w) > 2] print("🧹 Clean Tokens:\n", cleaned_tokens)

      ☁️ 4. 输出高频词云图(WordCloud)

      from wordcloud import WordCloud import matplotlib.pyplot as plt # 合并所有词汇为一个字符串 clean_text = ' '.join(cleaned_tokens) # 生成词云 wordcloud = WordCloud(width=800, height=400, background_color='white').generate(clean_text) # 可视化 plt.figure(figsize=(10, 5)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') plt.title("💼 Financial News Word Cloud", fontsize=16) plt.show()

      ✅ 最终效果

      你会得到:
      • 一份干净的词汇表 cleaned_tokens
      • 一张金融词云图,展示新闻中出现频率最高的关键词(如:fed, market, rate, pause, inflation

      NLP 常用正则表达式大全(含金融实战场景)

      功能
      正则表达式
      说明
      示例
      移除HTML标签
      <.*?>
      删除 <p>, <div> 等标签
      re.sub(r'<.*?>', '', text)
      去除标点符号
      [^\w\s]
      保留字母、数字、空格
      re.sub(r'[^\w\s]', '', text)
      去除数字
      \d+
      匹配所有数字
      re.sub(r'\d+', '', text)
      去除非ASCII字符
      [^\x00-\x7F]+
      处理 emoji、中文、特殊符号
      re.sub(r'[^\x00-\x7F]+', '', text)
      保留字母数字组合
      \b[A-Za-z0-9]+\b
      提取词语如 risk123
      re.findall(r'\b[A-Za-z0-9]+\b', text)
      检测 Email
      [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
      提取邮箱地址
      re.findall(...)
      检测 URL
      https?://[^\s]+
      提取网页链接
      re.findall(...)
      提取日期(简化版)
      \d{4}-\d{2}-\d{2}
      匹配如 2023-10-01
      re.findall(...)
      匹配特定金融词
      `\b(risk
      loan
      default
      多个空格合并成一个
      \s+
      用一个空格替换多个空格/tab
      re.sub(r'\s+', ' ', text)
      只保留字母
      [^a-zA-Z\s]
      删除数字、符号,仅保留英文单词
      re.sub(r'[^a-zA-Z\s]', '', text)

      金融文本清洗函数库

      import re import pandas as pd class FinancialTextCleaner: def __init__(self): pass def clean_html(self, text): return re.sub(r'<.*?>', '', text) def remove_non_ascii(self, text): return re.sub(r'[^ return re.sub(r'[^\x00- return re.sub(r'[^\x00-\x7F]+', '', text) def remove_digits(self, text): return re.sub(r'\d+', '', text) def remove_punctuation(self, text): return re.sub(r'[^\w\s]', '', text) def remove_multiple_spaces(self, text): return re.sub(r'\s+', ' ', text).strip() def lower_case(self, text): return text.lower() def clean_text(self, text): text = self.clean_html(text) text = self.remove_non_ascii(text) text = self.remove_digits(text) text = self.remove_punctuation(text) text = self.remove_multiple_spaces(text) text = self.lower_case(text) return text def clean_dataframe_column(self, df, column_name, new_column_name=None): if new_column_name is None: new_column_name = column_name + '_cleaned' df[new_column_name] = df[column_name].astype(str).apply(self.clean_text) return df def extract_keywords(self, text, keywords): pattern = r'\\b(' + '|'.join(map(re.escape, keywords)) + r')\\b' return re.findall(pattern, text.lower()) def extract_money_amounts(self, text): return re.findall(r'\$\d+(\.\d+)?\s?(million|billion)?', text.lower()) def extract_dates(self, text): return re.findall(r'\d{4}-\d{2}-\d{2}', text) def extract_ratings(self, text): return re.findall(r'(AAA|AA\+?|A\+?|BBB\+?|BB\+?|B\+?)', text) # Example usage: if __name__ == "__main__": cleaner = FinancialTextCleaner() sample_text = "<p>As of 2023-12-31, the Company had $25.3 million in credit risk exposure. 📉</p>" print("Cleaned Text:", cleaner.clean_text(sample_text)) print("Keywords:", cleaner.extract_keywords(sample_text, ['credit', 'risk', 'exposure'])) print("Amounts:", cleaner.extract_money_amounts(sample_text)) print("Dates:", cleaner.extract_dates(sample_text)) print("Ratings:", cleaner.extract_ratings("The firm was rated AAA and later downgraded to BBB+"))
 

文本表示方法(BoW、TF-IDF、词向量)

核心知识总结

表示方法
特点
优缺点
适用场景
BoW (Bag of Words)
将每个文本表示为词频向量(稀疏矩阵)
简单、易于实现;不考虑词序和语义
文本分类、简单规则引擎
TF-IDF
考虑词频(TF)与逆文档频率(IDF)的权重,降低高频无意义词影响
保留关键词重要性;依然稀疏且无语义
新闻摘要提取、关键词提取
Word2Vec (CBOW / Skip-Gram)
将词映射为稠密向量,捕捉语义关系 CBOW:预测当前词 Skip-Gram:预测上下文词
有语义信息;可计算词相似度;训练成本高
情感分析、文本生成、金融语义识别

深度理解对比

对比维度
BoW / TF-IDF
Word2Vec
向量维度
高维
低维(如 100~300)
稀疏性
稀疏
稠密
语义信息
无(仅频率)
有(上下文捕捉)
适合模型
线性模型、SVM、树类模型
深度学习(LSTM、Transformer)
存储成本

实战任务设计(金融文本)

对一批金融新闻(或年报段落)进行以下操作,观察向量表示的不同:

1. 文本准备

docs = [ "The interest rate will rise next quarter due to inflation pressure.", "The central bank is expected to keep the interest rate unchanged.", "Stock market volatility increased following policy changes." ]

2. BoW 表示

from sklearn.feature_extraction.text import CountVectorizer bow_vectorizer = CountVectorizer() bow_matrix = bow_vectorizer.fit_transform(docs) print("BoW Feature Names:\n", bow_vectorizer.get_feature_names_out()) print("BoW Matrix:\n", bow_matrix.toarray())

3. TF-IDF 表示

from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer() tfidf_matrix = tfidf_vectorizer.fit_transform(docs) print("TF-IDF Feature Names:\n", tfidf_vectorizer.get_feature_names_out()) print("TF-IDF Matrix:\n", tfidf_matrix.toarray())

4. Word2Vec 表示(平均词向量)

import gensim.downloader as api from gensim.utils import simple_preprocess import numpy as np model = api.load("word2vec-google-news-300") # 需要联网,建议本地已有模型 def get_avg_vector(sentence): tokens = simple_preprocess(sentence) vectors = [model[word] for word in tokens if word in model] return np.mean(vectors, axis=0) if vectors else np.zeros(300) word2vec_matrix = np.array([get_avg_vector(doc) for doc in docs]) print("Word2Vec Shape:", word2vec_matrix.shape)

你需要掌握的问题点

1. Why are BoW and TF-IDF sparse? What are their limitations?

  • Sparse nature:
    • BoW and TF-IDF create high-dimensional vectors where each dimension represents a unique word in the vocabulary. Since each document only contains a small subset of all words, most elements in the vector are zeros — making them sparse.
  • Limitations:
    • No semantic understanding: “bank” and “finance” are treated as completely unrelated.
    • No contextual information: Word order and co-occurrence are ignored.
    • High memory usage: Large vocabulary → high-dimensional vectors.
    • Poor performance on deep learning models, which favor dense inputs.

2. How does Word2Vec learn word relationships through context?

  • Mechanism:
    • CBOW (Continuous Bag of Words): Predicts a target word based on its surrounding context words.
    • Skip-Gram: Predicts surrounding context words given a target word.
  • Learning process:
    • Trained using a shallow neural network.
    • Adjusts word vectors such that words appearing in similar contexts have similar vector representations.
    • Captures semantic similarity: e.g., vec("king") - vec("man") + vec("woman") ≈ vec("queen").

3. Why do deep neural networks need dense vector inputs?

  • Sparse vectors (like BoW or TF-IDF) are:
    • Memory inefficient.
    • Contain lots of zero values → make training harder.
    • Do not capture semantic or contextual information.
  • Dense vectors (like Word2Vec or BERT embeddings) are:
    • Low-dimensional and continuous.
    • Capture meaning, relationships, and structure.
    • Enable gradient-based learning in neural networks to be more efficient and effective.

4. Which is better for small datasets: CBOW or Skip-Gram?

  • Skip-Gram is better for small datasets:
    • Skip-Gram is more robust with fewer data because it samples more context pairs from each sentence.
    • It learns better rare word representations.
    • CBOW tends to average context, which may lose nuance in small datasets.

5. In financial NLP tasks (e.g., credit scoring, annual report analysis), which representation is more suitable?

Scenario
Recommended Representation
Reason
Credit Scoring / Fraud Detection
TF-IDF + Word2Vec hybrid
TF-IDF identifies keywords, Word2Vec captures meaning
Annual Report Analysis / Risk Disclosure
Word2Vec or Transformer-based embeddings (BERT)
Better at capturing domain-specific context
Sentiment Analysis (e.g., earnings calls)
Word2Vec / FinBERT
Captures tone, sentiment, and subtle nuances
Rule-based keyword triggers
TF-IDF / BoW
Simple, interpretable, low-latency logic
⚠️ For modern NLP applications in finance, dense vector models like Word2Vec or BERT variants (e.g., FinBERT) are typically preferred due to their semantic power and ability to generalize across contexts.
 

Classic NLP Models

经典的 NLP 模型(Classic NLP Models)是指深度学习出现之前广泛使用的自然语言处理算法,主要基于统计学习和规则推理。这些模型至今仍被用于基线实验、特征工程任务或资源受限场景。

应用层级结构

text 复制编辑 输入文本 → 文本清洗 → 向量表示(BoW / TF-IDF)→ 训练分类模型(Naive Bayes / SVM / LR)→ 预测情感 / 主题
Model
Core Idea
Strengths
Weaknesses
Typical Use Cases
Bag of Words (BoW)
统计词频,不考虑顺序
简单高效,适合小数据集
忽略语序和上下文
文本分类、情感分析
TF-IDF
加权词频,惩罚常见词
兼顾全局与局部词频
仍然是稀疏向量
信息检索、关键词提取
Naive Bayes
条件独立假设下的概率分类器
训练快、解释性强
独立性假设不现实
文本分类、垃圾邮件检测
Logistic Regression
学习词与标签的线性关系
稳定、鲁棒性好
无法建模复杂非线性关系
情感分析、舆情预测
SVM (支持向量机)
寻找最优分类边界
对高维数据表现好
对大型数据集慢
文本分类、小样本任务
Decision Tree / Random Forest
基于规则的树结构模型
解释性强,不易过拟合
对稀疏数据不敏感
标签预测、多分类任务
LDA (Latent Dirichlet Allocation)
主题建模,发现文档中潜在主题
可解释性好,适用于文本聚类
模型假设较强,训练慢
新闻聚类、舆情主题分析

Sentiment Analysis

Sentiment Analysis (also known as Opinion Mining) is a Natural Language Processing (NLP) technique used to determine whether a piece of text expresses a positive, negative, or neutral sentiment.

Why is it Important?

Industry
Application
Finance
判断用户对信用卡、银行服务、年报的情绪
Marketing
跟踪品牌声誉,用户满意度分析
Politics
舆情监控与选民倾向预测
Customer Service
自动分类用户反馈(投诉 vs 赞扬)

Key Components

1. Text Preprocessing

Before classification, text is cleaned and normalized:
  • Lowercasing
  • Removing punctuation / stopwords
  • Tokenization
  • Lemmatization or Stemming

2. Text Representation

Convert text into numerical features:
  • Bag of Words (BoW)
  • TF-IDF
  • Word2Vec / GloVe
  • Transformer Embeddings (BERT)

3. Classification Models

Models are trained on labeled data to predict sentiment:
  • Naive Bayes: Fast, good baseline
  • Logistic Regression: Interpretable, stable
  • SVM: Strong performance with sparse features
  • LSTM / RNN: Sequence-sensitive deep models
  • BERT: Context-aware transformer model

Sentiment Analysis Pipeline (Example)

Raw Text → Cleaned Text → Feature Vectors → Model → Sentiment Label
Example:
Input: "The credit card support was terrible and slow."
Output: Negative 😡

Evaluation Metrics

Metric
Meaning
Accuracy
Overall correctness
Precision
Positive predictions actually positive
Recall
True positive rate
F1 Score
Harmonic mean of precision and recall
ROC-AUC
Area under curve for binary classifiers
Confusion Matrix
TP, FP, TN, FN breakdown

Financial Applications

Use Case
Example
Credit Card Review Classification
将用户评论分为满意/不满
Stock News Sentiment
判断新闻对股价是利好还是利空
10-K 年报情绪倾向
量化风险段落的负面情绪浓度
金融客服自动回复系统
情绪识别+自动响应建议

Tools for Sentiment Analysis

Library
Feature
NLTK
Basic preprocessing and classifiers
TextBlob
Simple sentiment polarity scoring
Scikit-learn
Pipeline + classifier integration
SpaCy
Efficient tokenization & POS tagging
HuggingFace Transformers
Pretrained BERT, RoBERTa, etc.
VADER (for social media)
Lexicon-based scoring for short texts
FinBERT (for finance)
Fine-tuned BERT on financial news/reports