1
0
Эх сурвалжийг харах

feat: 优化3.2.2.1 引入OOV和语义表达

xyliu3 7 сар өмнө
parent
commit
55fc4a3cf5

+ 6 - 3
docs/chapter3/Chapter3-Fundamentals-of-Large-Language-Models.md

@@ -663,10 +663,13 @@ We know that computers essentially can only understand numbers. Therefore, befor
 
 Early natural language processing tasks might adopt simple tokenization strategies:
 
-- **Word-based**: Directly split sentences into words using spaces or punctuation. This method is intuitive but faces the problem of "vocabulary explosion." A language's vocabulary is huge; if each word is treated as an independent token, the vocabulary becomes difficult to manage. Worse, the model will be unable to handle any words not appearing in the vocabulary, such as "DatawhaleAgent."
-- **Character-based**: Split text into individual characters. This method has a very small vocabulary (e.g., English letters, numbers, and punctuation) and no OOV (Out-Of-Vocabulary) problem. But its disadvantage is that most individual characters don't have independent semantics, and the model needs to spend more effort learning how to combine characters into meaningful words, leading to low learning efficiency.
+-   **Word-based**: Directly splits sentences into words using spaces or punctuation. This method is intuitive but faces significant challenges:
+    -   **Vocabulary Explosion and OOV**: A language's vocabulary is vast. If each word is treated as an independent token, the vocabulary becomes difficult to manage. Worse, the model cannot handle any word that does not appear in its vocabulary (e.g., "DatawhaleAgent"). This phenomenon is known as the "Out-Of-Vocabulary" (OOV) problem.
+    -   **Lack of Semantic Association**: The model struggles to capture the semantic relationships between morphologically similar words. For instance, "look," "looks," and "looking" are treated as three completely different tokens, despite sharing a common core meaning. Similarly, the semantics of low-frequency words in the training data cannot be fully learned due to their rare occurrences.
 
-To balance vocabulary size and semantic expression, modern large language models generally adopt **Subword Tokenization** algorithms. The core idea is: keep common words (such as "agent") as complete tokens while splitting uncommon words (such as "Tokenization") into multiple meaningful subword fragments (such as "Token" and "ization"). This both controls vocabulary size and allows the model to understand and generate new words by combining subwords.
+-   **Character-based**: Splits text into individual characters. This method has a very small vocabulary (e.g., English letters, numbers, and punctuation) and thus avoids the OOV problem. However, its disadvantage is that individual characters mostly lack independent semantic meaning. The model must expend more effort learning to combine characters into meaningful words, leading to inefficient learning.
+
+To balance vocabulary size and semantic expression, modern large language models widely adopt **Subword Tokenization** algorithms. The core idea is to keep common words (like "agent") as single, complete tokens while breaking down uncommon words (like "Tokenization") into meaningful subword pieces (such as "Token" and "ization"). This approach not only controls the size of the vocabulary but also enables the model to understand and generate new words by combining subwords.
 
 **3.2.2.2 Byte-Pair Encoding Algorithm Analysis**
 

+ 3 - 1
docs/chapter3/第三章 大语言模型基础.md

@@ -669,7 +669,9 @@ How are you?
 
 早期的自然语言处理任务可能会采用简单的分词策略:
 
-- <strong>按词分词 (Word-based)</strong> :直接用空格或标点符号将句子切分成单词。这种方法很直观,但会面临“词表爆炸”的问题。一个语言的词汇量是巨大的,如果每个词都作为一个独立的词元,词表会变得难以管理。更糟糕的是,模型将无法处理任何未在词表中出现过的词,例如 “DatawhaleAgent”。
+- <strong>按词分词 (Word-based)</strong> :直接用空格或标点符号将句子切分成单词。这种方法很直观,但也面临挑战:
+  - 词表爆炸与未登录词:一个语言的词汇量是巨大的,如果每个词都作为一个独立的词元,词表会变得难以管理。更糟糕的是,模型将无法处理任何未在词表中出现过的词(例如 “DatawhaleAgent”),这种现象我们称为“未登录词” (Out-Of-Vocabulary, OOV)。
+  - 语义关联的缺失:模型难以捕捉词形相近的词之间的语义关系。例如,"look"、"looks" 和 "looking" 会被视为三个完全不同的词元,尽管它们有共同的核心含义。同样,训练数据中的低频词由于出现次数少,其语义也难以被模型充分学习。
 - <strong>按字符分词 (Character-based)</strong> :将文本切分成单个字符。这种方法词表很小(例如英文字母、数字和标点),不存在 OOV 问题。但它的缺点是,单个字符大多不具备独立的语义,模型需要花费更多的精力去学习如何将字符组合成有意义的词,导致学习效率低下。
 
 为了兼顾词表大小和语义表达,现代大语言模型普遍采用<strong>子词分词 (Subword Tokenization)</strong> 算法。它的核心思想是:将常见的词(如 "agent")保留为完整的词元,同时将不常见的词(如 "Tokenization")拆分成多个有意义的子词片段(如 "Token" 和 "ization")。这样既控制了词表的大小,又能让模型通过组合子词来理解和生成新词。