7 月之前 · 92cf78bd3a
--- a/docs/chapter3/Chapter3-Fundamentals-of-Large-Language-Models.md
+++ b/docs/chapter3/Chapter3-Fundamentals-of-Large-Language-Models.md
@@ -386,7 +386,7 @@ The key to this name is "position-wise." It means this feed-forward network acts
 
				 
			
 
				 $$\mathrm{FFN}(x)=\max\left(0, xW_{1}+b_{1}\right) W_{2}+b_{2}$$
			
 
				 
			
 
				-Where $x$ is the output of the attention sublayer. $W_1,b_1,W_2,b_2$ are learnable parameters. Typically, the output dimension `d_ff` of the first linear layer is much larger than the input dimension `d_model` (for example, `d_ff = 4 * d_model`), then after ReLU activation, it is mapped back to `d_model` dimension through the second linear layer. This "expand then shrink" pattern, also called a bottleneck structure, is believed to help the model learn richer feature representations.
			
 
				+Where $x$ is the output of the attention sublayer. $W_1,b_1,W_2,b_2$ are learnable parameters. Typically, the output dimension `d_ff` of the first linear layer is much larger than the input dimension `d_model` (for example, `d_ff = 4 * d_model`), then after ReLU activation, it is mapped back to `d_model` dimension through the second linear layer. This "expand then shrink" design is believed to help the model learn richer feature representations.
			
 
				 
			
 
				 In our PyTorch skeleton, we can implement this module with the following code:
			
 
				 
			
@@ -553,8 +553,6 @@ Sentiment: Positive
 
				 
			
 
				 **One-shot Prompting** We provide the model with one complete example, showing it the task format and expected output style.
			
 
				 
			
 
				-We provide the model with one complete example, showing it the task format and expected output style.
			
 
				-
			
 
				 Case: We first give the model a complete "question-answer" pair as a demonstration, then pose our new question.
			
 
				 
			
 
				 ```Python
			
--- a/docs/chapter3/第三章大语言模型基础.md
+++ b/docs/chapter3/第三章大语言模型基础.md
@@ -1,4 +1,4 @@
 
				-# 第三章 大语言模型基础
			
 
				+# 第三章 大语言模型基础
			
 
				 
			
 
				 前两章分别介绍了智能体的定义和发展历史，本章将完全聚焦于大语言模型本身解答一个关键问题：现代智能体是如何工作的？我们将从语言模型的基本定义出发，通过对这些原理的学习，为理解LLM如何获得强大的知识储备与推理能力打下坚实的基础。
			
 
				 
			
@@ -388,7 +388,7 @@ class MultiHeadAttention(nn.Module):
 
				 
			
 
				 $$\mathrm{FFN}(x)=\max\left(0, xW_{1}+b_{1}\right) W_{2}+b_{2}$$
			
 
				 
			
 
				-其中，$x$是注意力子层的输出。 $W_1,b_1,W_2,b_2$是可学习的参数。通常，第一个线性层的输出维度 `d_ff` 会远大于输入的维度 `d_model`（例如 `d_ff = 4 * d_model`），经过 ReLU 激活后再通过第二个线性层映射回 `d_model` 维度。这种“先扩大再缩小”的模式，也被称为瓶颈结构，被认为有助于模型学习更丰富的特征表示。
			
 
				+其中，$x$是注意力子层的输出。 $W_1,b_1,W_2,b_2$是可学习的参数。通常，第一个线性层的输出维度 `d_ff` 会远大于输入的维度 `d_model`（例如 `d_ff = 4 * d_model`），经过 ReLU 激活后再通过第二个线性层映射回 `d_model` 维度。这种“先扩大再缩小”的模式，被认为有助于模型学习更丰富的特征表示。
			
 
				 
			
 
				 在我们的 PyTorch 骨架中，我们可以用以下代码来实现这个模块：
			
 
				 
			
@@ -559,8 +559,6 @@ Decoder-Only 架构的工作模式被称为<strong>自回归 (Autoregressive)</s
 
				 
			
 
				 <strong>单样本提示 (One-shot Prompting)</strong> 我们给模型提供一个完整的示例，向它展示任务的格式和期望的输出风格。
			
 
				 
			
 
				-我们给模型提供一个完整的示例，向它展示任务的格式和期望的输出风格。
			
 
				-
			
 
				 案例： 我们先给模型一个完整的“问题-答案”对作为示范，然后提出我们的新问题。
			
 
				 
			
 
				 ```Python