il y a 1 mois · bff822a1d8
--- a/docs/chapter1/Chapter1-Introduction-to-Agents.md
+++ b/docs/chapter1/Chapter1-Introduction-to-Agents.md
@@ -296,7 +296,7 @@ def get_weather(city: str) -> str:
 
				 
			
 
				 (3) Tool 2: Search and Recommend Tourist Attractions
			
 
				 
			
 
				-We will define a new tool `search_attraction` that searches the internet for suitable attractions based on city and weather conditions:
			
 
				+We will define a new tool `get_attraction` that searches the internet for suitable attractions based on city and weather conditions:
			
 
				 
			
 
				 ```python
			
 
				 import os
			
--- a/docs/chapter1/第一章初识智能体.md
+++ b/docs/chapter1/第一章初识智能体.md
@@ -300,7 +300,7 @@ def get_weather(city: str) -> str:
 
				 
			
 
				 （3）工具 2：搜索并推荐旅游景点
			
 
				 
			
 
				-我们将定义一个新工具 `search_attraction`，它会根据城市和天气状况，互联网上搜索合适的景点：
			
 
				+我们将定义一个新工具 `get_attraction`，它会根据城市和天气状况，在互联网上搜索合适的景点：
			
 
				 
			
 
				 ```python
			
 
				 import os
			
--- a/docs/chapter11/Chapter11-Agentic-RL.md
+++ b/docs/chapter11/Chapter11-Agentic-RL.md
@@ -40,7 +40,7 @@ Before diving into Agentic RL, we need to first understand the complete process
 
				   <p>Figure 11.1 LLM Training Landscape</p>
			
 
				 </div>
			
 
				 
			
 
				-**Pretraining Stage** is the first stage of LLM training, with the goal of making the model learn basic language patterns and world knowledge. This stage uses massive amounts of text data (usually TB-level) and trains the model through self-supervised learning. The most common pretraining task is Causal Language Modeling, also known as Next Token Prediction.
			
 
				+**Pretraining Stage** is the first stage of LLM training, with the goal of making the model learn basic language patterns and world knowledge. This stage uses massive amounts of text data (usually TB-level) and trains the model through self-supervised learning, where the training signal is constructed from the text itself, such as predicting the next word from the previous context. The most common pretraining task is Causal Language Modeling, also known as Next Token Prediction.
			
 
				 
			
 
				 Given a text sequence $x_1, x_2, ..., x_t$, the model needs to predict the next word $x_{t+1}$:
			
 
				 
			
@@ -50,7 +50,7 @@ $$
 
				 
			
 
				 Where $\theta$ is the model parameters, $P(x_t | x_1, ..., x_{t-1}; \theta)$ is the probability distribution of the next word predicted by the model, and the goal is to minimize negative log-likelihood, i.e., maximize the probability of predicting the correct word. For example, given the text "The cat sat on the", the model needs to predict that the next word is most likely "mat". Through training on massive amounts of text, the model gradually learns grammar rules (what word sequences are legal), semantic knowledge (relationships between words), world knowledge (factual information about the world), and basic reasoning abilities.
			
 
				 
			
 
				-The characteristics of the pretraining stage are: massive data volume, high computational cost, learning general language understanding and generation capabilities, and using unsupervised learning.
			
 
				+The characteristics of the pretraining stage are: massive data volume, high computational cost, learning general language understanding and generation capabilities, and using self-supervised objectives constructed from unlabeled text rather than manually labeled task data.
			
 
				 
			
 
				 **Post-training Stage** aims to address the shortcomings of pretrained models. Although pretrained models have powerful language capabilities, they are just "next word prediction" models and don't know how to follow human instructions, generate helpful, harmless, and honest answers, refuse inappropriate requests, and interact with humans in a conversational manner. The post-training stage aims to solve these problems and align the model with human preferences and values.
			
 
				 
			
@@ -90,7 +90,7 @@ Let's understand this difference through a specific example. In the PBRFT scenar
 
				 
			
 
				 As can be seen, key features of Agentic RL are multi-step interaction, each action changes environment state, each step can receive feedback, and optimizing overall task completion quality.
			
 
				 
			
 
				-Reinforcement learning is formalized based on the Markov Decision Process (MDP) framework. MDP is defined by a five-tuple $(S, A, P, R, \gamma)$: state space $S$, action space $A$, state transition function $P(s'|s,a)$, reward function $R(s,a)$, discount factor $\gamma$. Let's compare PBRFT and Agentic RL from the MDP perspective, as shown in Table 11.1.
			
 
				+Reinforcement learning is commonly formalized with the Markov Decision Process (MDP) framework. An MDP is defined by a five-tuple $(S, A, P, R, \gamma)$: state space $S$, action space $A$, state transition function $P(s'|s,a)$, reward function $R(s,a)$, and discount factor $\gamma$. The table below compares PBRFT and Agentic RL from the MDP perspective.
			
 
				 
			
 
				 <div align="center">
			
 
				   <p>Table 11.1 Comparison of PBRFT and Agentic RL</p>
			
--- a/docs/chapter11/第十一章
+++ b/docs/chapter11/第十一章
@@ -40,7 +40,7 @@ duck egg. How much in dollars does she make every day at the farmers' market?
 
				   <p>图 11.1 LLM 训练全景图</p>
			
 
				 </div>
			
 
				 
			
 
				-<strong>预训练阶段</strong>是 LLM 训练的第一阶段，目标是让模型学习语言的基本规律和世界知识。这个阶段使用海量的文本数据(通常是数 TB 级别)，通过自监督学习的方式训练模型。最常见的预训练任务是因果语言建模(Causal Language Modeling)，也称为下一个词预测(Next Token Prediction)。
			
 
				+<strong>预训练阶段</strong>是 LLM 训练的第一阶段，目标是让模型学习语言的基本规律和世界知识。这个阶段使用海量的文本数据(通常是数 TB 级别)，通过自监督学习的方式训练模型：训练信号来自文本本身，例如根据上文预测下一个词。最常见的预训练任务是因果语言建模(Causal Language Modeling)，也称为下一个词预测(Next Token Prediction)。
			
 
				 
			
 
				 给定一个文本序列 $x_1, x_2, ..., x_t$，模型需要预测下一个词 $x_{t+1}$:
			
 
				 
			
@@ -50,7 +50,7 @@ $$
 
				 
			
 
				 其中 $\theta$ 是模型参数，$P(x_t | x_1, ..., x_{t-1}; \theta)$ 是模型预测的下一个词的概率分布，目标是最小化负对数似然，即最大化预测正确词的概率。例如，给定文本"The cat sat on the"，模型需要预测下一个词最可能是"mat"。通过在海量文本上进行这样的训练，模型逐渐学会语法规则(什么样的词序是合法的)、语义知识(词与词之间的关系)、世界知识(关于世界的事实性信息)以及基础的推理能力。
			
 
				 
			
 
				-预训练阶段的特点是数据量巨大、计算成本高、学到的是通用的语言理解和生成能力、采用无监督学习。
			
 
				+预训练阶段的特点是数据量巨大、计算成本高，学到的是通用的语言理解和生成能力，训练过程通常采用从未标注文本中自动构造监督信号的自监督目标。
			
 
				 
			
 
				 <strong>后训练阶段</strong>则是要解决预训练模型的不足。预训练后的模型虽然具备了强大的语言能力，但它只是一个"预测下一个词"的模型，并不知道如何遵循人类的指令、生成有帮助无害诚实的回答、拒绝不当的请求，以及以对话的方式与人交互。后训练阶段就是要解决这些问题，让模型对齐人类的偏好和价值观。
			
 
				 
			
@@ -90,7 +90,7 @@ $$
 
				 
			
 
				 可以看到，Agentic RL 的关键特征是多步交互、每一步的行动都会改变环境状态、每一步都可以获得反馈、优化整个任务的完成质量。
			
 
				 
			
 
				-强化学习是基于马尔可夫决策过程(Markov Decision Process， MDP)框架进行形式化的。MDP 由五元组 $(S, A, P, R, \gamma)$ 定义:状态空间$S$、行动空间$A$、状态转移函数$P(s'|s,a)$、奖励函数$R(s,a)$、折扣因子$\gamma$。让我们从 MDP 的角度对比 PBRFT 和 Agentic RL，如表 11.1 所示。
			
 
				+强化学习常用马尔可夫决策过程(Markov Decision Process，MDP)框架进行形式化。MDP 由五元组 $(S, A, P, R, \gamma)$ 定义：状态空间 $S$、行动空间 $A$、状态转移函数 $P(s'|s,a)$、奖励函数 $R(s,a)$ 和折扣因子 $\gamma$。下表从 MDP 角度对比 PBRFT 和 Agentic RL。
			
 
				 
			
 
				 <div align="center">
			
 
				   <p>表 11.1 PBRFT 与 Agentic RL 对比</p>