há 7 meses atrás · 919494d0b1
--- a/docs/chapter12/Chapter12-Agent-Performance-Evaluation.md
+++ b/docs/chapter12/Chapter12-Agent-Performance-Evaluation.md
@@ -1865,7 +1865,7 @@ In our implementation, LLM Judge evaluates AIME problem quality from four key di
 
				 
			
 
				 <div align="center">
			
 
				   <p>Table 12.5 LLM Judge Evaluation Dimensions for AIME Problems</p>
			
 
				-  <img src="https://raw.githubusercontent.com/datawhalechina/Hello-Agents/main/docs/images/12-figures/12-table-4.png" alt="" width="85%"/>
			
 
				+  <img src="https://raw.githubusercontent.com/datawhalechina/Hello-Agents/main/docs/images/12-figures/12-table-5.png" alt="" width="85%"/>
			
 
				 </div>
			
 
				 
			
 
				 After obtaining scores from four dimensions, we need to aggregate these scores into overall evaluation metrics. We define three key metrics to measure the quality level of generated problems:
			
--- a/docs/chapter12/第十二章智能体性能评估.md
+++ b/docs/chapter12/第十二章智能体性能评估.md
@@ -1853,7 +1853,7 @@ AIME 是美国数学协会（MAA）主办的中等难度数学竞赛，介于 AM
 
				 
			
 
				 <div align="center">
			
 
				   <p>表 12.5 LLM Judge 评估 AIME 题目的维度</p>
			
 
				-  <img src="https://raw.githubusercontent.com/datawhalechina/Hello-Agents/main/docs/images/12-figures/12-table-4.png" alt="" width="85%"/>
			
 
				+  <img src="https://raw.githubusercontent.com/datawhalechina/Hello-Agents/main/docs/images/12-figures/12-table-5.png" alt="" width="85%"/>
			
 
				 </div>
			
 
				 
			
 
				 有了四个维度的评分后，我们需要将这些评分汇总成整体的评估指标。我们定义了三个关键指标来衡量生成题目的质量水平：