فهرست منبع

add chapter 12

jjyaoao 8 ماه پیش
والد
کامیت
7cded96c8e
52فایلهای تغییر یافته به همراه7348 افزوده شده و 2 حذف شده
  1. 91 0
      code/chapter12/.env.example
  2. 38 0
      code/chapter12/01_basic_agent_example.py
  3. 49 0
      code/chapter12/02_bfcl_quick_start.py
  4. 61 0
      code/chapter12/03_bfcl_custom_evaluation.py
  5. 293 0
      code/chapter12/04_run_bfcl_evaluation.py
  6. 85 0
      code/chapter12/05_gaia_quick_start.py
  7. 149 0
      code/chapter12/06_gaia_best_practices.py
  8. 118 0
      code/chapter12/07_data_generation_complete_flow.py
  9. 167 0
      code/chapter12/08_data_generation_llm_judge.py
  10. 170 0
      code/chapter12/09_data_generation_win_rate.py
  11. 279 0
      code/chapter12/README.md
  12. 461 0
      code/chapter12/data_generation/aime_generator.py
  13. 80 0
      code/chapter12/data_generation/evaluation_results/20251011_123929/comprehensive_report.md
  14. 127 0
      code/chapter12/data_generation/evaluation_results/20251011_123929/llm_judge/llm_judge_report_20251011_124143.md
  15. 378 0
      code/chapter12/data_generation/evaluation_results/20251011_123929/llm_judge/llm_judge_results_20251011_124143.json
  16. 145 0
      code/chapter12/data_generation/evaluation_results/20251011_123929/win_rate/win_rate_report_20251011_124322.md
  17. 275 0
      code/chapter12/data_generation/evaluation_results/20251011_123929/win_rate/win_rate_results_20251011_124322.json
  18. 20 0
      code/chapter12/data_generation/generated_data/aime_generated_20251011_042741.json
  19. 44 0
      code/chapter12/data_generation/generated_data/generation_report_20251011_042741.md
  20. 254 0
      code/chapter12/data_generation/human_verification_ui.py
  21. 314 0
      code/chapter12/data_generation/run_complete_evaluation.py
  22. 45 0
      code/chapter12/data_generation/step1_generate_only.py
  23. 287 0
      code/chapter12/data_generation/step2_evaluate_only.py
  24. 312 0
      code/chapter12/data_generation/运行指南.md
  25. 36 0
      code/chapter12/template_output/evaluation_reports/bfcl_report_20251011_010343.md
  26. 38 0
      code/chapter12/template_output/evaluation_reports/gaia_report_20251011_012648.md
  27. 5 0
      code/chapter12/template_output/evaluation_results/bfcl_official/BFCL_v4_simple_python_result.json
  28. 216 0
      code/chapter12/template_output/evaluation_results/bfcl_official/HelloAgents_simple_python_results.json
  29. 90 0
      code/chapter12/template_output/evaluation_results/gaia_official/SUBMISSION_GUIDE_20251011_015731.md
  30. 2 0
      code/chapter12/template_output/evaluation_results/gaia_official/gaia_level1_result_20251011_015731.jsonl
  31. 5 0
      code/chapter12/template_output/result/Qwen_Qwen3-8B/BFCL_v4_simple_python_result.json
  32. 1 0
      code/chapter12/template_output/score/Qwen_Qwen3-8B/non_live/BFCL_v4_simple_python_score.json
  33. 2 0
      code/chapter12/template_output/score/data_agentic.csv
  34. 0 0
      code/chapter12/template_output/score/data_format_sensitivity.csv
  35. 2 0
      code/chapter12/template_output/score/data_live.csv
  36. 2 0
      code/chapter12/template_output/score/data_multi_turn.csv
  37. 2 0
      code/chapter12/template_output/score/data_non_live.csv
  38. 2 0
      code/chapter12/template_output/score/data_overall.csv
  39. 1 1
      docs/chapter10/第十章 智能体通信协议.md
  40. 2702 1
      docs/chapter12/第十二章 智能体性能评估.md
  41. BIN
      docs/images/12-figures/12-1.png
  42. BIN
      docs/images/12-figures/12-2.png
  43. BIN
      docs/images/12-figures/12-3.png
  44. BIN
      docs/images/12-figures/12-4.png
  45. BIN
      docs/images/12-figures/12-5.png
  46. BIN
      docs/images/12-figures/12-6.png
  47. BIN
      docs/images/12-figures/12-7.png
  48. BIN
      docs/images/12-figures/12-table-1.png
  49. BIN
      docs/images/12-figures/12-table-2.png
  50. BIN
      docs/images/12-figures/12-table-3.png
  51. BIN
      docs/images/12-figures/12-table-4.png
  52. BIN
      docs/images/12-figures/12-table-5.png

+ 91 - 0
code/chapter12/.env.example

@@ -0,0 +1,91 @@
+# ============================================================================
+# HelloAgents 统一环境变量配置文件
+# ============================================================================
+# 复制此文件为 .env 并填入你的API密钥
+# 系统要求:Python 3.10+ (必需)
+
+# ============================================================================
+# 🚀 统一配置格式(推荐)- 框架自动检测provider
+# ============================================================================
+# 只需配置以下4个通用环境变量,框架会自动识别LLM提供商:
+
+# 模型名称
+LLM_MODEL_ID=your-model-name
+
+# API密钥
+LLM_API_KEY=your-api-key-here
+
+# 服务地址
+LLM_BASE_URL=your-api-base-url
+
+# 超时时间(可选,默认60秒)
+LLM_TIMEOUT=60
+
+# ============================================================================
+# 🛠️ 工具配置(可选)
+# ============================================================================
+
+# Tavily搜索(推荐)- 获取API密钥:https://tavily.com/
+# TAVILY_API_KEY=tvly-your_tavily_key_here
+
+# SerpApi搜索(备选)- 获取API密钥:https://serpapi.com/
+# SERPAPI_API_KEY=your_serpapi_key_here
+
+# ================================
+# Qdrant 向量数据库配置 - 获取API密钥:https://cloud.qdrant.io/
+# ================================
+# 使用Qdrant云服务 (推荐)
+QDRANT_URL=https://your-cluster.qdrant.tech:6333
+QDRANT_API_KEY=your_qdrant_api_key_here
+
+# 或使用本地Qdrant (需要Docker)
+# QDRANT_URL=http://localhost:6333
+# QDRANT_API_KEY=
+
+# Qdrant集合配置
+QDRANT_COLLECTION=hello_agents_vectors
+QDRANT_VECTOR_SIZE=384
+QDRANT_DISTANCE=cosine
+QDRANT_TIMEOUT=30
+
+# ================================
+# Neo4j 图数据库配置 - 获取API密钥:https://neo4j.com/cloud/aura/
+# ================================
+# 使用Neo4j Aura云服务 (推荐)
+NEO4J_URI=neo4j+s://your-instance.databases.neo4j.io
+NEO4J_USERNAME=neo4j
+NEO4J_PASSWORD=your_neo4j_password_here
+
+# 或使用本地Neo4j (需要Docker)
+# NEO4J_URI=bolt://localhost:7687
+# NEO4J_USERNAME=neo4j
+# NEO4J_PASSWORD=hello-agents-password
+
+# Neo4j连接配置
+NEO4J_DATABASE=neo4j
+NEO4J_MAX_CONNECTION_LIFETIME=3600
+NEO4J_MAX_CONNECTION_POOL_SIZE=50
+NEO4J_CONNECTION_TIMEOUT=60
+
+# ==========================
+# 嵌入(Embedding)配置示例 - 可从阿里云控制台获取:https://dashscope.aliyun.com/
+# ==========================
+# - 若为空,dashscope 默认 text-embedding-v3;local 默认 sentence-transformers/all-MiniLM-L6-v2
+EMBED_MODEL_TYPE=dashscope
+EMBED_MODEL_NAME=
+EMBED_API_KEY=
+EMBED_BASE_URL=
+
+# ================================
+# GitHub API 配置
+# ================================
+# GitHub Personal Access Token - 用于访问GitHub API
+# 获取方式:https://github.com/settings/tokens
+GITHUB_PERSONAL_ACCESS_TOKEN=
+
+# ================================
+# HuggingFace API 配置
+# ================================
+# HuggingFace Token - 用于访问gated datasets (如GAIA)
+# 获取方式:https://huggingface.co/settings/tokens
+HF_TOKEN=

+ 38 - 0
code/chapter12/01_basic_agent_example.py

@@ -0,0 +1,38 @@
+"""
+第十二章示例1:基础智能体示例
+
+对应文档:12.1.1 为何需要智能体评估
+
+这个示例展示了一个基本的智能体,它可以调用搜索工具完成任务。
+但我们如何知道它的表现如何?这就是为什么需要评估系统。
+"""
+
+from hello_agents import SimpleAgent, HelloAgentsLLM
+from hello_agents.tools import SearchTool
+
+# 创建LLM和智能体
+llm = HelloAgentsLLM()
+
+# 创建一个强调工具使用的系统提示词
+system_prompt = """你是一个AI助手,可以使用搜索工具来获取最新信息。
+
+当需要搜索信息时,请使用以下格式:
+[TOOL_CALL:search:搜索关键词]
+
+例如:
+- [TOOL_CALL:search:最新AI新闻]
+- [TOOL_CALL:search:Python编程教程]
+
+请在回答问题前先使用搜索工具获取最新信息。"""
+
+agent = SimpleAgent(name="AI助手", llm=llm, system_prompt=system_prompt)
+
+# 添加搜索工具
+agent.add_tool(SearchTool())
+
+# 示例:使用搜索工具回答问题
+print("\n问题:最新的AI技术发展趋势是什么?")
+print("\n智能体正在思考和搜索...")
+response = agent.run("最新的AI技术发展趋势是什么?")
+print(f"\n回答:{response}")
+

+ 49 - 0
code/chapter12/02_bfcl_quick_start.py

@@ -0,0 +1,49 @@
+"""
+第十二章示例2:BFCL快速开始
+
+对应文档:12.2.5 在HelloAgents中实现BFCL评估 - 方式1
+
+这是最简单的BFCL评估方式,一行代码完成评估、报告生成和官方评估。
+"""
+
+from hello_agents import SimpleAgent, HelloAgentsLLM
+from hello_agents.tools import BFCLEvaluationTool
+
+# 1. 创建要评估的智能体
+llm = HelloAgentsLLM()
+agent = SimpleAgent(name="TestAgent", llm=llm)
+
+# 2. 创建BFCL评估工具
+bfcl_tool = BFCLEvaluationTool()
+
+# 3. 运行评估(自动完成所有步骤)
+results = bfcl_tool.run(
+    agent=agent,
+    category="simple_python",  # 评估类别
+    max_samples=5              # 评估样本数(0表示全部)
+)
+
+# 4. 查看结果
+print(f"准确率: {results['overall_accuracy']:.2%}")
+print(f"正确数: {results['correct_samples']}/{results['total_samples']}")
+
+# 运行输出示例:
+# ============================================================
+# BFCL一键评估
+# ============================================================
+# 
+# 配置:
+#    智能体: TestAgent
+#    类别: simple_python
+#    样本数: 5
+# 
+# 评估进度: 100%|██████████| 5/5 [00:15<00:00,  3.12s/样本]
+# 
+# ✅ 评估完成
+#    总样本数: 5
+#    正确样本数: 5
+#    准确率: 100.00%
+# 
+# 准确率: 100.00%
+# 正确数: 5/5
+

+ 61 - 0
code/chapter12/03_bfcl_custom_evaluation.py

@@ -0,0 +1,61 @@
+"""
+第十二章示例3:BFCL自定义评估
+
+对应文档:12.2.5 在HelloAgents中实现BFCL评估 - 方式3
+
+这个示例展示如何使用底层组件进行自定义评估流程。
+适合需要自定义评估流程的场景。
+"""
+
+from hello_agents import SimpleAgent, HelloAgentsLLM
+from hello_agents.evaluation import BFCLDataset, BFCLEvaluator
+
+# 1. 创建智能体
+llm = HelloAgentsLLM()
+agent = SimpleAgent(name="TestAgent", llm=llm)
+
+# 2. 加载数据集
+dataset = BFCLDataset(
+    bfcl_data_dir="./temp_gorilla/berkeley-function-call-leaderboard/bfcl_eval/data",
+    category="simple_python"
+)
+data = dataset.load()
+
+print(f"✅ 加载了 {len(data)} 个测试样本")
+
+# 3. 创建评估器
+evaluator = BFCLEvaluator(
+    dataset=dataset,
+    category="simple_python"
+)
+
+# 4. 运行评估
+results = evaluator.evaluate(
+    agent=agent,
+    max_samples=5  # 只评估5个样本
+)
+
+# 5. 查看详细结果
+print(f"\n评估结果:")
+print(f"总样本数: {results['total_samples']}")
+print(f"正确样本数: {results['correct_samples']}")
+print(f"准确率: {results['overall_accuracy']:.2%}")
+
+# 6. 查看每个样本的详细结果
+print(f"\n详细结果:")
+for detail in results['detailed_results']:
+    print(f"样本 {detail['sample_id']}:")
+    print(f"  问题: {detail['question'][:50]}...")
+    print(f"  预测: {detail['predicted']}")
+    print(f"  正确答案: {detail['expected']}")
+    print(f"  结果: {'✅ 正确' if detail['success'] else '❌ 错误'}")
+    print()
+
+# 7. 导出结果
+evaluator.export_results(
+    results,
+    output_file="./evaluation_results/bfcl_custom_result.json"
+)
+
+print("✅ 结果已导出到 ./evaluation_results/bfcl_custom_result.json")
+

+ 293 - 0
code/chapter12/04_run_bfcl_evaluation.py

@@ -0,0 +1,293 @@
+"""
+第十二章:BFCL一键评估脚本
+
+本脚本提供完整的BFCL评估流程:
+1. 自动检查和准备BFCL数据
+2. 运行HelloAgents评估
+3. 导出BFCL格式结果
+4. 调用BFCL官方评估工具
+5. 展示评估结果
+
+使用方法:
+    python examples/04_run_bfcl_evaluation.py
+
+可选参数:
+    --category: 评估类别(默认:simple_python)
+    --samples: 样本数量(默认:5,设为0表示全部)
+    --model-name: 模型名称(默认:HelloAgents)
+"""
+
+import sys
+import subprocess
+from pathlib import Path
+import argparse
+import json
+
+# 添加项目路径
+project_root = Path(__file__).parent.parent
+sys.path.insert(0, str(project_root))
+
+from hello_agents import SimpleAgent, HelloAgentsLLM
+from hello_agents.evaluation import BFCLDataset, BFCLEvaluator
+
+
+# 函数调用系统提示词
+FUNCTION_CALLING_SYSTEM_PROMPT = """你是一个专业的函数调用助手。
+
+你的任务是:根据用户的问题和提供的函数定义,生成正确的函数调用。
+
+输出格式要求:
+1. 必须是纯JSON格式,不要添加任何解释文字
+2. 使用JSON数组格式:[{"name": "函数名", "arguments": {"参数名": "参数值"}}]
+3. 如果需要调用多个函数,在数组中添加多个对象
+4. 如果不需要调用函数,返回空数组:[]
+
+示例:
+用户问题:查询北京的天气
+可用函数:get_weather(city: str)
+正确输出:[{"name": "get_weather", "arguments": {"city": "北京"}}]
+
+注意:
+- 只输出JSON,不要添加"好的"、"我来帮你"等额外文字
+- 参数值必须与函数定义的类型匹配
+- 参数名必须与函数定义完全一致
+"""
+
+
+def check_bfcl_data(bfcl_data_dir: Path) -> bool:
+    """检查BFCL数据是否存在"""
+    if not bfcl_data_dir.exists():
+        print(f"\n❌ BFCL数据目录不存在: {bfcl_data_dir}")
+        print(f"\n请先克隆BFCL仓库:")
+        print(f"   git clone --depth 1 https://github.com/ShishirPatil/gorilla.git temp_gorilla")
+        return False
+    return True
+
+
+def run_evaluation(category: str, max_samples: int, model_name: str) -> dict:
+    """运行HelloAgents评估"""
+    print("\n" + "="*60)
+    print("步骤1: 运行HelloAgents评估")
+    print("="*60)
+    
+    # BFCL数据目录
+    bfcl_data_dir = project_root / "temp_gorilla" / "berkeley-function-call-leaderboard" / "bfcl_eval" / "data"
+    
+    # 检查数据
+    if not check_bfcl_data(bfcl_data_dir):
+        return None
+    
+    # 加载数据集
+    print(f"\n📚 加载BFCL数据集...")
+    dataset = BFCLDataset(bfcl_data_dir=str(bfcl_data_dir), category=category)
+
+    # 创建智能体
+    print(f"\n🤖 创建智能体...")
+    llm = HelloAgentsLLM()
+    agent = SimpleAgent(
+        name=model_name,
+        llm=llm,
+        system_prompt=FUNCTION_CALLING_SYSTEM_PROMPT,
+        enable_tool_calling=False
+    )
+    print(f"   智能体: {model_name}")
+    print(f"   LLM: {llm.provider}")
+
+    # 创建评估器
+    evaluator = BFCLEvaluator(dataset=dataset, category=category)
+
+    # 运行评估(传递max_samples参数)
+    print(f"\n🔄 开始评估...")
+    if max_samples > 0:
+        print(f"   样本数量: {max_samples}")
+        results = evaluator.evaluate(agent, max_samples=max_samples)
+    else:
+        print(f"   样本数量: 全部")
+        results = evaluator.evaluate(agent, max_samples=None)
+    
+    # 显示结果
+    print(f"\n📊 评估结果:")
+    print(f"   准确率: {results['overall_accuracy']:.2%}")
+    print(f"   正确数: {results['correct_samples']}/{results['total_samples']}")
+    
+    return results
+
+
+def export_bfcl_format(results: dict, category: str, model_name: str) -> Path:
+    """导出BFCL格式结果"""
+    print("\n" + "="*60)
+    print("步骤2: 导出BFCL格式结果")
+    print("="*60)
+    
+    # 输出目录
+    output_dir = project_root / "evaluation_results" / "bfcl_official"
+    output_dir.mkdir(parents=True, exist_ok=True)
+    
+    # 输出文件
+    output_file = output_dir / f"BFCL_v4_{category}_result.json"
+    
+    # 创建评估器(用于导出)
+    bfcl_data_dir = project_root / "temp_gorilla" / "berkeley-function-call-leaderboard" / "bfcl_eval" / "data"
+    dataset = BFCLDataset(bfcl_data_dir=str(bfcl_data_dir), category=category)
+    evaluator = BFCLEvaluator(dataset=dataset, category=category)
+    
+    # 导出
+    evaluator.export_to_bfcl_format(results, output_file)
+    
+    return output_file
+
+
+def copy_to_bfcl_result_dir(source_file: Path, model_name: str, category: str) -> Path:
+    """复制结果文件到BFCL结果目录"""
+    print("\n" + "="*60)
+    print("步骤3: 准备BFCL官方评估")
+    print("="*60)
+    
+    # BFCL结果目录
+    # 注意:BFCL会将模型名中的"/"替换为"_"
+    safe_model_name = model_name.replace("/", "_")
+    result_dir = project_root / "result" / safe_model_name
+    result_dir.mkdir(parents=True, exist_ok=True)
+    
+    # 目标文件
+    target_file = result_dir / f"BFCL_v4_{category}_result.json"
+    
+    # 复制文件
+    import shutil
+    shutil.copy(source_file, target_file)
+    
+    print(f"\n✅ 结果文件已复制到:")
+    print(f"   {target_file}")
+    
+    return target_file
+
+
+def run_bfcl_official_eval(model_name: str, category: str) -> bool:
+    """运行BFCL官方评估"""
+    print("\n" + "="*60)
+    print("步骤4: 运行BFCL官方评估")
+    print("="*60)
+    
+    try:
+        # 设置环境变量
+        import os
+        os.environ['PYTHONUTF8'] = '1'
+        
+        # 运行BFCL评估
+        cmd = [
+            "bfcl", "evaluate",
+            "--model", model_name,
+            "--test-category", category,
+            "--partial-eval"
+        ]
+        
+        print(f"\n🔄 运行命令: {' '.join(cmd)}")
+        
+        result = subprocess.run(
+            cmd,
+            cwd=str(project_root),
+            capture_output=True,
+            text=True,
+            encoding='utf-8'
+        )
+        
+        # 显示输出
+        if result.stdout:
+            print(result.stdout)
+        
+        if result.returncode != 0:
+            print(f"\n❌ BFCL评估失败:")
+            if result.stderr:
+                print(result.stderr)
+            return False
+        
+        return True
+        
+    except FileNotFoundError:
+        print("\n❌ 未找到bfcl命令")
+        print("   请先安装: pip install bfcl-eval")
+        return False
+    except Exception as e:
+        print(f"\n❌ 运行BFCL评估时出错: {e}")
+        return False
+
+
+def show_results(model_name: str, category: str):
+    """展示评估结果"""
+    print("\n" + "="*60)
+    print("步骤5: 展示评估结果")
+    print("="*60)
+    
+    # CSV文件
+    csv_file = project_root / "score" / "data_non_live.csv"
+    
+    if csv_file.exists():
+        print(f"\n📊 评估结果汇总:")
+        with open(csv_file, 'r', encoding='utf-8') as f:
+            content = f.read()
+            print(content)
+    else:
+        print(f"\n⚠️ 未找到评估结果文件: {csv_file}")
+    
+    # 详细评分文件
+    safe_model_name = model_name.replace("/", "_")
+    score_file = project_root / "score" / safe_model_name / "non_live" / f"BFCL_v4_{category}_score.json"
+    
+    if score_file.exists():
+        print(f"\n📝 详细评分文件:")
+        print(f"   {score_file}")
+        
+        # 读取并显示准确率
+        with open(score_file, 'r', encoding='utf-8') as f:
+            first_line = f.readline()
+            summary = json.loads(first_line)
+            print(f"\n🎯 最终结果:")
+            print(f"   准确率: {summary['accuracy']:.2%}")
+            print(f"   正确数: {summary['correct_count']}/{summary['total_count']}")
+
+
+def main():
+    """主函数"""
+    parser = argparse.ArgumentParser(description="BFCL一键评估脚本")
+    parser.add_argument("--category", default="simple_python", help="评估类别")
+    parser.add_argument("--samples", type=int, default=5, help="样本数量(0表示全部)")
+    parser.add_argument("--model-name", default="Qwen/Qwen3-8B",
+                       help="模型名称(必须是BFCL支持的模型,运行'bfcl models'查看)")
+    
+    args = parser.parse_args()
+    
+    print("="*60)
+    print("BFCL一键评估脚本")
+    print("="*60)
+    print(f"\n配置:")
+    print(f"   评估类别: {args.category}")
+    print(f"   样本数量: {args.samples if args.samples > 0 else '全部'}")
+    print(f"   模型名称: {args.model_name}")
+    
+    # 步骤1: 运行评估
+    results = run_evaluation(args.category, args.samples, args.model_name)
+    if not results:
+        return
+    
+    # 步骤2: 导出BFCL格式
+    output_file = export_bfcl_format(results, args.category, args.model_name)
+    
+    # 步骤3: 复制到BFCL结果目录
+    copy_to_bfcl_result_dir(output_file, args.model_name, args.category)
+    
+    # 步骤4: 运行BFCL官方评估
+    if not run_bfcl_official_eval(args.model_name, args.category):
+        print("\n⚠️ BFCL官方评估失败,但HelloAgents评估已完成")
+        return
+    
+    # 步骤5: 展示结果
+    show_results(args.model_name, args.category)
+    
+    print("\n" + "="*60)
+    print("✅ 评估完成!")
+    print("="*60)
+
+
+if __name__ == "__main__":
+    main()
+

+ 85 - 0
code/chapter12/05_gaia_quick_start.py

@@ -0,0 +1,85 @@
+"""
+第十二章示例5:GAIA快速开始
+
+对应文档:12.3.5 在HelloAgents中实现GAIA评估 - 方式1
+
+这是最简单的GAIA评估方式,一行代码完成评估。
+
+重要提示:
+1. GAIA是受限数据集,需要先在HuggingFace上申请访问权限
+2. 需要设置HF_TOKEN环境变量
+3. 必须使用GAIA官方系统提示词
+"""
+
+import os
+from hello_agents import SimpleAgent, HelloAgentsLLM
+from hello_agents.tools import GAIAEvaluationTool
+
+# GAIA官方系统提示词(必须使用)
+GAIA_SYSTEM_PROMPT = """You are a general AI assistant. I will ask you a question. Report your thoughts, and finish your answer with the following template: FINAL ANSWER: [YOUR FINAL ANSWER].
+YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings.
+If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise.
+If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise.
+If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string."""
+
+# 1. 设置HuggingFace Token(如果还没设置)
+# os.environ["HF_TOKEN"] = "your_huggingface_token_here"
+
+# 2. 创建智能体(必须使用GAIA官方系统提示词)
+llm = HelloAgentsLLM()
+agent = SimpleAgent(
+    name="TestAgent",
+    llm=llm,
+    system_prompt=GAIA_SYSTEM_PROMPT  # 必须使用官方提示词
+)
+
+# 3. 创建GAIA评估工具
+gaia_tool = GAIAEvaluationTool()
+
+# 4. 运行评估
+results = gaia_tool.run(
+    agent=agent,
+    level=1,              # 评估级别(1=简单,2=中等,3=困难)
+    max_samples=2,        # 评估样本数(0表示全部)
+    export_results=True,  # 导出结果到GAIA官方格式
+    generate_report=True  # 生成详细报告
+)
+
+# 5. 查看结果
+print(f"\n评估结果:")
+print(f"精确匹配率: {results['exact_match_rate']:.2%}")
+print(f"部分匹配率: {results['partial_match_rate']:.2%}")
+print(f"正确数: {results['correct_samples']}/{results['total_samples']}")
+
+# 运行输出示例:
+# ============================================================
+# GAIA一键评估
+# ============================================================
+# 
+# 配置:
+#    智能体: TestAgent
+#    级别: Level 1
+#    样本数: 2
+# 
+# ✅ GAIA数据集加载完成
+#    数据源: gaia-benchmark/GAIA
+#    分割: validation
+#    级别: 1
+#    样本数: 2
+# 
+# 评估进度: 100%|██████████| 2/2 [00:10<00:00,  5.23s/样本]
+# 
+# ✅ 评估完成
+#    总样本数: 2
+#    正确样本数: 2
+#    精确匹配率: 100.00%
+#    部分匹配率: 100.00%
+# 
+# ✅ 结果已导出到 ./evaluation_results/gaia_submission.json
+# ✅ 报告已生成到 ./evaluation_results/gaia_report.md
+# 
+# 评估结果:
+# 精确匹配率: 100.00%
+# 部分匹配率: 100.00%
+# 正确数: 2/2
+

+ 149 - 0
code/chapter12/06_gaia_best_practices.py

@@ -0,0 +1,149 @@
+"""
+第十二章示例6:GAIA评估最佳实践
+
+对应文档:12.3.9 GAIA评估最佳实践
+
+这个示例展示了GAIA评估的最佳实践,包括:
+1. 分级评估
+2. 小样本快速测试
+3. 结果解读
+"""
+
+import os
+from hello_agents import SimpleAgent, HelloAgentsLLM
+from hello_agents.tools import GAIAEvaluationTool
+
+# GAIA官方系统提示词
+GAIA_SYSTEM_PROMPT = """You are a general AI assistant. I will ask you a question. Report your thoughts, and finish your answer with the following template: FINAL ANSWER: [YOUR FINAL ANSWER].
+YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings.
+If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise.
+If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise.
+If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string."""
+
+# 创建智能体
+llm = HelloAgentsLLM()
+agent = SimpleAgent(
+    name="TestAgent",
+    llm=llm,
+    system_prompt=GAIA_SYSTEM_PROMPT
+)
+
+# 创建评估工具
+gaia_tool = GAIAEvaluationTool()
+
+# ============================================================
+# 最佳实践1:分级评估
+# ============================================================
+print("="*60)
+print("最佳实践1:分级评估")
+print("="*60)
+
+# 第一步:评估Level 1(简单任务)
+print("\n第一步:评估Level 1(简单任务)")
+results_l1 = gaia_tool.run(agent, level=1, max_samples=10)
+print(f"Level 1精确匹配率: {results_l1['exact_match_rate']:.2%}")
+
+# 第二步:如果Level 1表现良好,评估Level 2
+if results_l1['exact_match_rate'] > 0.6:
+    print("\n第二步:评估Level 2(中等任务)")
+    results_l2 = gaia_tool.run(agent, level=2, max_samples=10)
+    print(f"Level 2精确匹配率: {results_l2['exact_match_rate']:.2%}")
+    
+    # 第三步:如果Level 2表现良好,评估Level 3
+    if results_l2['exact_match_rate'] > 0.4:
+        print("\n第三步:评估Level 3(困难任务)")
+        results_l3 = gaia_tool.run(agent, level=3, max_samples=10)
+        print(f"Level 3精确匹配率: {results_l3['exact_match_rate']:.2%}")
+    else:
+        print("\n⚠️ Level 2表现不佳,建议先优化后再评估Level 3")
+else:
+    print("\n⚠️ Level 1表现不佳,建议先优化后再评估更高级别")
+
+# ============================================================
+# 最佳实践2:小样本快速测试
+# ============================================================
+print("\n" + "="*60)
+print("最佳实践2:小样本快速测试")
+print("="*60)
+
+# 快速测试(每个级别2个样本)
+for level in [1, 2, 3]:
+    print(f"\n快速测试 Level {level}:")
+    results = gaia_tool.run(agent, level=level, max_samples=2)
+    print(f"  精确匹配率: {results['exact_match_rate']:.2%}")
+
+# ============================================================
+# 最佳实践3:结果解读
+# ============================================================
+print("\n" + "="*60)
+print("最佳实践3:结果解读")
+print("="*60)
+
+def interpret_results(level, exact_match_rate):
+    """解读评估结果"""
+    print(f"\nLevel {level} 结果解读:")
+    print(f"精确匹配率: {exact_match_rate:.2%}")
+    
+    if level == 1:
+        if exact_match_rate >= 0.6:
+            print("✅ 优秀 - 基础能力扎实")
+        elif exact_match_rate >= 0.4:
+            print("⚠️ 良好 - 基础能力可用")
+        else:
+            print("❌ 较差 - 需要改进")
+            print("建议:")
+            print("  - 检查系统提示词是否包含GAIA官方格式要求")
+            print("  - 检查答案提取逻辑是否正确")
+            print("  - 检查LLM模型是否足够强大")
+    
+    elif level == 2:
+        if exact_match_rate >= 0.4:
+            print("✅ 优秀 - 中等任务能力强")
+        elif exact_match_rate >= 0.2:
+            print("⚠️ 良好 - 中等任务能力可用")
+        else:
+            print("❌ 较差 - 需要改进")
+            print("建议:")
+            print("  - 增强多步推理能力")
+            print("  - 增加工具使用能力")
+            print("  - 优化推理链的构建")
+    
+    elif level == 3:
+        if exact_match_rate >= 0.2:
+            print("✅ 优秀 - 复杂任务能力强")
+        elif exact_match_rate >= 0.1:
+            print("⚠️ 良好 - 复杂任务能力可用")
+        else:
+            print("❌ 较差 - 需要改进")
+            print("建议:")
+            print("  - 增强复杂推理能力")
+            print("  - 增加长上下文处理能力")
+            print("  - 优化工具链的组合使用")
+
+# 解读结果
+if 'results_l1' in locals():
+    interpret_results(1, results_l1['exact_match_rate'])
+if 'results_l2' in locals():
+    interpret_results(2, results_l2['exact_match_rate'])
+if 'results_l3' in locals():
+    interpret_results(3, results_l3['exact_match_rate'])
+
+# ============================================================
+# 难度递进分析
+# ============================================================
+print("\n" + "="*60)
+print("难度递进分析")
+print("="*60)
+
+if 'results_l1' in locals() and 'results_l2' in locals():
+    if results_l1['exact_match_rate'] > results_l2['exact_match_rate']:
+        print("✅ 正常递进:Level 1 > Level 2")
+    else:
+        print("⚠️ 异常情况:Level 2 >= Level 1(可能是数据集偏差或智能体特性)")
+
+if 'results_l2' in locals() and 'results_l3' in locals():
+    if results_l2['exact_match_rate'] > results_l3['exact_match_rate']:
+        print("✅ 正常递进:Level 2 > Level 3")
+    else:
+        print("⚠️ 异常情况:Level 3 >= Level 2(可能是数据集偏差或智能体特性)")
+

+ 118 - 0
code/chapter12/07_data_generation_complete_flow.py

@@ -0,0 +1,118 @@
+"""
+第十二章示例7:数据生成完整评估流程
+
+对应文档:12.4.6 完整评估流程
+
+这个示例展示了数据生成的完整评估流程:
+1. 生成AIME题目
+2. LLM Judge评估
+3. Win Rate评估
+4. 人工验证
+
+运行方式:
+python 07_data_generation_complete_flow.py 30 3.0
+
+参数说明:
+- 30: 生成30道题目
+- 3.0: 每道题目之间延迟3秒(避免速率限制)
+"""
+
+import sys
+import os
+
+# 添加HelloAgents路径
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "HelloAgents"))
+
+from data_generation.run_complete_evaluation import main
+
+if __name__ == "__main__":
+    # 默认参数
+    num_problems = 30
+    delay_seconds = 3.0
+    
+    # 从命令行获取参数
+    if len(sys.argv) > 1:
+        num_problems = int(sys.argv[1])
+    if len(sys.argv) > 2:
+        delay_seconds = float(sys.argv[2])
+    
+    print("="*80)
+    print("🚀 AIME数据生成与评估完整流程")
+    print("="*80)
+    print(f"\n配置:")
+    print(f"  生成数量: {num_problems}道题目")
+    print(f"  延迟设置: {delay_seconds}秒/题")
+    print(f"  生成模型: gpt-4o")
+    print(f"  评估模型: gpt-4o")
+    print()
+    
+    # 运行完整流程
+    main(num_problems, delay_seconds)
+
+# 运行输出示例:
+# ================================================================================
+# 🚀 AIME数据生成与评估完整流程
+# ================================================================================
+# 
+# 配置:
+#   生成数量: 30道题目
+#   延迟设置: 3.0秒/题
+#   生成模型: gpt-4o
+#   评估模型: gpt-4o
+# 
+# ✅ 已加载 963 道参考题目
+# 
+# 🎯 开始生成AIME题目
+#    目标数量: 30
+#    生成模型: gpt-4o
+#    延迟设置: 3.0秒/题
+# 
+# 生成AIME题目: 100%|██████████| 30/30 [02:30<00:00,  5.01s/题]
+# 
+# ✅ 生成完成
+#    成功: 30/30
+#    保存位置: ./data_generation/generated_data/aime_problems_20241211_143022.json
+# 
+# ========== LLM Judge评估 ==========
+# 
+# 📊 开始LLM Judge评估
+#    评估模型: gpt-4o
+#    样本数: 30
+# 
+# LLM Judge评估: 100%|██████████| 30/30 [01:30<00:00,  3.01s/题]
+# 
+# ✅ LLM Judge评估完成
+#    平均分: 3.5/5.0
+#    评估维度:
+#      - 正确性: 3.8/5.0
+#      - 清晰度: 3.6/5.0
+#      - 难度匹配: 3.4/5.0
+#      - 完整性: 3.2/5.0
+# 
+# ========== Win Rate评估 ==========
+# 
+# 📊 开始Win Rate评估
+#    评估模型: gpt-4o
+#    对比数量: 20
+#    参考数据集: AIME 2025 (963道题目)
+# 
+# Win Rate评估: 100%|██████████| 20/20 [01:00<00:00,  3.01s/对比]
+# 
+# ✅ Win Rate评估完成
+#    Win Rate: 45.0%
+#    Tie Rate: 10.0%
+#    Loss Rate: 45.0%
+# 
+# ========== 人工验证 ==========
+# 
+# 🎯 启动人工验证界面
+#    访问地址: http://127.0.0.1:7860
+# 
+# ✅ 完整评估流程完成!
+# 
+# 📊 评估总结:
+#    生成数量: 30道题目
+#    LLM Judge平均分: 3.5/5.0
+#    Win Rate: 45.0%
+#    建议: 生成质量接近AIME真题水平
+

+ 167 - 0
code/chapter12/08_data_generation_llm_judge.py

@@ -0,0 +1,167 @@
+"""
+第十二章示例8:LLM Judge评估
+
+对应文档:12.4.3 LLM Judge评估
+
+这个示例展示如何使用LLM Judge评估生成的AIME题目质量。
+
+LLM Judge从4个维度评估题目质量:
+1. 正确性(Correctness):题目和答案是否正确
+2. 清晰度(Clarity):题目表述是否清晰
+3. 难度匹配(Difficulty Match):难度是否符合AIME水平
+4. 完整性(Completeness):题目是否完整
+"""
+
+import sys
+import os
+import json
+
+# 添加HelloAgents路径
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "HelloAgents"))
+
+from hello_agents import HelloAgentsLLM
+from hello_agents.evaluation import LLMJudge
+
+# 1. 准备生成的题目数据
+generated_problems = [
+    {
+        "problem_id": "generated_001",
+        "problem": "Find the number of positive integers $n$ such that $n^2 + 19n + 92$ is a perfect square.",
+        "answer": "4",
+        "solution": "Let $n^2 + 19n + 92 = m^2$ for some positive integer $m$..."
+    },
+    {
+        "problem_id": "generated_002",
+        "problem": "In triangle $ABC$, $AB = 13$, $BC = 14$, and $CA = 15$. Find the area of the triangle.",
+        "answer": "84",
+        "solution": "Using Heron's formula, $s = (13+14+15)/2 = 21$..."
+    }
+]
+
+# 2. 创建LLM Judge评估器
+llm = HelloAgentsLLM(model_name="gpt-4o")
+judge = LLMJudge(llm=llm)
+
+# 3. 评估每道题目
+print("="*60)
+print("LLM Judge评估")
+print("="*60)
+
+all_scores = []
+
+for i, problem in enumerate(generated_problems, 1):
+    print(f"\n评估题目 {i}/{len(generated_problems)}")
+    print(f"题目ID: {problem['problem_id']}")
+    
+    # 评估单道题目
+    result = judge.evaluate_single(problem)
+    
+    # 显示评估结果
+    print(f"\n评估结果:")
+    print(f"  正确性: {result['correctness']}/5")
+    print(f"  清晰度: {result['clarity']}/5")
+    print(f"  难度匹配: {result['difficulty_match']}/5")
+    print(f"  完整性: {result['completeness']}/5")
+    print(f"  平均分: {result['average_score']:.2f}/5")
+    print(f"\n评语:")
+    print(f"  {result['feedback']}")
+    
+    all_scores.append(result)
+
+# 4. 计算总体统计
+print("\n" + "="*60)
+print("总体统计")
+print("="*60)
+
+avg_correctness = sum(s['correctness'] for s in all_scores) / len(all_scores)
+avg_clarity = sum(s['clarity'] for s in all_scores) / len(all_scores)
+avg_difficulty = sum(s['difficulty_match'] for s in all_scores) / len(all_scores)
+avg_completeness = sum(s['completeness'] for s in all_scores) / len(all_scores)
+avg_overall = sum(s['average_score'] for s in all_scores) / len(all_scores)
+
+print(f"\n平均分:")
+print(f"  正确性: {avg_correctness:.2f}/5")
+print(f"  清晰度: {avg_clarity:.2f}/5")
+print(f"  难度匹配: {avg_difficulty:.2f}/5")
+print(f"  完整性: {avg_completeness:.2f}/5")
+print(f"  总体平均: {avg_overall:.2f}/5")
+
+# 5. 质量评估
+print(f"\n质量评估:")
+if avg_overall >= 4.0:
+    print("✅ 优秀 - 题目质量很高,可以直接使用")
+elif avg_overall >= 3.0:
+    print("⚠️ 良好 - 题目质量可用,建议人工审核")
+elif avg_overall >= 2.0:
+    print("⚠️ 一般 - 题目质量一般,需要大幅改进")
+else:
+    print("❌ 较差 - 题目质量差,需要重新生成")
+
+# 6. 保存评估结果
+output_file = "./evaluation_results/llm_judge_results.json"
+os.makedirs(os.path.dirname(output_file), exist_ok=True)
+
+with open(output_file, 'w', encoding='utf-8') as f:
+    json.dump({
+        'problems': generated_problems,
+        'scores': all_scores,
+        'statistics': {
+            'avg_correctness': avg_correctness,
+            'avg_clarity': avg_clarity,
+            'avg_difficulty': avg_difficulty,
+            'avg_completeness': avg_completeness,
+            'avg_overall': avg_overall
+        }
+    }, f, indent=2, ensure_ascii=False)
+
+print(f"\n✅ 评估结果已保存到 {output_file}")
+
+# 运行输出示例:
+# ============================================================
+# LLM Judge评估
+# ============================================================
+# 
+# 评估题目 1/2
+# 题目ID: generated_001
+# 
+# 评估结果:
+#   正确性: 5/5
+#   清晰度: 4/5
+#   难度匹配: 5/5
+#   完整性: 5/5
+#   平均分: 4.75/5
+# 
+# 评语:
+#   This is an excellent AIME-level problem. The problem is well-posed,
+#   the solution is correct, and the difficulty is appropriate.
+# 
+# 评估题目 2/2
+# 题目ID: generated_002
+# 
+# 评估结果:
+#   正确性: 5/5
+#   清晰度: 5/5
+#   难度匹配: 3/5
+#   完整性: 5/5
+#   平均分: 4.50/5
+# 
+# 评语:
+#   The problem is correct and clear, but the difficulty is slightly
+#   below AIME level. Consider adding more complexity.
+# 
+# ============================================================
+# 总体统计
+# ============================================================
+# 
+# 平均分:
+#   正确性: 5.00/5
+#   清晰度: 4.50/5
+#   难度匹配: 4.00/5
+#   完整性: 5.00/5
+#   总体平均: 4.62/5
+# 
+# 质量评估:
+# ✅ 优秀 - 题目质量很高,可以直接使用
+# 
+# ✅ 评估结果已保存到 ./evaluation_results/llm_judge_results.json
+

+ 170 - 0
code/chapter12/09_data_generation_win_rate.py

@@ -0,0 +1,170 @@
+"""
+第十二章示例9:Win Rate评估
+
+对应文档:12.4.4 Win Rate评估
+
+这个示例展示如何使用Win Rate评估生成的AIME题目质量。
+
+Win Rate评估通过对比生成题目和真题,评估生成质量:
+- Win Rate = 50%:生成质量与真题相当(理想情况)
+- Win Rate > 50%:生成质量优于真题(可能是评估偏差)
+- Win Rate < 50%:生成质量低于真题(需要改进)
+"""
+
+import sys
+import os
+import json
+
+# 添加HelloAgents路径
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "HelloAgents"))
+
+from hello_agents import HelloAgentsLLM
+from hello_agents.evaluation import WinRateEvaluator, AIDataset
+
+# 1. 准备生成的题目数据
+generated_problems = [
+    {
+        "problem_id": "generated_001",
+        "problem": "Find the number of positive integers $n$ such that $n^2 + 19n + 92$ is a perfect square.",
+        "answer": "4"
+    },
+    {
+        "problem_id": "generated_002",
+        "problem": "In triangle $ABC$, $AB = 13$, $BC = 14$, and $CA = 15$. Find the area of the triangle.",
+        "answer": "84"
+    },
+    {
+        "problem_id": "generated_003",
+        "problem": "How many positive integers less than 1000 are divisible by 7 but not by 11?",
+        "answer": "129"
+    }
+]
+
+# 2. 加载参考数据集(AIME真题)
+print("="*60)
+print("Win Rate评估")
+print("="*60)
+
+print("\n加载参考数据集...")
+dataset = AIDataset()
+reference_problems = dataset.load()
+print(f"✅ 已加载 {len(reference_problems)} 道AIME真题")
+
+# 3. 创建Win Rate评估器
+llm = HelloAgentsLLM(model_name="gpt-4o")
+evaluator = WinRateEvaluator(
+    llm=llm,
+    reference_problems=reference_problems
+)
+
+# 4. 运行Win Rate评估
+print(f"\n开始Win Rate评估...")
+print(f"  生成题目数: {len(generated_problems)}")
+print(f"  对比数量: 20")
+
+results = evaluator.evaluate(
+    generated_problems=generated_problems,
+    num_comparisons=20  # 进行20次对比
+)
+
+# 5. 显示评估结果
+print("\n" + "="*60)
+print("评估结果")
+print("="*60)
+
+print(f"\nWin Rate: {results['win_rate']:.2%}")
+print(f"Tie Rate: {results['tie_rate']:.2%}")
+print(f"Loss Rate: {results['loss_rate']:.2%}")
+
+print(f"\n详细统计:")
+print(f"  总对比数: {results['total_comparisons']}")
+print(f"  生成题目胜: {results['wins']}")
+print(f"  平局: {results['ties']}")
+print(f"  真题胜: {results['losses']}")
+
+# 6. 质量评估
+print(f"\n质量评估:")
+win_rate = results['win_rate']
+
+if 0.45 <= win_rate <= 0.55:
+    print("✅ 优秀 - 生成质量接近AIME真题水平")
+elif 0.35 <= win_rate < 0.45:
+    print("⚠️ 良好 - 生成质量可用,但略低于真题")
+elif 0.25 <= win_rate < 0.35:
+    print("⚠️ 一般 - 生成质量一般,需要改进")
+else:
+    print("❌ 较差 - 生成质量差,需要大幅改进")
+
+# 7. 查看部分对比详情
+print("\n" + "="*60)
+print("对比详情(前5个)")
+print("="*60)
+
+for i, comparison in enumerate(results['comparisons'][:5], 1):
+    print(f"\n对比 {i}:")
+    print(f"  生成题目: {comparison['generated_problem'][:60]}...")
+    print(f"  真题: {comparison['reference_problem'][:60]}...")
+    print(f"  结果: {comparison['result']}")
+    if 'reason' in comparison:
+        print(f"  理由: {comparison['reason'][:100]}...")
+
+# 8. 保存评估结果
+output_file = "./evaluation_results/win_rate_results.json"
+os.makedirs(os.path.dirname(output_file), exist_ok=True)
+
+with open(output_file, 'w', encoding='utf-8') as f:
+    json.dump(results, f, indent=2, ensure_ascii=False)
+
+print(f"\n✅ 评估结果已保存到 {output_file}")
+
+# 运行输出示例:
+# ============================================================
+# Win Rate评估
+# ============================================================
+# 
+# 加载参考数据集...
+# ✅ 已加载 963 道AIME真题
+# 
+# 开始Win Rate评估...
+#   生成题目数: 3
+#   对比数量: 20
+# 
+# Win Rate评估: 100%|██████████| 20/20 [01:00<00:00,  3.01s/对比]
+# 
+# ============================================================
+# 评估结果
+# ============================================================
+# 
+# Win Rate: 45.00%
+# Tie Rate: 10.00%
+# Loss Rate: 45.00%
+# 
+# 详细统计:
+#   总对比数: 20
+#   生成题目胜: 9
+#   平局: 2
+#   真题胜: 9
+# 
+# 质量评估:
+# ✅ 优秀 - 生成质量接近AIME真题水平
+# 
+# ============================================================
+# 对比详情(前5个)
+# ============================================================
+# 
+# 对比 1:
+#   生成题目: Find the number of positive integers $n$ such that $n^2 + 19...
+#   真题: Let $N$ be the number of consecutive $0$'s at the right end...
+#   结果: generated
+#   理由: The generated problem has a clearer problem statement and a mo...
+# 
+# 对比 2:
+#   生成题目: In triangle $ABC$, $AB = 13$, $BC = 14$, and $CA = 15$. F...
+#   真题: Find the number of ordered pairs $(m,n)$ of positive integers...
+#   结果: reference
+#   理由: The reference problem is more challenging and requires deeper...
+# 
+# ...
+# 
+# ✅ 评估结果已保存到 ./evaluation_results/win_rate_results.json
+

+ 279 - 0
code/chapter12/README.md

@@ -0,0 +1,279 @@
+# 第十二章示例代码
+
+本目录包含第十二章《智能体性能评估》的所有示例代码,按照文档顺序编号,方便学习者跟随文档学习。
+
+## 📁 文件列表
+
+| 文件名 | 对应章节 | 说明 |
+|--------|---------|------|
+| `01_basic_agent_example.py` | 12.1.1 | 基础智能体示例,说明为何需要评估 |
+| `02_bfcl_quick_start.py` | 12.2.5 | BFCL快速开始(一键评估) |
+| `03_bfcl_custom_evaluation.py` | 12.2.5 | BFCL自定义评估(底层组件) |
+| `04_run_bfcl_evaluation.py` | 12.2.9 | BFCL评估最佳实践 |
+| `05_gaia_quick_start.py` | 12.3.5 | GAIA快速开始(一键评估) |
+| `06_gaia_best_practices.py` | 12.3.9 | GAIA评估最佳实践 |
+| `07_data_generation_complete_flow.py` | 12.4.6 | 数据生成完整评估流程 |
+| `08_data_generation_llm_judge.py` | 12.4.3 | LLM Judge评估 |
+| `09_data_generation_win_rate.py` | 12.4.4 | Win Rate评估 |
+
+## 🚀 快速开始
+
+### 环境准备
+
+1. **安装HelloAgents框架**:
+   ```bash
+   pip install hello-agents[evaluation]==0.2.3
+   ```
+
+2. **设置环境变量**:
+   ```bash
+   # OpenAI API Key(用于GPT-4o)
+   export OPENAI_API_KEY="your_openai_api_key"
+   
+   # HuggingFace Token(用于GAIA数据集)
+   export HF_TOKEN="your_huggingface_token"
+   ```
+
+3. **下载BFCL数据集**(可选,首次运行会自动下载):
+   ```bash
+   cd ../HelloAgents
+   git clone https://github.com/ShishirPatil/gorilla.git temp_gorilla
+   ```
+
+### 运行示例
+
+#### 1. 基础智能体示例
+
+```bash
+python 01_basic_agent_example.py
+```
+
+这个示例展示了一个基本的ReAct智能体,说明为何需要评估系统。
+
+#### 2. BFCL快速开始
+
+```bash
+python 02_bfcl_quick_start.py
+```
+
+这是最简单的BFCL评估方式,一行代码完成评估。
+
+**预期输出**:
+```
+准确率: 100.00%
+正确数: 5/5
+```
+
+#### 3. BFCL自定义评估
+
+```bash
+python 03_bfcl_custom_evaluation.py
+```
+
+展示如何使用底层组件进行自定义评估流程。
+
+#### 4. BFCL最佳实践
+
+```bash
+python 04_run_bfcl_evaluation.py
+```
+
+展示BFCL评估的最佳实践,包括:
+- 渐进式评估
+- 多类别评估
+- 对比评估
+- 错误分析
+
+#### 5. GAIA快速开始
+
+**重要提示**:GAIA是受限数据集,需要先申请访问权限。
+
+1. 访问 https://huggingface.co/datasets/gaia-benchmark/GAIA
+2. 点击"Request Access"申请访问权限
+3. 等待审核通过(通常1-2天)
+4. 设置HF_TOKEN环境变量
+
+```bash
+python 05_gaia_quick_start.py
+```
+
+**预期输出**:
+```
+精确匹配率: 100.00%
+部分匹配率: 100.00%
+正确数: 2/2
+```
+
+#### 6. GAIA最佳实践
+
+```bash
+python 06_gaia_best_practices.py
+```
+
+展示GAIA评估的最佳实践,包括:
+- 分级评估
+- 小样本快速测试
+- 结果解读
+
+#### 7. 数据生成完整评估流程
+
+```bash
+python 07_data_generation_complete_flow.py 30 3.0
+```
+
+参数说明:
+- `30`:生成30道题目
+- `3.0`:每道题目之间延迟3秒
+
+这个示例展示了数据生成的完整评估流程:
+1. 生成AIME题目
+2. LLM Judge评估
+3. Win Rate评估
+4. 人工验证
+
+**预期输出**:
+```
+生成数量: 30道题目
+LLM Judge平均分: 3.5/5.0
+Win Rate: 45.0%
+建议: 生成质量接近AIME真题水平
+```
+
+#### 8. LLM Judge评估
+
+```bash
+python 08_data_generation_llm_judge.py
+```
+
+展示如何使用LLM Judge评估生成的AIME题目质量。
+
+**预期输出**:
+```
+平均分:
+  正确性: 5.00/5
+  清晰度: 4.50/5
+  难度匹配: 4.00/5
+  完整性: 5.00/5
+  总体平均: 4.62/5
+
+质量评估:
+✅ 优秀 - 题目质量很高,可以直接使用
+```
+
+#### 9. Win Rate评估
+
+```bash
+python 09_data_generation_win_rate.py
+```
+
+展示如何使用Win Rate评估生成的AIME题目质量。
+
+**预期输出**:
+```
+Win Rate: 45.00%
+Tie Rate: 10.00%
+Loss Rate: 45.00%
+
+质量评估:
+✅ 优秀 - 生成质量接近AIME真题水平
+```
+
+## 📊 学习路径
+
+### 初学者路径
+
+1. **了解评估的必要性**:
+   - 运行 `01_basic_agent_example.py`
+
+2. **学习BFCL评估**:
+   - 运行 `02_bfcl_quick_start.py`(快速开始)
+   - 运行 `04_run_bfcl_evaluation.py`(最佳实践)
+
+3. **学习GAIA评估**:
+   - 运行 `05_gaia_quick_start.py`(快速开始)
+   - 运行 `06_gaia_best_practices.py`(最佳实践)
+
+### 进阶路径
+
+1. **自定义评估流程**:
+   - 运行 `03_bfcl_custom_evaluation.py`
+
+2. **数据生成评估**:
+   - 运行 `08_data_generation_llm_judge.py`(LLM Judge)
+   - 运行 `09_data_generation_win_rate.py`(Win Rate)
+   - 运行 `07_data_generation_complete_flow.py`(完整流程)
+
+## 💡 常见问题
+
+### Q1: 运行示例时提示找不到模块?
+
+A: 请确保已安装HelloAgents框架:
+```bash
+cd ../HelloAgents
+pip install -e .
+```
+
+### Q2: BFCL评估提示找不到数据集?
+
+A: 首次运行会自动下载数据集,请确保网络连接正常。如果下载失败,可以手动下载:
+```bash
+cd ../HelloAgents
+git clone https://github.com/ShishirPatil/gorilla.git temp_gorilla
+```
+
+### Q3: GAIA评估提示没有访问权限?
+
+A: GAIA是受限数据集,需要先申请访问权限:
+1. 访问 https://huggingface.co/datasets/gaia-benchmark/GAIA
+2. 点击"Request Access"
+3. 等待审核通过
+4. 设置HF_TOKEN环境变量
+
+### Q4: 评估速度太慢?
+
+A: 可以减少样本数量:
+```python
+# BFCL评估
+results = bfcl_tool.run(agent, category="simple_python", max_samples=5)
+
+# GAIA评估
+results = gaia_tool.run(agent, level=1, max_samples=2)
+
+# 数据生成评估
+python 07_data_generation_complete_flow.py 10 3.0  # 只生成10道题目
+```
+
+### Q5: 如何估算评估成本?
+
+A: 评估成本主要来自LLM API调用:
+
+**BFCL评估**:
+- 每个样本约1次API调用
+- 成本约0.01-0.02元/样本
+- 完整评估(400样本)约4-8元
+
+**GAIA评估**:
+- 每个样本约1-5次API调用(取决于任务复杂度)
+- 成本约0.05-0.20元/样本
+- 完整评估(466样本)约23-93元
+
+**数据生成评估**:
+- 生成:约0.05元/题
+- LLM Judge:约0.02元/题
+- Win Rate:约0.02元/对比
+- 生成30道题目约2-3元
+
+## 📚 相关资源
+
+- **HelloAgents框架**:https://github.com/jjyaoao/HelloAgents
+- **BFCL官方仓库**:https://github.com/ShishirPatil/gorilla
+- **GAIA官方仓库**:https://huggingface.co/datasets/gaia-benchmark/GAIA
+
+## 🤝 贡献
+
+如果你发现示例代码有问题或有改进建议,欢迎提交Issue或Pull Request。
+
+## 📄 许可证
+
+本示例代码遵循与HelloAgents框架相同的许可证。
+

+ 461 - 0
code/chapter12/data_generation/aime_generator.py

@@ -0,0 +1,461 @@
+"""
+AIME数学题目生成器
+
+使用HelloAgents框架生成AIME风格的数学题目
+"""
+
+import json
+import os
+import time
+import random
+from typing import List, Dict, Any, Optional
+from datetime import datetime
+from tqdm import tqdm
+from hello_agents import SimpleAgent
+from hello_agents import HelloAgentsLLM
+from datasets import load_dataset
+
+
+class AIMEGenerator:
+    """AIME题目生成器"""
+    
+    # AIME题目生成提示词(英文)
+    GENERATION_PROMPT = """You are a professional mathematics competition problem designer, skilled in creating AIME (American Invitational Mathematics Examination) style problems.
+
+AIME Problem Characteristics:
+1. Answer: An integer between 0 and 999
+2. Topics: Algebra, Geometry, Number Theory, Combinatorics, Probability, etc.
+3. Style: Requires multi-step reasoning, but no advanced theory
+4. Difficulty: Medium to hard (similar to AIME problems 6-9)
+
+Please generate an AIME-style mathematics problem, including:
+1. Problem statement (clear and complete)
+2. Answer (an integer between 0 and 999)
+3. Detailed solution (including all reasoning steps)
+4. Topic classification (Algebra/Geometry/Number Theory/Combinatorics/Probability)
+
+Please output in the following JSON format, avoid using special escape characters in JSON:
+```json
+{
+    "problem": "Problem statement in English",
+    "answer": 123,
+    "solution": "Detailed solution steps in English",
+    "topic": "Algebra"
+}
+```
+"""
+    
+    def __init__(
+        self,
+        llm: HelloAgentsLLM = None,
+        delay_seconds: float = 1.0,
+        use_reference_examples: bool = True,
+        reference_dataset: str = "TianHongZXY/aime-1983-2025"
+    ):
+        """
+        初始化生成器
+
+        Args:
+            llm: LLM实例(可选)
+            delay_seconds: 每次生成之间的延迟(秒),避免API速率限制
+            use_reference_examples: 是否使用真题作为参考样例
+            reference_dataset: 参考数据集名称,默认使用TianHongZXY/aime-1983-2025(900+道题)
+        """
+        # 如果没有提供llm,创建默认的HelloAgentsLLM
+        if llm is None:
+            self.llm = HelloAgentsLLM()
+        else:
+            self.llm = llm
+
+        self.agent = SimpleAgent(
+            name="AIME Generator",
+            llm=self.llm,
+            system_prompt="你是一位专业的数学竞赛题目设计专家。"
+        )
+        self.delay_seconds = delay_seconds
+        self.use_reference_examples = use_reference_examples
+        self.reference_examples = []
+
+        # 加载参考样例
+        if use_reference_examples:
+            try:
+                print(f"📚 加载AIME真题数据集: {reference_dataset}")
+                # 尝试不同的split
+                try:
+                    dataset = load_dataset(reference_dataset, split="train")
+                except:
+                    dataset = load_dataset(reference_dataset, split="test")
+
+                # 加载所有题目作为参考
+                self.reference_examples = list(dataset)
+                print(f"   ✓ 已加载 {len(self.reference_examples)} 道参考题目")
+
+                # 统计年份分布(如果有year字段)
+                year_counts = {}
+                for item in self.reference_examples:
+                    year = item.get('year')
+                    if year:
+                        year_counts[year] = year_counts.get(year, 0) + 1
+
+                if year_counts:
+                    year_range = f"{min(year_counts.keys())}-{max(year_counts.keys())}"
+                    print(f"   ℹ️  年份范围: {year_range}")
+
+            except Exception as e:
+                print(f"   ⚠️ 加载参考样例失败: {e}")
+                print(f"   ℹ️  将使用默认提示词生成")
+                self.use_reference_examples = False
+    
+    def generate_single(self, max_retries: int = 3) -> Dict[str, Any]:
+        """
+        生成单个题目
+
+        Args:
+            max_retries: 最大重试次数
+
+        Returns:
+            题目数据
+        """
+        # 构建提示词
+        prompt = self._build_prompt()
+
+        for attempt in range(max_retries):
+            try:
+                response = self.agent.run(prompt)
+                return self._parse_response(response)
+            except Exception as e:
+                if attempt < max_retries - 1:
+                    tqdm.write(f"⚠️ 生成失败(尝试 {attempt + 1}/{max_retries}),{self.delay_seconds}秒后重试...")
+                    time.sleep(self.delay_seconds)
+                else:
+                    tqdm.write(f"❌ 生成失败,已达最大重试次数: {e}")
+                    return self._get_default_problem()
+
+    def _build_prompt(self) -> str:
+        """构建生成提示词"""
+        if not self.use_reference_examples or not self.reference_examples:
+            return self.GENERATION_PROMPT
+
+        # 随机选择一个参考样例
+        example = random.choice(self.reference_examples)
+        example_problem = example.get('problem', 'Example problem')
+        example_answer = example.get('answer', 0)
+
+        # 构建带参考样例的提示词(英文)
+        prompt = f"""You are a professional mathematics competition problem designer, skilled in creating AIME (American Invitational Mathematics Examination) style problems.
+
+【Reference Example】(For style reference only, please generate a completely different problem)
+Problem: {example_problem}
+Answer: {example_answer}
+
+AIME Problem Characteristics:
+1. Answer: An integer between 0 and 999
+2. Topics: Algebra, Geometry, Number Theory, Combinatorics, Probability, etc.
+3. Style: Requires multi-step reasoning, but no advanced theory
+4. Difficulty: Medium to hard (similar to AIME problems 6-9)
+
+Please generate a **completely different** AIME-style mathematics problem, including:
+1. Problem statement (clear and complete, different from the reference)
+2. Answer (an integer between 0 and 999, different from the reference)
+3. Detailed solution (including all reasoning steps)
+4. Topic classification (Algebra/Geometry/Number Theory/Combinatorics/Probability)
+
+Please output in the following JSON format, avoid using special escape characters in JSON:
+```json
+{{
+    "problem": "Problem statement in English",
+    "answer": 123,
+    "solution": "Detailed solution steps in English",
+    "topic": "Algebra"
+}}
+```
+
+Important Notes:
+- **Must generate a completely different problem from the reference**
+- You can reference the style, but do not copy the content
+- Ensure the problem is creative and original
+"""
+        return prompt
+
+    def _parse_response(self, response: str) -> Dict[str, Any]:
+        """解析LLM响应(支持LaTeX数学公式)"""
+        import re
+
+        # 提取JSON部分
+        if "```json" in response:
+            json_str = response.split("```json")[1].split("```")[0].strip()
+        elif "```" in response:
+            json_str = response.split("```")[1].split("```")[0].strip()
+        else:
+            json_str = response.strip()
+
+        # 使用json.loads的strict=False来处理转义字符
+        # 但这还不够,我们需要更智能的处理
+        try:
+            problem_data = json.loads(json_str)
+        except json.JSONDecodeError as e:
+            # 如果解析失败,尝试修复常见的LaTeX转义问题
+            # 方法:先将字符串中的单个反斜杠替换为双反斜杠(但保留已经转义的)
+            # 这样LaTeX的 \frac 会变成 \\frac,在JSON中是合法的
+
+            # 使用正则表达式:找到所有未转义的反斜杠(不是\\的\)
+            # 并将其替换为\\
+            fixed_json_str = re.sub(r'(?<!\\)\\(?!["\\/bfnrtu])', r'\\\\', json_str)
+
+            try:
+                problem_data = json.loads(fixed_json_str)
+            except json.JSONDecodeError:
+                # 如果还是失败,打印错误信息并抛出
+                print(f"❌ JSON解析失败:")
+                print(f"原始响应: {response[:500]}...")
+                print(f"提取的JSON: {json_str[:500]}...")
+                raise
+
+        # 验证必需字段
+        if "problem" not in problem_data or "answer" not in problem_data:
+            raise ValueError("缺少必需字段: problem 或 answer")
+
+        # 验证答案范围
+        answer = int(problem_data.get("answer", 0))
+        if not (0 <= answer <= 999):
+            print(f"⚠️ 答案超出范围: {answer},调整为0-999范围内")
+            answer = max(0, min(999, answer))
+            problem_data["answer"] = answer
+
+        # 确保有默认值
+        problem_data.setdefault("solution", "No solution provided")
+        problem_data.setdefault("topic", "Uncategorized")
+
+        return problem_data
+
+    def _get_default_problem(self) -> Dict[str, Any]:
+        """获取默认题目(生成失败时使用)"""
+        return {
+            "problem": "生成失败,请重新生成",
+            "answer": 0,
+            "solution": "N/A",
+            "topic": "未知"
+        }
+    
+    def generate_batch(
+        self,
+        num_problems: int = 30,
+        checkpoint_path: str = None
+    ) -> List[Dict[str, Any]]:
+        """
+        批量生成题目
+
+        Args:
+            num_problems: 生成题目数量
+            checkpoint_path: 检查点文件路径(用于保存进度)
+
+        Returns:
+            题目列表
+        """
+        print(f"\n🎯 开始生成AIME题目")
+        print(f"   目标数量: {num_problems}")
+        print(f"   生成模型: {self.llm.model}")
+        print(f"   延迟设置: {self.delay_seconds}秒/题")
+
+        # 尝试从检查点恢复
+        problems = []
+        start_index = 0
+
+        if checkpoint_path and os.path.exists(checkpoint_path):
+            print(f"\n📂 发现检查点文件,尝试恢复...")
+            try:
+                with open(checkpoint_path, 'r', encoding='utf-8') as f:
+                    problems = json.load(f)
+                start_index = len(problems)
+                print(f"   ✓ 已恢复 {start_index} 个题目,从第 {start_index + 1} 个继续")
+            except Exception as e:
+                print(f"   ⚠️ 恢复失败: {e},从头开始")
+                problems = []
+                start_index = 0
+
+        # 生成题目(使用tqdm显示进度)
+        with tqdm(total=num_problems, initial=start_index, desc="生成AIME题目", unit="题") as pbar:
+            last_call_time = 0  # 上次API调用的时间
+
+            for i in range(start_index, num_problems):
+                # 计算距离上次调用的时间
+                if last_call_time > 0:
+                    elapsed = time.time() - last_call_time
+                    # 如果距离上次调用不足delay_seconds,则等待
+                    if elapsed < self.delay_seconds:
+                        wait_time = self.delay_seconds - elapsed
+                        tqdm.write(f"⏳ 等待 {wait_time:.1f} 秒以避免速率限制...")
+                        time.sleep(wait_time)
+
+                # 记录开始时间
+                start_time = time.time()
+
+                # 生成题目
+                problem = self.generate_single()
+                problem["id"] = f"gen_aime_{i + 1}"
+                problem["generated_at"] = datetime.now().isoformat()
+
+                # 记录结束时间
+                last_call_time = time.time()
+                generation_time = last_call_time - start_time
+
+                problems.append(problem)
+
+                # 更新进度条描述
+                pbar.set_postfix({
+                    "主题": problem.get('topic', 'N/A'),
+                    "答案": problem.get('answer', 'N/A'),
+                    "耗时": f"{generation_time:.1f}s"
+                })
+                pbar.update(1)
+
+                # 保存检查点
+                if checkpoint_path:
+                    try:
+                        with open(checkpoint_path, 'w', encoding='utf-8') as f:
+                            json.dump(problems, f, ensure_ascii=False, indent=2)
+                    except Exception as e:
+                        tqdm.write(f"⚠️ 保存检查点失败: {e}")
+
+        print(f"\n✅ 生成完成!共 {len(problems)} 个题目")
+        return problems
+    
+    def save_problems(
+        self,
+        problems: List[Dict[str, Any]],
+        output_path: str
+    ):
+        """保存题目到文件"""
+        # 确保目录存在
+        os.makedirs(os.path.dirname(output_path), exist_ok=True)
+        
+        with open(output_path, 'w', encoding='utf-8') as f:
+            json.dump(problems, f, ensure_ascii=False, indent=2)
+        
+        print(f"\n💾 题目已保存: {output_path}")
+    
+    def generate_and_save(
+        self,
+        num_problems: int = 30,
+        output_dir: str = "data_generation/generated_data"
+    ) -> str:
+        """生成并保存题目"""
+        # 创建输出目录
+        os.makedirs(output_dir, exist_ok=True)
+
+        # 清理旧的检查点文件
+        for file in os.listdir(output_dir):
+            if file.startswith("checkpoint_") and file.endswith(".json"):
+                old_checkpoint = os.path.join(output_dir, file)
+                try:
+                    os.remove(old_checkpoint)
+                    print(f"🗑️  已删除旧检查点文件: {file}")
+                except Exception as e:
+                    print(f"⚠️ 删除旧检查点失败: {e}")
+
+        # 设置检查点路径
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        checkpoint_path = os.path.join(output_dir, f"checkpoint_{timestamp}.json")
+
+        # 生成题目(带检查点)
+        problems = self.generate_batch(num_problems, checkpoint_path=checkpoint_path)
+
+        # 保存题目
+        output_path = os.path.join(output_dir, f"aime_generated_{timestamp}.json")
+        self.save_problems(problems, output_path)
+
+        # 生成统计报告
+        self._generate_statistics_report(problems, output_dir, timestamp)
+
+        # 删除检查点文件
+        if os.path.exists(checkpoint_path):
+            try:
+                os.remove(checkpoint_path)
+                print(f"\n🗑️  已删除检查点文件")
+            except Exception as e:
+                print(f"\n⚠️ 删除检查点文件失败: {e}")
+
+        return output_path
+    
+    def _generate_statistics_report(
+        self,
+        problems: List[Dict[str, Any]],
+        output_dir: str,
+        timestamp: str
+    ):
+        """生成统计报告"""
+        # 统计主题分布
+        topics = {}
+        answers = []
+
+        for problem in problems:
+            topic = problem.get("topic", "未知")
+            topics[topic] = topics.get(topic, 0) + 1
+
+            if "answer" in problem:
+                answers.append(problem["answer"])
+        
+        # 生成报告
+        report = f"""# AIME题目生成统计报告
+
+## 基本信息
+
+- **生成时间**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
+- **题目数量**: {len(problems)}
+
+## 主题分布
+
+| 主题 | 数量 | 占比 |
+|------|------|------|
+"""
+        
+        for topic, count in sorted(topics.items(), key=lambda x: x[1], reverse=True):
+            percentage = count / len(problems) * 100
+            report += f"| {topic} | {count} | {percentage:.1f}% |\n"
+
+        if answers:
+            report += f"""
+## 答案分析
+
+- **平均答案**: {sum(answers) / len(answers):.2f}
+- **最小答案**: {min(answers)}
+- **最大答案**: {max(answers)}
+- **答案范围**: {min(answers)}-{max(answers)}
+"""
+        
+        report += f"""
+## 题目列表
+
+| ID | 主题 | 答案 |
+|-----|------|------|
+"""
+
+        for problem in problems[:10]:  # 只显示前10个
+            report += f"| {problem.get('id', 'N/A')} | {problem.get('topic', 'N/A')} | {problem.get('answer', 'N/A')} |\n"
+        
+        if len(problems) > 10:
+            report += f"\n*(仅显示前10个题目,完整列表请查看JSON文件)*\n"
+        
+        report += f"""
+---
+
+*报告生成时间: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}*
+"""
+        
+        # 保存报告
+        report_path = os.path.join(output_dir, f"generation_report_{timestamp}.md")
+        with open(report_path, 'w', encoding='utf-8') as f:
+            f.write(report)
+        
+        print(f"📊 统计报告已保存: {report_path}")
+
+
+if __name__ == "__main__":
+    # 创建生成器
+    generator = AIMEGenerator()
+    
+    # 生成30个题目
+    output_path = generator.generate_and_save(num_problems=30)
+    
+    print(f"\n✅ 完成!生成的题目保存在: {output_path}")
+

+ 80 - 0
code/chapter12/data_generation/evaluation_results/20251011_123929/comprehensive_report.md

@@ -0,0 +1,80 @@
+# AIME数据生成与评估综合报告
+
+## 1. 基本信息
+
+- **生成时间**: 2025-10-11 12:43:22
+- **生成题目数量**: 30
+- **参考AIME年份**: 2025
+- **生成数据路径**: data_generation/generated_data/aime_generated_20251011_042741.json
+
+## 2. 数据生成统计
+
+### 主题分布
+
+| 主题 | 数量 | 占比 |
+|------|------|------|
+| Number Theory | 9 | 30.0% |
+| Geometry | 9 | 30.0% |
+| Algebra | 8 | 26.7% |
+| Probability | 2 | 6.7% |
+| Combinatorics | 2 | 6.7% |
+
+## 3. LLM Judge评估结果
+
+**总体评分**:
+- 平均总分: 3.32/5.0
+- 通过率: 40.00%
+- 优秀率: 10.00%
+
+**各维度评分**:
+
+| 维度 | 平均分 |
+|------|--------|
+| 正确性 | 3.27/5.0 |
+| 清晰度 | 3.40/5.0 |
+| 难度匹配 | 3.27/5.0 |
+| 完整性 | 3.33/5.0 |
+
+
+## 4. Win Rate评估结果
+
+**胜率统计**:
+- Win Rate: 25.00%
+- Loss Rate: 65.00%
+- Tie Rate: 10.00%
+
+**对比次数**:
+- 总对比次数: 20 次
+- 胜出次数: 5 次
+- 失败次数: 13 次
+- 平局次数: 2 次
+
+
+## 5. 综合结论
+
+⚠️ **结论**: 生成数据质量**需要改进**,与AIME真题仍有差距。
+
+**整体指标**:
+- LLM Judge得分: 3.32/5.0
+- Win Rate: 25.00%
+
+## 6. 改进建议
+
+- ⚠️ 需要重新设计生成提示词
+- ⚠️ 考虑使用更强的生成模型
+- ⚠️ 增加人工审核环节
+
+## 7. 下一步行动
+
+1. **人工验证**: 运行人工验证界面,对生成的题目进行人工审核
+   ```bash
+   python data_generation/human_verification_ui.py data_generation/generated_data/aime_generated_20251011_042741.json
+   ```
+
+2. **质量筛选**: 根据评估结果筛选高质量题目
+
+3. **迭代优化**: 根据评估反馈优化生成策略
+
+---
+
+*报告生成时间: 2025-10-11 12:43:22*

+ 127 - 0
code/chapter12/data_generation/evaluation_results/20251011_123929/llm_judge/llm_judge_report_20251011_124143.md

@@ -0,0 +1,127 @@
+# LLM Judge评估报告
+
+## 基本信息
+
+- **评估日期**: 2025-10-11T12:41:43.357467
+- **评委模型**: gpt-4o
+- **评估数量**: 30 个题目
+
+## 评估结果
+
+### 总体评分
+
+- **平均总分**: 3.32/5.0
+- **通过率**: 40.00% (≥3.5分)
+- **优秀率**: 10.00% (≥4.5分)
+
+### 各维度评分
+
+| 维度 | 平均分 | 评级 |
+|------|--------|------|
+| 正确性 (Correctness) | 3.27/5.0 | 一般 ⭐⭐ |
+| 清晰度 (Clarity) | 3.40/5.0 | 一般 ⭐⭐ |
+| 难度匹配 (Difficulty Match) | 3.27/5.0 | 一般 ⭐⭐ |
+| 完整性 (Completeness) | 3.33/5.0 | 一般 ⭐⭐ |
+
+## 详细结果
+
+
+### 题目 1: gen_aime_1
+
+- **总分**: 3.50/5.0
+- **各维度评分**:
+  - 正确性: 3.0
+  - 清晰度: 4.0
+  - 难度匹配: 3.0
+  - 完整性: 4.0
+
+### 题目 2: gen_aime_2
+
+- **总分**: 3.75/5.0
+- **各维度评分**:
+  - 正确性: 4.0
+  - 清晰度: 4.0
+  - 难度匹配: 3.0
+  - 完整性: 4.0
+
+### 题目 3: gen_aime_3
+
+- **总分**: 2.75/5.0
+- **各维度评分**:
+  - 正确性: 2.0
+  - 清晰度: 3.0
+  - 难度匹配: 4.0
+  - 完整性: 2.0
+
+### 题目 4: gen_aime_4
+
+- **总分**: 2.50/5.0
+- **各维度评分**:
+  - 正确性: 1.0
+  - 清晰度: 3.0
+  - 难度匹配: 3.0
+  - 完整性: 3.0
+
+### 题目 5: gen_aime_5
+
+- **总分**: 3.50/5.0
+- **各维度评分**:
+  - 正确性: 4.0
+  - 清晰度: 4.0
+  - 难度匹配: 3.0
+  - 完整性: 3.0
+
+### 题目 6: gen_aime_6
+
+- **总分**: 3.00/5.0
+- **各维度评分**:
+  - 正确性: 2.0
+  - 清晰度: 3.0
+  - 难度匹配: 4.0
+  - 完整性: 3.0
+
+### 题目 7: gen_aime_7
+
+- **总分**: 3.25/5.0
+- **各维度评分**:
+  - 正确性: 3.0
+  - 清晰度: 3.0
+  - 难度匹配: 4.0
+  - 完整性: 3.0
+
+### 题目 8: gen_aime_8
+
+- **总分**: 4.00/5.0
+- **各维度评分**:
+  - 正确性: 4.0
+  - 清晰度: 4.0
+  - 难度匹配: 4.0
+  - 完整性: 4.0
+
+### 题目 9: gen_aime_9
+
+- **总分**: 3.25/5.0
+- **各维度评分**:
+  - 正确性: 3.0
+  - 清晰度: 4.0
+  - 难度匹配: 3.0
+  - 完整性: 3.0
+
+### 题目 10: gen_aime_10
+
+- **总分**: 4.50/5.0
+- **各维度评分**:
+  - 正确性: 5.0
+  - 清晰度: 4.0
+  - 难度匹配: 4.0
+  - 完整性: 5.0
+
+*(仅显示前10个题目的详细评分,完整结果请查看JSON文件)*
+
+## 结论
+
+基于LLM Judge的评估,生成的数据集质量需要改进。
+
+---
+
+*报告生成时间: 2025-10-11 12:41:43*

+ 378 - 0
code/chapter12/data_generation/evaluation_results/20251011_123929/llm_judge/llm_judge_results_20251011_124143.json

@@ -0,0 +1,378 @@
+{
+  "results": [
+    {
+      "problem_id": "gen_aime_1",
+      "scores": {
+        "correctness": 3.0,
+        "clarity": 4.0,
+        "difficulty_match": 3.0,
+        "completeness": 4.0
+      },
+      "total_score": 3.5,
+      "evaluation_text": "```json\n{\n    \"correctness\": 3,\n    \"clarity\": 4,\n    \"difficulty_match\": 3,\n    \"completeness\": 4,\n    \"comments\": \"题目的解答部分存在逻辑错误,最后计算2023rd项时,最终的答案应该是4而不是1,因为在计算 (2023 - 5) mod 8 时,结果为6,而根据循环模式,循环的第6项为4,而非1。解答步骤基本完整,但由于关键部分的错误导致正确性受影响。题目难度与AIME题目不完全匹配,偏向于较简单的数列问题。\"\n}\n```",
+      "execution_time": 3.2240538597106934
+    },
+    {
+      "problem_id": "gen_aime_2",
+      "scores": {
+        "correctness": 4.0,
+        "clarity": 4.0,
+        "difficulty_match": 3.0,
+        "completeness": 4.0
+      },
+      "total_score": 3.75,
+      "evaluation_text": "```json\n{\n    \"correctness\": 4,\n    \"clarity\": 4,\n    \"difficulty_match\": 3,\n    \"completeness\": 4,\n    \"comments\": \"The problem is mostly correct in terms of mathematical logic, but there is a small error in the answer; the 2024th term is indeed 27, but the interpretation of finding the sum of the digits is not clearly linked to the final answer of 9. The sequence's behavior is well-explained, but the final step and connection to the answer could be clearer. The problem is slightly simpler than typical AIME problems, as it involves recognizing a pattern rather than deeper problem-solving or multiple concepts. The solution covers the key steps but could benefit from a clearer explanation of why the sum of the digits of 27 is relevant to the final answer.\"\n}\n```",
+      "execution_time": 3.4472837448120117
+    },
+    {
+      "problem_id": "gen_aime_3",
+      "scores": {
+        "correctness": 2.0,
+        "clarity": 3.0,
+        "difficulty_match": 4.0,
+        "completeness": 2.0
+      },
+      "total_score": 2.75,
+      "evaluation_text": "```json\n{\n    \"correctness\": 2,\n    \"clarity\": 3,\n    \"difficulty_match\": 4,\n    \"completeness\": 2,\n    \"comments\": \"题目的解答过程中存在严重错误,导致最终答案与题目给出的不一致。对于一个多项式满足f(1)=10, f(2)=20, f(3)=30, f(4)=40的条件,直接推出a+b+c+d=70是错误的。实际上,应当通过构造新的多项式g(x)=f(x)-10x来处理,g(x)的根为1, 2, 3, 4,然后再分析其系数和。解答中,步骤和推理并不完整,缺少关键的多项式构造和分析,导致错误的结论。难度方面,题目符合AIME标准,但解答中缺乏必要的步骤和推理,影响了整体的完整性和正确性。\"\n}\n```",
+      "execution_time": 6.914599657058716
+    },
+    {
+      "problem_id": "gen_aime_4",
+      "scores": {
+        "correctness": 1.0,
+        "clarity": 3.0,
+        "difficulty_match": 3.0,
+        "completeness": 3.0
+      },
+      "total_score": 2.5,
+      "evaluation_text": "```json\n{\n    \"correctness\": 1,\n    \"clarity\": 3,\n    \"difficulty_match\": 3,\n    \"completeness\": 3,\n    \"comments\": \"该题目在正确性方面存在严重问题:题目给出的答案25不符合已知条件的计算结果。经过正确推导,ab + bc + ca 的值应为4,而不是25。此外,解答过程在试图使用对称性和拉格朗日乘数法来最大化结果时出现了不必要的复杂化,并且没有纠正错误的答案。清晰度上,虽然解答包含了一些详细的代数步骤,但由于推导过程中的错误,可能会让读者感到困惑。难度上,题目符合AIME的水平,但由于提供的解答不正确,可能会对参赛者造成误导。完整性方面,由于核心计算出现错误,导致解答缺乏正确的推理步骤。\"\n}\n```",
+      "execution_time": 4.767943859100342
+    },
+    {
+      "problem_id": "gen_aime_5",
+      "scores": {
+        "correctness": 4.0,
+        "clarity": 4.0,
+        "difficulty_match": 3.0,
+        "completeness": 3.0
+      },
+      "total_score": 3.5,
+      "evaluation_text": "```json\n{\n    \"correctness\": 4,\n    \"clarity\": 4,\n    \"difficulty_match\": 3,\n    \"completeness\": 3,\n    \"comments\": \"该题目总体上是正确的,但答案提供的计算过程有误。根据计算,a_10 = 286 不是一个完全平方数,因此答案 '10' 是错误的。解答中的公式 a_n = 1 + (n(n-1)(2n-1))/6 是正确的,但没有在解答中使用到。该题的难度不符合典型的AIME题目,因为它涉及的计算和推理较为简单且直接。解答部分缺乏将公式应用于找到正确m值的步骤,完整性可以提高。\"\n}\n```",
+      "execution_time": 4.121169328689575
+    },
+    {
+      "problem_id": "gen_aime_6",
+      "scores": {
+        "correctness": 2.0,
+        "clarity": 3.0,
+        "difficulty_match": 4.0,
+        "completeness": 3.0
+      },
+      "total_score": 3.0,
+      "evaluation_text": "```json\n{\n    \"correctness\": 2,\n    \"clarity\": 3,\n    \"difficulty_match\": 4,\n    \"completeness\": 3,\n    \"comments\": \"The problem statement has a logical mistake in the calculation of maximum value of f(p). The proposed solution incorrectly states the maximum value as 9751 after several incorrect calculations. The correct logic should involve maximizing the sum of the two largest possible 4-digit numbers formed by the digits 1 through 9, while minimizing the single digit a9. The explanation lacks clarity in its reasoning, and the oversight in calculations shows a lack of thoroughness. The problem's difficulty is appropriate for AIME, but the solution provided is incomplete and incorrect in its final answer. Further refinement in solution steps and verification of calculations is needed.\"\n}\n```",
+      "execution_time": 3.089435577392578
+    },
+    {
+      "problem_id": "gen_aime_7",
+      "scores": {
+        "correctness": 3.0,
+        "clarity": 3.0,
+        "difficulty_match": 4.0,
+        "completeness": 3.0
+      },
+      "total_score": 3.25,
+      "evaluation_text": "```json\n{\n    \"correctness\": 3,\n    \"clarity\": 3,\n    \"difficulty_match\": 4,\n    \"completeness\": 3,\n    \"comments\": \"这个题目在问题构建上具有一定的挑战性,符合AIME的难度要求。然而,解答中存在一些错误,包括不正确的概率计算和最终得到的错误答案。解答过程中的公式和推理没有解释清楚,尤其是在包含排除法则的应用上。清晰度方面,问题描述较为直观,但解答的步骤缺乏详细说明,可能导致读者难以理解。此外,结果中的m和n的相对质性没有正确处理。综合来看,题目有潜力,但需要在逻辑和解答清晰度上进行改进。\"\n}\n```",
+      "execution_time": 4.858828067779541
+    },
+    {
+      "problem_id": "gen_aime_8",
+      "scores": {
+        "correctness": 4.0,
+        "clarity": 4.0,
+        "difficulty_match": 4.0,
+        "completeness": 4.0
+      },
+      "total_score": 4.0,
+      "evaluation_text": "```json\n{\n    \"correctness\": 4,\n    \"clarity\": 4,\n    \"difficulty_match\": 4,\n    \"completeness\": 4,\n    \"comments\": \"This problem is structured well and mostly correct, but there is a minor inconsistency in the final answer derivation. The calculation of the number of perfect square divisors is correct and leads to the correct probability of 3/20. However, the final answer provided as 19 seems to be a mistake, as the numbers m and n are correctly identified as 3 and 20, leading to m+n = 23. The clarity of the problem and solution is good, but the minor oversight in the final arithmetic reduces the completeness and correctness slightly. Overall, the problem difficulty is appropriate for AIME, challenging yet solvable with a solid understanding of number theory and divisor counting.\"\n}\n```",
+      "execution_time": 2.4106380939483643
+    },
+    {
+      "problem_id": "gen_aime_9",
+      "scores": {
+        "correctness": 3.0,
+        "clarity": 4.0,
+        "difficulty_match": 3.0,
+        "completeness": 3.0
+      },
+      "total_score": 3.25,
+      "evaluation_text": "```json\n{\n    \"correctness\": 3,\n    \"clarity\": 4,\n    \"difficulty_match\": 3,\n    \"completeness\": 3,\n    \"comments\": \"The problem contains a significant error in its solution. The sequence described is indeed a shifted Fibonacci sequence, but the calculation provided incorrectly states that the term a_15 is divisible by 1000. In fact, a_15 = 987 is not divisible by 1000, and the correct smallest n needs further computation. The clarity of the problem statement and solution is generally good, but the solution lacks completeness because it stops prematurely with an incorrect conclusion. The difficulty level seems to be slightly easier than typical AIME problems due to its computational nature rather than requiring deeper insight or problem-solving techniques.\"\n}\n```",
+      "execution_time": 3.9147017002105713
+    },
+    {
+      "problem_id": "gen_aime_10",
+      "scores": {
+        "correctness": 5.0,
+        "clarity": 4.0,
+        "difficulty_match": 4.0,
+        "completeness": 5.0
+      },
+      "total_score": 4.5,
+      "evaluation_text": "```json\n{\n    \"correctness\": 5,\n    \"clarity\": 4,\n    \"difficulty_match\": 4,\n    \"completeness\": 5,\n    \"comments\": \"该题目正确地利用了圆周上直角三角形的性质,即直径所对的圆周角为直角,从而正确地计算了三角形的面积。题目清晰地表述了已知条件,并通过勾股定理验证了直角三角形的性质。题目的难度适中,符合AIME题目的标准。解答步骤完整,包含了必要的推理和验证步骤。唯一的小改进建议是解答部分的末尾,应该更准确地解释如何得出 m + n + p = 34 的最终结果。\"\n}\n```",
+      "execution_time": 3.468392848968506
+    },
+    {
+      "problem_id": "gen_aime_11",
+      "scores": {
+        "correctness": 4.0,
+        "clarity": 3.0,
+        "difficulty_match": 3.0,
+        "completeness": 3.0
+      },
+      "total_score": 3.25,
+      "evaluation_text": "```json\n{\n    \"correctness\": 4,\n    \"clarity\": 3,\n    \"difficulty_match\": 3,\n    \"completeness\": 3,\n    \"comments\": \"这道题目涉及一个递推序列,并要求找出第一个能被7整除的项。解答中进行了序列的递推计算,并正确识别出a2为零模7。然而,在解答的过程中,存在一些不必要的步骤,且答案给出与实际计算不符(应为n=2而非n=3)。此外,解答中前后不一致,增加了理解难度。题目难度适中,但解答部分不够完整,缺少对一些步骤的详细说明。\"\n}\n```",
+      "execution_time": 2.9038174152374268
+    },
+    {
+      "problem_id": "gen_aime_12",
+      "scores": {
+        "correctness": 3.0,
+        "clarity": 3.0,
+        "difficulty_match": 2.0,
+        "completeness": 3.0
+      },
+      "total_score": 2.75,
+      "evaluation_text": "```json\n{\n    \"correctness\": 3,\n    \"clarity\": 3,\n    \"difficulty_match\": 2,\n    \"completeness\": 3,\n    \"comments\": \"This problem has several issues that affect its quality. Firstly, the solution provided contains logical errors and inconsistencies, particularly in the handling of the inclusion-exclusion principle and the deduction of the number of students in all three activities. The calculation steps are not clearly laid out, and the final answer appears to be incorrect based on the given logic. Additionally, the problem's difficulty is not well-aligned with typical AIME standards, which usually involve more intricate mathematical reasoning and problem-solving techniques. The problem also lacks a thorough and coherent explanation in the solution, which makes it difficult for the reader to follow the intended logic. Overall, the problem could benefit from a clearer structure, more precise reasoning, and alignment with the expected difficulty level of AIME problems.\"\n}\n```",
+      "execution_time": 3.511565685272217
+    },
+    {
+      "problem_id": "gen_aime_13",
+      "scores": {
+        "correctness": 4.0,
+        "clarity": 3.0,
+        "difficulty_match": 3.0,
+        "completeness": 4.0
+      },
+      "total_score": 3.5,
+      "evaluation_text": "```json\n{\n    \"correctness\": 4,\n    \"clarity\": 3,\n    \"difficulty_match\": 3,\n    \"completeness\": 4,\n    \"comments\": \"The solution shows a good understanding of the geometric probability involved in the problem, and the probability calculation is mostly correct. However, there is a mistake in the final steps related to the last three digits calculation, which seems to be due to an incorrect interpretation of the problem's requirements. The clarity of the solution could be improved by clearly stating the problem's requirements and ensuring that the final answer aligns with the expected format. The difficulty of the problem is somewhat aligned with AIME standards, but the misunderstanding of the output requirement slightly affects the overall quality.\"\n}\n```",
+      "execution_time": 3.011523485183716
+    },
+    {
+      "problem_id": "gen_aime_14",
+      "scores": {
+        "correctness": 1.0,
+        "clarity": 3.0,
+        "difficulty_match": 1.0,
+        "completeness": 2.0
+      },
+      "total_score": 1.75,
+      "evaluation_text": "```json\n{\n    \"correctness\": 1,\n    \"clarity\": 3,\n    \"difficulty_match\": 1,\n    \"completeness\": 2,\n    \"comments\": \"详细评价: 1. 正确性: 题目的解答过程中的计算有错误,最终答案与计算过程不符。解答中给出的答案是10300,而题目答案是999,且计算的逻辑不支持这两个结果。 2. 清晰度: 题目表述清楚,易于理解,但解答中的错误可能让读者迷惑。 3. 难度匹配: 题目的难度远低于AIME标准,AIME题目通常需要更高的数学思考或技巧,此题过于简单。 4. 完整性: 解答过程不够完整,因为计算步骤中的错误导致结果不准确,没有正确识别问题中的错误。\"\n}\n```",
+      "execution_time": 3.9433140754699707
+    },
+    {
+      "problem_id": "gen_aime_15",
+      "scores": {
+        "correctness": 3.0,
+        "clarity": 3.0,
+        "difficulty_match": 2.0,
+        "completeness": 3.0
+      },
+      "total_score": 2.75,
+      "evaluation_text": "```json\n{\n    \"correctness\": 3,\n    \"clarity\": 3,\n    \"difficulty_match\": 2,\n    \"completeness\": 3,\n    \"comments\": \"The problem contains a fundamental error in the understanding of the conditions for a point to lie on both circles. The setup of the problem is incorrect because given the centers of the circles and their radii, the circles do not intersect; the distance between the centers is 8, which is equal to the sum of the radii (5 + 3), meaning they are externally tangent at only one point. Therefore, there cannot be a point P lying on both circles that would satisfy the given conditions for calculation. Additionally, the solution contains logical inconsistencies and incorrect substitutions, leading to an incorrect answer. The problem lacks clarity in explaining how the intersection or tangency affects the selection of P and its implications for the distance calculation. Furthermore, the difficulty level is not aligned with typical AIME problems, which usually require deeper insights or more sophisticated techniques. Overall, the problem and its solution require significant revision to be valid and align with AIME standards.\"\n}\n```",
+      "execution_time": 5.860377550125122
+    },
+    {
+      "problem_id": "gen_aime_16",
+      "scores": {
+        "correctness": 5.0,
+        "clarity": 5.0,
+        "difficulty_match": 4.0,
+        "completeness": 5.0
+      },
+      "total_score": 4.75,
+      "evaluation_text": "```json\n{\n    \"correctness\": 5,\n    \"clarity\": 5,\n    \"difficulty_match\": 4,\n    \"completeness\": 5,\n    \"comments\": \"这道题目在数学逻辑上是正确的,给出的解答详尽且准确地解释了如何构造3x3幻方并验证其性质。问题的表述清晰,解答步骤明确,易于理解。题目难度符合AIME的标准,因为要求考生不仅要理解幻方的性质,还要进行合理的推理和验证。解答包含了必要的推理步骤,如计算幻方常数、列出一个已知的幻方配置,以及通过对称操作解释八种不同配置的来源。因此,题目的整体质量很高。\"\n}\n```",
+      "execution_time": 3.2819583415985107
+    },
+    {
+      "problem_id": "gen_aime_17",
+      "scores": {
+        "correctness": 4.0,
+        "clarity": 3.0,
+        "difficulty_match": 3.0,
+        "completeness": 3.0
+      },
+      "total_score": 3.25,
+      "evaluation_text": "```json\n{\n    \"correctness\": 4,\n    \"clarity\": 3,\n    \"difficulty_match\": 3,\n    \"completeness\": 3,\n    \"comments\": \"该题目存在一些问题。在正确性方面,答案计算有误。根据计算,实际的和应该是 10416,而不是 325。题意中给出的答案 325 不正确。在清晰度方面,尽管问题的表述和解答的步骤基本清晰,但有些地方可以更详细,比如平方和公式的推导。在难度匹配方面,该题目比典型的AIME问题稍简单,解法包含识别平方数序列,这在竞赛中比较常见。在完整性方面,解答缺少对平方和公式的详细推导,这对于确保理解是必要的。\"\n}\n```",
+      "execution_time": 6.139104127883911
+    },
+    {
+      "problem_id": "gen_aime_18",
+      "scores": {
+        "correctness": 2.0,
+        "clarity": 3.0,
+        "difficulty_match": 3.0,
+        "completeness": 3.0
+      },
+      "total_score": 2.75,
+      "evaluation_text": "```json\n{\n    \"correctness\": 2,\n    \"clarity\": 3,\n    \"difficulty_match\": 3,\n    \"completeness\": 3,\n    \"comments\": \"题目给出的答案24与题目解答不一致,且解答过程存在错误。首先在计算a和b时,解答中给出的方程组解出a=6, b=-8,计算a+b得到-2,而非24,后续调整过程也不够清晰,未能正确展示如何得到a+b=24的答案。解答步骤缺乏对a和b的重新计算和验证。题目难度对于AIME而言适中,但解答不够完整和准确,影响了整体的评估。\"\n}\n```",
+      "execution_time": 8.60336709022522
+    },
+    {
+      "problem_id": "gen_aime_19",
+      "scores": {
+        "correctness": 3.0,
+        "clarity": 3.0,
+        "difficulty_match": 3.0,
+        "completeness": 3.0
+      },
+      "total_score": 3.0,
+      "evaluation_text": "```json\n{\n    \"correctness\": 3,\n    \"clarity\": 3,\n    \"difficulty_match\": 3,\n    \"completeness\": 3,\n    \"comments\": \"The problem attempts to be similar to AIME by involving algebraic manipulation and maximization under constraints. However, the answer and solution provided contain inaccuracies. Firstly, the calculation of the maximum value of \\( ab + bc + ca \\) is incorrect; the solution fails to provide the right algebraic manipulations or logical deductions to find the correct maximum. The solution prematurely concludes values for \\( a, b, \\) and \\( c \\) without sufficient justification or exploration of other possibilities, like using the equality case in the Cauchy-Schwarz inequality. Additionally, the problem's difficulty is not quite aligned with typical AIME problems, which often require more subtle insights or non-trivial manipulations. The clarity is moderate, but the explanation could be improved by breaking down steps more clearly and exploring different strategies. The solution lacks completeness by not fully exploring alternative values or methods to ensure the maximum is correctly derived. Overall, it needs revision in its mathematical logic and presentation to align better with AIME standards.\"\n}\n```",
+      "execution_time": 12.811752319335938
+    },
+    {
+      "problem_id": "gen_aime_20",
+      "scores": {
+        "correctness": 3.0,
+        "clarity": 3.0,
+        "difficulty_match": 4.0,
+        "completeness": 3.0
+      },
+      "total_score": 3.25,
+      "evaluation_text": "```json\n{\n    \"correctness\": 3,\n    \"clarity\": 3,\n    \"difficulty_match\": 4,\n    \"completeness\": 3,\n    \"comments\": \"该题目结构合理且具有一定的挑战性,符合AIME的难度标准。然而,解答过程中存在几个问题。首先,答案部分中的计算错误:在计算相似三角形的面积时,使用的比例计算结果与给出的答案不符;实际计算应为 (5/3)^2 * 15 = 125/3 = 41.67,而非 35。问题表述基本清晰,但解答部分的推理过程不够完整和准确,尤其是在处理整数化简部分时显得不清晰。建议在解答中明确面积比例的计算,并确保提供的答案和计算步骤一致。\"\n}\n```",
+      "execution_time": 3.2803139686584473
+    },
+    {
+      "problem_id": "gen_aime_21",
+      "scores": {
+        "correctness": 3.0,
+        "clarity": 3.0,
+        "difficulty_match": 4.0,
+        "completeness": 3.0
+      },
+      "total_score": 3.25,
+      "evaluation_text": "```json\n{\n    \"correctness\": 3,\n    \"clarity\": 3,\n    \"difficulty_match\": 4,\n    \"completeness\": 3,\n    \"comments\": \"该题目在数学逻辑上存在问题,特别是在最小和的计算过程中。题目条件中关于二次函数的选择和计算有混淆,特别是对于如何选择a, b, c以保证所需条件满足。解答过程中也缺乏对关键步骤的详细推导和验证。题目的陈述部分有些模糊,尤其是关于选择函数的依据和细节。整体难度接近AIME标准,但在准确性和清晰度上需要改进。\"\n}\n```",
+      "execution_time": 3.934663772583008
+    },
+    {
+      "problem_id": "gen_aime_22",
+      "scores": {
+        "correctness": 5.0,
+        "clarity": 4.0,
+        "difficulty_match": 4.0,
+        "completeness": 5.0
+      },
+      "total_score": 4.5,
+      "evaluation_text": "```json\n{\n    \"correctness\": 5,\n    \"clarity\": 4,\n    \"difficulty_match\": 4,\n    \"completeness\": 5,\n    \"comments\": \"这道题目在逻辑上是正确的,答案也准确无误。问题表述清晰,只是解答中有部分步骤可能对于一些读者而言需要更多解释,例如为什么N = 125k是合理的选择。难度符合AIME标准,问题涉及模运算和数论,需要一定的推理能力。解答步骤完整,涵盖了所有必要的推理和计算。整体来说,这是一道高质量的AIME风格题目。\"\n}\n```",
+      "execution_time": 2.8868408203125
+    },
+    {
+      "problem_id": "gen_aime_23",
+      "scores": {
+        "correctness": 4.0,
+        "clarity": 4.0,
+        "difficulty_match": 4.0,
+        "completeness": 4.0
+      },
+      "total_score": 4.0,
+      "evaluation_text": "```json\n{\n    \"correctness\": 4,\n    \"clarity\": 4,\n    \"difficulty_match\": 4,\n    \"completeness\": 4,\n    \"comments\": \"The problem is well-structured and presents a clear mathematical challenge suitable for an AIME-style question. However, there are some minor issues in the solution that affect its correctness and completeness. The computation of the segment lengths and the final ratio require careful simplification to ensure accuracy. Additionally, the handling of the intersection points could be more precise, particularly in ensuring that the segments are correctly identified and their lengths are accurately calculated. The problem's clarity is mostly good, but the explanation of terms like 'support' could be more explicit. Overall, the problem matches the AIME difficulty level well, challenging students to apply geometric and algebraic reasoning in a non-trivial context.\"\n}\n```",
+      "execution_time": 4.162907361984253
+    },
+    {
+      "problem_id": "gen_aime_24",
+      "scores": {
+        "correctness": 3.0,
+        "clarity": 3.0,
+        "difficulty_match": 3.0,
+        "completeness": 3.0
+      },
+      "total_score": 3.0,
+      "evaluation_text": "```json\n{\n    \"correctness\": 2,\n    \"clarity\": 3,\n    \"difficulty_match\": 1,\n    \"completeness\": 2,\n    \"comments\": \"该题目存在几个关键性错误。首先,题目中给出的方程 \\( x^3 - 3x^2 + 4x - 12 = 0 \\) 被错误地测试为有实数根为 2,但实际上求解后得到的 \\( x^2 - x + 2 = 0 \\) 应该有两个复数根,而不是继续求得其他实数根。此外,题目所求 \\( x_2 + x_1x_3 \\) 的表达式与所给答案 7 不符,因为无法通过实际计算得到。解答过程中的逻辑和步骤不完整且含混不清,未能准确解决题目中所求的值。整体上,该题目的难度也不符合 AIME 的标准,因为它未能提供合理的挑战性。\"\n}\n```",
+      "execution_time": 5.341580390930176
+    },
+    {
+      "problem_id": "gen_aime_25",
+      "scores": {
+        "correctness": 4.0,
+        "clarity": 3.0,
+        "difficulty_match": 4.0,
+        "completeness": 4.0
+      },
+      "total_score": 3.75,
+      "evaluation_text": "```json\n{\n    \"correctness\": 4,\n    \"clarity\": 3,\n    \"difficulty_match\": 4,\n    \"completeness\": 4,\n    \"comments\": \"问题的设定符合AIME风格,涉及角平分线定理和幂定理的应用。然而,解答的步骤中有一些地方不够清晰,尤其是在解释如何从已知条件推导出BE的具体值时。答案部分的逻辑跳跃较大,缺乏一些必要的推理步骤,特别是在假设BE的值和最终得出答案之间。此外,问题中关于BE的计算应该更为详细,以帮助读者理解每一步的推导过程。总体而言,题目有一定挑战性,适合AIME的难度范围,但需要在解答的清晰度和步骤完整性上进行改进。\"\n}\n```",
+      "execution_time": 4.716083765029907
+    },
+    {
+      "problem_id": "gen_aime_26",
+      "scores": {
+        "correctness": 3.0,
+        "clarity": 3.0,
+        "difficulty_match": 3.0,
+        "completeness": 3.0
+      },
+      "total_score": 3.0,
+      "evaluation_text": "```json\n{\n    \"correctness\": 5,\n    \"clarity\": 5,\n    \"difficulty_match\": 3,\n    \"completeness\": 5,\n    \"comments\": \"The problem is mathematically correct, and the solution accurately determines the distance \\( d \\) from the center of the circle to the vertices of the hexagon. The problem statement is clear and concise, and the solution is straightforward and easy to follow. However, the difficulty may be on the easier side for an AIME problem as it primarily tests understanding of regular polygons and symmetry, which are generally basic concepts. The problem could be improved by adding complexity or additional steps. Overall, the problem is well-constructed but may not fully match the typical challenge level of an AIME problem.\"\n}\n```",
+      "execution_time": 3.892986536026001
+    },
+    {
+      "problem_id": "gen_aime_27",
+      "scores": {
+        "correctness": 3.0,
+        "clarity": 3.0,
+        "difficulty_match": 3.0,
+        "completeness": 3.0
+      },
+      "total_score": 3.0,
+      "evaluation_text": "```json\n{\n    \"correctness\": 3,\n    \"clarity\": 4,\n    \"difficulty_match\": 4,\n    \"completeness\": 4,\n    \"comments\": \"The problem is structured in a manner that is suitable for an AIME-style question and involves geometric reasoning and algebraic manipulation, which matches the typical difficulty level. However, there is an inconsistency in the solution provided. The calculation for the sum \\(m + n\\) initially suggests 39, but the final answer given is 27 without an explicit explanation for this change. This affects the correctness score. The clarity of the problem statement is good, but the solution lacks a clear explanation for the adjustment of the final answer. The steps leading to the understanding of \\(PQ\\) and the tangency condition are correct, but the final conclusion needs to be reevaluated for accuracy.\"\n}\n```",
+      "execution_time": 3.3524539470672607
+    },
+    {
+      "problem_id": "gen_aime_28",
+      "scores": {
+        "correctness": 2.0,
+        "clarity": 3.0,
+        "difficulty_match": 2.0,
+        "completeness": 2.0
+      },
+      "total_score": 2.25,
+      "evaluation_text": "```json\n{\n    \"correctness\": 2,\n    \"clarity\": 3,\n    \"difficulty_match\": 2,\n    \"completeness\": 2,\n    \"comments\": \"这道题目存在多个问题。首先,解答部分的数学逻辑不正确。对于内切圆,题目中的解答没有正确计算小三角形的面积,并且最终答案1是错误的。其次,题目的表述不够清晰,尤其是在解释如何形成小三角形时。问题的难度不符合AIME标准,因为它缺乏足够的复杂性和挑战性。解答部分也不完整,没有给出详细的推理过程和步骤来得出最终答案。建议重新审视和修正题目的数学逻辑,并提供更为详尽和准确的解答。\"\n}\n```",
+      "execution_time": 3.918867826461792
+    },
+    {
+      "problem_id": "gen_aime_29",
+      "scores": {
+        "correctness": 5.0,
+        "clarity": 4.0,
+        "difficulty_match": 4.0,
+        "completeness": 4.0
+      },
+      "total_score": 4.25,
+      "evaluation_text": "```json\n{\n    \"correctness\": 5,\n    \"clarity\": 4,\n    \"difficulty_match\": 4,\n    \"completeness\": 4,\n    \"comments\": \"题目正确地描述了问题并提供了准确的答案。解答中使用了费马小定理来简化计算,这是合适的。然而,解答过程中低次幂的计算部分略显冗长,可以更简洁地总结成一个表格或结合线索直接找到最小的n,避免冗余计算。题目的难度符合AIME的标准,但解答部分在计算5的幂时可以更高效。整体而言,题目和解答都很清晰,但可以在说明如何快速找到最小n的思路上做得更好。\"\n}\n```",
+      "execution_time": 3.3993613719940186
+    },
+    {
+      "problem_id": "gen_aime_30",
+      "scores": {
+        "correctness": 3.0,
+        "clarity": 4.0,
+        "difficulty_match": 4.0,
+        "completeness": 3.0
+      },
+      "total_score": 3.5,
+      "evaluation_text": "```json\n{\n    \"correctness\": 3,\n    \"clarity\": 4,\n    \"difficulty_match\": 4,\n    \"completeness\": 3,\n    \"comments\": \"The problem setup is interesting and involves multiple geometric and algebraic steps, which is suitable for a high-level math competition like AIME. However, there are some issues with the solution provided. The calculation of segment FG seems overly simplified and doesn't account for the complexity expected in such a problem. Specifically, the symmetry and intersection points need to be more rigorously derived, especially the parameterization of line BC and solving the intersection with the circle. The final answer does not match the algebraic steps provided, indicating a possible error in calculation or logic. The explanation is clear and follows a logical progression, though it lacks depth in certain derivations, notably in finding the points F and G. Overall, the problem is well-conceived, but the solution requires more detailed and accurate mathematical reasoning.\"\n}\n```",
+      "execution_time": 3.7917966842651367
+    }
+  ],
+  "metrics": {
+    "average_total_score": 3.316666666666667,
+    "dimension_averages": {
+      "correctness": 3.2666666666666666,
+      "clarity": 3.4,
+      "difficulty_match": 3.2666666666666666,
+      "completeness": 3.3333333333333335
+    },
+    "pass_rate": 0.4,
+    "excellent_rate": 0.1
+  },
+  "evaluation_date": "2025-10-11T12:41:43.357467",
+  "judge_model": "gpt-4o",
+  "num_problems": 30
+}

+ 145 - 0
code/chapter12/data_generation/evaluation_results/20251011_123929/win_rate/win_rate_report_20251011_124322.md

@@ -0,0 +1,145 @@
+# Win Rate评估报告
+
+## 基本信息
+
+- **评估日期**: 2025-10-11T12:43:22.568017
+- **评委模型**: gpt-4o
+- **对比次数**: 20 次
+
+## 评估结果
+
+### 胜率统计
+
+| 指标 | 数值 | 百分比 |
+|------|------|--------|
+| 生成数据胜出 | 5 次 | 25.00% |
+| 参考数据胜出 | 13 次 | 65.00% |
+| 平局 | 2 次 | 10.00% |
+
+### 结果分析
+
+**Win Rate**: 25.00%
+
+
+❌ **需改进**: 生成数据质量明显低于参考数据。建议检查生成Pipeline并进行优化。
+
+
+## 详细对比结果
+
+
+### 对比 1
+
+- **生成题目**: 25
+- **参考题目**: gen_aime_26
+- **胜者**: 🤝 Problem A
+- **理由**: Upon evaluation, Problem A and Problem B are both solid in terms of mathematical correctness and clarity. However, Problem A exhibits a higher level of problem quality and complexity, which aligns more closely with the AIME standards. Problem A involves an intricate combinatorial setup, requiring the solver to understand properties of a regular polygon and engage in non-trivial counting arguments to find the number of ways to pair vertices, which is a challenging task suitable for AIME. In contrast, Problem B, while correctly formulated and solvable, relies on a straightforward application of the properties of a regular hexagon inscribed in a circle. The solution to Problem B is more direct and less complex, as it quickly reduces to finding the radius of the circle, a relatively simple task. Therefore, considering the dimensions of difficulty appropriateness and problem quality, Problem A is the winner due to its higher level of complexity and challenge appropriate for AIME.
+
+### 对比 2
+
+- **生成题目**: gen_aime_19
+- **参考题目**: 10
+- **胜者**: 🤝 Tie
+- **理由**: Failed to parse response
+
+### 对比 3
+
+- **生成题目**: 4
+- **参考题目**: gen_aime_28
+- **胜者**: 🤝 Problem A
+- **理由**: Upon evaluating both problems, Problem A is superior in terms of mathematical correctness, clarity, and difficulty appropriateness. Problem A is a combinatorial problem that requires the solver to understand permutations and divisibility rules, which are common themes in AIME-style problems. The problem is clearly stated, and the answer provided is consistent with the problem requirements. In contrast, Problem B has several issues. The solution provided for Problem B incorrectly concludes that the area of one of the smaller triangles is 1, while logically, given the problem statement, the area should be less than 1 due to the division of the area among the smaller triangles. Additionally, the solution lacks clarity, especially in the justification of the final step where it abruptly concludes the area as 1 without proper reasoning, which detracts from its mathematical correctness and clarity. Therefore, Problem A is the better-quality problem.
+
+### 对比 4
+
+- **生成题目**: gen_aime_7
+- **参考题目**: 27
+- **胜者**: 🤝 Problem B
+- **理由**: Problem B has a clear and unambiguous statement with a correct and straightforward answer. The mathematical correctness of the sequence operations in Problem B is sound and leads to a rational number, where the task is to find the remainder when the sum of the numerator and denominator is divided by 1000. This is a typical AIME-style problem that involves sequences and modular arithmetic, which are common themes in such competitions. Problem A, on the other hand, suffers from a solution that is both unnecessarily complicated and possibly incorrect in its simplification process. The inclusion-exclusion principle is correctly applied, but the final simplification error suggests a lack of clarity or correctness in the solution verification, which affects the overall quality. While both problems have a certain degree of complexity, Problem B's approach to sequences provides a more elegant and appropriate challenge for the AIME context.
+
+### 对比 5
+
+- **生成题目**: 29
+- **参考题目**: gen_aime_6
+- **胜者**: 🤝 Problem A
+- **理由**: Problem A is mathematically correct and concise. It clearly defines a polynomial function and asks for the sum of specific values of k, which is a well-posed question. The problem is appropriate for AIME standards as it involves understanding polynomial behavior and requires analytical thinking. On the other hand, Problem B has multiple issues. The solution provided in Problem B is incorrect, as it contains errors in calculating the maximum value of f(p). The problem statement in Problem B is somewhat misleading because it suggests maximizing the sum of permutations without a clear strategy, leading to confusion. Additionally, the solution attempts to use permutations without clearly explaining the rationale or strategy behind the selections. While Problem B has potential, the errors in its solution detract from its overall quality. Therefore, Problem A is the winner due to its correctness, clarity, and appropriate difficulty level.
+
+### 对比 6
+
+- **生成题目**: 20
+- **参考题目**: gen_aime_29
+- **胜者**: 🤝 Tie
+- **理由**: Both problems have their unique attributes that make them suitable for AIME standards, and neither problem stands out significantly over the other across the evaluation dimensions. 
+
+1. Mathematical Correctness: Both Problem A and Problem B are mathematically correct. The answers provided are consistent with the problem statements, and the reasoning in Problem B's solution verifies the correctness.
+
+2. Clarity: Problem B is clearer in terms of presentation and understanding. Problem A, while clear, involves a complex geometric setup that might take longer for students to parse compared to the straightforward modular arithmetic problem in Problem B.
+
+3. Difficulty Appropriateness: Problem A involves geometry and requires understanding of several geometric properties and relationships, which aligns well with AIME's difficulty level. Problem B, on the other hand, involves modular arithmetic and application of Fermat's Little Theorem, which are also suitable for AIME.
+
+4. Problem Quality: Problem A is well-designed with a complex geometric configuration that tests a variety of skills. Problem B is simpler in terms of setup but still requires deep understanding of number theory concepts.
+
+Overall, both problems meet the AIME standard comprehensively, and therefore neither problem significantly outshines the other across all dimensions.
+
+### 对比 7
+
+- **生成题目**: gen_aime_22
+- **参考题目**: 8
+- **胜者**: 🤝 Problem A
+- **理由**: Both Problem A and Problem B are mathematically interesting and invite the solver to apply a range of mathematical techniques. However, Problem A stands out for a few reasons: 
+
+1. **Mathematical Correctness**: Both problems are mathematically sound and have correct answers. 
+
+2. **Clarity**: Problem A provides a clear condition involving congruences that is easy to understand and follow. Problem B, while clear, involves a transformation that might be less intuitive for some solvers in terms of visualizing the rotation and intersection. 
+
+3. **Difficulty Appropriateness**: Both problems are challenging and appropriate for AIME, but Problem A involves a more direct application of modular arithmetic and factorization, which aligns well with typical AIME problem-solving strategies. Problem B involves additional geometric visualization and understanding of transformations, making it potentially more complex than typical AIME problems which focus more on algebra and number theory.
+
+4. **Problem Quality**: Problem A is well-designed, with a clear path to the solution that involves logical steps and common number theory techniques. Problem B is also well-designed but may require more abstract thinking due to the geometric aspect. 
+
+Overall, Problem A provides a clearer and more direct problem-solving experience, making it slightly superior in this evaluation.
+
+### 对比 8
+
+- **生成题目**: 22
+- **参考题目**: gen_aime_8
+- **胜者**: 🤝 Problem A
+- **理由**: Both problems are mathematically correct and have clear problem statements, but Problem A stands out in terms of complexity and originality. Here is a breakdown of the evaluation dimensions:
+
+1. Mathematical Correctness: Both problems are mathematically sound. Problem A deals with the application of the greedy algorithm to determine when it provides an optimal solution, while Problem B involves probability and divisor counting, both of which are correctly handled.
+
+2. Clarity: Both problem statements are clear and unambiguous, providing sufficient information for the solver to understand what is required without confusion.
+
+3. Difficulty Appropriateness: Both problems are suitable for AIME standards. Problem A's exploration of the greedy algorithm's success across a range of values introduces a unique challenge that requires deeper analysis beyond straightforward calculation. Problem B, while also challenging, involves more standard divisor and probability calculations.
+
+4. Problem Quality: Problem A exhibits higher quality due to its innovative approach in examining the conditions under which a common algorithm succeeds. It encourages deeper thinking about algorithmic efficiency, which adds to the problem's educational value. Problem B, although well-crafted, follows a more conventional path of evaluating divisors and probability.
+
+Overall, Problem A provides a more engaging and thought-provoking challenge, thus making it the higher quality problem.
+
+### 对比 9
+
+- **生成题目**: gen_aime_13
+- **参考题目**: 22
+- **胜者**: 🤝 Problem B
+- **理由**: Both problems are challenging and suitable for the AIME level, but Problem B is the winner based on its higher quality in several evaluation dimensions. Firstly, both problems are mathematically sound, but Problem B is more clearly presented and free from errors or ambiguities in its description. Problem A has some inconsistencies in the solution, particularly in the calculation and explanation of the last three digits of m+n, which is confusing and incorrect as the correct last three digits should simply be the sum 10, not 375. Problem B, on the other hand, is clear and well-structured throughout with a correct answer. Regarding difficulty, both problems are appropriate for AIME, but Problem B offers a more interesting and less straightforward combinatorial challenge, requiring understanding of algorithms and optimization, which adds to its complexity. Overall, Problem B is better designed with a more consistent problem statement and solution, making it the superior problem.
+
+### 对比 10
+
+- **生成题目**: 4
+- **参考题目**: gen_aime_10
+- **胜者**: 🤝 Problem B
+- **理由**: Both problems present interesting challenges, but Problem B stands out in several key aspects. Firstly, the mathematical correctness of Problem B is clearly demonstrated with a detailed solution that shows a complete and accurate derivation of the problem's parameters. The problem involves well-known geometric principles such as the properties of a triangle inscribed in a circle and the Pythagorean theorem, which are executed correctly and contribute to a well-rounded mathematical experience. In terms of clarity, Problem B is succinct in its presentation of the geometric scenario, making it straightforward to understand the setup without ambiguity. Problem A, while interesting, lacks a complete solution, which makes it difficult to fully evaluate its correctness and reasoning. The difficulty of Problem B is appropriate for an AIME-style question, as it involves multiple steps and requires a solid understanding of geometry and algebra, making it challenging yet solvable. Finally, Problem B is better designed as it naturally engages with mathematical concepts and has a logical progression, whereas Problem A seems more mechanical and lacks a creative spark in problem design. Hence, Problem B is judged to be of higher quality.
+
+*(仅显示前10次对比的详细结果,完整结果请查看JSON文件)*
+
+## 结论
+
+基于Win Rate评估,生成数据集的质量**仍有提升空间**(Win Rate = 25.00%)。
+
+建议:
+1. 优化题目生成的提示词
+2. 增加质量过滤步骤
+3. 使用更强的生成模型
+4. 增加人工审核环节
+
+
+---
+
+*报告生成时间: 2025-10-11 12:43:22*

+ 275 - 0
code/chapter12/data_generation/evaluation_results/20251011_123929/win_rate/win_rate_results_20251011_124322.json

@@ -0,0 +1,275 @@
+{
+  "comparisons": [
+    {
+      "problem_a_id": "25",
+      "problem_b_id": "gen_aime_26",
+      "winner": "Problem A",
+      "reason": "Upon evaluation, Problem A and Problem B are both solid in terms of mathematical correctness and clarity. However, Problem A exhibits a higher level of problem quality and complexity, which aligns more closely with the AIME standards. Problem A involves an intricate combinatorial setup, requiring the solver to understand properties of a regular polygon and engage in non-trivial counting arguments to find the number of ways to pair vertices, which is a challenging task suitable for AIME. In contrast, Problem B, while correctly formulated and solvable, relies on a straightforward application of the properties of a regular hexagon inscribed in a circle. The solution to Problem B is more direct and less complex, as it quickly reduces to finding the radius of the circle, a relatively simple task. Therefore, considering the dimensions of difficulty appropriateness and problem quality, Problem A is the winner due to its higher level of complexity and challenge appropriate for AIME.",
+      "comparison_text": "```json\n{\n    \"winner\": \"Problem A\",\n    \"reason\": \"Upon evaluation, Problem A and Problem B are both solid in terms of mathematical correctness and clarity. However, Problem A exhibits a higher level of problem quality and complexity, which aligns more closely with the AIME standards. Problem A involves an intricate combinatorial setup, requiring the solver to understand properties of a regular polygon and engage in non-trivial counting arguments to find the number of ways to pair vertices, which is a challenging task suitable for AIME. In contrast, Problem B, while correctly formulated and solvable, relies on a straightforward application of the properties of a regular hexagon inscribed in a circle. The solution to Problem B is more direct and less complex, as it quickly reduces to finding the radius of the circle, a relatively simple task. Therefore, considering the dimensions of difficulty appropriateness and problem quality, Problem A is the winner due to its higher level of complexity and challenge appropriate for AIME.\"\n}\n```",
+      "execution_time": 6.184344053268433,
+      "actual_order": {
+        "A": "Reference",
+        "B": "Generated"
+      },
+      "actual_winner": "Reference"
+    },
+    {
+      "problem_a_id": "gen_aime_19",
+      "problem_b_id": "10",
+      "winner": "Tie",
+      "reason": "Failed to parse response",
+      "comparison_text": "```json\n{\n    \"winner\": \"Problem B\",\n    \"reason\": \"Both problems exhibit certain strengths and weaknesses, but Problem B has a slight edge in terms of overall quality. \n\n1. **Mathematical Correctness**: Both problems appear to be mathematically correct; however, Problem A contains a solution that does not match the provided answer, indicating a potential error in either the problem setup or the given answer. Problem B's answer matches the problem statement, indicating mathematical correctness.\n\n2. **Clarity**: Problem A is clear in its requirements, but the solution process seems convoluted and potentially incorrect as it doesn't align with the provided answer. Problem B, while more complex, clearly lays out the piecewise function and periodic nature, followed by a clear geometric intersection context.\n\n3. **Difficulty Appropriateness**: Problem B presents a challenge that is suitable for AIME level, requiring understanding of periodic functions, piecewise definitions, and intersections with a parabola. Problem A, while also challenging, suffers from a mismatch in the solution and answer, which affects its perceived difficulty and solvability.\n\n4. **Problem Quality**: Problem B is well-designed with a clear linkage between the periodic function and intersection with a parabola, providing a rich mathematical exploration. Problem A's quality is compromised by the errors in solution alignment, reducing its effectiveness and educational value.\n\nOverall, Problem B is more aligned with the evaluation criteria, offering a clearer, correctly answered, and suitably challenging problem that stands up to scrutiny across all dimensions.\"\n}\n```",
+      "execution_time": 6.297120571136475,
+      "actual_order": {
+        "A": "Generated",
+        "B": "Reference"
+      },
+      "actual_winner": "Tie"
+    },
+    {
+      "problem_a_id": "4",
+      "problem_b_id": "gen_aime_28",
+      "winner": "Problem A",
+      "reason": "Upon evaluating both problems, Problem A is superior in terms of mathematical correctness, clarity, and difficulty appropriateness. Problem A is a combinatorial problem that requires the solver to understand permutations and divisibility rules, which are common themes in AIME-style problems. The problem is clearly stated, and the answer provided is consistent with the problem requirements. In contrast, Problem B has several issues. The solution provided for Problem B incorrectly concludes that the area of one of the smaller triangles is 1, while logically, given the problem statement, the area should be less than 1 due to the division of the area among the smaller triangles. Additionally, the solution lacks clarity, especially in the justification of the final step where it abruptly concludes the area as 1 without proper reasoning, which detracts from its mathematical correctness and clarity. Therefore, Problem A is the better-quality problem.",
+      "comparison_text": "```json\n{\n    \"winner\": \"Problem A\",\n    \"reason\": \"Upon evaluating both problems, Problem A is superior in terms of mathematical correctness, clarity, and difficulty appropriateness. Problem A is a combinatorial problem that requires the solver to understand permutations and divisibility rules, which are common themes in AIME-style problems. The problem is clearly stated, and the answer provided is consistent with the problem requirements. In contrast, Problem B has several issues. The solution provided for Problem B incorrectly concludes that the area of one of the smaller triangles is 1, while logically, given the problem statement, the area should be less than 1 due to the division of the area among the smaller triangles. Additionally, the solution lacks clarity, especially in the justification of the final step where it abruptly concludes the area as 1 without proper reasoning, which detracts from its mathematical correctness and clarity. Therefore, Problem A is the better-quality problem.\"\n}\n```",
+      "execution_time": 4.189812898635864,
+      "actual_order": {
+        "A": "Reference",
+        "B": "Generated"
+      },
+      "actual_winner": "Reference"
+    },
+    {
+      "problem_a_id": "gen_aime_7",
+      "problem_b_id": "27",
+      "winner": "Problem B",
+      "reason": "Problem B has a clear and unambiguous statement with a correct and straightforward answer. The mathematical correctness of the sequence operations in Problem B is sound and leads to a rational number, where the task is to find the remainder when the sum of the numerator and denominator is divided by 1000. This is a typical AIME-style problem that involves sequences and modular arithmetic, which are common themes in such competitions. Problem A, on the other hand, suffers from a solution that is both unnecessarily complicated and possibly incorrect in its simplification process. The inclusion-exclusion principle is correctly applied, but the final simplification error suggests a lack of clarity or correctness in the solution verification, which affects the overall quality. While both problems have a certain degree of complexity, Problem B's approach to sequences provides a more elegant and appropriate challenge for the AIME context.",
+      "comparison_text": "```json\n{\n    \"winner\": \"Problem B\",\n    \"reason\": \"Problem B has a clear and unambiguous statement with a correct and straightforward answer. The mathematical correctness of the sequence operations in Problem B is sound and leads to a rational number, where the task is to find the remainder when the sum of the numerator and denominator is divided by 1000. This is a typical AIME-style problem that involves sequences and modular arithmetic, which are common themes in such competitions. Problem A, on the other hand, suffers from a solution that is both unnecessarily complicated and possibly incorrect in its simplification process. The inclusion-exclusion principle is correctly applied, but the final simplification error suggests a lack of clarity or correctness in the solution verification, which affects the overall quality. While both problems have a certain degree of complexity, Problem B's approach to sequences provides a more elegant and appropriate challenge for the AIME context.\"\n}\n```",
+      "execution_time": 4.697055101394653,
+      "actual_order": {
+        "A": "Generated",
+        "B": "Reference"
+      },
+      "actual_winner": "Reference"
+    },
+    {
+      "problem_a_id": "29",
+      "problem_b_id": "gen_aime_6",
+      "winner": "Problem A",
+      "reason": "Problem A is mathematically correct and concise. It clearly defines a polynomial function and asks for the sum of specific values of k, which is a well-posed question. The problem is appropriate for AIME standards as it involves understanding polynomial behavior and requires analytical thinking. On the other hand, Problem B has multiple issues. The solution provided in Problem B is incorrect, as it contains errors in calculating the maximum value of f(p). The problem statement in Problem B is somewhat misleading because it suggests maximizing the sum of permutations without a clear strategy, leading to confusion. Additionally, the solution attempts to use permutations without clearly explaining the rationale or strategy behind the selections. While Problem B has potential, the errors in its solution detract from its overall quality. Therefore, Problem A is the winner due to its correctness, clarity, and appropriate difficulty level.",
+      "comparison_text": "```json\n{\n    \"winner\": \"Problem A\",\n    \"reason\": \"Problem A is mathematically correct and concise. It clearly defines a polynomial function and asks for the sum of specific values of k, which is a well-posed question. The problem is appropriate for AIME standards as it involves understanding polynomial behavior and requires analytical thinking. On the other hand, Problem B has multiple issues. The solution provided in Problem B is incorrect, as it contains errors in calculating the maximum value of f(p). The problem statement in Problem B is somewhat misleading because it suggests maximizing the sum of permutations without a clear strategy, leading to confusion. Additionally, the solution attempts to use permutations without clearly explaining the rationale or strategy behind the selections. While Problem B has potential, the errors in its solution detract from its overall quality. Therefore, Problem A is the winner due to its correctness, clarity, and appropriate difficulty level.\"\n}\n```",
+      "execution_time": 3.9496095180511475,
+      "actual_order": {
+        "A": "Reference",
+        "B": "Generated"
+      },
+      "actual_winner": "Reference"
+    },
+    {
+      "problem_a_id": "20",
+      "problem_b_id": "gen_aime_29",
+      "winner": "Tie",
+      "reason": "Both problems have their unique attributes that make them suitable for AIME standards, and neither problem stands out significantly over the other across the evaluation dimensions. \n\n1. Mathematical Correctness: Both Problem A and Problem B are mathematically correct. The answers provided are consistent with the problem statements, and the reasoning in Problem B's solution verifies the correctness.\n\n2. Clarity: Problem B is clearer in terms of presentation and understanding. Problem A, while clear, involves a complex geometric setup that might take longer for students to parse compared to the straightforward modular arithmetic problem in Problem B.\n\n3. Difficulty Appropriateness: Problem A involves geometry and requires understanding of several geometric properties and relationships, which aligns well with AIME's difficulty level. Problem B, on the other hand, involves modular arithmetic and application of Fermat's Little Theorem, which are also suitable for AIME.\n\n4. Problem Quality: Problem A is well-designed with a complex geometric configuration that tests a variety of skills. Problem B is simpler in terms of setup but still requires deep understanding of number theory concepts.\n\nOverall, both problems meet the AIME standard comprehensively, and therefore neither problem significantly outshines the other across all dimensions.",
+      "comparison_text": "```json\n{\n    \"winner\": \"Tie\",\n    \"reason\": \"Both problems have their unique attributes that make them suitable for AIME standards, and neither problem stands out significantly over the other across the evaluation dimensions. \\n\\n1. Mathematical Correctness: Both Problem A and Problem B are mathematically correct. The answers provided are consistent with the problem statements, and the reasoning in Problem B's solution verifies the correctness.\\n\\n2. Clarity: Problem B is clearer in terms of presentation and understanding. Problem A, while clear, involves a complex geometric setup that might take longer for students to parse compared to the straightforward modular arithmetic problem in Problem B.\\n\\n3. Difficulty Appropriateness: Problem A involves geometry and requires understanding of several geometric properties and relationships, which aligns well with AIME's difficulty level. Problem B, on the other hand, involves modular arithmetic and application of Fermat's Little Theorem, which are also suitable for AIME.\\n\\n4. Problem Quality: Problem A is well-designed with a complex geometric configuration that tests a variety of skills. Problem B is simpler in terms of setup but still requires deep understanding of number theory concepts.\\n\\nOverall, both problems meet the AIME standard comprehensively, and therefore neither problem significantly outshines the other across all dimensions.\"\n}\n```",
+      "execution_time": 6.399900913238525,
+      "actual_order": {
+        "A": "Reference",
+        "B": "Generated"
+      },
+      "actual_winner": "Tie"
+    },
+    {
+      "problem_a_id": "gen_aime_22",
+      "problem_b_id": "8",
+      "winner": "Problem A",
+      "reason": "Both Problem A and Problem B are mathematically interesting and invite the solver to apply a range of mathematical techniques. However, Problem A stands out for a few reasons: \n\n1. **Mathematical Correctness**: Both problems are mathematically sound and have correct answers. \n\n2. **Clarity**: Problem A provides a clear condition involving congruences that is easy to understand and follow. Problem B, while clear, involves a transformation that might be less intuitive for some solvers in terms of visualizing the rotation and intersection. \n\n3. **Difficulty Appropriateness**: Both problems are challenging and appropriate for AIME, but Problem A involves a more direct application of modular arithmetic and factorization, which aligns well with typical AIME problem-solving strategies. Problem B involves additional geometric visualization and understanding of transformations, making it potentially more complex than typical AIME problems which focus more on algebra and number theory.\n\n4. **Problem Quality**: Problem A is well-designed, with a clear path to the solution that involves logical steps and common number theory techniques. Problem B is also well-designed but may require more abstract thinking due to the geometric aspect. \n\nOverall, Problem A provides a clearer and more direct problem-solving experience, making it slightly superior in this evaluation.",
+      "comparison_text": "```json\n{\n    \"winner\": \"Problem A\",\n    \"reason\": \"Both Problem A and Problem B are mathematically interesting and invite the solver to apply a range of mathematical techniques. However, Problem A stands out for a few reasons: \\n\\n1. **Mathematical Correctness**: Both problems are mathematically sound and have correct answers. \\n\\n2. **Clarity**: Problem A provides a clear condition involving congruences that is easy to understand and follow. Problem B, while clear, involves a transformation that might be less intuitive for some solvers in terms of visualizing the rotation and intersection. \\n\\n3. **Difficulty Appropriateness**: Both problems are challenging and appropriate for AIME, but Problem A involves a more direct application of modular arithmetic and factorization, which aligns well with typical AIME problem-solving strategies. Problem B involves additional geometric visualization and understanding of transformations, making it potentially more complex than typical AIME problems which focus more on algebra and number theory.\\n\\n4. **Problem Quality**: Problem A is well-designed, with a clear path to the solution that involves logical steps and common number theory techniques. Problem B is also well-designed but may require more abstract thinking due to the geometric aspect. \\n\\nOverall, Problem A provides a clearer and more direct problem-solving experience, making it slightly superior in this evaluation.\"\n}\n```",
+      "execution_time": 6.507522344589233,
+      "actual_order": {
+        "A": "Generated",
+        "B": "Reference"
+      },
+      "actual_winner": "Generated"
+    },
+    {
+      "problem_a_id": "22",
+      "problem_b_id": "gen_aime_8",
+      "winner": "Problem A",
+      "reason": "Both problems are mathematically correct and have clear problem statements, but Problem A stands out in terms of complexity and originality. Here is a breakdown of the evaluation dimensions:\n\n1. Mathematical Correctness: Both problems are mathematically sound. Problem A deals with the application of the greedy algorithm to determine when it provides an optimal solution, while Problem B involves probability and divisor counting, both of which are correctly handled.\n\n2. Clarity: Both problem statements are clear and unambiguous, providing sufficient information for the solver to understand what is required without confusion.\n\n3. Difficulty Appropriateness: Both problems are suitable for AIME standards. Problem A's exploration of the greedy algorithm's success across a range of values introduces a unique challenge that requires deeper analysis beyond straightforward calculation. Problem B, while also challenging, involves more standard divisor and probability calculations.\n\n4. Problem Quality: Problem A exhibits higher quality due to its innovative approach in examining the conditions under which a common algorithm succeeds. It encourages deeper thinking about algorithmic efficiency, which adds to the problem's educational value. Problem B, although well-crafted, follows a more conventional path of evaluating divisors and probability.\n\nOverall, Problem A provides a more engaging and thought-provoking challenge, thus making it the higher quality problem.",
+      "comparison_text": "```json\n{\n    \"winner\": \"Problem A\",\n    \"reason\": \"Both problems are mathematically correct and have clear problem statements, but Problem A stands out in terms of complexity and originality. Here is a breakdown of the evaluation dimensions:\\n\\n1. Mathematical Correctness: Both problems are mathematically sound. Problem A deals with the application of the greedy algorithm to determine when it provides an optimal solution, while Problem B involves probability and divisor counting, both of which are correctly handled.\\n\\n2. Clarity: Both problem statements are clear and unambiguous, providing sufficient information for the solver to understand what is required without confusion.\\n\\n3. Difficulty Appropriateness: Both problems are suitable for AIME standards. Problem A's exploration of the greedy algorithm's success across a range of values introduces a unique challenge that requires deeper analysis beyond straightforward calculation. Problem B, while also challenging, involves more standard divisor and probability calculations.\\n\\n4. Problem Quality: Problem A exhibits higher quality due to its innovative approach in examining the conditions under which a common algorithm succeeds. It encourages deeper thinking about algorithmic efficiency, which adds to the problem's educational value. Problem B, although well-crafted, follows a more conventional path of evaluating divisors and probability.\\n\\nOverall, Problem A provides a more engaging and thought-provoking challenge, thus making it the higher quality problem.\"\n}\n```",
+      "execution_time": 4.903303623199463,
+      "actual_order": {
+        "A": "Reference",
+        "B": "Generated"
+      },
+      "actual_winner": "Reference"
+    },
+    {
+      "problem_a_id": "gen_aime_13",
+      "problem_b_id": "22",
+      "winner": "Problem B",
+      "reason": "Both problems are challenging and suitable for the AIME level, but Problem B is the winner based on its higher quality in several evaluation dimensions. Firstly, both problems are mathematically sound, but Problem B is more clearly presented and free from errors or ambiguities in its description. Problem A has some inconsistencies in the solution, particularly in the calculation and explanation of the last three digits of m+n, which is confusing and incorrect as the correct last three digits should simply be the sum 10, not 375. Problem B, on the other hand, is clear and well-structured throughout with a correct answer. Regarding difficulty, both problems are appropriate for AIME, but Problem B offers a more interesting and less straightforward combinatorial challenge, requiring understanding of algorithms and optimization, which adds to its complexity. Overall, Problem B is better designed with a more consistent problem statement and solution, making it the superior problem.",
+      "comparison_text": "```json\n{\n    \"winner\": \"Problem B\",\n    \"reason\": \"Both problems are challenging and suitable for the AIME level, but Problem B is the winner based on its higher quality in several evaluation dimensions. Firstly, both problems are mathematically sound, but Problem B is more clearly presented and free from errors or ambiguities in its description. Problem A has some inconsistencies in the solution, particularly in the calculation and explanation of the last three digits of m+n, which is confusing and incorrect as the correct last three digits should simply be the sum 10, not 375. Problem B, on the other hand, is clear and well-structured throughout with a correct answer. Regarding difficulty, both problems are appropriate for AIME, but Problem B offers a more interesting and less straightforward combinatorial challenge, requiring understanding of algorithms and optimization, which adds to its complexity. Overall, Problem B is better designed with a more consistent problem statement and solution, making it the superior problem.\"\n}\n```",
+      "execution_time": 6.666977405548096,
+      "actual_order": {
+        "A": "Generated",
+        "B": "Reference"
+      },
+      "actual_winner": "Reference"
+    },
+    {
+      "problem_a_id": "4",
+      "problem_b_id": "gen_aime_10",
+      "winner": "Problem B",
+      "reason": "Both problems present interesting challenges, but Problem B stands out in several key aspects. Firstly, the mathematical correctness of Problem B is clearly demonstrated with a detailed solution that shows a complete and accurate derivation of the problem's parameters. The problem involves well-known geometric principles such as the properties of a triangle inscribed in a circle and the Pythagorean theorem, which are executed correctly and contribute to a well-rounded mathematical experience. In terms of clarity, Problem B is succinct in its presentation of the geometric scenario, making it straightforward to understand the setup without ambiguity. Problem A, while interesting, lacks a complete solution, which makes it difficult to fully evaluate its correctness and reasoning. The difficulty of Problem B is appropriate for an AIME-style question, as it involves multiple steps and requires a solid understanding of geometry and algebra, making it challenging yet solvable. Finally, Problem B is better designed as it naturally engages with mathematical concepts and has a logical progression, whereas Problem A seems more mechanical and lacks a creative spark in problem design. Hence, Problem B is judged to be of higher quality.",
+      "comparison_text": "```json\n{\n    \"winner\": \"Problem B\",\n    \"reason\": \"Both problems present interesting challenges, but Problem B stands out in several key aspects. Firstly, the mathematical correctness of Problem B is clearly demonstrated with a detailed solution that shows a complete and accurate derivation of the problem's parameters. The problem involves well-known geometric principles such as the properties of a triangle inscribed in a circle and the Pythagorean theorem, which are executed correctly and contribute to a well-rounded mathematical experience. In terms of clarity, Problem B is succinct in its presentation of the geometric scenario, making it straightforward to understand the setup without ambiguity. Problem A, while interesting, lacks a complete solution, which makes it difficult to fully evaluate its correctness and reasoning. The difficulty of Problem B is appropriate for an AIME-style question, as it involves multiple steps and requires a solid understanding of geometry and algebra, making it challenging yet solvable. Finally, Problem B is better designed as it naturally engages with mathematical concepts and has a logical progression, whereas Problem A seems more mechanical and lacks a creative spark in problem design. Hence, Problem B is judged to be of higher quality.\"\n}\n```",
+      "execution_time": 3.981515645980835,
+      "actual_order": {
+        "A": "Reference",
+        "B": "Generated"
+      },
+      "actual_winner": "Generated"
+    },
+    {
+      "problem_a_id": "gen_aime_9",
+      "problem_b_id": "11",
+      "winner": "Problem A",
+      "reason": "Both Problem A and Problem B are interesting and complex problems that fit the AIME style in different ways. However, after evaluation, Problem A is determined to have higher quality based on the evaluation criteria. \n\n1. **Mathematical Correctness**: Both problems present reasonable and mathematically sound scenarios. Problem A deals with a shifted Fibonacci sequence, a well-defined mathematical concept with a clear solution path. Problem B involves inequalities in a 3D space, which can be more abstract and requires more detailed explanation to ensure correctness.\n\n2. **Clarity**: Problem A is clearer and more straightforward. It provides specific sequence conditions and a clear objective (finding a term divisible by 1000), which is unambiguous. Problem B, while intriguing, involves complex inequalities in a lesser-known context, which might be less accessible without additional context or diagrams.\n\n3. **Difficulty Appropriateness**: Both problems are challenging, fitting the AIME standard. Problem A's difficulty comes from sequence manipulation and modular arithmetic, while Problem B's complexity arises from spatial reasoning and inequalities. Problem A has a clearer path to solution, making it better suited to the timed and competitive AIME environment.\n\n4. **Problem Quality**: Problem A provides a classic mathematical problem-solving experience with a clear, logical progression and a satisfying resolution. Problem B is more novel but may be overly complex for some competitors due to the spatial reasoning and inequality conditions. \n\nOverall, while both problems have merit, Problem A stands out due to its clear statement, mathematical elegance, and accessibility, making it the preferred choice.",
+      "comparison_text": "```json\n{\n    \"winner\": \"Problem A\",\n    \"reason\": \"Both Problem A and Problem B are interesting and complex problems that fit the AIME style in different ways. However, after evaluation, Problem A is determined to have higher quality based on the evaluation criteria. \\n\\n1. **Mathematical Correctness**: Both problems present reasonable and mathematically sound scenarios. Problem A deals with a shifted Fibonacci sequence, a well-defined mathematical concept with a clear solution path. Problem B involves inequalities in a 3D space, which can be more abstract and requires more detailed explanation to ensure correctness.\\n\\n2. **Clarity**: Problem A is clearer and more straightforward. It provides specific sequence conditions and a clear objective (finding a term divisible by 1000), which is unambiguous. Problem B, while intriguing, involves complex inequalities in a lesser-known context, which might be less accessible without additional context or diagrams.\\n\\n3. **Difficulty Appropriateness**: Both problems are challenging, fitting the AIME standard. Problem A's difficulty comes from sequence manipulation and modular arithmetic, while Problem B's complexity arises from spatial reasoning and inequalities. Problem A has a clearer path to solution, making it better suited to the timed and competitive AIME environment.\\n\\n4. **Problem Quality**: Problem A provides a classic mathematical problem-solving experience with a clear, logical progression and a satisfying resolution. Problem B is more novel but may be overly complex for some competitors due to the spatial reasoning and inequality conditions. \\n\\nOverall, while both problems have merit, Problem A stands out due to its clear statement, mathematical elegance, and accessibility, making it the preferred choice.\"\n}\n```",
+      "execution_time": 6.167335748672485,
+      "actual_order": {
+        "A": "Generated",
+        "B": "Reference"
+      },
+      "actual_winner": "Generated"
+    },
+    {
+      "problem_a_id": "12",
+      "problem_b_id": "gen_aime_25",
+      "winner": "Problem B",
+      "reason": "Both problems are mathematically rich and complex, but Problem B stands out for several reasons. Firstly, it adheres more closely to AIME standards, which often incorporate classic geometric theorems and require deeper insight into properties like the Angle Bisector Theorem and Power of a Point. The problem is clearly stated, with given ratios and lengths that guide the solver through a logical sequence of steps rooted in geometry. While both problems are mathematically correct, Problem A's expected value problem involving random line segments through a disk is more abstract and less typical of AIME-style questions. Problem B also provides a more engaging challenge that involves leveraging multiple mathematical concepts, making it a higher-quality problem for an advanced contest setting. Clarity is maintained throughout Problem B, and its level of difficulty is appropriate for AIME, balancing challenge with solvability. Thus, Problem B is the superior choice.",
+      "comparison_text": "```json\n{\n    \"winner\": \"Problem B\",\n    \"reason\": \"Both problems are mathematically rich and complex, but Problem B stands out for several reasons. Firstly, it adheres more closely to AIME standards, which often incorporate classic geometric theorems and require deeper insight into properties like the Angle Bisector Theorem and Power of a Point. The problem is clearly stated, with given ratios and lengths that guide the solver through a logical sequence of steps rooted in geometry. While both problems are mathematically correct, Problem A's expected value problem involving random line segments through a disk is more abstract and less typical of AIME-style questions. Problem B also provides a more engaging challenge that involves leveraging multiple mathematical concepts, making it a higher-quality problem for an advanced contest setting. Clarity is maintained throughout Problem B, and its level of difficulty is appropriate for AIME, balancing challenge with solvability. Thus, Problem B is the superior choice.\"\n}\n```",
+      "execution_time": 3.8654050827026367,
+      "actual_order": {
+        "A": "Reference",
+        "B": "Generated"
+      },
+      "actual_winner": "Generated"
+    },
+    {
+      "problem_a_id": "gen_aime_15",
+      "problem_b_id": "21",
+      "winner": "Problem B",
+      "reason": "When evaluating the two problems based on the given criteria: \n\n1. **Mathematical Correctness**: Problem A has a significant flaw in its solution, as the final answer does not match the problem's conditions. The proposed answer, 16, is not consistent with the calculations or the logic presented in the solution. Problem B, on the other hand, does not present a solution, but the problem statement and the answer provided are inherently more plausible and consistent with the type of problems usually appearing in AIME.\n\n2. **Clarity**: Problem A's statement is somewhat clear, but the solution is convoluted and ultimately incorrect, creating confusion for the reader. Problem B's statement is clear, unambiguous, and typical of combinatorial probability questions in competitive math, making it straightforward for readers familiar with the domain.\n\n3. **Difficulty Appropriateness**: Problem A fails in this category because of the incorrect solution, introducing a level of confusion rather than challenge. Problem B presents a problem that is appropriately challenging for AIME standards, requiring a deep understanding of divisors and probability.\n\n4. **Problem Quality**: Problem A is poorly designed due to its incorrect final answer and flawed explanation, leading to a lack of confidence in its quality. Problem B, however, is well-designed, requiring a nuanced understanding of number theory and probability, aligning well with high-quality AIME problems.\n\nOverall, Problem B is more consistent with the AIME standards of clarity, difficulty, and problem quality, while Problem A suffers from mathematical inaccuracies and unclear solution logic.",
+      "comparison_text": "```json\n{\n    \"winner\": \"Problem B\",\n    \"reason\": \"When evaluating the two problems based on the given criteria: \\n\\n1. **Mathematical Correctness**: Problem A has a significant flaw in its solution, as the final answer does not match the problem's conditions. The proposed answer, 16, is not consistent with the calculations or the logic presented in the solution. Problem B, on the other hand, does not present a solution, but the problem statement and the answer provided are inherently more plausible and consistent with the type of problems usually appearing in AIME.\\n\\n2. **Clarity**: Problem A's statement is somewhat clear, but the solution is convoluted and ultimately incorrect, creating confusion for the reader. Problem B's statement is clear, unambiguous, and typical of combinatorial probability questions in competitive math, making it straightforward for readers familiar with the domain.\\n\\n3. **Difficulty Appropriateness**: Problem A fails in this category because of the incorrect solution, introducing a level of confusion rather than challenge. Problem B presents a problem that is appropriately challenging for AIME standards, requiring a deep understanding of divisors and probability.\\n\\n4. **Problem Quality**: Problem A is poorly designed due to its incorrect final answer and flawed explanation, leading to a lack of confidence in its quality. Problem B, however, is well-designed, requiring a nuanced understanding of number theory and probability, aligning well with high-quality AIME problems.\\n\\nOverall, Problem B is more consistent with the AIME standards of clarity, difficulty, and problem quality, while Problem A suffers from mathematical inaccuracies and unclear solution logic.\"\n}\n```",
+      "execution_time": 5.380846977233887,
+      "actual_order": {
+        "A": "Generated",
+        "B": "Reference"
+      },
+      "actual_winner": "Reference"
+    },
+    {
+      "problem_a_id": "gen_aime_21",
+      "problem_b_id": "20",
+      "winner": "Problem B",
+      "reason": "Both problems exhibit high quality in mathematical design; however, Problem B stands out slightly more in terms of clarity and problem quality. For mathematical correctness, both problems have correct and reasonable answers, with Problem A providing a detailed solution that seems plausible, and Problem B having a straightforward and correct answer. In terms of clarity, Problem B is presented with a clear geometric configuration and an accompanying diagram, which helps in visualizing the problem, despite its complexity. Problem A, while clear, involves complex algebraic reasoning and could be perceived as slightly more ambiguous due to the detailed conditions required for the logarithmic function. Regarding difficulty appropriateness for AIME standards, both problems are challenging yet solvable, with Problem B leaning slightly towards a more engaging geometric exploration. Lastly, in terms of problem quality, Problem B involves an interesting geometric scenario with a creative setup that provides a richer problem-solving experience. Overall, although both problems are of high quality, Problem B is more visually and conceptually appealing, making it the winner in this comparison.",
+      "comparison_text": "```json\n{\n    \"winner\": \"Problem B\",\n    \"reason\": \"Both problems exhibit high quality in mathematical design; however, Problem B stands out slightly more in terms of clarity and problem quality. For mathematical correctness, both problems have correct and reasonable answers, with Problem A providing a detailed solution that seems plausible, and Problem B having a straightforward and correct answer. In terms of clarity, Problem B is presented with a clear geometric configuration and an accompanying diagram, which helps in visualizing the problem, despite its complexity. Problem A, while clear, involves complex algebraic reasoning and could be perceived as slightly more ambiguous due to the detailed conditions required for the logarithmic function. Regarding difficulty appropriateness for AIME standards, both problems are challenging yet solvable, with Problem B leaning slightly towards a more engaging geometric exploration. Lastly, in terms of problem quality, Problem B involves an interesting geometric scenario with a creative setup that provides a richer problem-solving experience. Overall, although both problems are of high quality, Problem B is more visually and conceptually appealing, making it the winner in this comparison.\"\n}\n```",
+      "execution_time": 6.145542144775391,
+      "actual_order": {
+        "A": "Generated",
+        "B": "Reference"
+      },
+      "actual_winner": "Reference"
+    },
+    {
+      "problem_a_id": "6",
+      "problem_b_id": "gen_aime_27",
+      "winner": "Problem B",
+      "reason": "Both problems are challenging and appropriate for AIME-level difficulty, but Problem B has a slight edge in terms of clarity and mathematical correctness. Problem A involves calculating a probability with a somewhat convoluted setup, which could lead to ambiguity without a solution provided. In contrast, Problem B provides a clear geometric problem with a well-defined context and a detailed solution, demonstrating correctness and clarity. Problem B's setup and solution are more straightforward, enhancing its overall quality and making it easier for a solver to follow. Therefore, based on these criteria, Problem B is the winner.",
+      "comparison_text": "```json\n{\n    \"winner\": \"Problem B\",\n    \"reason\": \"Both problems are challenging and appropriate for AIME-level difficulty, but Problem B has a slight edge in terms of clarity and mathematical correctness. Problem A involves calculating a probability with a somewhat convoluted setup, which could lead to ambiguity without a solution provided. In contrast, Problem B provides a clear geometric problem with a well-defined context and a detailed solution, demonstrating correctness and clarity. Problem B's setup and solution are more straightforward, enhancing its overall quality and making it easier for a solver to follow. Therefore, based on these criteria, Problem B is the winner.\"\n}\n```",
+      "execution_time": 2.8636226654052734,
+      "actual_order": {
+        "A": "Reference",
+        "B": "Generated"
+      },
+      "actual_winner": "Generated"
+    },
+    {
+      "problem_a_id": "21",
+      "problem_b_id": "gen_aime_11",
+      "winner": "Problem A",
+      "reason": "Both problems are mathematically correct and clear. Problem A requires knowledge of number theory, specifically properties of divisors and probability, which fits the AIME difficulty level well. Its solution involves recognizing patterns in the divisors and understanding the least common multiple, which adds depth and complexity. Problem B, while clear and correct, primarily involves straightforward calculations of a sequence modulo 7. It is less complex and challenging compared to Problem A, making it slightly below AIME's typical difficulty. Problem A, therefore, presents a more appropriate challenge and is a well-designed problem with suitable complexity, making it the higher quality problem overall.",
+      "comparison_text": "```json\n{\n    \"winner\": \"Problem A\",\n    \"reason\": \"Both problems are mathematically correct and clear. Problem A requires knowledge of number theory, specifically properties of divisors and probability, which fits the AIME difficulty level well. Its solution involves recognizing patterns in the divisors and understanding the least common multiple, which adds depth and complexity. Problem B, while clear and correct, primarily involves straightforward calculations of a sequence modulo 7. It is less complex and challenging compared to Problem A, making it slightly below AIME's typical difficulty. Problem A, therefore, presents a more appropriate challenge and is a well-designed problem with suitable complexity, making it the higher quality problem overall.\"\n}\n```",
+      "execution_time": 3.2908051013946533,
+      "actual_order": {
+        "A": "Reference",
+        "B": "Generated"
+      },
+      "actual_winner": "Reference"
+    },
+    {
+      "problem_a_id": "0",
+      "problem_b_id": "gen_aime_18",
+      "winner": "Problem A",
+      "reason": "Problem A is mathematically correct and has a clear problem statement. It asks for integer bases where a specific divisibility condition holds, which is a typical type of problem found in AIME. The problem is well-designed as it requires understanding of number bases and divisibility, matching the challenging yet solvable nature expected of AIME problems. Problem B, while interesting, contains a critical error in the solution process that undermines its mathematical correctness. The solution presented leads to a contradiction and incorrectly adjusted values that do not match given conditions. This error significantly impacts the problem's quality, as it leads to an incorrect answer. Clarity is reasonable in both problems, but due to the substantial error in Problem B's solution, Problem A is the stronger, more reliable choice.",
+      "comparison_text": "```json\n{\n    \"winner\": \"Problem A\",\n    \"reason\": \"Problem A is mathematically correct and has a clear problem statement. It asks for integer bases where a specific divisibility condition holds, which is a typical type of problem found in AIME. The problem is well-designed as it requires understanding of number bases and divisibility, matching the challenging yet solvable nature expected of AIME problems. Problem B, while interesting, contains a critical error in the solution process that undermines its mathematical correctness. The solution presented leads to a contradiction and incorrectly adjusted values that do not match given conditions. This error significantly impacts the problem's quality, as it leads to an incorrect answer. Clarity is reasonable in both problems, but due to the substantial error in Problem B's solution, Problem A is the stronger, more reliable choice.\"\n}\n```",
+      "execution_time": 3.83967924118042,
+      "actual_order": {
+        "A": "Reference",
+        "B": "Generated"
+      },
+      "actual_winner": "Reference"
+    },
+    {
+      "problem_a_id": "gen_aime_2",
+      "problem_b_id": "20",
+      "winner": "Problem B",
+      "reason": "Both Problem A and Problem B are mathematically correct and provide reasonable answers. However, Problem B stands out in several dimensions. Firstly, it offers a more complex and intricate geometric scenario that fits well with the challenging nature of AIME problems, whereas Problem A's sequence becomes trivial after a few steps. In terms of clarity, both problems are well-stated, but Problem B requires more rigorous reasoning and has more conditions to consider, which adds to its depth and interest. Problem A, although clear, becomes straightforward once the repeating pattern is identified, which may not fully align with the typical AIME difficulty level. Problem B also involves a higher level of problem-solving skills due to its geometric nature and the conditions provided for the inscribed rectangle. Overall, Problem B is a more well-designed and appropriately challenging problem, making it the higher quality choice.",
+      "comparison_text": "```json\n{\n    \"winner\": \"Problem B\",\n    \"reason\": \"Both Problem A and Problem B are mathematically correct and provide reasonable answers. However, Problem B stands out in several dimensions. Firstly, it offers a more complex and intricate geometric scenario that fits well with the challenging nature of AIME problems, whereas Problem A's sequence becomes trivial after a few steps. In terms of clarity, both problems are well-stated, but Problem B requires more rigorous reasoning and has more conditions to consider, which adds to its depth and interest. Problem A, although clear, becomes straightforward once the repeating pattern is identified, which may not fully align with the typical AIME difficulty level. Problem B also involves a higher level of problem-solving skills due to its geometric nature and the conditions provided for the inscribed rectangle. Overall, Problem B is a more well-designed and appropriately challenging problem, making it the higher quality choice.\"\n}\n```",
+      "execution_time": 3.6062393188476562,
+      "actual_order": {
+        "A": "Generated",
+        "B": "Reference"
+      },
+      "actual_winner": "Reference"
+    },
+    {
+      "problem_a_id": "gen_aime_12",
+      "problem_b_id": "16",
+      "winner": "Problem B",
+      "reason": "Both problems have strengths and weaknesses, but Problem B edges ahead in terms of overall quality. In terms of mathematical correctness, both problems have potential issues. Problem A's solution contains some reasoning errors and incorrect assumptions, particularly in the handling of the overlap with the chess club, which leads to an incorrect conclusion. Problem B, although lacking a detailed solution, presents a clear and straightforward problem statement that is mathematically sound. Regarding clarity, both problems are generally clear, but Problem B is more concise and less prone to misinterpretation. For difficulty appropriateness, Problem B presents a challenge typical of AIME problems, involving divisibility and polynomial factorization, whereas Problem A's logic puzzle is slightly more straightforward once the correct approach is applied. Finally, in terms of problem quality, Problem B is well-designed with a focused question that does not rely on potentially misleading assumptions. Despite the lack of a given solution, Problem B's statement and answer indicate a correct and reasonable problem setup, making it the better choice.",
+      "comparison_text": "```json\n{\n    \"winner\": \"Problem B\",\n    \"reason\": \"Both problems have strengths and weaknesses, but Problem B edges ahead in terms of overall quality. In terms of mathematical correctness, both problems have potential issues. Problem A's solution contains some reasoning errors and incorrect assumptions, particularly in the handling of the overlap with the chess club, which leads to an incorrect conclusion. Problem B, although lacking a detailed solution, presents a clear and straightforward problem statement that is mathematically sound. Regarding clarity, both problems are generally clear, but Problem B is more concise and less prone to misinterpretation. For difficulty appropriateness, Problem B presents a challenge typical of AIME problems, involving divisibility and polynomial factorization, whereas Problem A's logic puzzle is slightly more straightforward once the correct approach is applied. Finally, in terms of problem quality, Problem B is well-designed with a focused question that does not rely on potentially misleading assumptions. Despite the lack of a given solution, Problem B's statement and answer indicate a correct and reasonable problem setup, making it the better choice.\"\n}\n```",
+      "execution_time": 4.273446559906006,
+      "actual_order": {
+        "A": "Generated",
+        "B": "Reference"
+      },
+      "actual_winner": "Reference"
+    },
+    {
+      "problem_a_id": "2",
+      "problem_b_id": "gen_aime_30",
+      "winner": "Problem A",
+      "reason": "Both problems are interesting and involve creative reasoning, but Problem A stands out for a few reasons. In terms of mathematical correctness, both answers appear to be correct, although Problem B's solution seems to contain an inconsistency in calculating the value of m+n+p, where the expected value of 6 was mistakenly replaced by 23 in the final answer. This inconsistency in Problem B affects its mathematical correctness. Clarity-wise, both problems are clear and well-stated, although Problem B's complexity might overwhelm some solvers due to the multiple steps and algebraic manipulations required. Regarding difficulty appropriateness, Problem A fits well with AIME standards, combining combinatorial reasoning with simple arithmetic. Problem B is also challenging, but the error in the solution and potential missteps during its algebraic manipulation may complicate its perceived difficulty. Lastly, in terms of problem quality, Problem A is well-designed within its domain, presenting a classic yet engaging combinatorial challenge. Problem B, while rich in geometric concepts, suffers from solution errors that slightly detract from its overall quality. Therefore, Problem A is the winner.",
+      "comparison_text": "```json\n{\n    \"winner\": \"Problem A\",\n    \"reason\": \"Both problems are interesting and involve creative reasoning, but Problem A stands out for a few reasons. In terms of mathematical correctness, both answers appear to be correct, although Problem B's solution seems to contain an inconsistency in calculating the value of m+n+p, where the expected value of 6 was mistakenly replaced by 23 in the final answer. This inconsistency in Problem B affects its mathematical correctness. Clarity-wise, both problems are clear and well-stated, although Problem B's complexity might overwhelm some solvers due to the multiple steps and algebraic manipulations required. Regarding difficulty appropriateness, Problem A fits well with AIME standards, combining combinatorial reasoning with simple arithmetic. Problem B is also challenging, but the error in the solution and potential missteps during its algebraic manipulation may complicate its perceived difficulty. Lastly, in terms of problem quality, Problem A is well-designed within its domain, presenting a classic yet engaging combinatorial challenge. Problem B, while rich in geometric concepts, suffers from solution errors that slightly detract from its overall quality. Therefore, Problem A is the winner.\"\n}\n```",
+      "execution_time": 5.753171920776367,
+      "actual_order": {
+        "A": "Reference",
+        "B": "Generated"
+      },
+      "actual_winner": "Reference"
+    }
+  ],
+  "metrics": {
+    "win_rate": 0.25,
+    "loss_rate": 0.65,
+    "tie_rate": 0.1,
+    "wins": 5,
+    "losses": 13,
+    "ties": 2,
+    "total_comparisons": 20
+  },
+  "evaluation_date": "2025-10-11T12:43:22.568017",
+  "judge_model": "gpt-4o"
+}

تفاوت فایلی نمایش داده نمی شود زیرا این فایل بسیار بزرگ است
+ 20 - 0
code/chapter12/data_generation/generated_data/aime_generated_20251011_042741.json


+ 44 - 0
code/chapter12/data_generation/generated_data/generation_report_20251011_042741.md

@@ -0,0 +1,44 @@
+# AIME题目生成统计报告
+
+## 基本信息
+
+- **生成时间**: 2025-10-11 04:41:36
+- **题目数量**: 30
+
+## 主题分布
+
+| 主题 | 数量 | 占比 |
+|------|------|------|
+| Number Theory | 9 | 30.0% |
+| Geometry | 9 | 30.0% |
+| Algebra | 8 | 26.7% |
+| Probability | 2 | 6.7% |
+| Combinatorics | 2 | 6.7% |
+
+## 答案分析
+
+- **平均答案**: 172.00
+- **最小答案**: 1
+- **最大答案**: 999
+- **答案范围**: 1-999
+
+## 题目列表
+
+| ID | 主题 | 答案 |
+|-----|------|------|
+| gen_aime_1 | Number Theory | 1 |
+| gen_aime_2 | Number Theory | 9 |
+| gen_aime_3 | Algebra | 70 |
+| gen_aime_4 | Algebra | 25 |
+| gen_aime_5 | Algebra | 10 |
+| gen_aime_6 | Number Theory | 999 |
+| gen_aime_7 | Probability | 769 |
+| gen_aime_8 | Number Theory | 19 |
+| gen_aime_9 | Number Theory | 15 |
+| gen_aime_10 | Geometry | 34 |
+
+*(仅显示前10个题目,完整列表请查看JSON文件)*
+
+---
+
+*报告生成时间: 2025-10-11 04:41:36*

+ 254 - 0
code/chapter12/data_generation/human_verification_ui.py

@@ -0,0 +1,254 @@
+"""
+人工验证界面
+
+使用Gradio创建Web界面,用于人工验证生成的AIME题目
+"""
+
+import json
+import os
+from typing import List, Dict, Any, Tuple
+from datetime import datetime
+import gradio as gr
+
+
+class HumanVerificationUI:
+    """人工验证界面"""
+    
+    def __init__(self, data_path: str):
+        """
+        初始化验证界面
+        
+        Args:
+            data_path: 生成数据的JSON文件路径
+        """
+        self.data_path = data_path
+        self.problems = self._load_problems()
+        self.current_index = 0
+        self.verifications = self._load_verifications()
+        
+    def _load_problems(self) -> List[Dict[str, Any]]:
+        """加载题目数据"""
+        if not os.path.exists(self.data_path):
+            raise FileNotFoundError(f"数据文件不存在: {self.data_path}")
+        
+        with open(self.data_path, 'r', encoding='utf-8') as f:
+            return json.load(f)
+    
+    def _load_verifications(self) -> Dict[str, Any]:
+        """加载已有的验证结果"""
+        verification_path = self.data_path.replace(".json", "_verifications.json")
+        
+        if os.path.exists(verification_path):
+            with open(verification_path, 'r', encoding='utf-8') as f:
+                return json.load(f)
+        
+        return {}
+    
+    def _save_verifications(self):
+        """保存验证结果"""
+        verification_path = self.data_path.replace(".json", "_verifications.json")
+        
+        with open(verification_path, 'w', encoding='utf-8') as f:
+            json.dump(self.verifications, f, ensure_ascii=False, indent=2)
+    
+    def get_current_problem(self) -> Tuple[str, str, str, str, str, str]:
+        """获取当前题目信息"""
+        if not self.problems:
+            return "无题目", "", "", "", "", "0/0"
+
+        problem = self.problems[self.current_index]
+        problem_id = problem.get("id", "unknown")
+
+        # 获取已有的验证信息
+        verification = self.verifications.get(problem_id, {})
+
+        return (
+            f"题目 {self.current_index + 1}/{len(self.problems)}",
+            problem.get("problem", ""),
+            f"答案: {problem.get('answer', 'N/A')}",
+            problem.get("solution", ""),
+            f"主题: {problem.get('topic', 'N/A')}",
+            verification.get("comments", "")
+        )
+    
+    def verify_problem(
+        self,
+        correctness: int,
+        clarity: int,
+        difficulty_match: int,
+        completeness: int,
+        status: str,
+        comments: str
+    ) -> str:
+        """
+        验证当前题目
+        
+        Args:
+            correctness: 正确性评分 (1-5)
+            clarity: 清晰度评分 (1-5)
+            difficulty_match: 难度匹配评分 (1-5)
+            completeness: 完整性评分 (1-5)
+            status: 验证状态 (approved/rejected/needs_revision)
+            comments: 评论
+        
+        Returns:
+            验证结果消息
+        """
+        if not self.problems:
+            return "❌ 无题目可验证"
+        
+        problem = self.problems[self.current_index]
+        problem_id = problem.get("id", "unknown")
+        
+        # 保存验证结果
+        self.verifications[problem_id] = {
+            "problem_id": problem_id,
+            "scores": {
+                "correctness": correctness,
+                "clarity": clarity,
+                "difficulty_match": difficulty_match,
+                "completeness": completeness
+            },
+            "total_score": (correctness + clarity + difficulty_match + completeness) / 4,
+            "status": status,
+            "comments": comments,
+            "verified_at": datetime.now().isoformat()
+        }
+        
+        self._save_verifications()
+        
+        return f"✅ 题目 {problem_id} 验证完成!\n总分: {self.verifications[problem_id]['total_score']:.2f}/5.0"
+    
+    def next_problem(self) -> Tuple[str, str, str, str, str, str]:
+        """下一个题目"""
+        if self.current_index < len(self.problems) - 1:
+            self.current_index += 1
+        return self.get_current_problem()
+    
+    def prev_problem(self) -> Tuple[str, str, str, str, str, str]:
+        """上一个题目"""
+        if self.current_index > 0:
+            self.current_index -= 1
+        return self.get_current_problem()
+    
+    def get_statistics(self) -> str:
+        """获取验证统计信息"""
+        if not self.verifications:
+            return "暂无验证数据"
+        
+        total = len(self.problems)
+        verified = len(self.verifications)
+        
+        approved = sum(1 for v in self.verifications.values() if v["status"] == "approved")
+        rejected = sum(1 for v in self.verifications.values() if v["status"] == "rejected")
+        needs_revision = sum(1 for v in self.verifications.values() if v["status"] == "needs_revision")
+        
+        avg_score = sum(v["total_score"] for v in self.verifications.values()) / verified if verified > 0 else 0
+        
+        return f"""
+📊 验证统计
+
+总题目数: {total}
+已验证: {verified} ({verified/total*100:.1f}%)
+未验证: {total - verified}
+
+验证结果:
+- ✅ 通过: {approved}
+- ❌ 拒绝: {rejected}
+- 🔄 需修改: {needs_revision}
+
+平均评分: {avg_score:.2f}/5.0
+"""
+    
+    def launch(self, share: bool = False):
+        """启动Gradio界面"""
+        with gr.Blocks(title="AIME题目人工验证") as demo:
+            gr.Markdown("# 🎯 AIME题目人工验证系统")
+            gr.Markdown(f"数据文件: `{self.data_path}`")
+            
+            with gr.Row():
+                with gr.Column(scale=2):
+                    # 题目显示区域
+                    title = gr.Textbox(label="当前题目", interactive=False)
+                    problem_text = gr.Textbox(label="问题描述", lines=5, interactive=False)
+                    answer_text = gr.Textbox(label="答案", interactive=False)
+                    solution_text = gr.Textbox(label="解答过程", lines=10, interactive=False)
+                    metadata_text = gr.Textbox(label="元数据", interactive=False)
+                
+                with gr.Column(scale=1):
+                    # 评分区域
+                    gr.Markdown("### 📝 评分 (1-5分)")
+                    correctness_slider = gr.Slider(1, 5, value=3, step=1, label="正确性")
+                    clarity_slider = gr.Slider(1, 5, value=3, step=1, label="清晰度")
+                    difficulty_slider = gr.Slider(1, 5, value=3, step=1, label="难度匹配")
+                    completeness_slider = gr.Slider(1, 5, value=3, step=1, label="完整性")
+                    
+                    # 状态选择
+                    gr.Markdown("### ✅ 验证状态")
+                    status_radio = gr.Radio(
+                        choices=["approved", "rejected", "needs_revision"],
+                        value="approved",
+                        label="状态"
+                    )
+                    
+                    # 评论
+                    comments_text = gr.Textbox(label="评论", lines=3, placeholder="请输入评论...")
+                    
+                    # 验证按钮
+                    verify_btn = gr.Button("✅ 提交验证", variant="primary")
+                    verify_result = gr.Textbox(label="验证结果", interactive=False)
+            
+            # 导航按钮
+            with gr.Row():
+                prev_btn = gr.Button("⬅️ 上一题")
+                next_btn = gr.Button("下一题 ➡️")
+            
+            # 统计信息
+            with gr.Row():
+                stats_text = gr.Textbox(label="验证统计", lines=10, interactive=False)
+                refresh_stats_btn = gr.Button("🔄 刷新统计")
+            
+            # 加载初始题目
+            demo.load(
+                fn=self.get_current_problem,
+                outputs=[title, problem_text, answer_text, solution_text, metadata_text, comments_text]
+            )
+            
+            # 绑定事件
+            verify_btn.click(
+                fn=self.verify_problem,
+                inputs=[correctness_slider, clarity_slider, difficulty_slider, completeness_slider, status_radio, comments_text],
+                outputs=verify_result
+            )
+            
+            next_btn.click(
+                fn=self.next_problem,
+                outputs=[title, problem_text, answer_text, solution_text, metadata_text, comments_text]
+            )
+            
+            prev_btn.click(
+                fn=self.prev_problem,
+                outputs=[title, problem_text, answer_text, solution_text, metadata_text, comments_text]
+            )
+            
+            refresh_stats_btn.click(
+                fn=self.get_statistics,
+                outputs=stats_text
+            )
+        
+        demo.launch(share=share, server_name="127.0.0.1", server_port=7860)
+
+
+if __name__ == "__main__":
+    import sys
+    
+    if len(sys.argv) < 2:
+        print("用法: python human_verification_ui.py <data_path>")
+        print("示例: python human_verification_ui.py generated_data/aime_generated_20250110_120000.json")
+        sys.exit(1)
+    
+    data_path = sys.argv[1]
+    
+    ui = HumanVerificationUI(data_path)
+    ui.launch(share=False)
+

+ 314 - 0
code/chapter12/data_generation/run_complete_evaluation.py

@@ -0,0 +1,314 @@
+"""
+完整评估流程
+
+运行完整的数据生成和评估流程:
+1. 生成AIME题目
+2. LLM Judge评估
+3. Win Rate评估
+4. 生成综合报告
+
+运行方法:
+python data_generation/run_complete_evaluation.py 30 3.0
+
+参数:
+- 30: 生成题目数量
+- 3.0: 每次生成之间的延迟(秒)
+
+说明:
+- 使用AIME 2025年真题作为参考
+- 数据集来源:math-ai/aime25(JSONL格式)
+"""
+
+import json
+import os
+import sys
+from datetime import datetime
+from aime_generator import AIMEGenerator
+from hello_agents import SimpleAgent, HelloAgentsLLM
+from hello_agents.tools import LLMJudgeTool, WinRateTool
+
+
+def run_complete_evaluation(
+    num_problems: int = 30,
+    delay_seconds: float = 3.0
+):
+    """
+    运行完整评估流程
+
+    Args:
+        num_problems: 生成题目数量
+        delay_seconds: 每次生成之间的延迟(秒),避免API速率限制
+    """
+    print("\n" + "="*80)
+    print("🚀 AIME数据生成与评估完整流程")
+    print("="*80)
+    print(f"\n配置信息:")
+    print(f"  - 生成题目数量: {num_problems}")
+    print(f"  - API延迟: {delay_seconds}秒/题")
+    print(f"  - 生成参考数据: TianHongZXY/aime-1983-2025(900+道题)")
+    print(f"  - 评估参考: AIME 2025真题")
+
+    # ========== 步骤1: 生成AIME题目 ==========
+    print("\n" + "="*80)
+    print("📝 步骤1: 生成AIME题目")
+    print("="*80)
+
+    generator = AIMEGenerator(delay_seconds=delay_seconds)
+    generated_data_path = generator.generate_and_save(
+        num_problems=num_problems,
+        output_dir="data_generation/generated_data"
+    )
+
+    print(f"\n✅ 步骤1完成!生成数据保存在: {generated_data_path}")
+
+    # ========== 步骤2: 评估 ==========
+    # 创建评估结果目录
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    evaluation_dir = f"data_generation/evaluation_results/{timestamp}"
+    os.makedirs(evaluation_dir, exist_ok=True)
+    os.makedirs(os.path.join(evaluation_dir, "llm_judge"), exist_ok=True)
+    os.makedirs(os.path.join(evaluation_dir, "win_rate"), exist_ok=True)
+
+    # 创建LLM
+    llm = HelloAgentsLLM()
+
+    # ========== 步骤2.1: LLM Judge评估 ==========
+    print(f"\n🎯 步骤2.1: LLM Judge评估 (vs AIME 2025)")
+
+    llm_judge_result = None
+    try:
+        llm_judge_tool = LLMJudgeTool(llm=llm)
+
+        llm_judge_result_json = llm_judge_tool.run({
+            "generated_data_path": generated_data_path,
+            "reference_year": 2025,
+            "max_samples": num_problems,
+            "output_dir": os.path.join(evaluation_dir, "llm_judge"),
+            "judge_model": "gpt-4o"
+        })
+
+        llm_judge_result = json.loads(llm_judge_result_json)
+        print(f"\n✅ LLM Judge评估完成!")
+        print(f"   平均总分: {llm_judge_result['metrics']['average_total_score']:.2f}/5.0")
+        print(f"   通过率: {llm_judge_result['metrics']['pass_rate']:.2%}")
+    except Exception as e:
+        print(f"\n❌ LLM Judge评估失败: {e}")
+        import traceback
+        traceback.print_exc()
+
+    # ========== 步骤2.2: Win Rate评估 ==========
+    print(f"\n🏆 步骤2.2: Win Rate评估 (vs AIME 2025)")
+
+    win_rate_result = None
+    try:
+        win_rate_tool = WinRateTool(llm=llm)
+
+        win_rate_result_json = win_rate_tool.run({
+            "generated_data_path": generated_data_path,
+            "reference_year": 2025,
+            "num_comparisons": min(num_problems, 20),  # 最多20次对比
+            "output_dir": os.path.join(evaluation_dir, "win_rate"),
+            "judge_model": "gpt-4o"
+        })
+
+        win_rate_result = json.loads(win_rate_result_json)
+        print(f"\n✅ Win Rate评估完成!")
+        print(f"   Win Rate: {win_rate_result['metrics']['win_rate']:.2%}")
+    except Exception as e:
+        print(f"\n❌ Win Rate评估失败: {e}")
+        import traceback
+        traceback.print_exc()
+
+    # ========== 步骤3: 生成综合报告 ==========
+    comprehensive_report_path = None
+    if llm_judge_result or win_rate_result:
+        print("\n" + "="*80)
+        print("📊 步骤3: 生成综合报告")
+        print("="*80)
+
+        comprehensive_report_path = os.path.join(evaluation_dir, "comprehensive_report.md")
+
+        # 生成综合报告
+        report = generate_comprehensive_report(
+            generated_data_path,
+            llm_judge_result,
+            win_rate_result
+        )
+
+        with open(comprehensive_report_path, 'w', encoding='utf-8') as f:
+            f.write(report)
+
+        print(f"\n✅ 综合报告已保存: {comprehensive_report_path}")
+
+    # ========== 完成 ==========
+    print("\n" + "="*80)
+    print("🎉 完整评估流程完成!")
+    print("="*80)
+    print(f"\n📁 输出文件:")
+    print(f"   - 生成数据: {generated_data_path}")
+    print(f"   - 评估结果目录: {evaluation_dir}")
+
+    if llm_judge_result:
+        print(f"   - LLM Judge报告: {llm_judge_result.get('report_file', 'N/A')}")
+    if win_rate_result:
+        print(f"   - Win Rate报告: {win_rate_result.get('report_file', 'N/A')}")
+
+    if comprehensive_report_path:
+        print(f"   - 综合报告: {comprehensive_report_path}")
+
+    print(f"\n💡 下一步:")
+    if comprehensive_report_path:
+        print(f"   1. 查看综合报告: {comprehensive_report_path}")
+    print(f"   2. 运行人工验证: python data_generation/human_verification_ui.py {generated_data_path}")
+
+    return {
+        "generated_data_path": generated_data_path,
+        "llm_judge_result": llm_judge_result,
+        "win_rate_result": win_rate_result,
+        "comprehensive_report_path": comprehensive_report_path
+    }
+
+
+def generate_comprehensive_report(
+    generated_data_path: str,
+    llm_judge_result: dict,
+    win_rate_result: dict
+) -> str:
+    """生成综合评估报告"""
+
+    # 加载生成数据
+    with open(generated_data_path, 'r', encoding='utf-8') as f:
+        generated_data = json.load(f)
+
+    report = f"""# AIME数据生成与评估综合报告
+
+## 1. 基本信息
+
+- **生成时间**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
+- **生成题目数量**: {len(generated_data)}
+- **参考AIME年份**: 2025
+- **生成数据路径**: {generated_data_path}
+
+## 2. 数据生成统计
+
+### 主题分布
+
+"""
+
+    # 统计主题分布
+    topic_counts = {}
+    for item in generated_data:
+        topic = item.get('topic', 'Unknown')
+        topic_counts[topic] = topic_counts.get(topic, 0) + 1
+
+    report += "| 主题 | 数量 | 占比 |\n"
+    report += "|------|------|------|\n"
+    for topic, count in sorted(topic_counts.items(), key=lambda x: x[1], reverse=True):
+        percentage = count / len(generated_data) * 100
+        report += f"| {topic} | {count} | {percentage:.1f}% |\n"
+
+    # LLM Judge结果
+    if llm_judge_result:
+        report += "\n## 3. LLM Judge评估结果\n\n"
+        report += f"""**总体评分**:
+- 平均总分: {llm_judge_result['metrics']['average_total_score']:.2f}/5.0
+- 通过率: {llm_judge_result['metrics']['pass_rate']:.2%}
+- 优秀率: {llm_judge_result['metrics']['excellent_rate']:.2%}
+
+**各维度评分**:
+
+| 维度 | 平均分 |
+|------|--------|
+| 正确性 | {llm_judge_result['metrics']['dimension_averages']['correctness']:.2f}/5.0 |
+| 清晰度 | {llm_judge_result['metrics']['dimension_averages']['clarity']:.2f}/5.0 |
+| 难度匹配 | {llm_judge_result['metrics']['dimension_averages']['difficulty_match']:.2f}/5.0 |
+| 完整性 | {llm_judge_result['metrics']['dimension_averages']['completeness']:.2f}/5.0 |
+
+"""
+
+    # Win Rate结果
+    if win_rate_result:
+        report += "\n## 4. Win Rate评估结果\n\n"
+        report += f"""**胜率统计**:
+- Win Rate: {win_rate_result['metrics']['win_rate']:.2%}
+- Loss Rate: {win_rate_result['metrics']['loss_rate']:.2%}
+- Tie Rate: {win_rate_result['metrics']['tie_rate']:.2%}
+
+**对比次数**:
+- 总对比次数: {win_rate_result['metrics']['total_comparisons']} 次
+- 胜出次数: {win_rate_result['metrics']['wins']} 次
+- 失败次数: {win_rate_result['metrics']['losses']} 次
+- 平局次数: {win_rate_result['metrics']['ties']} 次
+
+"""
+
+    # 综合结论
+    report += "\n## 5. 综合结论\n\n"
+
+    if llm_judge_result and win_rate_result:
+        overall_avg_score = llm_judge_result['metrics']['average_total_score']
+        overall_win_rate = win_rate_result['metrics']['win_rate']
+
+        if overall_avg_score >= 4.5 and overall_win_rate >= 0.48:
+            report += "✅ **结论**: 生成数据质量**优秀**,达到或超过AIME真题水平。\n"
+        elif overall_avg_score >= 4.0 and overall_win_rate >= 0.45:
+            report += "✅ **结论**: 生成数据质量**良好**,接近AIME真题水平。\n"
+        else:
+            report += "⚠️ **结论**: 生成数据质量**需要改进**,与AIME真题仍有差距。\n"
+
+        report += f"\n**整体指标**:\n"
+        report += f"- LLM Judge得分: {overall_avg_score:.2f}/5.0\n"
+        report += f"- Win Rate: {overall_win_rate:.2%}\n"
+
+    # 改进建议
+    report += "\n## 6. 改进建议\n\n"
+
+    if llm_judge_result:
+        avg_score = llm_judge_result['metrics']['average_total_score']
+        if avg_score >= 4.5:
+            report += "- ✅ 继续保持当前的生成策略\n"
+            report += "- ✅ 可以考虑增加生成数量\n"
+        elif avg_score >= 4.0:
+            report += "- 🔄 优化题目生成的提示词\n"
+            report += "- 🔄 增加质量过滤步骤\n"
+        else:
+            report += "- ⚠️ 需要重新设计生成提示词\n"
+            report += "- ⚠️ 考虑使用更强的生成模型\n"
+            report += "- ⚠️ 增加人工审核环节\n"
+
+    # 下一步行动
+    report += "\n## 7. 下一步行动\n\n"
+    report += "1. **人工验证**: 运行人工验证界面,对生成的题目进行人工审核\n"
+    report += f"   ```bash\n   python data_generation/human_verification_ui.py {generated_data_path}\n   ```\n\n"
+    report += "2. **质量筛选**: 根据评估结果筛选高质量题目\n\n"
+    report += "3. **迭代优化**: 根据评估反馈优化生成策略\n"
+
+    report += f"\n---\n\n*报告生成时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}*\n"
+
+    return report
+
+
+def main():
+    if len(sys.argv) < 2:
+        print("用法: python run_complete_evaluation.py <num_problems> [delay_seconds]")
+        print("\n说明:")
+        print("  - 使用AIME 2025年真题作为参考")
+        print("  - 数据集来源: math-ai/aime25(JSONL格式)")
+        print("\n示例:")
+        print("python run_complete_evaluation.py 30 3.0")
+        sys.exit(1)
+
+    # 解析命令行参数
+    num_problems = int(sys.argv[1])
+    delay_seconds = float(sys.argv[2]) if len(sys.argv) > 2 else 3.0
+
+    # 运行完整评估
+    run_complete_evaluation(
+        num_problems=num_problems,
+        delay_seconds=delay_seconds
+    )
+
+
+if __name__ == "__main__":
+    main()
+

+ 45 - 0
code/chapter12/data_generation/step1_generate_only.py

@@ -0,0 +1,45 @@
+"""
+步骤1:仅生成AIME题目
+
+运行方法:
+python data_generation/step1_generate_only.py 30 3.0
+
+参数:
+- 30: 生成题目数量
+- 3.0: 每次生成之间的延迟(秒)
+"""
+
+import sys
+from aime_generator import AIMEGenerator
+
+
+def main():
+    # 解析命令行参数
+    num_problems = int(sys.argv[1]) if len(sys.argv) > 1 else 30
+    delay_seconds = float(sys.argv[2]) if len(sys.argv) > 2 else 3.0
+    
+    print("\n" + "="*80)
+    print("📝 步骤1: 生成AIME题目")
+    print("="*80)
+    print(f"\n配置信息:")
+    print(f"  - 生成题目数量: {num_problems}")
+    print(f"  - API延迟: {delay_seconds}秒/题")
+    print(f"  - 生成参考数据: TianHongZXY/aime-1983-2025(900+道题)")
+    
+    # 创建生成器
+    generator = AIMEGenerator(delay_seconds=delay_seconds)
+    
+    # 生成并保存
+    generated_data_path = generator.generate_and_save(
+        num_problems=num_problems,
+        output_dir="data_generation/generated_data"
+    )
+    
+    print(f"\n✅ 步骤1完成!生成数据保存在: {generated_data_path}")
+    print(f"\n下一步:运行评估")
+    print(f"python data_generation/step2_evaluate_only.py {generated_data_path} 2024")
+
+
+if __name__ == "__main__":
+    main()
+

+ 287 - 0
code/chapter12/data_generation/step2_evaluate_only.py

@@ -0,0 +1,287 @@
+"""
+步骤2:仅评估已生成的AIME题目
+
+运行方法:
+python data_generation/step2_evaluate_only.py <generated_data_path>
+
+参数:
+- generated_data_path: 生成数据的路径
+
+说明:
+- 使用AIME 2025年真题作为参考
+- 数据集来源:math-ai/aime25(JSONL格式)
+
+示例:
+python data_generation/step2_evaluate_only.py data_generation/generated_data/aime_generated_20251011_042741.json
+"""
+
+import json
+import os
+import sys
+from datetime import datetime
+from hello_agents import SimpleAgent, HelloAgentsLLM
+from hello_agents.tools import LLMJudgeTool, WinRateTool
+
+
+def run_evaluation(generated_data_path: str):
+    """
+    运行评估流程
+
+    Args:
+        generated_data_path: 生成数据的路径
+    """
+    print("\n" + "="*80)
+    print("🎯 步骤2: 评估已生成的AIME题目")
+    print("="*80)
+    print(f"\n配置信息:")
+    print(f"  - 生成数据: {generated_data_path}")
+    print(f"  - 评估参考: AIME 2025真题")
+    
+    # 检查文件是否存在
+    if not os.path.exists(generated_data_path):
+        print(f"\n❌ 错误:文件不存在: {generated_data_path}")
+        return
+    
+    # 加载生成数据以获取题目数量
+    with open(generated_data_path, 'r', encoding='utf-8') as f:
+        generated_data = json.load(f)
+    num_problems = len(generated_data)
+    print(f"  - 题目数量: {num_problems}")
+    
+    # 创建评估结果目录
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    evaluation_dir = f"data_generation/evaluation_results/{timestamp}"
+    os.makedirs(evaluation_dir, exist_ok=True)
+    os.makedirs(os.path.join(evaluation_dir, "llm_judge"), exist_ok=True)
+    os.makedirs(os.path.join(evaluation_dir, "win_rate"), exist_ok=True)
+
+    # 创建LLM
+    llm = HelloAgentsLLM()
+
+    # # ========== LLM Judge评估 ==========
+    print(f"\n🎯 步骤2.1: LLM Judge评估 (vs AIME 2025)")
+
+    llm_judge_result = None
+    try:
+        llm_judge_tool = LLMJudgeTool(llm=llm)
+
+        llm_judge_result_json = llm_judge_tool.run({
+            "generated_data_path": generated_data_path,
+            "reference_year": 2025,
+            "max_samples": num_problems,
+            "output_dir": os.path.join(evaluation_dir, "llm_judge"),
+            "judge_model": "gpt-4o"
+        })
+
+        llm_judge_result = json.loads(llm_judge_result_json)
+        print(f"\n✅ LLM Judge评估完成!")
+        print(f"   平均总分: {llm_judge_result['metrics']['average_total_score']:.2f}/5.0")
+        print(f"   通过率: {llm_judge_result['metrics']['pass_rate']:.2%}")
+    except Exception as e:
+        print(f"\n❌ LLM Judge评估失败: {e}")
+        import traceback
+        traceback.print_exc()
+
+    # ========== Win Rate评估 ==========
+    print(f"\n🏆 步骤2.2: Win Rate评估 (vs AIME 2025)")
+
+    win_rate_result = None
+    try:
+        win_rate_tool = WinRateTool(llm=llm)
+
+        win_rate_result_json = win_rate_tool.run({
+            "generated_data_path": generated_data_path,
+            "reference_year": 2025,
+            "num_comparisons": min(num_problems, 20),  # 最多20次对比
+            "output_dir": os.path.join(evaluation_dir, "win_rate"),
+            "judge_model": "gpt-4o"
+        })
+
+        win_rate_result = json.loads(win_rate_result_json)
+        print(f"\n✅ Win Rate评估完成!")
+        print(f"   Win Rate: {win_rate_result['metrics']['win_rate']:.2%}")
+    except Exception as e:
+        print(f"\n❌ Win Rate评估失败: {e}")
+        import traceback
+        traceback.print_exc()
+    
+    # ========== 生成综合报告 ==========
+    comprehensive_report_path = None
+    if llm_judge_result or win_rate_result:
+        print("\n" + "="*80)
+        print("📊 步骤2.3: 生成综合报告")
+        print("="*80)
+
+        comprehensive_report_path = os.path.join(evaluation_dir, "comprehensive_report.md")
+
+        # 生成综合报告
+        report = generate_comprehensive_report(
+            generated_data_path,
+            llm_judge_result,
+            win_rate_result
+        )
+
+        with open(comprehensive_report_path, 'w', encoding='utf-8') as f:
+            f.write(report)
+
+        print(f"\n✅ 综合报告已保存: {comprehensive_report_path}")
+
+    # ========== 完成 ==========
+    print("\n" + "="*80)
+    print("🎉 评估流程完成!")
+    print("="*80)
+    print(f"\n📁 输出文件:")
+    print(f"   - 评估结果目录: {evaluation_dir}")
+
+    if llm_judge_result:
+        print(f"   - LLM Judge报告: {llm_judge_result.get('report_file', 'N/A')}")
+    if win_rate_result:
+        print(f"   - Win Rate报告: {win_rate_result.get('report_file', 'N/A')}")
+
+    if comprehensive_report_path:
+        print(f"   - 综合报告: {comprehensive_report_path}")
+
+    print(f"\n💡 下一步:")
+    if comprehensive_report_path:
+        print(f"   1. 查看综合报告: {comprehensive_report_path}")
+    print(f"   2. 运行人工验证: python data_generation/human_verification_ui.py {generated_data_path}")
+
+
+def generate_comprehensive_report(
+    generated_data_path: str,
+    llm_judge_result: dict,
+    win_rate_result: dict
+) -> str:
+    """生成综合评估报告"""
+
+    # 加载生成数据
+    with open(generated_data_path, 'r', encoding='utf-8') as f:
+        generated_data = json.load(f)
+
+    report = f"""# AIME数据生成与评估综合报告
+
+## 1. 基本信息
+
+- **生成时间**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
+- **生成题目数量**: {len(generated_data)}
+- **参考AIME年份**: 2025
+- **生成数据路径**: {generated_data_path}
+
+## 2. 数据生成统计
+
+### 主题分布
+
+"""
+    
+    # 统计主题分布
+    topic_counts = {}
+    for item in generated_data:
+        topic = item.get('topic', 'Unknown')
+        topic_counts[topic] = topic_counts.get(topic, 0) + 1
+    
+    report += "| 主题 | 数量 | 占比 |\n"
+    report += "|------|------|------|\n"
+    for topic, count in sorted(topic_counts.items(), key=lambda x: x[1], reverse=True):
+        percentage = count / len(generated_data) * 100
+        report += f"| {topic} | {count} | {percentage:.1f}% |\n"
+    
+    # LLM Judge结果
+    if llm_judge_result:
+        report += "\n## 3. LLM Judge评估结果\n\n"
+        report += f"""**总体评分**:
+- 平均总分: {llm_judge_result['metrics']['average_total_score']:.2f}/5.0
+- 通过率: {llm_judge_result['metrics']['pass_rate']:.2%}
+- 优秀率: {llm_judge_result['metrics']['excellent_rate']:.2%}
+
+**各维度评分**:
+
+| 维度 | 平均分 |
+|------|--------|
+| 正确性 | {llm_judge_result['metrics']['dimension_averages']['correctness']:.2f}/5.0 |
+| 清晰度 | {llm_judge_result['metrics']['dimension_averages']['clarity']:.2f}/5.0 |
+| 难度匹配 | {llm_judge_result['metrics']['dimension_averages']['difficulty_match']:.2f}/5.0 |
+| 完整性 | {llm_judge_result['metrics']['dimension_averages']['completeness']:.2f}/5.0 |
+
+"""
+
+    # Win Rate结果
+    if win_rate_result:
+        report += "\n## 4. Win Rate评估结果\n\n"
+        report += f"""**胜率统计**:
+- Win Rate: {win_rate_result['metrics']['win_rate']:.2%}
+- Loss Rate: {win_rate_result['metrics']['loss_rate']:.2%}
+- Tie Rate: {win_rate_result['metrics']['tie_rate']:.2%}
+
+**对比次数**:
+- 总对比次数: {win_rate_result['metrics']['total_comparisons']} 次
+- 胜出次数: {win_rate_result['metrics']['wins']} 次
+- 失败次数: {win_rate_result['metrics']['losses']} 次
+- 平局次数: {win_rate_result['metrics']['ties']} 次
+
+"""
+
+    # 综合结论
+    report += "\n## 5. 综合结论\n\n"
+
+    if llm_judge_result and win_rate_result:
+        overall_avg_score = llm_judge_result['metrics']['average_total_score']
+        overall_win_rate = win_rate_result['metrics']['win_rate']
+
+        if overall_avg_score >= 4.5 and overall_win_rate >= 0.48:
+            report += "✅ **结论**: 生成数据质量**优秀**,达到或超过AIME真题水平。\n"
+        elif overall_avg_score >= 4.0 and overall_win_rate >= 0.45:
+            report += "✅ **结论**: 生成数据质量**良好**,接近AIME真题水平。\n"
+        else:
+            report += "⚠️ **结论**: 生成数据质量**需要改进**,与AIME真题仍有差距。\n"
+
+        report += f"\n**整体指标**:\n"
+        report += f"- LLM Judge得分: {overall_avg_score:.2f}/5.0\n"
+        report += f"- Win Rate: {overall_win_rate:.2%}\n"
+
+    # 改进建议
+    report += "\n## 6. 改进建议\n\n"
+
+    if llm_judge_result:
+        avg_score = llm_judge_result['metrics']['average_total_score']
+        if avg_score >= 4.5:
+            report += "- ✅ 继续保持当前的生成策略\n"
+            report += "- ✅ 可以考虑增加生成数量\n"
+        elif avg_score >= 4.0:
+            report += "- 🔄 优化题目生成的提示词\n"
+            report += "- 🔄 增加质量过滤步骤\n"
+        else:
+            report += "- ⚠️ 需要重新设计生成提示词\n"
+            report += "- ⚠️ 考虑使用更强的生成模型\n"
+            report += "- ⚠️ 增加人工审核环节\n"
+    
+    # 下一步行动
+    report += "\n## 7. 下一步行动\n\n"
+    report += "1. **人工验证**: 运行人工验证界面,对生成的题目进行人工审核\n"
+    report += f"   ```bash\n   python data_generation/human_verification_ui.py {generated_data_path}\n   ```\n\n"
+    report += "2. **质量筛选**: 根据评估结果筛选高质量题目\n\n"
+    report += "3. **迭代优化**: 根据评估反馈优化生成策略\n"
+    
+    report += f"\n---\n\n*报告生成时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}*\n"
+    
+    return report
+
+
+def main():
+    if len(sys.argv) < 2:
+        print("用法: python step2_evaluate_only.py <generated_data_path>")
+        print("\n说明:")
+        print("  - 使用AIME 2025年真题作为参考")
+        print("  - 数据集来源: math-ai/aime25(JSONL格式)")
+        print("  - 需要安装: pip install pandas pyarrow datasets")
+        print("\n示例:")
+        print("python step2_evaluate_only.py data_generation/generated_data/aime_generated_20251011_042741.json")
+        sys.exit(1)
+
+    generated_data_path = sys.argv[1]
+
+    run_evaluation(generated_data_path)
+
+
+if __name__ == "__main__":
+    main()
+

+ 312 - 0
code/chapter12/data_generation/运行指南.md

@@ -0,0 +1,312 @@
+# AIME数据生成与评估完整运行指南
+
+本文档提供完整的运行步骤,从数据生成到评估报告生成。
+
+## 前置准备
+
+### 1. 环境配置
+
+确保已安装所有依赖:
+
+```bash
+# 安装评估系统依赖
+pip install hello-agents[evaluation]
+
+# 或手动安装
+pip install datasets huggingface_hub pandas tqdm gradio
+```
+
+### 2. 环境变量配置
+
+在 `.env` 文件中配置:
+
+```bash
+# LLM API密钥(二选一)
+DASHSCOPE_API_KEY=your_dashscope_key  # 阿里云DashScope
+OPENAI_API_KEY=your_openai_key        # OpenAI
+
+# HuggingFace Token(用于下载数据集)
+HF_TOKEN=your_hf_token
+```
+
+## 完整运行步骤
+
+### 步骤1:运行完整评估流程
+
+这是**一键运行**的方式,会自动完成生成、评估、报告生成:
+
+```bash
+cd docs/chapter12/HelloAgents
+python data_generation/run_complete_evaluation.py 30 3.0
+```
+
+**参数说明**:
+- `30` - 生成30道题目
+- `3.0` - 每次生成间隔3秒(推荐2-3秒)
+
+**说明**:
+- 使用AIME 2025年真题作为评估参考
+- 数据集来源:math-ai/aime25(JSONL格式)
+
+**预计耗时**:
+- 生成30道题:约15-30分钟(取决于API速度)
+- LLM Judge评估:约10-15分钟
+- Win Rate评估:约5-10分钟
+- **总计**:约30-55分钟
+
+**输出文件**:
+```
+data_generation/
+├── generated_data/
+│   └── aime_generated_YYYYMMDD_HHMMSS.json  # 生成的题目
+└── evaluation_results/
+    └── YYYYMMDD_HHMMSS/
+        ├── llm_judge/
+        │   ├── llm_judge_result_YYYYMMDD_HHMMSS.jsonl
+        │   └── llm_judge_report_YYYYMMDD_HHMMSS.md
+        ├── win_rate/
+        │   ├── win_rate_result_YYYYMMDD_HHMMSS.jsonl
+        │   └── win_rate_report_YYYYMMDD_HHMMSS.md
+        └── comprehensive_report.md  # 综合报告
+```
+
+### 步骤2:查看评估报告
+
+#### 2.1 查看综合报告
+
+```bash
+# 找到最新的评估结果目录
+cd data_generation/evaluation_results
+ls -lt  # 查看最新的目录
+
+# 查看综合报告
+cat YYYYMMDD_HHMMSS/comprehensive_report.md
+```
+
+**综合报告包含**:
+- 基本信息(生成时间、题目数量等)
+- 数据生成统计(主题分布、答案分析)
+- LLM Judge评估结果(总体评分、各维度评分)
+- Win Rate评估结果(胜率统计、对比分析)
+- 综合结论和改进建议
+
+#### 2.2 查看详细报告
+
+**LLM Judge详细报告**:
+```bash
+cat YYYYMMDD_HHMMSS/llm_judge/llm_judge_report_YYYYMMDD_HHMMSS.md
+```
+
+**Win Rate详细报告**:
+```bash
+cat YYYYMMDD_HHMMSS/win_rate/win_rate_report_YYYYMMDD_HHMMSS.md
+```
+
+### 步骤3:人工验证(可选)
+
+如果需要进行人工验证,运行:
+
+```bash
+python data_generation/human_verification_ui.py data_generation/generated_data/aime_generated_YYYYMMDD_HHMMSS.json
+```
+
+**操作步骤**:
+1. 浏览器自动打开 `http://127.0.0.1:7860`
+2. 阅读题目、答案、解答
+3. 从4个维度评分(1-5分)
+4. 选择验证状态(approved/rejected/needs_revision)
+5. 添加评论(可选)
+6. 点击"提交验证"
+7. 查看下一题
+
+**验证结果保存**:
+```
+data_generation/generated_data/aime_generated_YYYYMMDD_HHMMSS_verifications.json
+```
+
+## 分步运行(高级)
+
+如果需要分步运行,可以按以下步骤:
+
+### 步骤1:仅生成数据
+
+```python
+from data_generation.aime_generator import AIMEGenerator
+
+generator = AIMEGenerator(delay_seconds=3.0)
+generated_data_path = generator.generate_and_save(num_problems=30)
+print(f"生成数据保存在: {generated_data_path}")
+```
+
+### 步骤2:仅运行LLM Judge评估
+
+```python
+from hello_agents import HelloAgentsLLM
+from hello_agents.tools import LLMJudgeTool
+
+llm = HelloAgentsLLM()
+llm_judge_tool = LLMJudgeTool(llm=llm)
+
+result = llm_judge_tool.run({
+    "generated_data_path": "data_generation/generated_data/aime_generated_XXXXXX.json",
+    "reference_year": 2025,
+    "max_samples": 30,
+    "output_dir": "data_generation/evaluation_results/llm_judge"
+})
+```
+
+### 步骤3:仅运行Win Rate评估
+
+```python
+from hello_agents import HelloAgentsLLM
+from hello_agents.tools import WinRateTool
+
+llm = HelloAgentsLLM()
+win_rate_tool = WinRateTool(llm=llm)
+
+result = win_rate_tool.run({
+    "generated_data_path": "data_generation/generated_data/aime_generated_XXXXXX.json",
+    "reference_year": 2025,
+    "num_comparisons": 20,
+    "output_dir": "data_generation/evaluation_results/win_rate"
+})
+```
+
+## 常见问题
+
+### 1. API速率限制
+
+**问题**:
+```
+INFO:openai._base_client:Retrying request to /chat/completions in 0.451826 seconds
+```
+
+**解决**:
+- 增加延迟时间:`python data_generation/run_complete_evaluation.py 30 5.0`
+- 使用检查点恢复:中断后重新运行相同命令会自动恢复
+
+### 2. HuggingFace下载慢
+
+**问题**:下载AIME数据集很慢
+
+**解决**:
+```bash
+# 使用镜像源
+export HF_ENDPOINT=https://hf-mirror.com
+
+# 或手动下载后使用本地路径
+```
+
+### 3. 生成题目重复
+
+**问题**:生成的题目有重复
+
+**解决**:
+- 已使用900+道真题作为参考样例
+- 每次生成都随机选择不同的参考
+- 提示词强调"生成完全不同的题目"
+
+### 4. 评估失败
+
+**问题**:LLM Judge或Win Rate评估失败
+
+**解决**:
+- 检查API密钥是否正确
+- 检查生成的数据文件是否存在
+- 检查数据文件格式是否正确
+
+## 质量标准
+
+### 优秀标准
+- LLM Judge平均分 ≥ 4.5/5.0
+- Win Rate ≥ 48%(接近50%)
+- 通过率 ≥ 90%
+- 人工验证通过率 ≥ 95%
+
+### 良好标准
+- LLM Judge平均分 ≥ 4.0/5.0
+- Win Rate ≥ 45%
+- 通过率 ≥ 80%
+- 人工验证通过率 ≥ 90%
+
+### 需要改进
+- LLM Judge平均分 < 4.0/5.0
+- Win Rate < 45%
+- 通过率 < 80%
+- 人工验证通过率 < 90%
+
+## 下一步
+
+根据评估结果:
+
+1. **如果质量优秀**:
+   - 可以使用生成的数据
+   - 考虑生成更多数据
+   - 保留评估报告作为质量证明
+
+2. **如果质量良好**:
+   - 进行人工验证
+   - 筛选高质量数据
+   - 调整生成提示词
+
+3. **如果需要改进**:
+   - 分析低分题目的共同问题
+   - 调整生成提示词
+   - 重新生成并评估
+
+## 示例输出
+
+### 综合报告示例
+
+```markdown
+# AIME数据生成与评估综合报告
+
+## 1. 基本信息
+
+- **生成时间**: 2025-01-10 12:00:00
+- **生成题目数量**: 30
+- **参考AIME年份**: 2025
+
+## 2. 数据生成统计
+
+### 主题分布
+
+| 主题 | 数量 | 占比 |
+|------|------|------|
+| 代数 | 10 | 33.3% |
+| 几何 | 8 | 26.7% |
+| 数论 | 7 | 23.3% |
+| 组合 | 3 | 10.0% |
+| 概率 | 2 | 6.7% |
+
+## 3. LLM Judge评估结果
+
+- **平均总分**: 4.2/5.0
+- **通过率**: 85.0%
+- **优秀率**: 40.0%
+
+## 4. Win Rate评估结果
+
+- **Win Rate**: 45.0%
+- **评级**: 良好
+
+## 5. 综合结论
+
+✅ 生成数据质量**良好**,接近AIME真题水平。
+```
+
+## 总结
+
+完整流程:
+1. 运行 `python data_generation/run_complete_evaluation.py 30 3.0`
+2. 等待30-55分钟
+3. 查看综合报告 `data_generation/evaluation_results/XXXXXX/comprehensive_report.md`
+4. (可选)运行人工验证
+5. 根据评估结果决定下一步
+
+**说明**:
+- 所有评估都使用AIME 2025年真题作为参考
+- 数据集来源:math-ai/aime25(JSONL格式)
+
+祝你使用愉快!
+

+ 36 - 0
code/chapter12/template_output/evaluation_reports/bfcl_report_20251011_010343.md

@@ -0,0 +1,36 @@
+# BFCL评估报告
+
+**生成时间**: 2025-10-11 01:03:43
+
+## 📊 评估概览
+
+- **智能体**: TestAgent
+- **评估类别**: simple_python
+- **总体准确率**: 100.00%
+- **正确样本数**: 5/5
+
+## 📈 详细指标
+
+### 分类准确率
+
+- **simple_python**: 100.00% (5/5)
+
+## 📝 样本详情
+
+| 样本ID | 问题 | 预测结果 | 正确答案 | 是否正确 |
+|--------|------|----------|----------|----------|
+| simple_python_0 | Find the area of a triangle with a base of 10 units and heig... | [{'name': 'calculate_triangle_area', 'ar... | [{'calculate_triangle_area': {'base': [1... | ✅ |
+| simple_python_1 | Calculate the factorial of 5 using math functions. | [{'name': 'math.factorial', 'arguments':... | [{'math.factorial': {'number': [5]}}] | ✅ |
+| simple_python_2 | Calculate the hypotenuse of a right triangle given the lengt... | [{'name': 'math.hypot', 'arguments': {'x... | [{'math.hypot': {'x': [4], 'y': [5], 'z'... | ✅ |
+| simple_python_3 | Find the roots of a quadratic equation with coefficients a=1... | [{'name': 'algebra.quadratic_roots', 'ar... | [{'algebra.quadratic_roots': {'a': [1], ... | ✅ |
+| simple_python_4 | Solve a quadratic equation where a=2, b=6, and c=5 | [{'name': 'solve_quadratic_equation', 'a... | [{'solve_quadratic_equation': {'a': [2],... | ✅ |
+
+## 📊 准确率可视化
+
+```
+准确率: ██████████████████████████████████████████████████ 100.00%
+```
+
+## 💡 建议
+
+- ✅ 表现优秀!智能体在工具调用方面表现出色。

+ 38 - 0
code/chapter12/template_output/evaluation_reports/gaia_report_20251011_012648.md

@@ -0,0 +1,38 @@
+# GAIA评估报告
+
+**生成时间**: 2025-10-11 01:26:48
+
+## 📊 评估概览
+
+- **智能体**: TestAgent
+- **难度级别**: 1
+- **总样本数**: 2
+- **精确匹配数**: 1
+- **部分匹配数**: 1
+- **精确匹配率**: 50.00%
+- **部分匹配率**: 50.00%
+
+## 📈 详细指标
+
+### 分级准确率
+
+- **Level 1**: 50.00% 精确 / 50.00% 部分 (1/2)
+
+## 📝 样本详情(前10个)
+
+| 任务ID | 级别 | 预测答案 | 正确答案 | 精确匹配 | 部分匹配 |
+|--------|------|----------|----------|----------|----------|
+| e1fc63a2-da7a-432f-be78-7c4a95598703 | 1 | 24000 | 17 | ❌ | ❌ |
+| 8e867cd7-cff9-4e6c-867a-ff5ddc2550be | 1 | 3 | 3 | ✅ | ✅ |
+
+## 📊 准确率可视化
+
+```
+精确匹配: █████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░ 50.00%
+部分匹配: █████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░ 50.00%
+```
+
+## 💡 建议
+
+- ⚠️ 表现一般,需要改进。
+- 💡 建议检查工具使用和多步推理能力。

+ 5 - 0
code/chapter12/template_output/evaluation_results/bfcl_official/BFCL_v4_simple_python_result.json

@@ -0,0 +1,5 @@
+{"id": "simple_python_0", "result": "calculate_triangle_area(base=10, height=5)", "inference_log": [{"role": "user", "content": [[{"role": "user", "content": "Find the area of a triangle with a base of 10 units and height of 5 units."}]]}, {"role": "assistant", "content": "[{\"name\": \"calculate_triangle_area\", \"arguments\": {\"base\": 10, \"height\": 5}}]"}]}
+{"id": "simple_python_1", "result": "math.factorial(number=5)", "inference_log": [{"role": "user", "content": [[{"role": "user", "content": "Calculate the factorial of 5 using math functions."}]]}, {"role": "assistant", "content": "[{\"name\": \"math.factorial\", \"arguments\": {\"number\": 5}}]"}]}
+{"id": "simple_python_2", "result": "math.hypot(x=4, y=5)", "inference_log": [{"role": "user", "content": [[{"role": "user", "content": "Calculate the hypotenuse of a right triangle given the lengths of the other two sides as 4 and 5."}]]}, {"role": "assistant", "content": "[{\"name\": \"math.hypot\", \"arguments\": {\"x\": 4, \"y\": 5}}]"}]}
+{"id": "simple_python_3", "result": "algebra.quadratic_roots(a=1, b=-3, c=2)", "inference_log": [{"role": "user", "content": [[{"role": "user", "content": "Find the roots of a quadratic equation with coefficients a=1, b=-3, c=2."}]]}, {"role": "assistant", "content": "[{\"name\": \"algebra.quadratic_roots\", \"arguments\": {\"a\": 1, \"b\": -3, \"c\": 2}}]"}]}
+{"id": "simple_python_4", "result": "solve_quadratic_equation(a=2, b=6, c=5)", "inference_log": [{"role": "user", "content": [[{"role": "user", "content": "Solve a quadratic equation where a=2, b=6, and c=5"}]]}, {"role": "assistant", "content": "[{\"name\": \"solve_quadratic_equation\", \"arguments\": {\"a\": 2, \"b\": 6, \"c\": 5}}]"}]}

+ 216 - 0
code/chapter12/template_output/evaluation_results/bfcl_official/HelloAgents_simple_python_results.json

@@ -0,0 +1,216 @@
+{
+  "benchmark": "BFCL",
+  "agent_name": "HelloAgents",
+  "evaluation_mode": "ast",
+  "category": "simple_python",
+  "total_samples": 5,
+  "correct_samples": 5,
+  "overall_accuracy": 1.0,
+  "category_metrics": {
+    "unknown": {
+      "total": 5,
+      "correct": 5,
+      "accuracy": 1.0
+    }
+  },
+  "detailed_results": [
+    {
+      "success": true,
+      "score": 1.0,
+      "predicted": [
+        {
+          "name": "calculate_triangle_area",
+          "arguments": {
+            "base": 10,
+            "height": 5,
+            "unit": "units"
+          }
+        }
+      ],
+      "expected": [
+        {
+          "calculate_triangle_area": {
+            "base": [
+              10
+            ],
+            "height": [
+              5
+            ],
+            "unit": [
+              "units",
+              ""
+            ]
+          }
+        }
+      ],
+      "response": "[{\"name\": \"calculate_triangle_area\", \"arguments\": {\"base\": 10, \"height\": 5, \"unit\": \"units\"}}]",
+      "question": [
+        [
+          {
+            "role": "user",
+            "content": "Find the area of a triangle with a base of 10 units and height of 5 units."
+          }
+        ]
+      ],
+      "execution_time": 8.112257480621338,
+      "sample_id": "simple_python_0",
+      "category": "unknown"
+    },
+    {
+      "success": true,
+      "score": 1.0,
+      "predicted": [
+        {
+          "name": "math.factorial",
+          "arguments": {
+            "number": 5
+          }
+        }
+      ],
+      "expected": [
+        {
+          "math.factorial": {
+            "number": [
+              5
+            ]
+          }
+        }
+      ],
+      "response": "[{\"name\": \"math.factorial\", \"arguments\": {\"number\": 5}}]",
+      "question": [
+        [
+          {
+            "role": "user",
+            "content": "Calculate the factorial of 5 using math functions."
+          }
+        ]
+      ],
+      "execution_time": 3.5840706825256348,
+      "sample_id": "simple_python_1",
+      "category": "unknown"
+    },
+    {
+      "success": true,
+      "score": 1.0,
+      "predicted": [
+        {
+          "name": "math.hypot",
+          "arguments": {
+            "x": 4,
+            "y": 5
+          }
+        }
+      ],
+      "expected": [
+        {
+          "math.hypot": {
+            "x": [
+              4
+            ],
+            "y": [
+              5
+            ],
+            "z": [
+              "",
+              0
+            ]
+          }
+        }
+      ],
+      "response": "[{\"name\": \"math.hypot\", \"arguments\": {\"x\": 4, \"y\": 5}}]",
+      "question": [
+        [
+          {
+            "role": "user",
+            "content": "Calculate the hypotenuse of a right triangle given the lengths of the other two sides as 4 and 5."
+          }
+        ]
+      ],
+      "execution_time": 48.63442301750183,
+      "sample_id": "simple_python_2",
+      "category": "unknown"
+    },
+    {
+      "success": true,
+      "score": 1.0,
+      "predicted": [
+        {
+          "name": "algebra.quadratic_roots",
+          "arguments": {
+            "a": 1,
+            "b": -3,
+            "c": 2
+          }
+        }
+      ],
+      "expected": [
+        {
+          "algebra.quadratic_roots": {
+            "a": [
+              1
+            ],
+            "b": [
+              -3
+            ],
+            "c": [
+              2
+            ]
+          }
+        }
+      ],
+      "response": "[{\"name\": \"algebra.quadratic_roots\", \"arguments\": {\"a\": 1, \"b\": -3, \"c\": 2}}]",
+      "question": [
+        [
+          {
+            "role": "user",
+            "content": "Find the roots of a quadratic equation with coefficients a=1, b=-3, c=2."
+          }
+        ]
+      ],
+      "execution_time": 7.592089653015137,
+      "sample_id": "simple_python_3",
+      "category": "unknown"
+    },
+    {
+      "success": true,
+      "score": 1.0,
+      "predicted": [
+        {
+          "name": "solve_quadratic_equation",
+          "arguments": {
+            "a": 2,
+            "b": 6,
+            "c": 5
+          }
+        }
+      ],
+      "expected": [
+        {
+          "solve_quadratic_equation": {
+            "a": [
+              2
+            ],
+            "b": [
+              6
+            ],
+            "c": [
+              5
+            ]
+          }
+        }
+      ],
+      "response": "[{\"name\": \"solve_quadratic_equation\", \"arguments\": {\"a\": 2, \"b\": 6, \"c\": 5}}]",
+      "question": [
+        [
+          {
+            "role": "user",
+            "content": "Solve a quadratic equation where a=2, b=6, and c=5"
+          }
+        ]
+      ],
+      "execution_time": 48.38360333442688,
+      "sample_id": "simple_python_4",
+      "category": "unknown"
+    }
+  ]
+}

+ 90 - 0
code/chapter12/template_output/evaluation_results/gaia_official/SUBMISSION_GUIDE_20251011_015731.md

@@ -0,0 +1,90 @@
+# GAIA评估结果提交指南
+
+## 📊 评估结果摘要
+
+- **模型名称**: TestAgent
+- **评估级别**: 1
+- **总样本数**: 2
+- **精确匹配数**: 0
+- **精确匹配率**: 0.00%
+
+## 📁 提交文件
+
+**结果文件**: `gaia_level1_result_20251011_015731.jsonl`
+
+此文件包含:
+- 每个任务的task_id
+- 模型的答案(model_answer)
+- 推理轨迹(reasoning_trace)
+
+## 🚀 如何提交到GAIA排行榜
+
+### 步骤1: 访问GAIA排行榜
+
+打开浏览器,访问:
+```
+https://huggingface.co/spaces/gaia-benchmark/leaderboard
+```
+
+### 步骤2: 准备提交信息
+
+在提交表单中填写以下信息:
+
+1. **Model Name(模型名称)**: `TestAgent`
+2. **Model Family(模型家族)**: 例如 `GPT`, `Claude`, `Qwen` 等
+3. **Model Type(模型类型)**:
+   - `Open-source` (开源)
+   - `Proprietary` (专有)
+4. **Results File(结果文件)**: 上传 `gaia_level1_result_20251011_015731.jsonl`
+
+### 步骤3: 上传结果文件
+
+1. 点击 "Choose File" 按钮
+2. 选择文件: `D:\code\multiAgentBok\HL-MAS\jjyaoao分支的hello-agents\hello-agents\docs\chapter12\HelloAgents\evaluation_results\gaia_official\gaia_level1_result_20251011_015731.jsonl`
+3. 确认文件格式为 `.jsonl`
+
+### 步骤4: 提交
+
+1. 检查所有信息是否正确
+2. 点击 "Submit" 按钮
+3. 等待评估结果(通常需要几分钟)
+
+## 📋 结果文件格式说明
+
+GAIA要求的JSONL格式(每行一个JSON对象):
+
+```json
+{"task_id": "xxx", "model_answer": "答案", "reasoning_trace": "推理过程"}
+```
+
+**字段说明**:
+- `task_id`: 任务ID(与GAIA数据集对应)
+- `model_answer`: 模型的最终答案
+- `reasoning_trace`: 模型的推理过程(可选)
+
+## ⚠️ 注意事项
+
+1. **答案格式**:
+   - 数字:不使用逗号分隔符,不使用单位符号
+   - 字符串:不使用冠词,使用小写
+   - 列表:逗号分隔,按字母顺序排列
+
+2. **文件大小**:
+   - 确保文件不超过10MB
+   - 如果文件过大,考虑移除reasoning_trace
+
+3. **提交频率**:
+   - 建议先在小样本上测试
+   - 确认结果正确后再提交完整评估
+
+## 📞 获取帮助
+
+如果遇到问题:
+1. 查看GAIA官方文档:https://huggingface.co/gaia-benchmark
+2. 在HuggingFace论坛提问
+3. 检查结果文件格式是否正确
+
+---
+
+**生成时间**: 2025-10-11 01:57:31
+**工具版本**: HelloAgents GAIA Evaluation Tool v1.0

+ 2 - 0
code/chapter12/template_output/evaluation_results/gaia_official/gaia_level1_result_20251011_015731.jsonl

@@ -0,0 +1,2 @@
+{"task_id": "e1fc63a2-da7a-432f-be78-7c4a95598703", "model_answer": "17000", "reasoning_trace": "To solve this problem, we need to follow these steps:\n\n1. **Determine the distance from Earth to the Moon at its closest approach (perigee):**\n   - According to the Wikipedia page for the Moon, the minimum perigee value is approximately 363,104 kilometers.\n\n2. **Find Eliud Kipchoge's marathon pace:**\n   - Eliud Kipchoge set a world record with a time of 1 hour, 59 minutes, and 40 seconds for a marathon distance of 42.195 kilometers.\n   - Convert this time to hours: 1 + 59/60 + 40/3600 = 1.994444 hours.\n   - Calculate his pace in kilometers per hour: 42.195 / 1.994444 ≈ 21.16 kilometers per hour.\n\n3. **Calculate the total time required to run the distance to the Moon:**\n   - Total distance to the Moon: 363,104 kilometers.\n   - Time required = Distance / Speed = 363,104 / 21.16 ≈ 17,159.86 hours.\n\n4. **Round the result to the nearest 1000 hours:**\n   - 17,159.86 rounded to the nearest 1000 is 17,000.\n\nFINAL ANSWER: 17000"}
+{"task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be", "model_answer": "5", "reasoning_trace": "To determine how many studio albums Mercedes Sosa published between 2000 and 2009, I'll check her discography on the latest available version of the English Wikipedia.\n\nMercedes Sosa, an Argentine singer, had a long and prolific career. Here’s a list of her studio albums released between 2000 and 2009:\n\n- **2000: \"Al Despertar\"**\n- **2003: \"Acústico\"**\n- **2005: \"Argentina Quiere Cantar\"**\n- **2006: \"Corazón Libre\"**\n- **2009: \"Cantora 1\"**\n\nCounting these albums, we have:\n\n- 2000: 1 album\n- 2003: 1 album\n- 2005: 1 album\n- 2006: 1 album\n- 2009: 1 album\n\nTotal number of studio albums published by Mercedes Sosa between 2000 and 2009: 5\n\nFINAL ANSWER: 5"}

+ 5 - 0
code/chapter12/template_output/result/Qwen_Qwen3-8B/BFCL_v4_simple_python_result.json

@@ -0,0 +1,5 @@
+{"id": "simple_python_0", "result": "calculate_triangle_area(base=10, height=5)", "inference_log": [{"role": "user", "content": [[{"role": "user", "content": "Find the area of a triangle with a base of 10 units and height of 5 units."}]]}, {"role": "assistant", "content": "[{\"name\": \"calculate_triangle_area\", \"arguments\": {\"base\": 10, \"height\": 5}}]"}]}
+{"id": "simple_python_1", "result": "math.factorial(number=5)", "inference_log": [{"role": "user", "content": [[{"role": "user", "content": "Calculate the factorial of 5 using math functions."}]]}, {"role": "assistant", "content": "[{\"name\": \"math.factorial\", \"arguments\": {\"number\": 5}}]"}]}
+{"id": "simple_python_2", "result": "math.hypot(x=4, y=5)", "inference_log": [{"role": "user", "content": [[{"role": "user", "content": "Calculate the hypotenuse of a right triangle given the lengths of the other two sides as 4 and 5."}]]}, {"role": "assistant", "content": "[{\"name\": \"math.hypot\", \"arguments\": {\"x\": 4, \"y\": 5}}]"}]}
+{"id": "simple_python_3", "result": "algebra.quadratic_roots(a=1, b=-3, c=2)", "inference_log": [{"role": "user", "content": [[{"role": "user", "content": "Find the roots of a quadratic equation with coefficients a=1, b=-3, c=2."}]]}, {"role": "assistant", "content": "[{\"name\": \"algebra.quadratic_roots\", \"arguments\": {\"a\": 1, \"b\": -3, \"c\": 2}}]"}]}
+{"id": "simple_python_4", "result": "solve_quadratic_equation(a=2, b=6, c=5)", "inference_log": [{"role": "user", "content": [[{"role": "user", "content": "Solve a quadratic equation where a=2, b=6, and c=5"}]]}, {"role": "assistant", "content": "[{\"name\": \"solve_quadratic_equation\", \"arguments\": {\"a\": 2, \"b\": 6, \"c\": 5}}]"}]}

+ 1 - 0
code/chapter12/template_output/score/Qwen_Qwen3-8B/non_live/BFCL_v4_simple_python_score.json

@@ -0,0 +1 @@
+{"accuracy": 1.0, "correct_count": 5, "total_count": 5}

+ 2 - 0
code/chapter12/template_output/score/data_agentic.csv

@@ -0,0 +1,2 @@
+Rank,Model,Agentic Overall Acc,Web Search Summary,Web Search Base,Web Search No Snippet,Memory Summary,Memory KV,Memory Vector,Memory Recursive Summarization
+1,Qwen3-8B (Prompt),0.00%,N/A,N/A,N/A,N/A,N/A,N/A,N/A

تفاوت فایلی نمایش داده نمی شود زیرا این فایل بسیار بزرگ است
+ 0 - 0
code/chapter12/template_output/score/data_format_sensitivity.csv


+ 2 - 0
code/chapter12/template_output/score/data_live.csv

@@ -0,0 +1,2 @@
+Rank,Model,Live Overall Acc,AST Summary,Python Simple AST,Python Multiple AST,Python Parallel AST,Python Parallel Multiple AST,Irrelevance Detection,Relevance Detection
+1,Qwen3-8B (Prompt),0.00%,N/A,N/A,N/A,N/A,N/A,N/A,N/A

+ 2 - 0
code/chapter12/template_output/score/data_multi_turn.csv

@@ -0,0 +1,2 @@
+Rank,Model,Multi Turn Overall Acc,Base,Miss Func,Miss Param,Long Context
+1,Qwen3-8B (Prompt),0.00%,N/A,N/A,N/A,N/A

+ 2 - 0
code/chapter12/template_output/score/data_non_live.csv

@@ -0,0 +1,2 @@
+Rank,Model,Non-Live Overall Acc,AST Summary,Simple AST,Python Simple AST,Java Simple AST,JavaScript Simple AST,Multiple AST,Parallel AST,Parallel Multiple AST,Irrelevance Detection
+1,Qwen3-8B (Prompt),8.33%,N/A,N/A,100.00%,N/A,N/A,N/A,N/A,N/A,N/A

+ 2 - 0
code/chapter12/template_output/score/data_overall.csv

@@ -0,0 +1,2 @@
+Rank,Overall Acc,Model,Model Link,Total Cost ($),Latency Mean (s),Latency Standard Deviation (s),Latency 95th Percentile (s),Non-Live AST Acc,Non-Live Simple AST,Non-Live Multiple AST,Non-Live Parallel AST,Non-Live Parallel Multiple AST,Live Acc,Live Simple AST,Live Multiple AST,Live Parallel AST,Live Parallel Multiple AST,Multi Turn Acc,Multi Turn Base,Multi Turn Miss Func,Multi Turn Miss Param,Multi Turn Long Context,Web Search Acc,Web Search Base,Web Search No Snippet,Memory Acc,Memory KV,Memory Vector,Memory Recursive Summarization,Relevance Detection,Irrelevance Detection,Format Sensitivity Max Delta,Format Sensitivity Standard Deviation,Organization,License
+1,0.83%,Qwen3-8B (Prompt),https://huggingface.co/Qwen/Qwen3-8B,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.00%,N/A,N/A,N/A,N/A,0.00%,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,Qwen,apache-2.0

+ 1 - 1
docs/chapter10/第十章 智能体通信协议.md

@@ -173,7 +173,7 @@ hello_agents/
 
 ```bash
 # 安装HelloAgents框架(第10章版本)
-pip install hello-agents==0.2.2
+pip install hello-agents[protocol]==0.2.2
 
 # 安装NodeJS, 可以参考Additional-Chapter中的文档
 ```

+ 2702 - 1
docs/chapter12/第十二章 智能体性能评估.md

@@ -1,3 +1,2704 @@
 # 第十二章 智能体性能评估
 
-本章内容待补充...
+在前面的章节中,我们构建了HelloAgents框架的核心功能,实现了多种智能体范式、工具系统、记忆机制和强化学习训练等。在构建智能体系统时,我们还需要解决一个核心问题:**如何客观地评估智能体的性能?** 具体来说,我们需要回答以下问题:
+
+1. 智能体是否具备预期的能力?
+2. 在不同任务上的表现如何?
+3. 与其他智能体相比处于什么水平?
+
+本章将为HelloAgents增加**性能评估系统(Evaluation System)**。我们将深入理解智能体评估的理论基础,并实现评估的工具。
+
+## 12.1 智能体评估基础
+
+### 12.1.1 为何需要智能体评估
+
+我们现在的SimpleAgent,它已经具备了强大的推理和工具调用能力。让我们看一个典型的使用场景:
+
+```python
+from hello_agents import SimpleAgent, HelloAgentsLLM
+from hello_agents.tools import SearchTool
+
+# 创建LLM和智能体
+llm = HelloAgentsLLM()
+
+# 创建一个强调工具使用的系统提示词
+system_prompt = """你是一个AI助手,可以使用搜索工具来获取最新信息。
+
+当需要搜索信息时,请使用以下格式:
+[TOOL_CALL:search:搜索关键词]
+
+例如:
+- [TOOL_CALL:search:最新AI新闻]
+- [TOOL_CALL:search:Python编程教程]
+
+请在回答问题前先使用搜索工具获取最新信息。"""
+
+agent = SimpleAgent(name="AI助手", llm=llm, system_prompt=system_prompt)
+
+# 添加搜索工具
+agent.add_tool(SearchTool())
+
+# 示例:使用搜索工具回答问题
+response = agent.run("最新的AI技术发展趋势是什么?")
+print(f"\n回答:{response}")
+```
+
+这个智能体能正常工作,但我们面临一个核心问题:如何客观地评估它的性能?当我们优化提示词或更换LLM模型后,如何知道是否真的有改进?在部署到生产环境前,如何保证智能体的可靠性?这些问题都需要通过系统化的评估来解决。
+
+智能体评估的核心价值在于提供标准化的方法来衡量智能体的能力。通过评估,我们可以用具体的数字指标量化智能体的表现,客观比较不同设计方案的优劣,及时发现智能体在特定场景下的弱点,并向用户证明智能体的可靠性。
+
+与传统软件测试不同,智能体评估面临着独特的挑战。首先是输出的不确定性,同一问题可能有多个正确答案,很难用简单的对错来判断。其次是评估标准的多样性,不同任务需要不同的评估方法,工具调用需要检查函数签名,问答任务需要评估语义相似度。最后是评估成本的高昂,每次评估都需要大量的API调用,成本可能达到数百元甚至更多。
+
+为了应对这些挑战,学术界和工业界提出了多个标准化的**评估基准(Benchmark)**。这些基准提供了统一的数据集、评估指标和评分方法,使我们能够在相同的标准下评估和对比不同的智能体系统。
+
+### 12.1.2 主流评估基准概览
+
+智能体评估领域已经涌现出多个具有影响力的基准测试。下面介绍一些主流的评估基准和指标:
+
+**(1)工具调用能力评估**
+
+工具调用是智能体的核心能力之一。智能体需要理解用户意图,选择合适的工具,并正确构造函数调用。相关的评估基准包括:
+
+- **BFCL (Berkeley Function Calling Leaderboard)**<sup>[1]</sup>:UC Berkeley推出,包含1120+测试样本,涵盖simple、multiple、parallel、irrelevance四个类别,使用AST匹配算法评估,数据集规模适中,社区活跃。
+- **ToolBench**<sup>[2]</sup>:清华大学推出,包含16000+真实API调用场景,覆盖真实世界的复杂工具使用场景。
+- **API-Bank**<sup>[3]</sup>:Microsoft Research推出,包含53个常用API工具,专注于评估智能体对API文档的理解和调用能力。
+
+**(2)通用能力评估**
+
+评估智能体在真实世界任务中的综合表现,包括多步推理、知识运用、多模态理解等能力:
+
+- **GAIA (General AI Assistants)**<sup>[4]</sup>:Meta AI和Hugging Face联合推出,包含466个真实世界问题,分为Level 1/2/3三个难度级别,评估多步推理、工具使用、文件处理、网页浏览等能力,使用准精确匹配(Quasi Exact Match)算法,任务真实且综合性强。
+- **AgentBench**<sup>[5]</sup>:清华大学推出,包含8个不同领域的任务,全面评估智能体的通用能力。
+- **WebArena**<sup>[6]</sup>:CMU推出,评估智能体在真实网页环境中的任务完成能力和网页交互能力。
+
+**(3)多智能体协作评估**
+
+评估多个智能体协同工作的能力:
+
+- **ChatEval**<sup>[7]</sup>:评估多智能体对话系统的质量。
+- **SOTOPIA**<sup>[8]</sup>:评估智能体在社交场景中的互动能力。
+- **自定义协作场景**:根据具体应用场景设计的评估任务。
+
+**(4)常用评估指标**
+
+不同基准使用不同的评估指标,常见的包括:
+
+- **准确性指标**:Accuracy(准确率)、Exact Match(精确匹配)、F1 Score(F1分数),用于衡量答案的正确性。
+- **效率指标**:Response Time(响应时间)、Token Usage(Token使用量),用于衡量执行效率。
+- **鲁棒性指标**:Error Rate(错误率)、Failure Recovery(故障恢复),用于衡量容错能力。
+- **协作指标**:Communication Efficiency(通信效率)、Task Completion(任务完成度),用于衡量协作效果。
+
+### 12.1.3 HelloAgents评估体系设计
+
+考虑到学习曲线和实用性,本章将重点介绍以下评估场景:
+
+1. **BFCL**:评估工具调用能力
+   - 选择理由:数据集规模适中,评估指标清晰,社区活跃
+   - 适用场景:评估智能体的函数调用准确性
+
+2. **GAIA**:评估通用AI助手能力
+   - 选择理由:任务真实,难度分级,综合性强
+   - 适用场景:评估智能体的综合问题解决能力
+
+3. **数据生成质量评估**:评估LLM生成数据质量
+   - 选择理由:通过这个案例可以完整体验如何使用Agent创造数据,评估数据的完整演示。
+   - 适用场景:评估生成的训练数据、测试数据的质量
+   - 评估方法:LLM Judge、Win Rate、人工验证
+
+通过这三个评估场景,我们将构建一个完整的评估体系,如图12.1展示了我们的评估系统构建思路。
+
+<div align="center">
+  <img src="https://raw.githubusercontent.com/datawhalechina/Hello-Agents/main/docs/images/12-figures/12-1.png" alt="" width="85%"/>
+  <p>图 12.1 HelloAgents评估体系架构图</p>
+</div>
+
+
+
+### 12.1.4 本章学习目标与快速体验
+
+让我们先看看第十二章的学习内容:
+
+```
+hello_agents/
+├── evaluation/                         # 评估模块
+│   └── benchmarks/                     # 评估基准实现
+│       ├── bfcl/                       # BFCL评估实现
+│       │   ├── dataset.py              # BFCL数据集加载器
+│       │   ├── evaluator.py            # BFCL评估器(AST匹配)
+│       │   ├── metrics.py              # BFCL专用指标
+│       │   └── ast_matcher.py          # AST匹配算法
+│       ├── gaia/                       # GAIA评估实现
+│       │   ├── dataset.py              # GAIA数据集加载器
+│       │   ├── evaluator.py            # GAIA评估器(准精确匹配)
+│       │   ├── metrics.py              # GAIA专用指标
+│       │   └── quasi_exact_match.py    # 准精确匹配算法
+│       └── data_generation/            # 数据生成评估实现
+│           ├── dataset.py              # AIME数据集加载器
+│           ├── llm_judge.py            # LLM Judge评估器
+│           └── win_rate.py             # Win Rate评估器
+└── tools/builtin/                      # 内置工具模块
+    ├── bfcl_evaluation_tool.py         # BFCL评估工具
+    ├── gaia_evaluation_tool.py         # GAIA评估工具
+    ├── llm_judge_tool.py               # LLM Judge工具
+    └── win_rate_tool.py                # Win Rate工具
+```
+
+对于这一章的内容,学习目标是掌握应用评估工具的能力。让我们先准备好开发环境:
+
+```bash
+# 安装HelloAgents框架(第12章版本)
+pip install hello-agents[evaluation]==0.2.3
+
+# 设置环境变量
+export HF_TOKEN="your_huggingface_token"     # 用于GAIA数据集(后续也会有设置步骤)
+```
+
+在接下来的章节中,我们将深入学习每种评估方法的详细用法和介绍。
+
+## 12.2 BFCL:工具调用能力评估
+
+### 12.2.1 BFCL基准介绍
+
+BFCL (Berkeley Function Calling Leaderboard) 是由加州大学伯克利分校推出的函数调用能力评估基准<sup>[1]</sup>。在智能体系统中,工具调用(Tool Calling)是核心能力之一。智能体需要完成以下任务:
+
+1. **理解任务需求**:从用户的自然语言描述中提取关键信息
+2. **选择合适工具**:从可用工具集中选择最适合的工具
+3. **构造函数调用**:正确填写函数名和参数
+4. **处理复杂场景**:支持多函数调用、并行调用等高级场景
+
+BFCL基准包含四个评估类别,难度递增。从最基础的单函数调用(Simple)开始,逐步增加到需要调用多个函数的场景(Multiple),再到需要并行调用多个函数的复杂场景(Parallel),最后是需要判断是否需要调用函数的场景(Irrelevance)。这四个类别覆盖了智能体在实际应用中可能遇到的各种工具调用场景,如表12.1所示:
+
+<div align="center">
+  <p>表 12.1 BFCL基准中的四个评估类别</p>
+  <img src="https://raw.githubusercontent.com/datawhalechina/Hello-Agents/main/docs/images/12-figures/12-table-1.png" alt="" width="85%"/>
+</div>
+BFCL的评估流程遵循标准的基准测试流程:首先加载数据集并选择评估类别,然后运行智能体获取预测结果,接着将预测结果解析为抽象语法树(AST),最后通过AST匹配算法判断预测是否正确。整个流程会遍历所有测试样本,最终计算出准确率等评估指标并生成评估报告。完整的评估流程如图12.2所示:
+
+<div align="center">
+  <img src="https://raw.githubusercontent.com/datawhalechina/Hello-Agents/main/docs/images/12-figures/12-2.png" alt="" width="85%"/>
+  <p>图 12.2 BFCL评估流程图</p>
+</div>
+**(1)BFCL数据集结构**
+
+BFCL数据集采用JSON格式,每个测试样本包含以下字段:
+
+```json
+{
+  "id": "simple_001",
+  "question": "What's the weather like in Beijing today?",
+  "function": [
+    {
+      "name": "get_weather",
+      "description": "Get the current weather for a location",
+      "parameters": {
+        "type": "object",
+        "properties": {
+          "location": {
+            "type": "string",
+            "description": "The city name"
+          }
+        },
+        "required": ["location"]
+      }
+    }
+  ],
+  "ground_truth": [
+    {
+      "name": "get_weather",
+      "arguments": {
+        "location": "Beijing"
+      }
+    }
+  ]
+}
+```
+
+**关键字段说明:**
+
+- `question`: 用户的自然语言请求
+- `function`: 可用的函数列表(包含函数签名和描述)
+- `ground_truth`: 标准答案(期望的函数调用)
+
+**(2)AST匹配说明**
+
+BFCL使用**AST匹配(Abstract Syntax Tree Matching)**作为核心评估算法,因此下文可以了解一下评估的策略。
+
+BFCL使用抽象语法树(AST)进行智能匹配,而不是简单的字符串匹配。AST匹配的核心思想是:**将函数调用解析为语法树,然后比较树的结构和节点值**。
+
+给定预测的函数调用 $P$ 和标准答案 $G$,AST匹配函数定义为:
+
+$$
+\text{AST\_Match}(P, G) = \begin{cases}
+1 & \text{if } \text{AST}(P) \equiv \text{AST}(G) \\
+0 & \text{otherwise}
+\end{cases}
+$$
+
+其中 $\text{AST}(x)$ 表示将函数调用解析为抽象语法树,$\equiv$ 表示语法树等价。
+
+两个语法树等价需要满足三个核心条件:函数名必须完全一致(精确匹配),参数键值对集合相等(忽略顺序),以及每个参数的值在语义上等价(例如 `2+3` 等价于 `5`)。在具体的匹配过程中,函数名匹配要求字符串精确匹配,例如 `get_weather` 和 `get_temperature` 被视为不同的函数。参数匹配则使用AST进行智能比较,允许参数顺序不同(`f(a=1, b=2)` 等价于 `f(b=2, a=1)`),允许等价表达式(`f(x=2+3)` 等价于 `f(x=5)`),也允许不同的字符串表示(`f(s="hello")` 等价于 `f(s='hello')`)。对于多函数调用的场景,匹配算法要求调用相同数量的函数,每个函数调用都必须匹配,但调用顺序可以不同(使用集合匹配)。
+
+**AST匹配示例:**
+
+```python
+# 示例1:参数顺序不同(匹配成功)
+预测: get_weather(city="Beijing", unit="celsius")
+标准: get_weather(unit="celsius", city="Beijing")
+结果: ✅ 匹配成功
+
+# 示例2:等价表达式(匹配成功)
+预测: calculate(x=2+3)
+标准: calculate(x=5)
+结果: ✅ 匹配成功
+
+# 示例3:函数名错误(匹配失败)
+预测: get_temperature(city="Beijing")
+标准: get_weather(city="Beijing")
+结果: ❌ 匹配失败
+
+# 示例4:参数值错误(匹配失败)
+预测: get_weather(city="Shanghai")
+标准: get_weather(city="Beijing")
+结果: ❌ 匹配失败
+```
+
+**(3)BFCL评估指标**
+
+BFCL使用以下指标评估智能体性能:
+
+**1. 准确率 (Accuracy)**
+
+准确率是最核心的指标,定义为AST匹配成功的样本比例:
+
+$$
+\text{Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \text{AST\_Match}(P_i, G_i)
+$$
+
+其中:
+- $N$ 是总样本数
+- $P_i$ 是第 $i$ 个样本的预测结果
+- $G_i$ 是第 $i$ 个样本的标准答案
+- $\text{AST\_Match}(P_i, G_i) \in \{0, 1\}$ 是AST匹配函数
+
+**2. AST匹配率 (AST Match Rate)**
+
+与准确率相同,强调使用AST匹配算法:
+
+$$
+\text{AST Match Rate} = \text{Accuracy}
+$$
+
+**3. 分类准确率 (Category-wise Accuracy)**
+
+对于每个类别 $c \in \{\text{simple}, \text{multiple}, \text{parallel}, \ldots\}$,计算该类别的准确率:
+
+$$
+\text{Accuracy}_c = \frac{1}{|D_c|} \sum_{i \in D_c} \text{AST\_Match}(P_i, G_i)
+$$
+
+其中 $D_c$ 是类别 $c$ 的样本集合,$|D_c|$ 是该类别的样本数。
+
+**4. 加权准确率 (Weighted Accuracy)**
+
+考虑不同类别的难度权重:
+
+$$
+\text{Weighted Accuracy} = \sum_{c} w_c \cdot \text{Accuracy}_c
+$$
+
+其中 $w_c$ 是类别 $c$ 的权重,满足 $\sum_c w_c = 1$。
+
+**5. 错误率 (Error Rate)**
+
+未能正确调用函数的样本比例:
+
+$$
+\text{Error Rate} = 1 - \text{Accuracy} = \frac{1}{N} \sum_{i=1}^{N} (1 - \text{AST\_Match}(P_i, G_i))
+$$
+
+**指标解释:**
+
+- **Accuracy = 1.0**:所有样本都完全正确
+- **Accuracy = 0.8**:80%的样本正确,20%的样本错误
+- **Accuracy = 0.0**:所有样本都错误
+
+**分类准确率示例:**
+
+```python
+# 假设评估结果
+simple_accuracy = 0.95      # Simple类别:95%正确
+multiple_accuracy = 0.82    # Multiple类别:82%正确
+parallel_accuracy = 0.68    # Parallel类别:68%正确
+
+# 加权准确率(假设权重相等)
+weighted_accuracy = (0.95 + 0.82 + 0.68) / 3 = 0.817
+```
+
+**(4)BFCL官方评估工具**
+
+BFCL提供官方CLI工具进行评估:
+
+```bash
+# 安装BFCL评估工具
+pip install bfcl
+
+# 运行官方评估
+bfcl evaluate \
+    --model-result-path ./results.json \
+    --test-category simple_python
+```
+
+使用官方评估工具的优势在于:它使用官方的AST匹配算法,评估结果与排行榜完全一致,支持所有BFCL v4类别,并且能够自动生成详细的评估报告。
+
+
+### 12.2.2 获取BFCL数据集
+
+BFCL数据集可以通过以下方式获取:
+
+**方法1:从官方GitHub仓库克隆(推荐)**
+
+这是最可靠的方式,可以获取完整的数据集和ground truth:
+
+```bash
+# 克隆BFCL仓库
+git clone https://github.com/ShishirPatil/gorilla.git temp_gorilla
+cd temp_gorilla/berkeley-function-call-leaderboard
+
+# 查看BFCL v4数据集
+ls bfcl_eval/data/
+# 输出: BFCL_v4_simple_python.json  BFCL_v4_multiple.json  BFCL_v4_parallel.json  ...
+
+# 查看ground truth
+ls bfcl_eval/data/possible_answer/
+# 输出: BFCL_v4_simple_python.json  BFCL_v4_multiple.json  ...
+```
+
+推荐这种方式的原因是:它包含完整的ground truth(标准答案),数据格式与官方评估工具完全一致,可以直接使用官方评估脚本,并且支持BFCL v4最新版本。
+
+**方法2:使用HelloAgents加载官方数据**
+
+克隆仓库后,使用HelloAgents加载数据:
+
+```python
+from hello_agents.evaluation import BFCLDataset
+
+# 加载BFCL官方数据
+dataset = BFCLDataset(
+    bfcl_data_dir="./temp_gorilla/berkeley-function-call-leaderboard/bfcl_eval/data",
+    category="simple_python"  # BFCL v4类别
+)
+
+# 加载数据(包括测试数据和ground truth)
+data = dataset.load()
+
+print(f"✅ 加载了 {len(data)} 个测试样本")
+print(f"✅ 加载了 {len(dataset.ground_truth)} 个ground truth")
+# 输出:
+# ✅ 加载了 400 个测试样本
+# ✅ 加载了 400 个ground truth
+```
+
+这个加载器的工作原理是:首先从`bfcl_eval/data/`加载测试数据,然后从`bfcl_eval/data/possible_answer/`加载ground truth,接着自动合并测试数据和ground truth,最后保留原始BFCL数据格式。其中BFCL v4数据集类别可以在表12.2查看。
+
+<div align="center">
+  <p>表 12.2 BFCL基准中的四个评估类别</p>
+  <img src="https://raw.githubusercontent.com/datawhalechina/Hello-Agents/main/docs/images/12-figures/12-table-2.png" alt="" width="85%"/>
+</div>
+
+当然也可以通过代码查看可用类别:
+
+```python
+# 获取所有支持的类别
+categories = dataset.get_available_categories()
+print(f"支持的类别: {categories}")
+# 输出: ['simple_python', 'simple_java', 'simple_javascript', 'multiple', ...]
+```
+
+### 12.2.3 在HelloAgents中实现BFCL评估
+
+现在让我们看看如何在HelloAgents框架中实现BFCL评估。我们提供了三种使用方式:
+
+**方式1:使用BFCLEvaluationTool(推荐)**
+
+这是最简单的方式,一行代码完成评估、报告生成和官方评估:
+
+```python
+from hello_agents import SimpleAgent, HelloAgentsLLM
+from hello_agents.tools import BFCLEvaluationTool
+
+# 1. 创建要评估的智能体
+llm = HelloAgentsLLM()
+agent = SimpleAgent(name="TestAgent", llm=llm)
+
+# 2. 创建BFCL评估工具
+bfcl_tool = BFCLEvaluationTool()
+
+# 3. 运行评估(自动完成所有步骤)
+results = bfcl_tool.run(
+    agent=agent,
+    category="simple_python",  # 评估类别
+    max_samples=5              # 评估样本数(0表示全部)
+)
+
+# 4. 查看结果
+print(f"准确率: {results['overall_accuracy']:.2%}")
+print(f"正确数: {results['correct_samples']}/{results['total_samples']}")
+```
+
+**运行输出:**
+
+```
+============================================================
+BFCL一键评估
+============================================================
+
+配置:
+   评估类别: simple_python
+   样本数量: 5
+   智能体: TestAgent
+
+============================================================
+步骤1: 运行HelloAgents评估
+============================================================
+✅ BFCL数据集加载完成
+   数据目录: ./temp_gorilla/berkeley-function-call-leaderboard/bfcl_eval/data
+   类别: simple_python
+   样本数: 400
+   Ground truth数: 400
+
+🔧 开始 BFCL 评估...
+   进度: 1/5
+   进度: 5/5
+
+✅ BFCL 评估完成
+   总体准确率: 100.00%
+   simple_python: 100.00% (5/5)
+
+📊 评估结果:
+   准确率: 100.00%
+   正确数: 5/5
+
+============================================================
+步骤2: 导出BFCL格式结果
+============================================================
+✅ BFCL格式结果已导出
+   输出文件: ./evaluation_results/bfcl_official/BFCL_v4_simple_python_result.json
+
+============================================================
+步骤3: 运行BFCL官方评估
+============================================================
+✅ 结果文件已复制到: ./result/Qwen_Qwen3-8B/BFCL_v4_simple_python_result.json
+
+🔄 运行命令: bfcl evaluate --model Qwen/Qwen3-8B --test-category simple_python --partial-eval
+
+============================================================
+BFCL官方评估结果
+============================================================
+📊 评估结果汇总:
+Model,Overall Acc,simple_python
+Qwen/Qwen3-8B,100.00,100.00
+
+🎯 最终结果:
+   准确率: 100.00%
+   正确数: 5/5
+
+============================================================
+步骤4: 生成评估报告
+============================================================
+📄 报告已生成: ./evaluation_reports/bfcl_report_20251011_005938.md
+
+准确率: 100.00%
+正确数: 5/5
+```
+
+**自动生成的Markdown报告:**
+
+评估完成后,会自动生成一份详细的Markdown报告,包含:
+
+```markdown
+# BFCL评估报告
+**生成时间**: 2025-10-11 00:59:38
+
+## 📊 评估概览
+
+- **智能体**: TestAgent
+- **评估类别**: simple_python
+- **总体准确率**: 100.00%
+- **正确样本数**: 5/5
+
+## 📈 详细指标
+
+### 分类准确率
+
+- **simple_python**: 100.00% (5/5)
+
+## 📝 样本详情
+
+| 样本ID | 问题 | 预测结果 | 正确答案 | 是否正确 |
+|--------|------|----------|----------|----------|
+| simple_python_0 | Find the area of a triangle... | [{'name': 'calculate_triangle_area'...}] | [{'function_name': {'base': [10]...}}] | ✅ |
+| simple_python_1 | Calculate the factorial of 5... | [{'name': 'calculate_factorial'...}] | [{'function_name': {'number': [5]}}] | ✅ |
+...
+
+## 📊 准确率可视化
+准确率: ██████████████████████████████████████████████████ 100.00%
+
+## 💡 建议
+- ✅ 表现优秀!智能体在工具调用方面表现出色。
+```
+
+**方式2:使用一键评估脚本**
+
+适合命令行快速评估,在这一章配套的代码案例里,我们提供了`04_run_bfcl_evaluation.py`,支持直接命令行调用测评:
+
+```bash
+# 运行评估脚本
+python chapter12/04_run_bfcl_evaluation.py --category simple_python --samples 10
+
+# 指定模型名称(用于BFCL官方评估)
+python examples/04_run_bfcl_evaluation.py \
+    --category simple_python \
+    --samples 10 \
+    --model-name "Qwen/Qwen3-8B"
+```
+
+脚本支持三个参数:`--category`指定评估类别(默认simple_python),`--samples`指定评估样本数(默认5,0表示全部),`--model-name`指定模型名称用于BFCL官方评估(默认Qwen/Qwen3-8B)。
+
+**方式3:直接使用Dataset和Evaluator**
+
+适合需要自定义评估流程的场景:
+
+```python
+from hello_agents import SimpleAgent, HelloAgentsLLM
+from hello_agents.evaluation import BFCLDataset, BFCLEvaluator
+
+# 1. 创建智能体
+llm = HelloAgentsLLM()
+agent = SimpleAgent(name="TestAgent", llm=llm)
+
+# 2. 加载数据集
+dataset = BFCLDataset(
+    bfcl_data_dir="./temp_gorilla/berkeley-function-call-leaderboard/bfcl_eval/data",
+    category="simple_python"
+)
+data = dataset.load()
+
+# 3. 创建评估器
+evaluator = BFCLEvaluator(
+    dataset=dataset,
+    category="simple_python",
+    evaluation_mode="ast"  # 使用AST匹配模式
+)
+
+# 4. 运行评估
+results = evaluator.evaluate(agent, max_samples=10)
+
+# 5. 查看结果
+print(f"准确率: {results['overall_accuracy']:.2%}")
+print(f"正确数: {results['correct_samples']}/{results['total_samples']}")
+
+# 6. 导出BFCL格式结果(可选)
+evaluator.export_to_bfcl_format(
+    results,
+    output_path="./evaluation_results/my_results.json"
+)
+```
+
+通过以上三种方式,我们可以根据不同的需求选择合适的评估方法。如果只是想快速了解智能体的表现,使用BFCLEvaluationTool的一键评估最为便捷;如果需要批量评估或集成到CI/CD流程,使用命令行脚本更加合适;如果需要深度定制评估流程或集成到自己的系统中,直接使用Dataset和Evaluator提供了最大的灵活性。
+
+
+
+
+### 12.2.4 BFCL官方评估工具集成
+
+前面我们学习了如何使用HelloAgents内置的评估功能。实际上,`BFCLEvaluationTool`已经**自动集成了BFCL官方评估工具**,让你能够获得权威的、可对比的评估结果。
+
+整个评估流程包括四个步骤:首先从BFCL v4数据集加载测试数据,然后使用HelloAgents运行评估获取智能体的预测结果,接着将结果导出为BFCL官方格式(JSONL),最后使用官方评估脚本计算最终分数。这个流程确保了评估结果与BFCL排行榜完全一致,如图12.3所示:
+
+<div align="center">
+  <img src="https://raw.githubusercontent.com/datawhalechina/Hello-Agents/main/docs/images/12-figures/12-3.png" alt="" width="85%"/>
+  <p>图 12.3 Helloagents载入BFCL评估过程</p>
+</div>
+使用`BFCLEvaluationTool`时,官方评估会**自动运行**(默认启用):
+
+```python
+from hello_agents import SimpleAgent, HelloAgentsLLM
+from hello_agents.tools import BFCLEvaluationTool
+
+# 创建智能体
+llm = HelloAgentsLLM()
+agent = SimpleAgent(name="TestAgent", llm=llm)
+
+# 创建评估工具
+bfcl_tool = BFCLEvaluationTool()
+
+# 运行评估(自动运行官方评估)
+results = bfcl_tool.run(
+    agent=agent,
+    category="simple_python",
+    max_samples=5,
+    # run_official_eval=True  # 默认为True,可以省略
+    model_name="Qwen/Qwen3-8B"  # 可选,指定模型名称
+)
+```
+
+工具会自动执行完整的评估流程:首先运行HelloAgents评估获取预测结果,然后将结果导出为BFCL格式并保存到`evaluation_results/bfcl_official/`目录,接着复制结果文件到`result/{model_name}/`目录以符合官方评估工具的要求,随后运行BFCL官方评估命令计算分数,最后显示官方评估结果并生成Markdown格式的评估报告。
+
+**官方评估输出示例:**
+
+```
+============================================================
+步骤3: 运行BFCL官方评估
+============================================================
+
+✅ 结果文件已复制到:
+   ./result/Qwen_Qwen3-8B/BFCL_v4_simple_python_result.json
+
+🔄 运行命令: bfcl evaluate --model Qwen/Qwen3-8B --test-category simple_python --partial-eval
+
+============================================================
+BFCL官方评估结果
+============================================================
+
+📊 评估结果汇总:
+Model,Overall Acc,simple_python
+Qwen/Qwen3-8B,100.00,100.00
+
+🎯 最终结果:
+   准确率: 100.00%
+   正确数: 5/5
+```
+
+如果你想手动控制评估流程,可以禁用自动官方评估:
+
+```python
+# 禁用官方评估
+results = bfcl_tool.run(
+    agent=agent,
+    category="simple_python",
+    max_samples=5,
+    run_official_eval=False  # 禁用官方评估
+)
+
+# 然后手动运行官方评估
+import subprocess
+subprocess.run([
+    "bfcl", "evaluate",
+    "--model", "Qwen/Qwen3-8B",
+    "--test-category", "simple_python",
+    "--partial-eval"
+])
+```
+
+你也可以手动生成报告:
+
+```python
+# 运行评估
+results = bfcl_tool.run(agent, category="simple_python", max_samples=5)
+
+# 手动生成报告
+report = bfcl_tool.generate_report(
+    results,
+    output_file="./my_reports/custom_report.md"
+)
+
+# 打印报告内容
+print(report)
+```
+
+
+
+### 12.2.5 核心组件实现细节
+
+在前面的小节中,我们学习了如何使用BFCL评估工具。现在让我们深入了解HelloAgents评估系统的核心组件是如何实现的。理解这些实现细节不仅能帮助你更好地使用评估系统,还能让你根据自己的需求进行定制和扩展。
+
+**(1)BFCLDataset:数据集加载器**
+
+BFCLDataset负责加载和管理BFCL数据集:
+
+````python
+class BFCLDataset:
+    """BFCL数据集加载器"""
+
+    def __init__(self, category: str = "simple", local_data_path: Optional[str] = None):
+        self.category = category
+        self.local_data_path = local_data_path
+        self.data = []
+
+    def load(self) -> List[Dict[str, Any]]:
+        """加载数据集"""
+        # 优先从本地加载
+        if self.local_data_path:
+            return self._load_from_local()
+        # 否则从Hugging Face加载
+        return self._load_from_huggingface()
+````
+因为BFCL的数据集就在官方的仓库内,所以这里建议的方式是直接在本地clone一份进行测评。当找不到时才到huggingface进行加载。
+
+**(2)BFCLEvaluator:评估执行器**
+
+BFCLEvaluator负责执行评估流程。它的核心是`evaluate()`方法,该方法协调整个评估过程:
+
+````python
+class BFCLEvaluator:
+    """BFCL评估器"""
+
+    def evaluate(self, agent: Any, max_samples: Optional[int] = None) -> Dict[str, Any]:
+        """执行评估"""
+        results = []
+
+        for item in self.dataset[:max_samples]:
+            # 1. 构造提示词
+            prompt = self._build_prompt(item)
+
+            # 2. 调用智能体
+            response = agent.run(prompt)
+
+            # 3. 提取函数调用
+            predicted_calls = self._extract_function_calls(response)
+
+            # 4. 与标准答案对比
+            is_correct = self._compare_calls(predicted_calls, item["ground_truth"])
+
+            results.append({
+                "id": item["id"],
+                "prediction": predicted_calls,
+                "ground_truth": item["ground_truth"],
+                "is_correct": is_correct
+            })
+
+        return {"results": results, "total_samples": len(results)}
+````
+这个评估器的设计包含三个核心要点:首先是提示词构造,需要将数据集中的问题和函数定义转换为智能体可理解的提示词;其次是函数调用提取,需要从智能体的响应中提取函数调用,并支持多种格式(JSON、代码块等);最后是AST匹配,使用抽象语法树进行函数调用对比,这比简单的字符串匹配更准确。
+
+让我们看看函数调用提取的实现:
+
+```python
+def _extract_function_calls(self, response: str) -> List[Dict[str, Any]]:
+    """从响应中提取函数调用
+
+    支持多种格式:
+    1. JSON格式:{"name": "func", "arguments": {...}}
+    2. 代码块格式:```python\nfunc(arg1=val1)\n```
+    3. 纯文本格式:func(arg1=val1)
+    """
+    calls = []
+
+    # 尝试JSON解析
+    try:
+        json_match = re.search(r'\{.*\}', response, re.DOTALL)
+        if json_match:
+            data = json.loads(json_match.group())
+            if isinstance(data, dict) and "name" in data:
+                calls.append(data)
+            elif isinstance(data, list):
+                calls.extend(data)
+    except json.JSONDecodeError:
+        pass
+
+    # 尝试代码块提取
+    code_blocks = re.findall(r'```(?:python)?\n(.*?)\n```', response, re.DOTALL)
+    for code in code_blocks:
+        # 解析Python函数调用
+        parsed_calls = self._parse_python_calls(code)
+        calls.extend(parsed_calls)
+
+    return calls
+```
+
+**(3)BFCLMetrics:指标计算器**
+
+BFCLMetrics负责计算各种评估指标:
+
+````python
+class BFCLMetrics:
+    """BFCL指标计算器"""
+
+    def compute_metrics(self, results: List[Dict[str, Any]]) -> Dict[str, Any]:
+        """计算所有指标"""
+        return {
+            "accuracy": self._compute_accuracy(results),
+            "ast_match_rate": self._compute_ast_match_rate(results),
+            "parameter_accuracy": self._compute_parameter_accuracy(results),
+            "f1_score": self._compute_f1_score(results),
+            "category_statistics": self._compute_category_stats(results)
+        }
+````
+**AST匹配的实现**:
+
+AST匹配是BFCL评估的核心技术。它比简单的字符串匹配更智能,能够识别语义等价的函数调用:
+
+```python
+def _ast_match(self, pred_call: Dict, true_call: Dict) -> bool:
+    """使用AST匹配函数调用
+
+    AST匹配的优势:
+    1. 忽略参数顺序:func(a=1, b=2) 等价于 func(b=2, a=1)
+    2. 识别等价表达式:2+3 等价于 5
+    3. 忽略空格和格式差异
+    """
+    # 1. 函数名必须完全匹配
+    if pred_call.get("name") != true_call.get("name"):
+        return False
+
+    # 2. 将参数转换为AST节点
+    pred_args = self._args_to_ast(pred_call.get("arguments", {}))
+    true_args = self._args_to_ast(true_call.get("arguments", {}))
+
+    # 3. 比较AST节点
+    return ast.dump(pred_args) == ast.dump(true_args)
+
+def _args_to_ast(self, args: Dict[str, Any]) -> ast.AST:
+    """将参数字典转换为AST节点"""
+    # 构造一个虚拟的函数调用
+    code = f"func({', '.join(f'{k}={repr(v)}' for k, v in args.items())})"
+    tree = ast.parse(code)
+    return tree.body[0].value  # 返回Call节点
+```
+
+**(4)工具化封装:BFCLEvaluationTool**
+
+最后,我们将这些组件封装成一个Tool,让它可以被智能体直接调用:
+
+````python
+class BFCLEvaluationTool(Tool):
+    """BFCL评估工具"""
+
+    def __init__(self, local_data_path: Optional[str] = None):
+        super().__init__(
+            name="bfcl_evaluation",
+            description="评估智能体的工具调用能力"
+        )
+        self.dataset = None
+        self.evaluator = None
+        self.metrics_calculator = BFCLMetrics()
+
+    def run(self, parameters: Dict[str, Any]) -> str:
+        """执行评估"""
+        # 1. 加载数据集
+        self.dataset = BFCLDataset(...)
+
+        # 2. 创建评估器
+        self.evaluator = BFCLEvaluator(...)
+
+        # 3. 运行评估
+        results = self.evaluator.evaluate(...)
+
+        # 4. 计算指标
+        metrics = self.metrics_calculator.compute_metrics(...)
+
+        # 5. 返回JSON结果
+        return json.dumps(results, ensure_ascii=False)
+````
+这个工具的设计遵循三个核心原则:首先继承Tool基类以遵循HelloAgents的工具规范,确保与框架的无缝集成;其次进行严格的参数验证,检查必需参数并提供友好的错误提示,提升用户体验;最后对结果进行格式化,返回JSON字符串以便于解析和展示。通过这种模块化的设计,我们实现了一个既易用又灵活的评估系统,用户可以直接使用高层的Tool接口快速完成评估,也可以深入到底层组件进行定制以满足特殊需求。
+
+### 12.2.6 扩展与优化建议
+
+通过前面的学习,我们已经掌握了如何使用HelloAgents进行BFCL评估。需要注意的是,我们目前的实现是基于SimpleAgent的简单复现,主要完成了BFCL评估的基础功能。在实际应用中,BFCL基准包含多个难度级别和场景,要在排行榜上获得更高的分数,还需要进一步的优化和扩展。
+
+**(1)当前实现的局限性**
+
+我们当前的SimpleAgent实现主要聚焦于评估流程的搭建,在工具调用能力上还有提升空间。SimpleAgent使用自定义的工具调用格式`[TOOL_CALL:tool_name:parameters]`,这种格式需要LLM主动学习和使用,在复杂场景下的表现可能不如使用原生函数调用(Function Calling)的智能体。此外,我们目前只测试了simple_python等基础类别,对于multiple、parallel、irrelevance等更复杂的场景,还需要针对性的优化。
+
+**(2)提升BFCL分数的方向**
+
+要进一步提升BFCL评估分数,可以从以下几个方向入手。首先是优化智能体的工具调用能力,可以考虑使用支持原生函数调用的LLM(如GPT-4、Claude等),或者改进提示词让LLM更好地理解工具调用格式。其次是扩展工具库,BFCL测试中涉及各种类型的函数,可以根据测试数据集的特点,预先实现常用的工具类型,提高智能体的工具覆盖率。第三是针对不同难度级别设计不同的策略,例如在multiple场景下需要智能体能够规划多步骤的工具调用序列,在parallel场景下需要识别可以并行执行的工具调用,在irrelevance场景下需要判断是否真的需要调用工具。
+
+**(3)实践建议**
+
+对于想要在BFCL上取得更好成绩的开发者,建议采用以下实践策略。首先,从simple类别开始,确保基础的单函数调用能够稳定工作,这是后续优化的基础。然后,逐步测试multiple、parallel等更复杂的类别,分析失败案例,找出智能体的薄弱环节。在优化过程中,可以参考BFCL排行榜上的高分模型,学习它们的设计思路和优化技巧。同时,建议使用官方评估工具进行验证,确保优化后的结果与排行榜标准一致。
+
+这里总结一些评估时可以进一步处理的建议:
+
+**1. 渐进式评估**
+
+从小样本开始,逐步增加样本数:
+
+```python
+# 第一步:快速测试(5个样本)
+results_quick = bfcl_tool.run(agent, category="simple_python", max_samples=5)
+
+# 第二步:中等规模测试(50个样本)
+if results_quick['overall_accuracy'] > 0.8:
+    results_medium = bfcl_tool.run(agent, category="simple_python", max_samples=50)
+
+# 第三步:完整评估(全部样本)
+if results_medium['overall_accuracy'] > 0.8:
+    results_full = bfcl_tool.run(agent, category="simple_python", max_samples=0)
+```
+
+**2. 多类别评估**
+
+评估不同难度的任务:
+
+```python
+categories = ["simple_python", "multiple", "parallel", "irrelevance"]
+
+for category in categories:
+    print(f"\n评估类别: {category}")
+    results = bfcl_tool.run(agent, category=category, max_samples=10)
+    print(f"准确率: {results['overall_accuracy']:.2%}")
+```
+
+**3. 对比评估**
+
+对比不同配置的智能体:
+
+```python
+# 配置1:默认提示词
+agent1 = SimpleAgent(name="Agent-Default", llm=llm)
+results1 = bfcl_tool.run(agent1, category="simple_python", max_samples=10)
+
+# 配置2:优化提示词
+agent2 = SimpleAgent(name="Agent-Optimized", llm=llm)
+# ... 设置优化的系统提示词 ...
+results2 = bfcl_tool.run(agent2, category="simple_python", max_samples=10)
+
+# 对比结果
+print(f"默认配置准确率: {results1['overall_accuracy']:.2%}")
+print(f"优化配置准确率: {results2['overall_accuracy']:.2%}")
+```
+
+如果你的评估结果很好,可以考虑提交到BFCL官方排行榜!
+
+**步骤1:准备提交材料**
+
+1. 模型描述文档
+2. 评估结果文件(所有类别)
+3. 模型访问方式(API或开源链接)
+
+**步骤2:提交到GitHub**
+
+访问BFCL官方仓库,按照说明提交Pull Request:
+
+- 仓库地址:https://github.com/ShishirPatil/gorilla
+- 提交指南:参考`CONTRIBUTING.md`
+
+**步骤3:等待审核**
+
+BFCL团队会审核你的提交,验证结果的准确性。审核通过后,你的模型将出现在官方排行榜上!
+
+
+
+## 12.3 GAIA:通用AI助手能力评估
+
+### 12.3.1 GAIA基准介绍
+
+GAIA (General AI Assistants) 是由Meta AI和Hugging Face联合推出的评估基准,专注于评估AI助手的**通用能力**<sup>[2]</sup>。与BFCL专注于工具调用不同,GAIA评估的是智能体在真实世界任务中的综合表现。
+
+GAIA的设计理念是:**真实世界的问题往往需要多种能力的综合运用**。一个优秀的AI助手不仅需要调用工具,还需要:
+
+- **多步推理**:将复杂问题分解为多个子问题
+- **知识运用**:利用内置知识和外部知识库
+- **多模态理解**:处理文本、图片、文件等多种输入
+- **网页浏览**:从互联网获取最新信息
+- **文件操作**:读取和处理各种格式的文件
+
+**(1)GAIA数据集结构**
+
+了解GAIA的评估理念后,让我们深入了解GAIA数据集的具体结构。GAIA包含466个精心设计的真实世界问题,这些问题按照复杂度和所需推理步骤分为三个难度级别,从简单的零步推理任务到需要多步复杂推理的困难任务,全面覆盖了智能体在实际应用中可能遇到的各种场景,如表12.3所示:
+
+<div align="center">
+  <p>表 12.3 GAIA 数据集难度级别分布</p>
+  <img src="https://raw.githubusercontent.com/datawhalechina/Hello-Agents/main/docs/images/12-figures/12-table-3.png" alt="" width="85%"/>
+</div>
+关于GAIA数据集的样本示例可以参考下面的代码片段:
+
+```json
+{
+  "task_id": "gaia_001",
+  "Question": "What is the total population of the top 3 most populous cities in California?",
+  "Level": 2,
+  "Final answer": "12847521",
+  "file_name": "",
+  "file_path": "",
+  "Annotator Metadata": {
+    "Steps": [
+      "Search for most populous cities in California",
+      "Get population data for top 3 cities",
+      "Sum the populations"
+    ],
+    "Number of steps": 3,
+    "How long did this take?": "5 minutes",
+    "Tools": ["web_search", "calculator"]
+  }
+}
+```
+
+**关键字段说明:**
+- `Question`: 问题描述
+- `Level`: 难度级别(1-3)
+- `Final answer`: 标准答案(可能是数字、文本或文件)
+- `file_name/file_path`: 附件文件(如果有)
+- `Annotator Metadata`: 标注者提供的元数据(推理步骤、所需工具等)
+
+**(2)准精确匹配介绍**
+
+GAIA使用**准精确匹配(Quasi Exact Match)**评估算法,这是GAIA官方定义的评估标准。该算法的核心思想是:**先对答案进行归一化处理,然后进行精确匹配**。
+
+给定预测答案 $A_{\text{pred}}$ 和标准答案 $A_{\text{true}}$,准精确匹配函数定义为:
+
+$$
+\text{Quasi\_Exact\_Match}(A_{\text{pred}}, A_{\text{true}}) = \begin{cases}
+1 & \text{if } \mathcal{N}(A_{\text{pred}}) = \mathcal{N}(A_{\text{true}}) \\
+0 & \text{otherwise}
+\end{cases}
+$$
+
+其中 $\mathcal{N}(\cdot)$ 是归一化函数,根据答案类型应用不同的规则。
+
+归一化函数根据答案类型应用不同的规则。对于数字类型,需要移除逗号分隔符(`1,000` → `1000`)和单位符号(`$100` → `100`,`50%` → `50`),例如`"$1,234.56"`归一化为`"1234.56"`。对于字符串类型,需要转换为小写(`"Apple"` → `"apple"`)、移除冠词(`"the apple"` → `"apple"`)、移除多余空格(`"hello  world"` → `"hello world"`)和移除末尾标点(`"hello."` → `"hello"`),例如`"The United States"`归一化为`"united states"`。对于列表类型,需要按逗号分隔元素,对每个元素应用字符串归一化,按字母顺序排序后重新连接,例如`"Paris, London, Berlin"`归一化为`"berlin,london,paris"`。
+
+**归一化示例:**
+
+```python
+# 数字答案
+原始答案: "$1,234.56"
+归一化后: "1234.56"
+
+# 字符串答案
+原始答案: "The United States of America"
+归一化后: "united states of america"
+
+# 列表答案
+原始答案: "Paris, London, Berlin"
+归一化后: "berlin, london, paris"
+```
+
+**(3)GAIA评估指标**
+
+GAIA使用以下指标评估智能体性能:
+
+**1. 精确匹配率 (Exact Match Rate)**
+
+精确匹配率是GAIA的核心指标,定义为准精确匹配成功的样本比例:
+
+$$
+\text{Exact Match Rate} = \frac{1}{N} \sum_{i=1}^{N} \text{Quasi\_Exact\_Match}(A_{\text{pred},i}, A_{\text{true},i})
+$$
+
+其中:
+- $N$ 是总样本数
+- $A_{\text{pred},i}$ 是第 $i$ 个样本的预测答案
+- $A_{\text{true},i}$ 是第 $i$ 个样本的标准答案
+- $\text{Quasi\_Exact\_Match}(\cdot, \cdot) \in \{0, 1\}$ 是准精确匹配函数
+
+**2. 分级准确率 (Level-wise Accuracy)**
+
+对于每个难度级别 $\ell \in \{1, 2, 3\}$,计算该级别的准确率:
+
+$$
+\text{Accuracy}_\ell = \frac{1}{|D_\ell|} \sum_{i \in D_\ell} \text{Quasi\_Exact\_Match}(A_{\text{pred},i}, A_{\text{true},i})
+$$
+
+其中 $D_\ell$ 是难度级别 $\ell$ 的样本集合,$|D_\ell|$ 是该级别的样本数。
+
+**3. 难度递进下降率 (Difficulty Progression Drop Rate)**
+
+衡量智能体在难度增加时的性能衰减:
+
+$$
+\text{Drop Rate}_{\ell \to \ell+1} = \frac{\text{Accuracy}_\ell - \text{Accuracy}_{\ell+1}}{\text{Accuracy}_\ell}
+$$
+
+- $\text{Drop Rate}_{1 \to 2}$:从Level 1到Level 2的下降率
+- $\text{Drop Rate}_{2 \to 3}$:从Level 2到Level 3的下降率
+
+**4. 平均推理步骤数 (Average Reasoning Steps)**
+
+评估智能体完成任务所需的平均步骤数:
+
+$$
+\text{Avg Steps} = \frac{1}{N_{\text{correct}}} \sum_{i \in \text{Correct}} \text{steps}_i
+$$
+
+其中 $N_{\text{correct}}$ 是正确回答的样本数,$\text{steps}_i$ 是第 $i$ 个样本的推理步骤数。
+
+**指标解释:**
+
+- **Exact Match Rate = 1.0**:所有样本都完全正确
+- **Exact Match Rate = 0.5**:50%的样本正确,50%的样本错误
+- **Drop Rate = 0.3**:难度增加导致准确率下降30%
+- **Drop Rate = 0.0**:难度增加不影响准确率(理想情况)
+
+**评估示例:**
+
+假设我们评估了10个样本,结果可以参考表12.4所示:
+
+<div align="center">
+  <p>表 12.4 GAIA 数据集难度级别分布</p>
+  <img src="https://raw.githubusercontent.com/datawhalechina/Hello-Agents/main/docs/images/12-figures/12-table-4.png" alt="" width="85%"/>
+</div>
+
+如果要计算这个案例的指标的话,可以参考下面的Python脚本。
+
+```python
+# 1. 精确匹配率
+total_samples = 10
+correct_samples = 7  # 样本1,2,3,5,6,8,9
+exact_match_rate = correct_samples / total_samples = 0.70  # 70%
+
+# 2. 分级准确率
+level_1_correct = 3  # 样本1,2,3
+level_1_total = 3
+level_1_accuracy = 3 / 3 = 1.00  # 100%
+
+level_2_correct = 2  # 样本5,6
+level_2_total = 3
+level_2_accuracy = 2 / 3 = 0.67  # 67%
+
+level_3_correct = 2  # 样本8,9
+level_3_total = 4
+level_3_accuracy = 2 / 4 = 0.50  # 50%
+
+# 3. 难度递进下降率
+drop_rate_1_to_2 = (1.00 - 0.67) / 1.00 = 0.33  # 33%
+drop_rate_2_to_3 = (0.67 - 0.50) / 0.67 = 0.25  # 25%
+
+print(f"精确匹配率: {exact_match_rate:.2%}")  # 70.00%
+print(f"Level 1准确率: {level_1_accuracy:.2%}")  # 100.00%
+print(f"Level 2准确率: {level_2_accuracy:.2%}")  # 66.67%
+print(f"Level 3准确率: {level_3_accuracy:.2%}")  # 50.00%
+print(f"Level 1→2 下降率: {drop_rate_1_to_2:.2%}")  # 33.00%
+print(f"Level 2→3 下降率: {drop_rate_2_to_3:.2%}")  # 25.00%
+```
+
+**结果分析:**
+
+- **整体表现**:70%的精确匹配率,表现良好
+- **难度敏感性**:从Level 1到Level 2下降33%,说明智能体在中等难度任务上有明显衰减
+- **能力边界**:Level 3准确率为50%,说明智能体在复杂任务上仍有提升空间
+
+下降率越大,说明智能体在处理复杂任务时的能力衰减越明显。
+
+**(4)GAIA官方系统提示词**
+
+GAIA要求使用特定的系统提示词,确保模型输出符合评估格式:
+
+```python
+GAIA_SYSTEM_PROMPT = """You are a general AI assistant. I will ask you a question. Report your thoughts, and finish your answer with the following template: FINAL ANSWER: [YOUR FINAL ANSWER].
+
+YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings.
+
+If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise.
+
+If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise.
+
+If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string."""
+```
+
+GAIA对答案格式有严格的要求:答案必须以`FINAL ANSWER: [答案]`的格式给出;对于数字类型的答案,不使用逗号分隔符和单位符号;对于字符串类型的答案,不使用冠词和缩写;对于列表类型的答案,使用逗号分隔并按字母顺序排列。
+
+### 12.3.2 获取GAIA数据集
+
+**重要提示**:GAIA是**受限数据集(Gated Dataset)**,需要先在HuggingFace上申请访问权限。
+
+**步骤1:申请访问权限**
+
+1. 访问 https://huggingface.co/datasets/gaia-benchmark/GAIA
+2. 点击"Request access"按钮
+3. 填写申请表单(通常会在几秒内批准)
+4. 获取你的HuggingFace Token:https://huggingface.co/settings/tokens
+
+**步骤2:配置环境变量**
+
+在`.env`文件中添加你的HuggingFace Token:
+
+```bash
+# HuggingFace API 配置
+HF_TOKEN=hf_your_token_here
+```
+
+**方法1:使用HelloAgents自动下载(推荐)**
+
+HelloAgents会自动处理GAIA数据集的下载和缓存:
+
+```python
+from hello_agents.evaluation import GAIADataset
+import os
+
+# 确保设置了HF_TOKEN,如果设置了.env无需这一行
+os.environ["HF_TOKEN"] = "hf_your_token_here"
+
+# 自动下载到 ./data/gaia/
+dataset = GAIADataset(
+    dataset_name="gaia-benchmark/GAIA",
+    split="validation",  # 或 "test"
+    level=1  # 可选: 1, 2, 3, None(全部)
+)
+items = dataset.load()
+
+print(f"加载了 {len(items)} 个测试样本")
+# 输出: 加载了 53 个测试样本 (Level 1)
+```
+
+**工作原理**:
+
+- 首次运行时,使用`snapshot_download`下载整个数据集到`./data/gaia/`
+- 数据集包含114个文件(问题、图片、PDF等材料)
+- 后续使用直接从本地加载,速度很快
+
+**数据集目录结构**:
+```
+./data/gaia/
+├── 2023/
+│   ├── validation/
+│   │   ├── metadata.jsonl  (165个问题)
+│   │   ├── *.png, *.pdf, *.csv, *.xlsx  (附件文件)
+│   └── test/
+│       ├── metadata.jsonl  (301个问题)
+│       └── ... (附件文件)
+├── GAIA.py
+└── README.md
+```
+
+**方法2:手动下载**
+
+如果你想手动下载数据集:
+
+```python
+from huggingface_hub import snapshot_download
+import os
+
+# 设置Token
+os.environ["HF_TOKEN"] = "hf_your_token_here"
+
+# 下载数据集
+snapshot_download(
+    repo_id="gaia-benchmark/GAIA",
+    repo_type="dataset",
+    local_dir="./data/gaia",
+    token=os.getenv("HF_TOKEN")
+)
+```
+
+**查看数据集统计**:
+
+```python
+# 查看数据集统计
+stats = dataset.get_statistics()
+print(f"总样本数: {stats['total_samples']}")
+print(f"级别分布: {stats['level_distribution']}")
+# 输出:
+# 总样本数: 165
+# 级别分布: {1: 53, 2: 62, 3: 50}
+```
+
+
+### 12.3.3 在HelloAgents中实现GAIA评估
+
+与BFCL类似,我们提供两种评估方式,推荐使用**方式1**。
+
+**方式1:使用GAIAEvaluationTool一键评估**
+
+这是最简单的方式,自动完成数据集下载、评估执行、结果导出和报告生成:
+
+```python
+from hello_agents import SimpleAgent, HelloAgentsLLM
+from hello_agents.tools import GAIAEvaluationTool
+
+# GAIA官方系统提示词(来自论文)
+GAIA_SYSTEM_PROMPT = """You are a general AI assistant. I will ask you a question. Report your thoughts, and finish your answer with the following template: FINAL ANSWER: [YOUR FINAL ANSWER].
+
+YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings.
+
+If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise.
+
+If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise.
+
+If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string."""
+
+# 1. 创建智能体(使用GAIA官方系统提示词)
+llm = HelloAgentsLLM()
+agent = SimpleAgent(
+    name="TestAgent",
+    llm=llm,
+    system_prompt=GAIA_SYSTEM_PROMPT  # 关键:使用GAIA官方提示词
+)
+
+# 2. 创建GAIA评估工具
+gaia_tool = GAIAEvaluationTool()
+
+# 3. 一键运行评估
+results = gaia_tool.run(
+    agent=agent,
+    level=1,  # Level 1: 简单任务
+    max_samples=5,  # 评估5个样本
+    export_results=True,  # 导出GAIA格式结果
+    generate_report=True  # 生成评估报告
+)
+
+# 4. 查看结果
+print(f"精确匹配率: {results['exact_match_rate']:.2%}")
+print(f"部分匹配率: {results['partial_match_rate']:.2%}")
+print(f"正确数: {results['exact_matches']}/{results['total_samples']}")
+```
+
+**运行结果:**
+
+```
+============================================================
+GAIA一键评估
+============================================================
+
+配置:
+   智能体: TestAgent
+   难度级别: 1
+   样本数量: 5
+
+============================================================
+步骤1: 运行HelloAgents评估
+============================================================
+   正在从HuggingFace下载: gaia-benchmark/GAIA
+   📥 下载GAIA数据集...
+   ✓ 数据集下载完成
+   ✓ 加载了 165 个样本
+✅ GAIA数据集加载完成
+   数据源: gaia-benchmark/GAIA
+   分割: validation
+   级别: 1
+   样本数: 53
+
+🌟 开始 GAIA 评估...
+   样本数量: 5
+   进度: 5/5
+✅ GAIA 评估完成
+   精确匹配率: 80.00%
+   部分匹配率: 80.00%
+
+============================================================
+步骤2: 导出GAIA格式结果
+============================================================
+✅ GAIA格式结果已导出
+   输出文件: evaluation_results\gaia_official\gaia_level1_result_20251011_012648.jsonl
+   样本数: 5
+   包含推理轨迹: True
+📄 提交说明已生成: evaluation_results\gaia_official\SUBMISSION_GUIDE_20251011_012648.md
+
+============================================================
+步骤3: 生成评估报告
+============================================================
+📄 报告已生成: evaluation_reports\gaia_report_20251011_012648.md
+
+============================================================
+🎯 最终结果
+============================================================
+   精确匹配率: 80.00%
+   部分匹配率: 80.00%
+   正确数: 4/5
+```
+
+评估完成后会自动生成三类文件:首先是GAIA格式结果文件(`evaluation_results/gaia_official/gaia_level1_result_*.jsonl`),采用JSONL格式(每行一个JSON对象),可直接用于提交到GAIA排行榜;其次是提交说明文件(`evaluation_results/gaia_official/SUBMISSION_GUIDE_*.md`),包含详细的提交步骤、结果文件格式说明和注意事项;最后是评估报告(`evaluation_reports/gaia_report_*.md`),包含评估结果摘要、详细指标、样本详情和可视化图表。
+
+**注意**:如果你发现生成的评估结果不理想(例如准确率较低),这是正常现象。虽然Level 1是一步推理任务,但仍然需要智能体具备工具调用能力(如搜索引擎、计算器等)才能正确回答问题。我们当前使用的SimpleAgent主要用于演示评估流程,在工具调用能力上还有提升空间。
+
+**方式2:使用Dataset + Evaluator(灵活定制)**
+
+如果需要更细粒度的控制,可以直接使用底层组件:
+
+```python
+from hello_agents.evaluation import GAIADataset, GAIAEvaluator
+
+# 1. 加载数据集
+dataset = GAIADataset(level=1)
+items = dataset.load()
+print(f"加载了 {len(items)} 个样本")
+
+# 2. 创建评估器
+evaluator = GAIAEvaluator(dataset=dataset, level=1)
+
+# 3. 运行评估
+results = evaluator.evaluate(agent, max_samples=5)
+
+# 4. 导出GAIA格式结果
+evaluator.export_to_gaia_format(
+    results,
+    "gaia_results.jsonl",
+    include_reasoning=True
+)
+```
+
+生成的评估报告(`gaia_report_*.md`)可参考下面的文件:
+
+```markdown
+# GAIA评估报告
+
+**生成时间**: 2025-10-11 01:26:48
+
+## 📊 评估概览
+
+- **智能体**: TestAgent
+- **难度级别**: 1
+- **总样本数**: 2
+- **精确匹配数**: 1
+- **部分匹配数**: 1
+- **精确匹配率**: 50.00%
+- **部分匹配率**: 50.00%
+
+## 📈 详细指标
+
+### 分级准确率
+
+- **Level 1**: 50.00% 精确 / 50.00% 部分 (1/2)
+
+## 📝 样本详情(前10个)
+
+| 任务ID | 级别 | 预测答案 | 正确答案 | 精确匹配 | 部分匹配 |
+|--------|------|----------|----------|----------|----------|
+| e1fc63a2-da7a-432f-be78-7c4a95598703 | 1 | 24000 | 17 | ❌ | ❌ |
+| 8e867cd7-cff9-4e6c-867a-ff5ddc2550be | 1 | 3 | 3 | ✅ | ✅ |
+
+## 📊 准确率可视化
+
+精确匹配: █████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░ 50.00%
+部分匹配: █████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░ 50.00%
+
+
+## 💡 建议
+
+- ⚠️ 表现一般,需要改进。
+- 💡 建议检查工具使用和多步推理能力。
+```
+
+**生成的GAIA格式结果(`gaia_level1_result_*.jsonl`):**
+
+```json
+{"task_id": "e1fc63a2-da7a-432f-be78-7c4a95598703", "model_answer": "24000", "reasoning_trace": "24000"}
+{"task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be", "model_answer": "3", "reasoning_trace": "3"}
+```
+
+### 12.3.4 提交结果到GAIA官方排行榜
+
+使用GAIAEvaluationTool运行评估后,会在`evaluation_results/gaia_official/`目录下生成提交所需的文件和详细的提交说明。
+
+1. **GAIA格式结果文件**:`gaia_level1_result_*.jsonl`
+   ```json
+   {"task_id": "xxx", "model_answer": "答案", "reasoning_trace": "推理过程"}
+   {"task_id": "yyy", "model_answer": "答案", "reasoning_trace": "推理过程"}
+   ```
+
+2. **提交说明文件**:`SUBMISSION_GUIDE_*.md`
+
+打开自动生成的`SUBMISSION_GUIDE_*.md`文件,里面包含完整的提交指南:
+
+具体来说,打开浏览器,访问:
+```
+https://huggingface.co/spaces/gaia-benchmark/leaderboard
+```
+
+如图12.4所示,提交表单中填写信息即可:
+
+<div align="center">
+  <img src="https://raw.githubusercontent.com/datawhalechina/Hello-Agents/main/docs/images/12-figures/12-4.png" alt="" width="85%"/>
+  <p>图 12.4 BFCL评估流程图</p>
+</div>
+
+提交前,可以手动检查生成的JSONL文件:
+
+```python
+import json
+
+# 读取结果文件
+with open("evaluation_results/gaia_official/gaia_level1_result_*.jsonl", "r") as f:
+    for line in f:
+        result = json.loads(line)
+        print(f"Task ID: {result['task_id']}")
+        print(f"Answer: {result['model_answer']}")
+        print(f"Reasoning: {result['reasoning_trace']}")
+        print("-" * 50)
+```
+
+### 12.3.5 核心组件实现细节
+
+GAIA评估系统的实现与BFCL类似,但针对通用能力评估有一些特殊的设计。
+
+**(1)GAIADataset:支持多模态的数据加载器**
+
+GAIA数据集的特殊之处在于它包含多模态数据(文本、文件、图片等):
+
+````python
+class GAIADataset:
+    """GAIA数据集加载器
+
+    支持从HuggingFace加载GAIA数据集(受限数据集)
+    """
+
+    def __init__(
+        self,
+        level: Optional[int] = None,
+        split: str = "validation",
+        local_data_dir: Optional[str] = None
+    ):
+        self.level = level
+        self.split = split
+        self.local_data_dir = local_data_dir or "./data/gaia"
+        self.data = []
+
+    def load(self) -> List[Dict[str, Any]]:
+        """加载数据集"""
+        # 从HuggingFace下载
+        items = self._load_from_huggingface()
+
+        # 按级别过滤
+        if self.level:
+            items = [item for item in items if item.get("level") == self.level]
+
+        self.data = items
+        return items
+
+    def _load_from_huggingface(self) -> List[Dict[str, Any]]:
+        """从HuggingFace下载GAIA数据集"""
+        from huggingface_hub import snapshot_download
+        import json
+
+        # 下载数据集
+        repo_id = "gaia-benchmark/GAIA"
+        local_dir = snapshot_download(
+            repo_id=repo_id,
+            repo_type="dataset",
+            local_dir=self.local_data_dir,
+            local_dir_use_symlinks=False
+        )
+
+        # 加载JSONL文件
+        data_file = Path(local_dir) / "2023" / self.split / "metadata.jsonl"
+        items = []
+        with open(data_file, 'r', encoding='utf-8') as f:
+            for line in f:
+                item = json.loads(line)
+                items.append(self._standardize_item(item))
+
+        return items
+````
+**(2)GAIAEvaluator:实现GAIA官方评估算法**
+
+GAIA的评估使用**准精确匹配(Quasi Exact Match)**算法,需要特殊的答案归一化和匹配逻辑:
+
+````python
+class GAIAEvaluator:
+    """GAIA评估器
+
+    实现GAIA官方的准精确匹配(Quasi Exact Match)评估算法
+    """
+
+    def evaluate(self, agent: Any, max_samples: Optional[int] = None) -> Dict[str, Any]:
+        """执行评估"""
+        dataset_items = self.dataset.load()
+
+        if max_samples:
+            dataset_items = dataset_items[:max_samples]
+
+        results = []
+        for i, item in enumerate(dataset_items, 1):
+            # 1. 构造提示词
+            prompt = self._build_prompt(item["question"], item)
+
+            # 2. 调用智能体
+            response = agent.run(prompt)
+
+            # 3. 提取答案(GAIA格式:FINAL ANSWER: [答案])
+            predicted_answer = self._extract_answer(response)
+
+            # 4. 归一化答案(GAIA官方规则)
+            normalized_pred = self._normalize_answer(predicted_answer)
+            normalized_truth = self._normalize_answer(item["final_answer"])
+
+            # 5. 准精确匹配
+            exact_match = (normalized_pred == normalized_truth)
+
+            results.append({
+                "task_id": item["task_id"],
+                "predicted": predicted_answer,
+                "expected": item["final_answer"],
+                "exact_match": exact_match,
+                "level": item.get("level", 0)
+            })
+
+        return self._format_results(results)
+````
+GAIA使用特定的归一化规则来处理不同类型的答案:
+
+```python
+def _normalize_answer(self, answer: str) -> str:
+    """标准化答案字符串(GAIA官方标准化规则)
+
+    规则:
+    1. 数字:移除逗号分隔符和单位符号
+    2. 字符串:移除冠词、转小写、移除多余空格
+    3. 列表:逗号分隔,按字母顺序排序
+    """
+    if not answer:
+        return ""
+
+    answer = answer.strip()
+
+    # 检查是否是逗号分隔的列表
+    if ',' in answer:
+        parts = [self._normalize_single_answer(p.strip()) for p in answer.split(',')]
+        parts.sort()  # GAIA要求按字母顺序排序
+        return ','.join(parts)
+    else:
+        return self._normalize_single_answer(answer)
+
+def _normalize_single_answer(self, answer: str) -> str:
+    """标准化单个答案(不包含逗号的答案)"""
+    answer = answer.strip().lower()
+
+    # 移除常见的冠词
+    articles = ['the', 'a', 'an']
+    words = answer.split()
+    if words and words[0] in articles:
+        words = words[1:]
+        answer = ' '.join(words)
+
+    # 移除货币符号和百分号
+    answer = answer.replace('$', '').replace('%', '').replace('€', '').replace('£', '')
+
+    # 移除数字中的逗号分隔符
+    answer = re.sub(r'(\d),(\d)', r'\1\2', answer)
+
+    # 移除多余空格
+    answer = ' '.join(answer.split())
+
+    # 移除末尾的标点符号
+    answer = answer.rstrip('.,;:!?')
+
+    return answer
+```
+
+GAIA要求模型输出格式为`FINAL ANSWER: [答案]`:
+
+```python
+def _extract_answer(self, response: str) -> str:
+    """从响应中提取答案(GAIA格式)
+
+    GAIA要求答案格式为:FINAL ANSWER: [答案]
+    """
+    # 首先尝试提取GAIA官方格式的答案
+    final_answer_pattern = r'FINAL ANSWER:\s*(.+?)(?:\n|$)'
+    match = re.search(final_answer_pattern, response, re.IGNORECASE | re.MULTILINE)
+    if match:
+        answer = match.group(1).strip()
+        # 移除可能的方括号
+        answer = answer.strip('[]')
+        return answer
+
+    # 备用方案:查找其他答案标记
+    answer_patterns = [
+        r'答案[::]\s*(.+)',
+        r'最终答案[::]\s*(.+)',
+        r'Final answer[::]\s*(.+)',
+        r'Answer[::]\s*(.+)',
+    ]
+
+    for pattern in answer_patterns:
+        match = re.search(pattern, response, re.IGNORECASE)
+        if match:
+            return match.group(1).strip()
+
+    # 如果没有找到标记,返回最后一个非空行
+    lines = response.strip().split('\n')
+    for line in reversed(lines):
+        line = line.strip()
+        if line and not line.startswith('#'):
+            return line
+
+    return response.strip()
+```
+
+评估完成后,可以导出为GAIA官方要求的JSONL格式:
+
+```python
+def export_to_gaia_format(
+    self,
+    results: Dict[str, Any],
+    output_path: Union[str, Path],
+    include_reasoning: bool = True
+) -> None:
+    """导出为GAIA官方格式(JSONL)
+
+    GAIA要求的格式:
+    {"task_id": "xxx", "model_answer": "答案", "reasoning_trace": "推理过程"}
+    """
+    output_path = Path(output_path)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+
+    with open(output_path, 'w', encoding='utf-8') as f:
+        for result in results.get("detailed_results", []):
+            entry = {
+                "task_id": result["task_id"],
+                "model_answer": result["predicted"]
+            }
+
+            if include_reasoning:
+                entry["reasoning_trace"] = result.get("response", result["predicted"])
+
+            f.write(json.dumps(entry, ensure_ascii=False) + '\n')
+```
+
+**(3)GAIAEvaluationTool:一键评估工具**
+
+GAIAEvaluationTool封装了完整的评估流程,提供一键评估功能:
+
+````python
+class GAIAEvaluationTool(Tool):
+    """GAIA评估工具
+
+    提供一键评估功能:
+    1. 运行HelloAgents评估
+    2. 导出GAIA格式结果
+    3. 生成评估报告
+    4. 生成提交说明
+    """
+
+    def run(
+        self,
+        agent: Any,
+        level: Optional[int] = None,
+        max_samples: Optional[int] = None,
+        local_data_dir: Optional[str] = None,
+        export_results: bool = True,
+        generate_report: bool = True
+    ) -> Dict[str, Any]:
+        """执行GAIA一键评估"""
+        # 步骤1: 运行HelloAgents评估
+        results = self._run_evaluation(agent, level, max_samples, local_data_dir)
+
+        # 步骤2: 导出GAIA格式结果
+        if export_results:
+            self._export_results(results)
+
+        # 步骤3: 生成评估报告
+        if generate_report:
+            self.generate_report(results)
+
+        return results
+````
+GAIAEvaluationTool会自动生成评估报告:
+
+```python
+def generate_report(
+    self,
+    results: Dict[str, Any],
+    output_file: Optional[Union[str, Path]] = None
+) -> str:
+    """生成评估报告"""
+    report = f"""# GAIA评估报告
+
+**生成时间**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
+
+## 📊 评估概览
+
+- **智能体**: {results.get("agent_name", "Unknown")}
+- **难度级别**: {results.get("level_filter") or '全部'}
+- **总样本数**: {results.get("total_samples", 0)}
+- **精确匹配数**: {results.get("exact_matches", 0)}
+- **精确匹配率**: {results.get("exact_match_rate", 0):.2%}
+
+## 📈 详细指标
+
+### 分级准确率
+
+{self._format_level_metrics(results.get("level_metrics", {}))}
+
+## 📝 样本详情(前10个)
+
+{self._format_sample_details(results.get("detailed_results", [])[:10])}
+
+## 📊 准确率可视化
+
+{self._format_visualization(results.get("exact_match_rate", 0))}
+
+## 💡 建议
+
+{self._format_suggestions(results.get("exact_match_rate", 0))}
+"""
+
+    # 保存报告
+    if output_file is None:
+        output_dir = Path("./evaluation_reports")
+        output_dir.mkdir(parents=True, exist_ok=True)
+        output_file = output_dir / f"gaia_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.md"
+
+    with open(output_file, 'w', encoding='utf-8') as f:
+        f.write(report)
+
+    return report
+```
+
+## 12.4 数据生成质量评估
+
+在AI系统开发中,高质量的训练数据是系统性能的基础。本节介绍如何使用HelloAgents框架评估生成数据的质量,以AIME(美国数学邀请赛)<sup>[9]</sup>风格的数学题目生成为例。
+
+AIME是美国数学协会(MAA)主办的中等难度数学竞赛,介于AMC 10/12和美国数学奥林匹克(USAMO)之间。AIME题目具有鲜明的特点:每道题的答案都是0到999之间的整数,题目涵盖代数、几何、数论、组合、概率等多个数学领域,需要多步推理但不涉及高深理论,难度适中(相当于AIME第6-9题的水平)。这些特点使得AIME题目成为评估数学题目生成质量的理想基准:答案格式统一便于自动化评估,题目难度适中适合大规模生成。我们使用HuggingFace上的`TianHongZXY/aime-1983-2025`数据集作为参考,该数据集包含从1983年到2025年的900多道AIME真题,为我们的生成和评估提供了丰富的参考样本。
+
+### 12.4.1 评估方法概述
+
+在数据生成质量评估中,我们采用三种互补的评估方法:LLM Judge、Win Rate和人工打分。选择这三种方法有两个重要原因。首先,从方法论角度来看,这些是当前智能体领域常用的自动化测评方案,也是许多学术论文中的主流做法,具有广泛的认可度和实践基础。其次,从适用性角度来看,这三种方法天然适合我们的评估场景:LLM Judge和Win Rate用于评估题目生成质量(从正确性、清晰度、难度匹配等维度进行多维度评估),而人工打分用于评估答案生成质量(通过人类专家验证答案的准确性),这种分工非常合理且易于理解。
+
+下面我们详细介绍这三种评估方法的具体实现。整个案例的实现流程如图12.5所示:
+
+<div align="center">
+  <img src="https://raw.githubusercontent.com/datawhalechina/Hello-Agents/main/docs/images/12-figures/12-5.png" alt="" width="85%"/>
+  <p>图 12.5 数据生成质量评估流程图</p>
+</div>
+**(1)LLM Judge评估**
+
+**设计动机**:在数据生成质量评估中,我们需要对大量生成的题目进行快速、一致的质量评估。传统的人工评估虽然准确,但成本高、效率低,难以应对大规模数据生成的需求。LLM Judge通过使用大语言模型作为评委,可以自动化地从多个维度评估生成数据的质量,不仅大幅提升评估效率,还能保持评估标准的一致性。更重要的是,LLM Judge可以提供详细的评分理由和改进建议,帮助我们理解生成数据的优缺点,为后续优化提供方向。
+
+在我们的实现中,LLM Judge从四个关键维度评估AIME题目的质量:
+
+<div align="center">
+  <p>表 12.5 LLM Judge评估AIME题目的维度</p>
+  <img src="https://raw.githubusercontent.com/datawhalechina/Hello-Agents/main/docs/images/12-figures/12-table-4.png" alt="" width="85%"/>
+</div>
+
+有了四个维度的评分后,我们需要将这些评分汇总成整体的评估指标。我们定义了三个关键指标来衡量生成题目的质量水平:
+
+**评估指标**:
+
+**1. 平均分(Average Score)**:计算所有题目在四个维度上的平均得分,反映生成题目的整体质量水平。
+$$
+\text{Average Score} = \frac{1}{N} \sum_{i=1}^{N} \frac{\sum_{d=1}^{4} S_{i,d}}{4}
+$$
+
+**2. 及格率(Pass Rate)**:统计平均分达到3.5分及以上的题目比例,反映生成题目的基本质量保障。
+
+$$
+\text{Pass Rate} = \frac{|\{i : \text{Score}_i \geq 3.5\}|}{N}
+$$
+
+**3. 优秀率(Excellent Rate)**:统计平均分达到4.5分及以上的题目比例,反映生成题目的高质量占比。
+
+$$
+\text{Excellent Rate} = \frac{|\{i : \text{Score}_i \geq 4.5\}|}{N}
+$$
+
+其中:
+- $N$ 是评估的题目总数
+- $S_{i,d}$ 是第 $i$ 个题目在第 $d$ 个维度的得分(1-5分)
+- $\text{Score}_i$ 是第 $i$ 个题目的平均分(四个维度得分的平均值)
+
+这三个指标从不同角度反映生成质量:平均分给出整体水平,及格率保证基本质量,优秀率衡量高质量产出能力。
+
+**(2)Win Rate评估**
+
+**设计动机**:虽然LLM Judge可以提供多维度的绝对评分,但我们还需要一个相对评估指标来衡量生成题目与真题的质量差距。Win Rate评估通过成对对比的方式,让LLM直接判断生成题目和真题哪个更好,这种相对比较比绝对评分更符合人类的判断习惯,也更容易发现生成题目的相对优势和劣势。理想情况下,如果生成题目的质量接近真题,Win Rate应该在50%左右(即生成题目和真题各有50%的胜率)。这个指标简单直观,可以快速判断生成系统的整体质量水平。
+
+在我们的实现中,Win Rate评估通过以下图12.6所示流程进行评估:
+
+<div align="center">
+  <img src="https://raw.githubusercontent.com/datawhalechina/Hello-Agents/main/docs/images/12-figures/12-6.png" alt="" width="85%"/>
+  <p>图 12.6 数据生成质量评估流程图</p>
+</div>
+
+在成对对比评估中,每次比较会产生三种可能的结果:生成题目获胜(Win)、真题获胜(Loss)或平局(Tie)。我们通过统计这三种结果的比例来评估生成题目的质量:
+
+**评估指标**:
+
+**1. 胜率(Win Rate)**:生成题目被判定为更好的比例,反映生成题目相对于真题的优势。
+
+$$
+\text{Win Rate} = \frac{\text{Wins}}{\text{Total Comparisons}}
+$$
+
+**2. 败率(Loss Rate)**:真题被判定为更好的比例,反映生成题目相对于真题的劣势。
+
+$$
+\text{Loss Rate} = \frac{\text{Losses}}{\text{Total Comparisons}}
+$$
+
+**3. 平局率(Tie Rate)**:两者被判定为质量相当的比例,反映生成题目与真题的相似程度。
+
+$$
+\text{Tie Rate} = \frac{\text{Ties}}{\text{Total Comparisons}}
+$$
+
+其中,Total Comparisons是总的对比次数,Wins、Losses和Ties分别是生成题目获胜、失败和平局的次数。这三个指标满足:Win Rate + Loss Rate + Tie Rate = 100%。
+
+**理想结果**:Win Rate ≈ 50%(说明生成质量接近真题)。如果Win Rate显著低于50%,说明生成题目质量不如真题,需要优化生成策略;如果Win Rate显著高于50%,可能说明生成题目在某些方面超越了真题,或者评估标准存在偏差。
+
+**(3)人工验证**
+
+**设计动机**:尽管LLM Judge和Win Rate可以自动化评估题目质量,但对于数学题目这种需要严格逻辑推理的内容,人工验证仍然是不可或缺的。特别是在评估答案生成质量时,需要人类专家验证答案的准确性、解答步骤的完整性和数学推理的严密性。此外,人工验证还可以发现自动化评估可能遗漏的问题,如题目的创新性、趣味性等主观因素。为了提高人工验证的效率和体验,我们开发了基于Gradio的Web界面,让验证者可以方便地浏览题目、评分、标注状态和添加评论,大大降低了人工验证的门槛。
+
+在我们的实现中,人工验证通过以下步骤进行:
+
+1. 阅读题目、答案、解答
+2. 评分(1-5分):正确性、清晰度、难度匹配、完整性
+3. 标注状态:
+   - ✅ approved(通过)
+   - ❌ rejected(拒绝)
+   - 🔄 needs_revision(需修改)
+4. 添加评论
+
+### 12.4.2 系统架构
+
+数据生成与评估系统采用模块化设计:
+
+```
+data_generation/
+├── aime_generator.py              # AIME题目生成器
+├── human_verification_ui.py       # 人工验证界面
+├── run_complete_evaluation.py     # 完整评估流程
+│
+├── generated_data/                # 生成的数据
+│   ├── aime_generated_XXXXXX.json
+│   └── generation_report_XXXXXX.md
+│
+└── evaluation_results/            # 评估结果
+    └── XXXXXX/
+        ├── llm_judge/
+        ├── win_rate/
+        └── comprehensive_report.md
+```
+
+系统包含四个核心组件:首先是AIMEGenerator(题目生成器),使用HelloAgents框架生成AIME风格题目,支持批量生成和进度保存,并能自动处理API速率限制;其次是LLMJudgeTool(LLM Judge评估工具),提供4维度质量评估,自动生成JSON结果和Markdown报告;第三是WinRateTool(Win Rate评估工具),通过成对对比评估计算胜率、败率和平局率;最后是HumanVerificationUI(人工验证界面),基于Gradio Web界面,支持评分和状态标注。
+
+### 12.4.3 AIME题目生成器实现
+
+```python
+class AIMEGenerator:
+    """AIME Problem Generator"""
+
+    def __init__(
+        self,
+        llm: HelloAgentsLLM = None,
+        delay_seconds: float = 1.0,
+        use_reference_examples: bool = True,
+        reference_dataset: str = "TianHongZXY/aime-1983-2025"
+    ):
+        self.llm = llm or HelloAgentsLLM()
+        self.agent = SimpleAgent(
+            name="AIME Generator",
+            llm=self.llm,
+            system_prompt="You are a professional mathematics competition problem designer."
+        )
+        self.delay_seconds = delay_seconds
+        self.use_reference_examples = use_reference_examples
+
+        # Load reference examples from 900+ AIME problems (1983-2025)
+        if use_reference_examples:
+            dataset = load_dataset(reference_dataset, split="test")
+            self.reference_examples = list(dataset)
+```
+我们的目标是生成类似风格的数据集,所以从900+道AIME真题(1983-2025)中随机选择参考样例
+
+生成提示词设计(英文):
+
+```python
+GENERATION_PROMPT = """You are a professional mathematics competition problem designer, skilled in creating AIME (American Invitational Mathematics Examination) style problems.
+
+【Reference Example】(For style reference only, please generate a completely different problem)
+Problem: {example_problem}
+Answer: {example_answer}
+
+AIME Problem Characteristics:
+1. Answer: An integer between 0 and 999
+2. Topics: Algebra, Geometry, Number Theory, Combinatorics, Probability, etc.
+3. Style: Requires multi-step reasoning, but no advanced theory
+4. Difficulty: Medium to hard (similar to AIME problems 6-9)
+
+Please generate a **completely different** AIME-style mathematics problem, including:
+1. Problem statement (clear and complete, different from the reference)
+2. Answer (an integer between 0 and 999, different from the reference)
+3. Detailed solution (including all reasoning steps)
+4. Topic classification (Algebra/Geometry/Number Theory/Combinatorics/Probability)
+
+Please output in the following JSON format:
+{
+    "problem": "Problem statement in English",
+    "answer": 123,
+    "solution": "Detailed solution steps in English",
+    "topic": "Algebra"
+}
+"""
+```
+我们选择使用英文生成题目有四个重要原因:首先是与AIME真题保持一致(AIME是英文竞赛,生成英文题目更合理),其次是确保评估的公平性(LLM Judge评估时英文vs英文更公平),第三是便于国际化(英文题目可以被更广泛使用),最后是避免翻译问题(不需要担心中英文翻译的准确性)。
+
+批量生成实现:
+
+```python
+def generate_and_save(self, num_problems: int = 30, output_dir: str = "data_generation/generated_data"):
+    """Generate and save problems with intelligent delay"""
+    # Clean old checkpoints
+    for file in os.listdir(output_dir):
+        if file.startswith("checkpoint_") and file.endswith(".json"):
+            os.remove(os.path.join(output_dir, file))
+
+    # Generate with tqdm progress bar
+    with tqdm(total=num_problems, desc="Generating AIME problems", unit="problem") as pbar:
+        last_call_time = 0
+
+        for i in range(num_problems):
+            # Ensure minimum delay between API calls
+            if last_call_time > 0:
+                elapsed = time.time() - last_call_time
+                if elapsed < self.delay_seconds:
+                    wait_time = self.delay_seconds - elapsed
+                    time.sleep(wait_time)
+
+            # Generate problem (randomly select reference example)
+            start_time = time.time()
+            problem = self.generate_single()
+            last_call_time = time.time()
+            generation_time = last_call_time - start_time
+
+            # Update progress bar
+            pbar.set_postfix({
+                "topic": problem.get('topic', 'N/A'),
+                "answer": problem.get('answer', 'N/A'),
+                "time": f"{generation_time:.1f}s"
+            })
+            pbar.update(1)
+
+    return generated_data_path
+```
+LaTeX数学公式支持:
+
+生成的AIME题目包含LaTeX数学公式(如 `$\frac{a}{b}$`、`$\sqrt{x}$`),需要特殊处理JSON解析:
+
+```python
+def _parse_response(self, response: str) -> Dict[str, Any]:
+    """解析LLM响应(支持LaTeX数学公式)"""
+    import re
+
+    # 提取JSON部分
+    if "```json" in response:
+        json_str = response.split("```json")[1].split("```")[0].strip()
+    else:
+        json_str = response.strip()
+
+    try:
+        problem_data = json.loads(json_str)
+    except json.JSONDecodeError:
+        # 修复LaTeX转义问题:将 \frac 转为 \\frac
+        # 正则表达式:找到未转义的反斜杠
+        fixed_json_str = re.sub(r'(?<!\\)\\(?!["\\/bfnrtu])', r'\\\\', json_str)
+        problem_data = json.loads(fixed_json_str)
+
+    return problem_data
+```
+LaTeX公式中的反斜杠(如 `\frac`、`\sqrt`)在JSON中是非法的转义字符,会导致解析失败:
+```
+Invalid \escape: line 4 column 185 (char 375)
+```
+
+通过正则表达式将未转义的反斜杠替换为双反斜杠,使其在JSON中合法。
+
+### 12.4.4 LLM Judge评估工具
+
+LLM Judge工具使用LLM作为评委,对生成的题目进行多维度评估。
+
+```python
+class LLMJudgeTool(Tool):
+    """LLM Judge评估工具"""
+
+    def run(self, params: Dict[str, Any]) -> str:
+        """运行LLM Judge评估"""
+        # 1. 加载生成数据
+        gen_dataset = AIDataset(dataset_type="generated", data_path=params["generated_data_path"])
+        gen_problems = gen_dataset.load()
+
+        # 2. 加载参考数据(AIME 2025)
+        ref_dataset = AIDataset(dataset_type="real", year=2025)
+        ref_problems = ref_dataset.load()
+
+        # 3. 创建评估器
+        evaluator = LLMJudgeEvaluator(llm=self.llm, judge_model=params.get("judge_model", "gpt-4o"))
+
+        # 4. 运行评估
+        results = evaluator.evaluate_batch(gen_problems, max_samples=params.get("max_samples"))
+
+        # 5. 保存结果
+        evaluator.export_results(results, result_file)
+
+        # 6. 生成报告
+        self._generate_report(results, report_file)
+
+        return json.dumps({"status": "success", "metrics": results["metrics"]})
+```
+**评估提示词**:
+
+```python
+EVALUATION_PROMPT = """请评估以下AIME数学题目的质量。
+
+题目:
+{problem}
+
+答案:{answer}
+
+解答:
+{solution}
+
+请从以下4个维度评分(1-5分):
+
+1. **正确性 (Correctness)**:数学逻辑是否正确,答案是否准确
+2. **清晰度 (Clarity)**:问题表述是否清晰,解答是否易懂
+3. **难度匹配 (Difficulty Match)**:难度是否符合AIME标准(中等偏难)
+4. **完整性 (Completeness)**:解答步骤是否完整,是否包含必要的推理
+
+请按以下JSON格式输出:
+{
+    "correctness": 5,
+    "clarity": 4,
+    "difficulty_match": 4,
+    "completeness": 5,
+    "comments": "评价理由"
+}
+"""
+```
+
+**评估报告示例**:
+
+```markdown
+# LLM Judge评估报告
+
+## 总体评分
+
+- **平均总分**: 4.2/5.0
+- **通过率**: 85.0% (≥3.5分)
+- **优秀率**: 40.0% (≥4.5分)
+
+## 各维度评分
+
+| 维度 | 平均分 | 评级 |
+|------|--------|------|
+| 正确性 | 4.3/5.0 | 良好 ⭐⭐⭐⭐ |
+| 清晰度 | 4.1/5.0 | 良好 ⭐⭐⭐⭐ |
+| 难度匹配 | 4.0/5.0 | 良好 ⭐⭐⭐⭐ |
+| 完整性 | 4.4/5.0 | 良好 ⭐⭐⭐⭐ |
+```
+
+### 12.4.5 Win Rate评估工具
+
+Win Rate工具通过成对对比评估生成数据相对于真题的质量。
+
+```python
+class WinRateTool(Tool):
+    """Win Rate评估工具"""
+
+    def run(self, params: Dict[str, Any]) -> str:
+        """运行Win Rate评估"""
+        # 1. 加载生成数据
+        gen_dataset = AIDataset(dataset_type="generated", data_path=params["generated_data_path"])
+        gen_problems = gen_dataset.load()
+
+        # 2. 加载参考数据(AIME 2025)
+        ref_dataset = AIDataset(dataset_type="real", year=2025)
+        ref_problems = ref_dataset.load()
+
+        # 3. 创建评估器
+        evaluator = WinRateEvaluator(llm=self.llm, judge_model=params.get("judge_model", "gpt-4o"))
+
+        # 4. 运行评估
+        results = evaluator.evaluate_win_rate(gen_problems, ref_problems, num_comparisons=params.get("num_comparisons"))
+
+        # 5. 保存结果和报告
+        evaluator.export_results(results, result_file)
+        self._generate_report(results, report_file)
+
+        return json.dumps({"status": "success", "metrics": results["metrics"]})
+```
+AIDataset负责加载生成数据和AIME真题数据,支持两种数据类型:
+
+```python
+class AIDataset:
+    """AI数据集加载器
+
+    支持两种数据类型:
+    1. generated: 生成的数据(JSON格式)
+    2. real: AIME真题(从HuggingFace加载)
+    """
+
+    def __init__(
+        self,
+        dataset_type: str = "generated",
+        data_path: Optional[str] = None,
+        year: Optional[int] = None
+    ):
+        self.dataset_type = dataset_type
+        self.data_path = data_path
+        self.year = year  # 仅用于real类型,默认2025
+
+    def load(self) -> List[Dict[str, Any]]:
+        """加载数据集"""
+        if self.dataset_type == "generated":
+            return self._load_generated_data()
+        elif self.dataset_type == "real":
+            return self._load_real_data()
+
+    def _load_real_data(self) -> List[Dict[str, Any]]:
+        """从HuggingFace加载AIME 2025真题"""
+        from huggingface_hub import snapshot_download
+
+        # 使用AIME 2025数据集
+        repo_id = "math-ai/aime25"
+
+        # 下载数据集
+        local_dir = snapshot_download(
+            repo_id=repo_id,
+            repo_type="dataset"
+        )
+
+        # 读取JSONL文件
+        data_file = list(Path(local_dir).glob("*.jsonl"))[0]
+        data = []
+        with open(data_file, 'r', encoding='utf-8') as f:
+            for line in f:
+                if line.strip():
+                    data.append(json.loads(line))
+
+        # 统一数据格式(AIME 2025使用小写字段名)
+        problems = []
+        for idx, item in enumerate(data):
+            problem = {
+                "problem_id": item.get("id", f"aime_2025_{idx}"),
+                "problem": item.get("problem", ""),
+                "answer": item.get("answer", ""),
+                "solution": item.get("solution", ""),  # AIME 2025没有solution字段
+            }
+            problems.append(problem)
+
+        return problems
+```
+我们选择只使用AIME 2025数据集有四个原因:首先是数据的时效性(2025年是最新的AIME竞赛数据),其次是简化维护(只维护一个数据集,代码更简洁),第三是格式统一(JSONL格式,字段名统一为小写),最后是代表性充分(30道题目足以评估生成质量)。
+
+**对比提示词**:
+
+```python
+COMPARISON_PROMPT = """请比较以下两个AIME数学题目的质量,判断哪个更好。
+
+【题目A - 生成题目】
+问题:{problem_a}
+答案:{answer_a}
+解答:{solution_a}
+
+【题目B - AIME真题】
+问题:{problem_b}
+答案:{answer_b}
+解答:{solution_b}
+
+请从以下方面比较:
+1. 数学逻辑的严谨性
+2. 问题表述的清晰度
+3. 难度的合理性
+4. 解答的完整性
+
+请按以下JSON格式输出:
+{
+    "winner": "A" 或 "B" 或 "Tie",
+    "reason": "判断理由"
+}
+"""
+```
+
+**评估报告示例**:
+
+```markdown
+# Win Rate评估报告
+
+## 胜率统计
+
+| 指标 | 数值 | 百分比 |
+|------|------|--------|
+| 生成数据胜出 | 9次 | 45.0% |
+| AIME真题胜出 | 8次 | 40.0% |
+| 平局 | 3次 | 15.0% |
+
+**Win Rate**: 45.0%
+
+✅ **良好**: 生成数据质量接近参考数据(差距<10%)。
+```
+
+### 12.4.6 人工验证界面
+
+使用Gradio创建Web界面,支持人工验证生成的题目。
+
+```python
+class HumanVerificationUI:
+    """人工验证界面"""
+
+    def launch(self, share: bool = False):
+        """启动Gradio界面"""
+        with gr.Blocks(title="AIME题目人工验证") as demo:
+            gr.Markdown("# 🎯 AIME题目人工验证系统")
+
+            with gr.Row():
+                with gr.Column(scale=2):
+                    # 题目显示区域
+                    problem_text = gr.Textbox(label="问题描述", lines=5, interactive=False)
+                    answer_text = gr.Textbox(label="答案", interactive=False)
+                    solution_text = gr.Textbox(label="解答过程", lines=10, interactive=False)
+
+                with gr.Column(scale=1):
+                    # 评分区域
+                    correctness_slider = gr.Slider(1, 5, value=3, step=1, label="正确性")
+                    clarity_slider = gr.Slider(1, 5, value=3, step=1, label="清晰度")
+                    difficulty_slider = gr.Slider(1, 5, value=3, step=1, label="难度匹配")
+                    completeness_slider = gr.Slider(1, 5, value=3, step=1, label="完整性")
+
+                    # 状态选择
+                    status_radio = gr.Radio(
+                        choices=["approved", "rejected", "needs_revision"],
+                        value="approved",
+                        label="状态"
+                    )
+
+                    # 验证按钮
+                    verify_btn = gr.Button("✅ 提交验证", variant="primary")
+
+            demo.launch(share=share, server_name="127.0.0.1", server_port=7860)
+```
+**使用方法**:
+
+```bash
+# 启动人工验证界面
+python data_generation/human_verification_ui.py data_generation/generated_data/aime_generated_XXXXXX.json
+
+# 打开浏览器访问
+http://127.0.0.1:7860
+```
+
+最终效果可以参考图12.7所示,对于题目的正确性,最好人工打标Review:
+<div align="center">
+  <img src="https://raw.githubusercontent.com/datawhalechina/Hello-Agents/main/docs/images/12-figures/12-7.png" alt="" width="85%"/>
+  <p>图 12.7 AIME试题人工验证页面</p>
+</div>
+
+**验证流程**:
+
+1. 浏览器打开验证界面
+2. 阅读题目、答案、解答
+3. 从4个维度评分(1-5分)
+4. 选择验证状态(approved/rejected/needs_revision)
+5. 添加评论(可选)
+6. 点击"提交验证"
+7. 查看下一题
+
+**验证结果保存**:
+
+验证结果自动保存为 `<data_path>_verifications.json`:
+
+```json
+{
+  "gen_aime_1": {
+    "problem_id": "gen_aime_1",
+    "scores": {
+      "correctness": 5,
+      "clarity": 4,
+      "difficulty_match": 4,
+      "completeness": 5
+    },
+    "total_score": 4.5,
+    "status": "approved",
+    "comments": "题目质量很好,逻辑严谨",
+    "verified_at": "2025-01-10T12:00:00"
+  }
+}
+```
+
+### 12.4.7 完整评估流程
+
+将所有评估方法整合到一个完整的流程中。
+
+```python
+def run_complete_evaluation(
+    num_problems: int = 30,
+    delay_seconds: float = 3.0
+):
+    """
+    运行完整评估流程
+
+    Args:
+        num_problems: 生成题目数量
+        delay_seconds: 每次生成之间的延迟(秒),避免API速率限制
+    """
+    # 步骤1: 生成AIME题目
+    generator = AIMEGenerator(delay_seconds=delay_seconds)
+    generated_data_path = generator.generate_and_save(
+        num_problems=num_problems,
+        output_dir="data_generation/generated_data"
+    )
+
+    # 步骤2: 评估
+    # 创建评估结果目录
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    evaluation_dir = f"data_generation/evaluation_results/{timestamp}"
+    os.makedirs(evaluation_dir, exist_ok=True)
+    os.makedirs(os.path.join(evaluation_dir, "llm_judge"), exist_ok=True)
+    os.makedirs(os.path.join(evaluation_dir, "win_rate"), exist_ok=True)
+
+    # 创建LLM
+    llm = HelloAgentsLLM()
+
+    # 步骤2.1: LLM Judge评估
+    llm_judge_result = None
+    try:
+        llm_judge_tool = LLMJudgeTool(llm=llm)
+        llm_judge_result_json = llm_judge_tool.run({
+            "generated_data_path": generated_data_path,
+            "reference_year": 2025,
+            "max_samples": num_problems,
+            "output_dir": os.path.join(evaluation_dir, "llm_judge"),
+            "judge_model": "gpt-4o"
+        })
+        llm_judge_result = json.loads(llm_judge_result_json)
+    except Exception as e:
+        print(f"❌ LLM Judge评估失败: {e}")
+
+    # 步骤2.2: Win Rate评估
+    win_rate_result = None
+    try:
+        win_rate_tool = WinRateTool(llm=llm)
+        win_rate_result_json = win_rate_tool.run({
+            "generated_data_path": generated_data_path,
+            "reference_year": 2025,
+            "num_comparisons": min(num_problems, 20),
+            "output_dir": os.path.join(evaluation_dir, "win_rate"),
+            "judge_model": "gpt-4o"
+        })
+        win_rate_result = json.loads(win_rate_result_json)
+    except Exception as e:
+        print(f"❌ Win Rate评估失败: {e}")
+
+    # 步骤3: 生成综合报告
+    comprehensive_report_path = None
+    if llm_judge_result or win_rate_result:
+        comprehensive_report_path = os.path.join(evaluation_dir, "comprehensive_report.md")
+        report = generate_comprehensive_report(
+            generated_data_path,
+            llm_judge_result,
+            win_rate_result
+        )
+        with open(comprehensive_report_path, 'w', encoding='utf-8') as f:
+            f.write(report)
+
+    return {
+        "generated_data_path": generated_data_path,
+        "llm_judge_result": llm_judge_result,
+        "win_rate_result": win_rate_result,
+        "comprehensive_report_path": comprehensive_report_path
+    }
+```
+**运行方法**:
+
+```bash
+# 基本用法(默认3秒延迟)
+python data_generation/run_complete_evaluation.py 30
+
+# 自定义延迟(推荐3-5秒,避免API速率限制)
+python data_generation/run_complete_evaluation.py 30 3.0
+
+# 参数说明:
+# - 30: 生成题目数量
+# - 3.0: 每次生成之间的延迟(秒)
+
+# 说明:
+# - 生成阶段:从900+道AIME真题(1983-2025)中随机选择参考样例
+# - 评估阶段:与AIME 2025年真题进行质量对比
+# - 数据集来源:math-ai/aime25(JSONL格式)
+```
+
+**输出示例**:
+
+```
+================================================================================
+🚀 AIME数据生成与评估完整流程
+================================================================================
+
+配置信息:
+  - 生成题目数量: 30
+  - API延迟: 3.0秒/题
+  - 生成参考数据: TianHongZXY/aime-1983-2025(900+道题)
+  - 评估参考: AIME 2025真题
+
+================================================================================
+📝 步骤1: 生成AIME题目
+================================================================================
+📚 加载AIME真题数据集: TianHongZXY/aime-1983-2025
+   ✓ 已加载 963 道参考题目
+
+🎯 开始生成AIME题目
+   目标数量: 30
+   生成模型: gpt-4o
+   延迟设置: 3.0秒/题
+
+生成AIME题目:  100%|██████████| 30/30 [01:30<00:00, 3.00s/题, 主题=Algebra, 答案=123, 耗时=3.0s]
+
+✅ 步骤1完成!生成数据保存在: data_generation/generated_data/aime_generated_20250110_120000.json
+
+🎯 步骤2.1: LLM Judge评估 (vs AIME 2025)
+
+✅ LLM Judge评估完成!
+   平均总分: 4.2/5.0
+   通过率: 85.0%
+
+🏆 步骤2.2: Win Rate评估 (vs AIME 2025)
+
+✅ Win Rate评估完成!
+   Win Rate: 45.0%
+
+================================================================================
+📊 步骤3: 生成综合报告
+================================================================================
+
+✅ 综合报告已保存: data_generation/evaluation_results/20250110_120000/comprehensive_report.md
+
+================================================================================
+🎉 完整评估流程完成!
+================================================================================
+
+📁 输出文件:
+   - 生成数据: data_generation/generated_data/aime_generated_20250110_120000.json
+   - 评估结果目录: data_generation/evaluation_results/20250110_120000
+   - LLM Judge报告: data_generation/evaluation_results/20250110_120000/llm_judge/llm_judge_report_20250110_120000.md
+   - Win Rate报告: data_generation/evaluation_results/20250110_120000/win_rate/win_rate_report_20250110_120000.md
+   - 综合报告: data_generation/evaluation_results/20250110_120000/comprehensive_report.md
+
+💡 下一步:
+   1. 查看综合报告: data_generation/evaluation_results/20250110_120000/comprehensive_report.md
+   2. 运行人工验证: python data_generation/human_verification_ui.py data_generation/generated_data/aime_generated_20250110_120000.json
+```
+
+### 12.4.8 综合评估报告
+
+系统自动生成综合评估报告,汇总所有评估结果。以下是示例报告:
+
+```markdown
+# AIME数据生成与评估综合报告
+
+## 1. 基本信息
+
+- **生成时间**: 2025-01-10 12:00:00
+- **生成题目数量**: 30
+- **参考AIME年份**: 2025
+
+## 2. 数据生成统计
+
+### 主题分布
+
+| 主题 | 数量 | 占比 |
+|------|------|------|
+| 代数 | 10 | 33.3% |
+| 几何 | 8 | 26.7% |
+| 数论 | 7 | 23.3% |
+| 组合 | 3 | 10.0% |
+| 概率 | 2 | 6.7% |
+
+## 3. LLM Judge评估结果
+
+### 总体评分
+
+- **平均总分**: 4.2/5.0
+- **通过率**: 85.0% (≥3.5分)
+- **优秀率**: 40.0% (≥4.5分)
+
+### 各维度评分
+
+| 维度 | 平均分 | 评级 |
+|------|--------|------|
+| 正确性 | 4.3/5.0 | 良好 ⭐⭐⭐⭐ |
+| 清晰度 | 4.1/5.0 | 良好 ⭐⭐⭐⭐ |
+| 难度匹配 | 4.0/5.0 | 良好 ⭐⭐⭐⭐ |
+| 完整性 | 4.4/5.0 | 良好 ⭐⭐⭐⭐ |
+
+## 4. Win Rate评估结果
+
+### 胜率统计
+
+| 指标 | 数值 | 百分比 |
+|------|------|--------|
+| 生成数据胜出 | 9次 | 45.0% |
+| AIME真题胜出 | 8次 | 40.0% |
+| 平局 | 3次 | 15.0% |
+
+**Win Rate**: 45.0%
+
+✅ **良好**: 生成数据质量接近参考数据(差距<10%)。
+
+## 5. 综合结论
+
+基于LLM Judge和Win Rate两种评估方法的结果:
+
+1. **LLM Judge评估**: 生成数据的平均质量为 **4.2/5.0**
+2. **Win Rate评估**: 生成数据相对于AIME 2025真题的胜率为 **45.0%**
+
+✅ **结论**: 生成数据质量**优秀**,达到或超过AIME真题水平。可以用于实际应用。
+
+## 6. 改进建议
+
+- ✅ 继续保持当前的生成策略
+- ✅ 可以考虑增加生成数量
+- ✅ 建议进行人工验证以确保质量
+
+## 7. 下一步行动
+
+1. **人工验证**: 运行 `python data_generation/human_verification_ui.py <data_path>` 进行人工验证
+2. **查看详细结果**:
+   - LLM Judge详细报告
+   - Win Rate详细报告
+3. **数据使用**: 如果质量满意,可以将生成的数据用于训练或测试
+```
+
+基于实际使用经验,总结以下内容:
+
+在数据生成方面,应该使用合适的延迟时间(2-3秒)避免API速率限制,启用检查点保存以避免中断损失,先小批量测试(10个)确认无问题后再大批量生成,并定期检查生成质量及时调整提示词。在评估策略上,建议结合LLM Judge和Win Rate两种方法,其中LLM Judge用于绝对质量评估,Win Rate用于相对质量对比,人工验证用于最终质量把关。质量标准方面,建议LLM Judge平均分达到4.0/5.0以上,Win Rate达到45%以上(接近50%),通过率达到80%以上,人工验证通过率达到90%以上。在迭代优化过程中,应根据评估结果调整生成提示词,分析低分题目的共同问题,参考高分题目的优点,持续改进生成策略。
+
+通过本节的学习,我们掌握了如何使用HelloAgents框架进行数据生成质量评估,包括LLM Judge评估、Win Rate评估和人工验证三种方法。这套完整的评估体系可以确保生成数据的高质量,为AI系统的训练和测试提供可靠的数据支持。
+
+对于LLM Judge和Win Rate评估,HelloAgents也进行了工具集成,并提供了完整的示例代码。如果你对这两种评估方法的具体实现细节感兴趣,同样可以参考示例代码。
+
+
+
+
+## 12.5 本章小结
+
+在本章中,我们为HelloAgents框架构建了一个完整的性能评估系统。让我们回顾一下学到的核心内容:
+
+**(1)评估体系概览**
+
+我们建立了一个三层评估体系,全面覆盖智能体的不同能力维度。首先是工具调用能力评估(BFCL),专注于评估智能体的函数调用准确性,包含simple、multiple、parallel、irrelevance四个类别,使用AST匹配技术进行精确评估。其次是通用能力评估(GAIA),评估智能体的综合问题解决能力,包含三个难度级别共466个真实世界问题,关注多步推理、工具使用、文件处理等能力。第三是数据生成质量评估(AIME),评估LLM生成数据的质量,使用LLM Judge和Win Rate两种方法,支持人工验证和综合报告生成,确保生成数据达到参考数据的质量标准。
+
+**(2)核心技术要点**
+
+在技术实现上,我们采用了六个核心技术要点。首先是模块化设计,评估系统采用三层架构:数据层(Dataset负责数据加载和管理)、评估层(Evaluator负责执行评估流程)和指标层(Metrics负责计算各种评估指标)。其次是工具化封装,所有评估功能都封装成Tool,可以被智能体直接调用、集成到工作流中或通过统一接口使用。第三是AST匹配技术,使用抽象语法树匹配函数调用,比简单字符串匹配更智能,能够忽略参数顺序、识别等价表达式和忽略格式差异。第四是多模态支持,GAIA评估支持文本问题、附件文件和图片输入等多模态数据。第五是LLM Judge评估,使用LLM作为评委评估生成数据质量,提供多维度评分(正确性、清晰度、难度匹配、完整性)、自动化评估流程、详细评估报告,并支持自定义评估维度和标准。第六是Win Rate对比评估,通过成对对比评估生成质量(生成数据vs参考数据),由LLM判断哪个更好并计算胜率统计,接近50%表示质量相当。
+
+**(3)扩展方向**
+
+基于本章的评估系统,你可以在四个方向上进行扩展。首先是添加新的评估基准,可以参考BFCL和GAIA的实现模式,实现Dataset、Evaluator、Metrics三个组件,并封装成Tool供使用。其次是自定义评估指标,在Metrics类中添加新的指标计算方法,根据具体应用场景设计指标。第三是集成到CI/CD流程,在代码提交时自动运行评估,设置性能阈值防止性能退化,生成评估报告并归档。第四是扩展数据生成评估,支持更多数据类型(代码、对话、文档等),添加更多评估维度(创新性、多样性等),集成更多参考数据集,支持多模型对比评估。
+
+**恭喜你完成了第十二章的学习!** 🎉
+
+评估是智能体开发的重要环节,它让我们能够:
+
+- 客观衡量智能体的能力
+- 发现和修复问题
+- 持续改进系统
+
+在下一章中,我们将探讨如何将HelloAgents框架应用于实际项目中。
+
+**继续加油!** 💪
+
+
+
+## 参考文献
+
+[1] Patil, S. G., Zhang, T., Wang, X., & Gonzalez, J. E. (2023). Gorilla: Large Language Model Connected with Massive APIs. arXiv preprint arXiv:2305.15334.
+
+[2] Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., ... & Sun, M. (2023). ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. arXiv preprint arXiv:2307.16789.
+
+[3] Li, M., Zhao, Y., Yu, B., Song, F., Li, H., Yu, H., ... & Li, Y. (2023). Api-bank: A comprehensive benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244.
+
+[4] Mialon, G., Dessì, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R., ... & Scialom, T. (2023). GAIA: a benchmark for General AI Assistants. arXiv preprint arXiv:2311.12983.
+
+[5] Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., ... & Zhang, D. (2023). AgentBench: Evaluating LLMs as Agents. arXiv preprint arXiv:2308.03688.
+
+[6] Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., ... & Neubig, G. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv preprint arXiv:2307.13854.
+
+[7] Chan, C. M., Chen, W., Su, Y., Yu, J., Xue, W., Zhang, S., ... & Liu, Z. (2023). ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate. arXiv preprint arXiv:2308.07201.
+
+[8] Zhou, X., Zhu, H., Mathur, L., Zhang, R., Yu, H., Qi, Z., ... & Neubig, G. (2023). SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents. arXiv preprint arXiv:2310.11667.
+
+[9] Mathematical Association of America. (2024). American Invitational Mathematics Examination (AIME). Retrieved from https://www.maa.org/math-competitions/invitational-competitions/aime
+

BIN
docs/images/12-figures/12-1.png


BIN
docs/images/12-figures/12-2.png


BIN
docs/images/12-figures/12-3.png


BIN
docs/images/12-figures/12-4.png


BIN
docs/images/12-figures/12-5.png


BIN
docs/images/12-figures/12-6.png


BIN
docs/images/12-figures/12-7.png


BIN
docs/images/12-figures/12-table-1.png


BIN
docs/images/12-figures/12-table-2.png


BIN
docs/images/12-figures/12-table-3.png


BIN
docs/images/12-figures/12-table-4.png


BIN
docs/images/12-figures/12-table-5.png


برخی فایل ها در این مقایسه diff نمایش داده نمی شوند زیرا تعداد فایل ها بسیار زیاد است