|
|
1 月之前 | |
|---|---|---|
| .. | ||
| data | 1 月之前 | |
| src | 1 月之前 | |
| .env.example | 1 月之前 | |
| .gitignore | 1 月之前 | |
| README.md | 1 月之前 | |
| main.ipynb | 1 月之前 | |
| requirements.txt | 1 月之前 | |
AI-powered incident triage, root cause investigation, and post-mortem generation
中文简介:本项目构建了一个 AI 驱动的 SRE 值班助手,自动完成告警分诊、根因调查和故障复盘报告生成。通过三阶段智能体流水线(Plan-and-Solve → ReAct → Reflection)演示了第四章三种经典范式在真实运维场景下的串联应用,是社区首个 SRE/运维领域项目。
When a production alert fires at 3am, an on-call SRE must triage the incident, investigate root cause across logs and metrics, consult runbooks, and write a post-mortem — all under pressure. This project automates that workflow using a three-stage AI agent pipeline:
This is the first SRE/operations domain project in the Hello-Agents community, and demonstrates all three agent paradigms from Chapter 4 in a single coherent system.
log_search, metric_query, runbook_lookuppip install -r requirements.txt
cp .env.example .env
# Edit .env and set LLM_API_KEY, LLM_BASE_URL, LLM_MODEL_ID
Free LLM options:
https://aihubmix.com/v1 — free tier, OpenAI-compatiblehttps://api-inference.modelscope.cn/v1 — 2000 free calls/dayjupyter lab
# Open main.ipynb and run all cells
uvicorn src.api.main:app --reload --port 8000
API endpoints:
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Liveness check |
GET |
/incidents/fixtures |
List sample incident IDs |
POST |
/incidents/investigate |
Run the full 3-agent pipeline |
GET |
/incidents/{id}/report |
Retrieve a generated report |
from src.agents.pipeline import run_pipeline
result = run_pipeline("db_pool_exhaustion")
print(result["report"]) # Markdown RCA report
print(result["findings"]) # Root cause + evidence dict
# List available incidents
curl http://localhost:8000/incidents/fixtures
# Run the pipeline
curl -X POST http://localhost:8000/incidents/investigate \
-H "Content-Type: application/json" \
-d '{"incident_id": "db_pool_exhaustion"}'
# Get the generated report
curl http://localhost:8000/incidents/db_pool_exhaustion/report
🚨 STAGE 1: TRIAGE — Generating investigation plan
1. [log_search] pool exhausted — Find DB pool error log entries
2. [metric_query] db_pool — Check connection pool saturation over time
3. [metric_query] latency — Quantify request latency degradation
4. [runbook_lookup] DB pool exhausted — Get remediation steps
🔍 STAGE 2: INVESTIGATION — ReAct tool loop
Step 1 — log_search[pool exhausted] → 3 matching entries found
Step 2 — metric_query[db_pool] → pool maxed at 10/10 from 14:01 onward
Step 3 — runbook_lookup[DB pool exhausted] → runbook steps retrieved
✅ Root cause: Missing index on orders.user_id causing full table scan...
📝 STAGE 3: POST-MORTEM — Reflection (draft → critique → revise)
Quality score: 9/10 — no revision needed.
✅ Final post-mortem ready.
On 3 incident fixtures (tested with Llama-3.3-70b via Groq, compatible with any OpenAI-compatible API):
| Incident | Root Cause Identified | Pipeline Time |
|---|---|---|
| DB pool exhaustion | ✅ Missing index on orders.user_id | ~30s |
| Memory leak OOM | ✅ Session cache with no TTL/eviction | ~25s |
| External API rate limit | ✅ Retry storm from no exponential backoff | ~28s |
Root cause accuracy: 3/3 (100%) on sample fixtures
Issues and PRs welcome! See the Hello-Agents contributing guide.
MIT License — see LICENSE.txt for details.
Thanks to the Datawhale Hello-Agents team for the excellent curriculum, and to Chapter 4's ReAct, Plan-and-Solve, and Reflection examples which this project builds on directly.