Sfoglia il codice sorgente

Merge pull request #534 from jijiezzhou/feature/sre-oncall-agent

[毕业设计] SREOnCallAgent - AI驱动的SRE事故自动分析与复盘报告生成
jjyaoao 1 mese fa
parent
commit
8cafa0e044
24 ha cambiato i file con 2016 aggiunte e 0 eliminazioni
  1. 20 0
      Co-creation-projects/zjzhou-SREOnCallAgent/.env.example
  2. 11 0
      Co-creation-projects/zjzhou-SREOnCallAgent/.gitignore
  3. 166 0
      Co-creation-projects/zjzhou-SREOnCallAgent/README.md
  4. 39 0
      Co-creation-projects/zjzhou-SREOnCallAgent/data/incidents/db_pool_exhaustion.json
  5. 34 0
      Co-creation-projects/zjzhou-SREOnCallAgent/data/incidents/external_api_ratelimit.json
  6. 31 0
      Co-creation-projects/zjzhou-SREOnCallAgent/data/incidents/memory_leak_oom.json
  7. 29 0
      Co-creation-projects/zjzhou-SREOnCallAgent/data/runbooks/checkout-service.yaml
  8. 30 0
      Co-creation-projects/zjzhou-SREOnCallAgent/data/runbooks/payment-service.yaml
  9. 855 0
      Co-creation-projects/zjzhou-SREOnCallAgent/main.ipynb
  10. 12 0
      Co-creation-projects/zjzhou-SREOnCallAgent/requirements.txt
  11. 0 0
      Co-creation-projects/zjzhou-SREOnCallAgent/src/__init__.py
  12. 0 0
      Co-creation-projects/zjzhou-SREOnCallAgent/src/agents/__init__.py
  13. 190 0
      Co-creation-projects/zjzhou-SREOnCallAgent/src/agents/investigation_agent.py
  14. 62 0
      Co-creation-projects/zjzhou-SREOnCallAgent/src/agents/pipeline.py
  15. 155 0
      Co-creation-projects/zjzhou-SREOnCallAgent/src/agents/postmortem_agent.py
  16. 94 0
      Co-creation-projects/zjzhou-SREOnCallAgent/src/agents/triage_agent.py
  17. 0 0
      Co-creation-projects/zjzhou-SREOnCallAgent/src/api/__init__.py
  18. 88 0
      Co-creation-projects/zjzhou-SREOnCallAgent/src/api/main.py
  19. 0 0
      Co-creation-projects/zjzhou-SREOnCallAgent/src/core/__init__.py
  20. 55 0
      Co-creation-projects/zjzhou-SREOnCallAgent/src/core/llm_client.py
  21. 0 0
      Co-creation-projects/zjzhou-SREOnCallAgent/src/tools/__init__.py
  22. 42 0
      Co-creation-projects/zjzhou-SREOnCallAgent/src/tools/log_search_tool.py
  23. 38 0
      Co-creation-projects/zjzhou-SREOnCallAgent/src/tools/metric_query_tool.py
  24. 65 0
      Co-creation-projects/zjzhou-SREOnCallAgent/src/tools/runbook_tool.py

+ 20 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/.env.example

@@ -0,0 +1,20 @@
+# LLM Configuration — any OpenAI-compatible API works
+
+# Option A: Groq (recommended — free tier, fast inference)
+#   Get your key at: https://console.groq.com → API Keys
+LLM_API_KEY=your_groq_api_key_here
+LLM_BASE_URL=https://api.groq.com/openai/v1
+LLM_MODEL_ID=llama-3.3-70b-versatile
+
+# Option B: ModelScope / Qwen (free, 2000 calls/day)
+# LLM_API_KEY=your_modelscope_sdk_token
+# LLM_BASE_URL=https://api-inference.modelscope.cn/v1
+# LLM_MODEL_ID=Qwen/Qwen2.5-72B-Instruct
+
+# Option C: OpenAI directly
+# LLM_API_KEY=sk-...
+# LLM_BASE_URL=https://api.openai.com/v1
+# LLM_MODEL_ID=gpt-4o-mini
+
+# Optional: LLM request timeout in seconds (default: 60)
+# LLM_TIMEOUT=60

+ 11 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/.gitignore

@@ -0,0 +1,11 @@
+.env
+__pycache__/
+*.pyc
+*.pyo
+.venv/
+venv/
+env/
+.DS_Store
+outputs/
+*.egg-info/
+.ipynb_checkpoints/

+ 166 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/README.md

@@ -0,0 +1,166 @@
+# SRE On-Call Agent
+
+> AI-powered incident triage, root cause investigation, and post-mortem generation
+
+> **中文简介**:本项目构建了一个 AI 驱动的 SRE 值班助手,自动完成告警分诊、根因调查和故障复盘报告生成。通过三阶段智能体流水线(Plan-and-Solve → ReAct → Reflection)演示了第四章三种经典范式在真实运维场景下的串联应用,是社区首个 SRE/运维领域项目。
+
+## 📝 Project Introduction
+
+When a production alert fires at 3am, an on-call SRE must triage the incident, investigate root cause across logs and metrics, consult runbooks, and write a post-mortem — all under pressure. This project automates that workflow using a three-stage AI agent pipeline:
+
+- **Stage 1 — TriageAgent** (Plan-and-Solve): converts a raw alert JSON into an ordered investigation plan
+- **Stage 2 — InvestigationAgent** (ReAct): iterates through log search, metric queries, and runbook lookups to identify root cause
+- **Stage 3 — PostmortemAgent** (Reflection): drafts a structured RCA report, self-critiques it against quality criteria, and revises
+
+This is the **first SRE/operations domain project** in the Hello-Agents community, and demonstrates all three agent paradigms from Chapter 4 in a single coherent system.
+
+## ✨ Core Features
+
+- [x] Three incident fixtures: DB pool exhaustion, memory leak OOM, external API rate limit cascade
+- [x] ReAct investigation loop with 3 tools: `log_search`, `metric_query`, `runbook_lookup`
+- [x] Reflection-based post-mortem with draft → critique → revise cycle
+- [x] FastAPI REST backend — CORS-enabled and ready for frontend integration
+- [x] Structured RCA reports: timeline, 5-whys, impact assessment, action items
+
+## 🛠️ Technology Stack
+
+- **Agent paradigms**: Plan-and-Solve, ReAct, Reflection (Chapter 4)
+- **LLM**: Any OpenAI-compatible API (AIHubmix, ModelScope/Qwen, OpenAI)
+- **Backend**: FastAPI + Uvicorn
+- **Data**: JSON incident fixtures + YAML runbooks (no external services needed)
+
+## 🚀 Quick Start
+
+### Environment Requirements
+
+- Python 3.10+
+
+### Install Dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+### Configure API Keys
+
+```bash
+cp .env.example .env
+# Edit .env and set LLM_API_KEY, LLM_BASE_URL, LLM_MODEL_ID
+```
+
+Free LLM options:
+- **AIHubmix** (recommended): `https://aihubmix.com/v1` — free tier, OpenAI-compatible
+- **ModelScope/Qwen**: `https://api-inference.modelscope.cn/v1` — 2000 free calls/day
+
+### Run in Jupyter Notebook
+
+```bash
+jupyter lab
+# Open main.ipynb and run all cells
+```
+
+### Run the FastAPI Server
+
+```bash
+uvicorn src.api.main:app --reload --port 8000
+```
+
+API endpoints:
+
+| Method | Endpoint | Description |
+|---|---|---|
+| `GET` | `/health` | Liveness check |
+| `GET` | `/incidents/fixtures` | List sample incident IDs |
+| `POST` | `/incidents/investigate` | Run the full 3-agent pipeline |
+| `GET` | `/incidents/{id}/report` | Retrieve a generated report |
+
+## 📖 Usage Examples
+
+### Via Python (notebook or script)
+
+```python
+from src.agents.pipeline import run_pipeline
+
+result = run_pipeline("db_pool_exhaustion")
+print(result["report"])    # Markdown RCA report
+print(result["findings"])  # Root cause + evidence dict
+```
+
+### Via API
+
+```bash
+# List available incidents
+curl http://localhost:8000/incidents/fixtures
+
+# Run the pipeline
+curl -X POST http://localhost:8000/incidents/investigate \
+  -H "Content-Type: application/json" \
+  -d '{"incident_id": "db_pool_exhaustion"}'
+
+# Get the generated report
+curl http://localhost:8000/incidents/db_pool_exhaustion/report
+```
+
+### Sample Output
+
+```
+🚨 STAGE 1: TRIAGE — Generating investigation plan
+   1. [log_search] pool exhausted — Find DB pool error log entries
+   2. [metric_query] db_pool — Check connection pool saturation over time
+   3. [metric_query] latency — Quantify request latency degradation
+   4. [runbook_lookup] DB pool exhausted — Get remediation steps
+
+🔍 STAGE 2: INVESTIGATION — ReAct tool loop
+   Step 1 — log_search[pool exhausted] → 3 matching entries found
+   Step 2 — metric_query[db_pool] → pool maxed at 10/10 from 14:01 onward
+   Step 3 — runbook_lookup[DB pool exhausted] → runbook steps retrieved
+   ✅ Root cause: Missing index on orders.user_id causing full table scan...
+
+📝 STAGE 3: POST-MORTEM — Reflection (draft → critique → revise)
+   Quality score: 9/10 — no revision needed.
+   ✅ Final post-mortem ready.
+```
+
+## 🎯 Project Highlights
+
+- **Three agent paradigms in one system**: most co-creation projects use a single paradigm; this chains Plan-and-Solve → ReAct → Reflection into a coherent pipeline
+- **Domain novelty**: SRE/operations is not covered by any other project in this community
+- **Production-realistic fixtures**: log entries, metric time-series, and runbook YAML match real incident patterns (DB pool exhaustion, memory leak, rate limit cascade)
+- **Upgrade path built in**: FastAPI backend is CORS-enabled; SSE streaming and a frontend can be added without changing the agent code
+
+## 📊 Performance Evaluation
+
+On 3 incident fixtures (tested with Llama-3.3-70b via Groq, compatible with any OpenAI-compatible API):
+
+| Incident | Root Cause Identified | Pipeline Time |
+|---|---|---|
+| DB pool exhaustion | ✅ Missing index on orders.user_id | ~30s |
+| Memory leak OOM | ✅ Session cache with no TTL/eviction | ~25s |
+| External API rate limit | ✅ Retry storm from no exponential backoff | ~28s |
+
+Root cause accuracy: **3/3 (100%)** on sample fixtures
+
+## 🔮 Future Plans
+
+- [ ] SSE streaming: stream agent reasoning steps to frontend in real-time
+- [ ] Vue/React frontend: incident selector + live trace + markdown viewer
+- [ ] Real log ingestion: connect to Loki / CloudWatch / Datadog
+- [ ] Vector memory: embed past RCA reports for faster future investigations
+- [ ] Safe runbook execution: let the agent run low-risk remediation commands
+
+## 🤝 Contribution Guidelines
+
+Issues and PRs welcome! See the [Hello-Agents contributing guide](../../README.md).
+
+## 📄 License
+
+MIT License — see [LICENSE.txt](../../LICENSE.txt) for details.
+
+## 👤 Author
+
+- **GitHub**: [@zjzhou](https://github.com/zjzhou)
+- **Email**: jijiezhou@gmail.com
+
+## 🙏 Acknowledgments
+
+Thanks to the [Datawhale Hello-Agents](https://github.com/datawhalechina/hello-agents) team for the excellent curriculum, and to Chapter 4's ReAct, Plan-and-Solve, and Reflection examples which this project builds on directly.

+ 39 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/data/incidents/db_pool_exhaustion.json

@@ -0,0 +1,39 @@
+{
+  "incident_id": "INC-001",
+  "service": "checkout-service",
+  "severity": "P1",
+  "alert": {
+    "metric": "http_request_duration_p99",
+    "value": 8.3,
+    "threshold": 1.0,
+    "unit": "seconds",
+    "timestamp": "2024-01-15T14:03:22Z",
+    "description": "P99 request latency critically exceeds SLO threshold"
+  },
+  "logs": [
+    {"timestamp": "2024-01-15T13:58:10Z", "level": "INFO",     "message": "Deployment: checkout-service v2.3.1 rolled out successfully (0 errors during rollout)"},
+    {"timestamp": "2024-01-15T13:59:02Z", "level": "WARN",     "message": "Redis cache miss rate elevated: 34% (threshold 30%) — cache warming in progress"},
+    {"timestamp": "2024-01-15T14:00:01Z", "level": "INFO",     "message": "Processing checkout request user_id=12345 order_id=98001"},
+    {"timestamp": "2024-01-15T14:00:03Z", "level": "WARN",     "message": "DB connection pool: 8/10 connections in use"},
+    {"timestamp": "2024-01-15T14:00:20Z", "level": "WARN",     "message": "JVM GC pause: 420ms stop-the-world collection (heap 78% utilised) — within acceptable range"},
+    {"timestamp": "2024-01-15T14:00:45Z", "level": "ERROR",    "message": "Health check /actuator/health returned 503 from upstream load balancer — retrying (attempt 1/3)"},
+    {"timestamp": "2024-01-15T14:01:00Z", "level": "INFO",     "message": "Health check recovered after retry — upstream responding normally"},
+    {"timestamp": "2024-01-15T14:01:15Z", "level": "WARN",     "message": "DB connection pool: 10/10 connections in use, queuing requests"},
+    {"timestamp": "2024-01-15T14:01:30Z", "level": "WARN",     "message": "Stripe webhook delivery delayed 1200ms (SLA=500ms) — possible downstream congestion"},
+    {"timestamp": "2024-01-15T14:01:45Z", "level": "ERROR",    "message": "DB connection pool exhausted — waiting for available connection (timeout=30s)"},
+    {"timestamp": "2024-01-15T14:02:00Z", "level": "ERROR",    "message": "Query timeout: SELECT * FROM orders WHERE user_id=12345 (full table scan, duration=30s)"},
+    {"timestamp": "2024-01-15T14:02:01Z", "level": "ERROR",    "message": "DB connection pool exhausted — waiting for available connection (timeout=30s)"},
+    {"timestamp": "2024-01-15T14:02:15Z", "level": "WARN",     "message": "Datadog agent: metric submission backlog 850 points — host under load"},
+    {"timestamp": "2024-01-15T14:02:30Z", "level": "ERROR",    "message": "SQLSTATE[HY000]: General error: too many connections"},
+    {"timestamp": "2024-01-15T14:03:00Z", "level": "CRITICAL", "message": "checkout-service: request queue depth=450, dropping new requests"},
+    {"timestamp": "2024-01-15T14:03:22Z", "level": "ALERT",    "message": "P1 FIRED: http_request_duration_p99=8.3s exceeds threshold=1.0s"}
+  ],
+  "metrics": {
+    "http_request_duration_p99": {"14:00": 0.3, "14:01": 1.2, "14:02": 4.7, "14:03": 8.3},
+    "db_pool_active_connections": {"14:00": 3, "14:01": 8, "14:02": 10, "14:03": 10},
+    "db_query_duration_p99_ms":   {"14:00": 45, "14:01": 280, "14:02": 12000, "14:03": 30000},
+    "request_queue_depth":        {"14:00": 12, "14:01": 85, "14:02": 220, "14:03": 450}
+  },
+  "root_cause": "Missing index on orders.user_id causing full table scan. At high load each scan holds a DB connection for 30s+, exhausting the pool of 10 connections.",
+  "affected_users": 2847
+}

+ 34 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/data/incidents/external_api_ratelimit.json

@@ -0,0 +1,34 @@
+{
+  "incident_id": "INC-003",
+  "service": "payment-service",
+  "severity": "P2",
+  "alert": {
+    "metric": "payment_success_rate",
+    "value": 23.4,
+    "threshold": 95.0,
+    "unit": "percent",
+    "timestamp": "2024-01-17T16:22:10Z",
+    "description": "Payment success rate critically below SLO"
+  },
+  "logs": [
+    {"timestamp": "2024-01-17T16:10:00Z", "level": "INFO",     "message": "Payment processing started for order_id=98765"},
+    {"timestamp": "2024-01-17T16:15:00Z", "level": "WARN",     "message": "Stripe API slow response: 2.3s for POST /v1/charges"},
+    {"timestamp": "2024-01-17T16:18:00Z", "level": "ERROR",    "message": "Stripe API 429 Too Many Requests: retry_after=60s"},
+    {"timestamp": "2024-01-17T16:18:01Z", "level": "ERROR",    "message": "Retry 1/3 for order_id=99001 (no backoff configured — immediate retry)"},
+    {"timestamp": "2024-01-17T16:18:02Z", "level": "ERROR",    "message": "Stripe API 429 Too Many Requests (retry 1 failed immediately)"},
+    {"timestamp": "2024-01-17T16:18:02Z", "level": "ERROR",    "message": "Retry 2/3 for order_id=99001 (no backoff — immediate retry)"},
+    {"timestamp": "2024-01-17T16:18:03Z", "level": "ERROR",    "message": "Stripe API 429 Too Many Requests (retry 2 failed)"},
+    {"timestamp": "2024-01-17T16:18:03Z", "level": "ERROR",    "message": "Retry 3/3 for order_id=99001 (final retry, no backoff)"},
+    {"timestamp": "2024-01-17T16:18:04Z", "level": "ERROR",    "message": "All retries exhausted for order_id=99001 — payment FAILED"},
+    {"timestamp": "2024-01-17T16:20:00Z", "level": "CRITICAL", "message": "Stripe API rate limit cascade: 347 failed payments in 2 minutes, amplified by retry storm"},
+    {"timestamp": "2024-01-17T16:22:10Z", "level": "ALERT",    "message": "P2 FIRED: payment_success_rate=23.4% below threshold=95.0%"}
+  ],
+  "metrics": {
+    "payment_success_rate":       {"16:10": 99.1, "16:15": 88.3, "16:18": 45.2, "16:20": 27.8, "16:22": 23.4},
+    "stripe_api_calls_per_min":   {"16:10": 45,   "16:15": 280,  "16:18": 890,  "16:20": 1240, "16:22": 1380},
+    "stripe_api_error_rate":      {"16:10": 0.0,  "16:15": 3.2,  "16:18": 62.4, "16:20": 88.1, "16:22": 94.6},
+    "failed_payments_count":      {"16:10": 0,    "16:15": 12,   "16:18": 89,   "16:20": 347,  "16:22": 521}
+  },
+  "root_cause": "Promotional flash sale caused 30x payment request spike. Retry logic has no exponential backoff — each 429 triggers immediate retries, creating a retry storm that amplifies the rate limit violation.",
+  "affected_users": 521
+}

+ 31 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/data/incidents/memory_leak_oom.json

@@ -0,0 +1,31 @@
+{
+  "incident_id": "INC-002",
+  "service": "auth-service",
+  "severity": "P1",
+  "alert": {
+    "metric": "pod_memory_usage_percent",
+    "value": 96.8,
+    "threshold": 85.0,
+    "unit": "percent",
+    "timestamp": "2024-01-16T09:47:33Z",
+    "description": "Pod memory usage approaching OOM limit"
+  },
+  "logs": [
+    {"timestamp": "2024-01-16T07:30:00Z", "level": "INFO",     "message": "auth-service started, heap=128MB, session_cache=enabled"},
+    {"timestamp": "2024-01-16T08:00:00Z", "level": "INFO",     "message": "Session cache initialized, entries=0, ttl=not_configured"},
+    {"timestamp": "2024-01-16T08:30:00Z", "level": "WARN",     "message": "Memory usage 45%, heap=460MB, session_cache_entries=15420"},
+    {"timestamp": "2024-01-16T09:00:00Z", "level": "WARN",     "message": "Memory usage 62%, heap=634MB, session_cache_entries=31200"},
+    {"timestamp": "2024-01-16T09:30:00Z", "level": "ERROR",    "message": "Memory usage 82%, heap=838MB, session_cache_entries=45800, GC pressure rising"},
+    {"timestamp": "2024-01-16T09:45:00Z", "level": "ERROR",    "message": "GC pause 2100ms — full GC running every 2s, application threads stalled"},
+    {"timestamp": "2024-01-16T09:47:20Z", "level": "CRITICAL", "message": "OOM imminent: heap=991MB/1024MB, session_cache_entries=52340, no eviction policy"},
+    {"timestamp": "2024-01-16T09:47:33Z", "level": "ALERT",    "message": "P1 FIRED: pod_memory_usage_percent=96.8 exceeds threshold=85.0"}
+  ],
+  "metrics": {
+    "pod_memory_usage_percent":  {"07:30": 12.5, "08:00": 22.1, "08:30": 45.0, "09:00": 62.0, "09:30": 82.0, "09:47": 96.8},
+    "session_cache_entries":     {"07:30": 0, "08:00": 0, "08:30": 15420, "09:00": 31200, "09:30": 45800, "09:47": 52340},
+    "gc_pause_ms":               {"07:30": 5, "08:00": 8, "08:30": 45, "09:00": 120, "09:30": 890, "09:47": 2100},
+    "auth_requests_per_sec":     {"07:30": 120, "08:00": 380, "08:30": 520, "09:00": 510, "09:30": 495, "09:47": 380}
+  },
+  "root_cause": "In-memory session cache has no TTL or eviction policy. Sessions accumulate indefinitely causing unbounded heap growth.",
+  "affected_users": 0
+}

+ 29 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/data/runbooks/checkout-service.yaml

@@ -0,0 +1,29 @@
+service: checkout-service
+runbook_version: "1.2"
+procedures:
+  - pattern: "DB pool exhausted"
+    severity: P1
+    steps:
+      - "Identify slow queries: run SHOW PROCESSLIST on the primary DB; look for queries >5s"
+      - "Run EXPLAIN on the slowest query to identify full table scans or missing indexes"
+      - "Temporary mitigation: increase pool_size from 10 to 20 via config hot-reload"
+      - "Permanent fix: add the missing index. Example: CREATE INDEX idx_orders_user_id ON orders(user_id)"
+      - "Validate index creation: run EXPLAIN again to confirm index is used"
+      - "Restart pods once index is verified: kubectl rollout restart deployment/checkout-service"
+      - "Monitor db_pool_active_connections and http_request_duration_p99 to confirm recovery"
+
+  - pattern: "high_p99_latency"
+    severity: P1
+    steps:
+      - "Check downstream dependencies: DB query time, Redis cache hit rate, external API latency"
+      - "Review request_queue_depth metric — if >100, consider pod autoscaling"
+      - "Enable slow query logging if not active: SET GLOBAL slow_query_log = 'ON'"
+      - "Consider circuit breaker activation to shed load and protect downstream services"
+      - "Check recent deployments: git log --since='2 hours ago' on checkout-service"
+
+  - pattern: "connection timeout"
+    severity: P2
+    steps:
+      - "Check network connectivity between checkout-service pods and DB"
+      - "Verify DB security group rules allow connections from checkout-service subnet"
+      - "Check for DB maintenance windows or failover events in RDS console"

+ 30 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/data/runbooks/payment-service.yaml

@@ -0,0 +1,30 @@
+service: payment-service
+runbook_version: "2.0"
+procedures:
+  - pattern: "Stripe API 429"
+    severity: P2
+    steps:
+      - "Immediately enable request queue with rate limiting: set STRIPE_MAX_RPS=40 in env and redeploy"
+      - "Enable exponential backoff on retries: set STRIPE_RETRY_BACKOFF=exponential, STRIPE_RETRY_INITIAL_DELAY_MS=1000"
+      - "Check Stripe dashboard for current rate limit quota and usage"
+      - "If a promotional event is active, implement request queuing to smooth burst traffic"
+      - "Add circuit breaker: open circuit if stripe_api_error_rate > 50% for 30s; queue requests locally"
+      - "For sustained high volume: contact Stripe to raise rate limits; consider async Payment Intents API"
+      - "Monitor stripe_api_error_rate and payment_success_rate until both return to baseline"
+
+  - pattern: "payment_success_rate low"
+    severity: P2
+    steps:
+      - "Identify root cause category: check stripe_api_error_rate vs internal payment errors"
+      - "For Stripe rate limit issues: follow 'Stripe API 429' procedure above"
+      - "For authentication failures: verify API key is valid and not expired in Stripe dashboard"
+      - "For network issues: check VPC NAT gateway health and egress bandwidth utilization"
+      - "Communicate status to customer support team with estimated recovery time"
+
+  - pattern: "retry storm"
+    severity: P1
+    steps:
+      - "Immediately disable retries in payment-service config to stop amplification"
+      - "Wait for upstream rate limit window to reset (check retry_after header in logs)"
+      - "Re-enable retries only after adding exponential backoff with jitter"
+      - "Post-incident: add retry budget enforcement to prevent retry storm recurrence"

+ 855 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/main.ipynb

@@ -0,0 +1,855 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# SRE On-Call Agent\n",
+    "> AI-powered incident triage, root cause investigation, and post-mortem generation\n",
+    "\n",
+    "## Project Introduction\n",
+    "\n",
+    "When a production alert fires at 3am, an SRE on-call engineer must:\n",
+    "1. **Triage** — assess severity and plan the investigation\n",
+    "2. **Investigate** — search logs, query metrics, consult runbooks\n",
+    "3. **Write a post-mortem** — root cause, timeline, action items\n",
+    "\n",
+    "This project automates that workflow with a **three-agent AI pipeline**:\n",
+    "\n",
+    "```\n",
+    "Alert JSON\n",
+    "    │\n",
+    "    ▼\n",
+    "┌─────────────────────────────┐\n",
+    "│  Stage 1: TriageAgent       │  Plan-and-Solve\n",
+    "│  \"What should I investigate?\"│  → ordered investigation plan\n",
+    "└─────────────┬───────────────┘\n",
+    "              │\n",
+    "              ▼\n",
+    "┌─────────────────────────────┐\n",
+    "│  Stage 2: InvestigationAgent│  ReAct loop\n",
+    "│  log_search / metric_query  │  → root cause hypothesis\n",
+    "│  / runbook_lookup           │\n",
+    "└─────────────┬───────────────┘\n",
+    "              │\n",
+    "              ▼\n",
+    "┌─────────────────────────────┐\n",
+    "│  Stage 3: PostmortemAgent   │  Reflection\n",
+    "│  draft → critique → revise  │  → final RCA report\n",
+    "└─────────────────────────────┘\n",
+    "```\n",
+    "\n",
+    "## Author Information\n",
+    "- **GitHub**: [@zjzhou](https://github.com/zjzhou)\n",
+    "- **Date**: 2026-04-21\n",
+    "- **HelloAgents Chapter**: 16 (Graduation Project)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## Part 2: Environment Setup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "zsh:1: command not found: pip\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Install dependencies\n",
+    "!pip install -q openai fastapi uvicorn pyyaml pydantic python-dotenv"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✅ Environment loaded\n",
+      "   LLM_MODEL_ID: llama-3.3-70b-versatile\n",
+      "   LLM_BASE_URL: https://api.groq.com/openai/v1\n",
+      "   LLM_API_KEY:  SET\n"
+     ]
+    }
+   ],
+   "source": [
+    "import os\n",
+    "import sys\n",
+    "import json\n",
+    "import time\n",
+    "import re\n",
+    "import ast\n",
+    "import glob\n",
+    "import yaml\n",
+    "from pathlib import Path\n",
+    "from typing import Dict, Any, List\n",
+    "from dotenv import load_dotenv\n",
+    "from openai import OpenAI\n",
+    "\n",
+    "# Add project root to path so src/ imports work\n",
+    "project_root = Path().resolve()\n",
+    "if str(project_root) not in sys.path:\n",
+    "    sys.path.insert(0, str(project_root))\n",
+    "\n",
+    "load_dotenv()\n",
+    "print(\"✅ Environment loaded\")\n",
+    "print(f\"   LLM_MODEL_ID: {os.getenv('LLM_MODEL_ID', 'NOT SET')}\")\n",
+    "print(f\"   LLM_BASE_URL: {os.getenv('LLM_BASE_URL', 'NOT SET')}\")\n",
+    "print(f\"   LLM_API_KEY:  {'SET' if os.getenv('LLM_API_KEY') else 'NOT SET'}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## Part 3: Tool Definitions\n",
+    "\n",
+    "Three tools give the InvestigationAgent eyes into the incident:\n",
+    "\n",
+    "| Tool | Input | What it does |\n",
+    "|---|---|---|\n",
+    "| `LogSearchTool` | keyword/regex | Searches incident log entries |\n",
+    "| `MetricQueryTool` | metric name keyword | Returns time-series metric data |\n",
+    "| `RunbookLookupTool` | error pattern | Fetches runbook remediation steps |"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "🔧 LogSearchTool — searching for 'pool':\n",
+      "Found 4 log entries matching 'pool':\n",
+      "[2024-01-15T14:00:03Z] [WARN    ] DB connection pool: 8/10 connections in use\n",
+      "[2024-01-15T14:01:15Z] [WARN    ] DB connection pool: 10/10 connections in use, queuing requests\n",
+      "[2024-01-15T14:01:45Z] [ERROR   ] DB connection pool exhausted — waiting for available connection (timeout=30s)\n",
+      "[2024-01-15T14:02:01Z] [ERROR   ] DB connection pool exhausted — waiting for available connection (timeout=30s)\n",
+      "\n",
+      "📊 MetricQueryTool — querying 'db_pool':\n",
+      "Metrics for checkout-service matching 'db_pool':\n",
+      "  db_pool_active_connections: [14:00: 3 | 14:01: 8 | 14:02: 10 | 14:03: 10]\n",
+      "\n",
+      "📖 RunbookLookupTool — pattern 'pool exhausted':\n",
+      "Runbook: checkout-service (v1.2)\n",
+      "\n",
+      "Pattern: 'DB pool exhausted' | Severity: P1\n",
+      "Remediation steps:\n",
+      "  1. Identify slow queries: run SHOW PROCESSLIST on the primary DB; look for queries >5s\n",
+      "  2. Run EXPLAIN on the slowest query to identify full table scans or missing indexes\n",
+      "  3. Temporary mitigation: increase pool_size from 10 to 20 via config hot-reload\n",
+      "  4. Permanent fix: add the missing index. Example: CREATE INDEX idx_orders_user_id ON orders(user_id)\n",
+      "  5. Validate index creation: run EXPLAIN again to confirm index is used\n",
+      "  6. Restart pods once index is verified: kubectl rollout restart deployment/checkout-service\n",
+      "  7. Monitor db_pool_active_connections and http_request_duration_p99 to confirm recovery\n"
+     ]
+    }
+   ],
+   "source": [
+    "from src.tools.log_search_tool import LogSearchTool\n",
+    "from src.tools.metric_query_tool import MetricQueryTool\n",
+    "from src.tools.runbook_tool import RunbookLookupTool\n",
+    "\n",
+    "# --- Quick tool smoke test ---\n",
+    "incident_path = project_root / \"data\" / \"incidents\" / \"db_pool_exhaustion.json\"\n",
+    "with open(incident_path) as f:\n",
+    "    sample_incident = json.load(f)\n",
+    "\n",
+    "log_tool = LogSearchTool(sample_incident)\n",
+    "metric_tool = MetricQueryTool(sample_incident)\n",
+    "runbook_tool = RunbookLookupTool(\n",
+    "    service=\"checkout-service\",\n",
+    "    runbooks_dir=str(project_root / \"data\" / \"runbooks\")\n",
+    ")\n",
+    "\n",
+    "print(\"🔧 LogSearchTool — searching for 'pool':\")\n",
+    "print(log_tool.run(\"pool\"))\n",
+    "print()\n",
+    "print(\"📊 MetricQueryTool — querying 'db_pool':\")\n",
+    "print(metric_tool.run(\"db_pool\"))\n",
+    "print()\n",
+    "print(\"📖 RunbookLookupTool — pattern 'pool exhausted':\")\n",
+    "print(runbook_tool.run(\"pool exhausted\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## Part 4: Agent Construction\n",
+    "\n",
+    "Three agents implementing different paradigms from the Hello-Agents curriculum:\n",
+    "\n",
+    "- **TriageAgent** — Plan-and-Solve (Chapter 4): LLM creates a structured investigation plan\n",
+    "- **InvestigationAgent** — ReAct (Chapter 4): iterative Reason-Act-Observe loop with tools\n",
+    "- **PostmortemAgent** — Reflection (Chapter 4): draft → critique → revise"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "🧠 Calling llama-3.3-70b-versatile...\n",
+      "LLM OK\n",
+      "\n",
+      "LLM connection test: LLM OK\n",
+      "\n",
+      "Available incident fixtures: ['db_pool_exhaustion', 'memory_leak_oom', 'external_api_ratelimit']\n"
+     ]
+    }
+   ],
+   "source": [
+    "from src.core.llm_client import HelloAgentsLLM\n",
+    "from src.agents.triage_agent import TriageAgent\n",
+    "from src.agents.investigation_agent import InvestigationAgent\n",
+    "from src.agents.postmortem_agent import PostmortemAgent\n",
+    "from src.agents.pipeline import run_pipeline, list_incidents, load_incident\n",
+    "\n",
+    "# Verify LLM connection\n",
+    "llm = HelloAgentsLLM(verbose=True)\n",
+    "test_response = llm.think([{\"role\": \"user\", \"content\": \"Reply with exactly: LLM OK\"}])\n",
+    "print(f\"\\nLLM connection test: {test_response.strip()}\")\n",
+    "\n",
+    "print(f\"\\nAvailable incident fixtures: {list_incidents()}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## Part 5: Demo — Full Pipeline\n",
+    "\n",
+    "### Demo 1: DB Connection Pool Exhaustion (INC-001)\n",
+    "\n",
+    "**Alert:** `checkout-service` P99 latency 8.3s > threshold 1.0s  \n",
+    "**Root cause (ground truth):** Missing index on `orders.user_id` → full table scan → DB pool exhaustion"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "📣 Incident Alert:\n",
+      "{\n",
+      "  \"incident_id\": \"INC-001\",\n",
+      "  \"service\": \"checkout-service\",\n",
+      "  \"severity\": \"P1\",\n",
+      "  \"alert\": {\n",
+      "    \"metric\": \"http_request_duration_p99\",\n",
+      "    \"value\": 8.3,\n",
+      "    \"threshold\": 1.0,\n",
+      "    \"unit\": \"seconds\",\n",
+      "    \"timestamp\": \"2024-01-15T14:03:22Z\",\n",
+      "    \"description\": \"P99 request latency critically exceeds SLO threshold\"\n",
+      "  }\n",
+      "}\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Show the incident alert\n",
+    "incident_1 = load_incident(\"db_pool_exhaustion\")\n",
+    "print(\"📣 Incident Alert:\")\n",
+    "print(json.dumps({\n",
+    "    \"incident_id\": incident_1[\"incident_id\"],\n",
+    "    \"service\": incident_1[\"service\"],\n",
+    "    \"severity\": incident_1[\"severity\"],\n",
+    "    \"alert\": incident_1[\"alert\"]\n",
+    "}, indent=2))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "============================================================\n",
+      "🚨 STAGE 1: TRIAGE — Generating investigation plan\n",
+      "============================================================\n",
+      "🧠 Calling llama-3.3-70b-versatile...\n",
+      "```python\n",
+      "[\n",
+      "  {\"tool\": \"metric_query\", \"query\": \"http_request_duration_p99\", \"reason\": \"Verify the current value of the metric that triggered the alert\"},\n",
+      "  {\"tool\": \"log_search\", \"query\": \"checkout-service ERROR\", \"reason\": \"Find all error-level log entries from the checkout-service to identify the failure pattern\"},\n",
+      "  {\"tool\": \"metric_query\", \"query\": \"http_request_count\", \"reason\": \"Check if there's an unusual increase in request volume that could be contributing to the latency\"},\n",
+      "  {\"tool\": \"log_search\", \"query\": \"checkout-service WARNING|ERROR\", \"reason\": \"Find all warning and error-level log entries from the checkout-service to identify potential issues\"},\n",
+      "  {\"tool\": \"runbook_lookup\", \"query\": \"high latency checkout-service\", \"reason\": \"Get standard remediation steps for high latency issues in the checkout-service\"}\n",
+      "]\n",
+      "```\n",
+      "\n",
+      "✅ Investigation plan (5 steps):\n",
+      "   1. [metric_query] http_request_duration_p99 — Verify the current value of the metric that triggered the alert\n",
+      "   2. [log_search] checkout-service ERROR — Find all error-level log entries from the checkout-service to identify the failure pattern\n",
+      "   3. [metric_query] http_request_count — Check if there's an unusual increase in request volume that could be contributing to the latency\n",
+      "   4. [log_search] checkout-service WARNING|ERROR — Find all warning and error-level log entries from the checkout-service to identify potential issues\n",
+      "   5. [runbook_lookup] high latency checkout-service — Get standard remediation steps for high latency issues in the checkout-service\n",
+      "\n",
+      "============================================================\n",
+      "🔍 STAGE 2: INVESTIGATION — ReAct tool loop\n",
+      "============================================================\n",
+      "\n",
+      "--- Step 1 ---\n",
+      "🧠 Calling llama-3.3-70b-versatile...\n",
+      "Thought: The incident is related to high latency in the checkout-service, with a P99 request latency of 8.3 seconds, which is critically exceeding the SLO threshold. To start investigating, I need to verify the current value of the metric that triggered the alert to confirm if the issue is still ongoing. This will help me understand if the problem is persistent or if it was a temporary spike.\n",
+      "\n",
+      "Action: metric_query[http_request_duration_p99]\n",
+      "💭 Thought: The incident is related to high latency in the checkout-service, with a P99 request latency of 8.3 seconds, which is critically exceeding the SLO threshold. To start investigating, I need to verify the current value of the metric that triggered the alert to confirm if the issue is still ongoing. This will help me understand if the problem is persistent or if it was a temporary spike.\n",
+      "🔧 Action: metric_query[http_request_duration_p99]\n",
+      "👀 Observation: Metrics for checkout-service matching 'http_request_duration_p99':\n",
+      "  http_request_duration_p99: [14:00: 0.3 | 14:01: 1.2 | 14:02: 4.7 | 14:03: 8.3]\n",
+      "\n",
+      "--- Step 2 ---\n",
+      "🧠 Calling llama-3.3-70b-versatile...\n",
+      "Thought: The current value of http_request_duration_p99 has been verified, and it indeed exceeded the threshold, triggering the alert. The next step is to investigate if there are any error-level log entries from the checkout-service that could indicate the cause of the high latency. This will help identify any failure patterns or potential issues.\n",
+      "\n",
+      "Action: log_search[checkout-service ERROR]\n",
+      "💭 Thought: The current value of http_request_duration_p99 has been verified, and it indeed exceeded the threshold, triggering the alert. The next step is to investigate if there are any error-level log entries from the checkout-service that could indicate the cause of the high latency. This will help identify any failure patterns or potential issues.\n",
+      "🔧 Action: log_search[checkout-service ERROR]\n",
+      "👀 Observation: No log entries found matching 'checkout-service ERROR' in checkout-service logs.\n",
+      "\n",
+      "--- Step 3 ---\n",
+      "🧠 Calling llama-3.3-70b-versatile...\n",
+      "Thought: We have verified that the http_request_duration_p99 metric is currently high, and we didn't find any error-level log entries in the checkout-service logs. The next step is to check if there's an unusual increase in request volume that could be contributing to the latency. This will help us understand if the issue is related to the service's capacity to handle requests or if it's a problem with the service itself.\n",
+      "\n",
+      "Action: metric_query[http_request_count]\n",
+      "💭 Thought: We have verified that the http_request_duration_p99 metric is currently high, and we didn't find any error-level log entries in the checkout-service logs. The next step is to check if there's an unusual increase in request volume that could be contributing to the latency. This will help us understand if the issue is related to the service's capacity to handle requests or if it's a problem with the service itself.\n",
+      "🔧 Action: metric_query[http_request_count]\n",
+      "👀 Observation: No metrics found matching 'http_request_count' for checkout-service.\n",
+      "Available metrics: http_request_duration_p99, db_pool_active_connections, db_query_duration_p99_ms, request_queue_depth\n",
+      "\n",
+      "--- Step 4 ---\n",
+      "🧠 Calling llama-3.3-70b-versatile...\n",
+      "Thought: We have already verified the current value of the metric that triggered the alert, which is the http_request_duration_p99, and it's indeed higher than the threshold. We also searched for error-level log entries from the checkout-service but found none. Additionally, we checked for an unusual increase in request volume by querying the http_request_count metric, but it seems that metric is not available. However, we did find other available metrics such as db_pool_active_connections, db_query_duration_p99_ms, and request_queue_depth. Our next step should be to find all warning and error-level log entries from the checkout-service to identify potential issues.\n",
+      "\n",
+      "Action: log_search[checkout-service WARNING|ERROR]\n",
+      "💭 Thought: We have already verified the current value of the metric that triggered the alert, which is the http_request_duration_p99, and it's indeed higher than the threshold. We also searched for error-level log entries from the checkout-service but found none. Additionally, we checked for an unusual increase in request volume by querying the http_request_count metric, but it seems that metric is not available. However, we did find other available metrics such as db_pool_active_connections, db_query_duration_p99_ms, and request_queue_depth. Our next step should be to find all warning and error-level log entries from the checkout-service to identify potential issues.\n",
+      "🔧 Action: log_search[checkout-service WARNING|ERROR]\n",
+      "👀 Observation: Found 4 log entries matching 'checkout-service WARNING|ERROR':\n",
+      "[2024-01-15T14:01:45Z] [ERROR   ] DB connection pool exhausted — waiting for available connection (timeout=30s)\n",
+      "[2024-01-15T14:02:00Z] [ERROR   ] Query timeout: SELECT * FROM orders WHERE user_id=12345 (full table scan, duration=30s)\n",
+      "[20...\n",
+      "\n",
+      "--- Step 5 ---\n",
+      "🧠 Calling llama-3.3-70b-versatile...\n",
+      "Thought: We have investigated the current value of the metric that triggered the alert, searched for error-level log entries, checked for unusual increases in request volume, and searched for warning and error-level log entries. The error-level log entries suggest a DB connection pool exhaustion issue, which could be contributing to the high latency. To further investigate, we should look up standard remediation steps for high latency issues in the checkout-service, specifically related to DB connection pool exhaustion.\n",
+      "\n",
+      "Action: runbook_lookup[high latency checkout-service DB connection pool exhausted]\n",
+      "💭 Thought: We have investigated the current value of the metric that triggered the alert, searched for error-level log entries, checked for unusual increases in request volume, and searched for warning and error-level log entries. The error-level log entries suggest a DB connection pool exhaustion issue, which could be contributing to the high latency. To further investigate, we should look up standard remediation steps for high latency issues in the checkout-service, specifically related to DB connection pool exhaustion.\n",
+      "🔧 Action: runbook_lookup[high latency checkout-service DB connection pool exhausted]\n",
+      "👀 Observation: Runbook: checkout-service (v1.2)\n",
+      "\n",
+      "Pattern: 'DB pool exhausted' | Severity: P1\n",
+      "Remediation steps:\n",
+      "  1. Identify slow queries: run SHOW PROCESSLIST on the primary DB; look for queries >5s\n",
+      "  2. Run EXPLAIN on the slowest query to identify full table scans or missing indexes\n",
+      "  3. Temporary mitigation: i...\n",
+      "\n",
+      "--- Step 6 ---\n",
+      "🧠 Calling llama-3.3-70b-versatile...\n",
+      "Thought: We have already investigated the current value of the metric that triggered the alert, searched for error-level log entries, checked for unusual increases in request volume, and searched for warning and error-level log entries. The error-level log entries revealed a pattern of DB connection pool exhaustion and query timeouts, which could be contributing to the high latency. The runbook lookup provided remediation steps for high latency issues in the checkout-service, including identifying slow queries and increasing the pool size. However, we still need to verify if there's an unusual increase in request volume that could be contributing to the latency, but since the metric_query for http_request_count didn't return any results, we should try to find an alternative metric that can give us insight into the request volume.\n",
+      "\n",
+      "Action: metric_query[request_queue_depth]\n",
+      "💭 Thought: We have already investigated the current value of the metric that triggered the alert, searched for error-level log entries, checked for unusual increases in request volume, and searched for warning and error-level log entries. The error-level log entries revealed a pattern of DB connection pool exhaustion and query timeouts, which could be contributing to the high latency. The runbook lookup provided remediation steps for high latency issues in the checkout-service, including identifying slow queries and increasing the pool size. However, we still need to verify if there's an unusual increase in request volume that could be contributing to the latency, but since the metric_query for http_request_count didn't return any results, we should try to find an alternative metric that can give us insight into the request volume.\n",
+      "🔧 Action: metric_query[request_queue_depth]\n",
+      "👀 Observation: Metrics for checkout-service matching 'request_queue_depth':\n",
+      "  request_queue_depth: [14:00: 12 | 14:01: 85 | 14:02: 220 | 14:03: 450]\n",
+      "\n",
+      "--- Step 7 ---\n",
+      "🧠 Calling llama-3.3-70b-versatile...\n",
+      "Thought: We have already investigated the current value of the metric that triggered the alert, searched for error-level log entries, checked for unusual increases in request volume, and searched for warning and error-level log entries. The error-level log entries revealed a pattern of DB connection pool exhaustion and query timeouts, which could be contributing to the high latency. The request queue depth metric also showed a significant increase, which could be related to the DB connection pool exhaustion. We should now look up standard remediation steps for high latency issues in the checkout-service, specifically for the DB connection pool exhausted error pattern.\n",
+      "\n",
+      "Action: runbook_lookup[high latency checkout-service DB connection pool exhausted]\n",
+      "💭 Thought: We have already investigated the current value of the metric that triggered the alert, searched for error-level log entries, checked for unusual increases in request volume, and searched for warning and error-level log entries. The error-level log entries revealed a pattern of DB connection pool exhaustion and query timeouts, which could be contributing to the high latency. The request queue depth metric also showed a significant increase, which could be related to the DB connection pool exhaustion. We should now look up standard remediation steps for high latency issues in the checkout-service, specifically for the DB connection pool exhausted error pattern.\n",
+      "🔧 Action: runbook_lookup[high latency checkout-service DB connection pool exhausted]\n",
+      "👀 Observation: Runbook: checkout-service (v1.2)\n",
+      "\n",
+      "Pattern: 'DB pool exhausted' | Severity: P1\n",
+      "Remediation steps:\n",
+      "  1. Identify slow queries: run SHOW PROCESSLIST on the primary DB; look for queries >5s\n",
+      "  2. Run EXPLAIN on the slowest query to identify full table scans or missing indexes\n",
+      "  3. Temporary mitigation: i...\n",
+      "\n",
+      "--- Step 8 ---\n",
+      "🧠 Calling llama-3.3-70b-versatile...\n",
+      "Thought: We have already investigated the current value of the metric that triggered the alert, searched for error-level log entries, checked for unusual increases in request volume, and searched for warning and error-level log entries. The error-level log entries revealed a pattern of DB connection pool exhaustion and query timeouts. The request_queue_depth metric is also increasing, which could be contributing to the latency. We should now look up standard remediation steps for high latency issues in the checkout-service, specifically focusing on the DB connection pool exhaustion pattern.\n",
+      "\n",
+      "Action: runbook_lookup[high latency checkout-service DB connection pool exhausted]\n",
+      "💭 Thought: We have already investigated the current value of the metric that triggered the alert, searched for error-level log entries, checked for unusual increases in request volume, and searched for warning and error-level log entries. The error-level log entries revealed a pattern of DB connection pool exhaustion and query timeouts. The request_queue_depth metric is also increasing, which could be contributing to the latency. We should now look up standard remediation steps for high latency issues in the checkout-service, specifically focusing on the DB connection pool exhaustion pattern.\n",
+      "🔧 Action: runbook_lookup[high latency checkout-service DB connection pool exhausted]\n",
+      "👀 Observation: Runbook: checkout-service (v1.2)\n",
+      "\n",
+      "Pattern: 'DB pool exhausted' | Severity: P1\n",
+      "Remediation steps:\n",
+      "  1. Identify slow queries: run SHOW PROCESSLIST on the primary DB; look for queries >5s\n",
+      "  2. Run EXPLAIN on the slowest query to identify full table scans or missing indexes\n",
+      "  3. Temporary mitigation: i...\n",
+      "\n",
+      "--- Step 9 ---\n",
+      "🧠 Calling llama-3.3-70b-versatile...\n",
+      "Thought: We have already completed several steps in the investigation plan, including verifying the current value of http_request_duration_p99, searching for error-level log entries, checking for unusual increases in request volume, and searching for warning and error-level log entries. The log search revealed error messages related to DB connection pool exhaustion and query timeouts, which suggests that the database may be the bottleneck. The request_queue_depth metric also showed a significant increase, indicating a potential overload. To further investigate, we should follow the next step in the plan, which is to get standard remediation steps for high latency issues in the checkout-service.\n",
+      "\n",
+      "Action: runbook_lookup[high latency checkout-service]\n",
+      "💭 Thought: We have already completed several steps in the investigation plan, including verifying the current value of http_request_duration_p99, searching for error-level log entries, checking for unusual increases in request volume, and searching for warning and error-level log entries. The log search revealed error messages related to DB connection pool exhaustion and query timeouts, which suggests that the database may be the bottleneck. The request_queue_depth metric also showed a significant increase, indicating a potential overload. To further investigate, we should follow the next step in the plan, which is to get standard remediation steps for high latency issues in the checkout-service.\n",
+      "🔧 Action: runbook_lookup[high latency checkout-service]\n",
+      "👀 Observation: Runbook: checkout-service (v1.2)\n",
+      "\n",
+      "Pattern: 'DB pool exhausted' | Severity: P1\n",
+      "Remediation steps:\n",
+      "  1. Identify slow queries: run SHOW PROCESSLIST on the primary DB; look for queries >5s\n",
+      "  2. Run EXPLAIN on the slowest query to identify full table scans or missing indexes\n",
+      "  3. Temporary mitigation: i...\n",
+      "\n",
+      "--- Step 10 ---\n",
+      "🧠 Calling llama-3.3-70b-versatile...\n",
+      "Thought: We have already investigated the current value of the metric that triggered the alert, which is http_request_duration_p99, and found that it has increased significantly. We also searched for error-level log entries from the checkout-service and found none initially, but when we searched for warning and error-level log entries, we found several entries indicating a DB connection pool exhaustion issue. Additionally, we checked the request count metric but found that it was not available, however, we did find the request_queue_depth metric which showed a significant increase. Given this information, our next step should be to look up standard remediation steps for high latency issues in the checkout-service, specifically for the DB connection pool exhausted error pattern.\n",
+      "\n",
+      "Action: runbook_lookup[high latency checkout-service DB connection pool exhausted]\n",
+      "💭 Thought: We have already investigated the current value of the metric that triggered the alert, which is http_request_duration_p99, and found that it has increased significantly. We also searched for error-level log entries from the checkout-service and found none initially, but when we searched for warning and error-level log entries, we found several entries indicating a DB connection pool exhaustion issue. Additionally, we checked the request count metric but found that it was not available, however, we did find the request_queue_depth metric which showed a significant increase. Given this information, our next step should be to look up standard remediation steps for high latency issues in the checkout-service, specifically for the DB connection pool exhausted error pattern.\n",
+      "🔧 Action: runbook_lookup[high latency checkout-service DB connection pool exhausted]\n",
+      "👀 Observation: Runbook: checkout-service (v1.2)\n",
+      "\n",
+      "Pattern: 'DB pool exhausted' | Severity: P1\n",
+      "Remediation steps:\n",
+      "  1. Identify slow queries: run SHOW PROCESSLIST on the primary DB; look for queries >5s\n",
+      "  2. Run EXPLAIN on the slowest query to identify full table scans or missing indexes\n",
+      "  3. Temporary mitigation: i...\n",
+      "\n",
+      "--- Step 11 ---\n",
+      "🧠 Calling llama-3.3-70b-versatile...\n",
+      "Thought: We have already investigated the current value of the metric that triggered the alert, which is http_request_duration_p99, and found that it has increased significantly. We also searched for error-level log entries from the checkout-service but found none initially. However, when we searched for both warning and error-level log entries, we found several entries indicating a DB connection pool exhaustion issue. Additionally, we checked the request count metric but found that the actual metric available is request_queue_depth, which is also showing an unusual increase. Given this information, our next step should be to follow the investigation plan and check for unusual increases in request volume that could be contributing to the latency by querying the request_queue_depth metric, which we have already done. Now, we should look up standard remediation steps for high latency issues in the checkout-service, specifically focusing on the DB connection pool exhausted pattern.\n",
+      "\n",
+      "Action: runbook_lookup[high latency checkout-service DB connection pool exhausted]\n",
+      "💭 Thought: We have already investigated the current value of the metric that triggered the alert, which is http_request_duration_p99, and found that it has increased significantly. We also searched for error-level log entries from the checkout-service but found none initially. However, when we searched for both warning and error-level log entries, we found several entries indicating a DB connection pool exhaustion issue. Additionally, we checked the request count metric but found that the actual metric available is request_queue_depth, which is also showing an unusual increase. Given this information, our next step should be to follow the investigation plan and check for unusual increases in request volume that could be contributing to the latency by querying the request_queue_depth metric, which we have already done. Now, we should look up standard remediation steps for high latency issues in the checkout-service, specifically focusing on the DB connection pool exhausted pattern.\n",
+      "🔧 Action: runbook_lookup[high latency checkout-service DB connection pool exhausted]\n",
+      "👀 Observation: Runbook: checkout-service (v1.2)\n",
+      "\n",
+      "Pattern: 'DB pool exhausted' | Severity: P1\n",
+      "Remediation steps:\n",
+      "  1. Identify slow queries: run SHOW PROCESSLIST on the primary DB; look for queries >5s\n",
+      "  2. Run EXPLAIN on the slowest query to identify full table scans or missing indexes\n",
+      "  3. Temporary mitigation: i...\n",
+      "\n",
+      "--- Step 12 ---\n",
+      "🧠 Calling llama-3.3-70b-versatile...\n",
+      "Thought: The incident context indicates a P1 severity issue with the checkout-service, where the P99 request latency has exceeded the SLO threshold. The current value of http_request_duration_p99 has been verified to be 8.3 seconds, which is above the threshold. The next step is to find all error-level log entries from the checkout-service to identify the failure pattern.\n",
+      "\n",
+      "Action: log_search[checkout-service ERROR]\n",
+      "💭 Thought: The incident context indicates a P1 severity issue with the checkout-service, where the P99 request latency has exceeded the SLO threshold. The current value of http_request_duration_p99 has been verified to be 8.3 seconds, which is above the threshold. The next step is to find all error-level log entries from the checkout-service to identify the failure pattern.\n",
+      "🔧 Action: log_search[checkout-service ERROR]\n",
+      "👀 Observation: No log entries found matching 'checkout-service ERROR' in checkout-service logs.\n",
+      "\n",
+      "============================================================\n",
+      "📝 STAGE 3: POST-MORTEM — Reflection (draft → critique → revise)\n",
+      "============================================================\n",
+      "🧠 Calling llama-3.3-70b-versatile...\n",
+      "## Executive Summary\n",
+      "On January 15, 2024, at 14:03:22Z, the checkout-service experienced a critical incident with P99 request latency exceeding the SLO threshold of 1.0 seconds, reaching 8.3 seconds. This incident affected 2847 users and was resolved by identifying and mitigating a DB connection pool exhaustion issue. The root cause was traced back to a missing index on the orders table, leading to slow queries and eventual pool exhaustion.\n",
+      "\n",
+      "## Incident Timeline\n",
+      "* 14:00:00Z - http_request_duration_p99 metric at 0.3 seconds, within normal range\n",
+      "* 14:01:00Z - http_request_duration_p99 metric increases to 1.2 seconds\n",
+      "* 14:01:45Z - [ERROR] DB connection pool exhausted — waiting for available connection (timeout=30s)\n",
+      "* 14:02:00Z - [ERROR] Query timeout: SELECT * FROM orders WHERE user_id=12345 (full table scan, duration=30s)\n",
+      "* 14:02:01Z - [ERROR] DB connection pool exhausted — waiting for available connection (timeout=30s)\n",
+      "* 14:02:30Z - [ERROR] SQLSTATE[HY000]: General error: too many connections\n",
+      "* 14:03:22Z - Alert triggered for P99 request latency exceeding SLO threshold\n",
+      "\n",
+      "## Root Cause Analysis\n",
+      "1. Symptom: P99 request latency exceeded the SLO threshold of 1.0 seconds.\n",
+      "Why? The http_request_duration_p99 metric was increasing over time, indicating a performance issue.\n",
+      "2. Why was the http_request_duration_p99 metric increasing? The request_queue_depth metric was also increasing, reaching 450, indicating a backlog of requests.\n",
+      "3. Why was the request_queue_depth metric increasing? The DB connection pool was exhausted, causing requests to wait for available connections.\n",
+      "4. Why was the DB connection pool exhausted? Slow queries, such as the one at 14:02:00Z, were causing the pool to be depleted.\n",
+      "5. Why were slow queries occurring? A missing index on the orders table was leading to full table scans, resulting in slow query performance.\n",
+      "\n",
+      "## Impact Assessment\n",
+      "* Severity: P1\n",
+      "* Affected users: 2847\n",
+      "* Duration estimate: Approximately 3 minutes (from 14:00:00Z to 14:03:22Z)\n",
+      "* Business impact: The incident resulted in a significant increase in request latency, potentially leading to a poor user experience and lost sales.\n",
+      "\n",
+      "## Immediate Remediation Steps\n",
+      "1. Increase the DB connection pool size from 10 to 20 via config hot-reload to provide temporary relief.\n",
+      "2. Identify slow queries using SHOW PROCESSLIST on the primary DB and look for queries >5s.\n",
+      "3. Run EXPLAIN on the slowest query to identify full table scans or missing indexes.\n",
+      "4. Add a missing index on the orders table, such as CREATE INDEX idx_orders_user_id ON orders(user_id).\n",
+      "5. Validate index creation by running EXPLAIN again to confirm the index is used.\n",
+      "\n",
+      "## Action Items\n",
+      "| Action | Owner | Due Date | Priority |\n",
+      "| --- | --- | --- | --- |\n",
+      "| Add missing index on orders table | Database Team | 2024-01-16 | High |\n",
+      "| Implement pod autoscaling for checkout-service | Infrastructure Team | 2024-01-17 | Medium |\n",
+      "| Enable slow query logging on DB | Database Team | 2024-01-18 | Low |\n",
+      "| Review and optimize DB queries | Database Team | 2024-01-19 | Medium |\n",
+      "\n",
+      "## Lessons Learned\n",
+      "* The importance of monitoring DB connection pool usage and request queue depth to detect potential performance issues.\n",
+      "* The need for regular review and optimization of DB queries to prevent slow query performance.\n",
+      "* The value of having a runbook with clear remediation steps for common issues, such as DB connection pool exhaustion.\n",
+      "* The importance of implementing pod autoscaling to handle increased load and prevent request queue buildup.\n",
+      "\n",
+      "✍️  Draft post-mortem written.\n",
+      "🧠 Calling llama-3.3-70b-versatile...\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Run the full pipeline\n",
+    "start = time.time()\n",
+    "result_1 = run_pipeline(\"db_pool_exhaustion\", verbose=True)\n",
+    "elapsed_1 = round(time.time() - start, 1)\n",
+    "print(f\"\\n⏱️  Pipeline completed in {elapsed_1}s\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 38,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/markdown": [],
+      "text/plain": [
+       "<IPython.core.display.Markdown object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "# Display the generated post-mortem report\n",
+    "from IPython.display import Markdown, display\n",
+    "display(Markdown(result_1[\"report\"]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Demo 2: External API Rate Limit Cascade (INC-003)\n",
+    "\n",
+    "**Alert:** `payment-service` success rate 23.4% < threshold 95%  \n",
+    "**Root cause (ground truth):** Promotional traffic spike + no exponential backoff → retry storm → Stripe 429 cascade"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "============================================================\n",
+      "🚨 STAGE 1: TRIAGE — Generating investigation plan\n",
+      "============================================================\n",
+      "🧠 Calling gpt-4o-mini...\n",
+      "❌ LLM API error: Error code: 403 - {'error': {'message': 'Your account balance is insufficient. Please recharge your account to continue using the API. (tid: 2026042706261177927272665835401)', 'type': 'Aihubmix_api_error', 'param': '', 'code': 'insufficient_user_quota'}}\n",
+      "⚠️  Could not parse structured plan; using fallback plan.\n",
+      "\n",
+      "============================================================\n",
+      "🔍 STAGE 2: INVESTIGATION — ReAct tool loop\n",
+      "============================================================\n",
+      "\n",
+      "--- Step 1 ---\n",
+      "🧠 Calling gpt-4o-mini...\n",
+      "❌ LLM API error: Error code: 403 - {'error': {'message': 'Your account balance is insufficient. Please recharge your account to continue using the API. (tid: 2026042706261185271520637880320)', 'type': 'Aihubmix_api_error', 'param': '', 'code': 'insufficient_user_quota'}}\n",
+      "⚠️  Empty LLM response, stopping.\n",
+      "\n",
+      "============================================================\n",
+      "📝 STAGE 3: POST-MORTEM — Reflection (draft → critique → revise)\n",
+      "============================================================\n",
+      "🧠 Calling gpt-4o-mini...\n",
+      "❌ LLM API error: Error code: 403 - {'error': {'message': 'Your account balance is insufficient. Please recharge your account to continue using the API. (tid: 2026042706261192673164517328253)', 'type': 'Aihubmix_api_error', 'param': '', 'code': 'insufficient_user_quota'}}\n",
+      "\n",
+      "✍️  Draft post-mortem written.\n",
+      "🧠 Calling gpt-4o-mini...\n",
+      "❌ LLM API error: Error code: 403 - {'error': {'message': 'Your account balance is insufficient. Please recharge your account to continue using the API. (tid: 2026042706261199902531542463148)', 'type': 'Aihubmix_api_error', 'param': '', 'code': 'insufficient_user_quota'}}\n",
+      "\n",
+      "🔍 Critique (revision 1):\n",
+      "...\n",
+      "   Quality score: 7/10\n",
+      "   Revising post-mortem (score 7 < 8)...\n",
+      "🧠 Calling gpt-4o-mini...\n",
+      "❌ LLM API error: Error code: 403 - {'error': {'message': 'Your account balance is insufficient. Please recharge your account to continue using the API. (tid: 202604270626127809566451572748)', 'type': 'Aihubmix_api_error', 'param': '', 'code': 'insufficient_user_quota'}}\n",
+      "   Revision 1 complete.\n",
+      "\n",
+      "✅ Final post-mortem ready.\n",
+      "\n",
+      "⏱️  Pipeline completed in 0.4s\n"
+     ]
+    },
+    {
+     "data": {
+      "text/markdown": [],
+      "text/plain": [
+       "<IPython.core.display.Markdown object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "start = time.time()\n",
+    "result_3 = run_pipeline(\"external_api_ratelimit\", verbose=True)\n",
+    "elapsed_3 = round(time.time() - start, 1)\n",
+    "print(f\"\\n⏱️  Pipeline completed in {elapsed_3}s\")\n",
+    "display(Markdown(result_3[\"report\"]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## Part 6: Performance Evaluation\n",
+    "\n",
+    "Run all 3 incident fixtures and measure: pipeline speed, root cause accuracy, and report quality."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "============================================================\n",
+      "🚨 STAGE 1: TRIAGE — Generating investigation plan\n",
+      "============================================================\n",
+      "❌ LLM API error: Error code: 403 - {'error': {'message': 'Your account balance is insufficient. Please recharge your account to continue using the API. (tid: 2026042706261223482086656223927)', 'type': 'Aihubmix_api_error', 'param': '', 'code': 'insufficient_user_quota'}}\n",
+      "⚠️  Could not parse structured plan; using fallback plan.\n",
+      "\n",
+      "============================================================\n",
+      "🔍 STAGE 2: INVESTIGATION — ReAct tool loop\n",
+      "============================================================\n",
+      "\n",
+      "--- Step 1 ---\n",
+      "❌ LLM API error: Error code: 403 - {'error': {'message': 'Your account balance is insufficient. Please recharge your account to continue using the API. (tid: 2026042706261231052236455543539)', 'type': 'Aihubmix_api_error', 'param': '', 'code': 'insufficient_user_quota'}}\n",
+      "⚠️  Empty LLM response, stopping.\n",
+      "\n",
+      "============================================================\n",
+      "📝 STAGE 3: POST-MORTEM — Reflection (draft → critique → revise)\n",
+      "============================================================\n",
+      "❌ LLM API error: Error code: 403 - {'error': {'message': 'Your account balance is insufficient. Please recharge your account to continue using the API. (tid: 2026042706261240102403213861415)', 'type': 'Aihubmix_api_error', 'param': '', 'code': 'insufficient_user_quota'}}\n",
+      "\n",
+      "✍️  Draft post-mortem written.\n",
+      "❌ LLM API error: Error code: 403 - {'error': {'message': 'Your account balance is insufficient. Please recharge your account to continue using the API. (tid: 2026042706261247825417010034927)', 'type': 'Aihubmix_api_error', 'param': '', 'code': 'insufficient_user_quota'}}\n",
+      "\n",
+      "🔍 Critique (revision 1):\n",
+      "...\n",
+      "   Quality score: 7/10\n",
+      "   Revising post-mortem (score 7 < 8)...\n",
+      "❌ LLM API error: Error code: 403 - {'error': {'message': 'Your account balance is insufficient. Please recharge your account to continue using the API. (tid: 2026042706261256116126829506230)', 'type': 'Aihubmix_api_error', 'param': '', 'code': 'insufficient_user_quota'}}\n",
+      "   Revision 1 complete.\n",
+      "\n",
+      "✅ Final post-mortem ready.\n",
+      "\n",
+      "============================================================\n",
+      "🚨 STAGE 1: TRIAGE — Generating investigation plan\n",
+      "============================================================\n",
+      "❌ LLM API error: Error code: 403 - {'error': {'message': 'Your account balance is insufficient. Please recharge your account to continue using the API. (tid: 2026042706261273255223804179156)', 'type': 'Aihubmix_api_error', 'param': '', 'code': 'insufficient_user_quota'}}\n",
+      "⚠️  Could not parse structured plan; using fallback plan.\n",
+      "\n",
+      "============================================================\n",
+      "🔍 STAGE 2: INVESTIGATION — ReAct tool loop\n",
+      "============================================================\n",
+      "\n",
+      "--- Step 1 ---\n",
+      "❌ LLM API error: Error code: 403 - {'error': {'message': 'Your account balance is insufficient. Please recharge your account to continue using the API. (tid: 2026042706261281273660804723994)', 'type': 'Aihubmix_api_error', 'param': '', 'code': 'insufficient_user_quota'}}\n",
+      "⚠️  Empty LLM response, stopping.\n",
+      "\n",
+      "============================================================\n",
+      "📝 STAGE 3: POST-MORTEM — Reflection (draft → critique → revise)\n",
+      "============================================================\n",
+      "❌ LLM API error: Error code: 403 - {'error': {'message': 'Your account balance is insufficient. Please recharge your account to continue using the API. (tid: 2026042706261289003520739111729)', 'type': 'Aihubmix_api_error', 'param': '', 'code': 'insufficient_user_quota'}}\n",
+      "\n",
+      "✍️  Draft post-mortem written.\n",
+      "❌ LLM API error: Error code: 403 - {'error': {'message': 'Your account balance is insufficient. Please recharge your account to continue using the API. (tid: 2026042706261297094600607657398)', 'type': 'Aihubmix_api_error', 'param': '', 'code': 'insufficient_user_quota'}}\n",
+      "\n",
+      "🔍 Critique (revision 1):\n",
+      "...\n",
+      "   Quality score: 7/10\n",
+      "   Revising post-mortem (score 7 < 8)...\n",
+      "❌ LLM API error: Error code: 403 - {'error': {'message': 'Your account balance is insufficient. Please recharge your account to continue using the API. (tid: 202604270626134677146634593157)', 'type': 'Aihubmix_api_error', 'param': '', 'code': 'insufficient_user_quota'}}\n",
+      "   Revision 1 complete.\n",
+      "\n",
+      "✅ Final post-mortem ready.\n",
+      "\n",
+      "============================================================\n",
+      "🚨 STAGE 1: TRIAGE — Generating investigation plan\n",
+      "============================================================\n",
+      "❌ LLM API error: Error code: 403 - {'error': {'message': 'Your account balance is insufficient. Please recharge your account to continue using the API. (tid: 2026042706261320092757163126141)', 'type': 'Aihubmix_api_error', 'param': '', 'code': 'insufficient_user_quota'}}\n",
+      "⚠️  Could not parse structured plan; using fallback plan.\n",
+      "\n",
+      "============================================================\n",
+      "🔍 STAGE 2: INVESTIGATION — ReAct tool loop\n",
+      "============================================================\n",
+      "\n",
+      "--- Step 1 ---\n",
+      "❌ LLM API error: Error code: 403 - {'error': {'message': 'Your account balance is insufficient. Please recharge your account to continue using the API. (tid: 2026042706261328430175304565034)', 'type': 'Aihubmix_api_error', 'param': '', 'code': 'insufficient_user_quota'}}\n",
+      "⚠️  Empty LLM response, stopping.\n",
+      "\n",
+      "============================================================\n",
+      "📝 STAGE 3: POST-MORTEM — Reflection (draft → critique → revise)\n",
+      "============================================================\n",
+      "❌ LLM API error: Error code: 403 - {'error': {'message': 'Your account balance is insufficient. Please recharge your account to continue using the API. (tid: 2026042706261336639858400171735)', 'type': 'Aihubmix_api_error', 'param': '', 'code': 'insufficient_user_quota'}}\n",
+      "\n",
+      "✍️  Draft post-mortem written.\n",
+      "❌ LLM API error: Error code: 403 - {'error': {'message': 'Your account balance is insufficient. Please recharge your account to continue using the API. (tid: 2026042706261344133585616970293)', 'type': 'Aihubmix_api_error', 'param': '', 'code': 'insufficient_user_quota'}}\n",
+      "\n",
+      "🔍 Critique (revision 1):\n",
+      "...\n",
+      "   Quality score: 7/10\n",
+      "   Revising post-mortem (score 7 < 8)...\n",
+      "❌ LLM API error: Error code: 403 - {'error': {'message': 'Your account balance is insufficient. Please recharge your account to continue using the API. (tid: 2026042706261351413276818741033)', 'type': 'Aihubmix_api_error', 'param': '', 'code': 'insufficient_user_quota'}}\n",
+      "   Revision 1 complete.\n",
+      "\n",
+      "✅ Final post-mortem ready.\n",
+      "\n",
+      "=== Evaluation Results ===\n",
+      "  [❌] db_pool_exhaustion             |   0.5s | ...\n",
+      "  [❌] memory_leak_oom                |   0.5s | ...\n",
+      "  [❌] external_api_ratelimit         |   0.5s | ...\n",
+      "\n",
+      "Accuracy: 0%   Avg pipeline time: 0.5s\n"
+     ]
+    }
+   ],
+   "source": [
+    "GROUND_TRUTH = {\n",
+    "    \"db_pool_exhaustion\":   \"missing index\",\n",
+    "    \"memory_leak_oom\":      \"no TTL\",\n",
+    "    \"external_api_ratelimit\": \"retry storm\",\n",
+    "}\n",
+    "\n",
+    "eval_results = []\n",
+    "for incident_id in list_incidents():\n",
+    "    t0 = time.time()\n",
+    "    res = run_pipeline(incident_id, verbose=False)\n",
+    "    elapsed = round(time.time() - t0, 1)\n",
+    "\n",
+    "    ground_truth_keyword = GROUND_TRUTH.get(incident_id, \"\")\n",
+    "    root_cause_text = res[\"findings\"].get(\"root_cause\", \"\").lower()\n",
+    "    correct = ground_truth_keyword.lower() in root_cause_text\n",
+    "\n",
+    "    eval_results.append({\n",
+    "        \"incident_id\": incident_id,\n",
+    "        \"service\": res[\"service\"],\n",
+    "        \"severity\": res[\"severity\"],\n",
+    "        \"elapsed_s\": elapsed,\n",
+    "        \"root_cause_correct\": \"✅\" if correct else \"❌\",\n",
+    "        \"root_cause_found\": res[\"findings\"].get(\"root_cause\", \"\")[:80] + \"...\",\n",
+    "    })\n",
+    "\n",
+    "print(\"\\n=== Evaluation Results ===\")\n",
+    "for r in eval_results:\n",
+    "    print(f\"  [{r['root_cause_correct']}] {r['incident_id']:30s} \"\n",
+    "          f\"| {r['elapsed_s']:5.1f}s | {r['root_cause_found']}\")\n",
+    "\n",
+    "accuracy = sum(1 for r in eval_results if r[\"root_cause_correct\"] == \"✅\") / len(eval_results)\n",
+    "avg_time = sum(r[\"elapsed_s\"] for r in eval_results) / len(eval_results)\n",
+    "print(f\"\\nAccuracy: {accuracy:.0%}   Avg pipeline time: {avg_time:.1f}s\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## Part 7: Summary and Outlook\n",
+    "\n",
+    "### What We Built\n",
+    "- **TriageAgent (Plan-and-Solve)**: converts a raw alert into an ordered investigation plan, reducing hallucinated tool calls in the ReAct stage\n",
+    "- **InvestigationAgent (ReAct)**: iterates through log search → metric queries → runbook lookup to arrive at a root cause with evidence\n",
+    "- **PostmortemAgent (Reflection)**: produces a structured RCA report and self-critiques it against quality criteria before finalizing\n",
+    "- **FastAPI backend**: exposes the pipeline as a REST API, ready for frontend integration\n",
+    "\n",
+    "### Agent Paradigms Demonstrated\n",
+    "| Agent | Paradigm | Chapter Reference |\n",
+    "|---|---|---|\n",
+    "| TriageAgent | Plan-and-Solve | Chapter 4 |\n",
+    "| InvestigationAgent | ReAct | Chapter 4 |\n",
+    "| PostmortemAgent | Reflection | Chapter 4 |\n",
+    "\n",
+    "### Challenges and Lessons\n",
+    "1. **Structured output parsing**: LLMs don't always return valid JSON/Python lists. Robust parsing with fallback logic is essential.\n",
+    "2. **Tool design**: Tool descriptions must be precise — ambiguous descriptions cause the ReAct agent to call the wrong tool.\n",
+    "3. **Context window management**: Passing the full investigation history in each ReAct turn grows quickly. In production, use a sliding window or summarization.\n",
+    "4. **Reflection threshold**: The Reflection pattern only adds value if the critique is rigorous. Vague critique prompts produce vague improvements.\n",
+    "\n",
+    "### Future Plans\n",
+    "- [ ] **SSE streaming**: stream agent reasoning steps to a frontend in real-time\n",
+    "- [ ] **Vue/React frontend**: incident selector UI + live reasoning trace + markdown report viewer\n",
+    "- [ ] **Real log ingestion**: connect to actual log aggregators (Loki, CloudWatch, Datadog)\n",
+    "- [ ] **Memory across incidents**: vector store for past RCA reports to accelerate future investigations\n",
+    "- [ ] **Automated runbook execution**: give the agent permission to run safe remediation commands (kubectl, SQL index creation)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

+ 12 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/requirements.txt

@@ -0,0 +1,12 @@
+# LLM — pick one or keep both; the client auto-selects based on env vars
+openai>=1.0.0          # for AIHubmix / ModelScope / OpenAI
+anthropic>=0.40.0      # for Anthropic API (claude-opus-4-7, etc.)
+
+# Web framework (API server)
+fastapi>=0.109.0
+uvicorn>=0.27.0
+
+# Data / config
+pyyaml>=6.0.0
+pydantic>=2.0.0
+python-dotenv>=1.0.0

+ 0 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/src/__init__.py


+ 0 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/src/agents/__init__.py


+ 190 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/src/agents/investigation_agent.py

@@ -0,0 +1,190 @@
+import re
+import json
+import os
+from typing import Dict, Any, List
+
+from src.core.llm_client import HelloAgentsLLM
+from src.tools.log_search_tool import LogSearchTool
+from src.tools.metric_query_tool import MetricQueryTool
+from src.tools.runbook_tool import RunbookLookupTool
+
+REACT_PROMPT = """
+You are an expert SRE investigating a production incident using the ReAct (Reason + Act) pattern.
+
+INCIDENT CONTEXT:
+{incident_summary}
+
+INVESTIGATION PLAN (steps to follow):
+{plan}
+
+AVAILABLE TOOLS:
+{tools}
+
+INVESTIGATION HISTORY SO FAR:
+{history}
+
+INSTRUCTIONS:
+Follow the investigation plan step by step. For each step:
+1. Thought: Reason about what you know and what to investigate next.
+2. Action: Call ONE tool using this EXACT format: tool_name[query string]
+   - log_search[keyword or regex]
+   - metric_query[metric name keyword]
+   - runbook_lookup[error pattern keyword]
+3. After all plan steps are complete, write: Finish[your root cause conclusion]
+
+Your Finish conclusion should be 2-3 sentences summarizing:
+- The root cause
+- The evidence that confirmed it
+- The recommended immediate action
+
+Respond with exactly:
+Thought: <your reasoning>
+Action: <tool_name>[<query>]
+
+OR when done:
+Thought: <final reasoning>
+Action: Finish[<root cause conclusion>]
+"""
+
+
+class InvestigationAgent:
+    """
+    ReAct agent: executes the investigation plan using log/metric/runbook tools.
+
+    This is the second stage of the pipeline. It iterates through the triage plan,
+    calls tools, observes results, and concludes with a root cause hypothesis.
+    """
+
+    def __init__(self, llm: HelloAgentsLLM, incident: Dict[str, Any], runbooks_dir: str):
+        self.llm = llm
+        self.incident = incident
+
+        log_tool = LogSearchTool(incident)
+        metric_tool = MetricQueryTool(incident)
+        runbook_tool = RunbookLookupTool(
+            service=incident["service"],
+            runbooks_dir=runbooks_dir,
+        )
+
+        self.tools = {
+            "log_search": log_tool,
+            "metric_query": metric_tool,
+            "runbook_lookup": runbook_tool,
+        }
+
+    def run(self, plan: List[Dict[str, str]], max_steps: int = 12) -> Dict[str, Any]:
+        history: List[str] = []
+        called_actions: set = set()  # deduplicate repeated tool calls
+        findings: Dict[str, Any] = {
+            "evidence": [],
+            "root_cause": "",
+            "runbook_steps": [],
+        }
+
+        incident_summary = json.dumps(
+            {
+                "incident_id": self.incident["incident_id"],
+                "service": self.incident["service"],
+                "severity": self.incident["severity"],
+                "alert": self.incident["alert"],
+                "affected_users": self.incident.get("affected_users", 0),
+            },
+            indent=2,
+        )
+
+        tools_desc = "\n".join(
+            f"- {t.name}: {t.description}" for t in self.tools.values()
+        )
+
+        plan_text = "\n".join(
+            f"{i+1}. [{s['tool']}] {s['query']} — {s['reason']}"
+            for i, s in enumerate(plan)
+        )
+
+        print("\n" + "=" * 60)
+        print("🔍 STAGE 2: INVESTIGATION — ReAct tool loop")
+        print("=" * 60)
+
+        for step_num in range(1, max_steps + 1):
+            print(f"\n--- Step {step_num} ---")
+
+            prompt = REACT_PROMPT.format(
+                incident_summary=incident_summary,
+                plan=plan_text,
+                tools=tools_desc,
+                history="\n".join(history) if history else "(none yet)",
+            )
+            messages = [{"role": "user", "content": prompt}]
+
+            response = self.llm.think(messages=messages)
+            if not response:
+                print("⚠️  Empty LLM response, stopping.")
+                break
+
+            thought, action = self._parse_react_output(response)
+            if thought:
+                print(f"💭 Thought: {thought}")
+            if not action:
+                print("⚠️  No action parsed, stopping.")
+                break
+
+            if action.lower().startswith("finish"):
+                conclusion = self._extract_action_input(action)
+                print(f"\n✅ Root cause identified: {conclusion}")
+                findings["root_cause"] = conclusion
+                break
+
+            tool_name, query = self._parse_tool_call(action)
+            if not tool_name:
+                history.append(f"Observation: Invalid action format '{action}'")
+                continue
+
+            action_key = f"{tool_name}[{query}]"
+            if action_key in called_actions:
+                hint = "You already called this tool with this query. Use Finish[<conclusion>] to state your root cause."
+                print(f"⚠️  Duplicate action skipped — {hint}")
+                history.append(f"Action: {action}")
+                history.append(f"Observation: {hint}")
+                continue
+
+            called_actions.add(action_key)
+            print(f"🔧 Action: {tool_name}[{query}]")
+            observation = self._execute_tool(tool_name, query, findings)
+            print(f"👀 Observation: {observation[:300]}{'...' if len(observation) > 300 else ''}")
+
+            history.append(f"Action: {action}")
+            history.append(f"Observation: {observation}")
+
+        return findings
+
+    def _execute_tool(self, tool_name: str, query: str, findings: Dict) -> str:
+        tool = self.tools.get(tool_name)
+        if not tool:
+            return f"Unknown tool '{tool_name}'. Available: {list(self.tools.keys())}"
+
+        result = tool.run(query)
+
+        # Accumulate evidence for the postmortem
+        if tool_name in ("log_search", "metric_query") and "No " not in result[:10]:
+            findings["evidence"].append({"tool": tool_name, "query": query, "result": result})
+        elif tool_name == "runbook_lookup":
+            findings["runbook_steps"].append(result)
+
+        return result
+
+    def _parse_react_output(self, text: str):
+        thought_match = re.search(r"Thought:\s*(.*?)(?=\nAction:|$)", text, re.DOTALL)
+        action_match = re.search(r"Action:\s*(.*?)$", text, re.DOTALL)
+        thought = thought_match.group(1).strip() if thought_match else None
+        action = action_match.group(1).strip() if action_match else None
+        return thought, action
+
+    def _parse_tool_call(self, action_text: str):
+        match = re.match(r"(\w+)\[(.*)\]", action_text, re.DOTALL)
+        if match:
+            return match.group(1).strip(), match.group(2).strip()
+        return None, None
+
+    def _extract_action_input(self, action_text: str) -> str:
+        match = re.match(r"\w+\[(.*)\]", action_text, re.DOTALL)
+        return match.group(1).strip() if match else action_text

+ 62 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/src/agents/pipeline.py

@@ -0,0 +1,62 @@
+"""
+SRE pipeline orchestrator — wires TriageAgent → InvestigationAgent → PostmortemAgent.
+"""
+import json
+import os
+from pathlib import Path
+from typing import Dict, Any
+
+from src.core.llm_client import HelloAgentsLLM
+from src.agents.triage_agent import TriageAgent
+from src.agents.investigation_agent import InvestigationAgent
+from src.agents.postmortem_agent import PostmortemAgent
+
+DATA_DIR = Path(__file__).resolve().parents[2] / "data"
+INCIDENTS_DIR = DATA_DIR / "incidents"
+RUNBOOKS_DIR = DATA_DIR / "runbooks"
+
+
+def list_incidents():
+    return [p.stem for p in INCIDENTS_DIR.glob("*.json")]
+
+
+def load_incident(incident_id: str) -> Dict[str, Any]:
+    path = INCIDENTS_DIR / f"{incident_id}.json"
+    if not path.exists():
+        raise FileNotFoundError(
+            f"Incident '{incident_id}' not found. "
+            f"Available: {list_incidents()}"
+        )
+    with open(path) as f:
+        return json.load(f)
+
+
+def run_pipeline(incident_id: str, verbose: bool = True) -> Dict[str, Any]:
+    """
+    Full three-stage SRE pipeline for a given incident ID.
+
+    Returns a dict with: incident_id, plan, findings, report
+    """
+    incident = load_incident(incident_id)
+    llm = HelloAgentsLLM(verbose=verbose)
+
+    # Stage 1: Triage — Plan-and-Solve
+    triage = TriageAgent(llm)
+    plan = triage.run(incident)
+
+    # Stage 2: Investigation — ReAct
+    investigator = InvestigationAgent(llm, incident, str(RUNBOOKS_DIR))
+    findings = investigator.run(plan)
+
+    # Stage 3: Post-mortem — Reflection
+    postmortem = PostmortemAgent(llm)
+    report = postmortem.run(incident, findings)
+
+    return {
+        "incident_id": incident_id,
+        "service": incident["service"],
+        "severity": incident["severity"],
+        "plan": plan,
+        "findings": findings,
+        "report": report,
+    }

+ 155 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/src/agents/postmortem_agent.py

@@ -0,0 +1,155 @@
+import json
+from typing import Dict, Any
+
+from src.core.llm_client import HelloAgentsLLM
+
+DRAFT_PROMPT = """
+You are a senior SRE writing a post-mortem report for a production incident.
+Write a clear, structured post-mortem in Markdown using the incident data and investigation
+findings provided below.
+
+INCIDENT DATA:
+{incident_json}
+
+INVESTIGATION FINDINGS:
+Root Cause: {root_cause}
+
+Evidence Collected:
+{evidence}
+
+Runbook Consulted:
+{runbook}
+
+Write the post-mortem with these EXACT sections:
+1. ## Executive Summary (2-3 sentences: what happened, impact, resolution)
+2. ## Incident Timeline (bullet list of timestamped events from the logs)
+3. ## Root Cause Analysis (5-Whys: start with the symptom and ask 'why?' 5 times)
+4. ## Impact Assessment (severity, affected users, duration estimate, business impact)
+5. ## Immediate Remediation Steps (numbered list of actions to take RIGHT NOW)
+6. ## Action Items (table with columns: Action | Owner | Due Date | Priority)
+7. ## Lessons Learned (2-4 bullet points about what this incident taught us)
+
+Be specific — reference actual log messages, metric values, and timestamps from the data.
+"""
+
+CRITIQUE_PROMPT = """
+You are a post-mortem review board member. Critically evaluate the draft post-mortem below
+against these quality criteria:
+
+1. ROOT CAUSE: Is the root cause clearly and specifically stated (not vague)?
+2. TIMELINE: Are all key events from the logs included with accurate timestamps?
+3. ACTION ITEMS: Are they specific, measurable, and assigned with due dates?
+4. 5-WHYS: Does it reach the true systemic root cause (not stop at symptoms)?
+5. LESSONS LEARNED: Are they actionable (not generic platitudes)?
+
+Draft Post-Mortem:
+{draft}
+
+Respond with a JSON object:
+{{
+  "score": <1-10>,
+  "issues": ["<issue 1>", "<issue 2>", ...],
+  "suggestions": ["<improvement 1>", "<improvement 2>", ...]
+}}
+"""
+
+REVISE_PROMPT = """
+Revise the post-mortem draft below to address the reviewer's feedback.
+Apply ALL suggestions and fix ALL identified issues.
+
+ORIGINAL DRAFT:
+{draft}
+
+REVIEWER FEEDBACK:
+{critique}
+
+Output the complete revised post-mortem in Markdown. No preamble — start directly with
+the post-mortem content.
+"""
+
+
+class PostmortemAgent:
+    """
+    Reflection agent: draft → critique → revise post-mortem report.
+
+    This is the third stage of the pipeline. It uses the Reflection pattern:
+    first drafting an RCA report, then self-critiquing it against quality criteria,
+    then producing a final revised version.
+    """
+
+    def __init__(self, llm: HelloAgentsLLM, max_revisions: int = 1):
+        self.llm = llm
+        self.max_revisions = max_revisions
+
+    def run(self, incident: Dict[str, Any], findings: Dict[str, Any]) -> str:
+        print("\n" + "=" * 60)
+        print("📝 STAGE 3: POST-MORTEM — Reflection (draft → critique → revise)")
+        print("=" * 60)
+
+        evidence_text = "\n\n".join(
+            f"[{e['tool']}({e['query']})]:\n{e['result']}"
+            for e in findings.get("evidence", [])
+        ) or "No structured evidence collected."
+
+        runbook_text = "\n\n".join(findings.get("runbook_steps", [])) or "No runbook consulted."
+
+        draft = self._draft(incident, findings, evidence_text, runbook_text)
+        print("\n✍️  Draft post-mortem written.")
+
+        for revision in range(1, self.max_revisions + 1):
+            critique = self._critique(draft)
+            print(f"\n🔍 Critique (revision {revision}):\n{critique[:500]}...")
+            score = self._extract_score(critique)
+            print(f"   Quality score: {score}/10")
+
+            if score >= 8:
+                print("✅ Quality threshold met — no revision needed.")
+                break
+
+            print(f"   Revising post-mortem (score {score} < 8)...")
+            draft = self._revise(draft, critique)
+            print(f"   Revision {revision} complete.")
+
+        print("\n✅ Final post-mortem ready.")
+        return draft
+
+    def _draft(
+        self,
+        incident: Dict[str, Any],
+        findings: Dict[str, Any],
+        evidence_text: str,
+        runbook_text: str,
+    ) -> str:
+        incident_json = json.dumps(
+            {
+                "incident_id": incident["incident_id"],
+                "service": incident["service"],
+                "severity": incident["severity"],
+                "alert": incident["alert"],
+                "affected_users": incident.get("affected_users", 0),
+            },
+            indent=2,
+        )
+        prompt = DRAFT_PROMPT.format(
+            incident_json=incident_json,
+            root_cause=findings.get("root_cause", "Unknown"),
+            evidence=evidence_text,
+            runbook=runbook_text,
+        )
+        messages = [{"role": "user", "content": prompt}]
+        return self.llm.think(messages=messages)
+
+    def _critique(self, draft: str) -> str:
+        prompt = CRITIQUE_PROMPT.format(draft=draft)
+        messages = [{"role": "user", "content": prompt}]
+        return self.llm.think(messages=messages)
+
+    def _revise(self, draft: str, critique: str) -> str:
+        prompt = REVISE_PROMPT.format(draft=draft, critique=critique)
+        messages = [{"role": "user", "content": prompt}]
+        return self.llm.think(messages=messages)
+
+    def _extract_score(self, critique: str) -> int:
+        import re
+        match = re.search(r'"score"\s*:\s*(\d+)', critique)
+        return int(match.group(1)) if match else 7

+ 94 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/src/agents/triage_agent.py

@@ -0,0 +1,94 @@
+import ast
+import json
+from typing import Dict, Any, List
+
+from src.core.llm_client import HelloAgentsLLM
+
+TRIAGE_PROMPT = """
+You are a senior Site Reliability Engineer (SRE) responding to a production incident.
+Your job is to create a concise, ordered investigation plan from the alert details below.
+
+Each plan step should be ONE specific investigation action using ONE of these tools:
+- log_search: search logs by keyword or regex
+- metric_query: query a time-series metric by name
+- runbook_lookup: look up remediation steps by error pattern
+
+Alert Details:
+{alert_json}
+
+Output a Python list of 4-6 investigation steps. Each step must be a dict with:
+  {{"tool": "<tool_name>", "query": "<specific query string>", "reason": "<why this step>"}}
+
+Output ONLY the Python list, wrapped in ```python ... ```. No other text.
+
+Example:
+```python
+[
+  {{"tool": "log_search", "query": "ERROR", "reason": "Find all error-level log entries to identify the failure pattern"}},
+  {{"tool": "metric_query", "query": "latency", "reason": "Quantify the latency degradation over time"}},
+  {{"tool": "runbook_lookup", "query": "high latency", "reason": "Get standard remediation steps"}}
+]
+```
+"""
+
+
+class TriageAgent:
+    """
+    Plan-and-Solve agent: generates an ordered investigation plan from an incident alert.
+
+    This is the first stage of the pipeline. It takes the raw alert and produces
+    a structured plan that the InvestigationAgent will execute step-by-step.
+    """
+
+    def __init__(self, llm: HelloAgentsLLM):
+        self.llm = llm
+
+    def run(self, incident: Dict[str, Any]) -> List[Dict[str, str]]:
+        alert_summary = {
+            "incident_id": incident["incident_id"],
+            "service": incident["service"],
+            "severity": incident["severity"],
+            "alert": incident["alert"],
+        }
+        prompt = TRIAGE_PROMPT.format(alert_json=json.dumps(alert_summary, indent=2))
+        messages = [{"role": "user", "content": prompt}]
+
+        print("\n" + "=" * 60)
+        print("🚨 STAGE 1: TRIAGE — Generating investigation plan")
+        print("=" * 60)
+
+        response = self.llm.think(messages=messages)
+        plan = self._parse_plan(response)
+
+        if plan:
+            print(f"\n✅ Investigation plan ({len(plan)} steps):")
+            for i, step in enumerate(plan, 1):
+                print(f"   {i}. [{step['tool']}] {step['query']} — {step['reason']}")
+        else:
+            print("⚠️  Could not parse structured plan; using fallback plan.")
+            plan = self._fallback_plan(incident)
+
+        return plan
+
+    def _parse_plan(self, response: str) -> List[Dict[str, str]]:
+        try:
+            block = response.split("```python")[1].split("```")[0].strip()
+            plan = ast.literal_eval(block)
+            if isinstance(plan, list) and all(isinstance(s, dict) for s in plan):
+                return plan
+        except (IndexError, ValueError, SyntaxError):
+            pass
+        return []
+
+    def _fallback_plan(self, incident: Dict[str, Any]) -> List[Dict[str, str]]:
+        service = incident.get("service", "unknown")
+        return [
+            {"tool": "log_search", "query": "ERROR",
+             "reason": "Find all error-level log entries"},
+            {"tool": "log_search", "query": "CRITICAL",
+             "reason": "Find critical-severity log entries"},
+            {"tool": "metric_query", "query": "error",
+             "reason": "Check error rate trend over time"},
+            {"tool": "runbook_lookup", "query": incident["alert"].get("description", "incident"),
+             "reason": "Retrieve standard remediation steps"},
+        ]

+ 0 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/src/api/__init__.py


+ 88 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/src/api/main.py

@@ -0,0 +1,88 @@
+"""
+FastAPI server for the SRE On-Call Agent.
+
+Run with: uvicorn src.api.main:app --reload --port 8000
+
+Endpoints:
+  GET  /health                       — liveness check
+  GET  /incidents/fixtures           — list available sample incident IDs
+  POST /incidents/investigate        — run the full 3-agent pipeline
+  GET  /incidents/{id}/report        — retrieve a previously generated report
+"""
+import time
+from typing import Dict, Any
+
+from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+
+from src.agents.pipeline import run_pipeline, list_incidents, load_incident
+
+app = FastAPI(
+    title="SRE On-Call Agent",
+    description="AI-powered incident triage and post-mortem generation",
+    version="1.0.0",
+)
+
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],  # Tighten this when adding a specific frontend origin
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+
+# In-memory store for generated reports (replace with Redis/DB for production)
+_report_store: Dict[str, Any] = {}
+
+
+class InvestigateRequest(BaseModel):
+    incident_id: str
+
+
+@app.get("/health")
+def health():
+    return {"status": "ok"}
+
+
+@app.get("/incidents/fixtures")
+def get_fixtures():
+    """List all available sample incident IDs."""
+    return {"incidents": list_incidents()}
+
+
+@app.post("/incidents/investigate")
+def investigate(req: InvestigateRequest):
+    """
+    Run the full triage → investigation → post-mortem pipeline for an incident.
+
+    This runs synchronously (suitable for demo; upgrade to background task + SSE for prod).
+    """
+    try:
+        load_incident(req.incident_id)  # Validate early
+    except FileNotFoundError as e:
+        raise HTTPException(status_code=404, detail=str(e))
+
+    start = time.time()
+    try:
+        result = run_pipeline(req.incident_id, verbose=False)
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Pipeline error: {e}")
+
+    elapsed = round(time.time() - start, 1)
+    result["elapsed_seconds"] = elapsed
+    _report_store[req.incident_id] = result
+    return result
+
+
+@app.get("/incidents/{incident_id}/report")
+def get_report(incident_id: str):
+    """Retrieve a previously generated post-mortem report."""
+    if incident_id not in _report_store:
+        raise HTTPException(
+            status_code=404,
+            detail=f"No report found for '{incident_id}'. Call POST /incidents/investigate first.",
+        )
+    return {
+        "incident_id": incident_id,
+        "report": _report_store[incident_id]["report"],
+    }

+ 0 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/src/core/__init__.py


+ 55 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/src/core/llm_client.py

@@ -0,0 +1,55 @@
+import os
+from openai import OpenAI
+from dotenv import load_dotenv
+from typing import List, Dict
+
+load_dotenv()
+
+
+class HelloAgentsLLM:
+    """OpenAI-compatible LLM client (works with AIHubmix, ModelScope, OpenAI)."""
+
+    def __init__(
+        self,
+        model: str = None,
+        api_key: str = None,
+        base_url: str = None,
+        timeout: int = None,
+        verbose: bool = True,
+    ):
+        self.model = model or os.getenv("LLM_MODEL_ID")
+        api_key = api_key or os.getenv("LLM_API_KEY")
+        base_url = base_url or os.getenv("LLM_BASE_URL")
+        timeout = timeout or int(os.getenv("LLM_TIMEOUT", "60"))
+        self.verbose = verbose
+
+        if not all([self.model, api_key, base_url]):
+            raise ValueError(
+                "LLM_MODEL_ID, LLM_API_KEY, and LLM_BASE_URL must be set "
+                "(via constructor args or .env file)."
+            )
+
+        self.client = OpenAI(api_key=api_key, base_url=base_url, timeout=timeout)
+
+    def think(self, messages: List[Dict[str, str]], temperature: float = 0) -> str:
+        if self.verbose:
+            print(f"🧠 Calling {self.model}...")
+        try:
+            response = self.client.chat.completions.create(
+                model=self.model,
+                messages=messages,
+                temperature=temperature,
+                stream=True,
+            )
+            collected = []
+            for chunk in response:
+                content = chunk.choices[0].delta.content or ""
+                if self.verbose:
+                    print(content, end="", flush=True)
+                collected.append(content)
+            if self.verbose:
+                print()
+            return "".join(collected)
+        except Exception as e:
+            print(f"❌ LLM API error: {e}")
+            return ""

+ 0 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/src/tools/__init__.py


+ 42 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/src/tools/log_search_tool.py

@@ -0,0 +1,42 @@
+import re
+from typing import Dict, Any, List
+
+
+class LogSearchTool:
+    """
+    Searches incident log entries by keyword or regex pattern.
+
+    Input to run(): a keyword or regex string
+    Returns: matching log lines as a formatted string
+    """
+
+    name = "log_search"
+    description = (
+        "Search incident logs by keyword or regex pattern. "
+        "Input: a keyword or regex string (e.g. 'pool exhausted', 'ERROR', '429'). "
+        "Returns matching log entries with timestamps and severity levels."
+    )
+
+    def __init__(self, incident_data: Dict[str, Any]):
+        self.logs: List[Dict] = incident_data.get("logs", [])
+        self.service: str = incident_data.get("service", "unknown")
+
+    def run(self, query: str) -> str:
+        try:
+            pattern = re.compile(query, re.IGNORECASE)
+        except re.error:
+            # Fall back to plain substring match if regex is invalid
+            pattern = re.compile(re.escape(query), re.IGNORECASE)
+
+        matches = [
+            f"[{e['timestamp']}] [{e['level']:8s}] {e['message']}"
+            for e in self.logs
+            if pattern.search(e.get("message", "")) or pattern.search(e.get("level", ""))
+        ]
+
+        if not matches:
+            return f"No log entries found matching '{query}' in {self.service} logs."
+        return (
+            f"Found {len(matches)} log entries matching '{query}':\n"
+            + "\n".join(matches)
+        )

+ 38 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/src/tools/metric_query_tool.py

@@ -0,0 +1,38 @@
+from typing import Dict, Any
+
+
+class MetricQueryTool:
+    """
+    Queries simulated time-series metrics for the incident's service.
+
+    Input to run(): a metric name substring (e.g. 'db_pool', 'latency', 'memory')
+    Returns: time-series data as a formatted string
+    """
+
+    name = "metric_query"
+    description = (
+        "Query time-series metrics for the incident service. "
+        "Input: a metric name or keyword (e.g. 'db_pool', 'memory', 'latency', 'error_rate'). "
+        "Returns time-series values showing how the metric changed over time."
+    )
+
+    def __init__(self, incident_data: Dict[str, Any]):
+        self.metrics: Dict[str, Dict] = incident_data.get("metrics", {})
+        self.service: str = incident_data.get("service", "unknown")
+
+    def run(self, metric_name: str) -> str:
+        query = metric_name.lower().strip()
+        matched = {k: v for k, v in self.metrics.items() if query in k.lower()}
+
+        if not matched:
+            available = ", ".join(self.metrics.keys())
+            return (
+                f"No metrics found matching '{metric_name}' for {self.service}.\n"
+                f"Available metrics: {available}"
+            )
+
+        lines = [f"Metrics for {self.service} matching '{metric_name}':"]
+        for name, values in matched.items():
+            series = " | ".join(f"{t}: {v}" for t, v in sorted(values.items()))
+            lines.append(f"  {name}: [{series}]")
+        return "\n".join(lines)

+ 65 - 0
Co-creation-projects/zjzhou-SREOnCallAgent/src/tools/runbook_tool.py

@@ -0,0 +1,65 @@
+import os
+import glob
+import yaml
+from typing import List, Dict, Any
+
+
+class RunbookLookupTool:
+    """
+    Fetches runbook remediation procedures by service and error pattern.
+
+    Input to run(): an error pattern keyword (e.g. 'DB pool exhausted', 'Stripe 429')
+    Returns: ordered remediation steps from the service's runbook
+    """
+
+    name = "runbook_lookup"
+    description = (
+        "Look up runbook remediation steps for a service and error pattern. "
+        "Input: an error pattern keyword (e.g. 'DB pool exhausted', 'Stripe API 429', 'memory leak'). "
+        "Returns ordered remediation steps from the on-call runbook."
+    )
+
+    def __init__(self, service: str, runbooks_dir: str):
+        self.service = service
+        self.runbooks_dir = runbooks_dir
+
+    def _load_runbook(self) -> Dict[str, Any]:
+        path = os.path.join(self.runbooks_dir, f"{self.service}.yaml")
+        if os.path.exists(path):
+            with open(path) as f:
+                return yaml.safe_load(f) or {}
+        # Fallback: search all runbooks
+        for rb_path in glob.glob(os.path.join(self.runbooks_dir, "*.yaml")):
+            with open(rb_path) as f:
+                data = yaml.safe_load(f) or {}
+            if data.get("service", "") in self.service or self.service in data.get("service", ""):
+                return data
+        return {}
+
+    def run(self, error_pattern: str) -> str:
+        runbook = self._load_runbook()
+        if not runbook:
+            return f"No runbook found for service '{self.service}'."
+
+        query = error_pattern.lower()
+        procedures: List[Dict] = runbook.get("procedures", [])
+
+        matching = [p for p in procedures if query in p.get("pattern", "").lower()]
+        if not matching:
+            matching = procedures  # Return all if no specific match
+
+        if not matching:
+            return f"No runbook procedures found for pattern '{error_pattern}'."
+
+        lines = [
+            f"Runbook: {runbook.get('service', 'unknown')} "
+            f"(v{runbook.get('runbook_version', '?')})"
+        ]
+        for proc in matching:
+            lines.append(
+                f"\nPattern: '{proc['pattern']}' | Severity: {proc.get('severity', 'unknown')}"
+            )
+            lines.append("Remediation steps:")
+            for i, step in enumerate(proc.get("steps", []), 1):
+                lines.append(f"  {i}. {step}")
+        return "\n".join(lines)