--- id: agent-coding-benchmark title: Agent Coding Benchmark version: 1.0.0 status: Published authors: [Vilius Vystartas] date: 2026-05-13 --- # Agent Coding Benchmark Methodology ## Overview We test LLMs on 10 real-world agent coding tasks — the things agents actually do: parse files, query SQLite, fix bugs, extract with regex, validate JSON schemas, fetch concurrently, monitor processes, and recover from errors. ## Tasks | # | Task | What it tests | |---|------|---------------| | 1 | File Parse | Navigate a directory tree, read formats | | 2 | Shell Find | Filesystem search + shell composition | | 3 | Error Recovery | Detect and fix tool errors autonomously | | 4 | CSV Stats | Aggregate + filter structured data | | 5 | Fix Bug | Patch a broken Python function | | 6 | SQL Query | Write correct queries against real data | | 7 | Regex Extract | Pattern matching + structured output | | 8 | Process Monitor | Inspect running processes, parse output | | 9 | JSON Schema Validate | Validate against schema, report violations | | 10 | Concurrent Fetch | Parallel HTTP with rate limiting | ## Scoring - **1.0** — Correct, clean, no intervention needed - **0.83** — Correct result but verbose/unnecessary steps - **0.67** — Plausible approach, partial success - **0.5** — Right direction, wrong result - **0.33** — Attempted but fundamentally off - **0.0** — Failed or refused ## Token Budget All tests run at **1 RB token budget** (approx 40K tokens per task). This is intentionally tight — it measures efficiency, not just capability. A model that needs 200K tokens to fix a bug isn't a practical agent. ## Runner The benchmark runner is open source at [github.com/workswithagents/works-with-agents](https://github.com/workswithagents/works-with-agents) under `skills/mlops/llm-agent-benchmark/`. ## Updates New models are added as they become available. The benchmark runs automatically and results are published to this page within minutes.
Suggest an improvement, report an error, or just say hi.