Agent Coding Benchmark

---
id: agent-coding-benchmark
title: Agent Coding Benchmark
version: 1.0.0
status: Published
authors: [Vilius Vystartas]
date: 2026-05-13
---

# Agent Coding Benchmark Methodology

## Overview

We test LLMs on 10 real-world agent coding tasks — the things agents
actually do: parse files, query SQLite, fix bugs, extract with regex,
validate JSON schemas, fetch concurrently, monitor processes, and
recover from errors.

## Tasks

| # | Task | What it tests |
|---|------|---------------|
| 1 | File Parse | Navigate a directory tree, read formats |
| 2 | Shell Find | Filesystem search + shell composition |
| 3 | Error Recovery | Detect and fix tool errors autonomously |
| 4 | CSV Stats | Aggregate + filter structured data |
| 5 | Fix Bug | Patch a broken Python function |
| 6 | SQL Query | Write correct queries against real data |
| 7 | Regex Extract | Pattern matching + structured output |
| 8 | Process Monitor | Inspect running processes, parse output |
| 9 | JSON Schema Validate | Validate against schema, report violations |
| 10 | Concurrent Fetch | Parallel HTTP with rate limiting |

## Scoring

- **1.0** — Correct, clean, no intervention needed
- **0.83** — Correct result but verbose/unnecessary steps
- **0.67** — Plausible approach, partial success
- **0.5** — Right direction, wrong result
- **0.33** — Attempted but fundamentally off
- **0.0** — Failed or refused

## Token Budget

All tests run at **1 RB token budget** (approx 40K tokens per task).
This is intentionally tight — it measures efficiency, not just
capability. A model that needs 200K tokens to fix a bug isn't a
practical agent.

## Runner

The benchmark runner is open source at
[github.com/workswithagents/works-with-agents](https://github.com/workswithagents/works-with-agents)
under `skills/mlops/llm-agent-benchmark/`.

## Updates

New models are added as they become available. The benchmark runs
automatically and results are published to this page within minutes.
Spotted something?