I Ran 5 LLMs Through 10 Real Agent Coding Tasks. The Free One Won.

*By Vilius Vystartas May 2026*

What I Tested

I gave 5 models the same 10 coding tasks — not LeetCode, not trivia. Tasks an autonomous agent actually does: parse a JSON config, find large files with a shell one-liner, fix a buggy merge function, write a concurrent HTTP fetcher. The kind of things my agents ask for at 3 AM.

Each task was scored on pattern matching: does the output contain the right function names, error handling, edge cases? Pass (75%+), partial (50-74%), or fail.

No model knew it was being benchmarked. Same prompt, same 500-token limit, temperature 0.1. OpenRouter for all calls.

The Results

Model Score Passes Cost Time
Claude Sonnet 4 89.4% 8/10 $0.063 54s
Gemini 2.5 Flash 88.9% 10/10 $0.008 17s
GPT-5.4 86.5% 9/10 $0.027 26s
GPT-5.5 53.8% 5/10 $0.104 109s
DeepSeek V3 0.0% 0/10 $0.000 1s

DeepSeek returned HTTP 400 on every call — OpenRouter compatibility issue, not a model problem. I excluded it rather than pretend it scored zero.

What Surprised Me

Gemini 2.5 Flash scored 10/10 passes. Not a single task fell below 75%. It cost $0.008 total — less than a single GPT-5.5 call. And it was 6x faster.

GPT-5.5 failed 4 tasks. The pattern was consistent: it over-explained. On the shell one-liner task, it returned a 500-token essay about find instead of the actual command. On the CSV stats task, it discussed three approaches and never wrote the code. GPT-5.5 is the smartest model I've ever used for reasoning — but for concise code generation, the verbosity hurts.

Claude Sonnet 4 was the most reliable. 8/10 perfect passes, 2 partials, zero fails. The 2 partials were on shell tasks where it used different syntax — correct, but didn't match my expected patterns. At $0.063 for 10 tasks, it's the premium pick for production agents.

What This Means for Agent Builders

If you're building agents that generate code:

I'm not claiming this is a comprehensive benchmark. 10 tasks, one run each, pattern-matching scoring. But it's real — the same tasks my agents run every day. Not a synthetic benchmark designed for a paper.

What I'll Test Next

Error recovery. All 5 models handled the happy path. I want to know how they handle partial failures, contradictory instructions, and corrupted inputs. The benchmark that matters for agents isn't "can you sort a list" — it's "can you recover when the filesystem is read-only and the config is missing."

Total cost of this experiment: $0.20.

Full results: benchmarks.workswithagents.dev


Originally published on dev.to. More posts at workswithagents.dev/blog.

← Back to blog

Spotted something?

Suggest an improvement, report an error, or just say hi.