I Ran 5 LLMs Through 10 Real Agent Coding Tasks. The Free One Won.

*By Vilius Vystartas

May 2026*

What I Tested

I gave 5 models the same 10 coding tasks — not LeetCode, not trivia. Tasks an autonomous agent actually does: parse a JSON config, find large files with a shell one-liner, fix a buggy merge function, write a concurrent HTTP fetcher. The kind of things my agents ask for at 3 AM.

Each task was scored on pattern matching: does the output contain the right function names, error handling, edge cases? Pass (75%+), partial (50-74%), or fail.

No model knew it was being benchmarked. Same prompt, same 500-token limit, temperature 0.1. OpenRouter for all calls.

The Results

Model	Score	Passes	Cost	Time
Claude Sonnet 4	89.4%	8/10	$0.063	54s
Gemini 2.5 Flash	88.9%	10/10	$0.008	17s
GPT-5.4	86.5%	9/10	$0.027	26s
GPT-5.5	53.8%	5/10	$0.104	109s
DeepSeek V3	0.0%	0/10	$0.000	1s

DeepSeek returned HTTP 400 on every call — OpenRouter compatibility issue, not a model problem. I excluded it rather than pretend it scored zero.

What Surprised Me

Gemini 2.5 Flash scored 10/10 passes. Not a single task fell below 75%. It cost $0.008 total — less than a single GPT-5.5 call. And it was 6x faster.

GPT-5.5 failed 4 tasks. The pattern was consistent: it over-explained. On the shell one-liner task, it returned a 500-token essay about find instead of the actual command. On the CSV stats task, it discussed three approaches and never wrote the code. GPT-5.5 is the smartest model I've ever used for reasoning — but for concise code generation, the verbosity hurts.

Claude Sonnet 4 was the most reliable. 8/10 perfect passes, 2 partials, zero fails. The 2 partials were on shell tasks where it used different syntax — correct, but didn't match my expected patterns. At $0.063 for 10 tasks, it's the premium pick for production agents.

What This Means for Agent Builders

If you're building agents that generate code:

Best value: Gemini 2.5 Flash. Free tier exists. 10/10 passes. Fast.
Most reliable: Claude Sonnet 4. Zero fails. Worth the $0.006/task.
Avoid for code gen: GPT-5.5. It's brilliant at reasoning — use it for architecture decisions, not shell scripts.

I'm not claiming this is a comprehensive benchmark. 10 tasks, one run each, pattern-matching scoring. But it's real — the same tasks my agents run every day. Not a synthetic benchmark designed for a paper.

What I'll Test Next

Error recovery. All 5 models handled the happy path. I want to know how they handle partial failures, contradictory instructions, and corrupted inputs. The benchmark that matters for agents isn't "can you sort a list" — it's "can you recover when the filesystem is read-only and the config is missing."

Total cost of this experiment: $0.20.

Full results: benchmarks.workswithagents.dev