Yesterday I promised to benchmark 10 LLMs that have never been tested on real agent coding tasks. I ran all 10 overnight. Some surprised me. Some embarrassed themselves.
10 models. 10 tasks each. Tasks are real agent work: parse JSON, write regex, fix a bug, query SQL, handle errors. Full pass requires correct, working code.
| Model | Score | Pass/Fail | Cost/task |
|---|---|---|---|
| Grok 4.20 | 75.0% | 6 pass / 3 partial / 1 fail | $0.0003 |
| Grok 4.1 Fast | 74.9% | 6/2/2 | $0.0009 |
| Xiaomi MiMo V2.5 Pro | 68.2% | 7/0/3 | $0.001 |
| Ring 2.6 (free) | 65.0% | 6/1/3 | free |
| DeepSeek V4 Flash | 60.0% | 4/3/3 | $0.0001 |
| GPT-5.4 Pro | 51.6% | 5/1/4 | $0.06 |
| GPT-5.5 Pro | 43.3% | 4/1/5 | $0.065 |
| DeepSeek V4 Pro | 38.3% | 4/0/6 | $0.001 |
| Google Lyria 3 Pro | 8.3% | 1/0/9 | free (preview) |
| Google Lyria 3 Clip | 0.0% | 0/0/10 | free (preview) |
Total cost: $1.37 for the entire run.
Grok 4.20 won. Not close enough to call it dominant, but it's the fastest by far — 14.5 seconds for all 10 tasks. Grok 4.1 Fast scored nearly identically at 225 seconds. Same family, wildly different speed profiles.
The "Pro" suffix is a trap. GPT-5.4 Pro scored 51.6%. Regular GPT-5.4 scored 76.6% on the same tasks. GPT-5.5 Pro scored 43.3%. Regular GPT-5.5 scored 60%. The Pro variants are slower, more expensive, and worse at this specific workload. If you're building agents, the base models are better.
DeepSeek V4 Flash beat DeepSeek V4 Pro — 60% vs 38%. Flash is also cheaper. For agent coding, smaller/faster beats bigger/slower again.
Ring 2.6 is free and beats paid models. Six passes, one partial, $0.00. Outperforms both GPT Pros, DeepSeek V4 Pro, and Lyria completely.
Google Lyria 3 is not ready. Clip failed every single task with 502 errors. Pro barely scored. Both are marked "preview" on OpenRouter. Fair enough — but worth knowing before you build on them.
For comparison, here's where these new models land against the existing leaderboard:
If I was building an agent today and had to pick a model:
For reliability: Claude Sonnet 4 (85%) or Mistral Large 3 (79.6%). These aren't new — they've been at the top since the first benchmark.
For speed at good quality: Grok 4.20. 75% score in 14.5 seconds. That's under 2 seconds per task.
For free: Ring 2.6 if you qualify for OpenRouter's free tier. 65% at $0 is hard to beat.
What to avoid: The "Pro" suffix on GPT models. Google's Lyria previews. DeepSeek V4 Pro if Flash is cheaper and better.
All results are live at workswithagents.dev/benchmarks — updated daily. Full interactive dashboard with local models at benchmarks.workswithagents.dev.
The Pro variants of GPT-5.4 and GPT-5.5 should theoretically be better. They're not. This might mean OpenAI optimized these for something other than quick-turn agent coding. Or it might mean the base models are just better tuned. Either way — don't assume Pro means better. Test it.
Originally published on dev.to. More posts at workswithagents.dev/blog.