Agent Benchmarks

Code quality and tool-calling reliability. 272 models tested. Updated daily.

How we test → 10 tasks · scoring · methodology · Add your model →

The best
SmolLM3 3B93
Local9/0/1
Highest score — any model, any cost.
The daily driver
IBM Granite 4.1 8B90
Cloud9/0/1
Best balance of score and cost.
The steal
Amazon Nova Micro v185
Cloud7/3/0
Most bang for buck — score ÷ cost.
The local
SmolLM3 3B93
Local9/0/1
Runs on your hardware — zero API cost.

Score vs Speed

Each dot is a model. Color = score (green good, red bad). Size = relative model size.

🇺🇸🇪🇺🇨🇳🇯🇵🇰🇷🇦🇪🌍Bigger dot = score ≥ 70%

Hover/tap a dot for details. X-axis is log scale.

All Models

Filter by type, quantization, or name. Click column to sort. Updated: 2026-06-17T16:25:38

Filter:|Quant:Sort:
Model (flag = region)TypeScore P/P/FCost
🇺🇸SmolLM3 3B (4-bit 1.8GB)local939/0/1local
🇺🇸Nemotron 3 Nano 30B A3Bcloud908/2/0$1.00/M
🇪🇺Codestral 2508cloud908/2/0$1.00/M
🇨🇳MiniMax M2 Hercloud908/2/0$1.20/M
🇨🇳DeepSeek Chatcloud908/2/0$1.00/M
🇨🇳Qwen3 Coder 30B A3Bcloud908/2/0$1.00/M
🇪🇺Mistral Large 2411cloud908/2/0$1.00/M
🇨🇳DeepSeek Chat V3-0324cloud908/2/0$1.00/M
🇺🇸Amazon Nova 2 Litecloud908/2/0$1.00/M
🇺🇸Granite 4.0 Microcloud908/2/0$1.00/M
🇨🇳Tencent Hunyuan A13Bcloud908/2/0$1.00/M
🇪🇺Ministral 3 3B 2512cloud908/2/0$1.00/M
🇺🇸GPT-OSS 20B (free)cloud908/2/0$1.00/M
🇨🇳Qwen3.5 9B 5-bit (Q5_K_M 6.1GB)local909/0/1local
🇺🇸Phi-4-mini (4-bit 2.3GB)local909/0/1local
🇺🇸IBM Granite 4.1 8Bcloud909/0/1$1.00/M
🇪🇺Falcon3-7B-Instruct-4bit (4-bit 3.8GB)local888/1/1local
🇺🇸Claude Sonnet 4cloud857/3/0$6.60/M
🇨🇳Qwen2.5 1.5B (4-bit 0.9GB)local857/3/0local
🇨🇳Qwen2.5 3B (4-bit 1.8GB)local857/3/0local
🌍Seed 2.0 Minicloud857/3/0free
🇨🇳MiMo V2 Flashcloud857/3/0free
🇺🇸Gemini 2.5 Flash Litecloud857/3/0$0.40/M
🇨🇳Qwen3 Coder Flashcloud857/3/0$1.50/M
🌍Cogito V2.1 671Bcloud857/3/0$1.50/M
🇺🇸Claude Opus 4.6cloud857/3/0free
🇨🇳Qwen3 Maxcloud857/3/0$1.00/M
🌍EssentialAI RNJ-1cloud857/3/0$0.15/M
🇺🇸Amazon Nova Micro v1cloud857/3/0$0.14/M
🇪🇺Mistral Small 3.2cloud857/3/0$1.00/M
🇺🇸LFM 2 24B A2Bcloud857/3/0$1.00/M
🇨🇳Qwen3.7 Maxcloud857/3/0$1.00/M
🇨🇳Qwen3.5 Plus (2026-04-20)cloud857/3/0$1.00/M
🇪🇺Mistral Medium 3.5cloud857/3/0$1.00/M
🇪🇺Ministral 8Bcloud857/3/0$1.00/M
🇨🇳DeepSeek V3.1 Terminuscloud857/3/0$1.00/M
🇺🇸Gemma 3N E4Bcloud857/3/0$1.00/M
🇨🇳Qwen3 Max Thinkingcloud857/3/0$1.00/M
🇺🇸GPT-5.1 Chatcloud857/3/0$1.00/M
🇺🇸LFM2.5 1.2B Instruct (free)cloud857/3/0$1.00/M
🇺🇸Phi-4cloud857/3/0$1.00/M
🇨🇳MiniMax-01cloud857/3/0$1.00/M
🇺🇸GPT-OSS 120B (free)cloud857/3/0$1.00/M
🇪🇺Mistral Small 24B 2501cloud857/3/0$1.00/M
🇨🇳Qwen Pluscloud858/1/1free
🌍L3.1 Euryale 70Bcloud858/1/1$1.00/M
🌍L3 Lunaris 8Bcloud858/1/1$1.00/M
🌍Anthracite Magnum V4 72Bcloud858/1/1$1.00/M
🇺🇸Gemma 3 27Bcloud858/1/1$1.00/M
🌍Skyfall 36B v2cloud858/1/1$1.00/M
🇪🇺Mistral Devstral 2cloud858/1/1$1.30/M
🇺🇸Granite 3.2 2B (4-bit 1.5GB)local837/2/1local
🇪🇺Ministral 3B (4-bit 2.0GB)local828/1/1local
🌍AionLabs: Aion-2.0cloud827/1/2$1.52/M
🇺🇸GPT-4.1cloud828/1/1$7.60/M
🌍Kimi K2.5cloud806/4/0free
🇺🇸Claude Sonnet 4.6cloud806/4/0$15.00/M
🇺🇸Gemini 2.5 Procloud806/4/0$10.00/M
🌍Kat Coder Pro V2cloud806/4/0$1.00/M
🇨🇳Qwen3 Codercloud806/4/0$1.80/M
🇺🇸Claude Opus 4.6 Fastcloud806/4/0free
🇪🇺Ministral 14B 2512cloud806/4/0$1.00/M
🌍Kimi K2 0905cloud806/4/0$1.00/M
🇨🇳Qwen3 30B A3B 2507cloud806/4/0$1.00/M
🌍Nex-AGI N1cloud806/4/0$0.50/M
🇨🇳DeepSeek V3.2cloud806/4/0$1.00/M
🌍Cydonia 24B V4.1cloud806/4/0$1.00/M
🇨🇳Qwen3 8Bcloud806/4/0$1.00/M
🇨🇳Qwen3.6 Flashcloud806/4/0$1.00/M
🇺🇸Nemotron 3 Super (free)cloud806/4/0$1.00/M
🌍Owl Alphacloud806/4/0$1.00/M
🇨🇳DeepSeek V3.2 Expcloud806/4/0$1.00/M
🇨🇳Qwen 3.7 Pluscloud806/4/0$1.00/M
🇨🇳Qwen 3.5 Flash (v2)cloud806/4/0$1.00/M
🇯🇵🇰🇷Solar Pro v3cloud806/4/0$1.00/M
🇪🇺Mistral Medium 3.1cloud806/4/0$1.00/M
🇺🇸Amazon Nova Premiercloud807/2/1$1.00/M
🇺🇸Claude Sonnet 4.5cloud807/2/1$15.00/M
🇺🇸Claude Opus 4.7 Fastcloud807/2/1$150.00/M
🇺🇸Amazon Nova Pro v1cloud807/2/1free
🌍Seed 1.6 Flashcloud807/2/1$0.30/M
🌍Seed 1.6cloud807/2/1$1.00/M
🇨🇳Qwen3 235B A22B 2507cloud807/2/1$1.00/M
🇺🇸Cohere Command Acloud807/2/1$10.00/M
🇺🇸Claude Haiku 4.5cloud807/2/1$25.00/M
🇨🇳MiniMax M2*[30]cloud807/2/1$2.00/M
🇺🇸Claude Opus 4.1cloud807/2/1$25.00/M
🇺🇸Claude Opus 4cloud807/2/1$1.00/M
🌍Mancer Weavercloud807/2/1$1.00/M
🇨🇳Qwen Plus (2025-07-28)cloud807/2/1$1.00/M
🇺🇸GPT-5.4 Nanocloud807/2/1$1.00/M
🇺🇸Claude Haiku Latestcloud807/2/1$1.00/M
🌍Perceptron Mk1cloud807/2/1$1.00/M
🇺🇸GPT-5.2 Chatcloud807/2/1$1.00/M
🌍Kimi K2 Thinkingcloud807/2/1$1.00/M
🌍Voxtral Small 24Bcloud807/2/1$1.00/M
🇺🇸GPT-OSS-120Bcloud807/2/1$1.00/M
🇪🇺Mistral Medium 3cloud807/2/1$1.00/M
🇺🇸Gemma 3 12B ITcloud807/2/1$1.00/M
🇪🇺Mistral Sabacloud807/2/1$1.00/M
🇪🇺Falcon3-Mamba-7B-4bits (4-bit 3.7GB)local808/0/2local
🇺🇸Grok Code Fast 1cloud808/0/2free
🌍Seed 2.0 Litecloud808/0/2free
🇺🇸Llama 4 Scoutcloud808/0/2$0.30/M
🌍Aion 1.0cloud808/0/2$1.00/M
🇺🇸Anthropic: Claude Opus 4.8 (Fast)cloud808/1/1$31.00/M
🇪🇺Mistral Large 3cloud807/2/1$0.80/M
🇪🇺Falcon3-10B-Instruct-4bit (4-bit 5.3GB)local797/0/1local
🇪🇺Falcon3-3B-Instruct-4bit (4-bit 1.7GB)local797/0/1local
🇺🇸Gemma 4 26B A4B*[2]cloud786/4/0$0.14/M
🇺🇸Gemma 4 31B*[1]cloud786/3/1$0.20/M
🌍Palmyra X5cloud787/2/1$1.00/M
🇺🇸Anthropic: Claude Opus 4.8cloud786/3/1$15.50/M
🇺🇸Anthropic: Claude Opus Latestcloud786/3/1$15.50/M
🇨🇳Qwen3 8B 5-bit (Q5_K_M 5.5GB)local787/1/2local
🇺🇸Nemotron 3 Supercloud787/3/0$0.90/M
🇺🇸GPT-4.1 Minicloud787/3/0$0.60/M
🇺🇸Llama 4 Maverickcloud787/2/1$0.80/M
🇺🇸Claude Opus 4.7cloud787/2/1$15.00/M
🇨🇳Qwen: Qwen3.5 Plus 2026-02-15cloud787/2/1$1.54/M
🇨🇳Qwen3 Coder Pluscloud787/2/1$1.50/M
🇺🇸Grok 3 Minicloud787/3/0$0.60/M
🇺🇸inclusionAI Ling 2.6cloud774/6/0$1.30/M
🇪🇺Mistral Nemo 12B 4-bit (4-bit 6.3GB)local777/2/1local
🇺🇸Google: Gemma 4 31B (free)cloud776/3/1free
🇺🇸Gemma 3n 2B (4-bit 2.6GB)local777/1/2local
🇨🇳Qwen 3.6 Plus*[3]cloud776/4/0$0.81/M
🇺🇸GPT-5.4cloud776/3/1$6.25/M
🇺🇸GPT-4.1 Nanocloud777/2/1$1.00/M
🇺🇸OpenAI GPT Mini Latestcloud776/3/1$2.69/M
🇺🇸Anthropic Claude Sonnet Latestcloud776/3/1$9.72/M
🇺🇸Anthropic: Claude Opus 4.7cloud777/1/2$15.80/M
🇺🇸Gemini 2.5 Flash*[4]cloud765/5/0$0.96/M
🌍Kimi K2.6*[5]cloud755/5/0$1.57/M
🇺🇸Ling 2.6 Flashcloud755/5/0free
🇨🇳Qwen3 Coder Nextcloud755/5/0$1.00/M
🇨🇳Qwen3 Next 80B A3Bcloud755/5/0$1.00/M
🇺🇸Claude Opus 4.5cloud755/5/0$25.00/M
🇺🇸Grok 4.3cloud755/5/0$1.00/M
🇺🇸Grok Build 0.1cloud755/5/0$1.00/M
🇺🇸Grok 4.20cloud756/3/1$1.63/M
🇺🇸Gemini 3 Flashcloud756/3/1$1.00/M
🇺🇸Gemini 2.0 Flash Litecloud756/3/1$0.30/M
🇺🇸GPT-5.1cloud756/3/1$10.00/M
🇪🇺Mistral Small 2603cloud756/3/1$0.60/M
🇨🇳Devstral Mediumcloud756/3/1$2.00/M
🌍Remm Slerp L2 13Bcloud756/3/1$1.00/M
🌍GPT Chat Latestcloud756/3/1$1.00/M
🇺🇸Phi-4 Minicloud756/3/1$1.00/M
🇺🇸GPT-5 Chatcloud756/3/1$1.00/M
🇺🇸Hermes 4 70Bcloud756/3/1$1.00/M
🇯🇵🇰🇷Solar Pro 3cloud756/3/1$1.00/M
🌍Kimi K2cloud756/3/1$1.00/M
🇨🇳GLM 4 32Bcloud756/3/1$1.00/M
🇺🇸Nemotron 3 Ultracloud756/3/1$1.00/M
🇨🇳Qwen3 235B A22Bcloud756/3/1$1.00/M
🌍Morph V3 Largecloud757/1/2free
🌍Jamba Large 1.7cloud757/1/2$3.00/M
🌍Mercury 2cloud757/1/2$1.00/M
🇺🇸GPT-5.2*[19]cloud757/1/2free
🇨🇳Devstral Smallcloud757/1/2$0.30/M
🌍Morph V3 Fastcloud757/1/2$1.20/M
🌍Inflection 3 Productivitycloud757/1/2$1.00/M
🇺🇸Command R7B (Cohere)cloud757/1/2$1.00/M
🇺🇸Grok 4 Fastcloud756/3/1$0.55/M
🇺🇸Grok 4.1 Fastcloud756/2/2$0.29/M
🇺🇸OpenAI: GPT-5.3 Chatcloud756/2/2$10.97/M
🇺🇸Gemini 3.1 Flash Litecloud755/5/0free
🇺🇸Grok 4.20 Multi-Agentcloud745/5/0$15.00/M
🇨🇳Qwen2.5 0.5B (4-bit 0.4GB)local747/1/2local
🇺🇸OpenAI: GPT-5.4 Minicloud736/3/1$2.68/M
🇺🇸OpenAI: GPT-5.4 Image 2cloud736/2/2$9.03/M
🇺🇸Llama 3.2 1B (4-bit 0.8GB)local736/1/3local
🇺🇸SmolLM2 1.7B (4-bit 1.0GB)local716/0/4local
🇺🇸Gemini 2.0 Flashcloud705/4/1$0.40/M
🇺🇸Hermes 4 405Bcloud705/4/1$1.00/M
🇨🇳Qwen Plus 0728 (thinking)cloud705/4/1$1.00/M
🇺🇸GPT-5.2 Codex*[18]cloud706/2/2free
🇺🇸GPT-5.1 Codexcloud706/2/2$10.00/M
🇨🇳DeepSeek Chat V3.1*[33]cloud706/2/2$0.79/M
🇨🇳Qwen3 30B A3B Thinking 2507cloud706/2/2$1.00/M
🇨🇳MiniMax M2.7*[6]cloud707/1/2$0.50/M
🇨🇳DeepSeek R1 0528cloud686/2/2$8.00/M
🇨🇳Xiaomi MiMo V2.5 Procloud687/0/3$1.60/M
🇨🇳MiniMax M2.5cloud655/3/2free
🇺🇸GPT-5.1 Codex Mini*[20]cloud655/3/2free
🇨🇳GLM 4.6cloud655/3/2$1.00/M
🇨🇳Qwen3 Next 80Bcloud655/3/2$1.00/M
🇺🇸inclusionAI Ring 2.6cloud656/1/3free
🇺🇸Nemotron 3 Nano 30B (free)cloud656/1/3$1.00/M
🇺🇸Gemma 4 E4B 5-bit*[15] (5-bit 5.1GB)local646/1/3local
🇪🇺Falcon3-1B-Instruct-4bit (4-bit 0.9GB)local626/0/2local
🇺🇸GPT-5.5*[7]cloud605/2/3$12.50/M
🇨🇳DeepSeek V4 Flashcloud604/3/3$0.18/M
🇨🇳MiniMax M2.1*[29]cloud605/2/3$2.00/M
🇺🇸OpenAI GPT Latestcloud605/2/3$21.08/M
🇪🇺Mistral Small 3.1 24Bcloud605/2/3$1.00/M
🌍Reka Edgecloud605/2/3$0.10/M
🇺🇸GPT-5.3 Codex*[17]cloud555/1/4free
🇺🇸GPT-OSS-20Bcloud555/1/4$1.00/M
🇺🇸GPT-5.4 Procloud525/1/4$75.00/M
🌍Ring 2.6 1T*[26]cloud504/2/4$0.62/M
🇺🇸Gemma 4 26B A4B (free)cloud504/2/4$1.00/M
🌍Laguna XS.2 (free)cloud504/2/4$1.00/M
🇨🇳GLM 4.5 Aircloud505/0/5$1.00/M
🌍DeltaCoder 9B 5-bit*[16] (Q5_K_M 6.1GB)local474/1/5local
🌍Kimi K2.7 Codecloud454/1/5$1.00/M
🇺🇸GPT-5.5 Pro*[8]cloud434/1/5$75.00/M
🇨🇳Xiaomi: MiMo-V2-Omnicloud424/1/5$1.22/M
🇨🇳Qwen: Qwen3.5 397B A17Bcloud402/3/5$2.85/M
🇨🇳Qwen 3 32Bcloud403/2/5$1.00/M
🇨🇳MiniMax M3cloud404/0/6$1.00/M
🌍Nex-N2-Pro (free)cloud404/0/6$1.00/M
🇨🇳Qwen: Qwen3.5-27Bcloud382/3/5$1.68/M
🇺🇸Grok 4cloud383/2/5$1.00/M
🇨🇳DeepSeek V4 Procloud384/0/6$0.57/M
🇨🇳MiniMax M1cloud352/3/5free
🇺🇸Google Gemini Flash Latestcloud330/3/7$8.03/M
🌍MoonshotAI: Kimi K2.6 (free)cloud333/1/6free
🇨🇳Qwen 3.6 35B A3B*[27]cloud302/2/6$1.00/M
🇨🇳MiMo-V2.5cloud302/2/6$1.00/M
🇺🇸GPT-5 Codex*[23]cloud303/0/7$10.00/M
🌍MoonshotAI Kimi Latestcloud303/0/7$3.69/M
🇺🇸Gemini 3 Pro Image*[36]cloud303/0/7$1.00/M
🇨🇳GLM 4.5cloud303/0/7$1.00/M
🇨🇳DeepSeek-R1 1.5B (4-bit 1.0GB)local282/1/7local
🇺🇸Google Gemini Pro Latestcloud270/2/8$10.70/M
🇨🇳Qwen3.5 0.8B (4-bit 0.5GB)local262/1/7local
🇨🇳Tencent HY3cloud252/1/7$1.00/M
🇺🇸Gemini 3.5 Flash*[28]cloud252/1/7$9.00/M
🇨🇳Qwen 3 30B A3Bcloud252/1/7$1.00/M
🇨🇳Xiaomi MiMo V2 Pro*[12]cloud242/1/7$2.50/M
🇨🇳Qwen: Qwen3.5-122B-A10Bcloud221/2/7$2.43/M
🌍StepFun 3.5 Flashcloud202/0/8$0.60/M
🌍Pareto Code Routercloud181/1/8$1.37/M
🇨🇳GLM 5.1cloud151/1/8free
🇨🇳GLM 4.7*[32]cloud151/1/8$2.00/M
🇨🇳Qwen: Qwen3.6 27Bcloud130/2/8$2.77/M
🇺🇸GPT-5.1 Codex Max*[22]cloud100/2/8$10.00/M
🇨🇳DeepSeek V3.2 Speciale*[11]cloud101/0/9$1.50/M
🌍Intellect 3cloud101/0/9$2.00/M
🇺🇸Nemotron 3 Nano Omni (free)cloud101/0/9$1.00/M
🌍StepFun: Step 3.7 Flashcloud101/0/9$0.97/M
🇨🇳Z.ai: GLM 5 Turbocloud101/0/9$3.61/M
🇨🇳GLM-5cloud101/0/9$1.00/M
🇺🇸LFM2.5 1.2B Thinking (free)cloud101/0/9$1.00/M
🌍Perplexity Sonar Reasoning Procloud101/0/9$1.00/M
🇺🇸OpenAI o3cloud101/0/9$1.00/M
🇺🇸GPT-5 Mini*[25]cloud50/1/9$2.00/M
🇨🇳Qwen 3.5 35B MoE*[34]cloud50/1/9$1.00/M
🇨🇳Qwen3 235B A22B Thinking 2507cloud50/1/9$1.00/M
🌍OLMo 3 32B Thinkcloud00/0/10$1.50/M
🌍Reka Flash 3cloud00/0/10$1.00/M
🇺🇸GPT-5 Nano*[21]cloud00/0/10free
🇺🇸GPT-5*[24]cloud00/0/10$10.00/M
🇨🇳GLM 4.7 Flash*[31]cloud00/0/10$2.00/M
🇨🇳Qwen3 14Bcloud00/0/10$1.00/M
🌍Trinity Large Thinkingcloud00/0/10$1.00/M
🇨🇳GLM 5V Turbocloud00/0/10$1.00/M
🇨🇳DeepSeek V4 Flash (free)cloud00/0/10$1.00/M
🌍Laguna M.1 (free)cloud00/0/10$1.00/M
🇨🇳Qwen 3.5 9Bcloud00/0/10$1.00/M
🌍Arcee Trinity Minicloud00/0/10$1.00/M
🇺🇸Nemotron Nano 9B v2cloud00/0/10$1.00/M
🇺🇸Nemotron Super 49B*[35]cloud00/0/10$1.00/M
🌍Arcee Coder Largecloud00/0/10$1.00/M
🇺🇸Dolphin Mistral 24B (free)cloud00/0/10$1.00/M
🌍Arcee Virtuoso Largecloud00/0/10$1.00/M
🇨🇳Qwen3 Next 80B (free)cloud00/0/10$1.00/M
🇨🇳DeepSeek R1 Distill Qwen 32Bcloud00/0/10$1.00/M
🇺🇸OpenAI o4 Mini Highcloud00/0/10$1.00/M
🇨🇳Qwen3 Coder 480B (free)cloud00/0/10$1.00/M

Agent Readiness

Tool-calling reliability — can the model actually function as an agent? 6 tests: single tool, multi-tool, required mode, false positive avoidance, multi-turn chaining, argument correctness.

ModelScoreSingleMultiRequiredNo FPChainArgs
SmolLM3-3B50%
Phi-4-mini17%
gemma4:e4b100%
Qwen3-4B-Function-Calling-xLAM83%

These are the same models that score 90%+ on code quality. Code quality ≠ agent capability.

Caveats

  1. Gemma 4 31B, Gemma 4 26B A4B: Requires instruct-tuned variant on API calls.
  2. Qwen 3.6 Plus: OpenRouter rate-limiting adds wait time; per-request latency is lower.
  3. Gemini 2.5 Flash: Free-tier; may be rate-limited during concurrent runs.
  4. Kimi K2.6: Performs better when reasoning overhead is disabled.
  5. MiniMax M2.7: Mandatory reasoning step adds overhead before each task output.
  6. GPT-5.5, GPT-5.5 Pro: Scores below GPT-5.4; not recommended for reliable agent work yet.
  7. Google Lyria 3 Pro, Google Lyria 3 Clip: Preview/experimental model
  8. DeepSeek V3.2 Speciale: Output capped at ~400 tokens — structural limitation, not parameter-sensitive. Can't do agent coding at tight budgets.
  9. Xiaomi MiMo V2 Pro: Unreliable API — 7/10 tasks hit 500 errors. Retested, same result. Not parameter-sensitive.
  10. Gemini 2.5 Pro Preview: Preview model — 65.0% at 2000 max_tokens with thinking_mode:False. May improve at higher token budgets.
  11. Gemini 2.5 Pro Preview 05-06: Preview model — 60.0% at 2000 max_tokens with thinking_mode:False. May improve at higher token budgets.
  12. Gemma 4 E4B 5-bit: Local model. 100% agent readiness. Best local for tool calling.
  13. DeltaCoder 9B 5-bit: Code score via llama-server methodology differs from Ollama — DeltaCoder defaults to explanatory answers under enable_thinking=False, scoring lower on pattern-matching despite correct solutions.
  14. GPT-5.3 Codex: 3/10 empty responses (30%); Regression from GPT-5.2 Codex (55% vs 70%)
  15. GPT-5.2 Codex, GPT-5.1 Codex Mini: 2/10 empty responses
  16. GPT-5.2: 1/10 empty responses
  17. GPT-5 Nano: 100% empty responses — not viable for agent coding
  18. GPT-5.1 Codex Max: 8/10 empty responses at 300 max_tokens. The -max variant underperforms base GPT-5.1 (75%) at tight budgets.
  19. GPT-5 Codex: 7/10 empty or failed responses at 300 tokens. Worse than GPT-5.1 Codex (70%) and GPT-5.1 (75%).
  20. GPT-5: 10/10 empty responses at 300 max_tokens — not viable for agent coding at tight budgets. Different league from GPT-5.1 (75%).
  21. GPT-5 Mini: 9/10 nearly empty responses at 300 max_tokens — only sql-query partially completed.
  22. Ring 2.6 1T: 50% score — 4 hard fails (fix-bug, regex-extract, process-monitor, json-schema-validate). Less reliable than Ring 2.6 base (65%).
  23. Qwen 3.6 35B A3B: MoE variant — 30% vs 76.6% for Qwen 3.6 Plus. enable_thinking: False applied, but MoE routing may degrade at 300 tokens.
  24. Gemini 3.5 Flash: Very verbose — output cut off at 600 max_tokens. May need 1000+ tokens.
  25. MiniMax M2.1, MiniMax M2: Mandatory reasoning — needs 2000+ tokens, 40% failure rate
  26. GLM 4.7 Flash, GLM 4.7: Most tasks returned empty responses at 300 max_tokens
  27. DeepSeek Chat V3.1: Worse than experimental (85%) and Terminus (80%) variants — skip for benchmarks
  28. Qwen 3.5 35B MoE: API: empty response on 9/10 tasks with no error — model ID may be deprecated on OpenRouter.
  29. Nemotron Super 49B: API: empty response on all 10 tasks with no error — model ID may be deprecated on OpenRouter.
  30. Gemini 3 Pro Image: Image-preview model — poor at text-only coding tasks. 7/10 tasks returned empty. Expected behaviour for vision-focused model.

Spotted something?

Suggest an improvement, report an error, or just say hi.