| *By Vilius Vystartas | May 2026* |
Three models scored under 15% in my first benchmark run. Kimi K2.5: 10%. MiniMax M2.5: 15%. Gemma 4: HTTP 400 on every call. I almost excluded them as broken. They weren't broken — I was calling them wrong.
Here's what happened and how to avoid it when benchmarking your own models.
Kimi K2.5 (10%): Every response was empty. The model returned exactly 300 tokens of nothing. finish_reason: length — it ran out of budget before producing visible output.
MiniMax M2.5 (15%): Same pattern. One task ran for 88 minutes and consumed 98,000 tokens before I killed it.
Gemma 4: Every request returned HTTP 400. Wrong model ID, wrong parameter name — include_thinking doesn't exist for Gemma.
These models enable internal chain-of-thought reasoning by default. Every request burns tokens thinking silently before producing output. At 300 max_tokens, there's nothing left for the actual answer.
The fix parameters are different for each model family:
reasoning: {"effort": "none"} — disables internal reasoning, 0 reasoning tokensinclude_reasoning: false — hides thinking from output, but the model still burns ~400 tokens internally. Needs 2000 max_tokens budgetinclude_reasoning: false — model ID needs the -it suffix and -a4b for the 26B variantKimi K2.6 went from 10% to 75%. MiniMax M2.7 from 15% to 60%. Gemma 4 31B from "HTTP 400" to 80% — second place overall.
On the 6 tasks it does complete, M2.7 scores 97.2% — higher than Claude Sonnet 4. It is the best model on this benchmark when it works. The problem: it fails 40% of the time. Mandatory internal reasoning can't be disabled, so the output budget gets consumed before anything appears. It's a brilliant model you can't rely on.
finish_reason: length + empty content → thinking mode eating your token budget. Try reasoning: {"effort": "none"} or include_reasoning: false-it, -a4b, -preview suffixesI wasted a morning debugging parameters that should be documented. If you're benchmarking models for your own agent stack, save yourself the time: check the reasoning config first.
_Full benchmark results with all 18 models at benchmarks.workswithagents.dev. Updated nightly._