Agent Benchmarks

Real agent coding benchmarks + MCP server quality. Updated hourly.

Cloud LLM Coding Benchmark (10 agent tasks)

Updated: 2026-05-09T13:57:41

ModelTierScoreResultsCostTime
Claude Sonnet 4premium83.307/10 passed$0.019123s
Gemma 4 31Bpremium80.006/10 passed$0.0005118s
Gemma 4 26B A4Bpremium78.306/10 passed$0.000566s
Mistral Large 3premium78.006/10 passed$0.001818s
Qwen 3.6 Plusbudget76.606/10 passed$0.0609574s
Gemini 2.5 Flashfree-tier76.405/10 passed$0.003712s
Kimi K2.6premium75.005/10 passed$0.005140s
GPT-5.4premium74.906/10 passed$0.015319s
MiniMax M2.7premium60.006/10 passed$0.0190137s
GPT-5.5premium58.305/10 passed$0.065567s

MCP Server Quality

Updated: 2026-05-09T14:16:05

ServerStatusScoreStarsIssuesUpdated
playwright-mcp● live8032,25242026-05-09
mcp-git● live738,048672026-05-09
github-mcp-server● live6029,6493292026-05-09
fastmcp● live6025,0892482026-05-09
awesome-mcp-servers● live6086,53711302026-05-09
mcp-pandoc● live4353462026-05-07

Bonsai 1-bit Quantization (local, same 10 tasks)

Updated: 2026-05-09T09:35:55

ModelBitsSizeScoreResultsTime
Qwen 3.5 9B (4-bit)4-bit~5GB83.07/10 passed190s
AgenticQwen 8B (4-bit)4-bit~5GB81.58/10 passed189s
Bonsai 4B (1-bit)1-bit545MB79.97/10 passed18s
Ternary Bonsai 1.7B (2-bit)2-bit (ternary)442MB79.97/10 passed10s
Bonsai 8B (1-bit)1-bit1.1GB79.88/10 passed15s
Ternary Bonsai 4B (2-bit)2-bit (ternary)1.0GB79.67/10 passed20s
Ternary Bonsai 8B (2-bit)2-bit (ternary)2.1GB78.27/10 passed22s
Bonsai 1.7B (1-bit)1-bit237MB73.44/10 passed8s

Benchmarks refresh hourly via cron. Trust score dashboard →

Spotted something?

Suggest an improvement, report an error, or just say hi.

Spotted something?

Suggest an improvement, report an error, or just say hi.