2026-06-02

The Harness Isn't Shrinking, It's Migrating

*By Vilius Vystartas June 2026*

The Current Narrative

A growing chorus argues that the "agent harness is shrinking." Thorsten Ball, building Amp Code, put it sharply: each generation of LLM makes more of his orchestration logic redundant. Models get better at tool use, planning, and context understanding — so the scaffolding around them gets thinner. Prompt templates shrink. Context window management goes from art to checkbox. Multi-step planners become unnecessary when the model can plan itself.

It sounds right. And for the cognitive harness — the part that tells the model what to do — it is right. But there's a second harness that's doing the opposite.

Two Harnesses

There are actually two layers of scaffolding in any agent system:

Cognitive Harness Operational Harness
What it does Prompt engineering, tool planning, context management Security, routing, governance, cost control, observability
Trend Shrinking (models absorb it) Growing (autonomy demands it)
Example "Write a bash one-liner to find large .py files" — the model just does this now "Which model should handle this task, at what cost, with what oversight?"

The first shrinks with every model release. The second grows with every deployment.

I've been running a daily LLM agent coding benchmark on workswithagents.dev since May 2026. We've tested 100+ models. The pattern is unmistakable: even budget models — Phi-4 Mini at $0.08/M tokens, Granite 4.0 Micro at $0.017/M — can handle 10 real-world agent coding tasks with 70%+ pass rates. A year ago these scores required premium models.

This is good news. But it doesn't make infrastructure irrelevant. It shifts where infrastructure matters.

What Shrinks

Prompt templates. The old pattern was a 2000-line system prompt that did all the work. Models now understand intent from sparse instructions. Hermes Agent's system prompt shrank ~40% between v1 and v2 without losing capability.

Tool call formatting. Models no longer need explicit XML schemas or strict JSON templates. Simple function definitions work.

Context window budgeting. When a model handles 128K+ tokens natively, the compression layer that was critical a year ago becomes optional. Hermes' compression module still helps, but it's no longer load-bearing.

Multi-step decomposition. Models can plan and execute without being told how to sequence. The agent loop stays, but the planner that used to be a separate component is now a single model call.

What Grows

Security boundaries. Autonomous agents that act on your filesystem, email, and API keys need sandboxing, least-privilege, and audit trails that static tools never needed. Every new model capability creates a new attack surface. Hermes' command allowlist, approval gates, and approvals.mode are not shrinking — they're getting more granular.

Multi-model routing. When every model is competent, the question shifts from "can it do the task?" to "which provider should handle this specific task at the best cost/quality ratio?" This is a routing problem. It didn't exist when only one or two models could do the job. Now it's a core infrastructure function.

Cost governance. Gemini 2.5 Flash costs $0.008 for 10 coding tasks. GPT-5.5 costs $0.104 for the same. Both complete the work. The difference isn't capability — it's cost awareness. A harness that routes cheap tasks to budget models and complex ones to premium models pays for itself in a week.

Observability and evaluation. When agents act autonomously, you can't just read their output. You need traces, cost tracking, failure classification, and regression detection. WWA's benchmark infrastructure caught models returning empty content with no error (Pitfall #19 in our registry). Without observability, that model would look like a 0% failure, not a broken API integration.

Human review gates. Autonomous ≠ unsupervised. The pattern is shifting from "the model follows every instruction" to "the model executes, flags exceptions, and asks for approval on risky actions." Hermes' approval system (approvals.mode: manual, cron_mode: deny) is getting more use, not less.

The Data

From 100+ models tested on the WWA agent coding benchmark:

  • May 2025: Only Claude and GPT-4 hit 80%+. Premium models only. Average score: ~55%.
  • November 2025: 5 models above 80%. Budget category emerging at ~60%.
  • May 2026: 30+ models above 70%. Budget models (sub-$0.10/M tokens) hit 75%+. Average score across all models: ~68%.

|The cognitive harness shrank from "must use premium models with custom prompts" to "any model with a decent system prompt." |The operational harness grew from "one model, one API key" to "100+ model options, cost/quality routing, anomaly detection, auto-publish." | |We tested this directly. A 48-LOC minimal agent loop — just model calls plus a pattern-matching scorer — against Hermes Agent's full 7,300-LOC loop. Same 10 coding tasks. Two models: Claude Sonnet 4 and Gemini 2.5 Flash. The score difference across all tasks: -0.7%. A 152:1 code ratio for statistically identical results. | |The cognitive harness is real — and it's already been absorbed.

What This Means for Agent Builders

  1. Stop optimizing prompt templates. The model will absorb your prompt engineering within two releases. Invest in evaluation instead — measure what actually changes when the model updates.
  1. Start investing in routing. The infrastructure that matters is the one that picks between providers. Not "which model answers this question" — "which provider handles this task type at the best ratio."
  1. Security is the new bottleneck. A model that can SSH into your server is a model that needs guards. Harness features that were nice-to-have (command allowlists, approval gates, audit trails) are now table stakes.
  1. Benchmark continuously. Your favorite model's scores will change with every update. The daily benchmark isn't a vanity metric — it's your early warning system for model regression.

The Takeaway

The harness isn't dying. It's migrating from cognitive scaffolding to operational infrastructure. From "how do I make the model understand" to "how do I operate the model safely and cost-effectively."

The cognitive harness shrinks. The operational harness grows.

If you're building agent infrastructure, you want the part that grows.


Full daily benchmark results: benchmarks.workswithagents.dev Hermes Agent: hermes-agent.nousresearch.com

Spotted something?

Suggest an improvement, report an error, or just say hi.