Blog

Infrastructure, agents, and what breaks in production.

Google Said It Had Native Function Calling. I Tested It.

Google released Gemma 4 E4B with a specific claim: native function calling. "Enhanced coding and agentic capabilities," the model card said. "Native function-calling support, powering highly capable autonomous agents."

4.5 billion effective parameters. Apache 2.0. Runs on a laptop. 50 tokens per second on my Mac Mini M4.

I wanted to believe it. The promise of a local model that could actually use tools — not hallucinate tool calls, not ignore them, but use them — is something I've been chasing for months.

So I ran it through the same battery I've used for every other local model.

The Test

Two dimensions. Code quality: 10 real agent coding tasks — file parsing, bug fixing, YAML repair, regex extraction. Agent readiness: 6 tool-calling scenarios — single-tool selection, multi-tool discrimination, required adherence, false positive resistance, multi-turn chaining, argument correctness.

The same tests I ran on SmolLM3, Phi-4-mini, and nine other models.

Code Quality: 64.2%

Six passes, one partial, three fails. It handled text parsing cleanly — regex extraction, JSON parsing, file analysis all passed. But it fell apart on system tasks. It couldn't fix a YAML indentation error. It couldn't produce the right shell command to check a port. It couldn't recover from a broken rm command.

The pattern was clear. Give Gemma 4 E4B structured text and it shines. Give it anything that touches the terminal and it stumbles.

Agent Readiness: 33.3%

Two out of six.

It correctly picked search_files over read_file when asked to find a config file. That's the good news — it discriminated between tools. And when given no tools at all, it stayed quiet and answered the question. No hallucinated function calls.

But it failed at everything else. It refused to call a tool when asked to read a file — it described what read_file would do instead. When I set tool_choice: required, it ignored it entirely and returned text. It couldn't chain two calls together. And when asked to write a file with specific content, it didn't call any tool at all.

Where It Lands

Model	Code Quality	Agent Readiness
SmolLM3 3B	93.3%	50.0%
Phi-4-mini	90.0%	16.7%
Gemma 4 E4B	64.2%	33.3%
Gemma-3n E2B	76.7%	0.0%
Qwen2.5 0.5B	74.2%	0.0%

Not the worst. Not the best. The middle child.

The "native function calling" is real — it's the only model besides SmolLM3 that reliably produces tool calls at all. Phi-4-mini, Gemma-3n, Qwen, Llama — they all score zero on agent readiness. Gemma 4 E4B gets two. That's genuinely better than the field.

But "better than zero" and "production-ready" are different things.

What This Means

If you need a local model for text tasks — summarization, extraction, parsing — Gemma 4 E4B is fast, small, and solid. 50 tok/s on modest hardware. 5GB on disk. Apache 2.0.

If you need it to act — to call tools, chain operations, function as an agent — it's not there yet. It picks the right tool about half the time and ignores you the other half.

The gap between "native function calling" on a model card and "reliable tool use" in practice is still wide. Google built the plumbing. The wiring isn't finished.

What You Should Check

Don't trust "native function calling" claims without testing them yourself
Test tool_choice: required specifically — many models ignore it
Multi-turn tool chaining is the hardest test — almost nobody passes it
Code quality and agent readiness are different skills — measure both
50 tok/s locally is genuinely useful even without tool calling

Something else will break tomorrow. Something always does.

Permalink · 2026-05-17

I Tested 6 Local Models on Real Agent Tasks. The Best Scored 50%.

I had a SmolLM3-3B running on my laptop. It scored 93.3% on my code quality benchmark. I thought I was one config change away from a local AI agent that could actually do things.

I was wrong.

What I Assumed

Code quality equals agent capability. If a model can generate correct Python, read files, and fix bugs at 93%, it should be able to call a function when asked.

That assumption survived exactly two minutes of testing.

What I Built

I wrote a proper agent readiness benchmark. Six pass/fail dimensions. Can it call a single tool when told? Pick the right one from three? Obey tool_choice: required? Stay quiet when no tools exist? Chain calls across turns? Pass the right arguments?

I also built a 100-line translation proxy. Local models output tool calls as text — <tool_call> blocks, JSON, Python syntax. Agent frameworks need OpenAI's native tool_calls format. The proxy bridges that gap. Without it, most models score 0%.

The Results

SmolLM3-3B scored 50%. It calls single tools correctly. It writes files with the right arguments. But give it three tools and ask it to pick — it freezes. Ask it to chain two calls — it can't.

Phi-4-mini scored 90% on code and 17% on agent tasks. The only dimension it passed was "no false positives" — meaning it stayed quiet instead of hallucinating. That's the bar.

Qwen2.5-Coder-14B, at 7.7 gigabytes, scored 85% on code. Couldn't call a single tool. Llama 3.1-8B, same story. Bigger model, zero agent capability.

Why the Gap Exists

Code benchmarks test whether a model can generate correct output from a prompt. Agent tasks test whether it can follow a protocol — receive tools, reason about which to use, call it, receive the result, decide what's next.

A model that writes perfect Python can still fail to understand that search_files is the right tool when someone says "find files." The 93.3% to 17-50% drop isn't a bug. It's revealing a capability that open-weight models under ~3 billion parameters simply don't have.

Architecture matters more than size. Qwen 14B at 7.7GB couldn't call a tool. SmolLM3 at 1.8GB could. Parameter count tells you nothing about agent readiness.

What You Should Check

Test tool calling separately from code quality. The correlation is weak. Your 90% code model might be useless as an agent.
Use a translation proxy. The format mismatch alone costs you 0-17%. A 100-line proxy fixes it.
Don't assume bigger means better for agent tasks. Architecture beats parameter count.
Benchmark before you build. I built the proxy first. Should've tested the models first.

The proxy is at github.com/vystartasv/toolcall-proxy. The benchmark is at benchmarks.workswithagents.dev.

Something else will break tomorrow. Something always does.

Permalink · 2026-05-16

My Cron Jobs Failed. I Didn't Check.

Agent Autopsy, Day 9

Things broke yesterday. A few cron jobs failed. Health checks went red. The local LLM running low-stakes jobs concluded it was unable to do its job. And gave up.

I didn't check.

Not because I was busy. Because I didn't care. Nothing that failed mattered enough to notice.

What I found

Eight autopsies in eight days. Every one of them was something I cared about — broken packages, dead tools, a model that scored 93.3 but couldn't survive a long session. Each one felt urgent in the moment.

Day 9: things broke and I shrugged.

Meanwhile, GitHub launched a certified agentic AI developer certification. A hundred quid. A badge from Microsoft — a company whose quarterly revenue rounds your fee to zero — that says you know how to build with AI agents.

I couldn't be happier to give them my money.

What I assumed

I assumed the autopsies would end with a fire. A production outage. A deployment that took down the VPS and the backup VPS.

I assumed I'd care about everything that broke.

What I no longer assume

You don't need everything to work. You need the things that matter to work. The cron jobs that failed weren't the ones keeping anything alive. The health checks that went red were checking things nobody uses. The local model that gave up was running jobs I could run tomorrow — or never.

The series ends here. Not because the work is done. Because the certification people arrived after you'd already done the work, and the things that break now aren't worth writing about.

What you should check

If your failures don't wake you up, you've built the right things. The jobs that matter have guardrails. The ones that don't, don't.
When a local model tells you it can't do the job — listen. It's not failing. It's being honest. Most models will hallucinate competence instead.
Public evidence beats private credentials. Eight autopsies, real breakages, real fixes. That's a credential nobody can issue and nobody can revoke.

Something else will break tomorrow. Something always does. Maybe I'll write about it. Maybe Day 11.

Permalink · 2026-05-15

Older posts

We Tested 10 Untested LLMs on Agent Coding — The Results Are In 2026-05-12
The $0 Agent: My 2GB Local Model Beat Claude 2026-05-12
My Agent Forgot What We Were Building. So I Told It to Write Everything Down. 2026-05-12
We Built an API. Nobody Used It. 2026-05-11
My Agent Said It Would Fix the Width. It Rebuilt the Whole Site Instead. 2026-05-11
A Billion Token Lesson: Because You Can ≠ You Should 2026-05-11
Managing 150+ AI Agent Skills at Scale — What Broke, What I Built 2026-05-10
I Broke My Website. Then I Fixed It. Then My Fix Broke It Again. 2026-05-10
My 19 AI Agents Kept Breaking Each Other — The 4 Tools That Fixed It 2026-05-10
I Run 19 Autonomous AI Agents. Here's the Infrastructure They Needed That Nobody Talks About 2026-05-10

Load more (23 remaining)

Blog

Google Said It Had Native Function Calling. I Tested It.

The Test

Code Quality: 64.2%

Agent Readiness: 33.3%

Where It Lands

What This Means

What You Should Check

I Tested 6 Local Models on Real Agent Tasks. The Best Scored 50%.

What I Assumed

What I Built

The Results

Why the Gap Exists

What You Should Check

My Cron Jobs Failed. I Didn't Check.

What I found

What I assumed

What I no longer assume

What you should check

Older posts

Spotted something?