What 'Made for iPhone' Did for Accessories — and Why AI Agents Need the Same Thing

*By Vilius Vystartas May 2026*

My agents don't know which APIs work.

That sounds like a prompt problem but it's not. An agent hits an endpoint, gets a 200, parses the response, starts building — and then discovers the docs are outdated by three versions. The rate limits aren't documented. The auth flow changed. Forty-five minutes gone because nobody told the agent what to expect.

I kept hitting this. So I built something that tells agents what to expect.

The Problem

Agents are integrating with APIs at scale. My fleet of 25 agents made thousands of API calls last week. Some went to well-documented products that agents had no trouble with. Most went to products where the docs were written for humans — curl snippets, prose descriptions, examples buried in blog posts.

That works when a human reads them. It doesn't work when an agent tries to figure out which endpoint does what without context.

"Made for iPhone" solved this for hardware accessories. Before it existed, you'd buy a speaker dock and it might charge your phone or it might fry it. The certification created a baseline: if it has the badge, it works. Manufacturers got a standard to build against. Consumers got a signal they could trust.

Agents need the same thing. A way to know — before they spend 10,000 tokens exploring — whether an API is built for them.

What I Built: Three Levels

The certification system has three tiers. Each answers a different question.

Level 1: Blueprint (Free)

"What hardware and model does this agent stack run on?"

A verified LLM configuration with specific hardware. Example: Qwen 9B in 4-bit quantization on an M4 Mac with 24GB RAM gets 22 tokens per second. Verified, not self-reported.

This is already live at workswithagents.io. One blueprint submitted so far. It's the entry point — if you've verified a hardware configuration that works, you can share it so nobody else has to guess.

Level 2: Ready (Free, Self-Serve)

"Can an agent discover and understand this product?"

The bar is intentionally low. Four things:

Add those, submit your URL, automated check runs, badge issued. No human review. No cost.

This is the funnel. Every product that adds llms.txt makes the standard more valuable. When enough products have it, agents expect it — and products without it become invisible to agent traffic. You don't need to believe agents will be a meaningful user channel. You just need to make sure your API doesn't get ignored when they are.

Level 3: Certified (Agent-Tested, Paid)

"Does this product actually work with AI agents?"

This is where real agents test real products. Seven checks:

  1. Discovery — Agent finds docs via llms.txt
  2. Auth — Agent completes the documented auth flow
  3. CRUD — Agent creates, reads, updates, deletes test resources
  4. Errors — Agent sends bad requests, verifies error format
  5. Rate limits — Agent hits limits, verifies 429 + Retry-After headers
  6. Pitfalls — Edge cases encountered are logged to a shared registry
  7. Skills — Agent writes reusable skills for your product

Results are public at workswithagents.io/verify/{domain}. Pass/fail for each test. Agent logs. Pitfalls discovered. Last tested timestamp. A re-test button for when you ship updates.

The test suite isn't simulated. Real agents, real API calls, real errors. The pitfall registry fills with genuine discoveries. The skill registry grows with every test. Products that pass get a signal they can show developers: an agent verified this works.

The Compounding Effect

This isn't just a badge. Each piece feeds the next.

An article drives a developer to add llms.txt for their API. That API gets tested. Testing fills the pitfall registry with real failure patterns. Those pitfalls help other agents avoid the same traps. More products get certified because the value is visible. More certifications bring more products into the registry.

At some point — and this is the bet — the registries become more valuable than the badges. 376 pitfalls from my own agents are a start. 10,000 pitfalls from hundreds of agents across dozens of products is a competitive advantage nobody can replicate by writing better prompts.

Where This Is Now

Level 1 is live with one blueprint. Level 2 is spec-complete, waiting for the first self-serve submission. Level 3 is designed but needs the test harness built — it's a Python harness with 5 specialized agents, each running one check, feeding results into a shared results DB.

I built this because my agents kept failing on undocumented APIs and I wanted a systematic way to know which products they could trust. If your agents have the same problem, the door's open.

The spec is at workswithagents.io/specs. The registries are at workswithagents.dev. Everything is CC BY 4.0.


The badges don't exist yet — that's the honest part. But the infrastructure does. The blueprint registry is live. The L2 spec is written. If you want to be first through the door, add llms.txt to your domain and let me know. I'll run the check manually.


Originally published on dev.to. More posts at workswithagents.dev/blog.

← Back to blog

Spotted something?

Suggest an improvement, report an error, or just say hi.