AI Agent FAQ

Everything I've learned from running 25+ autonomous AI agents. Real answers, real numbers, no hype.

What is the best way to start working with AI agents?

Start with three things: an AGENTS.md file that describes your project, a skills system for reusable procedures, and persistent memory so the agent remembers your preferences across sessions. These three patterns (Boot, Skills, Memory) are the foundation. I built 111 SPFx web parts and 5 backend services this way — the first 30 minutes of setup determines whether everything that follows works or doesn't.

Related: Boot Pattern · Skills Pattern · Memory Pattern

What's the difference between AI agents and chatbots?

Chatbots respond. Agents act. An agent has tools (terminal, file system, web access), memory (persistent across sessions), and autonomy (it decides what to do next without asking). A chatbot waits for your next message. An agent deploys code, runs benchmarks, reviews PRs, and hands off work to other agents — all while you're not watching. Our platform runs 25+ agents doing exactly that, 24/7.

Related: Orchestration · Pipelines

Can I run AI agents locally on my laptop?

Yes, for code generation. But for agent tasks (tool calling, multi-turn reasoning, autonomous pipelines), local models under 5B parameters aren't reliable. SmolLM3-3B scores 93% on code quality but only 50% on agent readiness. For local code generation, it's the champion. For agent cron jobs, cloud models remain the only reliable option. We benchmark this daily — see the benchmarks page.

Related: Live Benchmarks · Tool Composition

How much do AI agents cost to run?

Infrastructure: under €15/month (single Hetzner VPS runs 25+ agents). Model costs: $1-2/night for our daily benchmark pipeline using cloud models. Free local models work for code generation but not for autonomous agent tasks. The real cost isn't infrastructure — it's the time you save. Our agents run benchmarks, audit infrastructure, scan for vulnerabilities, and deploy code while I sleep.

Related: Benchmarks · Pipelines

How do AI agents remember things across sessions?

Through persistent memory — a knowledge store that survives session restarts. The agent writes facts, preferences, and corrections to durable storage (filesystem or database), and those facts get injected into every new session's context. Our system uses a Rust-backed knowledge store with H2 markdown format. No SQLite, no external service — just files the agent reads and writes. The key insight: memory is an index, not a database. Keep it compact.

Related: Memory Pattern · Knowledge Platform Docs

What's a knowledge store and why do agents need one?

A knowledge store is the agent's long-term memory — facts, preferences, pitfalls, and workflows organized by domain. Without it, every session starts from zero. With it, the agent knows your Python version, your preferred tools, which bugs to avoid, and every workflow you've ever taught it. Ours uses a Rust binary with H2 markdown, OR/NOT search, auto-supersede for stale entries, and access tracking. All filesystem-based — zero external dependencies.

Related: Memory Pattern · Boot Pattern

What's a pitfall registry?

A shared database of bugs and gotchas that one agent discovers and all others learn from. When Agent A hits a bug — say, 'uvicorn orphan process holding port 8500' — it records the pitfall with the tool, severity, source, and fix. Agent B encounters the same symptom and skips straight to the fix. We have 60+ pitfalls across SPFx, FastAPI, TypeScript, and deployment patterns. It's collective immune memory for your agent fleet.

Related: Pitfall Registry API · Resilience Pattern

How do AI agents handle context limits?

Context is precious. Three strategies: (1) Skills loaded on-demand instead of everything at once — the agent only loads what's relevant. (2) Session state files that compress completed work into a compact summary for the next turn. (3) Thin-memory pattern — keep the system prompt lean (~2K chars), store everything else in queryable knowledge-db. The agent searches when it needs details, rather than carrying everything. Our memory went from 93% to 32% capacity using this approach.

Related: Memory Pattern · Skills Pattern

What are AI agent skills and how do they work?

Skills are reusable procedural knowledge that agents load on-demand. Instead of putting everything in the context window, you create SKILL.md files with triggers, numbered steps, exact commands, and known pitfalls. The agent loads only the skills relevant to the current task. We run 153+ skills across 25+ autonomous agents. A skill is like a playbook — write it once, every agent benefits forever.

Related: Skills Pattern · Compounding

What tools should AI agents have access to?

The right tool for each job — and only the tools needed. A coding agent needs terminal + file access. A research agent needs web search. A review agent needs lint and test tools. Giving every agent every tool is wasteful and dangerous. Our tool composition pattern: use write_file for new code (replaces 10+ subagent API calls), patch for targeted edits, terminal for verification. The difference between the right tool and the wrong one is 30 seconds vs 15 minutes.

Related: Tool Composition · MCP Server

What programming languages do AI agents work with?

Depends on the agent. Coding agents work with any language they're trained on — Python, TypeScript, Rust, Go, shell. Infrastructure agents use bash, Python, and systemd. Our platform uses TypeScript/TSX for the frontend, Python/FastAPI for the backend API, Rust for the knowledge store binary, and shell for deployment scripts. The agent picks the right language for the job, same as a human would.

Related: Tool Composition · API Docs

What is MCP (Model Context Protocol)?

MCP is an open protocol for AI models to discover and use external tools and data sources. Think of it as a universal USB port for AI — the model plugs into any MCP-compatible server and gains its capabilities. We run an MCP server at workswithagents.dev with 14 tools covering facts, skills, pitfalls, and handoff. Our Python package (pip install wwa-mcp) gives any MCP client access to the full knowledge platform.

Related: MCP Server · Handoff Protocol

How do you orchestrate multiple AI agents?

Multi-agent orchestration means decomposing complex tasks into parallel streams, each handled by a specialist agent with the right tools. An orchestrator agent breaks down the work, spawns subagents, and assembles results. The key is role-based tool access — a research agent gets web search, a coding agent gets terminal and files, a review agent gets lint and test tools. We run up to 3 parallel subagents, each in isolated contexts. Throughput tripled on complex multi-stream work.

Related: Orchestration · Tool Composition

How do you choose between single agent and multi-agent?

Single agent for focused, sequential work — debugging, code review, research. Multi-agent when the work has independent parallel streams. The test: can two parts of this task run simultaneously without sharing state? If yes, split them. If they need to share state, use a handoff protocol instead. Most of our work is single-agent with skills. Orchestration is for benchmark runs, site audits, and multi-repo changes.

Related: Orchestration · Decision Protocols

How do AI agents communicate with each other?

Through structured handoff protocols. When one agent finishes a task, it writes a standardized YAML document with the task, decisions made, next steps, and open questions. The next agent picks up the handoff and continues. This is Layer 4 (Session) of the Agent OSI Model. We also proposed this as an MCP SEP (#2683) and Google A2A RFC (#1817). The goal: any agent can hand off to any other agent, regardless of framework.

Related: Handoff Spec · Agent OSI Model

What's agent handoff and why does it matter?

Agent handoff is the protocol for passing work between agents without context loss. When Agent A times out or finishes its part, it writes a handoff document. Agent B reads it and continues — no re-explaining, no re-discovery. Without handoff, every agent switch is a context reset. With it, agents chain together into pipelines. This is the difference between a single agent session and a 24/7 autonomous fleet.

Related: Handoff Spec · Pipelines

Can AI agents run fully autonomously?

Yes — with the right infrastructure. Autonomous pipelines use cron scheduling, background processes with completion notifications, and self-healing retry logic. We run 25+ agents on schedules: daily benchmark runs, weekly infrastructure audits, nightly security scans, memory curation. The agents work while you sleep. But autonomy needs guardrails: quality gates (syntax checks after every write, test suites before deploy) and decision protocols that define when to act vs when to ask.

Related: Pipelines · Decision Protocols

How do agent decision protocols work?

Decision protocols define the boundary between 'act now' and 'ask first.' They live in persistent memory as declarative rules: 'Proceed = execute multi-step, fix issues, flag only if blocked,' 'Don't change before asking = present findings + plan, wait.' The agent checks these rules before every non-trivial decision. No approval loops for routine work, no cowboying through destructive changes. The protocols save hours per session from eliminated back-and-forth.

Related: Decision Protocols · Memory Pattern

How do you deploy AI agents to production?

Systemd services on a VPS. Each agent gets a service file with health checks, restart policies, and environment isolation. Cron jobs trigger scheduled agent runs. A/B zero-downtime deployment for API changes — deploy to staging slot, test, flip nginx, instant rollback if broken. Cloudflare Worker for CDN and caching. The entire deployment is scripted: rsync source, restart service, verify health. No Docker required for this scale.

Related: Pipelines · Deployment Manifest Spec

How do agents deal with rate limits and API failures?

Exponential backoff with jitter. First retry: 2 seconds. Second: 4 seconds. Third: 8 seconds. After 3 failures, the agent categorizes the error — transient (retry with different approach) vs permanent (report and stop). Rate limits get automatic backoff. API 502s get retried. The agent never quits on the first error because most failures are transient. We've had 11 consecutive builds complete with zero human intervention using this pattern.

Related: Resilience · Pipelines

How do you verify AI agent output is correct?

Trust but verify — automated gates after every change. Syntax checks after every file write, AST parsing for Python, TypeScript compiler for TSX, test suites before deploy. We maintain 61 tests across the platform. If a test fails, the agent fixes it before proceeding. The pattern: write → verify → fix → verify → deploy. Never deploy unverified output. Autonomous doesn't mean reckless.

Related: Verify Pattern · Benchmarks

How do you test AI agent code?

The same way you test human code — unit tests, integration tests, and syntax checks. The difference is the agent runs them automatically after every change. Python: pytest with AST parse. TypeScript: tsc --noEmit. Shell: shellcheck. Deploy: staging slot smoke test before production flip. 61 tests across the platform. The agent also learns — when a test catches something, it becomes a pitfall that other agents reference before making the same mistake.

Related: Verify Pattern · Pitfall Registry

What are the most common AI agent mistakes?

Five patterns: (1) Fabricating facts — agents will invent URLs, model names, and numbers if not verified. Zero-tolerance policy. (2) Premature 'done' declarations — agent reports success but the thing isn't actually deployed. Verify don't trust. (3) Context lobotomy — after 15+ turns, the agent forgets earlier decisions and re-discovers them. Session state files fix this. (4) Tool misuse — using sed when patch exists, using subagents when write_file would do. (5) Building what nobody asked for — agents remove friction so completely that bad ideas survive. 'Is anyone looking for this?' is the most important question.

Related: Verify Pattern · Resilience

What AI models are best for agent tasks?

Code quality and agent capability are different things. SmolLM3-3B scores 93% on code tasks but only 50% on tool calling. No small open model under 5B parameters does reliable tool use. For agent cron jobs, cloud models remain the only reliable option. For local code generation, SmolLM3 is the champion. We benchmark models daily across 10 code tasks and 6 tool-calling tests — see the live benchmarks page for current rankings.

Related: Live Benchmarks · Benchmark Methodology

How do you benchmark AI agents?

Two dimensions: code quality (10 real coding tasks — build, deploy, fix) and agent readiness (6 tool-calling tests — single-tool, multi-tool, required mode, no false positives, multi-turn, argument correctness). The gap between them is massive. Phi-4-mini scores 90% on code but 17% on agent readiness. We run benchmarks nightly, publish results openly, and never fabricate model URLs. Every link is verified against live HuggingFace and OpenRouter APIs.

Related: Live Benchmarks · Benchmark Spec

What's the difference between open-source and cloud AI agents?

Open-source local models (SmolLM3, Phi-4-mini, Bonsai) work well for code generation but struggle with agent tasks requiring reliable tool calling. Cloud models (Claude, GPT, DeepSeek) handle tool calling reliably but cost money per call. The sweet spot: use local models for code generation (fast, free, private) and cloud models for autonomous agent pipelines (reliable, capable). We test both daily and publish the gap openly.

Related: Live Benchmarks · Tool Composition

What are 1-bit and ternary models?

Extremely efficient local LLMs that use 1-bit or 2-bit (ternary) precision instead of 16-bit. Bonsai models run on CPU with no GPU needed. They're tiny (1.7B-8B params) and work on laptops. For code generation they're surprisingly capable. For agent tool calling, they're not ready — same gap as other small models. The promise: AI agents that run entirely on-device. The reality: we're not there yet for autonomous work, but getting closer every month.

Related: Live Benchmarks · Pick a Model

How do I set up AI agent infrastructure?

Three layers: a gateway (Fastify or FastAPI) for HTTP endpoints, persistent storage (SQLite or filesystem) for memory, and a scheduler (cron or systemd timers) for autonomous pipelines. Our platform runs on a single Hetzner VPS: Node/TSX frontend on port 8610, Python API on port 8499, Cloudflare for CDN and DNS. Total cost: under €15/month. The entire stack is open-source under CC BY 4.0.

Related: Pipelines · MCP Server

How do AI agents manage secrets and credentials?

Never in code, never in prompts. Credentials live in an encrypted proxy (Rust, age-encrypted) that agents query at runtime. API keys are scoped to specific services. Admin tokens live in systemd environment files, not in the codebase. Our credential proxy has a Python CLI fallback and principle of least privilege — each agent gets only the keys it needs. No agent ever sees a raw secret in its context window.

Related: Security Disclosure Spec · Compliance Spec

How do AI agents discover APIs and services?

Three discovery layers: (1) llms.txt at the domain root — concise index of all public endpoints. (2) OpenAPI 3.1 spec at /v1/openapi.json — machine-readable contract with schemas. (3) Agent capability manifest at /.well-known/agent-capabilities.json — runtime capabilities, tools, and counts. The agent flow: fetch llms.txt → fetch OpenAPI → discover specific endpoints. AI models are being trained to look for llms.txt before crawling.

Related: Capability Manifest Spec · Agent OSI Model

What is the Agent OSI Model?

A 7-layer framework for AI agent infrastructure, published under CC BY 4.0. Layer 1 (Execution): hardware, runtime, tools. Layer 2 (Communication): messaging, auth, API contracts. Layer 3 (Discovery): registries, capability manifests, llms.txt. Layer 4 (Session): handoff protocols, state, context. Layer 5 (Coordination): consensus, work stealing, conflict resolution. Layer 6 (Verification): testing, evaluation, quality gates. Layer 7 (Governance): audit, compliance, sign-off. It gives the agent ecosystem a shared vocabulary — 'your Layer 4 handoff is broken' is actionable.

Related: Full Spec · All Specs

What is llms.txt and why does it matter?

llms.txt is a standard (llmstxt.org) for AI-agent-facing documentation — a concise index file that tells AI crawlers what your site offers, where the API docs live, and how to navigate. Works like robots.txt but for LLMs. Every domain should have one. Ours links to the OpenAPI spec, all public endpoints, methodology content, FAQ, and the full reference. AI models are being trained to look for llms.txt before crawling.

Related: Our llms.txt · Capability Manifest Spec

What agent specifications has Works With Agents published?

17+ open specifications under CC BY 4.0, organized by the Agent OSI Model layers: Agent Capability Manifest (L3), Handoff Protocol (L4, also proposed as MCP SEP #2683 and A2A RFC #1817), Coordination Protocol with leader election and work stealing (L5), Trust Score and Reputation Ledger (L3), Security Disclosure Protocol (L6), Compliance-as-Code and SLA Framework (L7), Transaction Protocol with idempotency and audit trail (L7). All readable at /specs/.

Related: All Specs · Agent OSI Model

How do AI agents improve over time?

Compounding — every discovery becomes a permanent skill. When an agent solves a hard problem, it saves the approach as a reusable skill. When it hits a bug, it records the pitfall so other agents skip it. When it learns a new command, it writes it to memory. Today: 153 skills, 60+ pitfalls documented, 25+ agents all sharing the same knowledge. Each session makes every future session more capable. That's the feedback loop.

Related: Compounding · Skills Pattern

How do you prevent AI agents from making the same mistake twice?

Pitfall registry + skill patching. When an agent discovers a bug pattern, it writes a knowledge entry with the exact symptom, root cause, and fix. The next agent encountering the same symptom searches the pitfall registry and skips to the fix. When a skill has outdated steps or missing pitfalls, the agent patches it immediately after hitting the issue. The system gets more reliable with every failure because every failure teaches all agents permanently.

Related: Pitfall Registry · Skills Pattern · Compounding

What is the best AI coding agent in 2026?

There's no single best — it depends on the task. Claude Code handles complex multi-file refactors well. OpenAI Codex is fast for smaller edits. OpenCode works for straightforward code generation. The real differentiator isn't the model — it's the infrastructure around it: skills, memory, decision protocols. A bare model vs one with 153 skills — the gap is bigger than the gap between models.

Related: Agent Benchmarks · Tool Composition

How do AI agents work with Git and GitHub?

Most coding agents have git awareness built in — they branch, commit, open PRs, and review diffs via the gh CLI. Our agents use the full workflow: create branch → commit changes → push → open PR → review → merge. Git operations are tool calls, not manual steps. The agent handles the entire PR lifecycle without leaving the terminal.

Related: Autonomous Pipelines · Handoff Protocol

What is agent delegation and when should you use it?

Delegation is spawning subagents for parallel work — an orchestrator breaks down complex tasks and dispatches them to specialist agents with role-based tool access. Use when tasks have independent parallel streams. Don't use for sequential work where each step depends on the previous. We run up to 3 parallel subagents, each in isolated contexts.

Related: Orchestration · Tool Composition

How do AI agents hand off work to each other?

Structured YAML handoff documents: task description, decisions made, next steps, open questions. Agent A finishes its part, writes the handoff. Agent B reads it and continues without re-discovering context. We proposed this as MCP SEP #2683 and Google A2A RFC #1817. The protocol supports both baseline (unregulated) and regulated variants for NHS/finance/govt use cases.

Related: Handoff Spec · Orchestration

How do you secure AI agent deployments?

Four layers: (1) Credential isolation — secrets never in code or prompts, stored in an encrypted proxy queried at runtime. (2) Tool scoping — agents only get the tools they need, never full system access. (3) Audit trail — every agent action is logged with timestamps and decision context. (4) Rate limiting and token budgets per agent, with principle of least privilege.

Related: Compliance-as-Code · Security Disclosure

Can AI agents be used in regulated industries?

Yes, with the right infrastructure. On-premise deployment behind the firewall, structured handoff protocols with audit trails (Layer 7 — Governance in the OSI Model), compliance-as-code specs, and agent verification pipelines. We published specs for regulated handoff with NHS/finance/govt variants and compliance validation frameworks.

Related: Compliance Spec · Handoff Protocol

What's better for AI agents — vector database or knowledge graph?

Neither — for agent memory at our scale, filesystem-based knowledge stores beat both. Vector DBs need embeddings and similarity search; knowledge graphs need schema design. Our Rust-backed knowledge store uses H2 markdown with token-density scoring, field match bonuses, and recency weighting — zero external dependencies. Vector DBs make sense at 100K+ documents, not 500 facts.

Related: Memory Pattern · Knowledge Store Docs

How do you write an effective AGENTS.md file?

Keep it under 500 words. Include: project purpose (one sentence), architecture (simple text diagram), key file paths, directory map (table), deployment commands (copy-pasteable), and contact info. The agent reads this at session start — it's the first impression and the most important file you'll write. Put it in the project root.

Related: Boot Pattern · Specs

Can AI agents write their own tests?

Yes, and they should. After every code change, the agent writes or updates tests before deploying. We maintain 61 tests across the platform — the agent runs pytest, tsc --noEmit, and shellcheck automatically. Failed tests block deployment. The agent also writes regression tests when it discovers edge cases during debugging.

Related: Verify Pattern · Benchmarks

How do you debug an AI agent that's going wrong?

Four phases: (1) Read the agent's handoff/log — what decisions did it make and why? (2) Check tool calls — right tool? right arguments? (3) Reproduce in isolation — same prompt, fresh session. (4) Check for context lobotomy — after 15+ turns, agents forget earlier decisions. Session state files and handoff documents prevent this.

Related: Resilience Pattern · Verify Pattern

What is an agent gateway and do I need one?

An agent gateway is the HTTP layer between agents and external services — it handles auth, rate limiting, tool routing, and API discovery. You need one if you're running multiple agents that share infrastructure. Our Fastify/TSX gateway serves all pages, API docs, and agent discovery endpoints from a single VPS under €15/month.

Related: MCP Server · Agent OSI Model

How do you monitor AI agents in production?

Health checks (heartbeat endpoints), cron job completion alerts, error rate tracking, and token usage monitoring. Silent when healthy, alert when broken. We use a watchdog cron job that checks agent health every hour and triggers the compactor if memory bloat exceeds 95%. Each agent reports its own status — no central monitoring dashboard needed at this scale.

Related: Pipelines Pattern · Resilience Pattern

What's the cheapest way to run AI agents?

Local models for code generation (free, private, fast) + cloud models only for autonomous pipelines that need reliable tool calling. Infrastructure: under €15/month on a single VPS. Model costs: $1-2/night for benchmark pipelines. SmolLM3-3B and Bonsai 1-bit models run on CPU with zero API cost. Use local for code, cloud for agent tasks — that's the sweet spot.

Related: Benchmarks · Tool Composition

Why do some AI models score high on benchmarks but fail at agent tasks?

Code quality ≠ agent capability. SmolLM3-3B scores 93.3% on code tasks but only 50% on tool calling. Phi-4-mini: 90% → 17% agent readiness. Benchmark suites test code generation — formatting, syntax, correctness. Agent tasks require multi-turn reasoning, tool selection, argument correctness, and no false positives. These are different skills entirely. We test both dimensions daily and publish the gap openly.

Related: Live Benchmarks · Benchmark Methodology

How do you design agent skills that don't go stale?

Patch-on-failure: when a skill's steps are outdated or a new pitfall appears, the agent patches the skill immediately after hitting the issue — no waiting, no approval needed. Skills have triggers (keywords that auto-load them), numbered steps with exact commands, pitfalls sections, and verification steps. The system enforces patching: if you used a skill and hit issues not covered by it, update it.

Related: Skills Pattern · Compounding

What is an agent capability manifest?

A machine-readable JSON file at /.well-known/agent-capabilities.json that declares what an agent can do — its tools, skills, API endpoints, and runtime capabilities. Paired with llms.txt for documentation discovery and OpenAPI for contract-level details. Together these three form Layer 3 (Discovery) of the Agent OSI Model.

Related: Capability Manifest Spec · Agent OSI Model

How do you manage AI agent context windows efficiently?

Context is the most expensive resource. Three strategies: (1) Skills on-demand — only load what's relevant to the current task, not the entire library. (2) Session state files — compress completed work into summaries that survive context compaction. (3) Thin-memory pattern — keep system prompt lean (~2K chars), store everything else in queryable knowledge-db. Our memory went from 93% to 32% capacity with this approach.

Related: Memory Pattern · Skills Pattern

These answers come from building with AI agents — 111 SPFx web parts, 5 backend services, 153 reusable skills. Read the full methodology →

Spotted something?

Suggest an improvement, report an error, or just say hi.