Agent Coordination Protocol (ACP) — Layer 5

Version: 1.0.0 Status: Published Layer: 5 — Coordination (Agent OSI Model) License: CC BY 4.0

1. Purpose

The Agent Coordination Protocol defines how multiple AI agents work together simultaneously. It handles four concerns:

Leader Election — which agent coordinates the fleet?
Work Distribution — which agent gets which task?
Conflict Resolution — when agents disagree, who wins?
Liveness — is this agent still alive?

2. Design Principles

Eventual consistency over strong consistency. Agents operate in different sessions with different context windows. Perfect real-time consensus is impossible. Good-enough coordination with recovery is achievable.
Push, don't poll. Agents notify each other of state changes. No polling loops burning tokens.
Graceful degradation. If the coordinator dies, the fleet continues with reduced coordination. No single point of failure that halts all work.
Agent-agnostic. Works with any agent framework (Hermes, Claude Code, Codex, Copilot). No framework-specific dependencies.
Minimal overhead. Coordination messages are small (under 500 tokens). The protocol doesn't burn the context it's trying to save.

3. Protocol Components

3.1 Leader Election

Problem: When multiple agents can make coordination decisions, who's in charge? Without a leader, agents either duplicate work or deadlock waiting for decisions.

Design: Raft-inspired but simplified. No log replication — just leadership.


State: LEADER | FOLLOWER | CANDIDATE

Heartbeat interval: 30s
Election timeout: 60-90s (randomized per agent to prevent split votes)
Leader lease: 120s (leader must refresh via heartbeat)

Flow:
1. FOLLOWER → No heartbeat for election_timeout → becomes CANDIDATE
2. CANDIDATE → Broadcasts "elect me" → 
   a. Receives majority votes → becomes LEADER
   b. Receives heartbeat from another leader → becomes FOLLOWER
   c. Timeout with no majority → new election
3. LEADER → Sends heartbeat every 30s → 
   Followers reset their election timeout on each heartbeat

Leader responsibilities:

Maintain work queue and agent status
Assign tasks to agents
Resolve conflicts when agents disagree
Detect dead/stuck agents

What the leader does NOT do:

Execute other agents' tasks
Override L6 verification or L7 governance
Own shared resources (locking is separate)

Leader failure: If the leader dies, followers detect heartbeat timeout (60-90s) and trigger a new election. During the election gap, agents continue executing assigned tasks but new assignments pause. Maximum coordination downtime: ~90 seconds.

3.2 Work Distribution

Problem: A task arrives. Which agent handles it?

Design: Capability-weighted work queue with work stealing.


Task lifecycle:
1. SUBMITTED — task created, not yet assigned
2. ASSIGNED — assigned to a specific agent
3. IN_PROGRESS — agent acknowledged and started working
4. COMPLETED — agent finished successfully
5. FAILED — agent reported failure
6. TIMED_OUT — agent didn't respond within deadline
7. STOLEN — another agent took over after timeout

Assignment algorithm:


Input: Task with required capabilities
Output: Agent ID or null (no capable agent available)

1. Filter agents by:
   a. Capability match (agent can do this type of work)
   b. Health (agent is alive and responsive)
   c. Load (agent has capacity — not at max_concurrent_tasks)

2. Rank remaining agents by:
   a. Success rate for this capability (higher = better)
   b. Current load (lower = better)
   c. Trust score (higher = better, from Layer 3 discovery)

3. Assign to top-ranked agent

4. If no agent available → queue task, retry on next agent heartbeat

Work stealing:


When: Agent A is idle (no tasks in queue)
Action: Agent A queries registry for overloaded agents (load > 80%)
Target: Steals the oldest pending task from the most overloaded agent
Constraint: Agent A must have matching capabilities for the stolen task

3.3 Conflict Resolution

Problem: Two agents make conflicting decisions about the same resource. Agent A writes to file X. Agent B writes to file X at the same time. Who wins?

Design: Last-write-wins with audit trail. Conflicts are rare. Perfect resolution is expensive. Logging the conflict is mandatory.


Conflict types and resolution:

FILE CONFLICT (two agents write to same file):
  Last write wins. Both writes logged.
  Previous version preserved in .agent-conflicts/ directory.
  Alert raised if file differs by >10% (possible corruption).

DECISION CONFLICT (two agents disagree on approach):
  If leader exists → leader decides.
  If no leader → first decision wins, second is logged as dissent.
  Dissent logged to audit trail for human review.

RESOURCE CONFLICT (two agents claim same resource):
  Leader arbitrates → assigns to agent with higher trust score.
  If no leader → first claim wins, second is queued.
  Resource released after task completes or timeout.

DEPENDENCY CONFLICT (Agent B starts before Agent A finishes):
  Agent B detects dependency unmet → waits or steals.
  If deadline would be missed → escalates to leader.
  Leader may reassign or extend deadline.

3.4 Liveness & Heartbeat

Problem: How does the fleet know an agent is still working?

Design: Heartbeat with grace period.


Heartbeat payload:
{
  "agent_id": "hermes-spfx-builder",
  "status": "healthy",          // healthy | degraded | busy | stuck
  "load": 0.67,                 // 0.0 to 1.0
  "current_tasks": ["task-42"],
  "last_completed": "task-41",
  "token_budget_remaining": 35000,
  "timestamp": "2026-05-05T21:00:00Z"
}

Heartbeat interval: 30s
Missed heartbeat threshold: 3 consecutive (90s)
Missed heartbeat action:
  1. Mark agent as "unresponsive"
  2. Reassign its IN_PROGRESS tasks (work stealing)
  3. Alert via Cron Guard
  4. If agent returns → reconcile state (agent reports what it actually completed)

4. Protocol Schema

4.1 Heartbeat Message


acp_version: "1.0.0-draft"
message_type: "heartbeat"
agent_id: "hermes-spfx-builder"
fleet_id: "works-with-agents-fleet-1"
timestamp: "2026-05-05T21:00:00Z"
status: "healthy"
load: 0.67
current_tasks:
  - id: "task-42"
    status: "IN_PROGRESS"
    started_at: "2026-05-05T20:55:00Z"
    deadline: "2026-05-05T21:10:00Z"
capabilities:
  - action: "build"
    target: "spfx"
    success_rate: 0.94
token_budget_remaining: 35000

4.2 Task Assignment


acp_version: "1.0.0-draft"
message_type: "task_assign"
task_id: "task-43"
from_agent: "hermes-leader"
to_agent: "hermes-spfx-builder"
task:
  action: "build"
  target: "spfx"
  description: "Build web part 'ProjectDashboard' from source"
  context:
    repo: "/Users/vilius/origami-spfx-webparts-lab"
    branch: "feature/project-dashboard"
  deadline: "2026-05-05T21:30:00Z"
  priority: "normal"            # low | normal | high | critical
  idempotency_key: "task-43-v1"

4.3 Leader Election


acp_version: "1.0.0-draft"
message_type: "election_request"
candidate_id: "hermes-leader"
fleet_id: "works-with-agents-fleet-1"
term: 7                        # monotonically increasing
timestamp: "2026-05-05T21:00:00Z"

4.4 Work Steal


acp_version: "1.0.0-draft"
message_type: "work_steal"
from_agent: "hermes-idle-worker"
from_agent_load: 0.0
target_agent: "hermes-spfx-builder"
target_agent_load: 0.95
stolen_tasks: ["task-42"]
timestamp: "2026-05-05T21:00:00Z"

5. Registry Dependencies

The Coordination Protocol depends on Layer 3 (Discovery):

Agent Capability Manifest — needed to match tasks to capable agents
Agent Registry — needed to know which agents exist and their status
Heartbeat — serves both L3 (status broadcast) and L5 (liveness check)

6. Implementation Notes

Minimum viable implementation

For a small fleet (2-5 agents), you don't need full leader election with Raft-style consensus. A simpler model works:

Static leader — designate one agent as coordinator in config
Direct assignment — leader assigns tasks, no work stealing needed
Heartbeat-only liveness — if an agent misses 3 heartbeats, leader reassigns
Manual conflict resolution — conflicts are rare with small fleets; log and alert

Upgrade to full protocol when fleet exceeds 5 agents or when coordination downtime becomes expensive.

Coordination Registry

The protocol requires a shared state store accessible to all agents. Options:

Store	Pros	Cons
SQLite + WAL (shared filesystem)	Zero setup, fast	Single machine only
FactBase API (workswithagents.dev)	Network-accessible, queryable	Adds latency
Redis	Battle-tested, pub/sub	Additional infrastructure
Custom HTTP API	Tailored to ACP	Build effort

Recommended: SQLite+WAL for local fleets, FactBase API for distributed fleets.

7. Status & Roadmap

Component	Status
Protocol spec	Draft 1.0.0
Reference implementation	Not yet built
Leader election	Spec complete
Work distribution	Spec complete
Conflict resolution	Spec complete
Liveness / heartbeat	Spec complete
Work stealing	Spec complete
Registry integration	Depends on L3 Agent Registry

Next: Reference implementation in Python for Hermes agent fleet. Target: 2-3 weeks.

This document is part of the Agent OSI Model framework. See Layer 3 (Agent Capability Manifest) for the discovery mechanism this protocol depends on.

Agent Coordination Protocol (ACP) — Layer 5

Agent Coordination Protocol (ACP) — Layer 5

1. Purpose

2. Design Principles

3. Protocol Components

3.1 Leader Election

3.2 Work Distribution

3.3 Conflict Resolution

3.4 Liveness & Heartbeat

4. Protocol Schema

4.1 Heartbeat Message

4.2 Task Assignment

4.3 Leader Election

4.4 Work Steal

5. Registry Dependencies

6. Implementation Notes

Minimum viable implementation

Coordination Registry

7. Status & Roadmap

Spotted something?

About

Resources

Community

Legal