Why Multi-Agent Systems Need a Coordination Layer

The AI agent landscape has a gap. We have incredible LLMs, solid frameworks for building individual agents (LangChain, CrewAI, AutoGen), and growing demand for multi-agent systems that can tackle complex workflows. What we don't have is good coordination infrastructure.

Most teams building multi-agent systems end up reinventing the same coordination plumbing: task queues on Redis, state management on a database, custom WebSocket servers for real-time updates, and bespoke logging for observability. This is undifferentiated heavy lifting that distracts from the actual agent logic.

The Problem with Direct Orchestration

The simplest multi-agent pattern is direct orchestration: a coordinator calls agents in sequence, passing outputs from one to the next. It looks clean in a demo.

// The naive approach: direct function calls
const research = await researchAgent.run(query);
const draft = await draftAgent.run(research);
const review = await reviewAgent.run(draft);
const final = await publishAgent.run(review);

// Problems:
// - Sequential, no parallelism
// - No visibility into what's happening
// - No way to retry a single step
// - No shared state between agents
// - No human approval before publish
// - If the orchestrator crashes, everything is lost

This approach breaks down as soon as you need any of the following: parallel execution across multiple workers, visibility into what each agent is doing, the ability to retry a single failed step, shared state that all agents can read and write, human approval gates before high-stakes actions, or resilience when the orchestrator process crashes.

These aren't edge cases. They're table stakes for production multi-agent systems.

Why Pub/Sub Is the Right Primitive

Publish/subscribe messaging solves the coordination problem at the right level of abstraction. Here's why:

Decoupling. Publishers don't know about subscribers. An orchestrator dispatches a task to a topic. Any worker subscribed to that topic with matching capabilities picks it up. You can add, remove, or restart workers without changing the orchestrator.

Fan-out. A single event can reach multiple interested parties: the worker that processes it, the dashboard that displays it, the logger that records it, and the human who might need to approve the result.

Persistence. With QoS 2 (exactly-once delivery) and message retention, the system survives process crashes. Tasks aren't lost. State is recoverable. Workers can reconnect and pick up where they left off.

Presence. Built-in presence tracking means you always know which agents are online, which are idle, and which are overloaded. No custom health check infrastructure needed.

Six Patterns as Coordination Primitives

We identified six patterns that cover the coordination needs of multi-agent systems:

Handoff - Task dispatch and result collection with capability-based routing
Inbox - Direct agent-to-agent messaging for point-to-point communication
Blackboard - Shared state with versioning so all agents see the same world
Observe - Real-time event stream for monitoring, logging, and debugging
Approve - Human-in-the-loop gates for high-stakes decisions
Tools - Remote tool invocation so agents can share capabilities

Each pattern maps to specific topics, QoS levels, and message flows. But you interact with them through a high-level SDK that hides the pub/sub mechanics.

What Coordination Actually Looks Like

Here's the same content pipeline from earlier, but with proper coordination infrastructure:

import { NoLagAgents } from "@nolag/agents";

const agents = new NoLagAgents(TOKEN);
await agents.connect();
const room = agents.joinRoom("content-pipeline");

// Shared state visible to all agents
await room.blackboard.set("plan", {
  steps: ["research", "draft", "review", "publish"],
  current: 0,
});

// Dispatch to any capable worker (parallel-ready)
await room.handoff.dispatch({
  type: "research",
  payload: { query },
  capabilities: ["research"],
});

// Human gate before publish
room.handoff.onResult(async (result) => {
  if (result.taskType === "review") {
    const approval = await room.approve.request({
      action: "publish",
      content: result.data,
    });
    if (approval.approved) {
      await room.handoff.dispatch({
        type: "publish", payload: result.data,
        capabilities: ["publishing"],
      });
    }
  }
});

// Full observability
room.observe.on("*", (e) => log(e));

The difference: workers can be scaled independently, every event is observable, shared state is accessible to all agents, human approval gates are built in, and the system survives crashes because messages are persisted.

NoLag as the Substrate

NoLag was built as real-time messaging infrastructure for chat, notifications, and IoT. It turns out that the same primitives - topics, rooms, presence, QoS, and access control - map directly to multi-agent coordination needs.

@nolag/agents is a high-level SDK that wraps these primitives into the six patterns above. Under the hood, it uses the same battle-tested WebSocket infrastructure that powers millions of real-time messages. You get persistent queues, exactly-once delivery, per-topic ACL, and sub-50ms latency without building any of it yourself.

The key insight: agent coordination is a real-time messaging problem. And real-time messaging is a solved problem - if you use the right infrastructure.

Learn more about @nolag/agents →