← Back to blog
AI AGENTS12 min read

Hybrid LLM Orchestration: Mixing Local and Proprietary Models

HB
Henco Burger
May 21, 2026

Running everything through a single LLM API is simple, until you hit the wall. Privacy policies require certain data to stay on-premises. Latency budgets demand sub-100ms classification that cloud APIs can't deliver. Cost projections show that routing every request through GPT-4 class models burns budget on tasks a 7B parameter model handles fine.

The answer is hybrid orchestration: use local models (Ollama, vLLM, llama.cpp) for tasks where speed, privacy, or cost matter, and proprietary APIs (Claude, GPT-4, Gemini) for tasks that need frontier reasoning. The hard part isn't calling two different APIs. It's coordinating them.

Why Hybrid Is Hard

The naive approach hardcodes model assignments into the orchestrator:

// The tightly-coupled approach
async function processDocument(doc: string) {
  // Classification - fast, no data leaves your network
  const category = await localLlama.classify(doc);

  // Summarization - needs strong reasoning
  const summary = await openai.chat(doc);

  // Entity extraction - privacy-sensitive, run locally
  const entities = await localMistral.extract(doc);

  // Final synthesis - best model wins
  const report = await anthropic.messages(summary, entities);

  return report;
}

// Problems:
// - Every model is hardcoded in the orchestrator
// - Swapping a model means changing orchestration logic
// - Can't scale local and cloud independently
// - No visibility into which model handled what
// - If the local model server goes down, everything stops

This works in a prototype. In production it falls apart:

  • Model coupling -- swapping Llama for Mistral means changing orchestration code, not just a config value
  • No independent scaling -- your local GPU box and your cloud API calls are locked to the same process
  • No fault isolation -- if Ollama crashes, the entire pipeline stops, even for tasks that use cloud models
  • No visibility -- you can't see which model handled which task, what the latency was, or where errors occurred
  • No dynamic routing -- you can't shift traffic between local and cloud based on load, latency, or error rate

The Decoupled Architecture

The fix is to separate what needs doing from who does it. The orchestrator dispatches tasks by capability ("classify", "summarize", "extract-entities"). Workers advertise their capabilities via presence. NoLag routes tasks to capable workers automatically.

import { NoLagAgents, Handoff } from "@nolag/agents";

// Orchestrator - dispatches by capability, not by model
const orchestrator = new NoLagAgents(TOKEN, {
  appName: "doc-pipeline",
  agentId: "orchestrator",
  presence: { name: "orchestrator", role: "orchestrator" },
});
await orchestrator.connect();
const room = orchestrator.room("processing");
const handoff = new Handoff(room);

// Service discovery: see what's available
const classifiers = room.findAgents("classify");
console.log("Classifiers online:", classifiers.length);

// Dispatch tasks by what needs doing, not how.
// The SDK checks presence to verify a capable agent is connected.
const result = await handoff.dispatch("classify", { document: doc }, {
  priority: "high",
  waitForResult: true,
  timeout: 30000,
});

// A local Llama worker picks up "classify" tasks.
// A Claude worker picks up "synthesize" tasks.
// Workers self-register with their capabilities via presence.

The orchestrator never imports ollama or @anthropic-ai/sdk. It doesn't know which model runs each task. Workers are independent processes -- they can run on a GPU server in your data center, a cloud VM, or a serverless function.

The Local Worker

A local worker runs on your infrastructure, connects to NoLag, and processes tasks using a local model. It advertises its capabilities via presence so the orchestrator knows it exists.

import { NoLagAgents, Handoff } from "@nolag/agents";
import { Ollama } from "ollama";

const ollama = new Ollama({ host: "http://localhost:11434" });

const worker = new NoLagAgents(TOKEN, {
  appName: "doc-pipeline",
  agentId: "local-classifier",
  presence: {
    name: "local-classifier",
    role: "agent",
    capabilities: ["classify", "extract-entities"],
    metadata: { model: "llama3.2", provider: "local" },
  },
});
await worker.connect();
const room = worker.room("processing");
const handoff = new Handoff(room);

// The SDK auto-filters: only tasks matching these capabilities arrive
handoff.onTask(["classify", "extract-entities"], async (task, respond) => {
  const response = await ollama.chat({
    model: "llama3.2",
    messages: [{ role: "user", content: task.payload.document }],
  });

  respond("success", {
    result: response.message.content,
    model: "llama3.2",
    provider: "local",
    latencyMs: Date.now() - task.createdAt,
  });
});

Key properties of this worker:

  • Data never leaves your network -- the document is sent to Ollama on localhost, not to an external API
  • Presence-based discovery -- the orchestrator (and the portal dashboard) see this worker appear the moment it connects, with its model and capabilities visible
  • Selective task claiming -- the worker only processes tasks matching its capabilities, ignoring others
  • Result metadata -- every result includes the model, provider, and latency for observability

The Cloud Worker

A cloud worker does the same thing, but calls a proprietary API. It can run anywhere -- it just needs network access to both NoLag and the API provider.

import { NoLagAgents, Handoff } from "@nolag/agents";
import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic();

const worker = new NoLagAgents(TOKEN, {
  appName: "doc-pipeline",
  agentId: "cloud-synthesizer",
  presence: {
    name: "cloud-synthesizer",
    role: "agent",
    capabilities: ["summarize", "synthesize"],
    metadata: { model: "claude-sonnet-4-6", provider: "anthropic" },
  },
});
await worker.connect();
const room = worker.room("processing");
const handoff = new Handoff(room);

// The SDK auto-filters: only tasks matching these capabilities arrive
handoff.onTask(["summarize", "synthesize"], async (task, respond) => {
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 4096,
    messages: [{ role: "user", content: task.payload.document }],
  });

  respond("success", {
    result: response.content[0].text,
    model: "claude-sonnet-4-6",
    provider: "anthropic",
    latencyMs: Date.now() - task.createdAt,
  });
});

The cloud worker and local worker are structurally identical. The only difference is the inference call. This is the point: the coordination layer doesn't care where inference happens.

When to Use Which Model

Here's a practical routing guide:

Task TypeModelWhy
Classification / taggingLocal (Llama 3, Mistral)Fast, cheap, accurate enough for categorical outputs
Entity extractionLocal (fine-tuned)Privacy-sensitive data stays on-prem
SummarizationCloud (Claude, GPT-4)Needs strong reasoning and coherent long-form output
Code generationCloud (Claude, Codex)Frontier models significantly outperform local for complex code
TranslationLocal (NLLB, Madlad)Specialized translation models are fast and good
Embedding / searchLocal (Nomic, BGE)Embedding models are small and latency-sensitive
Complex reasoningCloud (Claude Opus, o1)Multi-step reasoning is where frontier models shine

Shared State: The Blackboard

In a hybrid system, you need shared state that all agents can read and write. Which models are performing well? What's the current error rate? Has a document already been classified?

The NoLag blackboard pattern gives every agent a shared key-value store, updated in real time:

import { Blackboard, createStateEnvelope } from "@nolag/agents";

const blackboard = new Blackboard(room);

// Orchestrator tracks model performance on the blackboard
room.on("result", (result) => {
  const { model, provider, latencyMs } = result.payload;
  const key = `model-stats:${provider}:${model}`;

  // Publish to shared state (visible to all agents in real time)
  blackboard.set(key, {
    latencyMs,
    provider,
    model,
    lastUsed: Date.now(),
  });
});

// Any agent can read the blackboard to make routing decisions:
// "If local model latency > 500ms, fall back to cloud"
// "If cloud error rate > 5%, route to backup provider"

This enables dynamic routing decisions. If your local Llama instance starts responding slowly (GPU thermal throttling, memory pressure), the orchestrator can read the blackboard and temporarily route classification tasks to a cloud model instead. When the local model recovers, traffic shifts back automatically.

Observability: See Everything

With hybrid orchestration, visibility matters more than ever. You need to answer: which model handled this task? How long did it take? Did it succeed? Is the local model server healthy?

// The observer (portal dashboard) sees everything in real time:
// - Which models are online (presence)
// - Task dispatch and completion (events)
// - Model performance metrics (blackboard)
// - Approval requests for sensitive operations

// In the NoLag portal, the Agents tab shows:
//
//   [orchestrator] ---handoff---> [local-classifier]
//         |                           (llama3.2)
//         |
//         +--------handoff---> [cloud-synthesizer]
//                                   (claude-sonnet-4-6)
//
// Metrics: 142 tasks | 97% success | avg 340ms local | avg 1.2s cloud

The NoLag portal's agent dashboard shows the full topology in real time. You see every connected worker, their model and provider, task flow between agents, and aggregate performance metrics. No custom dashboards, no separate monitoring stack.

Scaling Independently

Because workers are independent processes that connect to NoLag, you can scale each tier independently:

  • Scale local horizontally -- add more GPU boxes, each running a worker. NoLag's load-balanced subscriptions distribute tasks across them automatically.
  • Scale cloud vertically -- increase API rate limits and concurrency. Cloud workers can run as serverless functions that auto-scale with demand.
  • Scale the orchestrator to zero -- since it's just dispatching tasks (not running inference), the orchestrator is lightweight. It can run on a small VM or even a serverless function.
  • Mix providers freely -- run three workers with different cloud providers (Anthropic, OpenAI, Google) for the same capability. If one provider has an outage, the others keep processing.

The Privacy Boundary

For regulated industries (healthcare, finance, legal), hybrid orchestration solves a real compliance problem. Sensitive data stays on your infrastructure:

  1. Documents arrive at your on-prem worker via NoLag (encrypted WebSocket)
  2. The local worker extracts entities, classifies, and redacts PII
  3. Only the redacted summary is sent to a cloud model for synthesis
  4. The cloud model never sees raw PII, medical records, or financial data

NoLag's access scopes enforce this boundary at the infrastructure level. Scoped actors can only see data within their tenant, and webhook payloads include scope info so your backend can route accordingly.

Getting Started

  1. Install the agents SDK: npm install @nolag/agents
  2. Create an app in the NoLag dashboard with a room for your workflow
  3. Write an orchestrator that dispatches tasks by capability
  4. Write a local worker with Ollama for fast, private tasks
  5. Write a cloud worker with your preferred API for reasoning tasks
  6. Open the Agents tab in the portal to watch the topology live

Each worker is a standalone script. Deploy them wherever makes sense: the local worker on your GPU server, the cloud worker on a VM or Lambda, the orchestrator anywhere. They find each other through NoLag presence -- no service discovery, no hardcoded addresses.

What's Next

Hybrid orchestration is just one pattern enabled by decoupled coordination. Once your agents communicate through NoLag instead of direct function calls, you can add approval gates for sensitive operations, A/B test models by routing a percentage of traffic to each, implement automatic fallback chains, and build custom dashboards on the event stream.

The models will keep getting better. The coordination infrastructure shouldn't need to change every time they do.