Multi-Agent Orchestration Patterns: The Hermes Example
Most multi-agent setups waste tokens on inter-agent handshakes, not the task itself. In a 3-agent crew running Nous-Hermes-2-Mixtral-8x7B, round-robin orchestration burned 22% of total tokens in routing overhead. A hub-and-spoke pattern dropped that to 7%, cutting inference cost by $0.09 per task run. The orchestration topology, not the model size, controls cost and latency for agentic workflows.
How Does a Multi-Agent Crew Orchestration Work in Production?
A crew consists of specialized agents — each holding a system prompt, a tool set, and a memory partition. Orchestration defines how these agents exchange context and hand off tasks. Three patterns dominate.
- Round-Robin : every agent receives the full conversation history, leading to token explosion. 22% overhead on Nous-Hermes-2.
- Hub-and-Spoke : a router agent parses the task, dispatches to a single worker, and returns the final output. Workers never talk to each other.
- Graph-of-Agents : a DAG defines allowed edges. Intermediate results flow conditionally. Complex to debug but latency-optimal for branching tasks.
Our data-intel pipelines use hub-and-spoke for breach verification. A router based on a 7B Hermes variant classifies the intel type, then invokes either a scraper, a parser, or an OSINT agent. Direct agent-to-agent chatter is avoided entirely.
What Is Agent Hermes and Why Use It for Tool Calling?
Agent Hermes is Nous Research’s fine-tuned variant of Llama/Mixtral, optimized for function calling and structured output. It emits function calls as JSON, natively follows Open‑AI tool schemas, and runs on a single A100 at 40 tokens/sec.
from openai import OpenAIclient = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")response = client.chat.completions.create( model="Nous-Hermes-2-Mixtral-8x7B-DPO", messages=[{"role": "user", "content": "Check if th1b4ut.com credentials leaked"}], tools=[{"type": "function", "function": {"name": "query_breach_db", "parameters": {...}}}])print(response.choices[0].message.tool_calls[0].function.name)> query_breach_db
The model reliably outputs tool calls with 91% formatting correctness on our internal benchmark of 500 multi‑step tasks. That reliability makes it the router inside our hub‑and‑spoke setup.
How to Implement a Hub‑and‑Spoke Pattern with Hermes and a Lightweight Dispatcher
The core is a dispatcher function that receives a task description, asks Hermes to classify it, and invokes the appropriate worker agent. Workers can be Hermes instances or lighter models like Mistral 7B.
import jsondef dispatcher(task: str): intent = get_intent_from_hermes(task) # returns "scrape" or "parse" if intent == "scrape": result = call_agent("scraper", task) elif intent == "parse": result = call_agent("parser", task) else: result = call_agent("generalist", task) return resultdef get_intent_from_hermes(text): # Hermes classifies task into a known intent set prompt = f"Classify intent: {text}\nOptions: scrape, parse, general" return llm_complete(prompt).strip() # ~12 tokens
Total tokens per classification: 12. In round‑robin, the same classification would consume 340 tokens in context bloat. The hub‑and‑spoke design keeps the prompt concise.
What Latency and Cost Trade‑Offs Exist?
Routing adds overhead but reduces total tokens. Our measurements on a crew with three Hermes‑2 workers:
| Pattern | Avg Tokens/Task | Latency | Cost (Together AI) |
|---|---|---|---|
| Round‑Robin | 3,400 | 5.2s | $0.11 |
| Hub‑and‑Spoke | 1,200 | 2.8s | $0.04 |
| Graph‑of‑Agents | 950 | 1.9s | $0.03 |
The graph pattern wins on cost and speed, but requires hand‑crafted DAGs that break when a tool fails. Hub‑and‑spoke with Hermes is the sweet spot: 7% routing overhead, no cascading failures.
Graph‑of‑agents is implemented using LangGraph for conditional branches:
from langgraph.graph import StateGraphworkflow = StateGraph(AgentState)workflow.add_node("router", router_agent)workflow.add_node("scraper", scraper_agent)workflow.add_node("parser", parser_agent)workflow.add_conditional_edges("router", lambda s: s["intent"], { "scrape": "scraper", "parse": "parser"})
No full context between nodes — just the relevant data slice. Latency drops further, but tool‑error recovery must be manually added.
FAQ
Q: Why not let all agents share a single context?
Token waste. A 3‑agent round‑robin blows the prompt to 3,400 tokens on average, while hub‑and‑spoke stays at 1,200 for identical results.
Q: Can Agent Hermes run on‑prem?
Yes. Nous‑Hermes‑2‑Mixtral‑8x7B-DPO fits on a single A100 80GB with vLLM. Throughput: 40 tok/s.
Q: Which pattern works best for OSINT pipelines?
Hub‑and‑spoke. The router classifies intel type, then dispatches to scrapers or NLP parsers. No inter‑agent noise.
Q: How do you handle agent failures?
In hub‑and‑spoke, the router retries with a fallback agent. Graph‑of‑Agents requires explicit failure edges.
Q: What’s the minimum hardware for a 3‑agent crew?
A single A100 can host three 7B‑class models using vLLM with tensor parallelism. Total memory: ~48GB.
Méta article
$ cat /meta/article.txt
> author: th1b4ut
> published: 2026-05-17
> category: AGENT-SWARM
> series: —
> tags: ai-ops, multi-agent, orchestration, hermes, crewai
> license: —