
A pattern repeats itself across Australian mid-market businesses. A company builds a single AI agent, it works brilliantly for the demo, and then reality sets in. The agent that handled customer inquiries beautifully starts hallucinating when asked about billing. The document processor that extracted data flawlessly chokes when it encounters a new form type.
The uncomfortable truth? A single AI agent will fail you when tasks cross domains, require specialisation, or demand parallel processing.
According to recent industry analysis, organisations using multi-agent architectures achieve 45% faster problem resolution and 60% more accurate outcomes compared to single-agent systems. The AI agents market is projected to grow from $5.25 billion in 2024 to $52.62 billion by 2030, with multi-agent systems representing the fastest-growing segment.
This article is a technical deep dive for decision-makers who need to understand when and how to architect multi-agent systems. It uses a 7-phase document processing architecture (the "Carbonly" pattern) as a concrete example throughout.
Before discussing multi-agent systems, you need to understand precisely where single agents break down. Three failure modes appear consistently across deployments.
Single-agent systems become bottlenecked as tasks or data grow. They work on a single thread, limiting them to one task at a time. In environments requiring quick multitasking or high-volume processing, this becomes crippling.
Consider a Melbourne logistics company with a single agent handling shipment tracking queries. At 50 concurrent users, response times are acceptable. At 200 users during peak season, the system collapses to 45-second response times.
Research from Google in late 2024 revealed a critical insight: there is a potential trade-off within single models between strong memorisation (needed for precise tool use) and effective in-context learning (needed for adapting to novel situations). You cannot optimise for both in a single agent.
IBM experts summarise it bluntly: "You are going to hit a limit on what single agents can do, and then you are going to go back to multi-agent collaboration again."
Every prompt you send to an LLM has a finite context window. When a single agent must understand billing systems, customer history, product catalogues, and compliance rules simultaneously, you exhaust that window rapidly. The agent loses critical context and starts making errors.
The Rule of Thumb: If your task requires expertise in more than two distinct domains, or if you need to process multiple requests concurrently, a single agent will fail you.
Here's a multi-agent architecture that works for complex document processing. The Carbonly pattern emerged from work with a carbon accounting firm processing thousands of supplier invoices.
Receives documents, classifies type, extracts metadata. Uses a lightweight model (GPT-4o-mini equivalent) for speed.
Checks document completeness, identifies missing fields, flags anomalies. Specialised prompts for each document type.
Deep extraction using a more capable model. OCR integration, table parsing, entity recognition.
Cross-references extracted data with external systems (ABN lookup, supplier databases, pricing catalogues).
Checks against business rules, regulatory requirements, approval thresholds.
Confidence scoring, exception flagging, human escalation decisions.
Formats output for downstream systems (MYOB, Xero, custom ERPs), handles API calls.
Document Input
|
v
[Phase 1: Intake Agent] --classify--> [Phase 2: Validation Agent]
| |
| v
| [Phase 3: Extraction Agent]
| |
| v
| [Phase 4: Enrichment Agent]
| |
| v
| [Phase 5: Compliance Agent]
| |
| v
| [Phase 6: Review Agent]
| |
| v
| [Phase 7: Integration Agent]
| |
v v
[Exception Queue] <--escalate-- [Completed Output]
Each agent has a single responsibility, uses a model optimised for its task, and communicates through structured handoffs. The system processes 400% more documents per hour than the single-agent version it replaced.
According to Microsoft's AI architecture guidance, there are five primary orchestration patterns. Choosing the wrong one is the most common mistake.
Structure: Linear pipeline where each agent processes the previous agent's output.
Best For:
Avoid When:
The Carbonly architecture uses sequential orchestration for its core flow because document processing has inherent dependencies. You cannot validate what you have not classified.
Structure: Multiple agents run simultaneously on the same task, then aggregate results.
Best For:
Example: A stock analysis system runs fundamental, technical, sentiment, and ESG analysis agents concurrently, then aggregates recommendations.
Structure: Dynamic delegation where agents assess tasks and transfer to specialists.
Best For:
Critical Warning: Microsoft's guidance specifically recommends limiting group chat patterns to 3 or fewer agents to prevent infinite loops and maintain control.
Hub-and-spoke uses a central orchestrator managing all interactions. Predictable, but creates a bottleneck and single point of failure.
Mesh architectures let agents communicate directly. More resilient (agents route around failures), but harder to debug and monitor.
Recommendation: Start with hub-and-spoke for simplicity. Move to hybrid patterns (high-level orchestrators with local mesh networks for tactical execution) only when you have the observability infrastructure to support it.
This is where implementations succeed or fail. The technical details matter enormously.
An agentic handoff occurs when one agent directly and dynamically passes control to another after finishing its work. The critical element is context transfer: the receiving agent must have sufficient state to act appropriately.
In technical terms, handoffs involve:
# Conceptual handoff structure
handoff_payload = {
"source_agent": "validation_agent",
"target_agent": "extraction_agent",
"context": {
"document_type": "invoice",
"confidence": 0.94,
"validated_fields": ["vendor_name", "date", "total"],
"flagged_anomalies": []
},
"instructions": "Extract line items and payment terms"
}
Four major protocols have emerged for agent communication:
| Protocol | Purpose | Use Case |
|---|---|---|
| MCP (Model Context Protocol) | Tool and context sharing | Agents sharing access to databases, APIs |
| A2A (Agent-to-Agent) | Direct agent negotiation | Peer-to-peer workflows without central orchestration |
| ACP (Agent Communication Protocol) | Structured message passing | Enterprise systems with strict data contracts |
| AG-UI | Agent-user interaction | Handling human-in-the-loop touchpoints |
For most Australian mid-market implementations, MCP provides the right balance of standardisation and flexibility.
Here is what vendors will not tell you: accumulated context across multiple agents can exhaust token budgets rapidly. The Microsoft architecture guide explicitly warns about "growing context windows" leading to "token exhaustion."
Practical Solutions:
When multiple agents analyse the same problem, they will sometimes disagree. This is not a bug; it is often a feature. But you need mechanisms to resolve conflicts.
Each agent provides a confidence score with its output. The system weights votes by confidence.
# Simplified voting mechanism
agent_outputs = [
{"agent": "fundamental", "recommendation": "buy", "confidence": 0.82},
{"agent": "technical", "recommendation": "hold", "confidence": 0.71},
{"agent": "sentiment", "recommendation": "buy", "confidence": 0.68}
]
# Weight by confidence
weighted_vote = calculate_weighted_consensus(agent_outputs)
# Result: "buy" with aggregated confidence 0.74
Designate certain agents as authoritative for specific domains. If the compliance agent flags a risk, it overrides the efficiency recommendations from other agents.
When agent confidence falls below a threshold, or when agents disagree beyond a tolerance level, escalate to human review. This is not failure; it is appropriate system design.
In the Carbonly implementation, documents where agents disagree by more than 20% on extracted values automatically route to a human reviewer. This typically catches approximately 3% of documents and prevents costly errors.
Errors in multi-agent systems are fundamentally different from traditional software errors. Failures cascade unpredictably because agents develop dynamic, context-dependent relationships.
When one agent fails, the cascade effect propagates unpredictably because other agents develop dependencies on that agent's specific knowledge or decision-making patterns. State synchronisation becomes nearly impossible at scale.
Traditional circuit breakers assume stateless services. AI agents violate this assumption. Deploy circuit breakers at the cluster level rather than individual connections:
Agent Cluster A (Intake + Validation)
|
[Circuit Breaker - monitors cluster health]
|
Agent Cluster B (Extraction + Enrichment)
|
[Circuit Breaker - monitors cluster health]
|
Agent Cluster C (Compliance + Review + Integration)
Use adaptive triggers monitoring interaction success rates, response times, and behavioural anomalies rather than fixed thresholds.
A common early mistake: using average response times for timeout configuration. LLM inference varies dramatically. Use 95th percentile response times to capture realistic worst-case behaviour. This prevents premature timeouts and false failure signals.
For GPT-4 class models, typical configurations include:
When systems fail, you cannot simply restart everything. Map explicit dependencies (data flow) and implicit ones (learned coordination patterns). Implement staged recovery:
Traditional APM tools will not tell you why your multi-agent system is misbehaving. You need specialised observability.
LangSmith and similar platforms provide nested spans for fine-grained debugging across multi-agent environments. Every agent-level decision and sub-action, including LLM generations, tool calls, and data retrievals, gets captured.
LangSmith demonstrated exceptional efficiency with virtually no measurable overhead, making it suitable for performance-critical production environments. It operates asynchronously and does not add latency to your application.
| Metric | Why It Matters |
|---|---|
| Per-agent latency (p50, p95, p99) | Identifies bottleneck agents |
| Handoff success rate | Detects communication failures |
| Context size per handoff | Warns of token exhaustion |
| Agent confidence distributions | Catches model degradation |
| Error rate by agent type | Focuses debugging effort |
| Human escalation rate | Measures system confidence |
LangSmith offers enterprise-grade alerting via PagerDuty and webhooks. Configure alerts for:
For Australian enterprises requiring data sovereignty, LangSmith offers self-hosted deployments on your Kubernetes cluster where data never leaves your environment.
Based on implementations across Australian businesses, here is practical guidance.
Google research found that if a task is sequential and a single agent could perform it accurately at least 45% of the time, using multiple agents actually reduced performance by 39% to 70%. The coordination overhead overwhelms the benefits.
Only introduce multi-agent complexity when you have:
Microsoft's architecture guidance recommends limiting agent groups to 3 or fewer to maintain control. Start there. You can always add complexity; removing it is much harder.
Multi-agent systems multiply your inference costs. The Carbonly implementation uses:
This tiered approach reduced inference costs by 65% compared to using capable models throughout.
For Australian businesses processing sensitive data:
Multi-agent AI is not about having more agents; it is about having the right agents, doing the right things, communicating effectively.
Use multi-agent architectures when:
Stay with single agents when:
The Carbonly 7-phase architecture works because each agent has clear responsibility, uses an appropriate model, and communicates through well-defined handoffs. The orchestration layer handles failures gracefully. The monitoring infrastructure provides visibility into every decision.
Start small. Measure ruthlessly. Add complexity only when the data demands it.
Ready to evaluate multi-agent architecture for your business? Book a technical consultation with our engineering team. We will assess your specific workflows and recommend whether multi-agent complexity is justified for your use case.
Related Resources:
Sources: Research synthesized from Microsoft Azure AI Agent Design Patterns, Towards Data Science on Agent Handoffs, Galileo Multi-Agent Failure Recovery, LangChain LangSmith Observability, and IBM AI Agents 2025, with Australian enterprise implementation experience.