← Back to blog

How to Debug a Multi-Agent Loop That Won't Terminate

You deploy your multi-agent system on Friday afternoon. Monday morning you find an agent conversation that's been running for 38 hours, has consumed $47 in tokens, and is still going. Two agents are passing the same task back and forth, each one deciding the other should handle it.

I've seen this exact scenario at least a dozen times. Infinite loops in multi-agent systems are one of the most common production failures, and they're surprisingly tricky to debug because the system is doing exactly what you told it to do. It's just that what you told it creates a cycle.

## Why Loops Happen

There are three root causes, and they account for about 95% of the infinite loops I've encountered.

**Circular handoff dependencies** are the most common. Agent A handles billing questions. Agent B handles refund questions. A customer asks about a billing error that resulted in an overcharge they want refunded. Agent A sees "overcharge" and routes to Agent B (it's a refund). Agent B sees "billing error" and routes to Agent A (it's billing). Neither agent has clear ownership of the intersection, so they bounce the task forever.

This happens because handoff rules are usually written for the clean cases. "If billing, go to A. If refund, go to B." The messy cases, where a request touches multiple domains, aren't covered. And LLM-based routing makes it worse because the model picks up on different keywords each time, creating non-deterministic ping-pong.

**Missing exit conditions** are the second cause. You design a workflow where Agent A drafts a response, Agent B reviews it, and if it's not good enough, Agent B sends it back to A for revision. But "not good enough" is subjective. Agent B always finds something to improve. Agent A always dutifully revises. They'll do this until your token budget runs out.

This is a design flaw, not a bug. Any iterative refinement loop without a hard stop condition is an infinite loop waiting to happen. "Good enough" needs to be a measurable threshold, not a vibe.

**Ambiguous handoff criteria** are the third cause, and the hardest to diagnose. Your routing conditions overlap. Both Agent A and Agent B think they should handle the request, but neither fully commits. Agent A starts processing, realizes it needs information from B's domain, hands off. Agent B starts, realizes A already did part of the work, hands back to A for completion. A sees an incomplete state and starts over.

The underlying problem is that your agents' responsibility boundaries are fuzzy. In a well-designed system, given any input, exactly one agent should be the clear owner.

## How to Diagnose the Loop

When you find a looping conversation, resist the urge to immediately fix the routing logic. First, understand the loop.

**Step 1: Pull the conversation transcript.** Look at the last 10-15 messages. Identify which agents are involved and what they're saying to each other. Usually the loop becomes obvious when you read the handoff messages in sequence.

**Step 2: Identify the loop pattern.** Is it A to B to A to B (circular dependency)? Is it A to B back to A, B to A back to B (ambiguous ownership)? Or is it A to A through some intermediate step (refinement loop)?

**Step 3: Check the routing decisions.** For each handoff, look at what information the routing agent had and why it made the decision it did. Often you'll find that the router is technically correct each time. The problem is that being correct locally doesn't prevent a cycle globally.

**Step 4: Find the trigger input.** What about this specific request caused the loop? Usually it's an input that sits at the boundary between two agents' domains. Save this input. It's your first test case for the fix.

ClawVortex's visual trace makes steps 1-3 much faster. Instead of reading raw conversation logs, you see the conversation path drawn on your agent topology. Loops show up as cycles in the visual graph, and you can click each handoff to see the routing decision.

## How to Fix It

The fix depends on the root cause, but there are patterns that work across all three.

**Add a conversation-level turn limit.** This is your safety net, not your fix. Set a maximum number of agent-to-agent handoffs per conversation. 10 is a reasonable starting point. When the limit is hit, escalate to a human with the full conversation context. This doesn't solve the loop, but it stops a $47 runaway.

**Implement a handoff history check.** Before any agent hands off to another agent, check if this exact handoff (same source agent, same target agent) has already happened in this conversation. If so, don't hand off. Instead, the current agent should either handle the task itself or escalate. This is a simple, effective circuit breaker.

**Fix the ownership boundaries.** For circular dependencies, you need a tiebreaker. Define which agent owns the intersection cases. "If a request involves both billing and refunds, the billing agent owns it and can invoke the refund agent as a sub-task, not a handoff." The distinction between a handoff (transferring ownership) and a sub-task (delegating while retaining ownership) prevents cycles.

**Add hard stop conditions to refinement loops.** "Agent B reviews Agent A's draft. If it needs revision, send it back. Maximum 3 revisions. After 3, use the current version regardless." Simple and brutal, but it works. You can also add a quality score threshold: "if the score is above 7/10, accept. If it's between 4-7, revise. If below 4, escalate to human."

**Make routing deterministic for known edge cases.** If you've identified inputs that cause loops, add explicit routing rules for them. "If the message contains both billing and refund keywords, route to billing." Don't rely on the LLM to figure out the right destination for ambiguous cases. Hard-code the answer.

## Prevention Beats Debugging

After you've fixed a loop, add these to your system:

Test every new agent topology with adversarial inputs before deploying. Specifically, test inputs that sit on the boundary between agents' domains. ClawVortex's stress testing does this automatically, which is one of the reasons I think visual orchestration tools earn their keep.

Monitor handoff counts per conversation in production. Set an alert if any conversation exceeds your expected maximum. A conversation that involves 8 handoffs when your median is 2 is worth investigating even if it eventually terminates.

Document your handoff rules in your AGENTS.md with explicit notes about boundary cases. Future you will thank current you when the next loop shows up at 2 AM on a Saturday.

Related posts

Visual Orchestration for Multi-Agent Systems: Why It MattersMulti-Agent Orchestration Guide: Designing Agent Fleets That Actually WorkAGENTS.md Tutorial: Configuring Agent Capabilities the Right WayBuilding your first multi-agent pipeline with OpenClawWhen to use agent orchestration (and when not to)Multi-Agent Workflow Patterns for OpenClaw TeamsClawVortex vs LangGraph: Visual Orchestration Compared