How to Pinpoint the Culprit: A Step-by-Step Guide to Automated Failure Attribution in LLM Multi-Agent Systems
Introduction
When an LLM multi-agent system fails, developers often face a daunting task: sifting through hundreds of lines of interaction logs to determine which agent caused the failure and at what point the mistake occurred. This manual debugging process is time-consuming, error-prone, and scales poorly as systems grow in complexity. To address this, researchers from Penn State University, Duke University, Google DeepMind, and other institutions introduced the problem of automated failure attribution and created the first dedicated benchmark dataset, Who&When, accepted as a Spotlight presentation at ICML 2025. This guide translates their research into a practical, step-by-step workflow you can follow to implement automated failure attribution in your own multi-agent systems.

What You Need
Before you begin, ensure you have the following:
- Interaction Logs from your LLM multi-agent system (e.g., agent messages, task assignments, outputs, failure signals).
- Access to the Who&When Dataset (available on Hugging Face) – includes labeled failures with ground-truth agent and step IDs.
- Python Environment (3.8+) with libraries: `pandas`, `json`, `openai` (or other LLM API), and
transformers. - LLM API Key for attribution models (e.g., GPT-4, LLaMA).
- Basic Knowledge of multi-agent architectures and prompt engineering.
Step-by-Step Instructions
Step 1: Understand the Failure Attribution Task
Automated failure attribution asks two questions per failed task: “Which agent?” and “When?” (i.e., at which step in the interaction). The Who&When dataset formalizes this as a benchmark. Familiarize yourself with the dataset’s structure: each failure case includes a task description, a full interaction log, and ground-truth labels for the responsible agent and the step index. This will guide your approach.
Step 2: Collect and Preprocess Your System’s Logs
If you are working with your own system, extract logs in a consistent format. For each task attempt that ended in failure (e.g., incorrect final output, loop, timeout), assemble:
- Task prompt (what the system was asked to do).
- Sequence of agent utterances – each message should include the agent name, timestamp, and content.
- Failure signal – how the system knows the task failed (e.g., manual flag, error metric).
Normalize logs into JSON lines with fields: `task`, `messages` (list of dicts with `agent`, `step`, `text`), and `failure_type`. If using Who&When, download the dataset and load it directly.
Step 3: Choose an Attribution Method
The researchers explored several methods. For simplicity, start with Direct Prompting – feed the entire log to an LLM and ask it to identify the culpable agent and step. Example prompt:
“Given this multi-agent conversation that resulted in a failure, which agent made the critical mistake, and at which step (0-indexed)? Output in JSON: {"agent": "...", "step": integer}.”
More advanced options include contrastive prompting (compare with successful runs) and agent-level vs. step-level decomposition. The Who&When paper provides baselines – you can replicate them using their open-source code.
Step 4: Run Attribution on Your Logs
Implement a script that loops over each failure case, builds the prompt, calls the LLM API, and parses the response. For the Who&When dataset, compare your predictions against the ground-truth labels to compute accuracy (agent attribution, step attribution, and joint accuracy). For your own logs, you may need to manually verify a sample to establish a baseline.
Step 5: Analyze and Iterate
When attribution fails, examine misclassifications. Common pitfalls include:
- Insufficient context – the log may be too long; try truncating or summarizing steps.
- Blaming the wrong agent – the first mistake may not be the most obvious one; use chain-of-thought prompting to reason step by step.
- Temporal ambiguity – multiple agents contribute close in time; consider step-level attribution first.
Adjust your prompts or method accordingly. The Who&When benchmark is designed to help you compare different approaches systematically.
Tips
The following insights from the original research can improve your success rate:
- Manual log archaeology is still useful for small-scale validation but does not scale. Automated attribution is essential for complex systems.
- Use the Who&When dataset as a testbed before running on your own logs – it provides ground truth and allows you to tune prompts.
- Consider the trade-off between agent-level and step-level attribution – some methods perform better at one than the other. The paper proposes combined approaches.
- Leverage open-source resources – the code and dataset are fully available; modify the evaluation scripts to suit your needs.
- Remember that failures can stem from multiple agents – the attribution task assumes a single primary culprit; for complex failures, you might need to adapt to multi-label attribution.
Related Articles
- VECT Ransomware's Critical Flaw: How a Nonce Mistake Turns Encryption into Data Destruction
- Capturing the International Space Station on a Budget: A Thrift Store Lens Challenge
- 6 Things You Need to Know About Quantum Gravity and the Cosmic Singularity
- How to Observe Lunar Impact Flashes During a Crewed Flyby
- Curiosity’s Worn Wheels: Six Years of Martian Driving Documented in New NASA Video
- Decoding the Twisted Jaw of Tanyka amnicola: A Paleontologist's Guide to Prehistoric Anomalies
- Ann Leckie's 'Radiant Star' Dazzles Critics: Sci-Fi Novel Set in Underground World Debuts
- Orion's Flywheel: A Deep Space Fitness Solution with Ryan Schulte