How to Pinpoint the Culprit: A Step-by-Step Guide to Automated Failure Attribution in LLM Multi-Agent Systems

Introduction

When an LLM multi-agent system fails, developers often face a daunting task: sifting through hundreds of lines of interaction logs to determine which agent caused the failure and at what point the mistake occurred. This manual debugging process is time-consuming, error-prone, and scales poorly as systems grow in complexity. To address this, researchers from Penn State University, Duke University, Google DeepMind, and other institutions introduced the problem of automated failure attribution and created the first dedicated benchmark dataset, Who&When, accepted as a Spotlight presentation at ICML 2025. This guide translates their research into a practical, step-by-step workflow you can follow to implement automated failure attribution in your own multi-agent systems.

How to Pinpoint the Culprit: A Step-by-Step Guide to Automated Failure Attribution in LLM Multi-Agent Systems — Source: syncedreview.com

What You Need

Before you begin, ensure you have the following:

Interaction Logs from your LLM multi-agent system (e.g., agent messages, task assignments, outputs, failure signals).
Access to the Who&When Dataset (available on Hugging Face) – includes labeled failures with ground-truth agent and step IDs.
Python Environment (3.8+) with libraries: `pandas`, `json`, `openai` (or other LLM API), and transformers.
LLM API Key for attribution models (e.g., GPT-4, LLaMA).
Basic Knowledge of multi-agent architectures and prompt engineering.

Step-by-Step Instructions

Step 1: Understand the Failure Attribution Task

Automated failure attribution asks two questions per failed task: “Which agent?” and “When?” (i.e., at which step in the interaction). The Who&When dataset formalizes this as a benchmark. Familiarize yourself with the dataset’s structure: each failure case includes a task description, a full interaction log, and ground-truth labels for the responsible agent and the step index. This will guide your approach.

Step 2: Collect and Preprocess Your System’s Logs

If you are working with your own system, extract logs in a consistent format. For each task attempt that ended in failure (e.g., incorrect final output, loop, timeout), assemble:

Task prompt (what the system was asked to do).
Sequence of agent utterances – each message should include the agent name, timestamp, and content.
Failure signal – how the system knows the task failed (e.g., manual flag, error metric).

Normalize logs into JSON lines with fields: `task`, `messages` (list of dicts with `agent`, `step`, `text`), and `failure_type`. If using Who&When, download the dataset and load it directly.

Step 3: Choose an Attribution Method

The researchers explored several methods. For simplicity, start with Direct Prompting – feed the entire log to an LLM and ask it to identify the culpable agent and step. Example prompt:

“Given this multi-agent conversation that resulted in a failure, which agent made the critical mistake, and at which step (0-indexed)? Output in JSON: {"agent": "...", "step": integer}.”

More advanced options include contrastive prompting (compare with successful runs) and agent-level vs. step-level decomposition. The Who&When paper provides baselines – you can replicate them using their open-source code.

Step 4: Run Attribution on Your Logs

Implement a script that loops over each failure case, builds the prompt, calls the LLM API, and parses the response. For the Who&When dataset, compare your predictions against the ground-truth labels to compute accuracy (agent attribution, step attribution, and joint accuracy). For your own logs, you may need to manually verify a sample to establish a baseline.

Step 5: Analyze and Iterate

When attribution fails, examine misclassifications. Common pitfalls include:

Insufficient context – the log may be too long; try truncating or summarizing steps.
Blaming the wrong agent – the first mistake may not be the most obvious one; use chain-of-thought prompting to reason step by step.
Temporal ambiguity – multiple agents contribute close in time; consider step-level attribution first.

Adjust your prompts or method accordingly. The Who&When benchmark is designed to help you compare different approaches systematically.

Tips

The following insights from the original research can improve your success rate:

Manual log archaeology is still useful for small-scale validation but does not scale. Automated attribution is essential for complex systems.
Use the Who&When dataset as a testbed before running on your own logs – it provides ground truth and allows you to tune prompts.
Consider the trade-off between agent-level and step-level attribution – some methods perform better at one than the other. The paper proposes combined approaches.
Leverage open-source resources – the code and dataset are fully available; modify the evaluation scripts to suit your needs.
Remember that failures can stem from multiple agents – the attribution task assumes a single primary culprit; for complex failures, you might need to adapt to multi-label attribution.