Agentic CLEAR Automating Multi-Level Evaluation of LLM Agents

Asaf Yehudai*, Lilach Eden*, Michal Shmueli-Scheuer
IBM Research | ACL 2026

TL;DR

The problem

Agent observability platforms capture execution traces but lack meaningful evaluation. Developers must manually inspect large numbers of traces to identify systemic failures. Research-driven error taxonomies are static and require extensive human annotation.

What Agentic CLEAR does

An open-source Python package that automatically evaluates agent traces at three levels of granularity (system, node, trace), dynamically surfaces recurring failure patterns without predefined taxonomies, and presents them in an interactive dashboard. Built on top of CLEAR's LLM-as-a-Judge methodology.

Getting Started

pip install clear-eval

# Run on sample traces (3 traces, ~2 minutes)
run-clear-agentic-eval \
    --data-dir src/clear_eval/sample_data/agentic/research_agent_traces/mlflow \
    --results-dir my_results \
    --from-raw-traces true \
    --agent-framework langgraph \
    --observability-framework mlflow \
    --max-files 3 \
    --eval-model-name gpt-4o \
    --provider openai

# Launch the dashboard
run-clear-agentic-dashboard

Supported Frameworks

Agent Framework	Observability Platform	Status
LangGraph	MLflow	Supported
LangGraph	Langfuse	Supported
CrewAI	Langfuse	Supported
Custom	Any (via CSV)	Supported

Pipeline

Trace Preprocessing

OpenTelemetry-compatible traces from MLflow, Langfuse, or CSV converted to unified representation.

Multi-Level Evaluation

LLM judge evaluates each trace: step-wise, trace-wise, and rubric-based.

CLEAR Aggregation

Cluster feedback into system-wide and node-specific recurring issues.

Dashboard

System, Node, and Trace views with filtering, path analysis, and score prediction.

Dashboard

Workflow View

Interactive graph of agents and transitions with call counts

Node Analysis

Per-agent CLEAR issues, score distributions, and drill-down

Trajectory Explorer

Browse trajectories filtered by length, agent, or score

Path Analysis

Common path patterns, success vs. failure analysis

Temporal Analysis

Agent position and score progression across steps

Score Prediction

ROC curves for predicting trajectory success

Results

Experimental Setup

Benchmark	Agent	Model	# Traces	Source
AppWorld	CUGA	GPT-4o	417	Leaderboard
GAIA	Generalist Agent	Claude 4.5 Sonnet	165	HAL
	Generalist Agent	GPT-4.1	165	HAL
	HF DeepResearch	Claude 4.5 Sonnet	165	HAL
	HF DeepResearch	OpenAI o3	117	TRAIL
SWE-bench Verified	Generalist Agent	Claude 4.5 Sonnet	50	HAL
TAU-bench	Generalist Agent	Claude 3.7 Sonnet	50	HAL

Discovered Issues

Agentic CLEAR surfaced 195 unique recurring issues across all configurations. Select a comparison view:

CUGA agent on AppWorld (GPT-4o backbone, GPT-5 judge). Issues sorted by frequency.

System-Level Issues	TaskDecompositionAgent (Node-Level)
Execution flow management flaws: inefficient execution and incomplete coverage	Assumes unsupported app capabilities; violates strict app constraints
Validation and preconditions gaps: skips critical pre/post-action checks	Workflow coherence errors: reasoning, tasks, and ordering don't align
Blockage handling failures: no fallbacks, declares failure prematurely	Fails to return the user's intent verbatim for single-app tasks
Incomplete execution: stops short of finishing core task (checkout, send, save)	App boundary and handoff mistakes: wrong app assignment
Entity resolution weaknesses: brittle matching and poor disambiguation	Insufficient disambiguation criteria for selecting among items
API/framework contract noncompliance: schema/role/output violations	Poor handling of absent data and capability limits; lacks fallbacks
Intent and channel selection errors: misinterprets request, wrong app/medium	Missing required parameters or constraints (time, recipients, labels)
Shopping cart and selection integrity issues: contaminated carts, mishandled items	Insufficient handling of edge cases and input/format variations
Edge-case robustness deficiencies: time zones, date boundaries, unit normalization	Missing finalization: fails to perform the final action or deliver answer
Data integrity in inputs: hard-coded or unverified identifiers instead of retrieved data	Adds unsupported assumptions or extra details not provided by the user

Same agent (HAL Generalist), same model (Claude 4.5 Sonnet), GPT-5 judge. Domain-specific issues emerge without benchmark-specific prompting.

GAIA (Research Tasks)	SWE-bench Verified (Code Tasks)
Lack of cross-verification across independent sources	Broken patch output: malformed diffs and missing hunks
Sourcing failures: unreliable or unverifiable references	Monkey-patching instead of maintainable overrides
Premature conclusion without exhausting search avenues	Missing regression tests for the fix
Failure to adhere to output formatting specifications	Environment prerequisites not checked before applying fix
Contradictory or self-conflicting statements	Incomplete understanding of codebase dependencies
Redundant tool calls and inefficient search patterns	Fails to isolate the minimal reproducible case
Incomplete data processing: partial extraction from documents	Patch applies to wrong file or function scope
Unreliable numerical or date parsing from web sources	Hard-coded values instead of dynamic resolution from context
Fails to synthesize findings into coherent final answer	Missing import statements or dependency declarations
Over-reliance on a single information source	Does not verify fix compiles or passes existing tests

HAL Generalist Agent on GAIA, GPT-5 judge. Shared issues dominate, but each model shows unique tendencies.

GPT-4.1 Backbone	Claude 4.5 Sonnet Backbone
Source verification gaps (shared)	Source verification gaps (shared)
Tool misuse and redundant calls (shared)	Tool misuse and redundant calls (shared)
Noncompliance with required output formats (shared)	Failure to adhere to output formatting specifications (shared)
Incomplete task execution: stops before delivering final answer (shared)	Incomplete task execution: abandons subtasks mid-stream (shared)
Insufficient error recovery after tool failures (shared)	Insufficient error recovery after tool failures (shared)
Unique: Prematurely giving up after errors instead of retrying or pivoting	Unique: Contradictory or self-conflicting statements; inconsistent interpretation
Unique: Over-reliance on single source without triangulation	Unique: Excessive tool calls before synthesizing available information
Unique: Fails to decompose complex queries into sub-questions	Unique: Hallucinates tool capabilities that don't exist
Unique: Rigid sequential execution; does not parallelize independent searches	Unique: Verbose reasoning chains that lose track of the original goal
Unique: Misinterprets ambiguous instructions as impossible tasks	Unique: Overconfident conclusions from insufficient evidence

Alignment with Human Taxonomies (TRAIL)

Error category prediction against TRAIL taxonomy — Table 1. We validate our automatically generated issues against TRAIL's human-annotated error taxonomy (12 reasoning/planning categories). GPT-5 with full+partial matching achieves the best coverage, outperforming frequency baselines on macro F1 while covering all 12 categories without predefined definitions.

Score Prediction (AUC)

AUC heatmap for predicting trajectory success — Table 2. AUC for predicting trajectory success using three complementary methods: step-wise (average step score), trace-level (holistic judgment), and rubric-based (proportion of criteria met). Darker green = higher AUC. Trace-level evaluation is generally strongest, peaking at 0.890 on AppWorld. The best method varies by benchmark and agent, suggesting the three modes provide complementary signals.

TL;DR

The problem

LLM evaluation typically produces only scalar scores. Developers manually inspect outputs to understand why their system fails, what recurring patterns exist, and which aspects need improvement.

What CLEAR does

CLEAR automatically discovers and quantifies recurring issues in your LLM outputs using an LLM-as-a-Judge approach combined with Key Point Analysis. Provide a CSV with prompts and responses — get back structured, actionable feedback with an interactive dashboard for exploration.

Getting Started

pip install clear-eval

# Run on sample GSM8K data
run-clear-eval-analysis --provider openai --eval-model-name gpt-4o

# Launch the Streamlit dashboard
run-clear-eval-dashboard
# Upload the ZIP from results/gsm8k/sample_output/

Pipeline

Generate Responses

Dataset D + target system s produce responses R. Or provide pre-existing outputs.

LLM Judge (J)

Scores each response with textual critique and numeric quality score.

Key Point Analysis (K)

Cluster critiques into recurring issues, quantifying frequency.

Dashboard

Explore issues, filter by type or score, compare subsets, drill down.

Dashboard

CLEAR LLM Analysis interactive UI (click for high-res PDF) — CLEAR Dashboard. The Streamlit-based interface provides: an **Issues View** showing error distribution and frequency; a **Filtering Mechanism** to isolate examples by issue type or score range; a **Comparison View** contrasting filtered vs. full-dataset distributions; and a **Model Behavior** panel for instance-level inspection of judge feedback.

Example: Discovered Issues

CLEAR adapts issue discovery to the task, dataset, and model. Select a view:

Mixtral 8x7B on GSM8K, task-specific mode, GPT-4o judge.

Issue	Frequency
Mathematical errors in calculations, including rounding and final steps	13.2%
Incorrect understanding of problem statements leading to flawed reasoning	11.8%
Failure to fully consider or correctly interpret all given information	5.8%
Incomplete answers due to missing necessary steps or calculations	5.5%
Logical errors despite clear reasoning	4.3%
Misunderstanding or incorrect application of mathematical concepts	3.3%
Unnecessary complexity or inclusion of irrelevant details	2.6%
Incorrect handling of units or conversions	0.6%

Mixtral 8x7B on TechQA (RAG), general mode, GPT-4o judge. 48.1% of instances flagged.

Issue	Frequency
Omission of necessary details or steps	36.3%
Lack of specificity and completeness in responses	31.2%
Omission of relevant links or references	9.2%
Inaccurate or irrelevant information	8.6%
Failure to provide actionable insights or solutions	8.3%
Misinterpretation or misuse of context	4.5%
Lack of clarity in explaining technical details	3.5%
Incomplete or abrupt ending of the response	3.5%

Phi-4 on TechQA (RAG), general mode, GPT-4o judge. Only 23.4% of instances flagged (vs. 48.1% for Mixtral).

Issue	Frequency
Lacks completeness and necessary details	10.9%
Lacks context-specific information	9.9%
Lacks specificity in technical details	6.0%
Fails to mention unsupported features or limitations	5.1%
Inaccurate or fabricated information	2.6%
Does not directly answer the question	1.9%
Assumes unsupported or incorrect context	1.9%

Evaluation Modes

General: Issues discovered dynamically — broad, exploratory assessment without prior knowledge.
Task-specific: Users provide evaluation criteria; judge is guided but can discover additional issues.
Static: Predefined issues mapped directly to instances without dynamic discovery.

User Study

We conducted a user study with 12 AI practitioners and researchers who used CLEAR on three datasets and provided feedback via structured questionnaires.

4.33

Ability to detect issues they would have overlooked (Likert 1-5)

4.25

Time-saving and better than existing practices

74%

Would take action based on the output

75%

Currently rely on manual inspection for error analysis

Participants appreciated the automation of error detection, the visual exploration interface, and the ability to identify common failure modes at scale (4.16 average). The tool was seen as actionable and time-saving compared to manual workflows.

Citation

Agentic Analysis (ACL 2026)

@article{yehudai2026agentic,
  title={Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents},
  author={Yehudai, Asaf and Eden, Lilach and Shmueli-Scheuer, Michal},
  journal={arXiv preprint arXiv:2605.22608},
  year={2026}
}

LLM Analysis (AAAI 2026)

@inproceedings{yehudai2026clear,
  title={CLEAR: Error analysis via llm-as-a-judge made easy},
  author={Yehudai, Asaf and Eden, Lilach and Perlitz, Yotam and Bar-Haim, Roy and Shmueli-Scheuer, Michal},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={40},
  number={48},
  pages={41736--41738},
  year={2026}
}