Agentic CLEAR Automating Multi-Level Evaluation of LLM Agents

Asaf Yehudai*, Lilach Eden*, Michal Shmueli-Scheuer
IBM Research  |  ACL 2026

TL;DR

The problem

Agent observability platforms capture execution traces but lack meaningful evaluation. Developers must manually inspect large numbers of traces to identify systemic failures. Research-driven error taxonomies are static and require extensive human annotation.

What Agentic CLEAR does

An open-source Python package that automatically evaluates agent traces at three levels of granularity (system, node, trace), dynamically surfaces recurring failure patterns without predefined taxonomies, and presents them in an interactive dashboard. Built on top of CLEAR's LLM-as-a-Judge methodology.

Getting Started

pip install clear-eval
# Run on sample traces (3 traces, ~2 minutes)
run-clear-agentic-eval \
    --data-dir src/clear_eval/sample_data/agentic/research_agent_traces/mlflow \
    --results-dir my_results \
    --from-raw-traces true \
    --agent-framework langgraph \
    --observability-framework mlflow \
    --max-files 3 \
    --eval-model-name gpt-4o \
    --provider openai

# Launch the dashboard
run-clear-agentic-dashboard

Supported Frameworks

Agent FrameworkObservability PlatformStatus
LangGraphMLflowSupported
LangGraphLangfuseSupported
CrewAILangfuseSupported
CustomAny (via CSV)Supported

Pipeline

Agentic CLEAR Pipeline. From raw execution traces through multi-level evaluation and aggregation to interactive insights.
Agentic CLEAR Pipeline
Trace Preprocessing

OpenTelemetry-compatible traces from MLflow, Langfuse, or CSV converted to unified representation.

Multi-Level Evaluation

LLM judge evaluates each trace: step-wise, trace-wise, and rubric-based.

CLEAR Aggregation

Cluster feedback into system-wide and node-specific recurring issues.

Dashboard

System, Node, and Trace views with filtering, path analysis, and score prediction.

Dashboard

Agentic CLEAR interactive dashboard (click for high-res PDF)
Agentic CLEAR Dashboard. A hierarchical UI with three views: System View reconstructs the multi-agent topology and surfaces global issues; Node View enables per-component error analysis with filtering; Trace View provides step-level inspection with rubric evaluation and dimension scores.
Workflow View

Interactive graph of agents and transitions with call counts

Node Analysis

Per-agent CLEAR issues, score distributions, and drill-down

Trajectory Explorer

Browse trajectories filtered by length, agent, or score

Path Analysis

Common path patterns, success vs. failure analysis

Temporal Analysis

Agent position and score progression across steps

Score Prediction

ROC curves for predicting trajectory success

Results

Experimental Setup

BenchmarkAgentModel# TracesSource
AppWorldCUGAGPT-4o417Leaderboard
GAIAGeneralist AgentClaude 4.5 Sonnet165HAL
Generalist AgentGPT-4.1165HAL
HF DeepResearchClaude 4.5 Sonnet165HAL
HF DeepResearchOpenAI o3117TRAIL
SWE-bench VerifiedGeneralist AgentClaude 4.5 Sonnet50HAL
TAU-benchGeneralist AgentClaude 3.7 Sonnet50HAL

Discovered Issues

Agentic CLEAR surfaced 195 unique recurring issues across all configurations. Select a comparison view:

CUGA agent on AppWorld (GPT-4o backbone, GPT-5 judge). Issues sorted by frequency.

System-Level IssuesTaskDecompositionAgent (Node-Level)
Execution flow management flaws: inefficient execution and incomplete coverageAssumes unsupported app capabilities; violates strict app constraints
Validation and preconditions gaps: skips critical pre/post-action checksWorkflow coherence errors: reasoning, tasks, and ordering don't align
Blockage handling failures: no fallbacks, declares failure prematurelyFails to return the user's intent verbatim for single-app tasks
Incomplete execution: stops short of finishing core task (checkout, send, save)App boundary and handoff mistakes: wrong app assignment
Entity resolution weaknesses: brittle matching and poor disambiguationInsufficient disambiguation criteria for selecting among items
API/framework contract noncompliance: schema/role/output violationsPoor handling of absent data and capability limits; lacks fallbacks
Intent and channel selection errors: misinterprets request, wrong app/mediumMissing required parameters or constraints (time, recipients, labels)
Shopping cart and selection integrity issues: contaminated carts, mishandled itemsInsufficient handling of edge cases and input/format variations
Edge-case robustness deficiencies: time zones, date boundaries, unit normalizationMissing finalization: fails to perform the final action or deliver answer
Data integrity in inputs: hard-coded or unverified identifiers instead of retrieved dataAdds unsupported assumptions or extra details not provided by the user

Same agent (HAL Generalist), same model (Claude 4.5 Sonnet), GPT-5 judge. Domain-specific issues emerge without benchmark-specific prompting.

GAIA (Research Tasks)SWE-bench Verified (Code Tasks)
Lack of cross-verification across independent sourcesBroken patch output: malformed diffs and missing hunks
Sourcing failures: unreliable or unverifiable referencesMonkey-patching instead of maintainable overrides
Premature conclusion without exhausting search avenuesMissing regression tests for the fix
Failure to adhere to output formatting specificationsEnvironment prerequisites not checked before applying fix
Contradictory or self-conflicting statementsIncomplete understanding of codebase dependencies
Redundant tool calls and inefficient search patternsFails to isolate the minimal reproducible case
Incomplete data processing: partial extraction from documentsPatch applies to wrong file or function scope
Unreliable numerical or date parsing from web sourcesHard-coded values instead of dynamic resolution from context
Fails to synthesize findings into coherent final answerMissing import statements or dependency declarations
Over-reliance on a single information sourceDoes not verify fix compiles or passes existing tests

HAL Generalist Agent on GAIA, GPT-5 judge. Shared issues dominate, but each model shows unique tendencies.

GPT-4.1 BackboneClaude 4.5 Sonnet Backbone
Source verification gaps (shared)Source verification gaps (shared)
Tool misuse and redundant calls (shared)Tool misuse and redundant calls (shared)
Noncompliance with required output formats (shared)Failure to adhere to output formatting specifications (shared)
Incomplete task execution: stops before delivering final answer (shared)Incomplete task execution: abandons subtasks mid-stream (shared)
Insufficient error recovery after tool failures (shared)Insufficient error recovery after tool failures (shared)
Unique: Prematurely giving up after errors instead of retrying or pivotingUnique: Contradictory or self-conflicting statements; inconsistent interpretation
Unique: Over-reliance on single source without triangulationUnique: Excessive tool calls before synthesizing available information
Unique: Fails to decompose complex queries into sub-questionsUnique: Hallucinates tool capabilities that don't exist
Unique: Rigid sequential execution; does not parallelize independent searchesUnique: Verbose reasoning chains that lose track of the original goal
Unique: Misinterprets ambiguous instructions as impossible tasksUnique: Overconfident conclusions from insufficient evidence

Alignment with Human Taxonomies (TRAIL)

Error category prediction against TRAIL taxonomy
Table 1. We validate our automatically generated issues against TRAIL's human-annotated error taxonomy (12 reasoning/planning categories). GPT-5 with full+partial matching achieves the best coverage, outperforming frequency baselines on macro F1 while covering all 12 categories without predefined definitions.

Score Prediction (AUC)

AUC heatmap for predicting trajectory success
Table 2. AUC for predicting trajectory success using three complementary methods: step-wise (average step score), trace-level (holistic judgment), and rubric-based (proportion of criteria met). Darker green = higher AUC. Trace-level evaluation is generally strongest, peaking at 0.890 on AppWorld. The best method varies by benchmark and agent, suggesting the three modes provide complementary signals.

TL;DR

The problem

LLM evaluation typically produces only scalar scores. Developers manually inspect outputs to understand why their system fails, what recurring patterns exist, and which aspects need improvement.

What CLEAR does

CLEAR automatically discovers and quantifies recurring issues in your LLM outputs using an LLM-as-a-Judge approach combined with Key Point Analysis. Provide a CSV with prompts and responses — get back structured, actionable feedback with an interactive dashboard for exploration.

Getting Started

pip install clear-eval
# Run on sample GSM8K data
run-clear-eval-analysis --provider openai --eval-model-name gpt-4o

# Launch the Streamlit dashboard
run-clear-eval-dashboard
# Upload the ZIP from results/gsm8k/sample_output/

Pipeline

CLEAR Pipeline. (a) From dataset and target system through judge evaluation and key point analysis to (b) interactive exploration.
CLEAR Framework pipeline
Generate Responses

Dataset D + target system s produce responses R. Or provide pre-existing outputs.

LLM Judge (J)

Scores each response with textual critique and numeric quality score.

Key Point Analysis (K)

Cluster critiques into recurring issues, quantifying frequency.

Dashboard

Explore issues, filter by type or score, compare subsets, drill down.

Dashboard

CLEAR LLM Analysis interactive UI (click for high-res PDF)
CLEAR Dashboard. The Streamlit-based interface provides: an Issues View showing error distribution and frequency; a Filtering Mechanism to isolate examples by issue type or score range; a Comparison View contrasting filtered vs. full-dataset distributions; and a Model Behavior panel for instance-level inspection of judge feedback.

Example: Discovered Issues

CLEAR adapts issue discovery to the task, dataset, and model. Select a view:

Mixtral 8x7B on GSM8K, task-specific mode, GPT-4o judge.

IssueFrequency
Mathematical errors in calculations, including rounding and final steps13.2%
Incorrect understanding of problem statements leading to flawed reasoning11.8%
Failure to fully consider or correctly interpret all given information5.8%
Incomplete answers due to missing necessary steps or calculations5.5%
Logical errors despite clear reasoning4.3%
Misunderstanding or incorrect application of mathematical concepts3.3%
Unnecessary complexity or inclusion of irrelevant details2.6%
Incorrect handling of units or conversions0.6%

Mixtral 8x7B on TechQA (RAG), general mode, GPT-4o judge. 48.1% of instances flagged.

IssueFrequency
Omission of necessary details or steps36.3%
Lack of specificity and completeness in responses31.2%
Omission of relevant links or references9.2%
Inaccurate or irrelevant information8.6%
Failure to provide actionable insights or solutions8.3%
Misinterpretation or misuse of context4.5%
Lack of clarity in explaining technical details3.5%
Incomplete or abrupt ending of the response3.5%

Phi-4 on TechQA (RAG), general mode, GPT-4o judge. Only 23.4% of instances flagged (vs. 48.1% for Mixtral).

IssueFrequency
Lacks completeness and necessary details10.9%
Lacks context-specific information9.9%
Lacks specificity in technical details6.0%
Fails to mention unsupported features or limitations5.1%
Inaccurate or fabricated information2.6%
Does not directly answer the question1.9%
Assumes unsupported or incorrect context1.9%

Evaluation Modes

  • General: Issues discovered dynamically — broad, exploratory assessment without prior knowledge.
  • Task-specific: Users provide evaluation criteria; judge is guided but can discover additional issues.
  • Static: Predefined issues mapped directly to instances without dynamic discovery.

User Study

We conducted a user study with 12 AI practitioners and researchers who used CLEAR on three datasets and provided feedback via structured questionnaires.

4.33
Ability to detect issues they would have overlooked (Likert 1-5)
4.25
Time-saving and better than existing practices
74%
Would take action based on the output
75%
Currently rely on manual inspection for error analysis

Participants appreciated the automation of error detection, the visual exploration interface, and the ability to identify common failure modes at scale (4.16 average). The tool was seen as actionable and time-saving compared to manual workflows.

Citation

Agentic Analysis (ACL 2026)

@article{yehudai2026agentic,
  title={Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents},
  author={Yehudai, Asaf and Eden, Lilach and Shmueli-Scheuer, Michal},
  journal={arXiv preprint arXiv:2605.22608},
  year={2026}
}

LLM Analysis (AAAI 2026)

@inproceedings{yehudai2026clear,
  title={CLEAR: Error analysis via llm-as-a-judge made easy},
  author={Yehudai, Asaf and Eden, Lilach and Perlitz, Yotam and Bar-Haim, Roy and Shmueli-Scheuer, Michal},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={40},
  number={48},
  pages={41736--41738},
  year={2026}
}