Agentic CLEAR surfaced 195 unique recurring issues across all configurations. Select a comparison view:
Agent observability platforms capture execution traces but lack meaningful evaluation. Developers must manually inspect large numbers of traces to identify systemic failures. Research-driven error taxonomies are static and require extensive human annotation.
An open-source Python package that automatically evaluates agent traces at three levels of granularity (system, node, trace), dynamically surfaces recurring failure patterns without predefined taxonomies, and presents them in an interactive dashboard. Built on top of CLEAR's LLM-as-a-Judge methodology.
| Agent Framework | Observability Platform | Status |
|---|---|---|
| LangGraph | MLflow | Supported |
| LangGraph | Langfuse | Supported |
| CrewAI | Langfuse | Supported |
| Custom | Any (via CSV) | Supported |
OpenTelemetry-compatible traces from MLflow, Langfuse, or CSV converted to unified representation.
LLM judge evaluates each trace: step-wise, trace-wise, and rubric-based.
Cluster feedback into system-wide and node-specific recurring issues.
System, Node, and Trace views with filtering, path analysis, and score prediction.
Interactive graph of agents and transitions with call counts
Per-agent CLEAR issues, score distributions, and drill-down
Browse trajectories filtered by length, agent, or score
Common path patterns, success vs. failure analysis
Agent position and score progression across steps
ROC curves for predicting trajectory success
| Benchmark | Agent | Model | # Traces | Source |
|---|---|---|---|---|
| AppWorld | CUGA | GPT-4o | 417 | Leaderboard |
| GAIA | Generalist Agent | Claude 4.5 Sonnet | 165 | HAL |
| Generalist Agent | GPT-4.1 | 165 | HAL | |
| HF DeepResearch | Claude 4.5 Sonnet | 165 | HAL | |
| HF DeepResearch | OpenAI o3 | 117 | TRAIL | |
| SWE-bench Verified | Generalist Agent | Claude 4.5 Sonnet | 50 | HAL |
| TAU-bench | Generalist Agent | Claude 3.7 Sonnet | 50 | HAL |
Agentic CLEAR surfaced 195 unique recurring issues across all configurations. Select a comparison view:
CUGA agent on AppWorld (GPT-4o backbone, GPT-5 judge). Issues sorted by frequency.
| System-Level Issues | TaskDecompositionAgent (Node-Level) |
|---|---|
| Execution flow management flaws: inefficient execution and incomplete coverage | Assumes unsupported app capabilities; violates strict app constraints |
| Validation and preconditions gaps: skips critical pre/post-action checks | Workflow coherence errors: reasoning, tasks, and ordering don't align |
| Blockage handling failures: no fallbacks, declares failure prematurely | Fails to return the user's intent verbatim for single-app tasks |
| Incomplete execution: stops short of finishing core task (checkout, send, save) | App boundary and handoff mistakes: wrong app assignment |
| Entity resolution weaknesses: brittle matching and poor disambiguation | Insufficient disambiguation criteria for selecting among items |
| API/framework contract noncompliance: schema/role/output violations | Poor handling of absent data and capability limits; lacks fallbacks |
| Intent and channel selection errors: misinterprets request, wrong app/medium | Missing required parameters or constraints (time, recipients, labels) |
| Shopping cart and selection integrity issues: contaminated carts, mishandled items | Insufficient handling of edge cases and input/format variations |
| Edge-case robustness deficiencies: time zones, date boundaries, unit normalization | Missing finalization: fails to perform the final action or deliver answer |
| Data integrity in inputs: hard-coded or unverified identifiers instead of retrieved data | Adds unsupported assumptions or extra details not provided by the user |
Same agent (HAL Generalist), same model (Claude 4.5 Sonnet), GPT-5 judge. Domain-specific issues emerge without benchmark-specific prompting.
| GAIA (Research Tasks) | SWE-bench Verified (Code Tasks) |
|---|---|
| Lack of cross-verification across independent sources | Broken patch output: malformed diffs and missing hunks |
| Sourcing failures: unreliable or unverifiable references | Monkey-patching instead of maintainable overrides |
| Premature conclusion without exhausting search avenues | Missing regression tests for the fix |
| Failure to adhere to output formatting specifications | Environment prerequisites not checked before applying fix |
| Contradictory or self-conflicting statements | Incomplete understanding of codebase dependencies |
| Redundant tool calls and inefficient search patterns | Fails to isolate the minimal reproducible case |
| Incomplete data processing: partial extraction from documents | Patch applies to wrong file or function scope |
| Unreliable numerical or date parsing from web sources | Hard-coded values instead of dynamic resolution from context |
| Fails to synthesize findings into coherent final answer | Missing import statements or dependency declarations |
| Over-reliance on a single information source | Does not verify fix compiles or passes existing tests |
HAL Generalist Agent on GAIA, GPT-5 judge. Shared issues dominate, but each model shows unique tendencies.
| GPT-4.1 Backbone | Claude 4.5 Sonnet Backbone |
|---|---|
| Source verification gaps (shared) | Source verification gaps (shared) |
| Tool misuse and redundant calls (shared) | Tool misuse and redundant calls (shared) |
| Noncompliance with required output formats (shared) | Failure to adhere to output formatting specifications (shared) |
| Incomplete task execution: stops before delivering final answer (shared) | Incomplete task execution: abandons subtasks mid-stream (shared) |
| Insufficient error recovery after tool failures (shared) | Insufficient error recovery after tool failures (shared) |
| Unique: Prematurely giving up after errors instead of retrying or pivoting | Unique: Contradictory or self-conflicting statements; inconsistent interpretation |
| Unique: Over-reliance on single source without triangulation | Unique: Excessive tool calls before synthesizing available information |
| Unique: Fails to decompose complex queries into sub-questions | Unique: Hallucinates tool capabilities that don't exist |
| Unique: Rigid sequential execution; does not parallelize independent searches | Unique: Verbose reasoning chains that lose track of the original goal |
| Unique: Misinterprets ambiguous instructions as impossible tasks | Unique: Overconfident conclusions from insufficient evidence |
LLM evaluation typically produces only scalar scores. Developers manually inspect outputs to understand why their system fails, what recurring patterns exist, and which aspects need improvement.
CLEAR automatically discovers and quantifies recurring issues in your LLM outputs using an LLM-as-a-Judge approach combined with Key Point Analysis. Provide a CSV with prompts and responses — get back structured, actionable feedback with an interactive dashboard for exploration.
Dataset D + target system s produce responses R. Or provide pre-existing outputs.
Scores each response with textual critique and numeric quality score.
Cluster critiques into recurring issues, quantifying frequency.
Explore issues, filter by type or score, compare subsets, drill down.
CLEAR adapts issue discovery to the task, dataset, and model. Select a view:
Mixtral 8x7B on GSM8K, task-specific mode, GPT-4o judge.
| Issue | Frequency |
|---|---|
| Mathematical errors in calculations, including rounding and final steps | 13.2% |
| Incorrect understanding of problem statements leading to flawed reasoning | 11.8% |
| Failure to fully consider or correctly interpret all given information | 5.8% |
| Incomplete answers due to missing necessary steps or calculations | 5.5% |
| Logical errors despite clear reasoning | 4.3% |
| Misunderstanding or incorrect application of mathematical concepts | 3.3% |
| Unnecessary complexity or inclusion of irrelevant details | 2.6% |
| Incorrect handling of units or conversions | 0.6% |
Mixtral 8x7B on TechQA (RAG), general mode, GPT-4o judge. 48.1% of instances flagged.
| Issue | Frequency |
|---|---|
| Omission of necessary details or steps | 36.3% |
| Lack of specificity and completeness in responses | 31.2% |
| Omission of relevant links or references | 9.2% |
| Inaccurate or irrelevant information | 8.6% |
| Failure to provide actionable insights or solutions | 8.3% |
| Misinterpretation or misuse of context | 4.5% |
| Lack of clarity in explaining technical details | 3.5% |
| Incomplete or abrupt ending of the response | 3.5% |
Phi-4 on TechQA (RAG), general mode, GPT-4o judge. Only 23.4% of instances flagged (vs. 48.1% for Mixtral).
| Issue | Frequency |
|---|---|
| Lacks completeness and necessary details | 10.9% |
| Lacks context-specific information | 9.9% |
| Lacks specificity in technical details | 6.0% |
| Fails to mention unsupported features or limitations | 5.1% |
| Inaccurate or fabricated information | 2.6% |
| Does not directly answer the question | 1.9% |
| Assumes unsupported or incorrect context | 1.9% |
We conducted a user study with 12 AI practitioners and researchers who used CLEAR on three datasets and provided feedback via structured questionnaires.
Participants appreciated the automation of error detection, the visual exploration interface, and the ability to identify common failure modes at scale (4.16 average). The tool was seen as actionable and time-saving compared to manual workflows.
@article{yehudai2026agentic,
title={Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents},
author={Yehudai, Asaf and Eden, Lilach and Shmueli-Scheuer, Michal},
journal={arXiv preprint arXiv:2605.22608},
year={2026}
}
@inproceedings{yehudai2026clear,
title={CLEAR: Error analysis via llm-as-a-judge made easy},
author={Yehudai, Asaf and Eden, Lilach and Perlitz, Yotam and Bar-Haim, Roy and Shmueli-Scheuer, Michal},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={40},
number={48},
pages={41736--41738},
year={2026}
}