EvalAssist Icon EvalAssist

LLM-as-a-Judge Simplified — Start Small, Refine Fast, Scale Smart

EvalAssist is an application that simplifies using large language models as evaluators (LLM-as-a-Judge) of the output of other large language models by supporting users in iteratively refining evaluation criteria in a web-based user experience. EvalAssist is based on the open source Unitxt evaluation library leveraging a multi-step prompt-chaining approach to conduct evaluations. Once users are satisfied with their criteria, they can generate a Jupyter notebook to run Unitxt evaluations of their criteria at scale.


Overview

EvalAssist Teaser Image

Key Features

Flexible Methods

Supports custom criteria development for both direct assessment and pairwise comparison evaluation methods, giving users the flexibility to choose what fits their use case best.

EvalAssist Teaser Image
EvalAssist Teaser Image

AI-Assisted Criteria Refinement

Allows users to iteratively design and refine evaluation criteria interactively with AI-assistance. Synthetic data produces edge cases to test your criteria on multiple data samples simultaneously.

Trustworthy Results

Empowers users to inspect outcomes with built-in trustworthiness metrics such as positional bias and model-generated explanations.

EvalAssist Teaser Image
EvalAssist Teaser Image

Scalable Toolkit

Built on top of the open source Unitxt library, EvalAssist allows you to export your criteria to Jupyter notebooks running bulk evaluations based on Unitxt, making it easy to scale your evaluations programmatically.

Diverse Models

Integrates a range of general and specialized LLM judges, including IBM Granite Guardian, Llama 3, Mixtral, and GPT-4, using a chained prompting process to ensure consistent and accurate evaluations.

EvalAssist Teaser Image
EvalAssist Teaser Image

Test Case Catalog

Includes a small catalogue of test cases to get you started and allows you to save your own test cases. If you develop criteria that could be of benefit to the larger community, consider contributing them to Unitxt.

Publications

  • Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences
    PreprintRead Paper
  • EvalAssist: LLM-as-a-Judge Simplified
    AAAI 2025Read Paper
  • Black-box Uncertainty Quantification Method for LLM-as-a-Judge
    NeurIPS 2024Read Paper
  • Human-Centered Design Recommendations for LLM-as-a-Judge
    ACL 2024 Workshop HuCLLMRead Paper
  • EvaluLLM: LLM Assisted Evaluation of Generative Outputs
    IUI ’24 CompanionRead Paper

Team

Zahra Ashktorab

IBM Research Yorktown

Elizabeth Daly

IBM Research Dublin

Michael Desmond

IBM Research Yorktown

Gina Do

IBM Research Cambridge

Werner Geyer

IBM Research Cambridge

Erik Miehling

IBM Research Dublin

Rahul Nair

IBM Research Dublin

Qian Pan

IBM Research Cambridge

Tejaswini Pedapati

IBM Research Yorktown

Martin Santillán Cooper

IBM Research Argentina