EvalAssist Icon EvalAssist

LLM-as-a-Judge Simplified — Start Small, Refine Fast, Scale Smart

EvalAssist is an application that simplifies using large language models as evaluators (LLM-as-a-Judge) of the output of other large language models by supporting users in iteratively refining evaluation criteria in a web-based user experience. EvalAssist is based on the open source Unitxt evaluation library leveraging a multi-step prompt-chaining approach to conduct evaluations. Once users are satisfied with their criteria, they can generate a Jupyter notebook to run Unitxt evaluations of their criteria at scale.


Overview

A gif that showcase overall judge steps

Key Features

Flexible Methods

Supports custom criteria development for both direct assessment and pairwise comparison evaluation methods, giving users the flexibility to choose what fits their use case best.

Flexible methods
AI-Assisted Criteria Refinement

AI-Assisted Criteria Refinement

Allows users to iteratively design and refine evaluation criteria interactively with AI-assistance. Synthetic data produces edge cases to test your criteria on multiple data samples simultaneously.

Trustworthy Results

Empowers users to inspect outcomes with built-in trustworthiness metrics such as positional bias and model-generated explanations.

Trustworthy Results
Scalable Toolkit

Scalable Toolkit

Built on top of the open source Unitxt library, EvalAssist allows you to export your criteria to Jupyter notebooks running bulk evaluations based on Unitxt, making it easy to scale your evaluations programmatically.

Diverse Models

Integrates a range of general and specialized LLM judges, including IBM Granite Guardian, Llama 3, Mixtral, and GPT-4, using a chained prompting process to ensure consistent and accurate evaluations.

Diverse Models
Test Case Catalog

Test Case Catalog

Includes a small catalogue of test cases to get you started and allows you to save your own test cases. If you develop criteria that could be of benefit to the larger community, consider contributing them to Unitxt.

Getting Started

    • macOS/Linux:
      python3 -m venv venv
      source venv/bin/activate
      pip install evalassist
      eval-assist serve
    • Windows:
      python3 -m venv venv
      venv\Scripts\activate.bat
      pip install evalassist
      eval-assist serve
  • uvx --python 3.11 --from evalassist eval-assist serve
    
  • conda create -n evalassist python=3.11
    conda activate evalassist
    pip install evalassist
    eval-assist serve

This installs all necessary dependencies and starts EvalAssist. After installation, you can access EvalAssist at http://localhost:8000. For more detailed instructions and tutorials, check out our Installation Guide .

Publications

  • Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences
    PreprintRead Paper
  • EvalAssist: A Human-Centered Tool for LLM-as-a-Judge
    HEAL @ CHI 2025 Human-centered Evaluation and Auditing of Language Models Read Paper
  • EvalAssist: LLM-as-a-Judge Simplified
    AAAI 2025Read Paper
  • Black-box Uncertainty Quantification Method for LLM-as-a-Judge
    NeurIPS 2024Read Paper
  • Human-Centered Design Recommendations for LLM-as-a-Judge
    ACL 2024 Workshop HuCLLMRead Paper
  • EvaluLLM: LLM Assisted Evaluation of Generative Outputs
    IUI ’24 CompanionRead Paper

Team

Zahra Ashktorab

IBM Research Yorktown

Elizabeth Daly

IBM Research Dublin

Michael Desmond

IBM Research Yorktown

Gina Do

IBM Research Cambridge

Werner Geyer

IBM Research Cambridge

Erik Miehling

IBM Research Dublin

Rahul Nair

IBM Research Dublin

Qian Pan

IBM Research Cambridge

Tejaswini Pedapati

Tejaswini Pedapati

IBM Research Yorktown

Martin Santillán Cooper

IBM Research Argentina