EvalAssist

Key Features

Flexible Methods

Supports custom criteria development for both direct assessment and pairwise comparison evaluation methods, giving users the flexibility to choose what fits their use case best.

AI-Assisted Criteria Refinement

Allows users to iteratively design and refine evaluation criteria interactively with AI-assistance. Synthetic data produces edge cases to test your criteria on multiple data samples simultaneously.

Trustworthy Results

Empowers users to inspect outcomes with built-in trustworthiness metrics such as positional bias and model-generated explanations.

Scalable Toolkit

Built on top of the open source Unitxt library, EvalAssist allows you to export your criteria to Jupyter notebooks running bulk evaluations based on Unitxt, making it easy to scale your evaluations programmatically.

Diverse Models

Integrates a range of general and specialized LLM judges, including IBM Granite Guardian, Llama 3, Mixtral, and GPT-4, using a chained prompting process to ensure consistent and accurate evaluations.

Test Case Catalog

Includes a small catalogue of test cases to get you started and allows you to save your own test cases. If you develop criteria that could be of benefit to the larger community, consider contributing them to Unitxt.

macOS/Linux:

python3 -m venv venv
source venv/bin/activate
pip install 'evalassist[webapp]'
eval-assist serve

Windows:

python3 -m venv venv
venv\Scripts\activate.bat
pip install 'evalassist[webapp]'
eval-assist serve

uvx --python 3.11 --from 'evalassist[webapp]' eval-assist serve

conda create -n evalassist python=3.11
conda activate evalassist
pip install 'evalassist[webapp]'
eval-assist serve

This installs all necessary dependencies and starts EvalAssist. After installation, you can access EvalAssist at http://localhost:8000. For more detailed instructions and tutorials, check out our Installation Guide .

Publications

Synthetic Data for Evaluation: Supporting LLM-as-a-Judge Workflows with EvalAssist
EMNLP 2025 – Read Paper
Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences
Preprint – Read Paper
EvalAssist: A Human-Centered Tool for LLM-as-a-Judge
HEAL @ CHI 2025 Human-centered Evaluation and Auditing of Language Models Read Paper
EvalAssist: LLM-as-a-Judge Simplified
AAAI 2025 – Read Paper
Black-box Uncertainty Quantification Method for LLM-as-a-Judge
NeurIPS 2024 – Read Paper
Human-Centered Design Recommendations for LLM-as-a-Judge
ACL 2024 Workshop HuCLLM – Read Paper
EvaluLLM: LLM Assisted Evaluation of Generative Outputs
IUI ’24 Companion – Read Paper