Evaluation Configuration

The evaluation node defines how ARES assesses the performance of the target model/app under adversarial attack. Evaluators analyze the model’s responses and determine whether they meet safety, robustness, or refusal criteria.

Basic Configuration

Here is an example using a keyword-based evaluator:

evaluation:
  type: ares.evals.keyword_eval.KeywordEval
  keyword_list_or_path: 'assets/advbench_refusal_keywords.json'  # path to refusal keywords
  input_path: 'assets/ares_attacks.json'  # path to generated attack prompts
  output_path: 'assets/evaluation.json'  # path to save evaluation results

Supported evaluator types can be found in the evals package. These include keyword matchers, regex filters, and scoring-based evaluators.

LLM-Based Evaluation

ARES supports evaluation using external language models (LLMs) to judge responses. These evaluators use a connector to query a model and ask whether the response is harmful, evasive, or aligned with expected behavior. This is particularly useful for subjective or nuanced evaluations.

For implementation details, see:

ares.evals.llm_eval.LLMEval — uses a language model to judge responses based on predefined criteria.

This approach allows ARES to simulate human-like judgment and scale evaluations across diverse attack strategies.

Example use cases:

Using a model to classify responses as harmful or safe
Asking an LLM to rate the quality or safety of a response
Comparing outputs against expected refusal behavior

Agentic Evaluation

ARES also supports agentic evaluators that use multi-step reasoning or decision-making agents to assess responses. These agents may simulate user behavior or apply complex logic to determine whether the model failed under attack.

To use LLM or agentic evaluators, configure the evaluation node to invoke a connector and specify the evaluation logic.

Note

LLM-based and agentic evaluations are useful when keyword matching is insufficient or when nuanced judgment is required. These evaluators can be plugged in via ARES plugins or custom connectors.

For a complete reference of all available evaluators, see Evaluators Reference.