Evaluation Configuration

The evaluation node defines how ARES assesses the performance of the target model/app under adversarial attack. Evaluators analyze the model’s responses and determine whether they meet safety, robustness, or refusal criteria.

ARES supports evaluation for both single-turn and multi-turn attack strategies.

Basic Configuration

Single-Turn Evaluation Example:

evaluation:
  type: ares.evals.keyword_eval.KeywordEval
  keyword_list_or_path: 'assets/advbench_refusal_keywords.json'
  input_path: 'assets/ares_attacks.json'
  output_path: 'assets/evaluation.json'

Multi-Turn Evaluation Example:

evaluation:
  type: ares.evals.multi_turn_eval.MultiTurnEval
  name: multi_turn_eval
  input_path: 'results/multi_turn_attacks.json'
  output_path: 'results/multi_turn_evaluation.json'
  mode: conversation  # Options: 'prompt', 'conversation', 'goal'

Supported evaluator types can be found in the evals package. These include keyword matchers, regex filters, scoring-based evaluators, and multi-turn evaluators.

Multi-Turn Evaluation Modes

The MultiTurnEval evaluator supports three evaluation modes:

  • prompt: Evaluates each individual prompt-response pair in the conversation

  • conversation: Evaluates the entire conversation as a whole

  • goal: Evaluates whether the attack goal was achieved across the conversation

Note

Multi-turn strategies using the MultiTurn base class automatically provide an attack_successful field with “Yes”/”No” values based on the strategy’s success criteria.

LLM-Based Evaluation

ARES supports evaluation using external language models (LLMs) to judge responses. These evaluators use a connector to query a model and ask whether the response is harmful, evasive, or aligned with expected behavior. This is particularly useful for subjective or nuanced evaluations.

For implementation details, see:

  • ares.evals.llm_eval.LLMEval — uses a language model to judge responses based on predefined criteria

  • ares.evals.multi_turn_eval.MultiTurnEval — evaluates multi-turn conversations with flexible scoring

This approach allows ARES to simulate human-like judgment and scale evaluations across diverse attack strategies.

Example use cases:

  • Using a model to classify responses as harmful or safe

  • Asking an LLM to rate the quality or safety of a response

  • Comparing outputs against expected refusal behavior

  • Evaluating multi-turn conversation success

Agentic Evaluation

ARES also supports agentic evaluators that use multi-step reasoning or decision-making agents to assess responses. These agents may simulate user behavior or apply complex logic to determine whether the model failed under attack.

To use LLM or agentic evaluators, configure the evaluation node to invoke a connector and specify the evaluation logic.

Note

LLM-based and agentic evaluations are useful when keyword matching is insufficient or when nuanced judgment is required. These evaluators can be plugged in via ARES plugins or custom connectors.

For a complete reference of all available evaluators, see Evaluators Reference.