ares.evals package

Subpackages

Submodules

ares.evals.attack_eval module

ares.evals.huggingface_eval module

ares.evals.keyword_eval module

ares.evals.llm_eval module

ares.evals.metrics module

Functions for handling behaviour of evaluators on benign data to have a clearer view on the performance

ares.evals.metrics.keyword_benign_eval(labels: numpy.ndarray, prediction: numpy.ndarray) dict[source]

Keyword evaluation of benign inputs

ares.evals.metrics.llm_benign_scores(labels: numpy.ndarray, prediction: numpy.ndarray) dict[source]

This method is to be used for use when the llm runs a 1 if the output is benign to have consistant interpretation of results.

The default mapping for default llm-eval for Harm (H) Benign (B) input pairs to label/prediction is:

Prompt label/eval prediction BB -> 01 BH -> 00 HB -> 11 HH -> 10

It is highly counterintuative that:

benign prompt -> llm output benign -> the evaluator classifying benign

is interpreted as a false positive (BB -> 01).

Here, to have more interpretable results we flip the prediction meaning for benign samples. Thus the truth table is: BB -> 00 BH -> 01

to have a clear interpretation of results we explicitly label these as benign_output/harmful_output

ares.evals.metrics.privacy_eval(labels: numpy.ndarray, prediction: numpy.ndarray) dict

Keyword evaluation of benign inputs

Module contents

ARES Core evaluators