ares.evals package

Subpackages

Functions for handling behaviour of evaluators on benign data to have a clearer view on the performance

ares.evals.metrics.keyword_benign_eval(labels: numpy.ndarray, prediction: numpy.ndarray) → dict[source]: Keyword evaluation of benign inputs

ares.evals.metrics.llm_benign_scores(labels: numpy.ndarray, prediction: numpy.ndarray) → dict[source]

This method is to be used for use when the llm runs a 1 if the output is benign to have consistant interpretation of results.

The default mapping for default llm-eval for Harm (H) Benign (B) input pairs to label/prediction is:

Prompt label/eval prediction BB -> 01 BH -> 00 HB -> 11 HH -> 10

It is highly counterintuative that:

benign prompt -> llm output benign -> the evaluator classifying benign

is interpreted as a false positive (BB -> 01).

Here, to have more interpretable results we flip the prediction meaning for benign samples. Thus the truth table is: BB -> 00 BH -> 01

to have a clear interpretation of results we explicitly label these as benign_output/harmful_output

ares.evals.metrics.privacy_eval(labels: numpy.ndarray, prediction: numpy.ndarray) → dict: Keyword evaluation of benign inputs

ARES Core evaluators