Skip to content

Adding your own metric

Evaluation metrics are intended to be consumed by use cases. This guide illustrates how to add new metrics. Broadly, metrics are of two categories:

  • Generic metrics: metrics that can be called from any use case.
  • Custom metrics: metrics that are intended to be called from a specific use case (e.g., question answering)

Depending on the metric category, structure your files in aisteer360/evaluation/metrics as follows:

aisteer360/
└── evaluation/
    └── metrics/
        ├── custom/
        │   └── <my_use_case>/
        │       └── <custom_metric_name>.py
        └── generic/
            └── <generic_metric_name>.py

Implementation of a new metric is the same regardless of the metric's category. Both generic and custom metrics can be one of two types:

  • standard: subclasses Metric from aisteer360.evaluation.metrics.base
  • LLM-as-a-judge: subclasses LLMJudgeMetric from aisteer360.evaluation.metrics.base_judge

All metrics compute scores using at minimum a response, with an optional field prompt. Any other necessary arguments can be passed into the metric's compute method via kwargs.

Implementing a standard metric

Standard metrics are any metric that require completely custom compute logic. Any unstructured computation can be implemented as a function of responses, prompts, and kwargs. Any necessary parameter initialization should be added to the metric’s constructor (__init__).

Below is an example implementation of a DistinctN metric (for computing unigrams, bigrams, etc.).

from itertools import islice
from typing import Any

from aisteer360.evaluation.metrics.base import Metric


class DistinctN(Metric):
    """Corpus-level Distinct-n (Li et al., 2015).

    Distinct-n = (# unique n-grams) / (# total n-grams)

    Args:
        n (int, optional): Size of the n-gram.

    Li, J., Galley, M., Brockett, C., Gao, J. and Dolan, B., 2015.
    A diversity-promoting objective function for neural conversation models.
    arXiv preprint arXiv:1510.03055.
    """

    def __init__(self, n: int = 2):
        super().__init__()
        self.n = n

    def _ngrams(self, tokens: list[str]):
        return zip(*(islice(tokens, i, None) for i in range(self.n)))

    def compute(
        self,
        responses: list[str],
        prompts: list[str] | None = None,
        **kwargs: Any,
    ) -> dict[str, float]:
        total_ngrams = 0
        unique_ngrams: set[tuple[str, ...]] = set()

        for response in responses:
            response = response.lower()
            tokens = response.split()
            grams = list(self._ngrams(tokens))
            total_ngrams += len(grams)
            unique_ngrams.update(grams)

        score = len(unique_ngrams) / total_ngrams if total_ngrams else 0.0
        return {
            f"distinct_{self.n}": score
        }

The above metric is called as follows:

from aisteer360.evaluation.metrics.generic.distinct_n import DistinctN

responses = [
    "I love exploring new places.",
    "I love exploring new places.",
    "Traveling is my passion."
]

unigram = DistinctN(n=1)

unigrams = unigram.compute(responses=responses)

Implementing an LLM-as-a-judge metric

To facilitate evaluation of more complex quantities, we have implemented a base class for LLM-as-a-judge metrics (LLMJudgeMetric) that extends the Metric class.

Implementation of LLM-as-a-judge metrics only requires specifying the prompt template _PROMPT and the scale interval scale. The prompt template must contain a placeholder to response and the lower and upper bounds of the scale interval (lower_bound and upper_bound). Optionally, the template can contain a placeholder for prompt. For instance, the Factuality metric requires the response (the model's answer) and the prompt (the question).

from aisteer360.evaluation.metrics.base_judge import LLMJudgeMetric


_PROMPT = """\
You are a careful fact-checker.

Considering only verifiable facts, rate the response’s factual accuracy with respect to the prompt on a scale from
{lower_bound} (completely incorrect) to {upper_bound} (fully correct).

PROMPT:
{prompt}

RESPONSE:
{response}

What is your score?
"""


class Factuality(LLMJudgeMetric):
    """
    Judge factual correctness of an answer to a question.
    """

    def __init__(self, *args, **kwargs):
        super().__init__(
            *args,
            prompt_template=_PROMPT,
            scale=(1, 5),
            **kwargs,
        )

LLM-as-a-judge metrics are initialized by specifying the judge model (via model_or_id) and any generation parameters (via gen_kwargs). Note that we can run the judge multiple times on a given input as dictated by num_return_sequences.

from aisteer360.evaluation.metrics.generic.relevance import Relevance

# metric parameters
judge_model = "meta-llama/Llama-3.2-3B-Instruct"
judge_gen_kwargs = {
    "temperature": 0.8,
    "num_return_sequences": 3,
    "do_sample": True
}

# initialize metric
answer_relevance = Relevance(
    model_or_id=judge_model,
    gen_kwargs=judge_gen_kwargs
)

# run the metric
questions = ["What is the capital of Ireland?"]
answers = ["Dublin."]
factuality = answer_relevance(responses=answers, prompts=questions)

To call metrics, please see the tutorial on adding your own use case.