Instruction following metrics
aisteer360.evaluation.metrics.custom.instruction_following
Evaluation metrics for the InstructionFollowing
use case.
helpers
We have omitted the documentation details on the IFEval functions (located in helpers/
) from our API reference. For details please see the
IFEval repo: https://github.com/google-research/google-research/tree/master/instruction_following_eval.
strict_instruction
StrictInstruction
Bases: Metric
Evaluation wrapper around IFEval's official implementation from Google Research (https://github.com/google-research/google-research/tree/master/instruction_following_eval). Measures how well models follow explicit instructions embedded within prompts, using strict binary evaluation criteria.
Source code in aisteer360/evaluation/metrics/custom/instruction_following/strict_instruction.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
|
extras = extras
instance-attribute
name = self.__class__.__name__
instance-attribute
compute(responses=None, prompts=None, **kwargs)
Computes strict instruction-following metrics using IFEval evaluation.
Evaluates model responses against structured instructions using the official IFEval framework. Each response is assessed both at the prompt level (whether ALL instructions were followed) and at the individual instruction level.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
responses
|
list[dict] | None
|
List of response dictionaries, each containing:
|
None
|
prompts
|
list[str] | None
|
List of question prompts (unused, for interface compatibility). |
None
|
**kwargs
|
Additional arguments (unused). |
{}
|
Returns:
Type | Description |
---|---|
dict[str, Any]
|
Dictionary of instruction-following metrics with values:
|
Note:
- Returns zero accuracies and empty list if responses is None or empty.
- The evaluation uses strict binary criteria (partial compliance counts as failure).
Source code in aisteer360/evaluation/metrics/custom/instruction_following/strict_instruction.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
|