Commonsense MCQA metrics
aisteer360.evaluation.metrics.custom.commonsense_mcqa
Evaluation metrics for the CommonsenseMCQA
use case.
mcqa_accuracy
MCQAAccuracy
Bases: Metric
Exact-match accuracy for multiple-choice QA.
Source code in aisteer360/evaluation/metrics/custom/commonsense_mcqa/mcqa_accuracy.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
|
extras = extras
instance-attribute
name = self.__class__.__name__
instance-attribute
compute(responses, prompts=None, reference_answers=None, question_ids=None, **kwargs)
Computes trial-level and question-level accuracy metrics.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
responses
|
list[str]
|
List of predicted answer choices (e.g., 'A', 'B', 'C', 'D'). |
required |
prompts
|
list[str] | None
|
List of question prompts (unused, for interface compatibility). |
None
|
reference_answers
|
list[str] | None
|
List of correct answer choices. |
None
|
question_ids
|
list[str] | None
|
Optional question IDs for grouping responses by question. |
None
|
**kwargs
|
Additional arguments (unused). |
{}
|
Returns:
Type | Description |
---|---|
dict[str, float]
|
Dictionary of accuracy score statistics with values:
|
Raises:
Type | Description |
---|---|
ValueError
|
If reference_answers is None or length mismatches occur. |
Source code in aisteer360/evaluation/metrics/custom/commonsense_mcqa/mcqa_accuracy.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
|
mcqa_calibration
MCQACalibration
Bases: Metric
Calibration metrics for multiple-choice QA.
Measures how well model confidence scores align with actual performance using Expected Calibration Error (ECE) and related metrics.
Source code in aisteer360/evaluation/metrics/custom/commonsense_mcqa/mcqa_calibration.py
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
|
extras = extras
instance-attribute
n_bins = n_bins
instance-attribute
name = self.__class__.__name__
instance-attribute
compute(responses, reference_answers=None, confidence_scores=None, question_ids=None, **kwargs)
Computes calibration metrics for model predictions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
responses
|
list[str]
|
List of predicted answer choices (e.g., 'A', 'B', 'C', 'D'). |
required |
reference_answers
|
list[str]
|
List of correct answer choices. |
None
|
confidence_scores
|
list[float]
|
List of model confidence scores (0.0 to 1.0). |
None
|
question_ids
|
list[str] | None
|
Optional question IDs (unused, for interface compatibility). |
None
|
**kwargs
|
Additional arguments (unused). |
{}
|
Returns:
Type | Description |
---|---|
dict[str, float]
|
Dictionary of calibration metrics with values:
|
Raises:
Type | Description |
---|---|
ValueError
|
If reference_answers or confidence_scores is None. |
Source code in aisteer360/evaluation/metrics/custom/commonsense_mcqa/mcqa_calibration.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
|
mcqa_positional_bias
MCQAPositionalBias
Bases: Metric
Positional bias metrics for multiple-choice QA.
Measures whether the model exhibits bias toward selecting certain answer positions.
Source code in aisteer360/evaluation/metrics/custom/commonsense_mcqa/mcqa_positional_bias.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
|
extras = extras
instance-attribute
name = self.__class__.__name__
instance-attribute
compute(responses, prompts=None, question_ids=None, **kwargs)
Computes positional bias metrics for model predictions.
Calculates how much the model's choice frequencies deviate from uniform distribution across answer positions. For K answer choices, each position should ideally be selected 1/K of the time.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
responses
|
list[str]
|
List of predicted answer choices (e.g., 'A', 'B', 'C', 'D'). |
required |
prompts
|
list[str] | None
|
List of question prompts (unused, for interface compatibility). |
None
|
question_ids
|
list[str] | None
|
Optional question IDs for computing per-question bias variance. |
None
|
**kwargs
|
Additional arguments (unused). |
{}
|
Returns:
Type | Description |
---|---|
dict[str, float]
|
Dictionary of positional bias metrics with values:
|
Note:
- If question_ids is None, per-question analysis is skipped and std will be 0.0.
Source code in aisteer360/evaluation/metrics/custom/commonsense_mcqa/mcqa_positional_bias.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
|