CommonsenseMCQA
aisteer360.evaluation.use_cases.commonsense_mcqa
Use case class for the commonsense multiple-choice question answering (MCQA) task.
use_case
CommonsenseMCQA
Bases: UseCase
Commonsense MCQA evaluation use case.
Evaluates model's ability to answer commonsense questions via accuracy on the CommonsenseMCQA dataset (https://huggingface.co/datasets/tau/commonsense_qa). Supports answer choice shuffling across multiple runs to reduce position bias and improve evaluation robustness.
The evaluation data should contain questions with multiple choice options where models are asked to respond with only the letter (A, B, C, etc.) corresponding to their chosen answer.
Attributes:
Name | Type | Description |
---|---|---|
num_shuffling_runs |
int
|
Number of times to shuffle answer choices for each question to mitigate position bias effects. |
Source code in aisteer360/evaluation/use_cases/commonsense_mcqa/use_case.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 |
|
evaluation_data = [(json.loads(line)) for line in f] if path.suffix == '.jsonl' else json.load(f)
instance-attribute
evaluation_metrics = evaluation_metrics
instance-attribute
num_shuffling_runs
instance-attribute
evaluate(generations)
Evaluates generated responses against reference answers using configured metrics.
Extracts responses and reference answers from generations and computes scores using all evaluation metrics specified during initialization.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
generations
|
list[dict[str, Any]]
|
List of generation dictionaries returned by the |
required |
Returns:
Type | Description |
---|---|
dict[str, dict[str, Any]]
|
Dictionary of scores keyed by |
Source code in aisteer360/evaluation/use_cases/commonsense_mcqa/use_case.py
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
|
export(profiles, save_dir)
Exports evaluation profiles to (tabbed) JSON format.
Source code in aisteer360/evaluation/use_cases/commonsense_mcqa/use_case.py
176 177 178 179 180 |
|
generate(model_or_pipeline, tokenizer, gen_kwargs=None, runtime_overrides=None)
Generates model responses for multiple-choice questions with shuffled answer orders.
Creates prompts for each question with shuffled answer choices, generates model responses, and parses the outputs to extract letter choices. Repeats the process multiple times with different answer orderings to reduce positional bias.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_or_pipeline
|
Either a HuggingFace model or SteeringPipeline instance to use for generation. |
required | |
tokenizer
|
Tokenizer for encoding/decoding text. |
required | |
gen_kwargs
|
dict | None
|
Optional generation parameters. |
None
|
runtime_overrides
|
dict[tuple[str, str], str] | None
|
Optional runtime parameter overrides for steering controls, structured as {(pipeline_name, param_name): value}. |
None
|
Returns:
Type | Description |
---|---|
list[dict[str, Any]]
|
List of generation dictionaries, each containing:
|
Note:
- The number of returned generations will be
len(evaluation_data)
*num_shuffling_runs
due to answer choice shuffling.
Source code in aisteer360/evaluation/use_cases/commonsense_mcqa/use_case.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
|
validate_evaluation_data(evaluation_data)
Validates that evaluation data contains required fields for MCQA evaluation.
Ensures each data instance has the necessary keys and non-null values for the evaluation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
evaluation_data
|
dict[str, Any]
|
Dictionary containing a single evaluation instance with question, answer choices, and correct answer information. |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If required keys ('id', 'question', 'answer', 'choices') are missing or if any required fields contain null/NaN values. |
Source code in aisteer360/evaluation/use_cases/commonsense_mcqa/use_case.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
|