lm-evaluation-harness CLI usage#

The recommended way to run benchmarks is through CLI. In your python environment with ‘ibm-generative-ai[lm-eval]’ installed:

Example:

python -m genai.extensions.lm_eval \
      --model="ibm_genai" \
      --model_args="model_id=tiiuae/falcon-40b,temperature=0" \
      --task="hellaswag" \
      --num_fewshot=10 \
      --output_path="falcon-40b_hellaswag.json"
See lm_eval_cli.py on GitHub.#
import subprocess

subprocess.run(
    [
        "python",
        "-m",
        "genai.extensions.lm_eval",
        "--model=ibm_genai",
        "--model_args=model_id=tiiuae/falcon-40b,temperature=0",
        "--task=hellaswag",
        "--num_fewshot=10",
        "--limit=10",  # WARNING: only for debug purposes, remove for full testing dataset
    ],
    check=True,
    text=True,
    capture_output=False,
)