There is an increasing body of work using Large Language Models (LLMs) as agents for orchestrating workflows and making decisions in domains that require planning and multi-step reasoning. As a result, it is imperative to evaluate LLMs on core skills required for planning. Hence, we distill 7 atomic reasoning tasks that are critical for reliable planning and create datasets of such tasks. These tasks focus on reasoning about Actions, Change (transitions) and Planning; hence, we call our benchmark as ACPBench. The tasks include single step reasoning, like evaluating whether an action can be performed in the described state, as well as multi step reasoning, like whether a sequence of actions is a valid plan for the described state and the described goal.

data-overview

Performance of few typical LLMs over different tasks in ACPBench. While the largest LLMs achieve more than 80% accuracy on few tasks, the variance in performance across tasks and across LLMs is still big. This signifies the long way to go before they can be reliably used in practical scenarios.

ACPBench Overview

ACPBench consists of 7 reasoning tasks over 13 domains. The 13 domains include 11 classical planning domains, ALFWorld, and a novel Swap domain. The 7 tasks included in ACPBench are Action Applicability (app), Progression (prog), Atom Reachability (reach), Validation (val), Action Reachability (areach), Justification (just), and Landmarks (land). ACPBench dev and test set is available for download in this GitHub Repo as well as on Hugging Face Dataset.

  • Dev Set: ACPBench Dev set consists of 40 examples for each task, with answers.
  • Test Set: ACPBench Test Set features 10 boolean (Bool) and 10 multiple-choice (MCQ) style questions for each domain and each task, without answers.

Example boolean questions:

Example multi-choice questions:

BibTeX

@inproceedings{kokel2024ACP,
  title={ACPBench: Reasoning about Action, Change, and Planning},
  author={Kokel, Harsha and Katz, Michael and Srinivas, Kavitha and Sohrabi, Shirin},
  booktitle={arXiv 2410.05669}
}
To evaluate your LLM on ACP Bench, please consult the Readme for instructions. If you have any questions or feedback, feel free to contact Harsha Kokel at harsha.kokel@ibm.com.