Recent research increasingly positions Large Language Models (LLMs) as autonomous agents capable of orchestrating complex workflows and making decisions in domains that demand sophisticated planning and multi‑step reasoning. This trend underscores the need for systematic evaluation of LLMs on the foundational skills that underpin effective planning. To address this gap, we have distilled eight core reasoning tasks that are essential for reliable planning and assembled dedicated datasets around them. These tasks center on three interrelated concepts---Actions, Change (transitions), and Planning---and collectively form our new benchmark, ACPBench.
ACPBench encompasses both single‑step reasoning (e.g., determining whether a particular action is executable in a given state) and multi‑step reasoning (e.g., assessing whether a sequence of actions constitutes a valid plan to achieve a specified goal). In the standard edition of ACPBench, we focus on boolean and multiple‑choice questions, while the more challenging ACPBench‑Hard edition presents generative questions that require the model to produce full, coherent responses. By providing a structured, academically rigorous framework for evaluating these critical planning competencies, ACPBench aims to advance the development of LLMs that can reliably navigate complex, real‑world decision‑making scenarios.
Performance of leading LLMs and LRMs over different tasks in ACPBench (on left) and ACPBench-Hard (on right). While the largest LLMs achieve more than 80% accuracy on few tasks on ACPBench (boolean and multi-choice questions), ACPBench-Hard (generative questions) remains quite challenging.
Comparison of a recent open-weight reasoning model GPT-OSS-120B on all of ACPBench and ACPBench-Hard task. While the recent model performance was decent enough for the boolean and multi choice questions, it still struggled with generative tasks.
ACPBench-Hard consists of 8 reasoning tasks over the same 13 domains as ACPBench. In addition to the 7 tasks on ACPBench, ACPBench-Hard introduces an additional task of generating the next action, that takes us towards the goal. ACPBench-Hard dataset is also released on Hugging Face and is integrated with LM Eval framework for evaluation. The evaluation scripts are also available in the github repo. Read more about the performance of the LLMs on ACPBench-Hard dataset here.
Action Applicability
Progression
Atom Reachability
Validation
Action Reachability
Justification
Landmarks
Next Action
ACPBench consists of 7 reasoning tasks over 13 domains. The 13 domains include 11 classical planning domains, ALFWorld, and a novel Swap domain. The 7 tasks included in ACPBench are Action Applicability (app), Progression (prog), Atom Reachability (reach), Validation (val), Action Reachability (areach), Justification (just), and Landmarks (land). ACPBench dev and test set is available for download in this GitHub Repo as well as on Hugging Face Dataset. Read more about the performance of the LLMs on ACPBench here.
Action Applicability
Progression
Atom Reachability
Validation
Action Reachability
Justification
Landmarks
Action Applicability
Progression
Atom Reachability
Validation
Action Reachability
Justification
Landmarks
@inproceedings{kokel2024acp
author = {Harsha Kokel and
Michael Katz and
Kavitha Srinivas and
Shirin Sohrabi},
title = {ACPBench: Reasoning about Action, Change, and Planning},
booktitle = {{AAAI}},
publisher = {{AAAI} Press},
year = {2025}
}
@misc{kokel2025acphard,
title={ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning},
author={Harsha Kokel and Michael Katz and Kavitha Srinivas and Shirin Sohrabi},
year={2025},
eprint={2503.24378},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2503.24378},
}