There is an increasing body of work using Large Language Models (LLMs) as agents for orchestrating workflows and making decisions in domains that require planning and multi-step reasoning. As a result, it is imperative to evaluate LLMs on core skills required for planning. Hence, we distill 7 atomic reasoning tasks that are critical for reliable planning and create datasets of such tasks. These tasks focus on reasoning about Actions, Change (transitions) and Planning; hence, we call our benchmark as ACPBench. The tasks include single step reasoning, like evaluating whether an action can be performed in the described state, as well as multi step reasoning, like whether a sequence of actions is a valid plan for the described state and the described goal.
Performance of few SOTA LLMs and OpenAI o1 over different tasks in ACPBench. While the largest LLMs achieve more than 80% accuracy on few tasks, the variance in performance across tasks and across LLMs is still big. This signifies the long way to go before they can be reliably used in practical scenarios.
ACPBench consists of 7 reasoning tasks over 13 domains. The 13 domains include 11 classical planning domains, ALFWorld, and a novel Swap domain. The 7 tasks included in ACPBench are Action Applicability (app), Progression (prog), Atom Reachability (reach), Validation (val), Action Reachability (areach), Justification (just), and Landmarks (land). ACPBench dev and test set is available for download in this GitHub Repo as well as on Hugging Face Dataset.
Action Applicability
Progression
Atom Reachability
Validation
Action Reachability
Justification
Landmarks
Action Applicability
Progression
Atom Reachability
Validation
Action Reachability
Justification
Landmarks
@inproceedings{kokel2024ACP,
title={ACPBench: Reasoning about Action, Change, and Planning},
author={Kokel, Harsha and Katz, Michael and Srinivas, Kavitha and Sohrabi, Shirin},
booktitle={arXiv 2410.05669}
}