Logo ACPBench-Hard

Unrestrained Reasoning About Action, Change, And Planning

Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi

ACPBench consists of generative questions of the following 8 reasoning tasks.

  • Action Applicability (app) : The first task deals with identifying which actions are applicable in a state.

    Action Applicability

  • Progression (prog) : The next task evaluates LLMs ability to understand how the world state changes by action application.

    Progression

  • Atom Reachability (reach) : The reachability task evaluates if a specific fact can eventually become true by taking (possibly multiple consecutive) actions in the given state.

    Atom Reachability

  • Action Reachability (areach) : The action reachability task is closely related to the atom reachability, checking whether there is a reachable state where the action is applicable.

    Action Reachability

  • Validation (val) : The validation task aims at checking whether the specified sequence of actions Ï€is a plan.

    Validation

  • Justification (just) : The Justification task asks to simplify the plan by removing one or two consecutive actions and to produce the resulting simplified plan.

    Justification

  • Landmarks (land) : The Landmarks task tests LLM’s ability to identify subgoals that are necessary to achieve the goal.

    Landmarks

  • Next Action (nexta) : An additional task that does not appear in the original ACPBench is the next action task. This generative question asks what is the next action that takes us towards the goal. This task is closely related to optimal planning, since optimal plans can be produced by iteratively obtaining such actions.

    Landmarks

LLMs' Performance Evaluation

data-overview data-overview data-overview

A domain-wise analysis of the performance of GPT-4o on all of ACPBench-Hard tasks.

BibTeX

@misc{kokel2025acphard,
      title={ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning}, 
      author={Harsha Kokel and Michael Katz and Kavitha Srinivas and Shirin Sohrabi},
      year={2025},
      eprint={2503.24378},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2503.24378}, 
}
To evaluate your LLM on ACP Bench, please consult the Readme for instructions. If you have any questions or feedback, feel free to contact Harsha Kokel at harsha.kokel@ibm.com.