ACPBench: Reasoning about Action, Change, and Planning

In recent years, large language models have made tremendous strides in natural language question answering tasks. To take this progress to the next level, researchers have developed various question-answering style reasoning datasets to further assess and improve natural language reasoning. These datasets evaluate a range of reasoning skills, including common-sense reasoning, arithmetic reasoning, multi-modal reasoning, logical deduction, and tabular QA, among others. While there are many more, the figure below highlights some of the most commonly used benchmarks in the field. These benchmarks play a crucial role in advancing the research goal of complex natural language reasoning, pushing the boundaries of what is possible with language models and paving the way for even more sophisticated AI approaches.

Word Cloud of Natural Language Reasoning Datasets.

As large language models (LLMs) continue to excel in reasoning tasks and benchmarks, researchers are now exploring their potential to take on a more ambitious role; serving LLMs as agents that can orchestrate workflows and make decisions in complex domains that require planning. This emerging area of research holds tremendous promise, with potential applications spanning multiple fields. Imagine LLMs capable of autonomously managing business processes, planning a trip, or even managing and analyzing data to provide timely insights. However, despite the excitement surrounding this development, there is a significant gap in our understanding of LLMs' planning capabilities.

To this end, we introduce ACP Bench. A question-answering style dataset that evaluates AI-model's ability to reason about Action, Change, and Planning. To fully harness LLMs' potential for planning, ACPBench performs a more systematic evaluation of their strengths and limitations in tasks that are central to planning, indeed not being able to perform these tasks perfectly precludes the ability to formulate plans.

ACPBench consists of boolean and multi-choice questions of the following 7 reasoning tasks.

Action Applicability (app) : The first, basic requirement for efficient planning is to determine valid, available actions in a given situation. Various existing work have discussed LLMs fall short of this basic ability. When using GPT-4 Turbo for travel planning, found that more than 30\% of the failed plans had invalid action dead loop -- that is the model kept repeating invalid actions event when it was informed that the action is invalid.

{
    "id": -8342636639526456067,
    "group": "applicable_actions_bool",
    "context": "This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time. There are 2 locations and 20 cars, numbered consecutively. Currently, the ferry is at l1 location and it is empty. The cars are at locations as follows: c7, c11, c2, c16, c14, c19, c5, c4, c12, c17, and c1 are at l1; c13, c8, c6, c18, c0, c3, c9, c10, and c15 are at l0.",
    "question": "Is the following action applicable in this state: travel by sea from location l1 to location l0?"
}

Action Applicability

Progression (prog) : The next task evaluates LLMs ability to understand the outcome of an action or change. This ability is important to track information across transitions. The subpar performance of LLMs on the Tracking Shuffled Objects task in the Big Bench Hard dataset suggests a significant limitation in their ability to reason about the consequences of actions or changes. Further, a few papers have proposed to use LLMs to execute a plan. For example, Plan-and-Solve-Prompting asks LLM to devise a plan and execute it step-by-step to reach the goal. To faithfully execute a plan, it is important for LLMs to demonstrate understanding of progression; how the world state is changed by the action.

{
    "id": -8215166616105943671,
    "group": "progression_bool",
    "context": "This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time. There are 2 locations and 5 cars, numbered consecutively. Currently, the ferry is at l0 location and it is empty. The cars are at locations as follows: c1, c0, c3, and c2 are at l0; c4 is at l1.",
    "question": "Will the fact \"Car c4 is on the ferry\" hold after performing the action \"sail from location l0 to location l1\" in the current state?"
}

Progression

Atom Reachability (reach) : The reachability task evaluates if a specific sub-goal can eventually be reached from the given state by taking (possibly multiple) actions. This is a multi-step reasoning task that can help avoid exploring unfeasible options. To maximize the efficiency of LLMs, it is crucial to detect unreachable (sub)goals early on. This can avoid unnecessary prompting and wasteful exploration, ensuring that the LLMs are utilized effectively, especially when used during search.

{
    "id": -2426698749034015429,
    "group": "reachable_atom_bool",
    "context": "This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time. There are 2 locations and 10 cars, numbered consecutively. Currently, the ferry is at l0 location and it is empty. The cars are at locations as follows: c2, c7, and c5 are at l1; c3, c4, c6, c9, c1, c0, and c8 are at l0.",
    "question": "Is it possible to transition to a state where the following holds: Car c2 is at location c0?"
}

Atom Reachability

Validation (val) : A body of research has advocated the use of LLMs for validation and refinement. In line with this research, we propose a Validation task. Here, given an initial state and a goal condition, the objective is to assess whether the specified sequence of actions is valid, applicable, and successfully achieves the intended goal.

{
    "id": -2339048290501167365,
    "group": "validation_bool",
    "context": "This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time. There are 2 locations and 2 cars, numbered consecutively. Currently, the ferry is at l0 location and it is empty. The cars are at locations as follows: c0 and c1 are at l0. The goal is to reach a state where the following facts hold: Car c0 is at location l1 and Car c1 is at location l1.",
    "question": "Is the following sequence of actions \"board car c0 at location l0, debark car c0 to location l0 from the ferry, travel by sea from location l0 to location l1, travel by sea from location l1 to location l0, board car c1 at location l0, travel by sea from location l0 to location l1, debark car c1 to location l1 from the ferry, board car c1 at location l1, debark car c1 to location l1 from the ferry, travel by sea from location l1 to location l0, board car c0 at location l0, debark car c0 to location l0 from the ferry, board car c0 at location l0, travel by sea from location l0 to location l1, debark car c0 to location l1 from the ferry\" valid in this problem?"
}

Validation

Action Reachability (areach) : In API-driven workflows, the objective is typically presented as an instruction to execute a specific function~\cite{toolllm}. In these scenarios, an LLM must identify the necessary prerequisites for execution and formulate a strategy to meet them. Therefore, it is essential for LLMs to assess whether a given instruction is executable from the provided starting point. We formulate this ability as action reachability task.

{
    "id": -1990152005808638716,
    "group": "reachable_action_bool",
    "context": "This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time. There are 2 locations and 20 cars, numbered consecutively. Currently, the ferry is at l0 location and it is empty. The cars are at locations as follows: c12, c19, c4, c11, c5, c7, c16, and c1 are at l1; c15, c18, c14, c0, c8, c3, c2, c9, c6, c10, c13, and c17 are at l0.",
    "question": "Is it possible to transition to a state where the action \"board the car c19 at location l1\" can be applied?"
  }

Action Reachability

Justification (just) : A major criteria for plans to be considered reasonable is whether they include unnecessary actions. In the realm of LLMs and API workflows, it is desirable to avoid calling unnecessary APIs as well as reduce wasteful explorations. Hence, it would be of immense value if LLMs are able to identify whether an action is necessary. This corresponds to the justification task in planning literature.

{
    "id": -3115201149135125328,
    "group": "action_justification_bool",
    "context": "This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time. There are 3 locations and 2 cars, numbered consecutively. Currently, the ferry is at l1 location and it is empty. The cars are at locations as follows: c1 and c0 are at l1. The goal is to reach a state where the following facts hold: Car c0 is at location l0 and Car c1 is at location l2.",
    "question": "Given the plan: \"load the car c1 at location l1 on to the ferry, unload the car c1 from the ferry to location l1, load the car c1 at location l1 on to the ferry, sail from location l1 to location l2, unload the car c1 from the ferry to location l2, load the car c1 at location l2 on to the ferry, unload the car c1 from the ferry to location l2, sail from location l2 to location l1, load the car c0 at location l1 on to the ferry, sail from location l1 to location l0, unload the car c0 from the ferry to location l0\"; can the following action be removed from this plan and still have a valid plan: load the car c1 at location l1 on to the ferry?"
}

Justification

Landmarks (land) :LLMs have shown to hallucinate or deviate from the task when the trajectory is long. To alleviate this problem, various work has proposed to use LLMs to decompose the goal into subgoals and achieve each of these subgoals separately. To do this faithfully, it is crucial for LLMs to be able to identify subgoals that are necessary to achieve the goal. In planning literature such subgoals are often called landmarks. Landmarks are facts that must become true sometime along every plan. So, the last task in ACPBench evaluates LLMs ability to recognize landmarks.

{
    "id": 1263458375528833442,
    "group": "landmarks_bool",
    "context": "This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time. There are 2 locations and 20 cars, numbered consecutively. Currently, the ferry is at l1 location and it is empty. The cars are at locations as follows: c14, c8, c3, c2, c10, c0, c6, c13, c11, c16, c9, c15, c18, and c17 are at l0; c7, c12, c19, c1, c4, and c5 are at l1. The goal is to reach a state where the following facts hold: Car c11 is at location l1, Car c8 is at location l1, Car c3 is at location l0, Car c7 is at location l1, Car c10 is at location l0, Car c0 is at location l0, Car c12 is at location l1, Car c19 is at location l1, Car c13 is at location l0, Car c17 is at location l1, Car c1 is at location l1, Car c9 is at location l0, Car c15 is at location l0, Car c14 is at location l1, Car c2 is at location l1, Car c4 is at location l1, Car c16 is at location l1, Car c6 is at location l1, Car c18 is at location l1, and Car c5 is at location l1.",
    "question": "Is the following fact a landmark (must hold at some point along any plan) for the current state? Car c12 is at location l0"
}

Landmarks

Upon evaluation of leading LLMs with COT-style prompting and two examples we found that LLAMA-3.1 405B and GPT-4o consistently outperform other models on these tasks, although they do not always achieve the top performance. When it comes to smaller open-sourced models, Codestral 22B stands out for its exceptional performance on boolean questions, while Mixtral 8x7B excels in handling multi-choice questions. However, both of them lag significantly behind GPT-4o, which is the best performer in these tasks. Action Reachability and Validation are the most challenging tasks for LLMs. Surprisingly, the GPT family models are not even among top-3 for the action reachablity task. cross all the tasks, GPT-4o performs best for boolean questions and LLAMA-3.1 405B performs best for multi-choice questions.

A domain-wise analysis of the performance of LLMs on multi-choice questions. This analysis showcases the top 8 performing models. The average performance of these top-8 models is shown as the dotted line in black. This indicates that across models no specific domain seems too easy. However, Rovers, FloorTile, Blocksworld, Alfworld and Satellite domains pose the greatest challenges to LLMs, in that particular order.

While the largest LLMs achieve more than 80% accuracy on few tasks, the variance in performance across tasks and across LLMs is still big. This signifies the long way to go before they can be reliably used in practical scenarios.

Recently, OpenAI released a series of LLM-based reasoning models, o1, that shows significant boost over GPT-4o on benchmarks that require reasoning. A comparison of o1 with LLMs is not even-handed as o1 uses multiple interactions and generates much more tokens than the max-limit provided to LLMs (see sec 4.4 in the paper). So we do not include it in the LLMs' Performance Evaluation. Instead, we present the performance difference of OpenAI o1 models (with zeroshot IO and 2-shot COT prompts) from the best performing LLMs. Our results indicate that o1 models fail to yield performance gains for boolean questions, but demonstrate notable improvements on MCQs. Specifically, o1 preview consistently performs better or equal to the best performing model for MCQs.

Comparing OpenAI o1 models with the best LLM. Positive difference shows o1 model performing better than the best of the LLMs. Negative difference is when o1 model lags behind the best LLM.

Our findings with OpenAI o1, a multi-turn reasoning model, reveal significant gains in performance on multiple- choice questions, yet surprisingly, no notable progress is made on boolean questions.

BibTeX

@inproceedings{kokel2024acp
  author       = {Harsha Kokel and
                  Michael Katz and
                  Kavitha Srinivas and
                  Shirin Sohrabi},
  title        = {ACPBench: Reasoning about Action, Change, and Planning},
  booktitle    = {{AAAI}},
  publisher    = {{AAAI} Press},
  year         = {2024}
}

ACPBench

Reasoning about Action, Change, and Planning

ACPBench Tasks

LLMs' Performance Evaluation

OpenAI o1 Performance Evaluation

BibTeX