In recent years, large language models have made tremendous strides in natural language question answering tasks. To take this progress to the next level, researchers have developed various question-answering style reasoning datasets to further assess and improve natural language reasoning. These datasets evaluate a range of reasoning skills, including common-sense reasoning, arithmetic reasoning, multi-modal reasoning, logical deduction, and tabular QA, among others. While there are many more, the figure below highlights some of the most commonly used benchmarks in the field. These benchmarks play a crucial role in advancing the research goal of complex natural language reasoning, pushing the boundaries of what is possible with language models and paving the way for even more sophisticated AI approaches.
Word Cloud of Natural Language Reasoning Datasets.
As large language models (LLMs) continue to excel in reasoning tasks and benchmarks, researchers are now exploring their potential to take on a more ambitious role; serving LLMs as agents that can orchestrate workflows and make decisions in complex domains that require planning. This emerging area of research holds tremendous promise, with potential applications spanning multiple fields. Imagine LLMs capable of autonomously managing business processes, planning a trip, or even managing and analyzing data to provide timely insights. However, despite the excitement surrounding this development, there is a significant gap in our understanding of LLMs' planning capabilities.
To this end, we introduce ACP Bench. A question-answering style dataset that evaluates AI-model's ability to reason about Action, Change, and Planning. To fully harness LLMs' potential for planning, ACPBench performs a more systematic evaluation of their strengths and limitations in tasks that are central to planning, indeed not being able to perform these tasks perfectly precludes the ability to formulate plans.
ACPBench consists of boolean and multi-choice questions of the following 7 reasoning tasks.
{
"id": -8342636639526456067,
"group": "applicable_actions_bool",
"context": "This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time. There are 2 locations and 20 cars, numbered consecutively. Currently, the ferry is at l1 location and it is empty. The cars are at locations as follows: c7, c11, c2, c16, c14, c19, c5, c4, c12, c17, and c1 are at l1; c13, c8, c6, c18, c0, c3, c9, c10, and c15 are at l0.",
"question": "Is the following action applicable in this state: travel by sea from location l1 to location l0?"
}
Action Applicability
{
"id": -8215166616105943671,
"group": "progression_bool",
"context": "This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time. There are 2 locations and 5 cars, numbered consecutively. Currently, the ferry is at l0 location and it is empty. The cars are at locations as follows: c1, c0, c3, and c2 are at l0; c4 is at l1.",
"question": "Will the fact \"Car c4 is on the ferry\" hold after performing the action \"sail from location l0 to location l1\" in the current state?"
}
Progression
{
"id": -2426698749034015429,
"group": "reachable_atom_bool",
"context": "This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time. There are 2 locations and 10 cars, numbered consecutively. Currently, the ferry is at l0 location and it is empty. The cars are at locations as follows: c2, c7, and c5 are at l1; c3, c4, c6, c9, c1, c0, and c8 are at l0.",
"question": "Is it possible to transition to a state where the following holds: Car c2 is at location c0?"
}
Atom Reachability
{
"id": -2339048290501167365,
"group": "validation_bool",
"context": "This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time. There are 2 locations and 2 cars, numbered consecutively. Currently, the ferry is at l0 location and it is empty. The cars are at locations as follows: c0 and c1 are at l0. The goal is to reach a state where the following facts hold: Car c0 is at location l1 and Car c1 is at location l1.",
"question": "Is the following sequence of actions \"board car c0 at location l0, debark car c0 to location l0 from the ferry, travel by sea from location l0 to location l1, travel by sea from location l1 to location l0, board car c1 at location l0, travel by sea from location l0 to location l1, debark car c1 to location l1 from the ferry, board car c1 at location l1, debark car c1 to location l1 from the ferry, travel by sea from location l1 to location l0, board car c0 at location l0, debark car c0 to location l0 from the ferry, board car c0 at location l0, travel by sea from location l0 to location l1, debark car c0 to location l1 from the ferry\" valid in this problem?"
}
Validation
{
"id": -1990152005808638716,
"group": "reachable_action_bool",
"context": "This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time. There are 2 locations and 20 cars, numbered consecutively. Currently, the ferry is at l0 location and it is empty. The cars are at locations as follows: c12, c19, c4, c11, c5, c7, c16, and c1 are at l1; c15, c18, c14, c0, c8, c3, c2, c9, c6, c10, c13, and c17 are at l0.",
"question": "Is it possible to transition to a state where the action \"board the car c19 at location l1\" can be applied?"
}
Action Reachability
{
"id": -3115201149135125328,
"group": "action_justification_bool",
"context": "This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time. There are 3 locations and 2 cars, numbered consecutively. Currently, the ferry is at l1 location and it is empty. The cars are at locations as follows: c1 and c0 are at l1. The goal is to reach a state where the following facts hold: Car c0 is at location l0 and Car c1 is at location l2.",
"question": "Given the plan: \"load the car c1 at location l1 on to the ferry, unload the car c1 from the ferry to location l1, load the car c1 at location l1 on to the ferry, sail from location l1 to location l2, unload the car c1 from the ferry to location l2, load the car c1 at location l2 on to the ferry, unload the car c1 from the ferry to location l2, sail from location l2 to location l1, load the car c0 at location l1 on to the ferry, sail from location l1 to location l0, unload the car c0 from the ferry to location l0\"; can the following action be removed from this plan and still have a valid plan: load the car c1 at location l1 on to the ferry?"
}
Justification
{
"id": 1263458375528833442,
"group": "landmarks_bool",
"context": "This is a ferry domain, where the task is to transport cars from their start to their goal locations, using a ferry. Each location is accessible by ferry from each other location. The cars can be debarked or boarded, and the ferry can carry only one car at a time. There are 2 locations and 20 cars, numbered consecutively. Currently, the ferry is at l1 location and it is empty. The cars are at locations as follows: c14, c8, c3, c2, c10, c0, c6, c13, c11, c16, c9, c15, c18, and c17 are at l0; c7, c12, c19, c1, c4, and c5 are at l1. The goal is to reach a state where the following facts hold: Car c11 is at location l1, Car c8 is at location l1, Car c3 is at location l0, Car c7 is at location l1, Car c10 is at location l0, Car c0 is at location l0, Car c12 is at location l1, Car c19 is at location l1, Car c13 is at location l0, Car c17 is at location l1, Car c1 is at location l1, Car c9 is at location l0, Car c15 is at location l0, Car c14 is at location l1, Car c2 is at location l1, Car c4 is at location l1, Car c16 is at location l1, Car c6 is at location l1, Car c18 is at location l1, and Car c5 is at location l1.",
"question": "Is the following fact a landmark (must hold at some point along any plan) for the current state? Car c12 is at location l0"
}
Landmarks
Accuracy of 22 leading LLMs on 7 ACPBench tasks (boolean as well as multi-choice questions). The best results are boldfaced, second best are underlined, and the best among the small, open-sourced models are highlighted with *. All models were evaluated with two in-context examples and Chain-of-Thought prompt. The right-most column is mean across tasks.
Upon evaluation of leading LLMs with COT-style prompting and two examples we found that LLAMA-3.1 405B and GPT-4o consistently outperform other models on these tasks, although they do not always achieve the top performance. When it comes to smaller open-sourced models, Codestral 22B stands out for its exceptional performance on boolean questions, while Mixtral 8x7B excels in handling multi-choice questions. However, both of them lag significantly behind GPT-4o, which is the best performer in these tasks. Action Reachability and Validation are the most challenging tasks for LLMs. Surprisingly, the GPT family models are not even among top-3 for the action reachablity task. cross all the tasks, GPT-4o performs best for boolean questions and LLAMA-3.1 405B performs best for multi-choice questions.
Comparison of 8 top performing LLMs on multi-choice questions in 13 domains of ACPBench. The mean of performance across the top-8 models is presented with dotted line in Black. The mean line indicates that none of the domains are exceptionally easy
A domain-wise analysis of the performance of LLMs on multi-choice questions. This analysis showcases the top 8 performing models. The average performance of these top-8 models is shown as the dotted line in black. This indicates that across models no specific domain seems too easy. However, Rovers, FloorTile, Blocksworld, Alfworld and Satellite domains pose the greatest challenges to LLMs, in that particular order.
Recently, OpenAI released a series of LLM-based reasoning models, o1, that shows significant boost over GPT-4o on benchmarks that require reasoning. A comparison of o1 with LLMs is not even-handed as o1 uses multiple interactions and generates much more tokens than the max-limit provided to LLMs (see sec 4.4 in the paper). So we do not include it in the LLMs' Performance Evaluation. Instead, we present the performance difference of OpenAI o1 models (with zeroshot IO and 2-shot COT prompts) from the best performing LLMs. Our results indicate that o1 models fail to yield performance gains for boolean questions, but demonstrate notable improvements on MCQs. Specifically, o1 preview consistently performs better or equal to the best performing model for MCQs.
Comparing OpenAI o1 models with the best LLM. Positive difference shows o1 model performing better than the best of the LLMs. Negative difference is when o1 model lags behind the best LLM.
@inproceedings{kokel2024acp
author = {Harsha Kokel and
Michael Katz and
Kavitha Srinivas and
Shirin Sohrabi},
title = {ACPBench: Reasoning about Action, Change, and Planning},
booktitle = {{AAAI}},
publisher = {{AAAI} Press},
year = {2024}
}