Skip to content

AutoPDL Tutorial

The following sections show how to use the AutoPDL optimizer introduced by Spiess et al. (2025) in "AutoPDL: Automatic Prompt Optimization for LLM Agents" (arXiv), to produce optimized PDL programs for specific tasks. Please ensure PDL was installed with extras e.g.

1
2
3
4
5
pip install 'prompt-declaration-language[all]'
# or from source
git clone git@github.com:IBM/prompt-declaration-language.git
cd prompt-declaration-language
pip install -e '.[all]'

To optimize a PDL program, we need the program, an optimizer configuration, a dataset, and an evaluator. An evaluator is a Python subclass of OptimizerEvaluator that evaluates a candidate, which is a generated configuration instance consisting of e.g. fewshot examples. The evaluator class follows this structure:

src/pdl/optimize/optimizer_evaluator.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
class OptimizerEvaluator(Thread):
    """Evaluates a candidate (configuration, i.e. fewshots, style) against **one** test example."""

    def __init__(
        self,
        pdl_program: Program,
        example: dict,
        candidate: dict,
        index: int,
        timeout: int,
        yield_output: bool,
        config: OptimizationConfig,
        cwd: Path,
        answer_key: str = "answer",
    ) -> None:
        super().__init__()
        self.pdl_program = pdl_program
        ...

    def get_scope(self) -> ScopeType:
        """
        Constructs a PDL scope for the candidate,
        can take self.candidate and self.config into account
        """

    def extract_answer(self, document: str) -> Any:
        """
        Extracts the final answer from the PDL result document,
        i.e. the string the PDL program returns
        """

    def answer_correct(self, document: str, answer: Any, truth: Any) -> bool:
        """
        Checks the extracted answer against the groundtruth value,
        in self.example[self.answer_key]
        """

Let's go through an example for GSM8K. Our PDL program uses different prompt patterns from the prompt library, and the variables prompt_pattern, question, model, and demonstrations are inserted at runtime by the evaluator.

examples/optimizer/gsm8k.pdl
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
description: Demo of template
defs:
  cot:
    import: ../../contrib/prompt_library/CoT
  react:
    import: ../../contrib/prompt_library/ReAct
  rewoo:
    import: ../../contrib/prompt_library/ReWoo
  tools:
    import: ../../contrib/prompt_library/tools
match: ${ prompt_pattern }
with:
  # CoT
  - case: cot
    then:
      text:
        - "Answer the questions to the best of your abilities.\n\n"
        - call: ${ cot.chain_of_thought }
          def: ANSWER
          contribute: []
          args:
            examples: "${ demonstrations }"
            question: "${ question|trim }"
            model: "${ model }"
        - "\nThe answer is ${ ANSWER.answer|trim }"

  # ReAct
  - case: react
    then:
      text:
        call: ${ react.react }
        args:
          task: "Question: ${ question|trim }"
          model: ${ model }
          tool_schema: ${ tools.tool_schema }
          tools: ${ tools.tools }
          trajectories: ${ demonstrations }

  # ReWOO
  - case: rewoo
    then:
      text:
        call: ${ rewoo.rewoo }
        args:
          task: ${ question|trim }
          model: ${ model }
          tool_schema: ${ tools.tool_schema }
          tools: ${ tools.tools }
          trajectories: ${ demonstrations }
          show_plans: false

We write a configuration file for the optimizer, and save it as gsm8k_optimizer_config.yml. See src/pdl/optimize/config_parser.py for all fields. Please note that this example uses the watsonx inference service, so an API key is required, although you can also use a local model or any other inference service.

examples/optimizer/gsm8k_optimizer_config.yml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
pdl_path: gsm8k.pdl # Path to the PDL file to optimize
dataset: gsm8k # Name our benchmark
budget: null # Set a budget, can be number of iterations, or a duration string e.g. "2h"
budget_growth: double # double validation set size each iteration
# or to_max: reach max_test_set_size by final iteration
initial_test_set_size: 2 # size of test set in first iteration
max_test_set_size: 10 # maximum test set size
num_candidates: 100 # how many candidates to evaluate
num_demonstrations: 5 # how many demonstrations to include per candidate
parallelism: 1 # how many threads to run evaluations across
shuffle_test: false # shuffling of test set
test_set_name: test # name of test set
train_set_name: train # name of train set
validation_set_name: validation # name of validation set
demonstrations_variable_name: demonstrations # variable name to insert demonstrations into
demonstration_columns:
  - question # column name for the question in the dataset
  - reasoning
  - answer
  - traj_keys
  - traj_values
  - rewoo_traj_keys
  - rewoo_traj_values

instance_columns:
  - question
  - reasoning

groundtruth_column: answer # column name for the ground truth in the dataset

variables: # define discrete options to sample from
  model: # set ${ model } variable
    - watsonx/meta-llama/llama-3-2-3b-instruct
  prompt_pattern: # set ${ prompt_pattern } variable to one of these
    - cot
    - react
    - rewoo
  num_demonstrations: # overrides num demonstrations above
    - 0
    - 3
    - 5
examples/optimizer/gsm8k_evaluator.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
from typing import Any

from pdl.optimize.optimizer_evaluator import OptimizerEvaluator
from pdl.optimize.parse_number import extract_math_answer
from pdl.pdl_ast import ScopeType
from pdl.pdl_interpreter import empty_scope


class Gsm8kEvaluator(OptimizerEvaluator):
    def __init__(
        self,
        *args,
        **kwargs,
    ) -> None:
        super().__init__(*args, **kwargs)

    def get_scope(self) -> ScopeType:
        demo_var = self.config.demonstrations_variable_name

        scope = {}

        for k in self.config.variables:
            if k in self.candidate:
                scope[k] = self.candidate[k]

        match self.candidate.get("prompt_pattern", None):
            case "cot":
                scope[demo_var] = [
                    {
                        "question": q["question"],
                        "reasoning": q["reasoning"],
                        "answer": str(q["answer"]),
                    }
                    for q in self.candidate[demo_var]
                ]
            case "react":
                scope[demo_var] = [
                    [
                        {key: value}
                        for key, value in zip(
                            q["traj_keys"],
                            q["traj_values"],
                            strict=True,
                        )
                    ]
                    for q in self.candidate[demo_var]
                ]
            case "rewoo":
                scope[demo_var] = [
                    [
                        {key: value}
                        for key, value in zip(
                            q["rewoo_traj_keys"],
                            q["rewoo_traj_values"],
                            strict=True,
                        )
                    ]
                    for q in self.candidate[demo_var]
                ]
            case _:
                pass

        scope["question"] = self.example["question"]
        scope["reasoning"] = self.example["reasoning"]
        return empty_scope | scope

    def score(self, document: str, ground_truth: Any) -> float:
        answer = extract_math_answer(document)
        return float(answer == ground_truth or document.endswith(f" {ground_truth}"))

We can see an example of a script to run the optimization process in examples/optimizer/optimize.py. Usage:

python optimize.py optimize -h
usage: optimize.py optimize [-h] --config CONFIG --dataset-path DATASET_PATH [--experiments-path EXPERIMENTS_PATH]
                            [--yield_output | --no-yield_output] [--dry | --no-dry]
                            pdl_file

We also need a dataset to optimize against, with train, test, and validation splits. To produce such a dataset, we can use HuggingFace Datasets load_dataset and save_to_disk. This example requires the dataset to have columns question, reasoning, and answer, which can be created from the original openai/gsm8k dataset.

We provide three scripts in examples/optimizer to create datasets, including the rule based agentic trajectories. These are process_gsm8k.py, process_fever.py, and process_mbpp.py. They load the original datasets, process them, and save them to disk in the required format. Dataset specific instructions may be found in the respective script files. Note that the scripts create a folder named var in the current directory, which contains the processed dataset in a format that can be used by the optimizer. Therefore, they should be run in the root of the PDL repository.

Let's run the GSM8K dataset processing script:

1
python examples/optimizer/process_gsm8k.py

Which should save the processed dataset in var/gsm8k_trajectified and output something like:

Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████| 6449/6449 [00:00<00:00, 557195.73 examples/s]
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████| 1319/1319 [00:00<00:00, 363559.64 examples/s]
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████| 1024/1024 [00:00<00:00, 271472.56 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6449/6449 [00:00<00:00, 71242.31 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:00<00:00, 68826.30 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6449/6449 [00:00<00:00, 22520.85 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6449/6449 [00:00<00:00, 18186.53 examples/s]
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████| 6449/6449 [00:00<00:00, 698328.77 examples/s]
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████| 1319/1319 [00:00<00:00, 232468.57 examples/s]
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████| 1024/1024 [00:00<00:00, 413375.10 examples/s]
DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'reasoning', 'raw_answer', 'answer_part', 'traj_keys', 'traj_values', 'rewoo_traj_keys', 'rewoo_traj_values'],
        num_rows: 6449
    })
    test: Dataset({
        features: ['question', 'answer', 'reasoning', 'raw_answer', 'answer_part'],
        num_rows: 1319
    })
    validation: Dataset({
        features: ['question', 'answer', 'reasoning', 'raw_answer', 'answer_part'],
        num_rows: 1024
    })
})

Finally, we can run the example like so:

1
2
cd examples/optimizer
python optimize.py optimize --config gsm8k_optimizer_config.yml --dataset-path ../../var/gsm8k_trajectified

This will report details about the optimization process, such as the number of candidates evaluated. The output will look something like this:

                                           PDL Optimizer                                  pdl_optimizer.py:336
           ┌──────────────────────────────┬─────────────────────────────────────────────┐
           │ Config combinations          │ 9                                           │
           │ Max candidates               │ 100                                         │
           │ Num. candidates              │ 100                                         │
           │ Starting validation set size │ 2                                           │
           │ Max validation set size      │ 10                                          │
           │ Num. iterations              │ 7                                           │
           │ Total evaluations            │ 1,200                                       │
           │ Num. threads                 │ 1                                           │
           │ Validation set multiplier    │ 2                                           │
           │ Shuffle validation set       │ False                                       │
           │ Budget policy                │ None                                        │
           ├──────────────────────────────┼─────────────────────────────────────────────┤
           │ model                        │ ['watsonx/meta-llama/llama-3-2-3b-instruct… │
           │ prompt_pattern               │ ['cot', 'react', 'rewoo']                   │
           │ num_demonstrations           │ [0, 3, 5]                                   │
           └──────────────────────────────┴─────────────────────────────────────────────┘
                     Iteration                                                            pdl_optimizer.py:419
           ┌─────────────────────┬─────┐
           │ Index               │ 0   │
           │ Validation set size │ 2   │
           │ Num. candidates     │ 100 │
           └─────────────────────┴─────┘
                                        Evaluation                                        pdl_optimizer.py:601
           ┌────────────────────────┬──────────────────────────────────────────┐
           │ Test set size          │ 2                                        │
           ├────────────────────────┼──────────────────────────────────────────┤
           │ model                  │ watsonx/meta-llama/llama-3-2-3b-instruct │
           │ prompt_pattern         │ cot                                      │
           │ num_demonstrations     │ 0                                        │
           │ uuid                   │ enl0ertp                                 │
           │ demonstrations_indices │ 0                                        │
           │ demonstrations         │ 0                                        │
           └────────────────────────┴──────────────────────────────────────────┘
           Running without parallelism                                                              util.py:74
   0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/1,200  [ 0:00:01 < -:--:-- , ? it/s ]

Note that it is not unusual to observe PDL exceptions during the optimization process.

[15:44:14] Type errors during spec checking:
../../contrib/prompt_library/ReAct.pdl:0 -  should be an object
../../contrib/prompt_library/ReAct.pdl:0 - Type errors during spec checking:
../../contrib/prompt_library/ReAct.pdl:0 -  should be an object
Retrying:  False
Runtime FAILED and took seconds: 10.21

Such exceptions, here for example in ReAct.pdl, are caused by the typed model call in ReAct.pdl:98. If the model output does not result in a parsable JSON that matches the expected type { name: string, arguments: object }, the PDL interpreter raises an exception.

Once the process is complete, a file optimized_gsm8k.pdl is written in same directory as the source PDL file. This file contains the optimal configuration and is directly executable by the standard PDL interpreter. A log of the optimization process is written to experiments/ by default.