Skip to content

Optimization Components

Search space preparation and RAG optimization functions for the AutoRAG pipeline.

Language Detection

search_space_preparation

Attributes

LANGUAGE_MAP module-attribute

LANGUAGE_MAP: dict[str, str] = {
    "ja": "Japanese",
    "ko": "Korean",
    "zh-cn": "Chinese",
    "zh-tw": "Chinese",
    "en": "English",
    "de": "German",
    "fr": "French",
    "es": "Spanish",
    "pt": "Portuguese",
    "it": "Italian",
    "ru": "Russian",
    "ar": "Arabic",
    "hi": "Hindi",
    "th": "Thai",
    "vi": "Vietnamese",
    "pl": "Polish",
    "nl": "Dutch",
    "sv": "Swedish",
    "cs": "Czech",
    "tr": "Turkish",
}

Search Space Preparation

search_space_preparation

Classes

SearchSpaceReport dataclass

SearchSpaceReport(
    search_space: dict[str, Any], selected_models: dict[str, list], detected_language: dict[str, str] | None
)

Result of the search-space preparation step.

Attributes:

  • search_space (dict[str, Any]) –

    Verbose representation of the search space, including selected model lists and non-model parameter ranges.

  • selected_models (dict[str, list]) –

    Foundation and embedding model lists that survived pre-selection.

  • detected_language (dict[str, str] | None) –

    Detected language code and name, or None when English or when detection was not performed.

Functions
save_yaml
save_yaml(path: str | Path) -> None

Serialize the report to a YAML file.

The file is suitable as input for the RAG optimization step.

Parameters:

  • path (str | Path) –

    Destination file path.

Source code in ai4rag/components/optimization/search_space_preparation.py
def save_yaml(self, path: str | Path) -> None:
    """Serialize the report to a YAML file.

    The file is suitable as input for the RAG optimization step.

    Parameters
    ----------
    path
        Destination file path.
    """
    report = dict(self.search_space)
    if self.detected_language:
        report["detected_language"] = self.detected_language

    path = Path(path)
    path.parent.mkdir(parents=True, exist_ok=True)
    with open(path, "w", encoding="utf-8") as f:
        yml.safe_dump(report, f)

Functions

prepare_search_space_report

prepare_search_space_report(
    test_data_path: str | Path,
    extracted_text_path: str | Path,
    ogx_client: OgxClient,
    embedding_models: list[str] | None = None,
    generation_models: list[str] | None = None,
    metric: str = _DEFAULT_METRIC,
    top_n_generation: int = _DEFAULT_TOP_N_GENERATION,
    top_k_embedding: int = _DEFAULT_TOP_K_EMBEDDING,
    sample_size: int = _DEFAULT_SAMPLE_SIZE,
    random_seed: int = _DEFAULT_SEED,
) -> SearchSpaceReport

Run model pre-selection and prepare a search-space report.

Builds an :class:AI4RAGSearchSpace from the given model lists, runs :class:ModelsPreSelector when the number of models exceeds the configured caps, detects the benchmark language, and returns a structured report.

Parameters:

  • test_data_path (str | Path) –

    Path to a JSON file containing benchmark questions and expected answers.

  • extracted_text_path (str | Path) –

    Path to a single DoclingDocument JSON file or a directory of such files.

  • ogx_client (OgxClient) –

    An authenticated :class:OgxClient instance.

  • embedding_models (list[str] | None, default: None ) –

    Embedding model identifiers. None uses the server defaults.

  • generation_models (list[str] | None, default: None ) –

    Generation model identifiers. None uses the server defaults.

  • metric (str, default: _DEFAULT_METRIC ) –

    Quality metric for intermediate pattern evaluation. Must be one of "faithfulness", "answer_correctness", or "context_correctness".

  • top_n_generation (int, default: _DEFAULT_TOP_N_GENERATION ) –

    Maximum number of generation models to retain.

  • top_k_embedding (int, default: _DEFAULT_TOP_K_EMBEDDING ) –

    Maximum number of embedding models to retain.

  • sample_size (int, default: _DEFAULT_SAMPLE_SIZE ) –

    Number of benchmark records sampled for model pre-selection.

  • random_seed (int, default: _DEFAULT_SEED ) –

    Seed for reproducible sampling.

Returns:

  • SearchSpaceReport

    Structured report containing the verbose search space, selected models, and detected language.

Raises:

  • ValueError

    If metric is not one of the supported values.

  • TypeError

    If embedding_models or generation_models contain invalid entries.

Source code in ai4rag/components/optimization/search_space_preparation.py
def prepare_search_space_report(  # pylint: disable=too-many-locals,too-many-arguments,too-many-positional-arguments
    test_data_path: str | Path,
    extracted_text_path: str | Path,
    ogx_client: OgxClient,
    embedding_models: list[str] | None = None,
    generation_models: list[str] | None = None,
    metric: str = _DEFAULT_METRIC,
    top_n_generation: int = _DEFAULT_TOP_N_GENERATION,
    top_k_embedding: int = _DEFAULT_TOP_K_EMBEDDING,
    sample_size: int = _DEFAULT_SAMPLE_SIZE,
    random_seed: int = _DEFAULT_SEED,
) -> SearchSpaceReport:
    """Run model pre-selection and prepare a search-space report.

    Builds an :class:`AI4RAGSearchSpace` from the given model lists, runs
    :class:`ModelsPreSelector` when the number of models exceeds the
    configured caps, detects the benchmark language, and returns a
    structured report.

    Parameters
    ----------
    test_data_path
        Path to a JSON file containing benchmark questions and expected
        answers.
    extracted_text_path
        Path to a single DoclingDocument JSON file or a directory of such
        files.
    ogx_client
        An authenticated :class:`OgxClient` instance.
    embedding_models
        Embedding model identifiers.  ``None`` uses the server defaults.
    generation_models
        Generation model identifiers.  ``None`` uses the server defaults.
    metric
        Quality metric for intermediate pattern evaluation.  Must be one
        of ``"faithfulness"``, ``"answer_correctness"``, or
        ``"context_correctness"``.
    top_n_generation
        Maximum number of generation models to retain.
    top_k_embedding
        Maximum number of embedding models to retain.
    sample_size
        Number of benchmark records sampled for model pre-selection.
    random_seed
        Seed for reproducible sampling.

    Returns
    -------
    SearchSpaceReport
        Structured report containing the verbose search space, selected
        models, and detected language.

    Raises
    ------
    ValueError
        If *metric* is not one of the supported values.
    TypeError
        If *embedding_models* or *generation_models* contain invalid entries.
    """
    if metric not in SUPPORTED_METRICS:
        raise ValueError(f"Metric {metric!r} is not supported. Supported metrics are {list(SUPPORTED_METRICS)}.")

    _validate_model_list(embedding_models, "embedding_models")
    _validate_model_list(generation_models, "generation_models")

    # Build payload and create search space via OGX
    payload: dict[str, list[dict[str, str]]] = {}
    if generation_models:
        payload["foundation_models"] = [{"model_id": gm} for gm in generation_models]
    if embedding_models:
        payload["embedding_models"] = [{"model_id": em} for em in embedding_models]

    search_space = prepare_search_space_with_ogx(payload, client=ogx_client)

    # Load benchmark data and documents
    benchmark_df = pd.read_json(Path(test_data_path))
    detected_language = _detect_benchmark_language(
        benchmark_df, llm_client=ogx_client, generation_models=generation_models
    )

    benchmark_data = BenchmarkData(benchmark_df)
    documents = load_docling_documents(extracted_text_path)

    # Run model pre-selection when the number of models exceeds the caps
    fm_values = search_space["foundation_model"].values
    em_values = search_space["embedding_model"].values

    if len(fm_values) > top_n_generation or len(em_values) > top_k_embedding:
        mps = ModelsPreSelector(
            benchmark_data=benchmark_data.get_random_sample(n_records=sample_size, random_seed=random_seed),
            documents=documents,
            foundation_models=search_space._search_space["foundation_model"].values,  # pylint: disable=protected-access
            embedding_models=search_space._search_space["embedding_model"].values,  # pylint: disable=protected-access
            metric=metric,
        )
        mps.evaluate_patterns()
        selected = mps.select_models(
            n_embedding_models=top_k_embedding,
            n_foundation_models=top_n_generation,
        )
        selected_models = {
            "foundation_model": selected["foundation_models"],
            "embedding_model": selected["embedding_models"],
        }
    else:
        selected_models = {
            "foundation_model": list(fm_values),
            "embedding_model": list(em_values),
        }

    # Build verbose representation
    verbose_repr: dict[str, Any] = {
        k: v.all_values()
        for k, v in search_space._search_space.items()  # pylint: disable=protected-access
        if k not in ("foundation_model", "embedding_model")
    }
    verbose_repr.update(selected_models)

    return SearchSpaceReport(
        search_space=verbose_repr,
        selected_models=selected_models,
        detected_language=detected_language,
    )

RAG Optimization

rag_templates_optimization

Classes

OptimizationResult dataclass

OptimizationResult(patterns: list[dict], evaluations: list)

Output of a complete RAG optimization run.

Attributes:

  • patterns (list[dict]) –

    Pattern definitions for each evaluated RAG configuration.

  • evaluations (list) –

    Raw evaluation result objects from the experiment.

Functions

run_rag_optimization

run_rag_optimization(
    extracted_text_path: str | Path,
    test_data_path: str | Path,
    search_space_report_path: str | Path,
    output_dir: str | Path,
    ogx_client: OgxClient,
    vector_io_provider_id: str,
    test_data_key: str = "",
    input_data_key: str = "",
    optimization_settings: dict | None = None,
) -> OptimizationResult

Run a full AI4RAG optimization experiment and generate output artefacts.

Orchestrates the end-to-end workflow: load documents, reconstruct the search space from a YAML report, run the experiment, then generate per-pattern outputs (pattern.json, notebooks, scripts, evaluation results).

Parameters:

  • extracted_text_path (str | Path) –

    Path to a folder of DoclingDocument JSON files (or a single file).

  • test_data_path (str | Path) –

    Path to a benchmark JSON file with questions and expected answers.

  • search_space_report_path (str | Path) –

    Path to the YAML report produced by the search-space preparation step.

  • output_dir (str | Path) –

    Root directory where per-pattern output folders are written.

  • ogx_client (OgxClient) –

    An authenticated :class:OgxClient instance.

  • vector_io_provider_id (str) –

    Vector I/O provider identifier registered in OGX.

  • test_data_key (str, default: '' ) –

    Object-storage key for the test data file, embedded into generated notebooks.

  • input_data_key (str, default: '' ) –

    Object-storage key for the documents directory, embedded into generated notebooks.

  • optimization_settings (dict | None, default: None ) –

    Optional dictionary with "metric" and/or "max_number_of_rag_patterns" overrides.

Returns:

  • OptimizationResult

    Contains the list of pattern definitions, raw evaluations, and the total number of parameter combinations explored.

Raises:

  • ValueError

    If test_data_key does not point to a JSON file, vector_io_provider_id is empty, or the optimization metric is not supported.

  • TypeError

    If optimization_settings has invalid types.

Source code in ai4rag/components/optimization/rag_templates_optimization.py
def run_rag_optimization(  # pylint: disable=too-many-locals,too-many-arguments,too-many-positional-arguments
    extracted_text_path: str | Path,
    test_data_path: str | Path,
    search_space_report_path: str | Path,
    output_dir: str | Path,
    ogx_client: OgxClient,
    vector_io_provider_id: str,
    test_data_key: str = "",
    input_data_key: str = "",
    optimization_settings: dict | None = None,
) -> OptimizationResult:
    """Run a full AI4RAG optimization experiment and generate output artefacts.

    Orchestrates the end-to-end workflow: load documents, reconstruct the
    search space from a YAML report, run the experiment, then generate
    per-pattern outputs (``pattern.json``, notebooks, scripts, evaluation
    results).

    Parameters
    ----------
    extracted_text_path
        Path to a folder of DoclingDocument JSON files (or a single file).
    test_data_path
        Path to a benchmark JSON file with questions and expected answers.
    search_space_report_path
        Path to the YAML report produced by the search-space preparation step.
    output_dir
        Root directory where per-pattern output folders are written.
    ogx_client
        An authenticated :class:`OgxClient` instance.
    vector_io_provider_id
        Vector I/O provider identifier registered in OGX.
    test_data_key
        Object-storage key for the test data file, embedded into generated
        notebooks.
    input_data_key
        Object-storage key for the documents directory, embedded into
        generated notebooks.
    optimization_settings
        Optional dictionary with ``"metric"`` and/or
        ``"max_number_of_rag_patterns"`` overrides.

    Returns
    -------
    OptimizationResult
        Contains the list of pattern definitions, raw evaluations, and the
        total number of parameter combinations explored.

    Raises
    ------
    ValueError
        If ``test_data_key`` does not point to a JSON file,
        ``vector_io_provider_id`` is empty, or the optimization metric is
        not supported.
    TypeError
        If ``optimization_settings`` has invalid types.
    """
    # --- Input validation ---
    if not isinstance(test_data_key, str) or not test_data_key.strip() or not test_data_key.lower().endswith(".json"):
        raise ValueError("test_data_key must point to a JSON file.")

    if not isinstance(vector_io_provider_id, str) or not vector_io_provider_id.strip():
        raise ValueError("vector_io_provider_id must be a non-empty string.")
    vector_io_provider_id = vector_io_provider_id.strip()

    settings = _validate_optimization_settings(optimization_settings)
    optimization_metric = settings.get("metric") or DEFAULT_METRIC
    if optimization_metric not in SUPPORTED_OPTIMIZATION_METRICS:
        raise ValueError(
            f"Optimization metric {optimization_metric} is not supported. "
            f"Select one of {SUPPORTED_OPTIMIZATION_METRICS}."
        )

    _register_model_yaml_constructor(ogx_client)

    documents = load_docling_documents(extracted_text_path)

    with open(search_space_report_path, "r", encoding="utf-8") as f:
        search_space_raw = yml.safe_load(f)

    detected_language: dict[str, str] | None = search_space_raw.pop("detected_language", None)

    search_space = AI4RAGSearchSpace(
        params=[Parameter(param, "C", values=values) for param, values in search_space_raw.items()]
    )

    # --- Configure experiment ---
    max_rag_patterns = settings.get("max_number_of_rag_patterns", DEFAULT_MAX_RAG_PATTERNS)
    if isinstance(max_rag_patterns, str):
        max_rag_patterns = int(max_rag_patterns.strip())
    optimizer_settings = GAMOptSettings(max_evals=max_rag_patterns)

    event_handler = KFPEventHandler()

    benchmark_data = pd.read_json(Path(test_data_path))

    if detected_language:
        _inject_language_instructions(search_space, detected_language)

    rag_exp = AI4RAGExperiment(
        client=ogx_client,
        event_handler=event_handler,
        optimizer_settings=optimizer_settings,
        search_space=search_space,
        benchmark_data=benchmark_data,
        vector_store_type="ogx",
        documents=documents,
        optimization_metric=optimization_metric,
        ogx_vector_io_provider_id=vector_io_provider_id,
    )

    # --- Run the optimization loop ---
    rag_exp.search()

    # --- Generate output artefacts ---
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    evaluations_list = list(rag_exp.results.evaluations)
    ogx_base_url = (os.environ.get("OGX_CLIENT_BASE_URL") or "").strip()

    patterns: list[dict] = []
    for pattern in event_handler.patterns:
        patt_dir = output_dir / pattern.get("payload").get("name")
        patt_dir.mkdir(parents=True, exist_ok=True)

        pattern_data = build_pattern_json(
            pattern=pattern.get("payload"),
            detected_language=detected_language,
        )

        # Generate notebooks
        generate_notebook_from_template(
            "ogx_indexing",
            pattern_data,
            patt_dir / "indexing.ipynb",
            input_data_key=input_data_key,
            ogx_base_url=ogx_base_url,
        )
        generate_notebook_from_template(
            "ogx_inference",
            pattern_data,
            patt_dir / "inference.ipynb",
            test_data_key=test_data_key,
            ogx_base_url=ogx_base_url,
        )

        # Attach scores to pattern data and write pattern.json

        with (patt_dir / "pattern.json").open("w+", encoding="utf-8") as f:
            json_dump(pattern_data, f, indent=2)

        # Write evaluation results
        evaluation_result_list = pattern.get("evaluation_results", [])
        with (patt_dir / "evaluation_results.json").open("w+", encoding="utf-8") as f:
            json_dump(evaluation_result_list, f, indent=2)

        patterns.append(pattern_data)

    return OptimizationResult(
        patterns=patterns,
        evaluations=evaluations_list,
    )