qbiocode.utils package#

Submodules#

qbiocode.utils.combine_evals_results module#

Utilities for tracking progress and combining results from interrupted jobs.

This module provides functions to help manage and combine results when computational jobs are interrupted and need to be restarted. These are generic utilities that can be used with any pipeline that produces CSV output files in subdirectories.

combine_results(prev_results_dir, recent_results_dir, eval_file_prefix='Raw', results_file_prefix='Model', output_eval_file='RawDataEvaluation_Combined.csv', output_results_file='ModelResults_Combined.csv', save_intermediate=True, verbose=True)[source]#

Combine results from interrupted and resumed computational jobs.

This function merges CSV files from a previous (interrupted) job run with files from a recent (resumed) job run. It’s useful when a long-running computational job needs to be restarted and you want to combine all results.

Parameters:
  • prev_results_dir (str) – Path to the directory where the previous job stopped prematurely. Should contain subdirectories with individual result files.

  • recent_results_dir (str) – Path to the directory where the job was resumed and ran to completion. Should contain combined result files.

  • eval_file_prefix (str, optional) – Prefix of evaluation/assessment files to combine. Default is ‘Raw’.

  • results_file_prefix (str, optional) – Prefix of model results files to combine. Default is ‘Model’.

  • output_eval_file (str, optional) – Name of the combined evaluation output file. Default is ‘RawDataEvaluation_Combined.csv’.

  • output_results_file (str, optional) – Name of the combined results output file. Default is ‘ModelResults_Combined.csv’.

  • save_intermediate (bool, optional) – If True, saves intermediate combined files from previous run. Default is True.

  • verbose (bool, optional) – If True, prints shape information during processing. Default is True.

Return type:

Tuple[DataFrame, DataFrame]

Returns:

  • combined_eval_df (pd.DataFrame) – Combined dataframe of all evaluation/assessment data.

  • combined_results_df (pd.DataFrame) – Combined dataframe of all model results.

Examples

>>> from qbiocode.utils import combine_results
>>> eval_df, results_df = combine_results(
...     prev_results_dir='results/run1_interrupted',
...     recent_results_dir='results/run2_resumed'
... )
>>> print(f"Combined {len(eval_df)} evaluation records")
>>> print(f"Combined {len(results_df)} result records")
>>> # Custom file prefixes and output names
>>> eval_df, results_df = combine_results(
...     prev_results_dir='results/old',
...     recent_results_dir='results/new',
...     eval_file_prefix='Evaluation',
...     results_file_prefix='Results',
...     output_eval_file='AllEvaluations.csv',
...     output_results_file='AllResults.csv'
... )

Notes

The function expects: - prev_results_dir to contain subdirectories, each with individual CSV files - recent_results_dir to contain combined CSV files at the top level - Files are identified by their prefix (eval_file_prefix, results_file_prefix)

track_progress(input_dataset_dir, current_results_dir, completion_marker='RawDataEvaluation.csv', prefix_length=8, input_extension='csv', verbose=True)[source]#

Track progress of a computational job by checking for completed datasets.

This function scans the results directory for completed datasets (identified by the presence of a specific marker file) and compares against the total number of input datasets to determine how many remain to be processed.

Parameters:
  • input_dataset_dir (str) – Path to the directory containing input datasets.

  • current_results_dir (str) – Path to the directory containing outputs of the current job.

  • completion_marker (str, optional) – Name of the file that indicates a dataset has been fully processed. Default is ‘RawDataEvaluation.csv’.

  • prefix_length (int, optional) – Number of characters to skip from the beginning of directory names when extracting dataset identifiers. Default is 8 (e.g., skips ‘dataset_’ prefix).

  • input_extension (str, optional) – File extension of input datasets (without dot). Default is ‘csv’.

  • verbose (bool, optional) – If True, prints progress information. Default is True.

Return type:

Tuple[List[str], int, int]

Returns:

  • completed_datasets (List[str]) – List of dataset identifiers that have been completed.

  • num_completed (int) – Number of completed datasets.

  • num_remaining (int) – Number of datasets remaining to be processed.

Examples

>>> from qbiocode.utils import track_progress
>>> completed, done, remaining = track_progress(
...     input_dataset_dir='data/inputs',
...     current_results_dir='results/run1'
... )
The completed datasets are: ['dataset1', 'dataset2']
You have finished running program on 2 out of a total of 10 input datasets.
You have 8 input datasets left before program finishes.
>>> # Custom completion marker
>>> completed, done, remaining = track_progress(
...     input_dataset_dir='data/inputs',
...     current_results_dir='results/run1',
...     completion_marker='final_output.csv',
...     prefix_length=0  # No prefix to skip
... )

qbiocode.utils.dataset_checkpoint module#

Checkpoint and restart utilities for resuming interrupted batch processing jobs.

This module provides functions to identify completed datasets from previous runs, enabling efficient restart of interrupted batch processing workflows.

checkpoint_restart(previous_results_dir, completion_marker='RawDataEvaluation.csv', prefix_length=8, verbose=False)[source]#

Identify completed datasets from a previous run to enable checkpoint restart.

This function scans a results directory to find which datasets were fully processed in a previous run by checking for the presence of a completion marker file. This allows you to resume interrupted batch processing jobs without reprocessing completed datasets.

The function assumes that each dataset has its own subdirectory in the results directory, and that a specific file (completion marker) is created when processing completes successfully.

Parameters:
  • previous_results_dir (str) – Path to the directory containing results from the previous (interrupted) run. Each subdirectory should correspond to one dataset.

  • completion_marker (str, optional) – Name of the file that indicates successful completion of a dataset. Default is ‘RawDataEvaluation.csv’ (used by QProfiler).

  • prefix_length (int, optional) – Number of characters to strip from the beginning of directory names to get the dataset name. Default is 8 (strips ‘dataset_’ prefix used by QProfiler). Set to 0 to use the full directory name.

  • verbose (bool, optional) – If True, print the list of completed datasets and count. Default is False.

Returns:

List of dataset names that were fully processed in the previous run. These can be excluded when restarting the batch job.

Return type:

List[str]

Examples

Basic usage with QProfiler default settings:

>>> completed = checkpoint_restart('/path/to/previous_results')
>>> print(f"Found {len(completed)} completed datasets")

Resume processing only incomplete datasets:

>>> import os
>>> all_datasets = [f for f in os.listdir('/path/to/data') if f.endswith('.csv')]
>>> completed = checkpoint_restart('/path/to/previous_results')
>>> remaining = [d for d in all_datasets if d not in completed]
>>> print(f"Need to process {len(remaining)} more datasets")

Custom completion marker and no prefix stripping:

>>> completed = checkpoint_restart(
...     '/path/to/results',
...     completion_marker='ModelResults.csv',
...     prefix_length=0,
...     verbose=True
... )

Integration with QProfiler batch processing:

>>> from qbiocode.utils.dataset_checkpoint import checkpoint_restart
>>>
>>> # Get list of completed datasets from previous run
>>> completed_datasets = checkpoint_restart(
...     previous_results_dir='./previous_run_results',
...     verbose=True
... )
>>>
>>> # Get all datasets to process
>>> all_datasets = [f.replace('.csv', '') for f in os.listdir('./data')
...                 if f.endswith('.csv')]
>>>
>>> # Filter to only incomplete datasets
>>> datasets_to_process = [d for d in all_datasets if d not in completed_datasets]
>>>
>>> # Run QProfiler only on remaining datasets
>>> # (use datasets_to_process in your batch processing loop)

Notes

  • The function only checks for the presence of the completion marker file, not its contents or validity

  • When restarting, you may need to manually combine results from the previous and current runs

  • Directory names are expected to have a consistent prefix (e.g., ‘dataset_’) that can be stripped using the prefix_length parameter

  • Non-directory entries in previous_results_dir are ignored

See also

qbiocode.evaluation.model_run

Main QProfiler batch processing function

qbiocode.utils.find_duplicates module#

File duplicate detection utilities for identifying identical files in directories.

This module provides functions to find duplicate files based on content comparison, useful for cleaning up redundant configuration files or identifying duplicate datasets.

find_duplicate_files(directory, file_pattern=None, ignore_empty_lines=True, case_sensitive=True, verbose=False)[source]#

Find files with identical content in a directory.

Scans the specified directory for files and compares their content line by line. Identifies files that have identical content, even if they have different names. Optionally filters files by pattern and provides various comparison options.

This is particularly useful for:

  • Finding duplicate configuration files (e.g., YAML, JSON)

  • Identifying redundant experiment configurations

  • Cleaning up duplicate datasets before batch processing

  • Validating file uniqueness in automated workflows

Parameters:
  • directory (str) – Path to the directory to search for duplicate files.

  • file_pattern (str, optional) – File extension or pattern to filter (e.g., ‘.yaml’, ‘.csv’, ‘.txt’). If None, all files are compared. Default is None.

  • ignore_empty_lines (bool, optional) – If True, empty lines are ignored during comparison. Default is True.

  • case_sensitive (bool, optional) – If True, comparison is case-sensitive. Default is True.

  • verbose (bool, optional) – If True, print progress information during comparison. Default is False.

Returns:

List of tuples, where each tuple contains paths of two duplicate files. Returns empty list if no duplicates are found.

Return type:

List[Tuple[str, str]]

Raises:
  • FileNotFoundError – If the specified directory does not exist.

  • NotADirectoryError – If the specified path is not a directory.

  • PermissionError – If files cannot be read due to permission issues.

Examples

Find all duplicate files in a directory:

>>> duplicates = find_duplicate_files("configs/")
>>> if duplicates:
...     print(f"Found {len(duplicates)} duplicate pairs")

Find duplicate YAML configuration files:

>>> duplicates = find_duplicate_files(
...     "configs/qml_gridsearch/",
...     file_pattern='.yaml',
...     verbose=True
... )
>>> for file1, file2 in duplicates:
...     print(f"Duplicate: {file1} == {file2}")

Case-insensitive comparison:

>>> duplicates = find_duplicate_files(
...     "data/",
...     file_pattern='.txt',
...     case_sensitive=False
... )

Integration with QProfiler workflow:

>>> # Check for duplicate configs before batch processing
>>> config_dir = "configs/experiments/"
>>> duplicates = find_duplicate_files(config_dir, file_pattern='.yaml')
>>>
>>> if duplicates:
...     print("Warning: Duplicate configurations found!")
...     for f1, f2 in duplicates:
...         print(f"  {os.path.basename(f1)} == {os.path.basename(f2)}")
...     # Optionally remove duplicates or warn user

Notes

  • Files are compared line by line after sorting (order-independent)

  • Binary files are not supported; use for text files only

  • Large files may consume significant memory during comparison

  • Symbolic links are followed and treated as regular files

  • Hidden files (starting with ‘.’) are included in comparison

See also

find_string_in_files

Search for specific strings across multiple files

checkpoint_restart

Resume interrupted batch processing jobs

qbiocode.utils.find_string module#

String search utilities for finding specific content across multiple files.

This module provides functions to search for strings or patterns in files within a directory, useful for auditing configurations, finding specific parameters, or validating file contents.

find_string_in_files(directory, search_string, file_pattern=None, case_sensitive=True, return_lines=False, verbose=True)[source]#

Search for a specific string in all files within a directory.

Scans files in the specified directory and identifies which files contain the search string. Optionally returns the matching lines with line numbers. Useful for auditing configurations, finding specific parameters, or validating settings across multiple files.

Parameters:
  • directory (str) – Path to the directory containing files to search.

  • search_string (str) – The string to search for in the files.

  • file_pattern (str, optional) – File extension or pattern to filter (e.g., ‘.yaml’, ‘.csv’, ‘.txt’). If None, all files are searched. Default is None.

  • case_sensitive (bool, optional) – If True, search is case-sensitive. Default is True.

  • return_lines (bool, optional) – If True, return matching lines with line numbers. Default is False.

  • verbose (bool, optional) – If True, print progress and results. Default is True.

Returns:

Dictionary mapping file paths to list of (line_number, line_content) tuples for files containing the search string. If return_lines is False, the list contains empty tuples.

Return type:

Dict[str, List[Tuple[int, str]]]

Raises:
  • FileNotFoundError – If the specified directory does not exist.

  • NotADirectoryError – If the specified path is not a directory.

Examples

Basic search for a string:

>>> results = find_string_in_files(
...     'configs/',
...     'embeddings: none'
... )
>>> print(f"Found in {len(results)} files")

Search with line numbers returned:

>>> results = find_string_in_files(
...     'configs/qml_gridsearch/',
...     'n_qubits: 4',
...     file_pattern='.yaml',
...     return_lines=True
... )
>>> for filepath, matches in results.items():
...     print(f"{filepath}:")
...     for line_num, line_content in matches:
...         print(f"  Line {line_num}: {line_content.strip()}")

Case-insensitive search:

>>> results = find_string_in_files(
...     'logs/',
...     'error',
...     file_pattern='.log',
...     case_sensitive=False
... )

Integration with QProfiler workflow:

>>> # Find all configs using a specific embedding
>>> config_dir = "configs/experiments/"
>>> results = find_string_in_files(
...     config_dir,
...     'embeddings: pca',
...     file_pattern='.yaml',
...     verbose=True
... )
>>>
>>> if results:
...     print(f"Found {len(results)} configs using PCA embedding")
...     for config_file in results.keys():
...         print(f"  - {os.path.basename(config_file)}")

Notes

  • Only text files are supported; binary files will be skipped

  • Large files may consume significant memory if return_lines=True

  • Symbolic links are followed and treated as regular files

  • Hidden files (starting with ‘.’) are included in search

See also

find_duplicate_files

Find files with identical content

checkpoint_restart

Resume interrupted batch processing jobs

qbiocode.utils.helper_fn module#

Helper Functions for Data Preprocessing and Model Evaluation#

This module provides utility functions for data preprocessing, feature encoding, and result presentation in machine learning workflows.

feature_encoding(feature1, sparse_output=False, feature_encoding='None')[source]#

Encode categorical features using various encoding strategies.

Transforms categorical features into numerical representations suitable for machine learning algorithms. Supports one-hot encoding, ordinal encoding, or no encoding.

Parameters:
  • feature1 (array-like of shape (n_samples,)) – Input categorical feature to be encoded. Should be a 1D array.

  • sparse_output (bool, default=False) – If True and feature_encoding=’OneHotEncoder’, returns a sparse matrix. If False, returns a dense array. Ignored for other encoding methods.

  • feature_encoding ({'None', 'OneHotEncoder', 'OrdinalEncoder'}, default='None') –

    Encoding method to apply:

    • ’None’: No encoding, returns original feature

    • ’OneHotEncoder’: Create binary columns for each category

    • ’OrdinalEncoder’: Map categories to integer values

Returns:

feature1_encoded – Encoded feature. Shape depends on encoding method:

  • ’None’: shape (n_samples, 1)

  • ’OrdinalEncoder’: shape (n_samples, 1)

  • ’OneHotEncoder’: shape (n_samples, n_categories)

Return type:

array-like

Notes

One-hot encoding creates a binary column for each unique category, useful when categories have no ordinal relationship. Ordinal encoding assigns integer values, suitable when categories have a natural order.

The function automatically reshapes the input to (-1, 1) format required by scikit-learn encoders.

Examples

>>> import numpy as np
>>> from qbiocode.utils import feature_encoding
>>> categories = np.array(['A', 'B', 'C', 'A', 'B'])
>>> # One-hot encoding
>>> encoded_onehot = feature_encoding(categories, feature_encoding='OneHotEncoder')
>>> # Ordinal encoding
>>> encoded_ordinal = feature_encoding(categories, feature_encoding='OrdinalEncoder')

See also

sklearn.preprocessing.OneHotEncoder

Encode categorical features as one-hot

sklearn.preprocessing.OrdinalEncoder

Encode categorical features as integers

print_results(model, accuracy, f1, compile_time, params)[source]#

Print formatted machine learning model evaluation results.

Displays model performance metrics and parameters in a consistent, readable format. Useful for comparing multiple models during experimentation and benchmarking.

Parameters:
  • model (str) – Name or identifier of the machine learning model.

  • accuracy (float) – Accuracy score of the model, typically in range [0, 1].

  • f1 (float) – F1 score of the model, harmonic mean of precision and recall.

  • compile_time (float) – Time taken to train/compile the model, in seconds.

  • params (dict) – Dictionary of model hyperparameters and configuration settings.

Returns:

Prints results to stdout.

Return type:

None

Notes

The function formats floating-point numbers to 4 decimal places for consistency. All metrics are printed with descriptive labels.

Examples

>>> from qbiocode.utils import print_results
>>> params = {'n_estimators': 100, 'max_depth': 10}
>>> print_results('RandomForest', 0.9234, 0.9156, 2.345, params)
RandomForest Model Accuracy score: 0.9234
RandomForest Model F1 score: 0.9156
Time taken for RandomForest Model (secs): 2.3450
RandomForest Model Params:  {'n_estimators': 100, 'max_depth': 10}

See also

sklearn.metrics.accuracy_score

Compute accuracy

sklearn.metrics.f1_score

Compute F1 score

scaler_fn(X, scaling='None')[source]#

Apply scaling transformation to input data.

Scales the input data using one of three methods: no scaling, standard scaling (z-score normalization), or min-max scaling to [0, 1] range.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input data to be scaled.

  • scaling ({'None', 'StandardScaler', 'MinMaxScaler'}, default='None') –

    Scaling method to apply:

    • ’None’: No scaling, returns original data

    • ’StandardScaler’: Standardize features by removing mean and scaling to unit variance

    • ’MinMaxScaler’: Scale features to [0, 1] range

Returns:

X_scaled – Scaled data. If scaling=’None’, returns original data unchanged.

Return type:

array-like of shape (n_samples, n_features)

Notes

StandardScaler transforms data to have mean=0 and variance=1:

\[z = \frac{x - \mu}{\sigma}\]

MinMaxScaler transforms data to [0, 1] range:

\[x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}\]

Examples

>>> import numpy as np
>>> from qbiocode.utils import scaler_fn
>>> X = np.array([[1, 2], [3, 4], [5, 6]])
>>> X_scaled = scaler_fn(X, scaling='StandardScaler')
>>> X_minmax = scaler_fn(X, scaling='MinMaxScaler')

See also

sklearn.preprocessing.StandardScaler

Standardize features

sklearn.preprocessing.MinMaxScaler

Scale features to a range

qbiocode.utils.ibm_account module#

get_creds(args)[source]#

This function determines the user’s IBM Quantum channel, instance, and token, using values provided within the config.yaml file or as defined within the user’s qiskit configuration from provided qiskit_json_path specified in the config.yaml file, and then parses its contents. It returns the main items in this json file, such as the instance and api token, which can then be passed into the QML functions when using a real hardware backend. The function will return a dictionary with the keys ‘channel’, ‘instance’, ‘token’, and ‘url’, which can be used to instantiate the QiskitRuntimeService. If the qiskit_json_path is provided, it will attempt to read the credentials from that file. :type args: dict :param args: This passes the arguments from the config.yaml file. In this particular case, it is importing the path to the qiskit-ibm.json file (qiskit_json_path) and the credentials :type args: dict :param defined in this json file: :type defined in this json file: ibm_channel, ibm_instance, ibm_token, ibm_url

Returns:

A dictionary containing the IBM Quantum credentials, including ‘channel’, ‘instance’, ‘token’, and ‘url’.

Return type:

rval (dict)

instantiate_runtime_service(args)[source]#

This function provides a quick way to instantiate QiskitRuntimeService in one place. A basic call to this function can then be done in anywhere else. It uses the get_creds function to retrieve the necessary credentials from the qiskit-ibm.json file, with the file path specified in the config.yaml file. It returns an instance of the QiskitRuntimeService class, which can be used to interact with IBM Quantum services.

Parameters:
  • args (dict) – This passes the arguments from the config.yaml file. In this particular case, it is importing the path to the qiskit-ibm.json file (qiskit_json_path) and the credentials

  • file (defined in this json)

Returns:

An instance of the QiskitRuntimeService class, initialized with the credentials from the qiskit-ibm.json file or the provided arguments.

Return type:

QiskitRuntimeService

qbiocode.utils.qc_winner_finder module#

qml_winner(results_df, rawevals_df, output_dir, tag)[source]#

This function finds data sets where QML was beneficial (higher F1 scores than CML) and create new .csv files with the relevant evaluation and performance for these specific datasets, for further analysis. It also computes the best results per method across all splits and the best results per dataset. It returns two DataFrames: one with the datasets where QML methods outperformed CML methods, and another with the evaluation scores for the best QML method for each of these datasets. It also saves these DataFrames as .csv files in the specified output directory.

Parameters:
  • results_df (pandas.DataFrame) – Dataset in pandas corresponding to ‘ModelResults.csv’

  • rawevals_df (pandas.DataFrame) – Dataset in pandas corresponding to ‘RawDataEvaluation.csv’

Returns:

contais the input datasets for which at least one QML method

performed better than CML. DataFrame contains the scores of all the methods.

winner_eval_score (pandas.DataFrame): contains the input datasets, their evaluation, and scores for the

specific qml method that yielded the best score.

Return type:

qml_winners (pandas.DataFrame)

qbiocode.utils.qutils module#

get_ansatz(ansatz_type, feat_dimension, reps=1, entanglement='linear')[source]#

This function returns an ansatz based on the specified type and parameters. It supports ‘esu2’, ‘amp’, and ‘twolocal’ ansatz types, constructing it using the specified feature dimension, number of repetitions, and entanglement type.

Parameters:
  • ansatz_type (str) – Type of the ansatz (‘esu2’, ‘amp’, or ‘twolocal’).

  • feat_dimension (int) – Number of qubits for the ansatz.

  • reps (int) – Number of repetitions for the ansatz.

  • entanglement (str) – Type of entanglement for the ansatz.

Returns:

An instance of the specified ansatz type.

Return type:

ansatz

get_backend_session(args, primitive, num_qubits)[source]#

This function to get the backend and session for the specified primitive.

Parameters:
  • args (dict) – Dictionary containing backend and other parameters.

  • primitive (str) – The type of primitive to instantiate (‘sampler’ or ‘estimator’).

  • num_qubits (int) – Number of qubits for the backend.

Returns:

The backend instance. session: The session instance. prim: The instantiated primitive (Sampler or Estimator).

Return type:

backend

get_estimator(mode=None, shots=1024, resil_level=2, dd=True, dd_seq='XpXm', PT=True)[source]#

This function creates an Estimator instance with specified options.

Parameters:
  • mode (Session) – The session mode for the estimator.

  • shots (int) – Number of shots for estimation.

  • resil_level (int) – Resilience level for error suppression.

  • dd (bool) – Whether to enable dynamical decoupling.

  • dd_seq (str) – Sequence type for dynamical decoupling.

  • PT (bool) – Whether to enable pulse twirling.

Returns:

An instance of the Estimator with the specified options.

Return type:

Estimator

get_feature_map(feature_map, feat_dimension, reps=1, entanglement='linear', data_map_func=None)[source]#

This function returns a feature map based on the specified type and parameters. It supports ‘Z’, ‘ZZ’, and ‘P’ feature maps, constructing it using the specified feature dimension, number of repetitions, entanglement type, and data mapping function. :type feature_map: str :param feature_map: Type of the feature map (‘Z’, ‘ZZ’, or ‘P’). :type feature_map: str :type feat_dimension: int :param feat_dimension: Number of qubits for the feature map. :type feat_dimension: int :type reps: int :param reps: Number of repetitions for the feature map. :type reps: int :type entanglement: str :param entanglement: Type of entanglement for the feature map. :type entanglement: str :type data_map_func: callable, optional :param data_map_func: Function to map data to the feature map parameters. :type data_map_func: callable, optional

Returns:

An instance of the specified feature map type. feat_dimension (int): The number of qubits in the feature map.

Return type:

feature_map

get_observable(circuit, backend)[source]#
get_optimizer(type='COBYLA', max_iter=100, learning_rate_a=None, perturbation_gamma=None, prior_iter=0)[source]#

This function returns an optimizer based on the specified type and parameters. It supports ‘SPSA’, ‘COBYLA’, ‘GradientDescent’, and ‘L_BFGS_B’ optimizer types, constructing it using the specified maximum iterations, learning rate, perturbation gamma, and prior iterations.

Parameters:
  • type (str) – Type of the optimizer (‘SPSA’, ‘COBYLA’, ‘GradientDescent’, or ‘L_BFGS_B’).

  • max_iter (int) – Maximum number of iterations for the optimizer.

  • learning_rate_a (float, optional) – Initial learning rate for SPSA.

  • perturbation_gamma (float, optional) – Perturbation gamma for SPSA.

  • prior_iter (int) – Number of prior iterations to consider.

Returns:

An instance of the specified optimizer type.

Return type:

optimizer

get_sampler(mode=None, shots=1024, dd=True, dd_seq='XpXm', PT=True)[source]#

This function creates a Sampler instance with specified options.

Parameters:
  • mode (Session) – The session mode for the sampler.

  • shots (int) – Number of shots for sampling.

  • dd (bool) – Whether to enable dynamical decoupling.

  • dd_seq (str) – Sequence type for dynamical decoupling.

  • PT (bool) – Whether to enable pulse twirling.

Returns:

An instance of the Sampler with the specified options.

Return type:

Sampler

transpile_circuit(circuit, opt_level, backend, initial_layout, PT=False, dd_sequence='XpXm')[source]#

This function transpiles the given quantum circuit based on the optimization level and backend.

Parameters:
  • circuit (QuantumCircuit) – The quantum circuit to be transpiled.

  • opt_level (int or str) – Optimization level for transpilation.

  • backend (Backend) – The backend to which the circuit will be transpiled.

  • initial_layout (Layout) – Initial layout for the transpilation.

  • PT (bool) – Whether to apply pulse twirling. Defaults to False.

  • dd_sequence (str) – Sequence for dynamical decoupling. Defaults to ‘XpXm’.

Returns:

The transpiled quantum circuit.

Return type:

t_qc (QuantumCircuit)

Module contents#

Utilities Module for QBioCode#

This module provides helper functions and utilities for data preprocessing, model management, IBM Quantum account handling, and result analysis.

Available Functions#

  • scaler_fn: Data scaling and normalization

  • feature_encoding: Encode features for quantum circuits

  • qml_winner: Identify best performing quantum model

  • checkpoint_restart: Save and load model checkpoints

  • track_progress: Track progress of dataset processing

  • combine_results: Combine evaluation results from multiple runs

  • find_duplicate_files: Find duplicate entries in datasets

  • find_string_in_files: Search for strings in files

  • generate_qml_experiment_configs: Generate config files for QML grid search

  • get_creds: Get IBM Quantum credentials

  • instantiate_runtime_service: Instantiate Qiskit Runtime Service

  • get_backend_session: Get backend session for quantum execution

  • get_sampler: Get sampler primitive

  • get_estimator: Get estimator primitive

  • get_ansatz: Get quantum ansatz circuit

  • get_feature_map: Get quantum feature map

  • get_optimizer: Get classical optimizer

Usage#

>>> from qbiocode.utils import scaler_fn, feature_encoding
>>> # Scale data
>>> X_scaled = scaler_fn(X, scaling='StandardScaler')
>>> # Encode features for quantum circuits
>>> X_encoded = feature_encoding(X, feature_encoding='OneHotEncoder')
checkpoint_restart(previous_results_dir, completion_marker='RawDataEvaluation.csv', prefix_length=8, verbose=False)[source]#

Identify completed datasets from a previous run to enable checkpoint restart.

This function scans a results directory to find which datasets were fully processed in a previous run by checking for the presence of a completion marker file. This allows you to resume interrupted batch processing jobs without reprocessing completed datasets.

The function assumes that each dataset has its own subdirectory in the results directory, and that a specific file (completion marker) is created when processing completes successfully.

Parameters:
  • previous_results_dir (str) – Path to the directory containing results from the previous (interrupted) run. Each subdirectory should correspond to one dataset.

  • completion_marker (str, optional) – Name of the file that indicates successful completion of a dataset. Default is ‘RawDataEvaluation.csv’ (used by QProfiler).

  • prefix_length (int, optional) – Number of characters to strip from the beginning of directory names to get the dataset name. Default is 8 (strips ‘dataset_’ prefix used by QProfiler). Set to 0 to use the full directory name.

  • verbose (bool, optional) – If True, print the list of completed datasets and count. Default is False.

Returns:

List of dataset names that were fully processed in the previous run. These can be excluded when restarting the batch job.

Return type:

List[str]

Examples

Basic usage with QProfiler default settings:

>>> completed = checkpoint_restart('/path/to/previous_results')
>>> print(f"Found {len(completed)} completed datasets")

Resume processing only incomplete datasets:

>>> import os
>>> all_datasets = [f for f in os.listdir('/path/to/data') if f.endswith('.csv')]
>>> completed = checkpoint_restart('/path/to/previous_results')
>>> remaining = [d for d in all_datasets if d not in completed]
>>> print(f"Need to process {len(remaining)} more datasets")

Custom completion marker and no prefix stripping:

>>> completed = checkpoint_restart(
...     '/path/to/results',
...     completion_marker='ModelResults.csv',
...     prefix_length=0,
...     verbose=True
... )

Integration with QProfiler batch processing:

>>> from qbiocode.utils.dataset_checkpoint import checkpoint_restart
>>>
>>> # Get list of completed datasets from previous run
>>> completed_datasets = checkpoint_restart(
...     previous_results_dir='./previous_run_results',
...     verbose=True
... )
>>>
>>> # Get all datasets to process
>>> all_datasets = [f.replace('.csv', '') for f in os.listdir('./data')
...                 if f.endswith('.csv')]
>>>
>>> # Filter to only incomplete datasets
>>> datasets_to_process = [d for d in all_datasets if d not in completed_datasets]
>>>
>>> # Run QProfiler only on remaining datasets
>>> # (use datasets_to_process in your batch processing loop)

Notes

  • The function only checks for the presence of the completion marker file, not its contents or validity

  • When restarting, you may need to manually combine results from the previous and current runs

  • Directory names are expected to have a consistent prefix (e.g., ‘dataset_’) that can be stripped using the prefix_length parameter

  • Non-directory entries in previous_results_dir are ignored

See also

qbiocode.evaluation.model_run

Main QProfiler batch processing function

combine_results(prev_results_dir, recent_results_dir, eval_file_prefix='Raw', results_file_prefix='Model', output_eval_file='RawDataEvaluation_Combined.csv', output_results_file='ModelResults_Combined.csv', save_intermediate=True, verbose=True)[source]#

Combine results from interrupted and resumed computational jobs.

This function merges CSV files from a previous (interrupted) job run with files from a recent (resumed) job run. It’s useful when a long-running computational job needs to be restarted and you want to combine all results.

Parameters:
  • prev_results_dir (str) – Path to the directory where the previous job stopped prematurely. Should contain subdirectories with individual result files.

  • recent_results_dir (str) – Path to the directory where the job was resumed and ran to completion. Should contain combined result files.

  • eval_file_prefix (str, optional) – Prefix of evaluation/assessment files to combine. Default is ‘Raw’.

  • results_file_prefix (str, optional) – Prefix of model results files to combine. Default is ‘Model’.

  • output_eval_file (str, optional) – Name of the combined evaluation output file. Default is ‘RawDataEvaluation_Combined.csv’.

  • output_results_file (str, optional) – Name of the combined results output file. Default is ‘ModelResults_Combined.csv’.

  • save_intermediate (bool, optional) – If True, saves intermediate combined files from previous run. Default is True.

  • verbose (bool, optional) – If True, prints shape information during processing. Default is True.

Return type:

Tuple[DataFrame, DataFrame]

Returns:

  • combined_eval_df (pd.DataFrame) – Combined dataframe of all evaluation/assessment data.

  • combined_results_df (pd.DataFrame) – Combined dataframe of all model results.

Examples

>>> from qbiocode.utils import combine_results
>>> eval_df, results_df = combine_results(
...     prev_results_dir='results/run1_interrupted',
...     recent_results_dir='results/run2_resumed'
... )
>>> print(f"Combined {len(eval_df)} evaluation records")
>>> print(f"Combined {len(results_df)} result records")
>>> # Custom file prefixes and output names
>>> eval_df, results_df = combine_results(
...     prev_results_dir='results/old',
...     recent_results_dir='results/new',
...     eval_file_prefix='Evaluation',
...     results_file_prefix='Results',
...     output_eval_file='AllEvaluations.csv',
...     output_results_file='AllResults.csv'
... )

Notes

The function expects: - prev_results_dir to contain subdirectories, each with individual CSV files - recent_results_dir to contain combined CSV files at the top level - Files are identified by their prefix (eval_file_prefix, results_file_prefix)

feature_encoding(feature1, sparse_output=False, feature_encoding='None')[source]#

Encode categorical features using various encoding strategies.

Transforms categorical features into numerical representations suitable for machine learning algorithms. Supports one-hot encoding, ordinal encoding, or no encoding.

Parameters:
  • feature1 (array-like of shape (n_samples,)) – Input categorical feature to be encoded. Should be a 1D array.

  • sparse_output (bool, default=False) – If True and feature_encoding=’OneHotEncoder’, returns a sparse matrix. If False, returns a dense array. Ignored for other encoding methods.

  • feature_encoding ({'None', 'OneHotEncoder', 'OrdinalEncoder'}, default='None') –

    Encoding method to apply:

    • ’None’: No encoding, returns original feature

    • ’OneHotEncoder’: Create binary columns for each category

    • ’OrdinalEncoder’: Map categories to integer values

Returns:

feature1_encoded – Encoded feature. Shape depends on encoding method:

  • ’None’: shape (n_samples, 1)

  • ’OrdinalEncoder’: shape (n_samples, 1)

  • ’OneHotEncoder’: shape (n_samples, n_categories)

Return type:

array-like

Notes

One-hot encoding creates a binary column for each unique category, useful when categories have no ordinal relationship. Ordinal encoding assigns integer values, suitable when categories have a natural order.

The function automatically reshapes the input to (-1, 1) format required by scikit-learn encoders.

Examples

>>> import numpy as np
>>> from qbiocode.utils import feature_encoding
>>> categories = np.array(['A', 'B', 'C', 'A', 'B'])
>>> # One-hot encoding
>>> encoded_onehot = feature_encoding(categories, feature_encoding='OneHotEncoder')
>>> # Ordinal encoding
>>> encoded_ordinal = feature_encoding(categories, feature_encoding='OrdinalEncoder')

See also

sklearn.preprocessing.OneHotEncoder

Encode categorical features as one-hot

sklearn.preprocessing.OrdinalEncoder

Encode categorical features as integers

find_duplicate_files(directory, file_pattern=None, ignore_empty_lines=True, case_sensitive=True, verbose=False)[source]#

Find files with identical content in a directory.

Scans the specified directory for files and compares their content line by line. Identifies files that have identical content, even if they have different names. Optionally filters files by pattern and provides various comparison options.

This is particularly useful for:

  • Finding duplicate configuration files (e.g., YAML, JSON)

  • Identifying redundant experiment configurations

  • Cleaning up duplicate datasets before batch processing

  • Validating file uniqueness in automated workflows

Parameters:
  • directory (str) – Path to the directory to search for duplicate files.

  • file_pattern (str, optional) – File extension or pattern to filter (e.g., ‘.yaml’, ‘.csv’, ‘.txt’). If None, all files are compared. Default is None.

  • ignore_empty_lines (bool, optional) – If True, empty lines are ignored during comparison. Default is True.

  • case_sensitive (bool, optional) – If True, comparison is case-sensitive. Default is True.

  • verbose (bool, optional) – If True, print progress information during comparison. Default is False.

Returns:

List of tuples, where each tuple contains paths of two duplicate files. Returns empty list if no duplicates are found.

Return type:

List[Tuple[str, str]]

Raises:
  • FileNotFoundError – If the specified directory does not exist.

  • NotADirectoryError – If the specified path is not a directory.

  • PermissionError – If files cannot be read due to permission issues.

Examples

Find all duplicate files in a directory:

>>> duplicates = find_duplicate_files("configs/")
>>> if duplicates:
...     print(f"Found {len(duplicates)} duplicate pairs")

Find duplicate YAML configuration files:

>>> duplicates = find_duplicate_files(
...     "configs/qml_gridsearch/",
...     file_pattern='.yaml',
...     verbose=True
... )
>>> for file1, file2 in duplicates:
...     print(f"Duplicate: {file1} == {file2}")

Case-insensitive comparison:

>>> duplicates = find_duplicate_files(
...     "data/",
...     file_pattern='.txt',
...     case_sensitive=False
... )

Integration with QProfiler workflow:

>>> # Check for duplicate configs before batch processing
>>> config_dir = "configs/experiments/"
>>> duplicates = find_duplicate_files(config_dir, file_pattern='.yaml')
>>>
>>> if duplicates:
...     print("Warning: Duplicate configurations found!")
...     for f1, f2 in duplicates:
...         print(f"  {os.path.basename(f1)} == {os.path.basename(f2)}")
...     # Optionally remove duplicates or warn user

Notes

  • Files are compared line by line after sorting (order-independent)

  • Binary files are not supported; use for text files only

  • Large files may consume significant memory during comparison

  • Symbolic links are followed and treated as regular files

  • Hidden files (starting with ‘.’) are included in comparison

See also

find_string_in_files

Search for specific strings across multiple files

checkpoint_restart

Resume interrupted batch processing jobs

find_string_in_files(directory, search_string, file_pattern=None, case_sensitive=True, return_lines=False, verbose=True)[source]#

Search for a specific string in all files within a directory.

Scans files in the specified directory and identifies which files contain the search string. Optionally returns the matching lines with line numbers. Useful for auditing configurations, finding specific parameters, or validating settings across multiple files.

Parameters:
  • directory (str) – Path to the directory containing files to search.

  • search_string (str) – The string to search for in the files.

  • file_pattern (str, optional) – File extension or pattern to filter (e.g., ‘.yaml’, ‘.csv’, ‘.txt’). If None, all files are searched. Default is None.

  • case_sensitive (bool, optional) – If True, search is case-sensitive. Default is True.

  • return_lines (bool, optional) – If True, return matching lines with line numbers. Default is False.

  • verbose (bool, optional) – If True, print progress and results. Default is True.

Returns:

Dictionary mapping file paths to list of (line_number, line_content) tuples for files containing the search string. If return_lines is False, the list contains empty tuples.

Return type:

Dict[str, List[Tuple[int, str]]]

Raises:
  • FileNotFoundError – If the specified directory does not exist.

  • NotADirectoryError – If the specified path is not a directory.

Examples

Basic search for a string:

>>> results = find_string_in_files(
...     'configs/',
...     'embeddings: none'
... )
>>> print(f"Found in {len(results)} files")

Search with line numbers returned:

>>> results = find_string_in_files(
...     'configs/qml_gridsearch/',
...     'n_qubits: 4',
...     file_pattern='.yaml',
...     return_lines=True
... )
>>> for filepath, matches in results.items():
...     print(f"{filepath}:")
...     for line_num, line_content in matches:
...         print(f"  Line {line_num}: {line_content.strip()}")

Case-insensitive search:

>>> results = find_string_in_files(
...     'logs/',
...     'error',
...     file_pattern='.log',
...     case_sensitive=False
... )

Integration with QProfiler workflow:

>>> # Find all configs using a specific embedding
>>> config_dir = "configs/experiments/"
>>> results = find_string_in_files(
...     config_dir,
...     'embeddings: pca',
...     file_pattern='.yaml',
...     verbose=True
... )
>>>
>>> if results:
...     print(f"Found {len(results)} configs using PCA embedding")
...     for config_file in results.keys():
...         print(f"  - {os.path.basename(config_file)}")

Notes

  • Only text files are supported; binary files will be skipped

  • Large files may consume significant memory if return_lines=True

  • Symbolic links are followed and treated as regular files

  • Hidden files (starting with ‘.’) are included in search

See also

find_duplicate_files

Find files with identical content

checkpoint_restart

Resume interrupted batch processing jobs

generate_qml_experiment_configs(template_config_path, output_dir, data_dirs, qmethods=None, reps=None, optimizers=None, entanglements=None, feature_maps=None, ansatz_types=None, n_components=None, Cs=None, max_iters=None, embeddings=None, data_sample_fraction=1.0, used_files_path=None, random_seed=None)[source]#

Generate YAML configuration files for quantum ML hyperparameter grid search.

This function creates multiple configuration files by combining different hyperparameter values for quantum machine learning models (QNN, VQC, QSVC). Each configuration file can be used with QProfiler to run systematic experiments.

Parameters:
  • template_config_path (str) – Path to the template YAML configuration file.

  • output_dir (str) – Directory where generated config files will be saved.

  • data_dirs (List[str]) – List of directories containing CSV dataset files.

  • qmethods (List[str], optional) – Quantum methods to test. Default: [‘qnn’, ‘vqc’, ‘qsvc’]

  • reps (List[int], optional) – Number of repetitions for ansatz layers. Default: [1, 2]

  • optimizers (List[str], optional) – Optimizers to use. Default: [‘COBYLA’, ‘SPSA’]

  • entanglements (List[str], optional) – Entanglement patterns. Default: [‘linear’, ‘full’]

  • feature_maps (List[str], optional) – Feature map encodings. Default: [‘Z’, ‘ZZ’]

  • ansatz_types (List[str], optional) – Ansatz types for QNN/VQC. Default: [‘amp’, ‘esu2’]

  • n_components (List[int], optional) – Number of components for dimensionality reduction. Default: [5, 10]

  • Cs (List[float], optional) – Regularization parameters for QSVC. Default: [0.1, 1, 10]

  • max_iters (List[int], optional) – Maximum iterations for optimization. Default: [100, 500]

  • embeddings (List[str], optional) – Embedding methods. Default: [‘none’, ‘pca’, ‘lle’, ‘isomap’, ‘spectral’, ‘umap’, ‘nmf’]

  • data_sample_fraction (float, optional) – Fraction of data files to use (0.0-1.0). Default: 1.0

  • used_files_path (str, optional) – Path to CSV file tracking previously used data files.

  • random_seed (int, optional) – Random seed for reproducible file sampling.

Returns:

Number of configuration files generated and path to used files CSV.

Return type:

Tuple[int, str]

Examples

>>> from qbiocode.utils import generate_qml_experiment_configs
>>>
>>> # Generate configs for quantum model grid search
>>> num_configs, used_files = generate_qml_experiment_configs(
...     template_config_path='configs/config.yaml',
...     output_dir='configs/qml_gridsearch',
...     data_dirs=['data/tutorial_test_data/lower_dim_datasets'],
...     qmethods=['qnn', 'vqc'],
...     reps=[1, 2],
...     n_components=[5, 10],
...     data_sample_fraction=0.1  # Use 10% of files for testing
... )
>>> print(f"Generated {num_configs} configuration files")

Notes

  • Quantum models (QNN, VQC, QSVC) don’t support automated grid search

  • This function generates separate config files for each hyperparameter combination

  • Run QProfiler separately for each generated config file

  • The function automatically handles model-specific constraints:
    • QSVC uses only ‘amp’ ansatz and ‘COBYLA’ optimizer

    • QNN/VQC don’t use the C parameter

  • Embedding is set to ‘none’ when n_components >= original feature count

See also

qbiocode.apps.qprofiler

Main profiling application

get_ansatz(ansatz_type, feat_dimension, reps=1, entanglement='linear')[source]#

This function returns an ansatz based on the specified type and parameters. It supports ‘esu2’, ‘amp’, and ‘twolocal’ ansatz types, constructing it using the specified feature dimension, number of repetitions, and entanglement type.

Parameters:
  • ansatz_type (str) – Type of the ansatz (‘esu2’, ‘amp’, or ‘twolocal’).

  • feat_dimension (int) – Number of qubits for the ansatz.

  • reps (int) – Number of repetitions for the ansatz.

  • entanglement (str) – Type of entanglement for the ansatz.

Returns:

An instance of the specified ansatz type.

Return type:

ansatz

get_backend_session(args, primitive, num_qubits)[source]#

This function to get the backend and session for the specified primitive.

Parameters:
  • args (dict) – Dictionary containing backend and other parameters.

  • primitive (str) – The type of primitive to instantiate (‘sampler’ or ‘estimator’).

  • num_qubits (int) – Number of qubits for the backend.

Returns:

The backend instance. session: The session instance. prim: The instantiated primitive (Sampler or Estimator).

Return type:

backend

get_creds(args)[source]#

This function determines the user’s IBM Quantum channel, instance, and token, using values provided within the config.yaml file or as defined within the user’s qiskit configuration from provided qiskit_json_path specified in the config.yaml file, and then parses its contents. It returns the main items in this json file, such as the instance and api token, which can then be passed into the QML functions when using a real hardware backend. The function will return a dictionary with the keys ‘channel’, ‘instance’, ‘token’, and ‘url’, which can be used to instantiate the QiskitRuntimeService. If the qiskit_json_path is provided, it will attempt to read the credentials from that file. :type args: dict :param args: This passes the arguments from the config.yaml file. In this particular case, it is importing the path to the qiskit-ibm.json file (qiskit_json_path) and the credentials :type args: dict :param defined in this json file: :type defined in this json file: ibm_channel, ibm_instance, ibm_token, ibm_url

Returns:

A dictionary containing the IBM Quantum credentials, including ‘channel’, ‘instance’, ‘token’, and ‘url’.

Return type:

rval (dict)

get_estimator(mode=None, shots=1024, resil_level=2, dd=True, dd_seq='XpXm', PT=True)[source]#

This function creates an Estimator instance with specified options.

Parameters:
  • mode (Session) – The session mode for the estimator.

  • shots (int) – Number of shots for estimation.

  • resil_level (int) – Resilience level for error suppression.

  • dd (bool) – Whether to enable dynamical decoupling.

  • dd_seq (str) – Sequence type for dynamical decoupling.

  • PT (bool) – Whether to enable pulse twirling.

Returns:

An instance of the Estimator with the specified options.

Return type:

Estimator

get_feature_map(feature_map, feat_dimension, reps=1, entanglement='linear', data_map_func=None)[source]#

This function returns a feature map based on the specified type and parameters. It supports ‘Z’, ‘ZZ’, and ‘P’ feature maps, constructing it using the specified feature dimension, number of repetitions, entanglement type, and data mapping function. :type feature_map: str :param feature_map: Type of the feature map (‘Z’, ‘ZZ’, or ‘P’). :type feature_map: str :type feat_dimension: int :param feat_dimension: Number of qubits for the feature map. :type feat_dimension: int :type reps: int :param reps: Number of repetitions for the feature map. :type reps: int :type entanglement: str :param entanglement: Type of entanglement for the feature map. :type entanglement: str :type data_map_func: callable, optional :param data_map_func: Function to map data to the feature map parameters. :type data_map_func: callable, optional

Returns:

An instance of the specified feature map type. feat_dimension (int): The number of qubits in the feature map.

Return type:

feature_map

get_optimizer(type='COBYLA', max_iter=100, learning_rate_a=None, perturbation_gamma=None, prior_iter=0)[source]#

This function returns an optimizer based on the specified type and parameters. It supports ‘SPSA’, ‘COBYLA’, ‘GradientDescent’, and ‘L_BFGS_B’ optimizer types, constructing it using the specified maximum iterations, learning rate, perturbation gamma, and prior iterations.

Parameters:
  • type (str) – Type of the optimizer (‘SPSA’, ‘COBYLA’, ‘GradientDescent’, or ‘L_BFGS_B’).

  • max_iter (int) – Maximum number of iterations for the optimizer.

  • learning_rate_a (float, optional) – Initial learning rate for SPSA.

  • perturbation_gamma (float, optional) – Perturbation gamma for SPSA.

  • prior_iter (int) – Number of prior iterations to consider.

Returns:

An instance of the specified optimizer type.

Return type:

optimizer

get_sampler(mode=None, shots=1024, dd=True, dd_seq='XpXm', PT=True)[source]#

This function creates a Sampler instance with specified options.

Parameters:
  • mode (Session) – The session mode for the sampler.

  • shots (int) – Number of shots for sampling.

  • dd (bool) – Whether to enable dynamical decoupling.

  • dd_seq (str) – Sequence type for dynamical decoupling.

  • PT (bool) – Whether to enable pulse twirling.

Returns:

An instance of the Sampler with the specified options.

Return type:

Sampler

instantiate_runtime_service(args)[source]#

This function provides a quick way to instantiate QiskitRuntimeService in one place. A basic call to this function can then be done in anywhere else. It uses the get_creds function to retrieve the necessary credentials from the qiskit-ibm.json file, with the file path specified in the config.yaml file. It returns an instance of the QiskitRuntimeService class, which can be used to interact with IBM Quantum services.

Parameters:
  • args (dict) – This passes the arguments from the config.yaml file. In this particular case, it is importing the path to the qiskit-ibm.json file (qiskit_json_path) and the credentials

  • file (defined in this json)

Returns:

An instance of the QiskitRuntimeService class, initialized with the credentials from the qiskit-ibm.json file or the provided arguments.

Return type:

QiskitRuntimeService

qml_winner(results_df, rawevals_df, output_dir, tag)[source]#

This function finds data sets where QML was beneficial (higher F1 scores than CML) and create new .csv files with the relevant evaluation and performance for these specific datasets, for further analysis. It also computes the best results per method across all splits and the best results per dataset. It returns two DataFrames: one with the datasets where QML methods outperformed CML methods, and another with the evaluation scores for the best QML method for each of these datasets. It also saves these DataFrames as .csv files in the specified output directory.

Parameters:
  • results_df (pandas.DataFrame) – Dataset in pandas corresponding to ‘ModelResults.csv’

  • rawevals_df (pandas.DataFrame) – Dataset in pandas corresponding to ‘RawDataEvaluation.csv’

Returns:

contais the input datasets for which at least one QML method

performed better than CML. DataFrame contains the scores of all the methods.

winner_eval_score (pandas.DataFrame): contains the input datasets, their evaluation, and scores for the

specific qml method that yielded the best score.

Return type:

qml_winners (pandas.DataFrame)

scaler_fn(X, scaling='None')[source]#

Apply scaling transformation to input data.

Scales the input data using one of three methods: no scaling, standard scaling (z-score normalization), or min-max scaling to [0, 1] range.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input data to be scaled.

  • scaling ({'None', 'StandardScaler', 'MinMaxScaler'}, default='None') –

    Scaling method to apply:

    • ’None’: No scaling, returns original data

    • ’StandardScaler’: Standardize features by removing mean and scaling to unit variance

    • ’MinMaxScaler’: Scale features to [0, 1] range

Returns:

X_scaled – Scaled data. If scaling=’None’, returns original data unchanged.

Return type:

array-like of shape (n_samples, n_features)

Notes

StandardScaler transforms data to have mean=0 and variance=1:

\[z = \frac{x - \mu}{\sigma}\]

MinMaxScaler transforms data to [0, 1] range:

\[x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}\]

Examples

>>> import numpy as np
>>> from qbiocode.utils import scaler_fn
>>> X = np.array([[1, 2], [3, 4], [5, 6]])
>>> X_scaled = scaler_fn(X, scaling='StandardScaler')
>>> X_minmax = scaler_fn(X, scaling='MinMaxScaler')

See also

sklearn.preprocessing.StandardScaler

Standardize features

sklearn.preprocessing.MinMaxScaler

Scale features to a range

track_progress(input_dataset_dir, current_results_dir, completion_marker='RawDataEvaluation.csv', prefix_length=8, input_extension='csv', verbose=True)[source]#

Track progress of a computational job by checking for completed datasets.

This function scans the results directory for completed datasets (identified by the presence of a specific marker file) and compares against the total number of input datasets to determine how many remain to be processed.

Parameters:
  • input_dataset_dir (str) – Path to the directory containing input datasets.

  • current_results_dir (str) – Path to the directory containing outputs of the current job.

  • completion_marker (str, optional) – Name of the file that indicates a dataset has been fully processed. Default is ‘RawDataEvaluation.csv’.

  • prefix_length (int, optional) – Number of characters to skip from the beginning of directory names when extracting dataset identifiers. Default is 8 (e.g., skips ‘dataset_’ prefix).

  • input_extension (str, optional) – File extension of input datasets (without dot). Default is ‘csv’.

  • verbose (bool, optional) – If True, prints progress information. Default is True.

Return type:

Tuple[List[str], int, int]

Returns:

  • completed_datasets (List[str]) – List of dataset identifiers that have been completed.

  • num_completed (int) – Number of completed datasets.

  • num_remaining (int) – Number of datasets remaining to be processed.

Examples

>>> from qbiocode.utils import track_progress
>>> completed, done, remaining = track_progress(
...     input_dataset_dir='data/inputs',
...     current_results_dir='results/run1'
... )
The completed datasets are: ['dataset1', 'dataset2']
You have finished running program on 2 out of a total of 10 input datasets.
You have 8 input datasets left before program finishes.
>>> # Custom completion marker
>>> completed, done, remaining = track_progress(
...     input_dataset_dir='data/inputs',
...     current_results_dir='results/run1',
...     completion_marker='final_output.csv',
...     prefix_length=0  # No prefix to skip
... )