qbiocode.utils package#
Submodules#
qbiocode.utils.combine_evals_results module#
Utilities for tracking progress and combining results from interrupted jobs.
This module provides functions to help manage and combine results when computational jobs are interrupted and need to be restarted. These are generic utilities that can be used with any pipeline that produces CSV output files in subdirectories.
- combine_results(prev_results_dir, recent_results_dir, eval_file_prefix='Raw', results_file_prefix='Model', output_eval_file='RawDataEvaluation_Combined.csv', output_results_file='ModelResults_Combined.csv', save_intermediate=True, verbose=True)[source]#
Combine results from interrupted and resumed computational jobs.
This function merges CSV files from a previous (interrupted) job run with files from a recent (resumed) job run. It’s useful when a long-running computational job needs to be restarted and you want to combine all results.
- Parameters:
prev_results_dir (str) – Path to the directory where the previous job stopped prematurely. Should contain subdirectories with individual result files.
recent_results_dir (str) – Path to the directory where the job was resumed and ran to completion. Should contain combined result files.
eval_file_prefix (str, optional) – Prefix of evaluation/assessment files to combine. Default is ‘Raw’.
results_file_prefix (str, optional) – Prefix of model results files to combine. Default is ‘Model’.
output_eval_file (str, optional) – Name of the combined evaluation output file. Default is ‘RawDataEvaluation_Combined.csv’.
output_results_file (str, optional) – Name of the combined results output file. Default is ‘ModelResults_Combined.csv’.
save_intermediate (bool, optional) – If True, saves intermediate combined files from previous run. Default is True.
verbose (bool, optional) – If True, prints shape information during processing. Default is True.
- Return type:
Tuple[DataFrame,DataFrame]- Returns:
combined_eval_df (pd.DataFrame) – Combined dataframe of all evaluation/assessment data.
combined_results_df (pd.DataFrame) – Combined dataframe of all model results.
Examples
>>> from qbiocode.utils import combine_results >>> eval_df, results_df = combine_results( ... prev_results_dir='results/run1_interrupted', ... recent_results_dir='results/run2_resumed' ... ) >>> print(f"Combined {len(eval_df)} evaluation records") >>> print(f"Combined {len(results_df)} result records")
>>> # Custom file prefixes and output names >>> eval_df, results_df = combine_results( ... prev_results_dir='results/old', ... recent_results_dir='results/new', ... eval_file_prefix='Evaluation', ... results_file_prefix='Results', ... output_eval_file='AllEvaluations.csv', ... output_results_file='AllResults.csv' ... )
Notes
The function expects: - prev_results_dir to contain subdirectories, each with individual CSV files - recent_results_dir to contain combined CSV files at the top level - Files are identified by their prefix (eval_file_prefix, results_file_prefix)
- track_progress(input_dataset_dir, current_results_dir, completion_marker='RawDataEvaluation.csv', prefix_length=8, input_extension='csv', verbose=True)[source]#
Track progress of a computational job by checking for completed datasets.
This function scans the results directory for completed datasets (identified by the presence of a specific marker file) and compares against the total number of input datasets to determine how many remain to be processed.
- Parameters:
input_dataset_dir (str) – Path to the directory containing input datasets.
current_results_dir (str) – Path to the directory containing outputs of the current job.
completion_marker (str, optional) – Name of the file that indicates a dataset has been fully processed. Default is ‘RawDataEvaluation.csv’.
prefix_length (int, optional) – Number of characters to skip from the beginning of directory names when extracting dataset identifiers. Default is 8 (e.g., skips ‘dataset_’ prefix).
input_extension (str, optional) – File extension of input datasets (without dot). Default is ‘csv’.
verbose (bool, optional) – If True, prints progress information. Default is True.
- Return type:
Tuple[List[str],int,int]- Returns:
completed_datasets (List[str]) – List of dataset identifiers that have been completed.
num_completed (int) – Number of completed datasets.
num_remaining (int) – Number of datasets remaining to be processed.
Examples
>>> from qbiocode.utils import track_progress >>> completed, done, remaining = track_progress( ... input_dataset_dir='data/inputs', ... current_results_dir='results/run1' ... ) The completed datasets are: ['dataset1', 'dataset2'] You have finished running program on 2 out of a total of 10 input datasets. You have 8 input datasets left before program finishes.
>>> # Custom completion marker >>> completed, done, remaining = track_progress( ... input_dataset_dir='data/inputs', ... current_results_dir='results/run1', ... completion_marker='final_output.csv', ... prefix_length=0 # No prefix to skip ... )
qbiocode.utils.dataset_checkpoint module#
Checkpoint and restart utilities for resuming interrupted batch processing jobs.
This module provides functions to identify completed datasets from previous runs, enabling efficient restart of interrupted batch processing workflows.
- checkpoint_restart(previous_results_dir, completion_marker='RawDataEvaluation.csv', prefix_length=8, verbose=False)[source]#
Identify completed datasets from a previous run to enable checkpoint restart.
This function scans a results directory to find which datasets were fully processed in a previous run by checking for the presence of a completion marker file. This allows you to resume interrupted batch processing jobs without reprocessing completed datasets.
The function assumes that each dataset has its own subdirectory in the results directory, and that a specific file (completion marker) is created when processing completes successfully.
- Parameters:
previous_results_dir (str) – Path to the directory containing results from the previous (interrupted) run. Each subdirectory should correspond to one dataset.
completion_marker (str, optional) – Name of the file that indicates successful completion of a dataset. Default is ‘RawDataEvaluation.csv’ (used by QProfiler).
prefix_length (int, optional) – Number of characters to strip from the beginning of directory names to get the dataset name. Default is 8 (strips ‘dataset_’ prefix used by QProfiler). Set to 0 to use the full directory name.
verbose (bool, optional) – If True, print the list of completed datasets and count. Default is False.
- Returns:
List of dataset names that were fully processed in the previous run. These can be excluded when restarting the batch job.
- Return type:
List[str]
Examples
Basic usage with QProfiler default settings:
>>> completed = checkpoint_restart('/path/to/previous_results') >>> print(f"Found {len(completed)} completed datasets")
Resume processing only incomplete datasets:
>>> import os >>> all_datasets = [f for f in os.listdir('/path/to/data') if f.endswith('.csv')] >>> completed = checkpoint_restart('/path/to/previous_results') >>> remaining = [d for d in all_datasets if d not in completed] >>> print(f"Need to process {len(remaining)} more datasets")
Custom completion marker and no prefix stripping:
>>> completed = checkpoint_restart( ... '/path/to/results', ... completion_marker='ModelResults.csv', ... prefix_length=0, ... verbose=True ... )
Integration with QProfiler batch processing:
>>> from qbiocode.utils.dataset_checkpoint import checkpoint_restart >>> >>> # Get list of completed datasets from previous run >>> completed_datasets = checkpoint_restart( ... previous_results_dir='./previous_run_results', ... verbose=True ... ) >>> >>> # Get all datasets to process >>> all_datasets = [f.replace('.csv', '') for f in os.listdir('./data') ... if f.endswith('.csv')] >>> >>> # Filter to only incomplete datasets >>> datasets_to_process = [d for d in all_datasets if d not in completed_datasets] >>> >>> # Run QProfiler only on remaining datasets >>> # (use datasets_to_process in your batch processing loop)
Notes
The function only checks for the presence of the completion marker file, not its contents or validity
When restarting, you may need to manually combine results from the previous and current runs
Directory names are expected to have a consistent prefix (e.g., ‘dataset_’) that can be stripped using the prefix_length parameter
Non-directory entries in previous_results_dir are ignored
See also
qbiocode.evaluation.model_runMain QProfiler batch processing function
qbiocode.utils.find_duplicates module#
File duplicate detection utilities for identifying identical files in directories.
This module provides functions to find duplicate files based on content comparison, useful for cleaning up redundant configuration files or identifying duplicate datasets.
- find_duplicate_files(directory, file_pattern=None, ignore_empty_lines=True, case_sensitive=True, verbose=False)[source]#
Find files with identical content in a directory.
Scans the specified directory for files and compares their content line by line. Identifies files that have identical content, even if they have different names. Optionally filters files by pattern and provides various comparison options.
This is particularly useful for:
Finding duplicate configuration files (e.g., YAML, JSON)
Identifying redundant experiment configurations
Cleaning up duplicate datasets before batch processing
Validating file uniqueness in automated workflows
- Parameters:
directory (str) – Path to the directory to search for duplicate files.
file_pattern (str, optional) – File extension or pattern to filter (e.g., ‘.yaml’, ‘.csv’, ‘.txt’). If None, all files are compared. Default is None.
ignore_empty_lines (bool, optional) – If True, empty lines are ignored during comparison. Default is True.
case_sensitive (bool, optional) – If True, comparison is case-sensitive. Default is True.
verbose (bool, optional) – If True, print progress information during comparison. Default is False.
- Returns:
List of tuples, where each tuple contains paths of two duplicate files. Returns empty list if no duplicates are found.
- Return type:
List[Tuple[str, str]]
- Raises:
FileNotFoundError – If the specified directory does not exist.
NotADirectoryError – If the specified path is not a directory.
PermissionError – If files cannot be read due to permission issues.
Examples
Find all duplicate files in a directory:
>>> duplicates = find_duplicate_files("configs/") >>> if duplicates: ... print(f"Found {len(duplicates)} duplicate pairs")
Find duplicate YAML configuration files:
>>> duplicates = find_duplicate_files( ... "configs/qml_gridsearch/", ... file_pattern='.yaml', ... verbose=True ... ) >>> for file1, file2 in duplicates: ... print(f"Duplicate: {file1} == {file2}")
Case-insensitive comparison:
>>> duplicates = find_duplicate_files( ... "data/", ... file_pattern='.txt', ... case_sensitive=False ... )
Integration with QProfiler workflow:
>>> # Check for duplicate configs before batch processing >>> config_dir = "configs/experiments/" >>> duplicates = find_duplicate_files(config_dir, file_pattern='.yaml') >>> >>> if duplicates: ... print("Warning: Duplicate configurations found!") ... for f1, f2 in duplicates: ... print(f" {os.path.basename(f1)} == {os.path.basename(f2)}") ... # Optionally remove duplicates or warn user
Notes
Files are compared line by line after sorting (order-independent)
Binary files are not supported; use for text files only
Large files may consume significant memory during comparison
Symbolic links are followed and treated as regular files
Hidden files (starting with ‘.’) are included in comparison
See also
find_string_in_filesSearch for specific strings across multiple files
checkpoint_restartResume interrupted batch processing jobs
qbiocode.utils.find_string module#
String search utilities for finding specific content across multiple files.
This module provides functions to search for strings or patterns in files within a directory, useful for auditing configurations, finding specific parameters, or validating file contents.
- find_string_in_files(directory, search_string, file_pattern=None, case_sensitive=True, return_lines=False, verbose=True)[source]#
Search for a specific string in all files within a directory.
Scans files in the specified directory and identifies which files contain the search string. Optionally returns the matching lines with line numbers. Useful for auditing configurations, finding specific parameters, or validating settings across multiple files.
- Parameters:
directory (str) – Path to the directory containing files to search.
search_string (str) – The string to search for in the files.
file_pattern (str, optional) – File extension or pattern to filter (e.g., ‘.yaml’, ‘.csv’, ‘.txt’). If None, all files are searched. Default is None.
case_sensitive (bool, optional) – If True, search is case-sensitive. Default is True.
return_lines (bool, optional) – If True, return matching lines with line numbers. Default is False.
verbose (bool, optional) – If True, print progress and results. Default is True.
- Returns:
Dictionary mapping file paths to list of (line_number, line_content) tuples for files containing the search string. If return_lines is False, the list contains empty tuples.
- Return type:
Dict[str, List[Tuple[int, str]]]
- Raises:
FileNotFoundError – If the specified directory does not exist.
NotADirectoryError – If the specified path is not a directory.
Examples
Basic search for a string:
>>> results = find_string_in_files( ... 'configs/', ... 'embeddings: none' ... ) >>> print(f"Found in {len(results)} files")
Search with line numbers returned:
>>> results = find_string_in_files( ... 'configs/qml_gridsearch/', ... 'n_qubits: 4', ... file_pattern='.yaml', ... return_lines=True ... ) >>> for filepath, matches in results.items(): ... print(f"{filepath}:") ... for line_num, line_content in matches: ... print(f" Line {line_num}: {line_content.strip()}")
Case-insensitive search:
>>> results = find_string_in_files( ... 'logs/', ... 'error', ... file_pattern='.log', ... case_sensitive=False ... )
Integration with QProfiler workflow:
>>> # Find all configs using a specific embedding >>> config_dir = "configs/experiments/" >>> results = find_string_in_files( ... config_dir, ... 'embeddings: pca', ... file_pattern='.yaml', ... verbose=True ... ) >>> >>> if results: ... print(f"Found {len(results)} configs using PCA embedding") ... for config_file in results.keys(): ... print(f" - {os.path.basename(config_file)}")
Notes
Only text files are supported; binary files will be skipped
Large files may consume significant memory if return_lines=True
Symbolic links are followed and treated as regular files
Hidden files (starting with ‘.’) are included in search
See also
find_duplicate_filesFind files with identical content
checkpoint_restartResume interrupted batch processing jobs
qbiocode.utils.helper_fn module#
Helper Functions for Data Preprocessing and Model Evaluation#
This module provides utility functions for data preprocessing, feature encoding, and result presentation in machine learning workflows.
- feature_encoding(feature1, sparse_output=False, feature_encoding='None')[source]#
Encode categorical features using various encoding strategies.
Transforms categorical features into numerical representations suitable for machine learning algorithms. Supports one-hot encoding, ordinal encoding, or no encoding.
- Parameters:
feature1 (array-like of shape (n_samples,)) – Input categorical feature to be encoded. Should be a 1D array.
sparse_output (bool, default=False) – If True and feature_encoding=’OneHotEncoder’, returns a sparse matrix. If False, returns a dense array. Ignored for other encoding methods.
feature_encoding ({'None', 'OneHotEncoder', 'OrdinalEncoder'}, default='None') –
Encoding method to apply:
’None’: No encoding, returns original feature
’OneHotEncoder’: Create binary columns for each category
’OrdinalEncoder’: Map categories to integer values
- Returns:
feature1_encoded – Encoded feature. Shape depends on encoding method:
’None’: shape (n_samples, 1)
’OrdinalEncoder’: shape (n_samples, 1)
’OneHotEncoder’: shape (n_samples, n_categories)
- Return type:
array-like
Notes
One-hot encoding creates a binary column for each unique category, useful when categories have no ordinal relationship. Ordinal encoding assigns integer values, suitable when categories have a natural order.
The function automatically reshapes the input to (-1, 1) format required by scikit-learn encoders.
Examples
>>> import numpy as np >>> from qbiocode.utils import feature_encoding >>> categories = np.array(['A', 'B', 'C', 'A', 'B']) >>> # One-hot encoding >>> encoded_onehot = feature_encoding(categories, feature_encoding='OneHotEncoder') >>> # Ordinal encoding >>> encoded_ordinal = feature_encoding(categories, feature_encoding='OrdinalEncoder')
See also
sklearn.preprocessing.OneHotEncoderEncode categorical features as one-hot
sklearn.preprocessing.OrdinalEncoderEncode categorical features as integers
- print_results(model, accuracy, f1, compile_time, params)[source]#
Print formatted machine learning model evaluation results.
Displays model performance metrics and parameters in a consistent, readable format. Useful for comparing multiple models during experimentation and benchmarking.
- Parameters:
model (str) – Name or identifier of the machine learning model.
accuracy (float) – Accuracy score of the model, typically in range [0, 1].
f1 (float) – F1 score of the model, harmonic mean of precision and recall.
compile_time (float) – Time taken to train/compile the model, in seconds.
params (dict) – Dictionary of model hyperparameters and configuration settings.
- Returns:
Prints results to stdout.
- Return type:
None
Notes
The function formats floating-point numbers to 4 decimal places for consistency. All metrics are printed with descriptive labels.
Examples
>>> from qbiocode.utils import print_results >>> params = {'n_estimators': 100, 'max_depth': 10} >>> print_results('RandomForest', 0.9234, 0.9156, 2.345, params) RandomForest Model Accuracy score: 0.9234 RandomForest Model F1 score: 0.9156 Time taken for RandomForest Model (secs): 2.3450 RandomForest Model Params: {'n_estimators': 100, 'max_depth': 10}
See also
sklearn.metrics.accuracy_scoreCompute accuracy
sklearn.metrics.f1_scoreCompute F1 score
- scaler_fn(X, scaling='None')[source]#
Apply scaling transformation to input data.
Scales the input data using one of three methods: no scaling, standard scaling (z-score normalization), or min-max scaling to [0, 1] range.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input data to be scaled.
scaling ({'None', 'StandardScaler', 'MinMaxScaler'}, default='None') –
Scaling method to apply:
’None’: No scaling, returns original data
’StandardScaler’: Standardize features by removing mean and scaling to unit variance
’MinMaxScaler’: Scale features to [0, 1] range
- Returns:
X_scaled – Scaled data. If scaling=’None’, returns original data unchanged.
- Return type:
array-like of shape (n_samples, n_features)
Notes
StandardScaler transforms data to have mean=0 and variance=1:
\[z = \frac{x - \mu}{\sigma}\]MinMaxScaler transforms data to [0, 1] range:
\[x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}\]Examples
>>> import numpy as np >>> from qbiocode.utils import scaler_fn >>> X = np.array([[1, 2], [3, 4], [5, 6]]) >>> X_scaled = scaler_fn(X, scaling='StandardScaler') >>> X_minmax = scaler_fn(X, scaling='MinMaxScaler')
See also
sklearn.preprocessing.StandardScalerStandardize features
sklearn.preprocessing.MinMaxScalerScale features to a range
qbiocode.utils.ibm_account module#
- get_creds(args)[source]#
This function determines the user’s IBM Quantum channel, instance, and token, using values provided within the config.yaml file or as defined within the user’s qiskit configuration from provided qiskit_json_path specified in the config.yaml file, and then parses its contents. It returns the main items in this json file, such as the instance and api token, which can then be passed into the QML functions when using a real hardware backend. The function will return a dictionary with the keys ‘channel’, ‘instance’, ‘token’, and ‘url’, which can be used to instantiate the QiskitRuntimeService. If the qiskit_json_path is provided, it will attempt to read the credentials from that file. :type args: dict :param args: This passes the arguments from the config.yaml file. In this particular case, it is importing the path to the qiskit-ibm.json file (qiskit_json_path) and the credentials :type args: dict :param defined in this json file: :type defined in this json file: ibm_channel, ibm_instance, ibm_token, ibm_url
- Returns:
A dictionary containing the IBM Quantum credentials, including ‘channel’, ‘instance’, ‘token’, and ‘url’.
- Return type:
rval (dict)
- instantiate_runtime_service(args)[source]#
This function provides a quick way to instantiate QiskitRuntimeService in one place. A basic call to this function can then be done in anywhere else. It uses the get_creds function to retrieve the necessary credentials from the qiskit-ibm.json file, with the file path specified in the config.yaml file. It returns an instance of the QiskitRuntimeService class, which can be used to interact with IBM Quantum services.
- Parameters:
args (dict) – This passes the arguments from the config.yaml file. In this particular case, it is importing the path to the qiskit-ibm.json file (qiskit_json_path) and the credentials
file (defined in this json)
- Returns:
An instance of the QiskitRuntimeService class, initialized with the credentials from the qiskit-ibm.json file or the provided arguments.
- Return type:
QiskitRuntimeService
qbiocode.utils.qc_winner_finder module#
- qml_winner(results_df, rawevals_df, output_dir, tag)[source]#
This function finds data sets where QML was beneficial (higher F1 scores than CML) and create new .csv files with the relevant evaluation and performance for these specific datasets, for further analysis. It also computes the best results per method across all splits and the best results per dataset. It returns two DataFrames: one with the datasets where QML methods outperformed CML methods, and another with the evaluation scores for the best QML method for each of these datasets. It also saves these DataFrames as .csv files in the specified output directory.
- Parameters:
results_df (pandas.DataFrame) – Dataset in pandas corresponding to ‘ModelResults.csv’
rawevals_df (pandas.DataFrame) – Dataset in pandas corresponding to ‘RawDataEvaluation.csv’
- Returns:
- contais the input datasets for which at least one QML method
performed better than CML. DataFrame contains the scores of all the methods.
- winner_eval_score (pandas.DataFrame): contains the input datasets, their evaluation, and scores for the
specific qml method that yielded the best score.
- Return type:
qml_winners (pandas.DataFrame)
qbiocode.utils.qutils module#
- get_ansatz(ansatz_type, feat_dimension, reps=1, entanglement='linear')[source]#
This function returns an ansatz based on the specified type and parameters. It supports ‘esu2’, ‘amp’, and ‘twolocal’ ansatz types, constructing it using the specified feature dimension, number of repetitions, and entanglement type.
- Parameters:
ansatz_type (str) – Type of the ansatz (‘esu2’, ‘amp’, or ‘twolocal’).
feat_dimension (int) – Number of qubits for the ansatz.
reps (int) – Number of repetitions for the ansatz.
entanglement (str) – Type of entanglement for the ansatz.
- Returns:
An instance of the specified ansatz type.
- Return type:
ansatz
- get_backend_session(args, primitive, num_qubits)[source]#
This function to get the backend and session for the specified primitive.
- Parameters:
args (dict) – Dictionary containing backend and other parameters.
primitive (str) – The type of primitive to instantiate (‘sampler’ or ‘estimator’).
num_qubits (int) – Number of qubits for the backend.
- Returns:
The backend instance. session: The session instance. prim: The instantiated primitive (Sampler or Estimator).
- Return type:
backend
- get_estimator(mode=None, shots=1024, resil_level=2, dd=True, dd_seq='XpXm', PT=True)[source]#
This function creates an Estimator instance with specified options.
- Parameters:
mode (Session) – The session mode for the estimator.
shots (int) – Number of shots for estimation.
resil_level (int) – Resilience level for error suppression.
dd (bool) – Whether to enable dynamical decoupling.
dd_seq (str) – Sequence type for dynamical decoupling.
PT (bool) – Whether to enable pulse twirling.
- Returns:
An instance of the Estimator with the specified options.
- Return type:
Estimator
- get_feature_map(feature_map, feat_dimension, reps=1, entanglement='linear', data_map_func=None)[source]#
This function returns a feature map based on the specified type and parameters. It supports ‘Z’, ‘ZZ’, and ‘P’ feature maps, constructing it using the specified feature dimension, number of repetitions, entanglement type, and data mapping function. :type feature_map: str :param feature_map: Type of the feature map (‘Z’, ‘ZZ’, or ‘P’). :type feature_map: str :type feat_dimension: int :param feat_dimension: Number of qubits for the feature map. :type feat_dimension: int :type reps: int :param reps: Number of repetitions for the feature map. :type reps: int :type entanglement: str :param entanglement: Type of entanglement for the feature map. :type entanglement: str :type data_map_func: callable, optional :param data_map_func: Function to map data to the feature map parameters. :type data_map_func: callable, optional
- Returns:
An instance of the specified feature map type. feat_dimension (int): The number of qubits in the feature map.
- Return type:
feature_map
- get_optimizer(type='COBYLA', max_iter=100, learning_rate_a=None, perturbation_gamma=None, prior_iter=0)[source]#
This function returns an optimizer based on the specified type and parameters. It supports ‘SPSA’, ‘COBYLA’, ‘GradientDescent’, and ‘L_BFGS_B’ optimizer types, constructing it using the specified maximum iterations, learning rate, perturbation gamma, and prior iterations.
- Parameters:
type (str) – Type of the optimizer (‘SPSA’, ‘COBYLA’, ‘GradientDescent’, or ‘L_BFGS_B’).
max_iter (int) – Maximum number of iterations for the optimizer.
learning_rate_a (float, optional) – Initial learning rate for SPSA.
perturbation_gamma (float, optional) – Perturbation gamma for SPSA.
prior_iter (int) – Number of prior iterations to consider.
- Returns:
An instance of the specified optimizer type.
- Return type:
optimizer
- get_sampler(mode=None, shots=1024, dd=True, dd_seq='XpXm', PT=True)[source]#
This function creates a Sampler instance with specified options.
- Parameters:
mode (Session) – The session mode for the sampler.
shots (int) – Number of shots for sampling.
dd (bool) – Whether to enable dynamical decoupling.
dd_seq (str) – Sequence type for dynamical decoupling.
PT (bool) – Whether to enable pulse twirling.
- Returns:
An instance of the Sampler with the specified options.
- Return type:
Sampler
- transpile_circuit(circuit, opt_level, backend, initial_layout, PT=False, dd_sequence='XpXm')[source]#
This function transpiles the given quantum circuit based on the optimization level and backend.
- Parameters:
circuit (QuantumCircuit) – The quantum circuit to be transpiled.
opt_level (int or str) – Optimization level for transpilation.
backend (Backend) – The backend to which the circuit will be transpiled.
initial_layout (Layout) – Initial layout for the transpilation.
PT (bool) – Whether to apply pulse twirling. Defaults to False.
dd_sequence (str) – Sequence for dynamical decoupling. Defaults to ‘XpXm’.
- Returns:
The transpiled quantum circuit.
- Return type:
t_qc (QuantumCircuit)
Module contents#
Utilities Module for QBioCode#
This module provides helper functions and utilities for data preprocessing, model management, IBM Quantum account handling, and result analysis.
Available Functions#
scaler_fn: Data scaling and normalization
feature_encoding: Encode features for quantum circuits
qml_winner: Identify best performing quantum model
checkpoint_restart: Save and load model checkpoints
track_progress: Track progress of dataset processing
combine_results: Combine evaluation results from multiple runs
find_duplicate_files: Find duplicate entries in datasets
find_string_in_files: Search for strings in files
generate_qml_experiment_configs: Generate config files for QML grid search
get_creds: Get IBM Quantum credentials
instantiate_runtime_service: Instantiate Qiskit Runtime Service
get_backend_session: Get backend session for quantum execution
get_sampler: Get sampler primitive
get_estimator: Get estimator primitive
get_ansatz: Get quantum ansatz circuit
get_feature_map: Get quantum feature map
get_optimizer: Get classical optimizer
Usage#
>>> from qbiocode.utils import scaler_fn, feature_encoding
>>> # Scale data
>>> X_scaled = scaler_fn(X, scaling='StandardScaler')
>>> # Encode features for quantum circuits
>>> X_encoded = feature_encoding(X, feature_encoding='OneHotEncoder')
- checkpoint_restart(previous_results_dir, completion_marker='RawDataEvaluation.csv', prefix_length=8, verbose=False)[source]#
Identify completed datasets from a previous run to enable checkpoint restart.
This function scans a results directory to find which datasets were fully processed in a previous run by checking for the presence of a completion marker file. This allows you to resume interrupted batch processing jobs without reprocessing completed datasets.
The function assumes that each dataset has its own subdirectory in the results directory, and that a specific file (completion marker) is created when processing completes successfully.
- Parameters:
previous_results_dir (str) – Path to the directory containing results from the previous (interrupted) run. Each subdirectory should correspond to one dataset.
completion_marker (str, optional) – Name of the file that indicates successful completion of a dataset. Default is ‘RawDataEvaluation.csv’ (used by QProfiler).
prefix_length (int, optional) – Number of characters to strip from the beginning of directory names to get the dataset name. Default is 8 (strips ‘dataset_’ prefix used by QProfiler). Set to 0 to use the full directory name.
verbose (bool, optional) – If True, print the list of completed datasets and count. Default is False.
- Returns:
List of dataset names that were fully processed in the previous run. These can be excluded when restarting the batch job.
- Return type:
List[str]
Examples
Basic usage with QProfiler default settings:
>>> completed = checkpoint_restart('/path/to/previous_results') >>> print(f"Found {len(completed)} completed datasets")
Resume processing only incomplete datasets:
>>> import os >>> all_datasets = [f for f in os.listdir('/path/to/data') if f.endswith('.csv')] >>> completed = checkpoint_restart('/path/to/previous_results') >>> remaining = [d for d in all_datasets if d not in completed] >>> print(f"Need to process {len(remaining)} more datasets")
Custom completion marker and no prefix stripping:
>>> completed = checkpoint_restart( ... '/path/to/results', ... completion_marker='ModelResults.csv', ... prefix_length=0, ... verbose=True ... )
Integration with QProfiler batch processing:
>>> from qbiocode.utils.dataset_checkpoint import checkpoint_restart >>> >>> # Get list of completed datasets from previous run >>> completed_datasets = checkpoint_restart( ... previous_results_dir='./previous_run_results', ... verbose=True ... ) >>> >>> # Get all datasets to process >>> all_datasets = [f.replace('.csv', '') for f in os.listdir('./data') ... if f.endswith('.csv')] >>> >>> # Filter to only incomplete datasets >>> datasets_to_process = [d for d in all_datasets if d not in completed_datasets] >>> >>> # Run QProfiler only on remaining datasets >>> # (use datasets_to_process in your batch processing loop)
Notes
The function only checks for the presence of the completion marker file, not its contents or validity
When restarting, you may need to manually combine results from the previous and current runs
Directory names are expected to have a consistent prefix (e.g., ‘dataset_’) that can be stripped using the prefix_length parameter
Non-directory entries in previous_results_dir are ignored
See also
qbiocode.evaluation.model_runMain QProfiler batch processing function
- combine_results(prev_results_dir, recent_results_dir, eval_file_prefix='Raw', results_file_prefix='Model', output_eval_file='RawDataEvaluation_Combined.csv', output_results_file='ModelResults_Combined.csv', save_intermediate=True, verbose=True)[source]#
Combine results from interrupted and resumed computational jobs.
This function merges CSV files from a previous (interrupted) job run with files from a recent (resumed) job run. It’s useful when a long-running computational job needs to be restarted and you want to combine all results.
- Parameters:
prev_results_dir (str) – Path to the directory where the previous job stopped prematurely. Should contain subdirectories with individual result files.
recent_results_dir (str) – Path to the directory where the job was resumed and ran to completion. Should contain combined result files.
eval_file_prefix (str, optional) – Prefix of evaluation/assessment files to combine. Default is ‘Raw’.
results_file_prefix (str, optional) – Prefix of model results files to combine. Default is ‘Model’.
output_eval_file (str, optional) – Name of the combined evaluation output file. Default is ‘RawDataEvaluation_Combined.csv’.
output_results_file (str, optional) – Name of the combined results output file. Default is ‘ModelResults_Combined.csv’.
save_intermediate (bool, optional) – If True, saves intermediate combined files from previous run. Default is True.
verbose (bool, optional) – If True, prints shape information during processing. Default is True.
- Return type:
Tuple[DataFrame,DataFrame]- Returns:
combined_eval_df (pd.DataFrame) – Combined dataframe of all evaluation/assessment data.
combined_results_df (pd.DataFrame) – Combined dataframe of all model results.
Examples
>>> from qbiocode.utils import combine_results >>> eval_df, results_df = combine_results( ... prev_results_dir='results/run1_interrupted', ... recent_results_dir='results/run2_resumed' ... ) >>> print(f"Combined {len(eval_df)} evaluation records") >>> print(f"Combined {len(results_df)} result records")
>>> # Custom file prefixes and output names >>> eval_df, results_df = combine_results( ... prev_results_dir='results/old', ... recent_results_dir='results/new', ... eval_file_prefix='Evaluation', ... results_file_prefix='Results', ... output_eval_file='AllEvaluations.csv', ... output_results_file='AllResults.csv' ... )
Notes
The function expects: - prev_results_dir to contain subdirectories, each with individual CSV files - recent_results_dir to contain combined CSV files at the top level - Files are identified by their prefix (eval_file_prefix, results_file_prefix)
- feature_encoding(feature1, sparse_output=False, feature_encoding='None')[source]#
Encode categorical features using various encoding strategies.
Transforms categorical features into numerical representations suitable for machine learning algorithms. Supports one-hot encoding, ordinal encoding, or no encoding.
- Parameters:
feature1 (array-like of shape (n_samples,)) – Input categorical feature to be encoded. Should be a 1D array.
sparse_output (bool, default=False) – If True and feature_encoding=’OneHotEncoder’, returns a sparse matrix. If False, returns a dense array. Ignored for other encoding methods.
feature_encoding ({'None', 'OneHotEncoder', 'OrdinalEncoder'}, default='None') –
Encoding method to apply:
’None’: No encoding, returns original feature
’OneHotEncoder’: Create binary columns for each category
’OrdinalEncoder’: Map categories to integer values
- Returns:
feature1_encoded – Encoded feature. Shape depends on encoding method:
’None’: shape (n_samples, 1)
’OrdinalEncoder’: shape (n_samples, 1)
’OneHotEncoder’: shape (n_samples, n_categories)
- Return type:
array-like
Notes
One-hot encoding creates a binary column for each unique category, useful when categories have no ordinal relationship. Ordinal encoding assigns integer values, suitable when categories have a natural order.
The function automatically reshapes the input to (-1, 1) format required by scikit-learn encoders.
Examples
>>> import numpy as np >>> from qbiocode.utils import feature_encoding >>> categories = np.array(['A', 'B', 'C', 'A', 'B']) >>> # One-hot encoding >>> encoded_onehot = feature_encoding(categories, feature_encoding='OneHotEncoder') >>> # Ordinal encoding >>> encoded_ordinal = feature_encoding(categories, feature_encoding='OrdinalEncoder')
See also
sklearn.preprocessing.OneHotEncoderEncode categorical features as one-hot
sklearn.preprocessing.OrdinalEncoderEncode categorical features as integers
- find_duplicate_files(directory, file_pattern=None, ignore_empty_lines=True, case_sensitive=True, verbose=False)[source]#
Find files with identical content in a directory.
Scans the specified directory for files and compares their content line by line. Identifies files that have identical content, even if they have different names. Optionally filters files by pattern and provides various comparison options.
This is particularly useful for:
Finding duplicate configuration files (e.g., YAML, JSON)
Identifying redundant experiment configurations
Cleaning up duplicate datasets before batch processing
Validating file uniqueness in automated workflows
- Parameters:
directory (str) – Path to the directory to search for duplicate files.
file_pattern (str, optional) – File extension or pattern to filter (e.g., ‘.yaml’, ‘.csv’, ‘.txt’). If None, all files are compared. Default is None.
ignore_empty_lines (bool, optional) – If True, empty lines are ignored during comparison. Default is True.
case_sensitive (bool, optional) – If True, comparison is case-sensitive. Default is True.
verbose (bool, optional) – If True, print progress information during comparison. Default is False.
- Returns:
List of tuples, where each tuple contains paths of two duplicate files. Returns empty list if no duplicates are found.
- Return type:
List[Tuple[str, str]]
- Raises:
FileNotFoundError – If the specified directory does not exist.
NotADirectoryError – If the specified path is not a directory.
PermissionError – If files cannot be read due to permission issues.
Examples
Find all duplicate files in a directory:
>>> duplicates = find_duplicate_files("configs/") >>> if duplicates: ... print(f"Found {len(duplicates)} duplicate pairs")
Find duplicate YAML configuration files:
>>> duplicates = find_duplicate_files( ... "configs/qml_gridsearch/", ... file_pattern='.yaml', ... verbose=True ... ) >>> for file1, file2 in duplicates: ... print(f"Duplicate: {file1} == {file2}")
Case-insensitive comparison:
>>> duplicates = find_duplicate_files( ... "data/", ... file_pattern='.txt', ... case_sensitive=False ... )
Integration with QProfiler workflow:
>>> # Check for duplicate configs before batch processing >>> config_dir = "configs/experiments/" >>> duplicates = find_duplicate_files(config_dir, file_pattern='.yaml') >>> >>> if duplicates: ... print("Warning: Duplicate configurations found!") ... for f1, f2 in duplicates: ... print(f" {os.path.basename(f1)} == {os.path.basename(f2)}") ... # Optionally remove duplicates or warn user
Notes
Files are compared line by line after sorting (order-independent)
Binary files are not supported; use for text files only
Large files may consume significant memory during comparison
Symbolic links are followed and treated as regular files
Hidden files (starting with ‘.’) are included in comparison
See also
find_string_in_filesSearch for specific strings across multiple files
checkpoint_restartResume interrupted batch processing jobs
- find_string_in_files(directory, search_string, file_pattern=None, case_sensitive=True, return_lines=False, verbose=True)[source]#
Search for a specific string in all files within a directory.
Scans files in the specified directory and identifies which files contain the search string. Optionally returns the matching lines with line numbers. Useful for auditing configurations, finding specific parameters, or validating settings across multiple files.
- Parameters:
directory (str) – Path to the directory containing files to search.
search_string (str) – The string to search for in the files.
file_pattern (str, optional) – File extension or pattern to filter (e.g., ‘.yaml’, ‘.csv’, ‘.txt’). If None, all files are searched. Default is None.
case_sensitive (bool, optional) – If True, search is case-sensitive. Default is True.
return_lines (bool, optional) – If True, return matching lines with line numbers. Default is False.
verbose (bool, optional) – If True, print progress and results. Default is True.
- Returns:
Dictionary mapping file paths to list of (line_number, line_content) tuples for files containing the search string. If return_lines is False, the list contains empty tuples.
- Return type:
Dict[str, List[Tuple[int, str]]]
- Raises:
FileNotFoundError – If the specified directory does not exist.
NotADirectoryError – If the specified path is not a directory.
Examples
Basic search for a string:
>>> results = find_string_in_files( ... 'configs/', ... 'embeddings: none' ... ) >>> print(f"Found in {len(results)} files")
Search with line numbers returned:
>>> results = find_string_in_files( ... 'configs/qml_gridsearch/', ... 'n_qubits: 4', ... file_pattern='.yaml', ... return_lines=True ... ) >>> for filepath, matches in results.items(): ... print(f"{filepath}:") ... for line_num, line_content in matches: ... print(f" Line {line_num}: {line_content.strip()}")
Case-insensitive search:
>>> results = find_string_in_files( ... 'logs/', ... 'error', ... file_pattern='.log', ... case_sensitive=False ... )
Integration with QProfiler workflow:
>>> # Find all configs using a specific embedding >>> config_dir = "configs/experiments/" >>> results = find_string_in_files( ... config_dir, ... 'embeddings: pca', ... file_pattern='.yaml', ... verbose=True ... ) >>> >>> if results: ... print(f"Found {len(results)} configs using PCA embedding") ... for config_file in results.keys(): ... print(f" - {os.path.basename(config_file)}")
Notes
Only text files are supported; binary files will be skipped
Large files may consume significant memory if return_lines=True
Symbolic links are followed and treated as regular files
Hidden files (starting with ‘.’) are included in search
See also
find_duplicate_filesFind files with identical content
checkpoint_restartResume interrupted batch processing jobs
- generate_qml_experiment_configs(template_config_path, output_dir, data_dirs, qmethods=None, reps=None, optimizers=None, entanglements=None, feature_maps=None, ansatz_types=None, n_components=None, Cs=None, max_iters=None, embeddings=None, data_sample_fraction=1.0, used_files_path=None, random_seed=None)[source]#
Generate YAML configuration files for quantum ML hyperparameter grid search.
This function creates multiple configuration files by combining different hyperparameter values for quantum machine learning models (QNN, VQC, QSVC). Each configuration file can be used with QProfiler to run systematic experiments.
- Parameters:
template_config_path (str) – Path to the template YAML configuration file.
output_dir (str) – Directory where generated config files will be saved.
data_dirs (List[str]) – List of directories containing CSV dataset files.
qmethods (List[str], optional) – Quantum methods to test. Default: [‘qnn’, ‘vqc’, ‘qsvc’]
reps (List[int], optional) – Number of repetitions for ansatz layers. Default: [1, 2]
optimizers (List[str], optional) – Optimizers to use. Default: [‘COBYLA’, ‘SPSA’]
entanglements (List[str], optional) – Entanglement patterns. Default: [‘linear’, ‘full’]
feature_maps (List[str], optional) – Feature map encodings. Default: [‘Z’, ‘ZZ’]
ansatz_types (List[str], optional) – Ansatz types for QNN/VQC. Default: [‘amp’, ‘esu2’]
n_components (List[int], optional) – Number of components for dimensionality reduction. Default: [5, 10]
Cs (List[float], optional) – Regularization parameters for QSVC. Default: [0.1, 1, 10]
max_iters (List[int], optional) – Maximum iterations for optimization. Default: [100, 500]
embeddings (List[str], optional) – Embedding methods. Default: [‘none’, ‘pca’, ‘lle’, ‘isomap’, ‘spectral’, ‘umap’, ‘nmf’]
data_sample_fraction (float, optional) – Fraction of data files to use (0.0-1.0). Default: 1.0
used_files_path (str, optional) – Path to CSV file tracking previously used data files.
random_seed (int, optional) – Random seed for reproducible file sampling.
- Returns:
Number of configuration files generated and path to used files CSV.
- Return type:
Tuple[int, str]
Examples
>>> from qbiocode.utils import generate_qml_experiment_configs >>> >>> # Generate configs for quantum model grid search >>> num_configs, used_files = generate_qml_experiment_configs( ... template_config_path='configs/config.yaml', ... output_dir='configs/qml_gridsearch', ... data_dirs=['data/tutorial_test_data/lower_dim_datasets'], ... qmethods=['qnn', 'vqc'], ... reps=[1, 2], ... n_components=[5, 10], ... data_sample_fraction=0.1 # Use 10% of files for testing ... ) >>> print(f"Generated {num_configs} configuration files")
Notes
Quantum models (QNN, VQC, QSVC) don’t support automated grid search
This function generates separate config files for each hyperparameter combination
Run QProfiler separately for each generated config file
- The function automatically handles model-specific constraints:
QSVC uses only ‘amp’ ansatz and ‘COBYLA’ optimizer
QNN/VQC don’t use the C parameter
Embedding is set to ‘none’ when n_components >= original feature count
See also
qbiocode.apps.qprofilerMain profiling application
- get_ansatz(ansatz_type, feat_dimension, reps=1, entanglement='linear')[source]#
This function returns an ansatz based on the specified type and parameters. It supports ‘esu2’, ‘amp’, and ‘twolocal’ ansatz types, constructing it using the specified feature dimension, number of repetitions, and entanglement type.
- Parameters:
ansatz_type (str) – Type of the ansatz (‘esu2’, ‘amp’, or ‘twolocal’).
feat_dimension (int) – Number of qubits for the ansatz.
reps (int) – Number of repetitions for the ansatz.
entanglement (str) – Type of entanglement for the ansatz.
- Returns:
An instance of the specified ansatz type.
- Return type:
ansatz
- get_backend_session(args, primitive, num_qubits)[source]#
This function to get the backend and session for the specified primitive.
- Parameters:
args (dict) – Dictionary containing backend and other parameters.
primitive (str) – The type of primitive to instantiate (‘sampler’ or ‘estimator’).
num_qubits (int) – Number of qubits for the backend.
- Returns:
The backend instance. session: The session instance. prim: The instantiated primitive (Sampler or Estimator).
- Return type:
backend
- get_creds(args)[source]#
This function determines the user’s IBM Quantum channel, instance, and token, using values provided within the config.yaml file or as defined within the user’s qiskit configuration from provided qiskit_json_path specified in the config.yaml file, and then parses its contents. It returns the main items in this json file, such as the instance and api token, which can then be passed into the QML functions when using a real hardware backend. The function will return a dictionary with the keys ‘channel’, ‘instance’, ‘token’, and ‘url’, which can be used to instantiate the QiskitRuntimeService. If the qiskit_json_path is provided, it will attempt to read the credentials from that file. :type args: dict :param args: This passes the arguments from the config.yaml file. In this particular case, it is importing the path to the qiskit-ibm.json file (qiskit_json_path) and the credentials :type args: dict :param defined in this json file: :type defined in this json file: ibm_channel, ibm_instance, ibm_token, ibm_url
- Returns:
A dictionary containing the IBM Quantum credentials, including ‘channel’, ‘instance’, ‘token’, and ‘url’.
- Return type:
rval (dict)
- get_estimator(mode=None, shots=1024, resil_level=2, dd=True, dd_seq='XpXm', PT=True)[source]#
This function creates an Estimator instance with specified options.
- Parameters:
mode (Session) – The session mode for the estimator.
shots (int) – Number of shots for estimation.
resil_level (int) – Resilience level for error suppression.
dd (bool) – Whether to enable dynamical decoupling.
dd_seq (str) – Sequence type for dynamical decoupling.
PT (bool) – Whether to enable pulse twirling.
- Returns:
An instance of the Estimator with the specified options.
- Return type:
Estimator
- get_feature_map(feature_map, feat_dimension, reps=1, entanglement='linear', data_map_func=None)[source]#
This function returns a feature map based on the specified type and parameters. It supports ‘Z’, ‘ZZ’, and ‘P’ feature maps, constructing it using the specified feature dimension, number of repetitions, entanglement type, and data mapping function. :type feature_map: str :param feature_map: Type of the feature map (‘Z’, ‘ZZ’, or ‘P’). :type feature_map: str :type feat_dimension: int :param feat_dimension: Number of qubits for the feature map. :type feat_dimension: int :type reps: int :param reps: Number of repetitions for the feature map. :type reps: int :type entanglement: str :param entanglement: Type of entanglement for the feature map. :type entanglement: str :type data_map_func: callable, optional :param data_map_func: Function to map data to the feature map parameters. :type data_map_func: callable, optional
- Returns:
An instance of the specified feature map type. feat_dimension (int): The number of qubits in the feature map.
- Return type:
feature_map
- get_optimizer(type='COBYLA', max_iter=100, learning_rate_a=None, perturbation_gamma=None, prior_iter=0)[source]#
This function returns an optimizer based on the specified type and parameters. It supports ‘SPSA’, ‘COBYLA’, ‘GradientDescent’, and ‘L_BFGS_B’ optimizer types, constructing it using the specified maximum iterations, learning rate, perturbation gamma, and prior iterations.
- Parameters:
type (str) – Type of the optimizer (‘SPSA’, ‘COBYLA’, ‘GradientDescent’, or ‘L_BFGS_B’).
max_iter (int) – Maximum number of iterations for the optimizer.
learning_rate_a (float, optional) – Initial learning rate for SPSA.
perturbation_gamma (float, optional) – Perturbation gamma for SPSA.
prior_iter (int) – Number of prior iterations to consider.
- Returns:
An instance of the specified optimizer type.
- Return type:
optimizer
- get_sampler(mode=None, shots=1024, dd=True, dd_seq='XpXm', PT=True)[source]#
This function creates a Sampler instance with specified options.
- Parameters:
mode (Session) – The session mode for the sampler.
shots (int) – Number of shots for sampling.
dd (bool) – Whether to enable dynamical decoupling.
dd_seq (str) – Sequence type for dynamical decoupling.
PT (bool) – Whether to enable pulse twirling.
- Returns:
An instance of the Sampler with the specified options.
- Return type:
Sampler
- instantiate_runtime_service(args)[source]#
This function provides a quick way to instantiate QiskitRuntimeService in one place. A basic call to this function can then be done in anywhere else. It uses the get_creds function to retrieve the necessary credentials from the qiskit-ibm.json file, with the file path specified in the config.yaml file. It returns an instance of the QiskitRuntimeService class, which can be used to interact with IBM Quantum services.
- Parameters:
args (dict) – This passes the arguments from the config.yaml file. In this particular case, it is importing the path to the qiskit-ibm.json file (qiskit_json_path) and the credentials
file (defined in this json)
- Returns:
An instance of the QiskitRuntimeService class, initialized with the credentials from the qiskit-ibm.json file or the provided arguments.
- Return type:
QiskitRuntimeService
- qml_winner(results_df, rawevals_df, output_dir, tag)[source]#
This function finds data sets where QML was beneficial (higher F1 scores than CML) and create new .csv files with the relevant evaluation and performance for these specific datasets, for further analysis. It also computes the best results per method across all splits and the best results per dataset. It returns two DataFrames: one with the datasets where QML methods outperformed CML methods, and another with the evaluation scores for the best QML method for each of these datasets. It also saves these DataFrames as .csv files in the specified output directory.
- Parameters:
results_df (pandas.DataFrame) – Dataset in pandas corresponding to ‘ModelResults.csv’
rawevals_df (pandas.DataFrame) – Dataset in pandas corresponding to ‘RawDataEvaluation.csv’
- Returns:
- contais the input datasets for which at least one QML method
performed better than CML. DataFrame contains the scores of all the methods.
- winner_eval_score (pandas.DataFrame): contains the input datasets, their evaluation, and scores for the
specific qml method that yielded the best score.
- Return type:
qml_winners (pandas.DataFrame)
- scaler_fn(X, scaling='None')[source]#
Apply scaling transformation to input data.
Scales the input data using one of three methods: no scaling, standard scaling (z-score normalization), or min-max scaling to [0, 1] range.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input data to be scaled.
scaling ({'None', 'StandardScaler', 'MinMaxScaler'}, default='None') –
Scaling method to apply:
’None’: No scaling, returns original data
’StandardScaler’: Standardize features by removing mean and scaling to unit variance
’MinMaxScaler’: Scale features to [0, 1] range
- Returns:
X_scaled – Scaled data. If scaling=’None’, returns original data unchanged.
- Return type:
array-like of shape (n_samples, n_features)
Notes
StandardScaler transforms data to have mean=0 and variance=1:
\[z = \frac{x - \mu}{\sigma}\]MinMaxScaler transforms data to [0, 1] range:
\[x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}\]Examples
>>> import numpy as np >>> from qbiocode.utils import scaler_fn >>> X = np.array([[1, 2], [3, 4], [5, 6]]) >>> X_scaled = scaler_fn(X, scaling='StandardScaler') >>> X_minmax = scaler_fn(X, scaling='MinMaxScaler')
See also
sklearn.preprocessing.StandardScalerStandardize features
sklearn.preprocessing.MinMaxScalerScale features to a range
- track_progress(input_dataset_dir, current_results_dir, completion_marker='RawDataEvaluation.csv', prefix_length=8, input_extension='csv', verbose=True)[source]#
Track progress of a computational job by checking for completed datasets.
This function scans the results directory for completed datasets (identified by the presence of a specific marker file) and compares against the total number of input datasets to determine how many remain to be processed.
- Parameters:
input_dataset_dir (str) – Path to the directory containing input datasets.
current_results_dir (str) – Path to the directory containing outputs of the current job.
completion_marker (str, optional) – Name of the file that indicates a dataset has been fully processed. Default is ‘RawDataEvaluation.csv’.
prefix_length (int, optional) – Number of characters to skip from the beginning of directory names when extracting dataset identifiers. Default is 8 (e.g., skips ‘dataset_’ prefix).
input_extension (str, optional) – File extension of input datasets (without dot). Default is ‘csv’.
verbose (bool, optional) – If True, prints progress information. Default is True.
- Return type:
Tuple[List[str],int,int]- Returns:
completed_datasets (List[str]) – List of dataset identifiers that have been completed.
num_completed (int) – Number of completed datasets.
num_remaining (int) – Number of datasets remaining to be processed.
Examples
>>> from qbiocode.utils import track_progress >>> completed, done, remaining = track_progress( ... input_dataset_dir='data/inputs', ... current_results_dir='results/run1' ... ) The completed datasets are: ['dataset1', 'dataset2'] You have finished running program on 2 out of a total of 10 input datasets. You have 8 input datasets left before program finishes.
>>> # Custom completion marker >>> completed, done, remaining = track_progress( ... input_dataset_dir='data/inputs', ... current_results_dir='results/run1', ... completion_marker='final_output.csv', ... prefix_length=0 # No prefix to skip ... )