QProfiler#

Automated ML Benchmarking with Data Complexity Analysis

QProfiler is a comprehensive tool that goes beyond simple model evaluation. It provides:

🔬 Dual Analysis Approach
  • Model Performance: Benchmarks both classical and quantum machine learning models

  • Data Complexity: Computes intrinsic dataset characteristics to understand model behavior

📊 What QProfiler Does
  1. Runs Multiple Models: Evaluates classical (RF, SVM, LR, etc.) and quantum (QSVC, PQK, VQC) algorithms

  2. Analyzes Data Complexity: Computes 15+ complexity measures before model training

  3. Correlates Results: Links model performance to data characteristics

  4. Automates Workflows: Handles data splitting, scaling, encoding, and evaluation

../_images/QProfiler.png

QProfiler workflow: Automated benchmarking pipeline combining data complexity analysis with model performance evaluation.#

Important

Key Feature: QProfiler computes data complexity metrics before running models, providing insights into why certain models perform better on specific datasets. This helps identify which quantum or classical approaches are most suitable for your data.

Warning

Binary Classification Focus: QProfiler is currently optimized for binary classification tasks (2 classes). While multi-class classification is supported experimentally, results may vary and some metrics may not be fully optimized. For best results, use datasets with exactly 2 classes.

Note

Before you start, make sure that you have installed QBioCode correctly by following the Installation guide.

Usage#

QProfiler can be used in two ways: as a command-line tool or as a Python library in your scripts and notebooks.

Command-Line Interface#

After installing QBioCode with the apps extras (pip install qbiocode[apps]), you can run QProfiler from the command line:

Basic Usage (Default Config)

qprofiler

This runs QProfiler with the default configuration file installed with the package.

Using Custom Configuration

You can override the default configuration in several ways:

# Use a custom config file from a specific directory (absolute path)
qprofiler --config-dir=/absolute/path/to/configs --config-name=my_config

# Use a config file from current directory
cd /path/to/your/project
qprofiler --config-dir=$(pwd)/configs --config-name=my_config

# Override specific parameters from command line
qprofiler model=[svc,rf,qsvc] n_jobs=5 backend=simulator

# Combine custom config with parameter overrides
qprofiler --config-dir=$(pwd)/configs --config-name=experiment1 seed=123 test_size=0.2

Batch Mode

For processing multiple datasets in parallel:

qprofiler-batch

This will process all CSV files in the configured input directory and generate combined results.

Configuration File Location

QProfiler uses Hydra for configuration management. By default, it uses the config file installed with the package at:

<site-packages>/apps/qprofiler/configs/config.yaml

To use your own configuration:

  1. Create a custom config file in your project directory:

    # my_config.yaml
    file_dataset: "my_dataset.csv"
    folder_path: "data/"
    model: [rf, svc, qsvc]
    embeddings: [none, pca]
    n_components: 3
    backend: simulator
    
  2. Run with custom config:

    qprofiler --config-dir=$(pwd) --config-name=my_config
    

Tip

Hydra Configuration Override Syntax:

  • --config-dir: Absolute path to directory containing config files

  • --config-name: Config filename (without .yaml extension)

  • key=value: Override individual parameters

  • key=[item1,item2]: Override list parameters

Example combining all:

qprofiler --config-dir=$(pwd)/configs --config-name=base \
          model=[rf,svc] backend=simulator seed=42

See the full configuration guide for all available options.

Python Library Usage#

You can also import and use QProfiler components directly in Python:

from qbiocode import model_run, evaluate
from qbiocode import scaler_fn, feature_encoding
import pandas as pd

# Load your data
data = pd.read_csv('dataset.csv')
X = data.drop('target', axis=1)
y = data['target']

# Evaluate data complexity
complexity_metrics = evaluate(X, y, 'my_dataset')

# Run a specific model
results = model_run(
    X_train, X_test, y_train, y_test,
    model_type='rf',
    dataset_name='my_dataset'
)

For complete examples, see the QProfiler Tutorial.

Data Complexity Measures#

In data mining and machine learning, we can distinguish between two fundamental types of complexity that affect model performance:

Intrinsic Complexity

Inherent structure of the data that makes it difficult to learn, independent of the algorithm:

  • Class distribution: Imbalanced or overlapping classes

  • Non-linear decision boundaries: Complex separating surfaces

  • Higher-order correlations: Interactions between multiple features

  • Noise: Random variations in the data

Extrinsic Complexity

Complexity arising from external factors dependent on the algorithm or preprocessing:

  • Preprocessing issues: Inadequate feature scaling or transformation

  • Misalignment between model and data: Model assumptions don’t match data structure

  • Learning limitations of models: Insufficient capacity or inappropriate inductive bias

../_images/IEComplexity.png

Intrinsic vs. Extrinsic Complexity: Understanding the sources of learning difficulty in machine learning tasks.#

QProfiler automatically computes the following complexity measures for each dataset to characterize its intrinsic properties and predict model performance.

Dimensionality Metrics#

Number of Features, Samples, and Feature-to-Sample Ratio

Basic dataset dimensions that characterize the problem scale. High feature-to-sample ratios (\(p/n > 1\)) indicate high-dimensional problems prone to overfitting, known as the “curse of dimensionality.”

\[\text{Ratio} = \frac{p}{n}\]

where \(p\) = number of features, \(n\) = number of samples.

Reference: Bellman, R. (1961). Adaptive Control Processes. Princeton University Press.

Intrinsic Dimension

Estimates the true dimensionality of data embedded in high-dimensional space. While data may have \(p\) features, it often lies on a lower-dimensional manifold of dimension \(d \ll p\). Lower intrinsic dimension suggests the data structure is simpler than the ambient dimension implies.

\[d_{\text{intrinsic}} \ll p\]

Reference: Fukunaga, K., & Olsen, D. R. (1971). “An algorithm for finding intrinsic dimensionality of data.” IEEE Transactions on Computers, C-20(2), 176-183.

Fractal Dimension

Measures self-similarity and geometric complexity of data structure. Values range from 1 (simple line) to 2 (space-filling), indicating varying degrees of complexity and irregularity in the data manifold.

\[D_f = \lim_{\epsilon \to 0} \frac{\log N(\epsilon)}{\log(1/\epsilon)}\]

where \(N(\epsilon)\) is the number of boxes of size \(\epsilon\) needed to cover the data.

../_images/Sierpinski_triangle.png

Sierpinski Triangle: A classic example of a fractal with self-similar structure at multiple scales, illustrating the concept of fractal dimension.#

Reference: Higuchi, T. (1988). “Approach to an irregular time series on the basis of the fractal theory.” Physica D: Nonlinear Phenomena, 31(2), 277-283.

Statistical Properties#

Variance

Measures data spread across features. Low variance features (\(\sigma^2 \approx 0\)) may not contribute to discrimination; high variance may indicate noise or important signal variation.

\[\sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2\]
Coefficient of Variation (CV)

Normalized measure of dispersion that enables comparison across features with different scales. Expressed as percentage of the mean.

\[CV = \frac{\sigma}{\mu} \times 100\%\]

Reference: Abdi, H. (2010). “Coefficient of variation.” Encyclopedia of Research Design, 1, 169-171.

Skewness

Third statistical moment measuring distribution asymmetry. Negative skewness indicates left-tailed distributions, zero indicates symmetry (normal distribution), and positive skewness indicates right-tailed distributions.

\[\text{Skewness} = \frac{E[(X-\mu)^3]}{\sigma^3}\]
../_images/skew.png

Distribution skewness: negative skew (left-tailed), zero skew (symmetric), and positive skew (right-tailed). Skewness quantifies the asymmetry of probability distributions.#

Kurtosis

Fourth statistical moment measuring tail heaviness and peakedness of distributions. Higher kurtosis indicates heavier tails and more outliers; lower kurtosis indicates lighter tails. Normal distribution has kurtosis of 3 (excess kurtosis of 0).

\[\text{Kurtosis} = \frac{E[(X-\mu)^4]}{\sigma^4}\]
../_images/kurt.jpg

Distribution kurtosis: platykurtic (light tails, kurtosis < 3), mesokurtic (normal, kurtosis = 3), and leptokurtic (heavy tails, kurtosis > 3). Kurtosis quantifies tail behavior and outlier propensity.#

Reference: Joanes, D. N., & Gill, C. A. (1998). “Comparing measures of sample skewness and kurtosis.” Journal of the Royal Statistical Society: Series D, 47(1), 183-189.

Nonzero Value Count

Measures data sparsity. High sparsity (many zeros) indicates sparse representations that may benefit from specialized algorithms or dimensionality reduction.

Low Variance Feature Count

Number of features below the 25th percentile of variance distribution. Identifies potentially uninformative features that contribute little to model discrimination.

Separability Measures#

Fisher Discriminant Ratio (FDR)

Quantifies class separability as the ratio of between-class to within-class scatter. Higher values indicate better linear separability. Only defined for binary classification.

\[\text{FDR} = \frac{\text{tr}(\mathbf{S}_B)}{\text{tr}(\mathbf{S}_W)}\]

where \(\mathbf{S}_B\) is between-class scatter and \(\mathbf{S}_W\) is within-class scatter.

Reference: Fisher, R. A. (1936). “The use of multiple measurements in taxonomic problems.” Annals of Eugenics, 7(2), 179-188.

Mutual Information

Measures statistical dependence between features and class labels. Higher values indicate features are more informative for classification. Mutual information can be expressed in terms of entropy:

\[\begin{split}I(X;Y) &= \sum_{x,y} p(x,y) \log\frac{p(x,y)}{p(x)p(y)} \\ &= H(X) + H(Y) - H(X,Y) \\ &= H(X) - H(X|Y) \\ &= H(Y) - H(Y|X)\end{split}\]

where \(H(X)\) and \(H(Y)\) are the marginal entropies, \(H(X,Y)\) is the joint entropy, and \(H(X|Y)\) and \(H(Y|X)\) are the conditional entropies.

../_images/MutualInformation.png

Relationship between entropy, mutual information, and relative entropy (KL divergence). Mutual information quantifies the reduction in uncertainty about one variable given knowledge of another.#

Reference: Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory. Wiley-Interscience.

Total Correlation

Sum of absolute correlations between all feature pairs (excluding self-correlation). Indicates feature redundancy and multicollinearity in the dataset.

\[TC = \sum_{i \neq j} |\rho_{ij}|\]

where \(\rho_{ij}\) is the correlation between features \(i\) and \(j\).

Reference: Watanabe, S. (1960). “Information theoretical analysis of multivariate correlation.” IBM Journal of Research and Development, 4(1), 66-82.

Log Kernel Density

Mean log-likelihood of data under Gaussian kernel density estimation. Indicates data concentration and distribution smoothness. Higher (less negative) values suggest more concentrated data.

\[\log p(x) = \frac{1}{n}\sum_{i=1}^{n} \log\left(\frac{1}{nh^d}\sum_{j=1}^{n}K\left(\frac{x_i-x_j}{h}\right)\right)\]

where \(K\) is the kernel function and \(h\) is the bandwidth.

../_images/KernelDensityGaussianAnimation.gif

Gaussian kernel density estimation: Animation showing how individual kernels (dashed lines) combine to form the overall density estimate (solid line). The bandwidth parameter controls the smoothness of the estimate.#

Reference: Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall.

Matrix Properties#

Condition Number

Ratio of largest to smallest singular value of the data matrix. Measures numerical stability and sensitivity to perturbations. High values (\(\kappa > 10^3\)) indicate ill-conditioned problems with potential numerical instability.

\[\kappa(\mathbf{X}) = \frac{\sigma_{\max}(\mathbf{X})}{\sigma_{\min}(\mathbf{X})} = \|\mathbf{X}\| \cdot \|\mathbf{X}^{-1}\|\]

where \(\sigma_{\max}\) and \(\sigma_{\min}\) are the largest and smallest singular values.

Reference: Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.

Key References

  • Comprehensive Overview: Data Complexity slides from ISMB 2025 tutorial

  • Meta-Learning Context: Lorena, A. C., et al. (2019). “How Complex is your classification problem? A survey on measuring classification complexity.” ACM Computing Surveys, 52(5), 1-34.

Configuration#

QProfiler uses a YAML configuration file to manage all experimental parameters, ensuring reproducibility and enabling batch processing of multiple datasets and models.

Quick Start#

Basic Usage:

python qprofiler.py --config-name=config.yaml

Get Help:

python qprofiler.py --help

Configuration Overview#

The config.yaml file controls all aspects of the QProfiler workflow:

📁 Data Configuration
  • Input dataset paths

  • File selection (all or specific)

  • Random seeds for reproducibility

⚛️ Quantum Backend
  • Backend selection (simulator/hardware)

  • IBM Quantum credentials

  • Shots and error mitigation

🔄 Preprocessing
  • Dimensionality reduction methods

  • Feature scaling options

  • Train/test split ratios

🤖 Model Selection
  • Classical models (SVC, RF, LR, etc.)

  • Quantum models (QSVC, VQC, PQK)

  • Hyperparameter grids

Key Configuration Sections#

1. Dataset Configuration

# Process all CSV files in folder
folder_path: 'test_data/'
file_dataset: 'ALL'

# Or select specific files
file_dataset: ['dataset1.csv', 'dataset2.csv']

# Handle datasets with row names/IDs in first column
index_col: False  # Set to True if first column contains row identifiers

# Reproducibility seeds
seed: 42        # Classical algorithms
q_seed: 42      # Quantum algorithms

Important

Row Names Support: If your CSV files have row identifiers (sample IDs, patient IDs, etc.) in the first column, set index_col: True. This will:

  • Use the first column as row index (excluded from features)

  • Properly handle datasets with sample identifiers

  • Maintain data integrity for downstream analysis

Example dataset with row names:

SampleID,Feature1,Feature2,Feature3,Label
Patient001,0.5,0.3,0.8,0
Patient002,0.2,0.7,0.4,1
Patient003,0.9,0.1,0.6,0

With index_col: True, only Feature1, Feature2, and Feature3 are used as features (Label is the target).

Default behavior (index_col: False):

Feature1,Feature2,Feature3,Label
0.5,0.3,0.8,0
0.2,0.7,0.4,1
0.9,0.1,0.6,0

All columns except the last are treated as features.

2. Quantum Backend Setup

# Use simulator
backend: 'simulator'

# Or use IBM Quantum hardware
backend: 'ibm_least'  # Least busy device
shots: 1024
resil_level: 1        # Error mitigation (1-3)

# IBM credentials
qiskit_json_path: '~/.qiskit/qiskit-ibm.json'
name: 'account_qbc'   # Account alias
ibm_instance: 'hub/group/project'  # Optional

3. Dimensionality Reduction

# Apply embeddings
embeddings: ['pca', 'nmf', 'autoencoder']
n_components: 3

# No embedding
embeddings: ['none']

4. Train/Test Split

test_size: 0.3           # 70:30 train:test ratio

# Stratified sampling - maintains class distribution in splits
stratify: ['y']          # Enable stratification
# stratify: []           # Disable stratification

scaling: ['True']        # Enable MinMaxScaler feature scaling

Note

Stratified Sampling:

Stratification ensures that the class distribution in your training and test sets matches the original dataset distribution. This is particularly important for:

  • Imbalanced datasets: When one class has significantly fewer samples than others

  • Small datasets: To ensure all classes are represented in both train and test sets

  • Reproducible results: Consistent class proportions across different random splits

When to use stratification:

  • Enable (stratify: ['y']): For classification tasks with imbalanced classes

  • Enable: When you have small datasets with multiple classes

  • Disable (stratify: []): When you have perfectly balanced classes and large datasets

Example:

If your dataset has 90% class 0 and 10% class 1, stratification ensures both train and test sets maintain this 90:10 ratio, preventing scenarios where the test set might have no class 1 samples.

5. Model Selection

Available models:

  • Classical: svc, dt, lr, nb, rf, mlp

  • Quantum: qsvc, vqc, qnn, pqk

# Run all models
model: ['svc', 'dt', 'lr', 'nb', 'rf', 'mlp',
        'qsvc', 'vqc', 'qnn', 'pqk']

# Or select specific models
model: ['rf', 'qsvc', 'pqk']

6. Hyperparameter Configuration

Each model can have standard parameters and grid search parameters:

# Standard parameters
svc_args:
  C: 0.01
  gamma: 0.1
  kernel: 'linear'

# Grid search parameters
gridsearch_svc_args:
  C: [0.1, 1, 10, 100]
  gamma: [0.001, 0.01, 0.1, 1]
  kernel: ['linear', 'rbf', 'poly', 'sigmoid']

Important

For quantum models: Grid search requires generating separate config files for each parameter combination. Use the qbiocode.utils.generate_qml_experiment_configs() utility function:

from qbiocode.utils import generate_qml_experiment_configs

# Generate config files for quantum model hyperparameter tuning
num_configs, used_files = generate_qml_experiment_configs(
    template_config_path='configs/config.yaml',
    output_dir='configs/qml_gridsearch',
    data_dirs=['data/tutorial_test_data/lower_dim_datasets'],
    qmethods=['qnn', 'vqc', 'qsvc'],
    reps=[1, 2],
    n_components=[5, 10],
    embeddings=['none', 'pca', 'isomap']  # Subset of available: none, pca, lle, isomap, spectral, umap, nmf
)

print(f"Generated {num_configs} configuration files")

# Then run QProfiler for each generated config

Running Generated Configs

After generating config files, you have several options to execute them:

Option 1: Manual execution (for small numbers of configs)

qprofiler --config configs/qml_gridsearch/exp_1.yaml
qprofiler --config configs/qml_gridsearch/exp_2.yaml
# ... etc.

Option 2: Bash loop (for local execution)

# Run all configs sequentially
for i in {1..100}; do
    qprofiler --config configs/qml_gridsearch/exp_${i}.yaml
done

Option 3: SLURM array job (for HPC clusters)

#!/bin/bash
#SBATCH --array=1-100%10    # Run 100 jobs, max 10 concurrent
#SBATCH -c 9                # 9 CPUs per job
#SBATCH --mem=12000         # 12GB memory
#SBATCH -p your_partition

qprofiler --config configs/qml_gridsearch/exp_${SLURM_ARRAY_TASK_ID}.yaml

Tip

For large hyperparameter grids (hundreds of configs), use SLURM array jobs on HPC clusters for parallel execution. Adjust --array range to match your number of configs and %N to control concurrent jobs based on cluster resources.

See qbiocode.utils.generate_qml_experiment_configs() for full documentation and all available parameters.

Detailed Configuration Reference#

For comprehensive documentation of all configuration parameters, see:

Example Configuration#

A complete example config.yaml is available at: apps/qprofiler/configs/config.yaml

Tip

Best Practices:

  • Always set random seeds for reproducibility

  • Start with a small subset of models to test configuration

  • Use grid search for classical models, separate configs for quantum models

  • Monitor quantum backend availability before large runs

  • Save configurations with descriptive names for different experiments

Troubleshooting#

Common Issues:

  • Missing config file: Ensure config.yaml is in the correct directory

  • YAML syntax errors: Validate YAML format (indentation, colons, quotes)

  • IBM Quantum access: Verify credentials and instance permissions

  • Memory errors: Reduce number of parallel jobs (n_jobs parameter)

  • Quantum backend errors: Check backend availability and queue status

Note

The project will fail if the config.yaml file is missing, incorrectly formatted, or contains invalid parameter values. Always validate your configuration before running large experiments.

Checkpoint and Restart#

When processing large batches of datasets, jobs may be interrupted due to time limits, system failures, or resource constraints. QProfiler provides a checkpoint restart utility to resume processing without recomputing completed datasets.

How It Works#

The checkpoint_restart function scans a previous results directory and identifies which datasets were fully processed by checking for completion marker files (e.g., RawDataEvaluation.csv). You can then filter your dataset list to process only the remaining incomplete datasets.

Basic Usage#

from qbiocode.utils.dataset_checkpoint import checkpoint_restart
import os

# Identify completed datasets from previous run
completed = checkpoint_restart(
    previous_results_dir='./previous_run_results',
    verbose=True
)

# Get all datasets to process
all_datasets = [f.replace('.csv', '') for f in os.listdir('./data')
                if f.endswith('.csv')]

# Filter to only incomplete datasets
remaining = [d for d in all_datasets if d not in completed]

print(f"Previously completed: {len(completed)} datasets")
print(f"Remaining to process: {len(remaining)} datasets")

# Continue processing with remaining datasets
# (use 'remaining' list in your batch processing loop)

Advanced Usage#

Custom Settings

from qbiocode.utils.dataset_checkpoint import checkpoint_restart

# Custom completion marker and directory naming
completed = checkpoint_restart(
    previous_results_dir='./results_2024_01_15',
    completion_marker='ModelResults.csv',  # Different marker file
    prefix_length=0,  # No prefix to strip from directory names
    verbose=True
)

Integration with Batch Processing

import os
from qbiocode.utils.dataset_checkpoint import checkpoint_restart
from qbiocode.evaluation import model_run

# Step 1: Check for previous results
if os.path.exists('./previous_results'):
    completed_datasets = checkpoint_restart(
        previous_results_dir='./previous_results',
        verbose=True
    )
else:
    completed_datasets = []

# Step 2: Get list of all datasets
data_dir = './datasets'
all_datasets = [f.replace('.csv', '') for f in os.listdir(data_dir)
                if f.endswith('.csv')]

# Step 3: Filter to incomplete datasets
datasets_to_process = [d for d in all_datasets
                       if d not in completed_datasets]

# Step 4: Process remaining datasets
for dataset_name in datasets_to_process:
    print(f"Processing {dataset_name}...")
    # Your QProfiler processing code here
    # ...

Parameters#

  • previous_results_dir (str): Path to directory with previous run results

  • completion_marker (str, optional): Filename indicating completion (default: 'RawDataEvaluation.csv')

  • prefix_length (int, optional): Characters to strip from directory names (default: 8 for 'dataset_' prefix)

  • verbose (bool, optional): Print progress information (default: False)

Returns#

List of dataset names that were successfully completed in the previous run.

Tip

Best Practices for Checkpoint Restart:

  • Always use verbose=True to verify which datasets are being skipped

  • Keep previous results directories until you confirm the new run completed successfully

  • Manually combine CSV results from previous and current runs if needed

  • Consider using unique output directories for each run (e.g., with timestamps)

Note

The checkpoint restart function only checks for the presence of the completion marker file, not its contents. Ensure your previous run actually completed successfully for the identified datasets.

See also

Utility Functions#

QBioCode provides several utility functions that can help with batch processing workflows, configuration management, and file organization.

Finding Duplicate Files#

When generating multiple experiment configurations (e.g., via grid search), you may accidentally create duplicate configuration files. The find_duplicate_files function helps identify these duplicates before running batch jobs.

from qbiocode.utils.find_duplicates import find_duplicate_files

# Find duplicate YAML configs
duplicates = find_duplicate_files(
    'configs/qml_gridsearch/',
    file_pattern='.yaml',
    verbose=True
)

if duplicates:
    print(f"Warning: Found {len(duplicates)} duplicate configuration pairs")
    for file1, file2 in duplicates:
        print(f"  {file1}")
        print(f"  {file2}")
        # Optionally remove one of the duplicates
        # os.remove(file2)

Use Cases:

  • Validate experiment configurations before batch processing

  • Clean up redundant config files from grid search generation

  • Identify duplicate datasets in data directories

  • Audit configuration consistency across projects

Searching for Strings in Files#

The find_string_in_files function helps you quickly locate specific parameters or settings across multiple configuration files.

from qbiocode.utils.find_string import find_string_in_files

# Find all configs using PCA embedding
results = find_string_in_files(
    'configs/experiments/',
    'embeddings: pca',
    file_pattern='.yaml',
    return_lines=True
)

print(f"Found in {len(results)} configuration files:")
for filepath, matches in results.items():
    print(f"\n{filepath}:")
    for line_num, line_content in matches:
        print(f"  Line {line_num}: {line_content.strip()}")

Use Cases:

  • Find all configs using a specific model or embedding

  • Audit parameter settings across experiments

  • Locate configurations with specific quantum backend settings

  • Validate consistency of hyperparameters

Tracking Job Progress#

The track_progress function monitors computational job progress by comparing completed datasets against total input datasets. Unlike checkpoint_restart (which returns completed dataset names for filtering), track_progress provides progress statistics and remaining work estimates.

from qbiocode.utils.combine_evals_results import track_progress

# Monitor progress of current job
completed, num_done, num_remaining = track_progress(
    input_dataset_dir='data/inputs',
    current_results_dir='results/current_run',
    completion_marker='RawDataEvaluation.csv',
    verbose=True
)

print(f"\nProgress: {num_done}/{num_done + num_remaining} datasets completed")
if num_remaining == 0:
    print("Job complete! Ready to combine results.")

Key Differences from checkpoint_restart:

  • checkpoint_restart: Returns list of completed names → Use for resuming jobs

  • track_progress: Returns list + progress counts → Use for monitoring status

Use Cases:

  • Monitor progress of long-running batch jobs

  • Verify job completion before combining results

  • Estimate remaining computation time

  • Generate progress reports

Combining Results from Interrupted Jobs#

When a job is interrupted and resumed, combine_results merges results from both runs into unified output files.

from qbiocode.utils.combine_evals_results import combine_results

# Combine results from interrupted and resumed runs
eval_df, results_df = combine_results(
    prev_results_dir='results/run1_interrupted',
    recent_results_dir='results/run2_resumed',
    verbose=True
)

print(f"Combined {len(eval_df)} evaluation records")
print(f"Combined {len(results_df)} result records")

Customizing File Patterns:

# Custom file prefixes and output names
eval_df, results_df = combine_results(
    prev_results_dir='results/old',
    recent_results_dir='results/new',
    eval_file_prefix='Evaluation',
    results_file_prefix='Results',
    output_eval_file='AllEvaluations.csv',
    output_results_file='AllResults.csv'
)

Use Cases:

  • Merge results after job interruption and restart

  • Combine results from multiple batch runs

  • Consolidate distributed computation results

  • Create unified datasets for downstream analysis

Combined Workflow Example#

Here’s a complete example combining these utilities with QProfiler batch processing:

import os
from qbiocode.utils.dataset_checkpoint import checkpoint_restart
from qbiocode.utils.find_duplicates import find_duplicate_files
from qbiocode.utils.find_string import find_string_in_files

# Step 1: Check for duplicate configs
config_dir = "configs/experiments/"
duplicates = find_duplicate_files(config_dir, file_pattern='.yaml')

if duplicates:
    print(f"Warning: {len(duplicates)} duplicate configs found")
    # Handle duplicates (remove or rename)

# Step 2: Verify all configs use correct settings
results = find_string_in_files(
    config_dir,
    'n_splits: 5',  # Ensure all use 5-fold CV
    file_pattern='.yaml'
)

if len(results) != len(os.listdir(config_dir)):
    print("Warning: Not all configs use 5-fold cross-validation")

# Step 3: Check for previous results and resume
if os.path.exists('./previous_results'):
    completed = checkpoint_restart('./previous_results', verbose=True)
else:
    completed = []

# Step 4: Process remaining datasets
all_datasets = [f.replace('.csv', '') for f in os.listdir('./data')
                if f.endswith('.csv')]
remaining = [d for d in all_datasets if d not in completed]

print(f"\nReady to process {len(remaining)} datasets")
# Continue with QProfiler batch processing...