API Overview#

Quick overview of the Methods and Datasets available in qbiocode.

Methods#

Depending on the underlying foundations in qbiocode can be….

Embeddings#

Collection of common embeddings (qbiocode.embeddings) functionalities.

get_embeddings(embedding, X_train, X_test[, ...])

This function applies the specified embedding technique to the training and test datasets.

Evaluation#

The qbiocode.evaluation submodule of qbiocode computes the evaluation metrics for the input dataset and the models.

Data Evaluation#

Depending on the underlying mathematical foundations, they can be classified into the following categories: (i)..

evaluate(df, y, file)

This function evaluates a dataset and returns a transposed summary DataFrame with various statistical measures, derived from the dataset.

Model Evaluation#

modeleval(y_test, y_predicted, beg_time, ...)

Evaluates the model performance using accuracy, F1 score, and AUC.

Model Computation#

qbiocode brings together a number of established machine learning model both from classical and quantum (qbiocode.learning). Multiple models can be run via the following

model_run(X_train, X_test, y_train, y_test, ...)

This function runs the ML methods, with or without a grid search, as specified in the config.yaml file.

Classical Models#

QBioCode provides classical machine learning models from scikit-learn for baseline comparisons and benchmarking against quantum models.

compute_dt(X_train, X_test, y_train, y_test, ...)

This function generates a model using a Decision Tree (DT) Classifier method as implemented in scikit-learn.

compute_lr(X_train, X_test, y_train, y_test, ...)

This function generates a model using a Logistic Regression (LR) method as implemented in scikit-learn.

compute_mlp(X_train, X_test, y_train, ...[, ...])

This function generates a model using a Multi-layer Perceptron (mlp), a neural network, method as implemented in scikit-learn.

compute_nb(X_train, X_test, y_train, y_test, ...)

This function generates a model using a Gaussian Naive Bayes (NB) Classifier method as implemented in scikit-learn.

compute_rf(X_train, X_test, y_train, y_test, ...)

This function generates a model using a Random Forest (RF) Classifier method as implemented in scikit-learn.

compute_svc(X_train, X_test, y_train, ...[, ...])

This function generates a model using a Support Vector Classifier (SVC) method as implemented in scikit-learn.

Each model has an alternative function with grid search parameters for hyperparameter optimization. Details can be found in the specific qbiocode.learning submodules.

Quantum Models#

QBioCode provides quantum machine learning models that leverage quantum computing capabilities for classification and regression tasks. These models can be run on quantum simulators or real quantum hardware.

compute_qnn(X_train, X_test, y_train, ...[, ...])

This function computes a Quantum Neural Network (QNN) model on the provided training data and evaluates it on the test data.

compute_qsvc(X_train, X_test, y_train, ...)

This function computes a quantum support vector classifier (QSVC) using the Qiskit Machine Learning library.

compute_vqc(X_train, X_test, y_train, ...[, ...])

This function computes a Variational Quantum Classifier (VQC) using the Qiskit Machine Learning library.

compute_pqk(X_train, X_test, y_train, ...[, ...])

This function generates quantum circuits, computes projections of the data onto these circuits, and evaluates the performance of classical machine learning models on the projected data.

Each quantum model has an alternative function where grid search parameters and quantum-specific configurations can be provided as input. Details can be found in the specific qbiocode.learning submodules.

Visualisation#

The plotting module (qbiocode.visualization) enables the user to visualise the data and provides out-of-the-box plots for some of the metrics.

compute_results_correlation(results_df[, ...])

This function takes in as input a Pandas Dataframe containing the results and data evaluations for a given dataset.

plot_results_correlation(correlations_df[, ...])

This function plots the spearman correlation dot plots using the previously generated correlations_df dataframe.

Datasets#

QBioCode provides a comprehensive suite of synthetic dataset generators for testing and benchmarking machine learning algorithms. These datasets are particularly useful for:

  • Algorithm Development: Test new quantum and classical ML algorithms

  • Benchmarking: Compare model performance across different data characteristics

  • Educational Purposes: Demonstrate ML concepts with controlled data properties

  • Reproducibility: Generate consistent datasets with fixed random seeds

Data Generation Module#

The qbiocode.data_generation module provides functions to generate various types of synthetic datasets with configurable parameters. QBioCode supports a diverse collection of dataset types, each designed to test specific aspects of machine learning algorithms, from simple 2D geometric patterns to complex high-dimensional manifolds.

_images/ArtificialDatatypes.png

Overview of artificial dataset types available in QBioCode: The figure showcases the variety of synthetic datasets that can be generated, including 2D geometric patterns (Circles, Moons), 3D manifolds (S-Curve, Swiss Roll), and high-dimensional datasets (Spheres, Spirals, Classification). Each dataset type is designed to challenge different aspects of machine learning algorithms, from handling non-linear decision boundaries to learning complex manifold structures.#

Main Generator Function#

generate_data([type_of_data, save_path, ...])

Generate synthetic datasets for machine learning benchmarking.

The generate_data function serves as a unified interface to generate multiple dataset types with customizable parameters including sample size, noise levels, dimensionality, and class balance.

Available Dataset Types#

QBioCode supports the following synthetic dataset generators:

2D Geometric Patterns

generate_circles_datasets([n_samples, ...])

Generate multiple concentric circles datasets with varying parameters.

generate_moons_datasets([n_samples, noise, ...])

Generate multiple two-moons datasets with varying parameters.

  • Circles: Concentric circles with adjustable noise and separation

  • Moons: Interleaving half-circles (two moons) with controllable noise

_images/NoiseLevels.png

Effect of noise parameter on 2D geometric patterns: Circles (top row) and Moons (bottom row) datasets with increasing noise levels from left to right. The noise parameter controls the standard deviation of Gaussian noise added to the data points, affecting class separability.#

3D Manifolds

generate_s_curve_datasets([n_samples, ...])

Generate multiple 3D S-curve datasets with varying parameters.

generate_swiss_roll_datasets([n_samples, ...])

Generate multiple 3D Swiss roll datasets with varying parameters.

  • S-Curve: S-shaped 3D manifold for testing manifold learning

  • Swiss Roll: Classic 3D manifold with spiral structure

High-Dimensional Datasets

generate_spheres_datasets([n_s, dim, ...])

Generate multiple concentric n-dimensional spheres datasets with varying parameters.

generate_spirals_datasets([n_s, n_c, n_n, ...])

Generate multiple n-dimensional spiral datasets with varying parameters.

generate_classification_datasets(n_samples, ...)

Generate multiple high-dimensional classification datasets with varying parameters.

  • Spheres: Concentric n-dimensional spheres for high-dimensional classification

  • Spirals: Intertwined spiral patterns in n-dimensional space

  • Classification: Customizable high-dimensional datasets with: - Configurable number of features, informative features, and redundant features - Multiple classes with adjustable separation - Class imbalance through weight parameters - Cluster structure within classes

Configurable Parameters#

All dataset generators support extensive parameter customization:

Sample Configuration
  • n_samples: Number of data points (default: 100-300)

  • n_classes: Number of classes (default: 2)

  • weights: Class balance ratios

Noise and Complexity
  • noise: Noise level (0.0-1.0)

  • n_informative: Number of informative features

  • n_redundant: Number of redundant features

  • n_clusters_per_class: Cluster structure

Dimensionality
  • n_features: Total number of features

  • dim: Dimensionality for manifold datasets

  • rad: Radius for geometric patterns

Reproducibility
  • random_state: Random seed for reproducible dataset generation (default: 42)

  • Ensures consistent results across multiple runs

  • Can be customized for different random variations

Output
  • save_path: Directory to save generated datasets

  • Datasets saved as CSV files with metadata JSON

Example Usage#

from qbiocode.data_generation import generate_data

# Generate circles dataset with custom parameters and fixed random seed
generate_data(
    type_of_data='circles',
    n_samples=[100, 200, 300],
    noise=[0.1, 0.3, 0.5],
    save_path='./my_datasets/circles',
    random_state=42  # For reproducibility
)

# Generate high-dimensional classification data with custom seed
generate_data(
    type_of_data='classes',
    n_samples=[500],
    n_features=[20, 50, 100],
    n_informative=[5, 10, 20],
    n_classes=[2, 3],
    save_path='./my_datasets/classification',
    random_state=123  # Different seed for variation
)

# Generate reproducible moons dataset
from qbiocode.data_generation import generate_moons_datasets

generate_moons_datasets(
    n_samples=[200, 400],
    noise=[0.2, 0.4],
    save_path='./moons_data',
    random_state=42  # Same seed produces identical results
)

Dataset Characteristics#

Each dataset type is designed to test specific ML capabilities:

Dataset

Dimensionality

Tests

Circles

2D

Non-linear separability, kernel methods

Moons

2D

Non-linear boundaries, noise robustness

S-Curve

3D manifold

Manifold learning, dimensionality reduction

Swiss Roll

3D manifold

Unrolling algorithms, local structure

Spheres

3D

Radial patterns, distance-based methods

Spirals

3D

Complex non-linear patterns

Classification

High-D

Feature selection, curse of dimensionality

Batch Generation#

The generator supports batch creation of multiple dataset configurations:

# Generate multiple configurations automatically
N_SAMPLES = [100, 200, 300]
NOISE = [0.1, 0.3, 0.5, 0.7]

generate_data(
    type_of_data='moons',
    n_samples=N_SAMPLES,
    noise=NOISE,
    save_path='./batch_datasets'
)

# This creates len(N_SAMPLES) × len(NOISE) = 12 datasets

Output Format#

Generated datasets are saved with:

  • CSV files: Feature matrix and labels

  • JSON metadata: Configuration parameters used

  • Naming convention: {type}_{config_id}.csv

See also

For a complete tutorial on data generation, see the Artificial Data Generation Tutorial.

References#