qbiocode.data_generation package#

Submodules#

qbiocode.data_generation.generator module#

Main data generation interface for QBioCode.

This module provides a unified interface to generate various types of synthetic datasets for machine learning benchmarking and evaluation.

generate_data(type_of_data=None, save_path=None, n_samples=[100, 120, 140, 160, 180, 200, 220, 240, 260, 280], noise=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], hole=[True, False], n_classes=[2], dim=[3, 6, 9, 12], rad=[3, 6, 9, 12], n_features=[10, 30, 50], n_informative=[2, 6], n_redundant=[2, 6], n_clusters_per_class=[1], weights=[[0.3, 0.7], [0.4, 0.6], [0.5, 0.5]], random_state=42)[source]#

Generate synthetic datasets for machine learning benchmarking.

Unified interface to generate various types of synthetic datasets with configurable parameters. Each dataset type creates multiple configurations by varying the specified parameters.

Parameters:

type_of_data (str) – Type of dataset to generate. Options: ‘circles’, ‘moons’, ‘classes’, ‘s_curve’, ‘spheres’, ‘spirals’, ‘swiss_roll’.
save_path (str) – Directory path where datasets will be saved.
n_samples (list of int, default=range(100, 300, 20)) – Sample sizes for dataset configurations.
noise (list of float, default=[0.1, 0.2, ..., 0.9]) – Noise levels to apply.
hole (list of bool, default=[True, False]) – Whether to include hole (for swiss_roll only).
n_classes (list of int, default=[2]) – Number of classes (for spirals and classes).
dim (list of int, default=[3, 6, 9, 12]) – Dimensionalities (for spheres and spirals).
rad (list of float, default=[3, 6, 9, 12]) – Radii (for spheres only).
n_features (list of int, default=range(10, 60, 20)) – Feature counts (for classes only).
n_informative (list of int, default=range(2, 8, 4)) – Informative feature counts (for classes only).
n_redundant (list of int, default=range(2, 8, 4)) – Redundant feature counts (for classes only).
n_clusters_per_class (list of int, default=range(1, 2, 3)) – Clusters per class (for classes only).
weights (list of list of float, default=[[0.3, 0.7], [0.4, 0.6], [0.5, 0.5]]) – Class weight distributions (for classes only).
random_state (int, default=42) – Random seed for reproducibility.

Returns:

Saves generated datasets to the specified path.

Return type:

None

Raises:

ValueError – If type_of_data is not one of the supported types.

Examples

>>> from qbiocode.data_generation import generate_data
>>> generate_data(type_of_data='circles', save_path='data/circles')
Generating circles dataset...
Dataset generation complete.

qbiocode.data_generation.make_blobs module#

Generate synthetic blob (Gaussian cluster) datasets.

This module creates multiple configurations of blob datasets with varying numbers of samples, features, centers, and cluster standard deviations, useful for testing clustering and classification algorithms.

generate_blobs_datasets(n_samples, n_features, centers, cluster_std, save_path=None, random_state=42)[source]#

Generate multiple blob (Gaussian cluster) datasets with varying parameters.

Creates a series of synthetic datasets consisting of isotropic Gaussian blobs for clustering and classification tasks. Each configuration varies the number of samples, features, cluster centers, and cluster spread.

Parameters:

n_samples (list of int) – List of sample sizes to generate for each configuration. Example: [100, 200, 300]
n_features (list of int) – List of feature dimensions to generate. Example: [2, 4, 8]
centers (list of int) – List of numbers of cluster centers (classes). Example: [2, 3, 4]
cluster_std (list of float) – List of standard deviations of the clusters. Example: [0.5, 1.0, 1.5, 2.0]
save_path (str, optional) – Directory path to save generated datasets. If None, datasets are not saved. Default: None
random_state (int, optional) – Random seed for reproducibility. Default: 42

Returns:

Dictionary containing generated datasets with keys as configuration strings and values as tuples of (X, y) where: - X : pd.DataFrame, shape (n_samples, n_features)

Feature matrix

ypd.Series, shape (n_samples,)
Target labels

Return type:

dict

Notes

Generates all combinations of input parameters
Each blob is an isotropic Gaussian distribution
Useful for testing classification and clustering algorithms
Blobs are well-separated when cluster_std is small relative to center distances

Examples

>>> from qbiocode.data_generation import generate_blobs_datasets
>>>
>>> # Generate simple blob datasets
>>> datasets = generate_blobs_datasets(
...     n_samples=[100, 200],
...     n_features=[2, 4],
...     centers=[2, 3],
...     cluster_std=[1.0, 1.5]
... )
>>>
>>> # Access a specific configuration
>>> X, y = datasets['n_samples_100_n_features_2_centers_2_cluster_std_1.0']
>>> print(f"Shape: {X.shape}, Classes: {y.nunique()}")
Shape: (100, 2), Classes: 2

>>> # Save datasets to disk
>>> datasets = generate_blobs_datasets(
...     n_samples=[100],
...     n_features=[2],
...     centers=[3],
...     cluster_std=[1.0],
...     save_path='./data/blobs'
... )

qbiocode.data_generation.make_circles module#

Generate synthetic concentric circles datasets for binary classification tasks.

This module creates multiple configurations of 2D concentric circles datasets with varying sample sizes and noise levels, useful for testing machine learning algorithms on non-linearly separable data.

generate_circles_datasets(n_samples=[100, 120, 140, 160, 180, 200, 220, 240, 260, 280], noise=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], save_path=None, random_state=42)[source]#

Generate multiple concentric circles datasets with varying parameters.

Creates a series of 2D datasets where samples form two concentric circles, providing a classic non-linearly separable binary classification problem. Each configuration varies the number of samples and noise level.

Parameters:

n_samples (list of int, default=range(100, 300, 20)) – List of sample sizes to generate for each configuration.
noise (list of float, default=[0.1, 0.2, ..., 0.9]) – List of noise standard deviations to apply to the data.
save_path (str, default='circles_data') – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.

Returns:

Saves CSV files for each dataset configuration and a JSON file with all configuration parameters.

Return type:

None

Notes

Each dataset is saved as ‘circles_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains binary labels (0 or 1)

Examples

>>> from qbiocode.data_generation import generate_circles_datasets
>>> generate_circles_datasets(n_samples=[100, 200], noise=[0.1, 0.3])
Generating circles dataset...

qbiocode.data_generation.make_class module#

Generate synthetic high-dimensional classification datasets.

This module creates multiple configurations of multi-class classification datasets with varying dimensionality, feature characteristics, and class distributions, useful for testing machine learning algorithms on high-dimensional data.

generate_classification_datasets(n_samples, n_features, n_informative, n_redundant, n_classes, n_clusters_per_class, weights, save_path=None, random_state=42)[source]#

Generate multiple high-dimensional classification datasets with varying parameters.

Creates a series of synthetic datasets for multi-class classification problems with configurable feature characteristics including informative features, redundant features, and class distributions.

Parameters:

n_samples (list of int) – List of sample sizes to generate for each configuration.
n_features (list of int) – List of total feature counts (must be >= n_informative + n_redundant).
n_informative (list of int) – List of informative feature counts that are useful for prediction.
n_redundant (list of int) – List of redundant feature counts (linear combinations of informative features).
n_classes (list of int) – List of class counts for multi-class classification.
n_clusters_per_class (list of int) – List of cluster counts per class.
weights (list of list of float) – List of class weight distributions (must sum to 1.0).
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.

Returns:

Saves CSV files for each dataset configuration and a JSON file with all configuration parameters.

Return type:

None

Notes

Each dataset is saved as ‘class_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains class labels
Only valid configurations where (n_informative + n_redundant) <= n_features are generated

Examples

>>> from qbiocode.data_generation import generate_classification_datasets
>>> generate_classification_datasets(
...     n_samples=[100], n_features=[20], n_informative=[5],
...     n_redundant=[2], n_classes=[2], n_clusters_per_class=[1],
...     weights=[[0.5, 0.5]], save_path='data'
... )
Generating classes dataset...

qbiocode.data_generation.make_moons module#

Generate synthetic two-moons datasets for binary classification tasks.

This module creates multiple configurations of 2D two-moons datasets with varying sample sizes and noise levels, useful for testing machine learning algorithms on non-linearly separable data with interleaving classes.

generate_moons_datasets(n_samples=[100, 120, 140, 160, 180, 200, 220, 240, 260, 280], noise=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], save_path=None, random_state=42)[source]#

Generate multiple two-moons datasets with varying parameters.

Creates a series of 2D datasets where samples form two interleaving half-circles (moons), providing a challenging non-linearly separable binary classification problem. Each configuration varies the number of samples and noise level.

Parameters:

n_samples (list of int, default=range(100, 300, 20)) – List of sample sizes to generate for each configuration.
noise (list of float, default=[0.1, 0.2, ..., 0.9]) – List of noise standard deviations to apply to the data.
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.

Returns:

Saves CSV files for each dataset configuration and a JSON file with all configuration parameters.

Return type:

None

Notes

Each dataset is saved as ‘moons_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains binary labels (0 or 1)
Two-moons datasets are commonly used to evaluate algorithms on interleaving patterns

Examples

>>> from qbiocode.data_generation import generate_moons_datasets
>>> generate_moons_datasets(n_samples=[100, 200], noise=[0.1, 0.3], save_path='data')
Generating moons dataset...

qbiocode.data_generation.make_s_curve module#

Generate synthetic 3D S-curve datasets for manifold learning tasks.

This module creates multiple configurations of 3D S-curve datasets with varying sample sizes and noise levels, useful for testing dimensionality reduction and manifold learning algorithms.

generate_s_curve_datasets(n_samples=[100, 120, 140, 160, 180, 200, 220, 240, 260, 280], noise=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], save_path=None, random_state=42)[source]#

Generate multiple 3D S-curve datasets with varying parameters.

Creates a series of 3D datasets where samples lie on an S-shaped manifold, a classic benchmark for manifold learning and dimensionality reduction algorithms. Each configuration varies the number of samples and noise level.

Parameters:

n_samples (list of int, default=range(100, 300, 20)) – List of sample sizes to generate for each configuration.
noise (list of float, default=[0.1, 0.2, ..., 0.9]) – List of noise standard deviations to apply to the data.
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.

Returns:

Saves CSV files for each dataset configuration and a JSON file with all configuration parameters.

Return type:

None

Notes

Each dataset is saved as ‘s_curve_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains the position along the manifold (continuous values)
S-curve is a standard benchmark for testing manifold learning algorithms

Examples

>>> from qbiocode.data_generation import generate_s_curve_datasets
>>> generate_s_curve_datasets(n_samples=[200], noise=[0.1], save_path='data')
Generating S Curve dataset...

qbiocode.data_generation.make_spheres module#

Generate synthetic concentric n-dimensional spheres datasets for binary classification.

This module creates multiple configurations of high-dimensional concentric spheres datasets with varying sample sizes, dimensionality, and radii, useful for testing machine learning algorithms on high-dimensional non-linearly separable data.

generate_points_in_nd_sphere(n_s, dim=3, radius=1, thresh=0.9)[source]#

Generate random points within an n-dimensional spherical shell.

Parameters:

n_s (int) – Number of points to generate.
dim (int, default=3) – Dimensionality of the sphere.
radius (float, default=1) – Outer radius of the spherical shell.
thresh (float, default=0.9) – Inner radius threshold as fraction of outer radius (creates shell).

Returns:

points – Generated points within the spherical shell.

Return type:

ndarray of shape (n_s, dim)

generate_spheres_datasets(n_s=[100, 125, 150, 175, 200, 225, 250, 275], dim=[5, 10], radius=[5, 10, 15], save_path=None, random_state=42)[source]#

Generate multiple concentric n-dimensional spheres datasets with varying parameters.

Creates a series of high-dimensional datasets where samples form two concentric spherical shells, providing a challenging non-linearly separable binary classification problem in high dimensions. Each configuration varies the number of samples, dimensionality, and sphere radii.

Parameters:

n_s (list of int, default=range(100, 300, 25)) – List of sample sizes per class to generate for each configuration.
dim (list of int, default=range(5, 15, 5)) – List of dimensionalities for the spheres.
radius (list of float, default=range(5, 20, 5)) – List of outer sphere radii (inner sphere is 0.5 * outer radius).
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.

Returns:

Saves CSV files for each dataset configuration and a JSON file with all configuration parameters.

Return type:

None

Notes

Each dataset is saved as ‘spheres_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains binary labels (0 for outer, 1 for inner sphere)
Samples are generated in spherical shells (not solid spheres) for better separation

Examples

>>> from qbiocode.data_generation import generate_spheres_datasets
>>> generate_spheres_datasets(n_s=[100], dim=[5], radius=[10], save_path='data')
Generating spheres dataset...

qbiocode.data_generation.make_spirals module#

Generate synthetic n-dimensional spiral datasets for multi-class classification.

This module creates multiple configurations of high-dimensional spiral datasets with varying sample sizes, noise levels, and dimensionality, useful for testing machine learning algorithms on complex non-linearly separable patterns.

generate_spirals_datasets(n_s=[100, 150, 200, 250], n_c=[2], n_n=[0.3, 0.6, 0.9], n_d=[3, 6, 9, 12], save_path=None, random_state=42)[source]#

Generate multiple n-dimensional spiral datasets with varying parameters.

Creates a series of high-dimensional datasets where samples form intertwined spiral patterns, providing challenging non-linearly separable multi-class classification problems. Each configuration varies the number of samples, classes, noise level, and dimensionality.

Parameters:

n_s (list of int, default=range(100, 300, 50)) – List of sample sizes to generate for each configuration.
n_c (list of int, default=[2]) – List of class counts (number of spiral arms).
n_n (list of float, default=[0.3, 0.6, 0.9]) – List of noise standard deviations to apply to the data.
n_d (list of int, default=[3, 6, 9, 12]) – List of dimensionalities (must be 3, 6, 9, or 12).
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.

Returns:

Saves CSV files for each dataset configuration and a JSON file with all configuration parameters.

Return type:

None

Notes

Each dataset is saved as ‘spirals_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains class labels
Spiral patterns become increasingly complex in higher dimensions

Examples

>>> from qbiocode.data_generation import generate_spirals_datasets
>>> generate_spirals_datasets(n_s=[200], n_c=[2], n_n=[0.3], n_d=[3], save_path='data')
Generating spirals dataset...

make_spirals(n_samples=5000, n_classes=2, noise=0.3, dim=3)[source]#

Generate an n-dimensional dataset of intertwined spirals.

Creates spiral patterns in n-dimensional space where each class forms a distinct spiral arm. Supports dimensions 3, 6, 9, and 12.

Parameters:

n_samples (int, default=5000) – Total number of samples to generate.
n_classes (int, default=2) – Number of spiral arms (classes).
noise (float, default=0.3) – Standard deviation of Gaussian noise added to each dimension.
dim (int, default=3) – Dimensionality of the output space (must be 3, 6, 9, or 12).

Returns:

X (ndarray of shape (n_samples, dim)) – Generated spiral data points.
y (ndarray of shape (n_samples,)) – Class labels for each sample.

qbiocode.data_generation.make_swiss_roll module#

Generate synthetic 3D Swiss roll datasets for manifold learning tasks.

This module creates multiple configurations of 3D Swiss roll datasets with varying sample sizes, noise levels, and hole configurations, useful for testing dimensionality reduction and manifold learning algorithms.

generate_swiss_roll_datasets(n_samples=[100, 120, 140, 160, 180, 200, 220, 240, 260, 280], noise=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], hole=[True, False], save_path=None, random_state=42)[source]#

Generate multiple 3D Swiss roll datasets with varying parameters.

Creates a series of 3D datasets where samples lie on a Swiss roll manifold, a classic benchmark for manifold learning and dimensionality reduction algorithms. Each configuration varies the number of samples, noise level, and whether the roll has a hole in the center.

Parameters:

n_samples (list of int, default=range(100, 300, 20)) – List of sample sizes to generate for each configuration.
noise (list of float, default=[0.1, 0.2, ..., 0.9]) – List of noise standard deviations to apply to the data.
hole (list of bool, default=[True, False]) – List of boolean values indicating whether to generate Swiss roll with hole.
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.

Returns:

Saves CSV files for each dataset configuration and a JSON file with all configuration parameters.

Return type:

None

Notes

Each dataset is saved as ‘swiss_roll_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains the position along the manifold (continuous values)
Swiss roll is a standard benchmark for testing manifold learning algorithms

Examples

>>> from qbiocode.data_generation import generate_swiss_roll_datasets
>>> generate_swiss_roll_datasets(n_samples=[200], noise=[0.1], hole=[False], save_path='data')
Generating swiss roll dataset...

Module contents#

Data Generation Module for QBioCode.

This module provides functions to generate synthetic datasets for testing machine learning algorithms. Each function creates multiple dataset configurations with varying parameters, useful for benchmarking and evaluation.

Available dataset generators: - generate_blobs_datasets: Isotropic Gaussian blobs (clusters) - generate_circles_datasets: 2D concentric circles - generate_moons_datasets: 2D interleaving half-circles - generate_classification_datasets: High-dimensional multi-class data - generate_s_curve_datasets: 3D S-shaped manifold - generate_spheres_datasets: N-dimensional concentric spheres - generate_spirals_datasets: N-dimensional intertwined spirals - generate_swiss_roll_datasets: 3D Swiss roll manifold

generate_blobs_datasets(n_samples, n_features, centers, cluster_std, save_path=None, random_state=42)[source]#

Generate multiple blob (Gaussian cluster) datasets with varying parameters.

Creates a series of synthetic datasets consisting of isotropic Gaussian blobs for clustering and classification tasks. Each configuration varies the number of samples, features, cluster centers, and cluster spread.

Parameters:

n_samples (list of int) – List of sample sizes to generate for each configuration. Example: [100, 200, 300]
n_features (list of int) – List of feature dimensions to generate. Example: [2, 4, 8]
centers (list of int) – List of numbers of cluster centers (classes). Example: [2, 3, 4]
cluster_std (list of float) – List of standard deviations of the clusters. Example: [0.5, 1.0, 1.5, 2.0]
save_path (str, optional) – Directory path to save generated datasets. If None, datasets are not saved. Default: None
random_state (int, optional) – Random seed for reproducibility. Default: 42

Returns:

Dictionary containing generated datasets with keys as configuration strings and values as tuples of (X, y) where: - X : pd.DataFrame, shape (n_samples, n_features)

Feature matrix

ypd.Series, shape (n_samples,)
Target labels

Return type:

dict

Notes

Generates all combinations of input parameters
Each blob is an isotropic Gaussian distribution
Useful for testing classification and clustering algorithms
Blobs are well-separated when cluster_std is small relative to center distances

Examples

>>> from qbiocode.data_generation import generate_blobs_datasets
>>>
>>> # Generate simple blob datasets
>>> datasets = generate_blobs_datasets(
...     n_samples=[100, 200],
...     n_features=[2, 4],
...     centers=[2, 3],
...     cluster_std=[1.0, 1.5]
... )
>>>
>>> # Access a specific configuration
>>> X, y = datasets['n_samples_100_n_features_2_centers_2_cluster_std_1.0']
>>> print(f"Shape: {X.shape}, Classes: {y.nunique()}")
Shape: (100, 2), Classes: 2

>>> # Save datasets to disk
>>> datasets = generate_blobs_datasets(
...     n_samples=[100],
...     n_features=[2],
...     centers=[3],
...     cluster_std=[1.0],
...     save_path='./data/blobs'
... )

See also

generate_circles_datasets: Generate concentric circles
generate_moons_datasets: Generate interleaving half-circles
generate_classification_datasets: Generate high-dimensional classification data

References

generate_circles_datasets(n_samples=[100, 120, 140, 160, 180, 200, 220, 240, 260, 280], noise=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], save_path=None, random_state=42)[source]#

Generate multiple concentric circles datasets with varying parameters.

Creates a series of 2D datasets where samples form two concentric circles, providing a classic non-linearly separable binary classification problem. Each configuration varies the number of samples and noise level.

Parameters:

n_samples (list of int, default=range(100, 300, 20)) – List of sample sizes to generate for each configuration.
noise (list of float, default=[0.1, 0.2, ..., 0.9]) – List of noise standard deviations to apply to the data.
save_path (str, default='circles_data') – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.

Returns:

Saves CSV files for each dataset configuration and a JSON file with all configuration parameters.

Return type:

None

Notes

Each dataset is saved as ‘circles_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains binary labels (0 or 1)

Examples

>>> from qbiocode.data_generation import generate_circles_datasets
>>> generate_circles_datasets(n_samples=[100, 200], noise=[0.1, 0.3])
Generating circles dataset...

generate_classification_datasets(n_samples, n_features, n_informative, n_redundant, n_classes, n_clusters_per_class, weights, save_path=None, random_state=42)[source]#

Generate multiple high-dimensional classification datasets with varying parameters.

Creates a series of synthetic datasets for multi-class classification problems with configurable feature characteristics including informative features, redundant features, and class distributions.

Parameters:

n_samples (list of int) – List of sample sizes to generate for each configuration.
n_features (list of int) – List of total feature counts (must be >= n_informative + n_redundant).
n_informative (list of int) – List of informative feature counts that are useful for prediction.
n_redundant (list of int) – List of redundant feature counts (linear combinations of informative features).
n_classes (list of int) – List of class counts for multi-class classification.
n_clusters_per_class (list of int) – List of cluster counts per class.
weights (list of list of float) – List of class weight distributions (must sum to 1.0).
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.

Returns:

Saves CSV files for each dataset configuration and a JSON file with all configuration parameters.

Return type:

None

Notes

Each dataset is saved as ‘class_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains class labels
Only valid configurations where (n_informative + n_redundant) <= n_features are generated

Examples

>>> from qbiocode.data_generation import generate_classification_datasets
>>> generate_classification_datasets(
...     n_samples=[100], n_features=[20], n_informative=[5],
...     n_redundant=[2], n_classes=[2], n_clusters_per_class=[1],
...     weights=[[0.5, 0.5]], save_path='data'
... )
Generating classes dataset...

generate_default_blobs_datasets(save_path=None, random_state=42)[source]#

Generate blob datasets with default parameter configurations.

Convenience function that generates a standard set of blob datasets using predefined parameter ranges suitable for most testing scenarios.

Parameters:

save_path (str, optional) – Directory path to save generated datasets. If None, datasets are not saved. Default: None
random_state (int, optional) – Random seed for reproducibility. Default: 42

Returns:

Dictionary containing generated datasets.

Return type:

dict

Examples

>>> from qbiocode.data_generation import generate_default_blobs_datasets
>>> datasets = generate_default_blobs_datasets()
>>> print(f"Generated {len(datasets)} dataset configurations")

generate_moons_datasets(n_samples=[100, 120, 140, 160, 180, 200, 220, 240, 260, 280], noise=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], save_path=None, random_state=42)[source]#

Generate multiple two-moons datasets with varying parameters.

Creates a series of 2D datasets where samples form two interleaving half-circles (moons), providing a challenging non-linearly separable binary classification problem. Each configuration varies the number of samples and noise level.

Parameters:

n_samples (list of int, default=range(100, 300, 20)) – List of sample sizes to generate for each configuration.
noise (list of float, default=[0.1, 0.2, ..., 0.9]) – List of noise standard deviations to apply to the data.
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.

Returns:

Saves CSV files for each dataset configuration and a JSON file with all configuration parameters.

Return type:

None

Notes

Each dataset is saved as ‘moons_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains binary labels (0 or 1)
Two-moons datasets are commonly used to evaluate algorithms on interleaving patterns

Examples

>>> from qbiocode.data_generation import generate_moons_datasets
>>> generate_moons_datasets(n_samples=[100, 200], noise=[0.1, 0.3], save_path='data')
Generating moons dataset...

generate_s_curve_datasets(n_samples=[100, 120, 140, 160, 180, 200, 220, 240, 260, 280], noise=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], save_path=None, random_state=42)[source]#

Generate multiple 3D S-curve datasets with varying parameters.

Creates a series of 3D datasets where samples lie on an S-shaped manifold, a classic benchmark for manifold learning and dimensionality reduction algorithms. Each configuration varies the number of samples and noise level.

Parameters:

n_samples (list of int, default=range(100, 300, 20)) – List of sample sizes to generate for each configuration.
noise (list of float, default=[0.1, 0.2, ..., 0.9]) – List of noise standard deviations to apply to the data.
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.

Returns:

Saves CSV files for each dataset configuration and a JSON file with all configuration parameters.

Return type:

None

Notes

Each dataset is saved as ‘s_curve_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains the position along the manifold (continuous values)
S-curve is a standard benchmark for testing manifold learning algorithms

Examples

>>> from qbiocode.data_generation import generate_s_curve_datasets
>>> generate_s_curve_datasets(n_samples=[200], noise=[0.1], save_path='data')
Generating S Curve dataset...

generate_spheres_datasets(n_s=[100, 125, 150, 175, 200, 225, 250, 275], dim=[5, 10], radius=[5, 10, 15], save_path=None, random_state=42)[source]#

Generate multiple concentric n-dimensional spheres datasets with varying parameters.

Creates a series of high-dimensional datasets where samples form two concentric spherical shells, providing a challenging non-linearly separable binary classification problem in high dimensions. Each configuration varies the number of samples, dimensionality, and sphere radii.

Parameters:

n_s (list of int, default=range(100, 300, 25)) – List of sample sizes per class to generate for each configuration.
dim (list of int, default=range(5, 15, 5)) – List of dimensionalities for the spheres.
radius (list of float, default=range(5, 20, 5)) – List of outer sphere radii (inner sphere is 0.5 * outer radius).
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.

Returns:

Saves CSV files for each dataset configuration and a JSON file with all configuration parameters.

Return type:

None

Notes

Each dataset is saved as ‘spheres_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains binary labels (0 for outer, 1 for inner sphere)
Samples are generated in spherical shells (not solid spheres) for better separation

Examples

>>> from qbiocode.data_generation import generate_spheres_datasets
>>> generate_spheres_datasets(n_s=[100], dim=[5], radius=[10], save_path='data')
Generating spheres dataset...

generate_spirals_datasets(n_s=[100, 150, 200, 250], n_c=[2], n_n=[0.3, 0.6, 0.9], n_d=[3, 6, 9, 12], save_path=None, random_state=42)[source]#

Generate multiple n-dimensional spiral datasets with varying parameters.

Creates a series of high-dimensional datasets where samples form intertwined spiral patterns, providing challenging non-linearly separable multi-class classification problems. Each configuration varies the number of samples, classes, noise level, and dimensionality.

Parameters:

n_s (list of int, default=range(100, 300, 50)) – List of sample sizes to generate for each configuration.
n_c (list of int, default=[2]) – List of class counts (number of spiral arms).
n_n (list of float, default=[0.3, 0.6, 0.9]) – List of noise standard deviations to apply to the data.
n_d (list of int, default=[3, 6, 9, 12]) – List of dimensionalities (must be 3, 6, 9, or 12).
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.

Returns:

Saves CSV files for each dataset configuration and a JSON file with all configuration parameters.

Return type:

None

Notes

Each dataset is saved as ‘spirals_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains class labels
Spiral patterns become increasingly complex in higher dimensions

Examples

>>> from qbiocode.data_generation import generate_spirals_datasets
>>> generate_spirals_datasets(n_s=[200], n_c=[2], n_n=[0.3], n_d=[3], save_path='data')
Generating spirals dataset...

generate_swiss_roll_datasets(n_samples=[100, 120, 140, 160, 180, 200, 220, 240, 260, 280], noise=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], hole=[True, False], save_path=None, random_state=42)[source]#

Generate multiple 3D Swiss roll datasets with varying parameters.

Creates a series of 3D datasets where samples lie on a Swiss roll manifold, a classic benchmark for manifold learning and dimensionality reduction algorithms. Each configuration varies the number of samples, noise level, and whether the roll has a hole in the center.

Parameters:

n_samples (list of int, default=range(100, 300, 20)) – List of sample sizes to generate for each configuration.
noise (list of float, default=[0.1, 0.2, ..., 0.9]) – List of noise standard deviations to apply to the data.
hole (list of bool, default=[True, False]) – List of boolean values indicating whether to generate Swiss roll with hole.
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.

Returns:

Saves CSV files for each dataset configuration and a JSON file with all configuration parameters.

Return type:

None

Notes

Each dataset is saved as ‘swiss_roll_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains the position along the manifold (continuous values)
Swiss roll is a standard benchmark for testing manifold learning algorithms

Examples

>>> from qbiocode.data_generation import generate_swiss_roll_datasets
>>> generate_swiss_roll_datasets(n_samples=[200], noise=[0.1], hole=[False], save_path='data')
Generating swiss roll dataset...