Generate synthetic datasets for machine learning benchmarking.
Unified interface to generate various types of synthetic datasets with
configurable parameters. Each dataset type creates multiple configurations
by varying the specified parameters.
Parameters:
type_of_data (str) – Type of dataset to generate. Options: ‘circles’, ‘moons’, ‘classes’,
‘s_curve’, ‘spheres’, ‘spirals’, ‘swiss_roll’.
save_path (str) – Directory path where datasets will be saved.
n_samples (list of int, default=range(100, 300, 20)) – Sample sizes for dataset configurations.
noise (list of float, default=[0.1, 0.2, ..., 0.9]) – Noise levels to apply.
hole (list of bool, default=[True, False]) – Whether to include hole (for swiss_roll only).
n_classes (list of int, default=[2]) – Number of classes (for spirals and classes).
dim (list of int, default=[3, 6, 9, 12]) – Dimensionalities (for spheres and spirals).
rad (list of float, default=[3, 6, 9, 12]) – Radii (for spheres only).
This module creates multiple configurations of blob datasets with varying
numbers of samples, features, centers, and cluster standard deviations,
useful for testing clustering and classification algorithms.
Generate multiple blob (Gaussian cluster) datasets with varying parameters.
Creates a series of synthetic datasets consisting of isotropic Gaussian blobs
for clustering and classification tasks. Each configuration varies the number
of samples, features, cluster centers, and cluster spread.
Parameters:
n_samples (list of int) – List of sample sizes to generate for each configuration.
Example: [100, 200, 300]
n_features (list of int) – List of feature dimensions to generate.
Example: [2, 4, 8]
centers (list of int) – List of numbers of cluster centers (classes).
Example: [2, 3, 4]
cluster_std (list of float) – List of standard deviations of the clusters.
Example: [0.5, 1.0, 1.5, 2.0]
save_path (str, optional) – Directory path to save generated datasets. If None, datasets are not saved.
Default: None
random_state (int, optional) – Random seed for reproducibility.
Default: 42
Returns:
Dictionary containing generated datasets with keys as configuration strings
and values as tuples of (X, y) where:
- X : pd.DataFrame, shape (n_samples, n_features)
Feature matrix
ypd.Series, shape (n_samples,)
Target labels
Return type:
dict
Notes
Generates all combinations of input parameters
Each blob is an isotropic Gaussian distribution
Useful for testing classification and clustering algorithms
Blobs are well-separated when cluster_std is small relative to center distances
Generate synthetic concentric circles datasets for binary classification tasks.
This module creates multiple configurations of 2D concentric circles datasets
with varying sample sizes and noise levels, useful for testing machine learning
algorithms on non-linearly separable data.
Generate multiple concentric circles datasets with varying parameters.
Creates a series of 2D datasets where samples form two concentric circles,
providing a classic non-linearly separable binary classification problem.
Each configuration varies the number of samples and noise level.
Parameters:
n_samples (list of int, default=range(100, 300, 20)) – List of sample sizes to generate for each configuration.
noise (list of float, default=[0.1, 0.2, ..., 0.9]) – List of noise standard deviations to apply to the data.
save_path (str, default='circles_data') – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.
Returns:
Saves CSV files for each dataset configuration and a JSON file with
all configuration parameters.
Return type:
None
Notes
Each dataset is saved as ‘circles_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains binary labels (0 or 1)
This module creates multiple configurations of multi-class classification datasets
with varying dimensionality, feature characteristics, and class distributions,
useful for testing machine learning algorithms on high-dimensional data.
Generate multiple high-dimensional classification datasets with varying parameters.
Creates a series of synthetic datasets for multi-class classification problems
with configurable feature characteristics including informative features,
redundant features, and class distributions.
Parameters:
n_samples (list of int) – List of sample sizes to generate for each configuration.
n_features (list of int) – List of total feature counts (must be >= n_informative + n_redundant).
n_informative (list of int) – List of informative feature counts that are useful for prediction.
n_redundant (list of int) – List of redundant feature counts (linear combinations of informative features).
n_classes (list of int) – List of class counts for multi-class classification.
n_clusters_per_class (list of int) – List of cluster counts per class.
weights (list of list of float) – List of class weight distributions (must sum to 1.0).
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.
Returns:
Saves CSV files for each dataset configuration and a JSON file with
all configuration parameters.
Return type:
None
Notes
Each dataset is saved as ‘class_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains class labels
Only valid configurations where (n_informative + n_redundant) <= n_features are generated
Generate synthetic two-moons datasets for binary classification tasks.
This module creates multiple configurations of 2D two-moons datasets with
varying sample sizes and noise levels, useful for testing machine learning
algorithms on non-linearly separable data with interleaving classes.
Generate multiple two-moons datasets with varying parameters.
Creates a series of 2D datasets where samples form two interleaving half-circles
(moons), providing a challenging non-linearly separable binary classification problem.
Each configuration varies the number of samples and noise level.
Parameters:
n_samples (list of int, default=range(100, 300, 20)) – List of sample sizes to generate for each configuration.
noise (list of float, default=[0.1, 0.2, ..., 0.9]) – List of noise standard deviations to apply to the data.
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.
Returns:
Saves CSV files for each dataset configuration and a JSON file with
all configuration parameters.
Return type:
None
Notes
Each dataset is saved as ‘moons_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains binary labels (0 or 1)
Two-moons datasets are commonly used to evaluate algorithms on interleaving patterns
Generate synthetic 3D S-curve datasets for manifold learning tasks.
This module creates multiple configurations of 3D S-curve datasets with
varying sample sizes and noise levels, useful for testing dimensionality
reduction and manifold learning algorithms.
Generate multiple 3D S-curve datasets with varying parameters.
Creates a series of 3D datasets where samples lie on an S-shaped manifold,
a classic benchmark for manifold learning and dimensionality reduction algorithms.
Each configuration varies the number of samples and noise level.
Parameters:
n_samples (list of int, default=range(100, 300, 20)) – List of sample sizes to generate for each configuration.
noise (list of float, default=[0.1, 0.2, ..., 0.9]) – List of noise standard deviations to apply to the data.
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.
Returns:
Saves CSV files for each dataset configuration and a JSON file with
all configuration parameters.
Return type:
None
Notes
Each dataset is saved as ‘s_curve_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains the position along the manifold (continuous values)
S-curve is a standard benchmark for testing manifold learning algorithms
Examples
>>> fromqbiocode.data_generationimportgenerate_s_curve_datasets>>> generate_s_curve_datasets(n_samples=[200],noise=[0.1],save_path='data')Generating S Curve dataset...
Generate synthetic concentric n-dimensional spheres datasets for binary classification.
This module creates multiple configurations of high-dimensional concentric spheres
datasets with varying sample sizes, dimensionality, and radii, useful for testing
machine learning algorithms on high-dimensional non-linearly separable data.
Generate multiple concentric n-dimensional spheres datasets with varying parameters.
Creates a series of high-dimensional datasets where samples form two concentric
spherical shells, providing a challenging non-linearly separable binary classification
problem in high dimensions. Each configuration varies the number of samples,
dimensionality, and sphere radii.
Parameters:
n_s (list of int, default=range(100, 300, 25)) – List of sample sizes per class to generate for each configuration.
dim (list of int, default=range(5, 15, 5)) – List of dimensionalities for the spheres.
radius (list of float, default=range(5, 20, 5)) – List of outer sphere radii (inner sphere is 0.5 * outer radius).
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.
Returns:
Saves CSV files for each dataset configuration and a JSON file with
all configuration parameters.
Return type:
None
Notes
Each dataset is saved as ‘spheres_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains binary labels (0 for outer, 1 for inner sphere)
Samples are generated in spherical shells (not solid spheres) for better separation
Generate synthetic n-dimensional spiral datasets for multi-class classification.
This module creates multiple configurations of high-dimensional spiral datasets
with varying sample sizes, noise levels, and dimensionality, useful for testing
machine learning algorithms on complex non-linearly separable patterns.
Generate multiple n-dimensional spiral datasets with varying parameters.
Creates a series of high-dimensional datasets where samples form intertwined
spiral patterns, providing challenging non-linearly separable multi-class
classification problems. Each configuration varies the number of samples,
classes, noise level, and dimensionality.
Parameters:
n_s (list of int, default=range(100, 300, 50)) – List of sample sizes to generate for each configuration.
n_c (list of int, default=[2]) – List of class counts (number of spiral arms).
n_n (list of float, default=[0.3, 0.6, 0.9]) – List of noise standard deviations to apply to the data.
n_d (list of int, default=[3, 6, 9, 12]) – List of dimensionalities (must be 3, 6, 9, or 12).
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.
Returns:
Saves CSV files for each dataset configuration and a JSON file with
all configuration parameters.
Return type:
None
Notes
Each dataset is saved as ‘spirals_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains class labels
Spiral patterns become increasingly complex in higher dimensions
Generate synthetic 3D Swiss roll datasets for manifold learning tasks.
This module creates multiple configurations of 3D Swiss roll datasets with
varying sample sizes, noise levels, and hole configurations, useful for testing
dimensionality reduction and manifold learning algorithms.
Generate multiple 3D Swiss roll datasets with varying parameters.
Creates a series of 3D datasets where samples lie on a Swiss roll manifold,
a classic benchmark for manifold learning and dimensionality reduction algorithms.
Each configuration varies the number of samples, noise level, and whether the
roll has a hole in the center.
Parameters:
n_samples (list of int, default=range(100, 300, 20)) – List of sample sizes to generate for each configuration.
noise (list of float, default=[0.1, 0.2, ..., 0.9]) – List of noise standard deviations to apply to the data.
hole (list of bool, default=[True, False]) – List of boolean values indicating whether to generate Swiss roll with hole.
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.
Returns:
Saves CSV files for each dataset configuration and a JSON file with
all configuration parameters.
Return type:
None
Notes
Each dataset is saved as ‘swiss_roll_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains the position along the manifold (continuous values)
Swiss roll is a standard benchmark for testing manifold learning algorithms
Examples
>>> fromqbiocode.data_generationimportgenerate_swiss_roll_datasets>>> generate_swiss_roll_datasets(n_samples=[200],noise=[0.1],hole=[False],save_path='data')Generating swiss roll dataset...
This module provides functions to generate synthetic datasets for testing
machine learning algorithms. Each function creates multiple dataset configurations
with varying parameters, useful for benchmarking and evaluation.
Available dataset generators:
- generate_blobs_datasets: Isotropic Gaussian blobs (clusters)
- generate_circles_datasets: 2D concentric circles
- generate_moons_datasets: 2D interleaving half-circles
- generate_classification_datasets: High-dimensional multi-class data
- generate_s_curve_datasets: 3D S-shaped manifold
- generate_spheres_datasets: N-dimensional concentric spheres
- generate_spirals_datasets: N-dimensional intertwined spirals
- generate_swiss_roll_datasets: 3D Swiss roll manifold
Generate multiple blob (Gaussian cluster) datasets with varying parameters.
Creates a series of synthetic datasets consisting of isotropic Gaussian blobs
for clustering and classification tasks. Each configuration varies the number
of samples, features, cluster centers, and cluster spread.
Parameters:
n_samples (list of int) – List of sample sizes to generate for each configuration.
Example: [100, 200, 300]
n_features (list of int) – List of feature dimensions to generate.
Example: [2, 4, 8]
centers (list of int) – List of numbers of cluster centers (classes).
Example: [2, 3, 4]
cluster_std (list of float) – List of standard deviations of the clusters.
Example: [0.5, 1.0, 1.5, 2.0]
save_path (str, optional) – Directory path to save generated datasets. If None, datasets are not saved.
Default: None
random_state (int, optional) – Random seed for reproducibility.
Default: 42
Returns:
Dictionary containing generated datasets with keys as configuration strings
and values as tuples of (X, y) where:
- X : pd.DataFrame, shape (n_samples, n_features)
Feature matrix
ypd.Series, shape (n_samples,)
Target labels
Return type:
dict
Notes
Generates all combinations of input parameters
Each blob is an isotropic Gaussian distribution
Useful for testing classification and clustering algorithms
Blobs are well-separated when cluster_std is small relative to center distances
Generate multiple concentric circles datasets with varying parameters.
Creates a series of 2D datasets where samples form two concentric circles,
providing a classic non-linearly separable binary classification problem.
Each configuration varies the number of samples and noise level.
Parameters:
n_samples (list of int, default=range(100, 300, 20)) – List of sample sizes to generate for each configuration.
noise (list of float, default=[0.1, 0.2, ..., 0.9]) – List of noise standard deviations to apply to the data.
save_path (str, default='circles_data') – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.
Returns:
Saves CSV files for each dataset configuration and a JSON file with
all configuration parameters.
Return type:
None
Notes
Each dataset is saved as ‘circles_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains binary labels (0 or 1)
Generate multiple high-dimensional classification datasets with varying parameters.
Creates a series of synthetic datasets for multi-class classification problems
with configurable feature characteristics including informative features,
redundant features, and class distributions.
Parameters:
n_samples (list of int) – List of sample sizes to generate for each configuration.
n_features (list of int) – List of total feature counts (must be >= n_informative + n_redundant).
n_informative (list of int) – List of informative feature counts that are useful for prediction.
n_redundant (list of int) – List of redundant feature counts (linear combinations of informative features).
n_classes (list of int) – List of class counts for multi-class classification.
n_clusters_per_class (list of int) – List of cluster counts per class.
weights (list of list of float) – List of class weight distributions (must sum to 1.0).
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.
Returns:
Saves CSV files for each dataset configuration and a JSON file with
all configuration parameters.
Return type:
None
Notes
Each dataset is saved as ‘class_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains class labels
Only valid configurations where (n_informative + n_redundant) <= n_features are generated
Generate multiple two-moons datasets with varying parameters.
Creates a series of 2D datasets where samples form two interleaving half-circles
(moons), providing a challenging non-linearly separable binary classification problem.
Each configuration varies the number of samples and noise level.
Parameters:
n_samples (list of int, default=range(100, 300, 20)) – List of sample sizes to generate for each configuration.
noise (list of float, default=[0.1, 0.2, ..., 0.9]) – List of noise standard deviations to apply to the data.
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.
Returns:
Saves CSV files for each dataset configuration and a JSON file with
all configuration parameters.
Return type:
None
Notes
Each dataset is saved as ‘moons_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains binary labels (0 or 1)
Two-moons datasets are commonly used to evaluate algorithms on interleaving patterns
Generate multiple 3D S-curve datasets with varying parameters.
Creates a series of 3D datasets where samples lie on an S-shaped manifold,
a classic benchmark for manifold learning and dimensionality reduction algorithms.
Each configuration varies the number of samples and noise level.
Parameters:
n_samples (list of int, default=range(100, 300, 20)) – List of sample sizes to generate for each configuration.
noise (list of float, default=[0.1, 0.2, ..., 0.9]) – List of noise standard deviations to apply to the data.
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.
Returns:
Saves CSV files for each dataset configuration and a JSON file with
all configuration parameters.
Return type:
None
Notes
Each dataset is saved as ‘s_curve_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains the position along the manifold (continuous values)
S-curve is a standard benchmark for testing manifold learning algorithms
Examples
>>> fromqbiocode.data_generationimportgenerate_s_curve_datasets>>> generate_s_curve_datasets(n_samples=[200],noise=[0.1],save_path='data')Generating S Curve dataset...
Generate multiple concentric n-dimensional spheres datasets with varying parameters.
Creates a series of high-dimensional datasets where samples form two concentric
spherical shells, providing a challenging non-linearly separable binary classification
problem in high dimensions. Each configuration varies the number of samples,
dimensionality, and sphere radii.
Parameters:
n_s (list of int, default=range(100, 300, 25)) – List of sample sizes per class to generate for each configuration.
dim (list of int, default=range(5, 15, 5)) – List of dimensionalities for the spheres.
radius (list of float, default=range(5, 20, 5)) – List of outer sphere radii (inner sphere is 0.5 * outer radius).
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.
Returns:
Saves CSV files for each dataset configuration and a JSON file with
all configuration parameters.
Return type:
None
Notes
Each dataset is saved as ‘spheres_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains binary labels (0 for outer, 1 for inner sphere)
Samples are generated in spherical shells (not solid spheres) for better separation
Generate multiple n-dimensional spiral datasets with varying parameters.
Creates a series of high-dimensional datasets where samples form intertwined
spiral patterns, providing challenging non-linearly separable multi-class
classification problems. Each configuration varies the number of samples,
classes, noise level, and dimensionality.
Parameters:
n_s (list of int, default=range(100, 300, 50)) – List of sample sizes to generate for each configuration.
n_c (list of int, default=[2]) – List of class counts (number of spiral arms).
n_n (list of float, default=[0.3, 0.6, 0.9]) – List of noise standard deviations to apply to the data.
n_d (list of int, default=[3, 6, 9, 12]) – List of dimensionalities (must be 3, 6, 9, or 12).
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.
Returns:
Saves CSV files for each dataset configuration and a JSON file with
all configuration parameters.
Return type:
None
Notes
Each dataset is saved as ‘spirals_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains class labels
Spiral patterns become increasingly complex in higher dimensions
Generate multiple 3D Swiss roll datasets with varying parameters.
Creates a series of 3D datasets where samples lie on a Swiss roll manifold,
a classic benchmark for manifold learning and dimensionality reduction algorithms.
Each configuration varies the number of samples, noise level, and whether the
roll has a hole in the center.
Parameters:
n_samples (list of int, default=range(100, 300, 20)) – List of sample sizes to generate for each configuration.
noise (list of float, default=[0.1, 0.2, ..., 0.9]) – List of noise standard deviations to apply to the data.
hole (list of bool, default=[True, False]) – List of boolean values indicating whether to generate Swiss roll with hole.
save_path (str, optional) – Directory path where datasets and configuration files will be saved.
random_state (int, default=42) – Random seed for reproducibility.
Returns:
Saves CSV files for each dataset configuration and a JSON file with
all configuration parameters.
Return type:
None
Notes
Each dataset is saved as ‘swiss_roll_data-{i}.csv’ where i is the configuration number
Configuration parameters are saved in ‘dataset_config.json’
The last column ‘class’ contains the position along the manifold (continuous values)
Swiss roll is a standard benchmark for testing manifold learning algorithms
Examples
>>> fromqbiocode.data_generationimportgenerate_swiss_roll_datasets>>> generate_swiss_roll_datasets(n_samples=[200],noise=[0.1],hole=[False],save_path='data')Generating swiss roll dataset...