Artificial Data Generation Tutorial#
This tutorial demonstrates how to use QBioCode’s data generation utilities to create synthetic datasets for machine learning experiments.
Overview#
QBioCode provides several built-in data generators for creating artificial datasets with controlled properties:
2D Manifolds: circles, moons, spirals
3D Manifolds: swiss_roll, s_curve, spheres
High-dimensional: classes (customizable features and complexity)
These datasets are useful for:
Testing machine learning algorithms
Benchmarking quantum vs classical models
Understanding how data complexity affects model performance
Creating reproducible experiments
Setup#
Import QBioCode and configure the environment.
[1]:
%load_ext autoreload
%autoreload 2
import sys, os, re
dir_home = re.sub( 'QBioCode.*', 'QBioCode', os.getcwd() )
sys.path.append( dir_home )
import qbiocode as qbc
/Users/krhriss/IBM/QBioCode/venv/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Examples of how to generate some artifical datasets#
1. Swiss Roll Dataset#
The Swiss Roll is a classic 3D manifold dataset that tests dimensionality reduction algorithms.
Parameters:
n_samples: Number of data points (100-160 in steps of 20)noise: Noise level added to the data (0.1, 0.2, 0.3)hole: Whether to create a hole in the center (True/False)
This generates 3 × 3 × 2 = 18 different dataset variations.
[ ]:
N_SAMPLES = list(range(100, 160, 20))
NOISE = [0.1, 0.2, 0.3]
HOLE = [True, False]
type_of_data = 'swiss_roll'
qbc.generate_data(
type_of_data=type_of_data,
save_path= os.path.join( 'data', type_of_data ),
n_samples=N_SAMPLES,
noise = NOISE,
hole = HOLE,
random_state=42
)
Generating swiss roll dataset...
Dataset generation complete.
2. Circles Dataset#
Concentric circles - a classic non-linearly separable 2D dataset.
Parameters:
n_samples: Number of data points per circlenoise: Gaussian noise standard deviation
Generates 3 × 3 = 9 dataset variations.
[ ]:
N_SAMPLES = list(range(100, 160, 20))
NOISE = [0.1, 0.2, 0.3]
type_of_data = 'circles'
qbc.generate_data(
type_of_data=type_of_data,
save_path= os.path.join( 'data', type_of_data ),
n_samples=N_SAMPLES,
noise = NOISE,
random_state=42
)
Generating circles dataset...
Dataset generation complete.
3. Moons Dataset#
Two interleaving half-circles - tests non-linear classification.
Parameters:
n_samples: Total number of points (split between two moons)noise: Noise level
Generates 3 × 3 = 9 dataset variations.
[ ]:
N_SAMPLES = list(range(100, 160, 20))
NOISE = [0.1, 0.2, 0.3]
type_of_data = 'moons'
qbc.generate_data(
type_of_data=type_of_data,
save_path= os.path.join( 'data', type_of_data ),
n_samples=N_SAMPLES,
noise = NOISE,
random_state=42
)
Generating moons dataset...
Dataset generation complete.
4. High-Dimensional Classification Data#
Generate complex, high-dimensional datasets with controlled properties.
Parameters:
n_samples: Number of samplesn_features: Total number of features (100-1000)n_informative: Number of informative features (200-800)n_redundant: Number of redundant features (200-800)n_classes: Number of classes (2, 4, 6)n_clusters_per_class: Clusters per class (1, 2, 3)weights: Class balance ratios
This creates high-dimensional datasets suitable for testing feature selection and dimensionality reduction.
[ ]:
N_SAMPLES = list(range(100, 160, 50))
N_FEATURES = list(range(100,1000,100))
N_INFORMATIVE = list(range(200,800,400))
N_REDUNDANT = list(range(200,800,400))
N_CLASSES = list(range(2, 4, 6))
N_CLUSTERS_PER_CLASS = list(range(1, 2, 3))
WEIGHTS = [[0.3, 0.7], [0.4, 0.6], [0.5, 0.5]]
type_of_data = 'classes'
qbc.generate_data(
type_of_data=type_of_data,
save_path= os.path.join( 'data', 'hd_data' ),
n_samples=N_SAMPLES,
n_features=N_FEATURES,
n_informative=N_INFORMATIVE,
n_redundant=N_REDUNDANT,
n_classes=N_CLASSES,
n_clusters_per_class=N_CLUSTERS_PER_CLASS,
weights=WEIGHTS,
random_state=42
)
Generating classes dataset...
Dataset generation complete.
5. Low-Dimensional Classification Data#
Similar to high-dimensional data but with fewer features (10-60).
These datasets are more suitable for:
Quantum machine learning (limited qubit counts)
Quick prototyping
Visualization
[ ]:
N_SAMPLES = list(range(100, 160, 50))
N_FEATURES = list(range(10, 60, 20))
N_INFORMATIVE = list(range(2,8,4))
N_REDUNDANT = list(range(2,8,4))
N_CLASSES = list(range(2, 4, 6))
N_CLUSTERS_PER_CLASS = list(range(1, 2, 3))
WEIGHTS = [[0.3, 0.7], [0.4, 0.6], [0.5, 0.5]]
type_of_data = 'classes'
qbc.generate_data(
type_of_data=type_of_data,
save_path= os.path.join( 'data', 'ld_data' ),
n_samples=N_SAMPLES,
n_features=N_FEATURES,
n_informative=N_INFORMATIVE,
n_redundant=N_REDUNDANT,
n_classes=N_CLASSES,
n_clusters_per_class=N_CLUSTERS_PER_CLASS,
weights=WEIGHTS,
random_state=42
)
Generating classes dataset...
Dataset generation complete.
6. Spheres Dataset#
Concentric spheres in arbitrary dimensions.
Parameters:
n_samples: Points per spheren_classes: Number of concentric spheresnoise: Noise leveldim: Dimensionality (3D, 6D, etc.)
Tests algorithms on radially symmetric data.
[ ]:
N_SAMPLES = list(range(100, 160, 20))
N_CLASSES = [2]
NOISE = [0.3, 0.6]
DIM = [3, 6, ]
type_of_data = 'spheres'
qbc.generate_data(
type_of_data=type_of_data,
save_path= os.path.join( 'data', type_of_data ),
n_samples=N_SAMPLES,
noise = NOISE,
n_classes=N_CLASSES,
dim=DIM,
random_state=42
)
Generating spheres dataset...
Dataset generation complete.
7. S-Curve Dataset#
A 3D S-shaped manifold embedded in 3D space.
Parameters:
n_samples: Number of pointsnoise: Noise level
Similar to Swiss Roll but with different topology.
[ ]:
N_SAMPLES = list(range(100, 160, 20))
NOISE = [0.1, 0.2, 0.3]
type_of_data = 's_curve'
qbc.generate_data(
type_of_data=type_of_data,
save_path= os.path.join( 'data', type_of_data ),
n_samples=N_SAMPLES,
noise = NOISE,
random_state=42
)
Generating S Curve dataset...
Dataset generation complete.
8. Spirals Dataset#
Multi-dimensional spiral patterns.
Parameters:
n_samples: Points per spiraldim: Dimensionality (5-15)rad: Radius range (5-20)noise: Noise level
Creates complex non-linear patterns in higher dimensions.
[ ]:
N_SAMPLES = list(range(100, 160, 20))
DIM = list(range(5, 15, 5))
RAD = list(range(5, 20, 5))
type_of_data = 'spirals'
qbc.generate_data(
type_of_data=type_of_data,
save_path= os.path.join( 'data', type_of_data ),
n_samples=N_SAMPLES,
noise = NOISE,
random_state=42
)
Generating spirals dataset...
Dataset generation complete.
[ ]:
Summary#
This tutorial demonstrated how to generate various types of artificial datasets using QBioCode:
2D datasets (circles, moons) - for visualization and simple tests
3D manifolds (swiss_roll, s_curve, spheres) - for dimensionality reduction
High-dimensional data (classes) - for realistic ML scenarios
Spirals - for complex non-linear patterns
Next Steps#
Use these datasets with QProfiler to benchmark models
Experiment with different complexity parameters
Analyze how data properties affect model performance
See Also#
QProfiler Tutorial - Benchmark models on generated data
API Documentation - Full parameter reference