Artificial Data Generation Tutorial#

This tutorial demonstrates how to use QBioCode’s data generation utilities to create synthetic datasets for machine learning experiments.

Overview#

QBioCode provides several built-in data generators for creating artificial datasets with controlled properties:

2D Manifolds: circles, moons, spirals
3D Manifolds: swiss_roll, s_curve, spheres
High-dimensional: classes (customizable features and complexity)

These datasets are useful for:

Testing machine learning algorithms
Benchmarking quantum vs classical models
Understanding how data complexity affects model performance
Creating reproducible experiments

Setup#

Import QBioCode and configure the environment.

[1]:

%load_ext autoreload
%autoreload 2

import sys, os, re

dir_home = re.sub( 'QBioCode.*', 'QBioCode', os.getcwd() )

sys.path.append( dir_home )

import qbiocode as qbc

/Users/krhriss/IBM/QBioCode/venv/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Examples of how to generate some artifical datasets#

1. Swiss Roll Dataset#

The Swiss Roll is a classic 3D manifold dataset that tests dimensionality reduction algorithms.

Parameters:

n_samples: Number of data points (100-160 in steps of 20)
noise: Noise level added to the data (0.1, 0.2, 0.3)
hole: Whether to create a hole in the center (True/False)

This generates 3 × 3 × 2 = 18 different dataset variations.

[ ]:

N_SAMPLES = list(range(100, 160, 20))
NOISE = [0.1, 0.2, 0.3]
HOLE = [True, False]

type_of_data = 'swiss_roll'

qbc.generate_data(
    type_of_data=type_of_data,
    save_path= os.path.join( 'data', type_of_data ),
    n_samples=N_SAMPLES,
    noise = NOISE,
    hole = HOLE,
    random_state=42
)

Generating swiss roll dataset...
Dataset generation complete.

2. Circles Dataset#

Concentric circles - a classic non-linearly separable 2D dataset.

Parameters:

n_samples: Number of data points per circle
noise: Gaussian noise standard deviation

Generates 3 × 3 = 9 dataset variations.

[ ]:

N_SAMPLES = list(range(100, 160, 20))
NOISE = [0.1, 0.2, 0.3]

type_of_data = 'circles'

qbc.generate_data(
    type_of_data=type_of_data,
    save_path= os.path.join( 'data', type_of_data ),
    n_samples=N_SAMPLES,
    noise = NOISE,
    random_state=42
)

Generating circles dataset...
Dataset generation complete.

3. Moons Dataset#

Two interleaving half-circles - tests non-linear classification.

Parameters:

n_samples: Total number of points (split between two moons)
noise: Noise level

Generates 3 × 3 = 9 dataset variations.

[ ]:

N_SAMPLES = list(range(100, 160, 20))
NOISE = [0.1, 0.2, 0.3]

type_of_data = 'moons'

qbc.generate_data(
    type_of_data=type_of_data,
    save_path= os.path.join( 'data', type_of_data ),
    n_samples=N_SAMPLES,
    noise = NOISE,
    random_state=42
)

Generating moons dataset...
Dataset generation complete.

4. High-Dimensional Classification Data#

Generate complex, high-dimensional datasets with controlled properties.

Parameters:

n_samples: Number of samples
n_features: Total number of features (100-1000)
n_informative: Number of informative features (200-800)
n_redundant: Number of redundant features (200-800)
n_classes: Number of classes (2, 4, 6)
n_clusters_per_class: Clusters per class (1, 2, 3)
weights: Class balance ratios

This creates high-dimensional datasets suitable for testing feature selection and dimensionality reduction.

[ ]:

N_SAMPLES = list(range(100, 160, 50))
N_FEATURES = list(range(100,1000,100))
N_INFORMATIVE = list(range(200,800,400))
N_REDUNDANT = list(range(200,800,400))
N_CLASSES = list(range(2, 4, 6))
N_CLUSTERS_PER_CLASS = list(range(1, 2, 3))
WEIGHTS = [[0.3, 0.7], [0.4, 0.6], [0.5, 0.5]]

type_of_data = 'classes'

qbc.generate_data(
    type_of_data=type_of_data,
    save_path= os.path.join( 'data', 'hd_data' ),
    n_samples=N_SAMPLES,
    n_features=N_FEATURES,
    n_informative=N_INFORMATIVE,
    n_redundant=N_REDUNDANT,
    n_classes=N_CLASSES,
    n_clusters_per_class=N_CLUSTERS_PER_CLASS,
    weights=WEIGHTS,
    random_state=42
)

Generating classes dataset...
Dataset generation complete.

5. Low-Dimensional Classification Data#

Similar to high-dimensional data but with fewer features (10-60).

These datasets are more suitable for:

Quantum machine learning (limited qubit counts)
Quick prototyping
Visualization

[ ]:

N_SAMPLES = list(range(100, 160, 50))
N_FEATURES = list(range(10, 60, 20))
N_INFORMATIVE = list(range(2,8,4))
N_REDUNDANT = list(range(2,8,4))
N_CLASSES = list(range(2, 4, 6))
N_CLUSTERS_PER_CLASS = list(range(1, 2, 3))
WEIGHTS = [[0.3, 0.7], [0.4, 0.6], [0.5, 0.5]]

type_of_data = 'classes'

qbc.generate_data(
    type_of_data=type_of_data,
    save_path= os.path.join( 'data', 'ld_data' ),
    n_samples=N_SAMPLES,
    n_features=N_FEATURES,
    n_informative=N_INFORMATIVE,
    n_redundant=N_REDUNDANT,
    n_classes=N_CLASSES,
    n_clusters_per_class=N_CLUSTERS_PER_CLASS,
    weights=WEIGHTS,
    random_state=42
)

Generating classes dataset...
Dataset generation complete.

6. Spheres Dataset#

Concentric spheres in arbitrary dimensions.

Parameters:

n_samples: Points per sphere
n_classes: Number of concentric spheres
noise: Noise level
dim: Dimensionality (3D, 6D, etc.)

Tests algorithms on radially symmetric data.

[ ]:

N_SAMPLES = list(range(100, 160, 20))
N_CLASSES = [2]
NOISE = [0.3, 0.6]
DIM = [3, 6, ]

type_of_data = 'spheres'

qbc.generate_data(
    type_of_data=type_of_data,
    save_path= os.path.join( 'data', type_of_data ),
    n_samples=N_SAMPLES,
    noise = NOISE,
    n_classes=N_CLASSES,
    dim=DIM,
    random_state=42
)

Generating spheres dataset...
Dataset generation complete.

7. S-Curve Dataset#

A 3D S-shaped manifold embedded in 3D space.

Parameters:

n_samples: Number of points
noise: Noise level

Similar to Swiss Roll but with different topology.

[ ]:

N_SAMPLES = list(range(100, 160, 20))
NOISE = [0.1, 0.2, 0.3]

type_of_data = 's_curve'

qbc.generate_data(
    type_of_data=type_of_data,
    save_path= os.path.join( 'data', type_of_data ),
    n_samples=N_SAMPLES,
    noise = NOISE,
    random_state=42
)

Generating S Curve dataset...
Dataset generation complete.

8. Spirals Dataset#

Multi-dimensional spiral patterns.

Parameters:

n_samples: Points per spiral
dim: Dimensionality (5-15)
rad: Radius range (5-20)
noise: Noise level

Creates complex non-linear patterns in higher dimensions.

[ ]:

N_SAMPLES = list(range(100, 160, 20))
DIM = list(range(5, 15, 5))
RAD = list(range(5, 20, 5))

type_of_data = 'spirals'

qbc.generate_data(
    type_of_data=type_of_data,
    save_path= os.path.join( 'data', type_of_data ),
    n_samples=N_SAMPLES,
    noise = NOISE,
    random_state=42
)

Generating spirals dataset...
Dataset generation complete.

[ ]:

Summary#

This tutorial demonstrated how to generate various types of artificial datasets using QBioCode:

2D datasets (circles, moons) - for visualization and simple tests
3D manifolds (swiss_roll, s_curve, spheres) - for dimensionality reduction
High-dimensional data (classes) - for realistic ML scenarios
Spirals - for complex non-linear patterns

Next Steps#

Use these datasets with QProfiler to benchmark models
Experiment with different complexity parameters
Analyze how data properties affect model performance

Artificial Data Generation Tutorial#

Overview#

Setup#

Examples of how to generate some artifical datasets#

1. Swiss Roll Dataset#

2. Circles Dataset#

3. Moons Dataset#

4. High-Dimensional Classification Data#

5. Low-Dimensional Classification Data#

6. Spheres Dataset#

7. S-Curve Dataset#

8. Spirals Dataset#

Summary#

Next Steps#

See Also#