Artificial Data Generation Tutorial#

This tutorial demonstrates how to use QBioCode’s data generation utilities to create synthetic datasets for machine learning experiments.

Overview#

QBioCode provides several built-in data generators for creating artificial datasets with controlled properties:

  • 2D Manifolds: circles, moons, spirals

  • 3D Manifolds: swiss_roll, s_curve, spheres

  • High-dimensional: classes (customizable features and complexity)

These datasets are useful for:

  • Testing machine learning algorithms

  • Benchmarking quantum vs classical models

  • Understanding how data complexity affects model performance

  • Creating reproducible experiments

Setup#

Import QBioCode and configure the environment.

[1]:
%load_ext autoreload
%autoreload 2

import sys, os, re

dir_home = re.sub( 'QBioCode.*', 'QBioCode', os.getcwd() )

sys.path.append( dir_home )

import qbiocode as qbc
/Users/krhriss/IBM/QBioCode/venv/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Examples of how to generate some artifical datasets#

1. Swiss Roll Dataset#

The Swiss Roll is a classic 3D manifold dataset that tests dimensionality reduction algorithms.

Parameters:

  • n_samples: Number of data points (100-160 in steps of 20)

  • noise: Noise level added to the data (0.1, 0.2, 0.3)

  • hole: Whether to create a hole in the center (True/False)

This generates 3 × 3 × 2 = 18 different dataset variations.

[ ]:
N_SAMPLES = list(range(100, 160, 20))
NOISE = [0.1, 0.2, 0.3]
HOLE = [True, False]

type_of_data = 'swiss_roll'

qbc.generate_data(
    type_of_data=type_of_data,
    save_path= os.path.join( 'data', type_of_data ),
    n_samples=N_SAMPLES,
    noise = NOISE,
    hole = HOLE,
    random_state=42
)
Generating swiss roll dataset...
Dataset generation complete.

2. Circles Dataset#

Concentric circles - a classic non-linearly separable 2D dataset.

Parameters:

  • n_samples: Number of data points per circle

  • noise: Gaussian noise standard deviation

Generates 3 × 3 = 9 dataset variations.

[ ]:
N_SAMPLES = list(range(100, 160, 20))
NOISE = [0.1, 0.2, 0.3]

type_of_data = 'circles'

qbc.generate_data(
    type_of_data=type_of_data,
    save_path= os.path.join( 'data', type_of_data ),
    n_samples=N_SAMPLES,
    noise = NOISE,
    random_state=42
)
Generating circles dataset...
Dataset generation complete.

3. Moons Dataset#

Two interleaving half-circles - tests non-linear classification.

Parameters:

  • n_samples: Total number of points (split between two moons)

  • noise: Noise level

Generates 3 × 3 = 9 dataset variations.

[ ]:
N_SAMPLES = list(range(100, 160, 20))
NOISE = [0.1, 0.2, 0.3]

type_of_data = 'moons'

qbc.generate_data(
    type_of_data=type_of_data,
    save_path= os.path.join( 'data', type_of_data ),
    n_samples=N_SAMPLES,
    noise = NOISE,
    random_state=42
)
Generating moons dataset...
Dataset generation complete.

4. High-Dimensional Classification Data#

Generate complex, high-dimensional datasets with controlled properties.

Parameters:

  • n_samples: Number of samples

  • n_features: Total number of features (100-1000)

  • n_informative: Number of informative features (200-800)

  • n_redundant: Number of redundant features (200-800)

  • n_classes: Number of classes (2, 4, 6)

  • n_clusters_per_class: Clusters per class (1, 2, 3)

  • weights: Class balance ratios

This creates high-dimensional datasets suitable for testing feature selection and dimensionality reduction.

[ ]:

N_SAMPLES = list(range(100, 160, 50)) N_FEATURES = list(range(100,1000,100)) N_INFORMATIVE = list(range(200,800,400)) N_REDUNDANT = list(range(200,800,400)) N_CLASSES = list(range(2, 4, 6)) N_CLUSTERS_PER_CLASS = list(range(1, 2, 3)) WEIGHTS = [[0.3, 0.7], [0.4, 0.6], [0.5, 0.5]] type_of_data = 'classes' qbc.generate_data( type_of_data=type_of_data, save_path= os.path.join( 'data', 'hd_data' ), n_samples=N_SAMPLES, n_features=N_FEATURES, n_informative=N_INFORMATIVE, n_redundant=N_REDUNDANT, n_classes=N_CLASSES, n_clusters_per_class=N_CLUSTERS_PER_CLASS, weights=WEIGHTS, random_state=42 )
Generating classes dataset...
Dataset generation complete.

5. Low-Dimensional Classification Data#

Similar to high-dimensional data but with fewer features (10-60).

These datasets are more suitable for:

  • Quantum machine learning (limited qubit counts)

  • Quick prototyping

  • Visualization

[ ]:

N_SAMPLES = list(range(100, 160, 50)) N_FEATURES = list(range(10, 60, 20)) N_INFORMATIVE = list(range(2,8,4)) N_REDUNDANT = list(range(2,8,4)) N_CLASSES = list(range(2, 4, 6)) N_CLUSTERS_PER_CLASS = list(range(1, 2, 3)) WEIGHTS = [[0.3, 0.7], [0.4, 0.6], [0.5, 0.5]] type_of_data = 'classes' qbc.generate_data( type_of_data=type_of_data, save_path= os.path.join( 'data', 'ld_data' ), n_samples=N_SAMPLES, n_features=N_FEATURES, n_informative=N_INFORMATIVE, n_redundant=N_REDUNDANT, n_classes=N_CLASSES, n_clusters_per_class=N_CLUSTERS_PER_CLASS, weights=WEIGHTS, random_state=42 )
Generating classes dataset...
Dataset generation complete.

6. Spheres Dataset#

Concentric spheres in arbitrary dimensions.

Parameters:

  • n_samples: Points per sphere

  • n_classes: Number of concentric spheres

  • noise: Noise level

  • dim: Dimensionality (3D, 6D, etc.)

Tests algorithms on radially symmetric data.

[ ]:
N_SAMPLES = list(range(100, 160, 20))
N_CLASSES = [2]
NOISE = [0.3, 0.6]
DIM = [3, 6, ]

type_of_data = 'spheres'

qbc.generate_data(
    type_of_data=type_of_data,
    save_path= os.path.join( 'data', type_of_data ),
    n_samples=N_SAMPLES,
    noise = NOISE,
    n_classes=N_CLASSES,
    dim=DIM,
    random_state=42
)
Generating spheres dataset...
Dataset generation complete.

7. S-Curve Dataset#

A 3D S-shaped manifold embedded in 3D space.

Parameters:

  • n_samples: Number of points

  • noise: Noise level

Similar to Swiss Roll but with different topology.

[ ]:
N_SAMPLES = list(range(100, 160, 20))
NOISE = [0.1, 0.2, 0.3]

type_of_data = 's_curve'

qbc.generate_data(
    type_of_data=type_of_data,
    save_path= os.path.join( 'data', type_of_data ),
    n_samples=N_SAMPLES,
    noise = NOISE,
    random_state=42
)
Generating S Curve dataset...
Dataset generation complete.

8. Spirals Dataset#

Multi-dimensional spiral patterns.

Parameters:

  • n_samples: Points per spiral

  • dim: Dimensionality (5-15)

  • rad: Radius range (5-20)

  • noise: Noise level

Creates complex non-linear patterns in higher dimensions.

[ ]:
N_SAMPLES = list(range(100, 160, 20))
DIM = list(range(5, 15, 5))
RAD = list(range(5, 20, 5))

type_of_data = 'spirals'

qbc.generate_data(
    type_of_data=type_of_data,
    save_path= os.path.join( 'data', type_of_data ),
    n_samples=N_SAMPLES,
    noise = NOISE,
    random_state=42
)
Generating spirals dataset...
Dataset generation complete.
[ ]:

Summary#

This tutorial demonstrated how to generate various types of artificial datasets using QBioCode:

  1. 2D datasets (circles, moons) - for visualization and simple tests

  2. 3D manifolds (swiss_roll, s_curve, spheres) - for dimensionality reduction

  3. High-dimensional data (classes) - for realistic ML scenarios

  4. Spirals - for complex non-linear patterns

Next Steps#

  • Use these datasets with QProfiler to benchmark models

  • Experiment with different complexity parameters

  • Analyze how data properties affect model performance

See Also#