qbiocode.data_generation.generator module#

Main data generation interface for QBioCode.

This module provides a unified interface to generate various types of synthetic datasets for machine learning benchmarking and evaluation.

Summary#

Functions:

generate_data

Generate synthetic datasets for machine learning benchmarking.

Reference#

generate_data(type_of_data=None, save_path=None, n_samples=[100, 120, 140, 160, 180, 200, 220, 240, 260, 280], noise=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], hole=[True, False], n_classes=[2], dim=[3, 6, 9, 12], rad=[3, 6, 9, 12], n_features=[10, 30, 50], n_informative=[2, 6], n_redundant=[2, 6], n_clusters_per_class=[1], weights=[[0.3, 0.7], [0.4, 0.6], [0.5, 0.5]], random_state=42)[source]#

Generate synthetic datasets for machine learning benchmarking.

Unified interface to generate various types of synthetic datasets with configurable parameters. Each dataset type creates multiple configurations by varying the specified parameters.

Parameters:
  • type_of_data (str) – Type of dataset to generate. Options: ‘circles’, ‘moons’, ‘classes’, ‘s_curve’, ‘spheres’, ‘spirals’, ‘swiss_roll’.

  • save_path (str) – Directory path where datasets will be saved.

  • n_samples (list of int, default=range(100, 300, 20)) – Sample sizes for dataset configurations.

  • noise (list of float, default=[0.1, 0.2, ..., 0.9]) – Noise levels to apply.

  • hole (list of bool, default=[True, False]) – Whether to include hole (for swiss_roll only).

  • n_classes (list of int, default=[2]) – Number of classes (for spirals and classes).

  • dim (list of int, default=[3, 6, 9, 12]) – Dimensionalities (for spheres and spirals).

  • rad (list of float, default=[3, 6, 9, 12]) – Radii (for spheres only).

  • n_features (list of int, default=range(10, 60, 20)) – Feature counts (for classes only).

  • n_informative (list of int, default=range(2, 8, 4)) – Informative feature counts (for classes only).

  • n_redundant (list of int, default=range(2, 8, 4)) – Redundant feature counts (for classes only).

  • n_clusters_per_class (list of int, default=range(1, 2, 3)) – Clusters per class (for classes only).

  • weights (list of list of float, default=[[0.3, 0.7], [0.4, 0.6], [0.5, 0.5]]) – Class weight distributions (for classes only).

  • random_state (int, default=42) – Random seed for reproducibility.

Returns:

Saves generated datasets to the specified path.

Return type:

None

Raises:

ValueError – If type_of_data is not one of the supported types.

Examples

>>> from qbiocode.data_generation import generate_data
>>> generate_data(type_of_data='circles', save_path='data/circles')
Generating circles dataset...
Dataset generation complete.