Configure the YAML file#

In order to run mulitple experiments and for reprudicibility of the experiments, the use of the terminal xxx is suggested. To do so, a yaml file should be set. An example of a config file can be found in config.yaml

In what follows the major component of the file are explained.

Define the inputs#

Inputs files can be placed in a single folder (e.g. test_data). In the case all files in the folder need to be analized, one can set:

config_file_name: 'basic_config'

# specify output directory where input datasets are located
folder_path: 'test_data/'
file_dataset: 'ALL'

or use a list as below, to only select specif files

file_dataset: ['file1', 'file2', 'file3']

To ensure deterministic and reproducible outcome a seed for should be fixed as:

seed: 42
q_seed: 42

q_seed refers to the seed used by quantum algoriths.

Quantum backend#

Define which quantum backend to be used for quantum computation. For instance, in case of simulator:

# choose a backend for the QML methods
backend: 'simulator'

In case one has access to IBM quantum device, one can specify the name. For instance to use the least busy:

# choose a backend for the QML methods
backend: 'ibm_least'

as well as the number of shoots:

shots: 1024

Level of error mitigation for the QML methods (ranges from 1-3) can be set as:

resil_level: 1

IBM runtime credential should be stored in your device, in case of json path, it can be specified for instance as:

qiskit_json_path: '~/.qiskit/qiskit-ibm.json'

in case of multiple accounts stored in your json, the account name alias (used when saving your runtime credentials) can be specified as:

name: account_qbc

Optionally you can also specify the instance with your account under which to run your job. If not included, it will search for an available instance under your account if you have multiple.

ibm_instance: instance_name

Embedding#

To specify the embedding method for reducing dimensionality (number of features) in your data set, the following list

embeddings: ['pca', 'nmf']

while the number of dimensions (features) to embed (reduce) your data down need to be set as:

n_components: 3

In case of no embedding need to be applied please specify it as:

embeddings: ['none']

Train/Test parameters#

Ratio of train:test the data is split into via the test_size option. User can also specify if data need to be scaled and stratified. For instace, in this case 70:30 train:test ragio, with stratification and scaling:

test_size: 0.3
stratify: ['y']
scaling: ['True']

Model selection#

ML model can be selected as a list, between the following, options: svc (Support Vector Classifier), dt (Decision Tree), lr (Logistic Regression), nb (Naive Bayes), rf (Random Forest), mlp (Multi-layer Perceptron), qsvc (Quantum svc), vqc (Variational Quantum Classifier), qnn (Quantum Neural Network), pqk (Projection Quantum Kernel). For instace if one would want to run all models, would set:

model: ['svc', 'dt', 'lr', 'nb', 'rf', 'mlp', 'qsvc', 'vqc', 'qnn', 'pqk']

parameters to use for each ML method could be set. Each CML method will have a grid search set of arguments as well as a standard set of arguments to pass, which will usually be the same as the grid search arguments except not as a list of parameters to iterate over.

For instance, for svc:

svc_args: { 'C': 0.01, 'gamma': 0.1, kernel: 'linear' }

gridsearch_svc_args: {'C': [0.1, 1, 10, 100], 
                      'gamma': [0.001, 0.01, 0.1, 1],
                      'kernel': ['linear', 'rbf', 'poly','sigmoid']
                     }

Parameter name and range values description can be found in the corresponding sklearn page for CML.

Note

For QML grid search, one must use generate_experiments.ipynb in tutorial_notebooks/analyses/qml_experiment_generators/, which will generate individual config.yaml files for each combination of parameters.