Working with AutoAI class and optimizer

The AutoAI experiment class is responsible for creating experiments and scheduling training. All experiment results are stored automatically in the user-specified Cloud Object Storage (COS). Then the AutoAI feature can fetch the results and provide them directly to the user for further usage.

Configure optimizer with one data source

For an AutoAI object initialization, you need watsonx.ai credentials (with your API key and URL) and either the project_id or space_id.

Hint

You can copy the project_id from the Project’s Manage tab (Project -> Manage -> General -> Details).

from ibm_watsonx_ai.experiment import AutoAI

experiment = AutoAI(wx_credentials,
    space_id='76g53e0-0b32-4a0e-9152-3d50324855ddb'
)

pipeline_optimizer = experiment.optimizer(
            name='test name',
            desc='test description',
            prediction_type=AutoAI.PredictionType.BINARY,
            prediction_column='y',
            scoring=AutoAI.Metrics.ACCURACY_SCORE,
            test_size=0.1,
            max_num_daub_ensembles=1,
            train_sample_rows_test_size=1.,
            daub_include_only_estimators = [
                 AutoAI.ClassificationAlgorithms.XGB,
                 AutoAI.ClassificationAlgorithms.LGBM
                 ],
            cognito_transform_names = [
                 AutoAI.Transformers.SUM,
                 AutoAI.Transformers.MAX
                 ]
        )

Configure optimizer for time series forecasting

Note: Supported for IBM Cloud Pak® for Data 4.0 and later.

Time series forecasting is a special AutoAI prediction scenario with specific parameters used to configure forecasting. These parameters include: prediction_columns, timestamp_column_name, backtest_num, lookback_window, forecast_window, and backtest_gap_length.

from ibm_watsonx_ai.experiment import AutoAI

experiment = AutoAI(wx_credentials,
    space_id='76g53e0-0b32-4a0e-9152-3d50324855ddb')
)

pipeline_optimizer = experiment.optimizer(
    name='forecasting optimiser',
    desc='description',
    prediction_type=experiment.PredictionType.FORECASTING,
    prediction_columns=['value'],
    timestamp_column_name='timestamp',
    backtest_num=4,
    lookback_window=5,
    forecast_window=2,
    holdout_size=0.05,
    max_number_of_estimators=1,
    include_only_estimators=[AutoAI.ForecastingAlgorithms.ENSEMBLER],
    t_shirt_size=TShirtSize.L
)

Optimizer and deployment fitting procedures are the same as in the example scenario above.

Configure optimizer for time series forecasting with supporting features

Note: Supported for IBM Cloud and IBM Cloud Pak® for Data version 4.5 and later.

Additional parameters can be passed to run time series forecasting scenarios with supporting features: feature_columns, pipeline_types, supporting_features_at_forecast

For more information about supporting features, refer to time series documentation.

from ibm_watsonx_ai.experiment import AutoAI
from ibm_watsonx_ai.utils.autoai.enums import ForecastingPipelineTypes

experiment = AutoAI(wx_credentials,
    space_id='76g53e0-0b32-4a0e-9152-3d50324855ddb')
)

pipeline_optimizer = experiment.optimizer(
    name='forecasting optimizer',
    desc='description',
    prediction_type=experiment.PredictionType.FORECASTING,
    prediction_columns=['value'],
    timestamp_column_name='week',
    feature_columns=['a', 'b', 'value'],
    pipeline_types=[ForecastingPipelineTypes.FlattenEnsembler] + ForecastingPipelineTypes.get_exogenous(),
    supporting_features_at_forecast=True
)

Predicting for time series forecasting scenario with supporting features:

# Example data:
#   new_observations:
#       week       a   b  value
#       14.0       0   0  134
#       15.0       1   4  96
#       ...
#
#   supporting_features:
#       week       a   b
#       16.0       1   3
#       ...

# with DataFrame or np.array:
pipeline_optimizer.predict(new_observations, supporting_features=supporting_features)

Online scoring for time series forecasting scenario with supporting features:

# with DataFrame:
web_service.score(payload={'observations': new_observations_df, 'supporting_features': supporting_features_df})

Batch scoring for time series forecasting scenario with supporting features:

# with DataFrame:
batch_service.run_job(payload={'observations': new_observations_df, 'supporting_features': supporting_features_df})

# with DataConnection:
batch_service.run_job(payload={'observations': new_observations_data_connection, 'supporting_features': supporting_features_data_connection})

Get configuration parameters

To see the current configuration parameters, call the get_params() method.

config_parameters = pipeline_optimizer.get_params()
print(config_parameters)
{
    'name': 'test name',
    'desc': 'test description',
    'prediction_type': 'classification',
    'prediction_column': 'y',
    'scoring': 'roc_auc',
    'test_size': 0.1,
    'max_num_daub_ensembles': 1
}

Fit optimizer

To schedule an AutoAI experiment, call the fit() method. This will trigger a training and an optimization process on watsonx.ai. The fit() method can be synchronous (background_mode=False) or asynchronous (background_mode=True). If you don’t want to wait for the fit to end, invoke the async version. The async version immediately returns only the fit/run details. If you invoke the async version, you see a progress bar with information about the learning/optimization process.

fit_details = pipeline_optimizer.fit(
        training_data_references=[training_data_connection],
        training_results_reference=results_connection,
        background_mode=True)

# OR

fit_details = pipeline_optimizer.fit(
        training_data_references=[training_data_connection],
        training_results_reference=results_connection,
        background_mode=False)

To run an AutoAI experiment with separate holdout data you can use the fit() method with the test_data_references parameter. See the example below:

fit_details = pipeline_optimizer.fit(
        training_data_references=[training_data_connection],
        test_data_references=[test_data_connection],
        training_results_reference=results_connection)

Get the run status and run details

If you use the fit() method asynchronously, you can monitor the run/fit details and status using the following two methods:

status = pipeline_optimizer.get_run_status()
print(status)
'running'

# OR

'completed'

run_details = pipeline_optimizer.get_run_details()
print(run_details)
{'entity': {'pipeline': {'href': '/v4/pipelines/5bfeb4c5-90df-48b8-9e03-ba232d8c0838'},
        'results_reference': {'connection': { 'id': ...},
                              'location': {'bucket': '...',
                                           'logs': '53c8cb7b-c8b5-44aa-8b52-6fde3c588462',
                                           'model': '53c8cb7b-c8b5-44aa-8b52-6fde3c588462/model',
                                           'path': '.',
                                           'pipeline': './33825fa2-5fca-471a-ab1a-c84820b3e34e/pipeline.json',
                                           'training': './33825fa2-5fca-471a-ab1a-c84820b3e34e',
                                           'training_status': './33825fa2-5fca-471a-ab1a-c84820b3e34e/training-status.json'},
                              'type': 'connected_asset'},
        'space': {'href': '/v4/spaces/71ab11ea-bb77-4ae6-b98a-a77f30ade09d'},
        'status': {'completed_at': '2020-02-17T10:46:32.962Z',
                   'message': {'level': 'info',
                               'text': 'Training job '
                                       '33825fa2-5fca-471a-ab1a-c84820b3e34e '
                                       'completed'},
                   'state': 'completed'},
        'training_data_references': [{'connection': {'id': '...'},
                                      'location': {'bucket': '...',
                                                   'path': '...'},
                                      'type': 'connected_asset'}]},
 'metadata': {'created_at': '2020-02-17T10:44:22.532Z',
              'guid': '33825fa2-5fca-471a-ab1a-c84820b3e34e',
              'href': '/v4/trainings/33825fa2-5fca-471a-ab1a-c84820b3e34e',
              'id': '33825fa2-5fca-471a-ab1a-c84820b3e34e',
              'modified_at': '2020-02-17T10:46:32.987Z'}}

Get data connections

The data_connections list contains all the training connections that you referenced while calling the fit() method.

data_connections = pipeline_optimizer.get_data_connections()

Pipeline summary

It is possible to get a ranking of all the computed pipeline models, sorted based on a scoring metric supplied when configuring the optimizer (scoring parameter). The output type is a pandas.DataFrame with pipeline names, computation timestamps, machine learning metrics, and the number of enhancements implemented in each of the pipelines.

results = pipeline_optimizer.summary()
print(results)
               Number of enhancements  ...  training_f1
Pipeline Name                          ...
Pipeline_4                          3  ...     0.555556
Pipeline_3                          2  ...     0.554978
Pipeline_2                          1  ...     0.503175
Pipeline_1                          0  ...     0.529928

Get pipeline details

To see the pipeline composition’s steps and nodes, use the get_pipeline_details() method. If you leave pipeline_name empty, the method returns the details of the best computed pipeline.

pipeline_params = pipeline_optimizer.get_pipeline_details(pipeline_name='Pipeline_1')
print(pipeline_params)
{
    'composition_steps': [
        'TrainingDataset_full_199_16', 'Split_TrainingHoldout',
        'TrainingDataset_full_179_16', 'Preprocessor_default', 'DAUB'
        ],
    'pipeline_nodes': [
        'PreprocessingTransformer', 'LogisticRegressionEstimator'
        ]
}

Get pipeline

Use the get_pipeline() method to load a specific pipeline. By default, get_pipeline() returns a Lale pipeline. For information on Lale pipelines, refer to the Lale library.

pipeline = pipeline_optimizer.get_pipeline(pipeline_name='Pipeline_4')
print(type(pipeline))
'lale.operators.TrainablePipeline'

You can also load a pipeline as a scikit learn (sklearn) pipeline model type.

pipeline = pipeline_optimizer.get_pipeline(pipeline_name='Pipeline_4', astype=AutoAI.PipelineTypes.SKLEARN)
print(type(pipeline))
# <class 'sklearn.pipeline.Pipeline'>

Working with deployments

This section describes classes that enable you to work with watsonx.ai deployments.

Web Service

Web Service is an online type of deployment. With Web Service, you can upload and deploy your model to score it through an online web service. You must pass the location where the training was performed using source_space_id or source_project_id. You can deploy the model to any space or project by providing the target_space_id or the target_project_id.

Note: WebService supports only the AutoAI deployment type.

from ibm_watsonx_ai.deployment import WebService

service = WebService(wx_credentials,
     source_space_id='76g53e0-0b32-4a0e-9152-3d50324855ddb',
     target_space_id='1234abc1234abc1234abc1234abc1234abcd')
 )

service.create(
       experiment_run_id="...",
       model=model,
       deployment_name='My new deployment'
   )

Batch

Batch manages the batch type of deployment. With Batch, you can upload and deploy a model and run a batch deployment job. As with Web Service, you must pass the location where the training was performed using the source_space_id or the source_project_id. You can deploy the model to any space or project by providing the target_space_id or the target_project_id.

You can provide the input data as a pandas.DataFrame, a data-asset, or a Cloud Object Storage (COS) file.

Note: Batch supports only the AutoAI deployment type.

Example of a batch deployment creation:

from ibm_watsonx_ai.deployment import Batch

service_batch = Batch(wx_credentials, source_space_id='76g53e0-0b32-4a0e-9152-3d50324855ddb')
service_batch.create(
        experiment_run_id="6ce62a02-3e41-4d11-89d1-484c2deaed75",
        model="Pipeline_4",
        deployment_name='Batch deployment')

Example of a batch job creation with inline data as pandas.DataFrame type:

scoring_params = service_batch.run_job(
            payload=test_X_df,
            background_mode=False)

Example of batch job creation with a COS object:

from ibm_watsonx_ai.helpers.connections import S3Location, DataConnection

connection_details = client.connections.create({
    client.connections.ConfigurationMetaNames.NAME: "Connection to COS",
    client.connections.ConfigurationMetaNames.DATASOURCE_TYPE: client.connections.get_datasource_type_id_by_name('bluemixcloudobjectstorage'),
    client.connections.ConfigurationMetaNames.PROPERTIES: {
        'bucket': 'bucket_name',
        'access_key': 'COS access key id',
        'secret_key': 'COS secret access key'
        'iam_url': 'COS iam url',
        'url': 'COS endpoint url'
    }
})

connection_id = client.connections.get_uid(connection_details)

payload_reference = DataConnection(
        connection_asset_id=connection_id,
        location=S3Location(bucket='bucket_name',   # note: COS bucket name where deployment payload dataset is located
                            path='my_path'  # note: path within bucket where your deployment payload dataset is located
                            )
    )

results_reference = DataConnection(
        connection_asset_id=connection_id,
        location=S3Location(bucket='bucket_name',   # note: COS bucket name where deployment output should be located
                            path='my_path_where_output_will_be_saved'  # note: path within bucket where your deployment output should be located
                            )
    )
payload_reference.write("local_path_to_the_batch_payload_csv_file", remote_name="batch_payload_location.csv"])

scoring_params = service_batch.run_job(
    payload=[payload_reference],
    output_data_reference=results_reference,
    background_mode=False)   # If background_mode is False, then synchronous mode is started. Otherwise job status need to be monitored.

Example of a batch job creation with a data-asset object:

from ibm_watsonx_ai.helpers.connections import DataConnection, CloudAssetLocation, DeploymentOutputAssetLocation

payload_reference = DataConnection(location=CloudAssetLocation(asset_id=asset_id))
results_reference = DataConnection(
        location=DeploymentOutputAssetLocation(name="batch_output_file_name.csv"))

    scoring_params = service_batch.run_job(
        payload=[payload_reference],
        output_data_reference=results_reference,
        background_mode=False)