Working with DataConnection =========================== Before you start an AutoAI experiment, you need to specify where your training dataset is located. AutoAI supports Cloud Object Storage (COS) and data assets on Cloud. IBM Cloud - DataConnection Initialization ----------------------------------------- There are three types of connections: Connection Asset, Data Asset, and Container. To upload your experiment dataset, you must initialize ``DataConnection`` with your COS credentials. .. _working-with-connection-asset: Connection Asset ~~~~~~~~~~~~~~~~ .. code-block:: python from ibm_watsonx_ai.helpers.connections import DataConnection, S3Location connection_details = client.connections.create( { client.connections.ConfigurationMetaNames.NAME: "Connection to COS", client.connections.ConfigurationMetaNames.DATASOURCE_TYPE: client.connections.get_datasource_type_id_by_name( "bluemixcloudobjectstorage" ), client.connections.ConfigurationMetaNames.PROPERTIES: { "bucket": "bucket_name", "access_key": "COS access key id", "secret_key": "COS secret access key", "iam_url": "COS iam url", "url": "COS endpoint url", }, } ) connection_id = client.connections.get_uid(connection_details) # note: this DataConnection will be used as a reference where to find your training dataset training_data_references = DataConnection( connection_asset_id=connection_id, location=S3Location( bucket="bucket_name", # note: COS bucket name where training dataset is located path="my_path", # note: path within bucket where your training dataset is located ), ) # note: this DataConnection will be used as a reference where to save all of the AutoAI experiment results results_connection = DataConnection( connection_asset_id=connection_id, # note: bucket name and path could be different or the same as specified in the training_data_references location=S3Location(bucket="bucket_name", path="my_path"), ) Data Asset ~~~~~~~~~~ .. code-block:: python from ibm_watsonx_ai.helpers.connections import DataConnection data_location = "./your_dataset.csv" asset_details = client.data_assets.create( name=data_location.split("/")[-1], file_path=data_location ) asset_id = client.data_assets.get_id(asset_details) training_data_references = DataConnection(data_asset_id=asset_id) Container ~~~~~~~~~ .. code-block:: python from ibm_watsonx_ai.helpers.connections import DataConnection, ContainerLocation training_data_references = DataConnection( location=ContainerLocation(path="your_dataset.csv") ) IBM watsonx.ai software - DataConnection Initialization ------------------------------------------------------- There are three types of connections: Connection Asset, Data Asset, and FS. FS is only for saving result references. To upload your experiment dataset, you must initialize ``DataConnection`` with your service credentials. Connection Asset - DatabaseLocation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from ibm_watsonx_ai.helpers.connections import DataConnection, DatabaseLocation connection_details = client.connections.create( { client.connections.ConfigurationMetaNames.NAME: f"Connection to Database - {your_database_name} ", client.connections.ConfigurationMetaNames.DATASOURCE_TYPE: client.connections.get_datasource_type_id_by_name( "your_database_name" ), client.connections.ConfigurationMetaNames.PROPERTIES: { "database": "database_name", "password": "database_password", "port": "port_number", "host": "host_name", "username": "database_type", # e.g. "postgres" }, } ) connection_id = client.connections.get_uid(connection_details) training_data_references = DataConnection( connection_asset_id=connection_id, location=DatabaseLocation( schema_name=schema_name, table_name=table_name, ), ) Connection Asset - S3Location ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For a Connection Asset with S3Location, ``connection_id`` to the S3 storage is required. .. code-block:: python from ibm_watsonx_ai.helpers.connections import DataConnection, S3Location training_data_references = DataConnection( connection_asset_id=connection_id, location=S3Location( bucket="bucket_name", # note: COS bucket name where training dataset is located path="my_path", # note: path within bucket where your training dataset is located ), ) Connection Asset - NFSLocation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before establishing a connection, you need to create and start a `volume` where the dataset will be stored. .. code-block:: python from ibm_watsonx_ai.helpers.connections import DataConnection, NFSLocation connection_details = { client.connections.ConfigurationMetaNames.NAME: "Client NFS Volume Connection from SDK", client.connections.ConfigurationMetaNames.DATASOURCE_TYPE: client.connections.get_datasource_type_id_by_name( "volumes" ), client.connections.ConfigurationMetaNames.DESCRIPTION: "NFS volume connection from python client", client.connections.ConfigurationMetaNames.PROPERTIES: { "instance_id": volume_id, "pvc": existing_pvc_volume_name, "volume": volume_name, "inherit_access_token": "true", }, "flags": ["personal_credentials"], } client.connections.create(connection_details) connection_id = client.connections.get_uid(connection_details) training_data_references = DataConnection( connection_asset_id=connection_id, location=NFSLocation(path=f"/{filename}") ) Connection Asset - RemoteFileStorageLocation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from ibm_watsonx_ai.helpers import DataConnection, RemoteFileStorageLocation connection_details = client.connections.create( { client.connections.ConfigurationMetaNames.NAME: "Connection to MS Azure Blob Storage", client.connections.ConfigurationMetaNames.DATASOURCE_TYPE: client.connections.get_datasource_type_id_by_name( "azureblobstorage" ), client.connections.ConfigurationMetaNames.PROPERTIES: { "container": container_name, "connection_string": connection_string, }, } ) connection_id = client.connections.get_id(connection_details) training_data_references = DataConnection( connection_asset_id=connection_id, location=RemoteFileStorageLocation(path=filename, container=container_name), ) Data Asset ~~~~~~~~~~ .. code-block:: python from ibm_watsonx_ai.helpers.connections import DataConnection data_location = "./your_dataset.csv" asset_details = client.data_assets.create( name=data_location.split("/")[-1], file_path=data_location ) asset_id = client.data_assets.get_id(asset_details) training_data_references = DataConnection(data_asset_id=asset_id) FSLocation ~~~~~~~~~~ After running ``fit()``, you can read your results from a dedicated place in cluster's filesystem using FSLocation. .. code-block:: python from ibm_watsonx_ai.helpers.connections import DataConnection, FSLocation training_result_reference = DataConnection( location=FSLocation(path="path_to_directory") ) Batch DataConnection -------------------- If you use a Batch type of deployment, you can store the output of the Batch deployment using ``DataConnection``. For more information and usage instruction, see :ref:`working-with-batch`. .. code-block:: python from ibm_watsonx_ai.helpers.connections import ( DataConnection, DeploymentOutputAssetLocation, ) from ibm_watsonx_ai.deployment import Batch service_batch = Batch(wml_credentials, source_space_id=space_id) service_batch.create( experiment_run_id="id_of_your_experiment_run", model="choosen_pipeline", deployment_name="Batch deployment", ) payload_reference = DataConnection(location=training_data_references) results_reference = DataConnection( location=DeploymentOutputAssetLocation(name="batch_output_file_name.csv") ) scoring_params = service_batch.run_job( payload=[payload_reference], output_data_reference=results_reference, background_mode=False, ) Upload your training dataset ---------------------------- An AutoAI experiment should have access to your training data. If you don't have a training dataset stored already, you can store it by invoking the ``write()`` method of the ``DataConnection`` object. .. code-block:: python training_data_references.set_client(client) training_data_references.write( data="local_path_to_the_dataset", remote_name="training_dataset.csv" ) Download your training dataset ------------------------------ To download a stored dataset, use the ``read()`` method of the ``DataConnection`` object. .. code-block:: python training_data_references.set_client(client) dataset = training_data_references.read() # note: returns a pandas DataFrame