DataConnection Modules

DataConnection

class ibm_watsonx_ai.helpers.connections.connections.DataConnection(location=None, connection=None, data_asset_id=None, connection_asset_id=None, **kwargs)[source]

Bases: BaseDataConnection

You need a Data Storage Connection class for Service training metadata (input data).

Parameters:
  • connection (NFSConnection or ConnectionAsset, optional) – connection parameters of a specific type

  • location (Union[S3Location, FSLocation, AssetLocation]) – required location parameters of a specific type

  • data_asset_id (str, optional) – data asset ID, if the DataConnection should point to a data asset

download(filename)[source]

Download a dataset stored in a remote data storage and save to a file.

Parameters:

filename (str) – path to the file where data will be downloaded

Examples

document_reference = DataConnection(
    connection_asset_id="<connection_id>",
    location=S3Location(bucket="<bucket_name>", path="path/to/file"),
    )
document_reference.download(filename='results.json')
classmethod from_dict(connection_data)[source]

Create a DataConnection object from a dictionary.

Parameters:

connection_data (dict) – dictionary data structure with information about the data connection reference

Returns:

DataConnection object

Return type:

DataConnection

classmethod from_studio(path)[source]

Create DataConnections from the credentials stored (connected) in Watson Studio. Only for COS.

Parameters:

path (str) – path in the COS bucket to the training dataset

Returns:

list with DataConnection objects

Return type:

list[DataConnection]

Example:

data_connections = DataConnection.from_studio(path='iris_dataset.csv')
read(with_holdout_split=False, csv_separator=',', excel_sheet=None, encoding='utf-8', raw=False, binary=False, read_to_file=None, number_of_batch_rows=None, sampling_type=None, sample_size_limit=None, sample_rows_limit=None, sample_percentage_limit=None, **kwargs)[source]

Download a dataset that is stored in a remote data storage. Returns batch up to 1 GB.

Parameters:
  • with_holdout_split (bool, optional) – if True, data will be split to train and holdout dataset as it was by AutoAI

  • csv_separator (str, optional) – separator/delimiter for the CSV file

  • excel_sheet (str, optional) – excel file sheet name to use, use only when the xlsx file is an input, support for the number of the sheet is deprecated

  • encoding (str, optional) – encoding type of the CSV file

  • raw (bool, optional) – if False, simple data is preprocessed (the same as in the backend), if True, data is not preprocessed

  • binary (bool, optional) – indicates to retrieve data in binary mode, the result will be a python binary type variable

  • read_to_file (str, optional) – stream read data to a file under the path specified as the value of this parameter, use this parameter to prevent keeping data in-memory

  • number_of_batch_rows (int, optional) – number of rows to read in each batch when reading from the flight connection

  • sampling_type (str, optional) – a sampling strategy on how to read the data

  • sample_size_limit (int, optional) – upper limit for the overall data to be downloaded in bytes, default: 1 GB

  • sample_rows_limit (int, optional) – upper limit for the overall data to be downloaded in a number of rows

  • sample_percentage_limit (float, optional) – upper limit for the overall data to be downloaded in the percent of all dataset, this parameter is ignored, when sampling_type parameter is set to first_n_records, must be a float number between 0 and 1

Note

If more than one of: sample_size_limit, sample_rows_limit, sample_percentage_limit are set, then downloaded data is limited to the lowest threshold.

Returns:

one of the following:

  • pandas.DataFrame that contains dataset from remote data storage : Xy_train

  • Tuple[pandas.DataFrame, pandas.DataFrame, pandas.DataFrame, pandas.DataFrame] : X_train, X_holdout, y_train, y_holdout

  • Tuple[pandas.DataFrame, pandas.DataFrame] : X_test, y_test that contains training data and holdout data from remote storage

  • bytes object, auto holdout split from backend (only train data provided)

Examples

train_data_connections = optimizer.get_data_connections()

data = train_data_connections[0].read() # all train data

# or

X_train, X_holdout, y_train, y_holdout = train_data_connections[0].read(with_holdout_split=True) # train and holdout data

Your train and test data:

optimizer.fit(training_data_reference=[DataConnection],
              training_results_reference=DataConnection,
              test_data_reference=DataConnection)

test_data_connection = optimizer.get_test_data_connections()
X_test, y_test = test_data_connection.read() # only holdout data

# and

train_data_connections = optimizer.get_data_connections()
data = train_connections[0].read() # only train data
set_client(api_client=None, **kwargs)[source]

To enable write/read operations with a connection to a service, set an initialized service client in the connection.

Parameters:

api_client (APIClient) – API client to connect to a service

Example:

DataConnection.set_client(api_client=api_client)
to_dict()[source]

Convert a DataConnection object to a dictionary representation.

Returns:

DataConnection dictionary representation

Return type:

dict

write(data, remote_name=None, **kwargs)[source]

Upload a file to a remote data storage.

Parameters:
  • data (str) – local path to the dataset or pandas.DataFrame with data

  • remote_name (str) – name of dataset to be stored in the remote data storage

S3Location

class ibm_watsonx_ai.helpers.connections.connections.S3Location(bucket, path, **kwargs)[source]

Bases: BaseLocation

Connection class to a COS data storage in S3 format.

Parameters:
  • bucket (str) – COS bucket name

  • path (str) – COS data path in the bucket

  • excel_sheet (str, optional) – name of the excel sheet, if the chosen dataset uses an excel file for Batched Deployment scoring

  • model_location (str, optional) – path to the pipeline model in the COS

  • training_status (str, optional) – path to the training status JSON in the COS

get_location()[source]

CloudAssetLocation

class ibm_watsonx_ai.helpers.connections.connections.CloudAssetLocation(asset_id)[source]

Bases: AssetLocation

Connection class to data assets as input data references to a batch deployment job on Cloud.

Parameters:

asset_id (str) – asset ID of the file loaded on space on Cloud

DeploymentOutputAssetLocation

class ibm_watsonx_ai.helpers.connections.connections.DeploymentOutputAssetLocation(name, description='')[source]

Bases: BaseLocation

Connection class to data assets where output of batch deployment will be stored.

Parameters:
  • name (str) – name of CSV file to be saved as a data asset

  • description (str, optional) – description of the data asset