DataConnection Modules

DataConnection

class ibm_watsonx_ai.helpers.connections.connections.DataConnection(location=None, connection=None, data_asset_id=None, connection_asset_id=None, **kwargs)[source]

Bases: BaseDataConnection

You need a Data Storage Connection class for Service training metadata (input data).

Parameters:
  • connection (NFSConnection or ConnectionAsset, optional) – connection parameters of a specific type

  • location (Union[S3Location, FSLocation, AssetLocation]) – required location parameters of a specific type

  • data_asset_id (str, optional) – data asset ID, if the DataConnection should point to a data asset

download(filename)[source]

Download a dataset stored in a remote data storage and save to a file.

Parameters:

filename (str) – path to the file where data will be downloaded

Examples

document_reference = DataConnection(
    connection_asset_id="<connection_id>",
    location=S3Location(bucket="<bucket_name>", path="path/to/file"),
    )
document_reference.download(filename='results.json')
download_folder(local_dir=None)[source]

Download files from a folder and subfolders stored in a remote data storage and save to a local directory.

Parameters:

local_dir (str, optional) – path to the local directory where data will be downloaded, download to current working directory if not provided

Examples

folder_reference = DataConnection(
    connection_asset_id="<connection_id>",
    location=S3Location(bucket="<bucket_name>", path="path/to/folder"),
    )
folder_reference.download(local_dir="./data")
classmethod from_dict(connection_data)[source]

Create a DataConnection object from a dictionary.

Parameters:

connection_data (dict) – dictionary data structure with information about the data connection reference

Returns:

DataConnection object

Return type:

DataConnection

classmethod from_studio(path)[source]

Create DataConnections from the credentials stored (connected) in Watson Studio. Only for COS.

Parameters:

path (str) – path in the COS bucket to the training dataset

Returns:

list with DataConnection objects

Return type:

list[DataConnection]

Example:

data_connections = DataConnection.from_studio(path='iris_dataset.csv')
read(with_holdout_split=False, csv_separator=',', excel_sheet=None, encoding='utf-8', raw=False, binary=False, read_to_file=None, number_of_batch_rows=None, sampling_type=None, sample_size_limit=None, sample_rows_limit=None, sample_percentage_limit=None, **kwargs)[source]

Download a dataset that is stored in a remote data storage. Returns batch up to 1 GB.

Parameters:
  • with_holdout_split (bool, optional) – if True, data will be split to train and holdout dataset as it was by AutoAI

  • csv_separator (str, optional) – separator/delimiter for the CSV file

  • excel_sheet (str, optional) – excel file sheet name to use, use only when the xlsx file is an input, support for the number of the sheet is deprecated

  • encoding (str, optional) – encoding type of the CSV file

  • raw (bool, optional) – if False, simple data is preprocessed (the same as in the backend), if True, data is not preprocessed

  • binary (bool, optional) – indicates to retrieve data in binary mode, the result will be a python binary type variable

  • read_to_file (str, optional) – stream read data to a file under the path specified as the value of this parameter, use this parameter to prevent keeping data in-memory

  • number_of_batch_rows (int, optional) – number of rows to read in each batch when reading from the flight connection

  • sampling_type (str, optional) – a sampling strategy on how to read the data

  • sample_size_limit (int, optional) – upper limit for the overall data to be downloaded in bytes, default: 1 GB

  • sample_rows_limit (int, optional) – upper limit for the overall data to be downloaded in a number of rows

  • sample_percentage_limit (float, optional) – upper limit for the overall data to be downloaded in the percent of all dataset, this parameter is ignored, when sampling_type parameter is set to first_n_records, must be a float number between 0 and 1

Note

If more than one of: sample_size_limit, sample_rows_limit, sample_percentage_limit are set, then downloaded data is limited to the lowest threshold.

Returns:

one of the following:

  • pandas.DataFrame that contains dataset from remote data storage : Xy_train

  • Tuple[pandas.DataFrame, pandas.DataFrame, pandas.DataFrame, pandas.DataFrame] : X_train, X_holdout, y_train, y_holdout

  • Tuple[pandas.DataFrame, pandas.DataFrame] : X_test, y_test that contains training data and holdout data from remote storage

  • bytes object, auto holdout split from backend (only train data provided)

Examples

train_data_connections = optimizer.get_data_connections()

data = train_data_connections[0].read() # all train data

# or

X_train, X_holdout, y_train, y_holdout = train_data_connections[0].read(with_holdout_split=True) # train and holdout data

Your train and test data:

optimizer.fit(training_data_reference=[DataConnection],
              training_results_reference=DataConnection,
              test_data_reference=DataConnection)

test_data_connection = optimizer.get_test_data_connections()
X_test, y_test = test_data_connection.read() # only holdout data

# and

train_data_connections = optimizer.get_data_connections()
data = train_connections[0].read() # only train data
set_client(api_client=None, **kwargs)[source]

To enable write/read operations with a connection to a service, set an initialized service client in the connection.

Parameters:

api_client (APIClient) – API client to connect to a service

Example:

DataConnection.set_client(api_client=api_client)
to_dict()[source]

Convert a DataConnection object to a dictionary representation.

Returns:

DataConnection dictionary representation

Return type:

dict

write(data, remote_name=None, **kwargs)[source]

Upload a file to a remote data storage.

Parameters:
  • data (str) – local path to the dataset or pandas.DataFrame with data

  • remote_name (str) – name of dataset to be stored in the remote data storage

S3Location

class ibm_watsonx_ai.helpers.connections.connections.S3Location(bucket, path, **kwargs)[source]

Bases: BaseLocation

Connection class to a COS data storage in S3 format.

Parameters:
  • bucket (str) – COS bucket name

  • path (str) – COS data path in the bucket

  • excel_sheet (str, optional) – name of the excel sheet, if the chosen dataset uses an excel file for Batched Deployment scoring

  • model_location (str, optional) – path to the pipeline model in the COS

  • training_status (str, optional) – path to the training status JSON in the COS

get_location()[source]

CloudAssetLocation

class ibm_watsonx_ai.helpers.connections.connections.CloudAssetLocation(asset_id)[source]

Bases: AssetLocation

Connection class to data assets as input data references to a batch deployment job on Cloud.

Parameters:

asset_id (str) – asset ID of the file loaded on space on Cloud

DeploymentOutputAssetLocation

class ibm_watsonx_ai.helpers.connections.connections.DeploymentOutputAssetLocation(name, description='')[source]

Bases: BaseLocation

Connection class to data assets where output of batch deployment will be stored.

Parameters:
  • name (str) – name of CSV file to be saved as a data asset

  • description (str, optional) – description of the data asset

ContainerLocation

class ibm_watsonx_ai.helpers.connections.connections.ContainerLocation(path=None, **kwargs)[source]

Bases: BaseLocation

Connection class to default COS in user Project/Space.

get_location()[source]
prepend_container_id_to_path(container_id)[source]

Prepend project / space ID to path. For projects and spaces stored in shared buckets, their ID must be prepended to the path. The assignment is skipped if the path already starts with container_id.

Parameters:

container_id (str) – id of project / space

to_dict()[source]

Get a json dictionary representing this model.

GithubLocation

class ibm_watsonx_ai.helpers.connections.connections.GithubLocation(secret_manager_url, secret_id, path)[source]

Bases: BaseLocation

Connection class to a Github.

Parameters:
  • secret_manager_url (str) – url of Secrets Manager service where the Github PAT and url are stored.

  • secret_id (str) – ID of the secret with Github PAT and url in the Secrets Manager

  • path (str) – path within github repo to the file

to_dict()[source]

Return a json dictionary representing this model.