DataConnection Modules¶
DataConnection¶
- class ibm_watsonx_ai.helpers.connections.connections.DataConnection(location=None, connection=None, data_asset_id=None, connection_asset_id=None, **kwargs)[source]¶
Bases:
BaseDataConnection
You need a Data Storage Connection class for Service training metadata (input data).
- Parameters:
connection (NFSConnection or ConnectionAsset, optional) – connection parameters of a specific type
location (Union[S3Location, FSLocation, AssetLocation]) – required location parameters of a specific type
data_asset_id (str, optional) – data asset ID, if the DataConnection should point to a data asset
- download(filename)[source]¶
Download a dataset stored in a remote data storage and save to a file.
- Parameters:
filename (str) – path to the file where data will be downloaded
Examples
document_reference = DataConnection( connection_asset_id="<connection_id>", location=S3Location(bucket="<bucket_name>", path="path/to/file"), ) document_reference.download(filename='results.json')
- download_folder(local_dir=None)[source]¶
Download files from a folder and subfolders stored in a remote data storage and save to a local directory.
- Parameters:
local_dir (str, optional) – path to the local directory where data will be downloaded, download to current working directory if not provided
Examples
folder_reference = DataConnection( connection_asset_id="<connection_id>", location=S3Location(bucket="<bucket_name>", path="path/to/folder"), ) folder_reference.download(local_dir="./data")
- classmethod from_dict(connection_data)[source]¶
Create a DataConnection object from a dictionary.
- Parameters:
connection_data (dict) – dictionary data structure with information about the data connection reference
- Returns:
DataConnection object
- Return type:
- classmethod from_studio(path)[source]¶
Create DataConnections from the credentials stored (connected) in Watson Studio. Only for COS.
- Parameters:
path (str) – path in the COS bucket to the training dataset
- Returns:
list with DataConnection objects
- Return type:
list[DataConnection]
Example:
data_connections = DataConnection.from_studio(path='iris_dataset.csv')
- read(with_holdout_split=False, csv_separator=',', excel_sheet=None, encoding='utf-8', raw=False, binary=False, read_to_file=None, number_of_batch_rows=None, sampling_type=None, sample_size_limit=None, sample_rows_limit=None, sample_percentage_limit=None, **kwargs)[source]¶
Download a dataset that is stored in a remote data storage. Returns batch up to 1 GB.
- Parameters:
with_holdout_split (bool, optional) – if True, data will be split to train and holdout dataset as it was by AutoAI
csv_separator (str, optional) – separator/delimiter for the CSV file
excel_sheet (str, optional) – excel file sheet name to use, use only when the xlsx file is an input, support for the number of the sheet is deprecated
encoding (str, optional) – encoding type of the CSV file
raw (bool, optional) – if False, simple data is preprocessed (the same as in the backend), if True, data is not preprocessed
binary (bool, optional) – indicates to retrieve data in binary mode, the result will be a python binary type variable
read_to_file (str, optional) – stream read data to a file under the path specified as the value of this parameter, use this parameter to prevent keeping data in-memory
number_of_batch_rows (int, optional) – number of rows to read in each batch when reading from the flight connection
sampling_type (str, optional) – a sampling strategy on how to read the data
sample_size_limit (int, optional) – upper limit for the overall data to be downloaded in bytes, default: 1 GB
sample_rows_limit (int, optional) – upper limit for the overall data to be downloaded in a number of rows
sample_percentage_limit (float, optional) – upper limit for the overall data to be downloaded in the percent of all dataset, this parameter is ignored, when sampling_type parameter is set to first_n_records, must be a float number between 0 and 1
Note
If more than one of: sample_size_limit, sample_rows_limit, sample_percentage_limit are set, then downloaded data is limited to the lowest threshold.
- Returns:
one of the following:
pandas.DataFrame that contains dataset from remote data storage : Xy_train
Tuple[pandas.DataFrame, pandas.DataFrame, pandas.DataFrame, pandas.DataFrame] : X_train, X_holdout, y_train, y_holdout
Tuple[pandas.DataFrame, pandas.DataFrame] : X_test, y_test that contains training data and holdout data from remote storage
bytes object, auto holdout split from backend (only train data provided)
Examples
train_data_connections = optimizer.get_data_connections() data = train_data_connections[0].read() # all train data # or X_train, X_holdout, y_train, y_holdout = train_data_connections[0].read(with_holdout_split=True) # train and holdout data
Your train and test data:
optimizer.fit(training_data_reference=[DataConnection], training_results_reference=DataConnection, test_data_reference=DataConnection) test_data_connection = optimizer.get_test_data_connections() X_test, y_test = test_data_connection.read() # only holdout data # and train_data_connections = optimizer.get_data_connections() data = train_connections[0].read() # only train data
- set_client(api_client=None, **kwargs)[source]¶
To enable write/read operations with a connection to a service, set an initialized service client in the connection.
- Parameters:
api_client (APIClient) – API client to connect to a service
Example:
DataConnection.set_client(api_client=api_client)
S3Location¶
- class ibm_watsonx_ai.helpers.connections.connections.S3Location(bucket, path, **kwargs)[source]¶
Bases:
BaseLocation
Connection class to a COS data storage in S3 format.
- Parameters:
bucket (str) – COS bucket name
path (str) – COS data path in the bucket
excel_sheet (str, optional) – name of the excel sheet, if the chosen dataset uses an excel file for Batched Deployment scoring
model_location (str, optional) – path to the pipeline model in the COS
training_status (str, optional) – path to the training status JSON in the COS
CloudAssetLocation¶
DeploymentOutputAssetLocation¶
- class ibm_watsonx_ai.helpers.connections.connections.DeploymentOutputAssetLocation(name, description='')[source]¶
Bases:
BaseLocation
Connection class to data assets where output of batch deployment will be stored.
- Parameters:
name (str) – name of CSV file to be saved as a data asset
description (str, optional) – description of the data asset
ContainerLocation¶
- class ibm_watsonx_ai.helpers.connections.connections.ContainerLocation(path=None, **kwargs)[source]¶
Bases:
BaseLocation
Connection class to default COS in user Project/Space.
- prepend_container_id_to_path(container_id)[source]¶
Prepend project / space ID to path. For projects and spaces stored in shared buckets, their ID must be prepended to the path. The assignment is skipped if the path already starts with
container_id
.- Parameters:
container_id (str) – id of project / space
GithubLocation¶
- class ibm_watsonx_ai.helpers.connections.connections.GithubLocation(secret_manager_url, secret_id, path)[source]¶
Bases:
BaseLocation
Connection class to a Github.
- Parameters:
secret_manager_url (str) – url of Secrets Manager service where the Github PAT and url are stored.
secret_id (str) – ID of the secret with Github PAT and url in the Secrets Manager
path (str) – path within github repo to the file