IterableDatasets Modules¶
Note
Added in 1.1.x release
TabularIterableDataset¶
- class ibm_watsonx_ai.data_loaders.datasets.tabular.TabularIterableDataset(connection, experiment_metadata=None, enable_sampling=True, sample_size_limit=1073741824, sampling_type='first_n_records', binary_data=False, number_of_batch_rows=None, stop_after_first_batch=False, total_size_limit=1073741824, total_nrows_limit=None, total_percentage_limit=1.0, apply_literal_eval=False, **kwargs)[source]¶
Bases:
IterableDataset
Iterable class downloading data in batches.
- Parameters:
connection (DataConnection) – connection to the dataset
experiment_metadata (dict, optional) – metadata retrieved from the experiment that created the model
enable_sampling (bool, optional) – if set to True, will enable sampling, default: True
sample_size_limit (int, optional) – upper limit for the overall data to be downloaded in bytes, default: 1 GB
sampling_type (str, optional) – a sampling strategy on how to read the data, check SamplingTypes enum class for more options
binary_data (bool, optional) – if set to True, the downloaded data will be treated as binary data
number_of_batch_rows (int, optional) – number of rows to read in each batch when reading from the flight connection
stop_after_first_batch (bool, optional) – if set to True, the loading will stop after downloading the first batch
total_size_limit (int, optional) – upper limit for overall data to be downloaded in Bytes, default: 1 GB, if more than one of: total_size_limit, total_nrows_limit, total_percentage_limit are set, then data are limited to the lower threshold, if None, then all data are downloaded in batches in the iterable_read method
total_nrows_limit (int, optional) – upper limit for overall data to be downloaded in a number of rows, if more than one of: total_size_limit, total_nrows_limit, total_percentage_limit are set, then data are limited to the lower threshold
total_percentage_limit (float, optional) – upper limit for overall data to be downloaded in percent of all dataset, must be a float number between 0 and 1, if more than one of: total_size_limit, total_nrows_limit, total_percentage_limit are set, then data are limited to the lower threshold
apply_literal_eval (bool, optional) – when True then ast.literal_eval will be applied to all string columns.
Example:
experiment_metadata = { "prediction_column": 'species', "prediction_type": "classification", "project_id": os.environ.get('PROJECT_ID'), 'credentials': credentials } connection = DataConnection(data_asset_id='5d99c11a-2060-4ef6-83d5-dc593c6455e2')
Example: default sampling - read first 1 GB of data
iterable_dataset = TabularIterableDataset(connection=connection, enable_sampling=True, sampling_type='first_n_records', sample_size_limit = 1GB, experiment_metadata=experiment_metadata)
Example: read all data records in batches/no subsampling
iterable_dataset = TabularIterableDataset(connection=connection, enable_sampling=False, experiment_metadata=experiment_metadata)
Example: stratified/random sampling
iterable_dataset = TabularIterableDataset(connection=connection, enable_sampling=True, sampling_type='stratified', sample_size_limit = 1GB, experiment_metadata=experiment_metadata)
- write(data=None, file_path=None)[source]¶
Writes data into the data source connection.
- Parameters:
data (DataFrame, optional) – structured data to be saved in data source connection, ‘data’ or ‘file_path’ must be provided
file_path (str, optional) – path to the local file to be saved in a source data connection (binary transfer). ‘data’ or ‘file_path’ need to be provided
DocumentsIterableDataset¶
- class ibm_watsonx_ai.data_loaders.datasets.documents.DocumentsIterableDataset(*, connections, enable_sampling=True, sample_size_limit=1073741824, sampling_type='random', total_size_limit=1073741824, total_ndocs_limit=None, benchmark_dataset=None, error_callback=None, **kwargs)[source]¶
Bases:
IterableDataset
This dataset is an Iterable stream of documents using an underneath Flight Service. It can download documents asynchronously and serve them to you from a generator.
- Supported types of documents:
text/plain (“.txt” file extension) - plain structured text
docx (“.docx” file extension) - standard Word style file
pdf (“.pdf” file extension) - standard pdf document
html (“.html” file extension) - saved html side
markdown (“.md” file extension) - plain text formatted with markdown
- Parameters:
connections (list[DataConnection]) – list of connections to the documents
enable_sampling (bool) – if set to True, will enable sampling, default: True
sample_size_limit (int) – upper limit for documents to be downloaded in bytes, default: 1 GB
sampling_type (str) – a sampling strategy on how to read the data, check the DocumentsSamplingTypes enum class for more options
total_size_limit (int) – upper limit for documents to be downloaded in Bytes, default: 1 GB, if more than one of: total_size_limit, total_ndocs_limit are set, then data are limited to the lower threshold.
total_ndocs_limit (int, optional) – upper limit for documents to be downloaded in a number of rows, if more than one of: total_size_limit, total_nrows_limit are set, then data are limited to the lower threshold.
benchmark_dataset (pd.DataFrame, optional) – dataset of benchmarking data with IDs in the document_ids column corresponding to the names of documents in the connections list
error_callback (function (str, Exception) -> None, optional) – error callback function, to handle the exceptions from document loading, as arguments are passed document_id and exception
api_client (APIClient, optional) – initialized APIClient object with set project or space ID. If the DataConnection object in list connections does not have a set API client, then the api_client object is used for reading data.
Example: default sampling - read up to 1 GB of random documents
connections = [DataConnection(data_asset_id='5d99c11a-2060-4ef6-83d5-dc593c6455e2')] iterable_dataset = DocumentsIterableDataset(connections=connections, enable_sampling=True, sampling_type='random', sample_size_limit = 1GB)
Example: read all documents/no subsampling
connections = [DataConnection(data_asset_id='5d99c11a-2060-4ef6-83d5-dc593c6455e2')] iterable_dataset = DocumentsIterableDataset(connections=connections, enable_sampling=False)
Example: context based sampling
connections = [DataConnection(data_asset_id='5d99c11a-2060-4ef6-83d5-dc593c6455e2')] iterable_dataset = DocumentsIterableDataset(connections=connections, enable_sampling=True, sampling_type='benchmark_driven', sample_size_limit = 1GB, benchmark_dataset=pd.DataFrame( data={ "question": [ "What foundation models are available in watsonx.ai ?" ], "correct_answers": [ [ "The following models are available in watsonx.ai: ..." ] ], "correct_answer_document_ids": ["sample_pdf_file.pdf"], }))