IterableDatasets Modules

Note

Added in 1.1.x release

TabularIterableDataset

class ibm_watsonx_ai.data_loaders.datasets.tabular.TabularIterableDataset(connection, experiment_metadata=None, enable_sampling=True, sample_size_limit=1073741824, sampling_type='first_n_records', binary_data=False, number_of_batch_rows=None, stop_after_first_batch=False, total_size_limit=1073741824, total_nrows_limit=None, total_percentage_limit=1.0, apply_literal_eval=False, **kwargs)[source]

Bases: IterableDataset

Iterable class downloading data in batches.

Parameters:
  • connection (DataConnection) – connection to the dataset

  • experiment_metadata (dict, optional) – metadata retrieved from the experiment that created the model

  • enable_sampling (bool, optional) – if set to True, will enable sampling, default: True

  • sample_size_limit (int, optional) – upper limit for the overall data to be downloaded in bytes, default: 1 GB

  • sampling_type (str, optional) – a sampling strategy on how to read the data, check SamplingTypes enum class for more options

  • binary_data (bool, optional) – if set to True, the downloaded data will be treated as binary data

  • number_of_batch_rows (int, optional) – number of rows to read in each batch when reading from the flight connection

  • stop_after_first_batch (bool, optional) – if set to True, the loading will stop after downloading the first batch

  • total_size_limit (int, optional) – upper limit for overall data to be downloaded in Bytes, default: 1 GB, if more than one of: total_size_limit, total_nrows_limit, total_percentage_limit are set, then data are limited to the lower threshold, if None, then all data are downloaded in batches in the iterable_read method

  • total_nrows_limit (int, optional) – upper limit for overall data to be downloaded in a number of rows, if more than one of: total_size_limit, total_nrows_limit, total_percentage_limit are set, then data are limited to the lower threshold

  • total_percentage_limit (float, optional) – upper limit for overall data to be downloaded in percent of all dataset, must be a float number between 0 and 1, if more than one of: total_size_limit, total_nrows_limit, total_percentage_limit are set, then data are limited to the lower threshold

  • apply_literal_eval (bool, optional) – when True then ast.literal_eval will be applied to all string columns.

Example:

experiment_metadata = {
        "prediction_column": 'species',
        "prediction_type": "classification",
        "project_id": os.environ.get('PROJECT_ID'),
        'credentials': credentials
}

connection = DataConnection(data_asset_id='5d99c11a-2060-4ef6-83d5-dc593c6455e2')

Example: default sampling - read first 1 GB of data

iterable_dataset = TabularIterableDataset(connection=connection,
                                          enable_sampling=True,
                                          sampling_type='first_n_records',
                                          sample_size_limit = 1GB,
                                          experiment_metadata=experiment_metadata)

Example: read all data records in batches/no subsampling

iterable_dataset = TabularIterableDataset(connection=connection,
                                          enable_sampling=False,
                                          experiment_metadata=experiment_metadata)

Example: stratified/random sampling

iterable_dataset = TabularIterableDataset(connection=connection,
                                          enable_sampling=True,
                                          sampling_type='stratified',
                                          sample_size_limit = 1GB,
                                          experiment_metadata=experiment_metadata)
write(data=None, file_path=None)[source]

Writes data into the data source connection.

Parameters:
  • data (DataFrame, optional) – structured data to be saved in data source connection, ‘data’ or ‘file_path’ must be provided

  • file_path (str, optional) – path to the local file to be saved in a source data connection (binary transfer). ‘data’ or ‘file_path’ need to be provided

DocumentsIterableDataset

class ibm_watsonx_ai.data_loaders.datasets.documents.DocumentsIterableDataset(*, connections, enable_sampling=True, sample_size_limit=1073741824, sampling_type='random', total_size_limit=1073741824, total_ndocs_limit=None, benchmark_dataset=None, error_callback=None, **kwargs)[source]

Bases: IterableDataset

This dataset is an Iterable stream of documents using an underneath Flight Service. It can download documents asynchronously and serve them to you from a generator.

Supported types of documents:
  • text/plain (“.txt” file extension) - plain structured text

  • docx (“.docx” file extension) - standard Word style file

  • pdf (“.pdf” file extension) - standard pdf document

  • html (“.html” file extension) - saved html side

  • markdown (“.md” file extension) - plain text formatted with markdown

Parameters:
  • connections (list[DataConnection]) – list of connections to the documents

  • enable_sampling (bool) – if set to True, will enable sampling, default: True

  • sample_size_limit (int) – upper limit for documents to be downloaded in bytes, default: 1 GB

  • sampling_type (str) – a sampling strategy on how to read the data, check the DocumentsSamplingTypes enum class for more options

  • total_size_limit (int) – upper limit for documents to be downloaded in Bytes, default: 1 GB, if more than one of: total_size_limit, total_ndocs_limit are set, then data are limited to the lower threshold.

  • total_ndocs_limit (int, optional) – upper limit for documents to be downloaded in a number of rows, if more than one of: total_size_limit, total_nrows_limit are set, then data are limited to the lower threshold.

  • benchmark_dataset (pd.DataFrame, optional) – dataset of benchmarking data with IDs in the document_ids column corresponding to the names of documents in the connections list

  • error_callback (function (str, Exception) -> None, optional) – error callback function, to handle the exceptions from document loading, as arguments are passed document_id and exception

  • api_client (APIClient, optional) – initialized APIClient object with set project or space ID. If the DataConnection object in list connections does not have a set API client, then the api_client object is used for reading data.

Example: default sampling - read up to 1 GB of random documents

connections = [DataConnection(data_asset_id='5d99c11a-2060-4ef6-83d5-dc593c6455e2')]

iterable_dataset = DocumentsIterableDataset(connections=connections,
                                            enable_sampling=True,
                                            sampling_type='random',
                                            sample_size_limit = 1GB)

Example: read all documents/no subsampling

connections = [DataConnection(data_asset_id='5d99c11a-2060-4ef6-83d5-dc593c6455e2')]

iterable_dataset = DocumentsIterableDataset(connections=connections,
                                            enable_sampling=False)

Example: context based sampling

connections = [DataConnection(data_asset_id='5d99c11a-2060-4ef6-83d5-dc593c6455e2')]

iterable_dataset = DocumentsIterableDataset(connections=connections,
                                            enable_sampling=True,
                                            sampling_type='benchmark_driven',
                                            sample_size_limit = 1GB,
                                            benchmark_dataset=pd.DataFrame(
                                                data={
                                                    "question": [
                                                        "What foundation models are available in watsonx.ai ?"
                                                    ],
                                                    "correct_answers": [
                                                        [
                                                            "The following models are available in watsonx.ai: ..."
                                                        ]
                                                    ],
                                                    "correct_answer_document_ids": ["sample_pdf_file.pdf"],
                                                }))