IterableDatasets Modules

Note

Added in 1.1.x release

Warning

Deprecated Encryption Algorithms for PDF Files

When working with PDF files using the ibm_watsonx_ai package, attempting to use outdated encryption algorithms, such as ARC4 might fail. This is because old algorithms, considered weak ciphers were not loaded. For more information, see cryptography library.

Manually decrypt and encrypt PDF files

If your PDF file uses an outdated encryption algorithm like ARC4, you need to decrypt it before processing. You can later re-encrypt it using a newer algorithm, like AES-256-R5.

1. Clear the CRYPTOGRAPHY_OPENSSL_NO_LEGACY environment variable before importing pypdf. This ensures the legacy OpenSSL provider can be loaded and older encryption algorithms are available.

import os
del os.environ['CRYPTOGRAPHY_OPENSSL_NO_LEGACY']
  1. Decrypt the PDF file.

from pypdf import PdfReader, PdfWriter

reader = PdfReader("example.pdf")
if reader.is_encrypted:
   reader.decrypt("")  # if there was no password

writer = PdfWriter(clone_from=reader)

with open("decrypted-pdf.pdf", "wb") as f:
   writer.write(f)
  1. Optional: Encrypt the PDF file again using AES-256-R5.

writer.encrypt("", algorithm="AES-256-R5")

with open("example-encrypted.pdf", "wb") as f:
   writer.write(f)

TabularIterableDataset

class ibm_watsonx_ai.data_loaders.datasets.tabular.TabularIterableDataset(connection, experiment_metadata=None, enable_sampling=True, sample_size_limit=1073741824, sampling_type='first_n_records', binary_data=False, number_of_batch_rows=None, stop_after_first_batch=False, total_size_limit=1073741824, total_nrows_limit=None, total_percentage_limit=1.0, apply_literal_eval=False, cast_strings=True, **kwargs)[source]

Bases: object

Iterable class downloading data in batches.

Parameters:
  • connection (DataConnection) – connection to the dataset

  • experiment_metadata (dict, optional) – metadata retrieved from the experiment that created the model

  • enable_sampling (bool, optional) – if set to True, will enable sampling, default: True

  • sample_size_limit (int, optional) – upper limit for the overall data to be downloaded in bytes, default: 1 GB

  • sampling_type (str, optional) – a sampling strategy on how to read the data, check SamplingTypes enum class for more options

  • binary_data (bool, optional) – if set to True, the downloaded data will be treated as binary data

  • number_of_batch_rows (int, optional) – number of rows to read in each batch when reading from the flight connection

  • stop_after_first_batch (bool, optional) – if set to True, the loading will stop after downloading the first batch

  • total_size_limit (int, optional) – upper limit for overall data to be downloaded in Bytes, default: 1 GB, if more than one of: total_size_limit, total_nrows_limit, total_percentage_limit are set, then data are limited to the lower threshold, if None, then all data are downloaded in batches in the iterable_read method

  • total_nrows_limit (int, optional) – upper limit for overall data to be downloaded in a number of rows, if more than one of: total_size_limit, total_nrows_limit, total_percentage_limit are set, then data are limited to the lower threshold

  • total_percentage_limit (float, optional) – upper limit for overall data to be downloaded in percent of all dataset, must be a float number between 0 and 1, if more than one of: total_size_limit, total_nrows_limit, total_percentage_limit are set, then data are limited to the lower threshold

  • apply_literal_eval (bool, optional) – when True then ast.literal_eval will be applied to all string columns.

  • cast_strings (bool, optional) – when True then all string columns are cast to float or bool if applicable

Example:

experiment_metadata = {
        "prediction_column": 'species',
        "prediction_type": "classification",
        "project_id": os.environ.get('PROJECT_ID'),
        'credentials': credentials
}

connection = DataConnection(data_asset_id='5d99c11a-2060-4ef6-83d5-dc593c6455e2')

Example: default sampling - read first 1 GB of data

iterable_dataset = TabularIterableDataset(connection=connection,
                                          enable_sampling=True,
                                          sampling_type='first_n_records',
                                          sample_size_limit = 1GB,
                                          experiment_metadata=experiment_metadata)

Example: read all data records in batches/no subsampling

iterable_dataset = TabularIterableDataset(connection=connection,
                                          enable_sampling=False,
                                          experiment_metadata=experiment_metadata)

Example: stratified/random sampling

iterable_dataset = TabularIterableDataset(connection=connection,
                                          enable_sampling=True,
                                          sampling_type='stratified',
                                          sample_size_limit = 1GB,
                                          experiment_metadata=experiment_metadata)
property connection

Get data connection.

Returns:

connection used in data operations

Return type:

FlightConnection | LocalBatchReader

Example:

dataset = TabularIterableDataset(...)
conn = dataset.connection

# Your code here...

conn.close() # FlightConnection instances must be closed after use
write(data=None, file_path=None)[source]

Writes data into the data source connection.

Parameters:
  • data (DataFrame, optional) – structured data to be saved in data source connection, ‘data’ or ‘file_path’ must be provided

  • file_path (str, optional) – path to the local file to be saved in a source data connection (binary transfer). ‘data’ or ‘file_path’ need to be provided

DocumentsIterableDataset

class ibm_watsonx_ai.data_loaders.datasets.documents.DocumentsIterableDataset(*, connections, enable_sampling=True, include_subfolders=False, sample_size_limit=1073741824, sampling_type='random', total_size_limit=1073741824, total_ndocs_limit=None, benchmark_dataset=None, error_callback=None, **kwargs)[source]

Bases: BaseDocumentsIterableDataset

This dataset is an Iterable stream of documents using an underneath Flight Service. It can download documents asynchronously and serve them to you from a generator.

Supported types of documents:
  • text/plain (“.txt” file extension) - plain structured text

  • docx (“.docx” file extension) - standard Word style file

  • pdf (“.pdf” file extension) - standard pdf document

  • html (“.html” file extension) - saved html side

  • markdown (“.md” file extension) - plain text formatted with markdown

  • pptx (“.pptx” file extension) - standard PowerPoint style file

  • json (“.json” file extension) - standard json file

  • yaml (“.yaml” file extension) - standard yaml file

  • xml (“.xml” file extension) - standard xml file

  • csv (“.csv” file extension) - standard csv file

  • excel (“.xlsx” file extension) - standard Excel file

Parameters:
  • connections (list[DataConnection]) – list of connections to the documents

  • enable_sampling (bool) – if set to True, will enable sampling, default: True

  • include_subfolders (bool, optional) – if set to True, all documents in subfolders of connections locations will be included, default: False

  • sample_size_limit (int) – upper limit for documents to be downloaded in bytes, default: 1 GB

  • sampling_type (str) – a sampling strategy on how to read the data, check the DocumentsSamplingTypes enum class for more options

  • total_size_limit (int) – upper limit for documents to be downloaded in Bytes, default: 1 GB, if more than one of: total_size_limit, total_ndocs_limit are set, then data are limited to the lower threshold.

  • total_ndocs_limit (int, optional) – upper limit for documents to be downloaded in a number of rows, if more than one of: total_size_limit, total_nrows_limit are set, then data are limited to the lower threshold.

  • benchmark_dataset (pd.DataFrame, optional) – dataset of benchmarking data with IDs in the document_ids column corresponding to the names of documents in the connections list

  • error_callback (function (str, Exception) -> None, optional) – error callback function, to handle the exceptions from document loading, as arguments are passed document_id and exception

  • api_client (APIClient, optional) – initialized APIClient object with set project or space ID. If the DataConnection object in list connections does not have a set API client, then the api_client object is used for reading data.

Example: default sampling - read up to 1 GB of random documents

connections = [DataConnection(data_asset_id='5d99c11a-2060-4ef6-83d5-dc593c6455e2')]

iterable_dataset = DocumentsIterableDataset(connections=connections,
                                            enable_sampling=True,
                                            sampling_type='random',
                                            sample_size_limit = 1GB)

Example: read all documents/no subsampling

connections = [DataConnection(data_asset_id='5d99c11a-2060-4ef6-83d5-dc593c6455e2')]

iterable_dataset = DocumentsIterableDataset(connections=connections,
                                            enable_sampling=False)

Example: context based sampling

connections = [DataConnection(data_asset_id='5d99c11a-2060-4ef6-83d5-dc593c6455e2')]

iterable_dataset = DocumentsIterableDataset(connections=connections,
                                            enable_sampling=True,
                                            sampling_type='benchmark_driven',
                                            sample_size_limit = 1GB,
                                            benchmark_dataset=pd.DataFrame(
                                                data={
                                                    "question": [
                                                        "What foundation models are available in watsonx.ai ?"
                                                    ],
                                                    "correct_answers": [
                                                        [
                                                            "The following models are available in watsonx.ai: ..."
                                                        ]
                                                    ],
                                                    "correct_answer_document_ids": ["sample_pdf_file.pdf"],
                                                }))