IterableDatasets Modules¶
Note
Added in 1.1.x release
Warning
Deprecated Encryption Algorithms for PDF Files
When working with PDF files using the ibm_watsonx_ai package, attempting to use outdated encryption algorithms, such as ARC4 might fail. This is because old algorithms, considered weak ciphers were not loaded. For more information, see cryptography library.
Manually decrypt and encrypt PDF files¶
If your PDF file uses an outdated encryption algorithm like ARC4, you need to decrypt it before processing. You can later re-encrypt it using a newer algorithm, like AES-256-R5.
1. Clear the CRYPTOGRAPHY_OPENSSL_NO_LEGACY environment variable before importing pypdf. This ensures the legacy OpenSSL provider can be loaded and older encryption algorithms are available.
import os
del os.environ['CRYPTOGRAPHY_OPENSSL_NO_LEGACY']
Decrypt the PDF file.
from pypdf import PdfReader, PdfWriter
reader = PdfReader("example.pdf")
if reader.is_encrypted:
reader.decrypt("") # if there was no password
writer = PdfWriter(clone_from=reader)
with open("decrypted-pdf.pdf", "wb") as f:
writer.write(f)
Optional: Encrypt the PDF file again using AES-256-R5.
writer.encrypt("", algorithm="AES-256-R5")
with open("example-encrypted.pdf", "wb") as f:
writer.write(f)
TabularIterableDataset¶
- class ibm_watsonx_ai.data_loaders.datasets.tabular.TabularIterableDataset(connection, experiment_metadata=None, enable_sampling=True, sample_size_limit=1073741824, sampling_type='first_n_records', binary_data=False, number_of_batch_rows=None, stop_after_first_batch=False, total_size_limit=1073741824, total_nrows_limit=None, total_percentage_limit=1.0, apply_literal_eval=False, cast_strings=True, **kwargs)[source]¶
Bases:
object
Iterable class downloading data in batches.
- Parameters:
connection (DataConnection) – connection to the dataset
experiment_metadata (dict, optional) – metadata retrieved from the experiment that created the model
enable_sampling (bool, optional) – if set to True, will enable sampling, default: True
sample_size_limit (int, optional) – upper limit for the overall data to be downloaded in bytes, default: 1 GB
sampling_type (str, optional) – a sampling strategy on how to read the data, check SamplingTypes enum class for more options
binary_data (bool, optional) – if set to True, the downloaded data will be treated as binary data
number_of_batch_rows (int, optional) – number of rows to read in each batch when reading from the flight connection
stop_after_first_batch (bool, optional) – if set to True, the loading will stop after downloading the first batch
total_size_limit (int, optional) – upper limit for overall data to be downloaded in Bytes, default: 1 GB, if more than one of: total_size_limit, total_nrows_limit, total_percentage_limit are set, then data are limited to the lower threshold, if None, then all data are downloaded in batches in the iterable_read method
total_nrows_limit (int, optional) – upper limit for overall data to be downloaded in a number of rows, if more than one of: total_size_limit, total_nrows_limit, total_percentage_limit are set, then data are limited to the lower threshold
total_percentage_limit (float, optional) – upper limit for overall data to be downloaded in percent of all dataset, must be a float number between 0 and 1, if more than one of: total_size_limit, total_nrows_limit, total_percentage_limit are set, then data are limited to the lower threshold
apply_literal_eval (bool, optional) – when True then ast.literal_eval will be applied to all string columns.
cast_strings (bool, optional) – when True then all string columns are cast to float or bool if applicable
Example:
experiment_metadata = { "prediction_column": 'species', "prediction_type": "classification", "project_id": os.environ.get('PROJECT_ID'), 'credentials': credentials } connection = DataConnection(data_asset_id='5d99c11a-2060-4ef6-83d5-dc593c6455e2')
Example: default sampling - read first 1 GB of data
iterable_dataset = TabularIterableDataset(connection=connection, enable_sampling=True, sampling_type='first_n_records', sample_size_limit = 1GB, experiment_metadata=experiment_metadata)
Example: read all data records in batches/no subsampling
iterable_dataset = TabularIterableDataset(connection=connection, enable_sampling=False, experiment_metadata=experiment_metadata)
Example: stratified/random sampling
iterable_dataset = TabularIterableDataset(connection=connection, enable_sampling=True, sampling_type='stratified', sample_size_limit = 1GB, experiment_metadata=experiment_metadata)
- property connection¶
Get data connection.
- Returns:
connection used in data operations
- Return type:
FlightConnection | LocalBatchReader
Example:
dataset = TabularIterableDataset(...) conn = dataset.connection # Your code here... conn.close() # FlightConnection instances must be closed after use
- write(data=None, file_path=None)[source]¶
Writes data into the data source connection.
- Parameters:
data (DataFrame, optional) – structured data to be saved in data source connection, ‘data’ or ‘file_path’ must be provided
file_path (str, optional) – path to the local file to be saved in a source data connection (binary transfer). ‘data’ or ‘file_path’ need to be provided
DocumentsIterableDataset¶
- class ibm_watsonx_ai.data_loaders.datasets.documents.DocumentsIterableDataset(*, connections, enable_sampling=True, include_subfolders=False, sample_size_limit=1073741824, sampling_type='random', total_size_limit=1073741824, total_ndocs_limit=None, benchmark_dataset=None, error_callback=None, **kwargs)[source]¶
Bases:
BaseDocumentsIterableDataset
This dataset is an Iterable stream of documents using an underneath Flight Service. It can download documents asynchronously and serve them to you from a generator.
- Supported types of documents:
text/plain (“.txt” file extension) - plain structured text
docx (“.docx” file extension) - standard Word style file
pdf (“.pdf” file extension) - standard pdf document
html (“.html” file extension) - saved html side
markdown (“.md” file extension) - plain text formatted with markdown
pptx (“.pptx” file extension) - standard PowerPoint style file
json (“.json” file extension) - standard json file
yaml (“.yaml” file extension) - standard yaml file
xml (“.xml” file extension) - standard xml file
csv (“.csv” file extension) - standard csv file
excel (“.xlsx” file extension) - standard Excel file
- Parameters:
connections (list[DataConnection]) – list of connections to the documents
enable_sampling (bool) – if set to True, will enable sampling, default: True
include_subfolders (bool, optional) – if set to True, all documents in subfolders of connections locations will be included, default: False
sample_size_limit (int) – upper limit for documents to be downloaded in bytes, default: 1 GB
sampling_type (str) – a sampling strategy on how to read the data, check the DocumentsSamplingTypes enum class for more options
total_size_limit (int) – upper limit for documents to be downloaded in Bytes, default: 1 GB, if more than one of: total_size_limit, total_ndocs_limit are set, then data are limited to the lower threshold.
total_ndocs_limit (int, optional) – upper limit for documents to be downloaded in a number of rows, if more than one of: total_size_limit, total_nrows_limit are set, then data are limited to the lower threshold.
benchmark_dataset (pd.DataFrame, optional) – dataset of benchmarking data with IDs in the document_ids column corresponding to the names of documents in the connections list
error_callback (function (str, Exception) -> None, optional) – error callback function, to handle the exceptions from document loading, as arguments are passed document_id and exception
api_client (APIClient, optional) – initialized APIClient object with set project or space ID. If the DataConnection object in list connections does not have a set API client, then the api_client object is used for reading data.
Example: default sampling - read up to 1 GB of random documents
connections = [DataConnection(data_asset_id='5d99c11a-2060-4ef6-83d5-dc593c6455e2')] iterable_dataset = DocumentsIterableDataset(connections=connections, enable_sampling=True, sampling_type='random', sample_size_limit = 1GB)
Example: read all documents/no subsampling
connections = [DataConnection(data_asset_id='5d99c11a-2060-4ef6-83d5-dc593c6455e2')] iterable_dataset = DocumentsIterableDataset(connections=connections, enable_sampling=False)
Example: context based sampling
connections = [DataConnection(data_asset_id='5d99c11a-2060-4ef6-83d5-dc593c6455e2')] iterable_dataset = DocumentsIterableDataset(connections=connections, enable_sampling=True, sampling_type='benchmark_driven', sample_size_limit = 1GB, benchmark_dataset=pd.DataFrame( data={ "question": [ "What foundation models are available in watsonx.ai ?" ], "correct_answers": [ [ "The following models are available in watsonx.ai: ..." ] ], "correct_answer_document_ids": ["sample_pdf_file.pdf"], }))