Text Extractions

class ibm_watsonx_ai.foundation_models.extractions.TextExtractions(credentials=None, project_id=None, space_id=None, api_client=None)[source]

Bases: WMLResource

Instantiate the Text Extraction service.

Parameters:
  • credentials (Credentials, optional) – credentials to the Watson Machine Learning instance

  • project_id (str, optional) – ID of the Watson Studio project, defaults to None

  • space_id (str, optional) – ID of the Watson Studio space, defaults to None

  • api_client (APIClient, optional) – initialized APIClient object with a set project ID or space ID. If passed, credentials and project_id/space_id are not required, defaults to None

Raises:
  • InvalidMultipleArguments – raised if space_id and project_id or credentials and api_client are provided simultaneously

  • WMLClientError – raised if the CPD version is less than 5.0

 from ibm_watsonx_ai import Credentials
 from ibm_watsonx_ai.foundation_models.extractions import TextExtractions

extraction = TextExtractions(
     credentials=Credentials(
                         api_key = "***",
                         url = "https://us-south.ml.cloud.ibm.com"),
     project_id="*****"
     )
delete_job(extraction_id)[source]

Delete a text extraction job.

Returns:

return “SUCCESS” if the deletion succeeds

Return type:

str

Example:

extraction.delete_job(extraction_id="<extraction_id>")
static get_id(extraction_details)[source]

Get the unique ID of a stored extraction request.

Parameters:

extraction_details (dict) – metadata of the stored extraction

Returns:

unique ID of the stored extraction request

Return type:

str

Example:

extraction_details = extraction.get_job_details(extraction_id)
extraction_id = extraction.get_id(extraction_details)
get_job_details(extraction_id=None, limit=None)[source]

Return text extraction job details. If extraction_id is None, returns the details of all text extraction jobs.

Parameters:
  • extraction_id (str | None, optional) – ID of the text extraction job, defaults to None

  • limit (int | None, optional) – limit number of fetched records, defaults to None

Returns:

details of the text extraction job

Return type:

dict

Example:

extraction.get_job_details(extraction_id="<extraction_id>")
get_results_reference(extraction_id)[source]

Get a DataConnection instance that is a reference to the results stored on COS.

Parameters:

extraction_id (str) – ID of text extraction job

Returns:

location of the Data Connection to text extraction job results

Return type:

DataConnection

Example:

results_reference = extraction.get_results_reference(extraction_id="<extraction_id>")
list_jobs(limit=None)[source]

List text extraction jobs. If limit is None, all jobs will be listed.

Parameters:

limit (int | None, optional) – limit number of fetched records, defaults to None

Returns:

job information of a pandas DataFrame with text extraction

Return type:

pandas.DataFrame

Example:

extraction.list_jobs()
run_job(document_reference, results_reference, steps=None, results_format='json')[source]

Start a request to extract text and metadata from a document.

Parameters:
  • document_reference (DataConnection) – reference to the document in the bucket from which text will be extracted

  • results_reference (DataConnection) – reference to the location in the bucket where results will saved

  • steps (dict | None, optional) – steps for the text extraction pipeline, defaults to None

  • results_format (Literal["json", "markdown"], optional) – results format for the text extraction, defaults to “json”

Returns:

raw response from the server with the text extraction job details

Return type:

dict

Example:

from ibm_watsonx_ai.metanames import TextExtractionsMetaNames
from ibm_watsonx_ai.helpers import DataConnection, S3Location

document_reference = DataConnection(
    connection_asset_id="<connection_id>",
    location=S3Location(bucket="<bucket_name>", path="path/to/file"),
    )

results_reference = DataConnection(
    connection_asset_id="<connection_id>",
    location=S3Location(bucket="<bucket_name>", path="path/to/file"),
    )

response = extraction.run_job(
    document_reference=document_reference,
    results_reference=results_reference,
    steps={
        TextExtractionsMetaNames.OCR: {
            "process_image": True,
            "languages_list": ["en", "fr"],
        },
        TextExtractionsMetaNames.TABLE_PROCESSING: {"enabled": True},
    results_format="markdown",
    },
)

Enums

class metanames.TextExtractionsMetaNames[source]

Set of MetaNames for Text Extraction Steps.

Available MetaNames:

MetaName

Type

Required

Example value

OCR

dict

N

{'process_images': True, 'languages_list': ['en']}

TABLE_PROCESSING

dict

N

{'enabled': True}

Note

For more details about Text Extraction Steps, see https://cloud.ibm.com/apidocs/watsonx-ai