Text Extraction Service

The TextExtractionService provides functionality to extract text from documents stored in IBM Cloud Object Storage (COS) using IBM watsonx.ai. It converts business documents into simpler formats (Markdown, JSON, HTML, plain text) suitable for AI pipelines, and can optionally extract structured key-value pair data from documents.

Quick Start

TextExtractionService service = TextExtractionService.builder()
    .apiKey(WATSONX_API_KEY)
    .projectId(WATSONX_PROJECT_ID)
    .baseUrl(CloudRegion.DALLAS)
    .cosUrl(CLOUD_OBJECT_STORAGE_URL)
    .documentReference(INPUT_CONNECTION_ID, INPUT_BUCKET_NAME)
    .resultReference(OUTPUT_CONNECTION_ID, OUTPUT_BUCKET_NAME)
    .build();

TextExtractionParameters parameters = TextExtractionParameters.builder()
    .requestedOutputs(Type.MD)
    .mode(Mode.HIGH_QUALITY)
    .languages(Language.ENGLISH)
    .build();

String text = service.uploadExtractAndFetch(new File("path/to/file.pdf"), parameters);
System.out.println(text);
// → # Contract
// → ...

Overview

The TextExtractionService enables you to:

  • Extract text from documents into Markdown, JSON, HTML, or plain text formats.
  • Upload local files or input streams directly to COS before extraction.
  • Run extraction synchronously or asynchronously.
  • Extract structured key-value pair data with pre-defined or custom schemas.
  • Configure OCR settings for language, rotation correction, and processing mode.
  • Control output format, DPI, embedded images, and token output.
  • Automatically clean up uploaded input and/or output files after processing.

Service Configuration

Basic Setup

TextExtractionService service = TextExtractionService.builder()
    .apiKey(WATSONX_API_KEY)
    .projectId(WATSONX_PROJECT_ID)
    .baseUrl("https://us-south.ml.cloud.ibm.com") // or use CloudRegion
    .cosUrl("https://s3.us-south.cloud-object-storage.appdomain.cloud") // or use CosUrl
    .documentReference(INPUT_CONNECTION_ID, INPUT_BUCKET_NAME)
    .resultReference(OUTPUT_CONNECTION_ID, OUTPUT_BUCKET_NAME)
    .build();

Using a Separate COS Authenticator

If your Cloud Object Storage uses different credentials than your watsonx.ai service, provide a dedicated cosAuthenticator:

TextExtractionService service = TextExtractionService.builder()
    .apiKey(WATSONX_API_KEY)
    .cosAuthenticator(IBMCloudAuthenticator.withKey(COS_API_KEY))
    .projectId(WATSONX_PROJECT_ID)
    .baseUrl(WATSONX_URL)
    .cosUrl(CLOUD_OBJECT_STORAGE_URL)
    .documentReference(INPUT_CONNECTION_ID, INPUT_BUCKET_NAME)
    .resultReference(OUTPUT_CONNECTION_ID, OUTPUT_BUCKET_NAME)
    .build();

Builder Parameters

Parameter Type Required Description
apiKey String Conditional API key for IBM Cloud authentication
authenticator Authenticator Conditional Custom authentication (alternative to apiKey)
cosAuthenticator Authenticator No Separate authenticator for COS operations (defaults to main authenticator)
projectId String Conditional Project ID where extraction will be performed
spaceId String Conditional Space ID (alternative to projectId)
baseUrl String/CloudRegion Yes watsonx.ai service base URL
cosUrl String/CosUrl Yes Cloud Object Storage base URL
documentReference CosReference Yes Connection ID and bucket containing input documents
resultReference CosReference Yes Connection ID and bucket where extracted results are stored
timeout Duration No Request timeout (default: 60 seconds)
logRequests Boolean No Enable request logging (default: false)
logResponses Boolean No Enable response logging (default: false)
httpClient HttpClient No Custom HTTP client
verifySsl Boolean No SSL certificate verification (default: true)
version String No API version override

Either apiKey or authenticator must be provided. Either projectId or spaceId must be specified.


Examples

Synchronous Extraction

uploadExtractAndFetch uploads a document, runs extraction, and returns the text content in one call. If no output format is specified, it defaults to Markdown.

From a local file:

String text = service.uploadExtractAndFetch(new File("path/to/file.pdf"));

With parameters:

var parameters = TextExtractionParameters.builder()
    .requestedOutputs(Type.MD)
    .mode(Mode.HIGH_QUALITY)
    .languages(Language.ENGLISH)
    .build();

String text = service.uploadExtractAndFetch(new File("path/to/file.pdf"), parameters);

From an InputStream — useful for documents from web uploads or streaming sources:

TextExtractionParameters parameters = TextExtractionParameters.builder()
    .requestedOutputs(Type.MD)
    .mode(Mode.HIGH_QUALITY)
    .build();

String text = service.uploadExtractAndFetch(inputStream, "fileName.pdf", parameters);

From a file already in COS — skip the upload step entirely:

String text = service.extractAndFetch("path/to/cosFile.pdf");

Automatic file cleanup — use removeUploadedFile and removeOutputFile to delete COS files asynchronously after extraction:

var parameters = TextExtractionParameters.builder()
    .requestedOutputs(Type.MD)
    .removeUploadedFile(true)   // delete input file after extraction
    .removeOutputFile(true)     // delete output file after reading
    .build();

String text = service.uploadExtractAndFetch(new File("path/to/file.pdf"), parameters);

Note: removeUploadedFile and removeOutputFile are only supported with the synchronous variants (uploadExtractAndFetch / extractAndFetch). They cannot be used with uploadAndStartExtraction.

Asynchronous Extraction

For long-running operations, start the job and poll until it completes:

var parameters = TextExtractionParameters.builder()
    .requestedOutputs(Type.MD)
    .mode(Mode.HIGH_QUALITY)
    .languages(Language.ENGLISH)
    .build();

TextExtractionResponse response = service.uploadAndStartExtraction(new File("path/to/file.pdf"), parameters);

String requestId = response.metadata().id();
String status = response.entity().results().status();

while (!status.equals(Status.COMPLETED.value()) && !status.equals(Status.FAILED.value())) {
    Thread.sleep(2000);
    response = service.fetchExtractionRequest(requestId);
    status = response.entity().results().status();
}

if (status.equals(Status.COMPLETED.value())) {
    String outputPath = response.entity().resultsReference().location().fileName();
    String text = service.readFile(OUTPUT_BUCKET_NAME, outputPath);
    System.out.println(text);
} else
    System.err.println("Failed: " + response.entity().results().error().message());

Note: Extraction results are retained for 2 days. After that, fetchExtractionRequest will no longer return results for the given ID.

Multiple Output Formats

Request multiple output formats in a single extraction using uploadAndStartExtraction. Set outputFileName to a directory path ending with / to group all outputs together:

var parameters = TextExtractionParameters.builder()
    .requestedOutputs(Type.PLAIN_TEXT, Type.JSON, Type.HTML)
    .mode(Mode.HIGH_QUALITY)
    .outputFileName("output/")
    .build();

TextExtractionResponse response = service.uploadAndStartExtraction(new File("path/to/file.pdf"), parameters);

// Wait for completion, then read each output file
// Files will be: output/plain.txt, output/assembly.json, output/assembly.html
String plainText = service.readFile(RESULTS_BUCKET, "output/plain.txt");
String json      = service.readFile(RESULTS_BUCKET, "output/assembly.json");
String html      = service.readFile(RESULTS_BUCKET, "output/assembly.html");

OCR from Images

Use Mode.HIGH_QUALITY for best OCR results on image files:

var parameters = TextExtractionParameters.builder()
    .mode(Mode.HIGH_QUALITY)
    .requestedOutputs(Type.PLAIN_TEXT)
    .build();

String text = service.uploadExtractAndFetch(new File("path/to/image.png"), parameters);

Managing Requests

Use deleteRequest to cancel or remove an extraction job. Pass hardDelete(true) to also remove the job metadata:

TextExtractionResponse response = service.uploadAndStartExtraction(new File("invoice.pdf"));

boolean deleted = service.deleteRequest(
    response.metadata().id(),
    TextExtractionDeleteParameters.builder()
        .hardDelete(true)
        .build()
);

System.out.println("Deleted: " + deleted); // → true

Output Files

File Naming Conventions

The output file name is derived from the input file name. When requesting a single output, the extension is replaced automatically:

Output Type Output File Name
Type.MD <input_name>.md
Type.HTML <input_name>.html
Type.PLAIN_TEXT <input_name>.txt
Type.JSON <input_name>.json
Type.PAGE_IMAGES page_images/<page>.png

When requesting multiple outputs, set outputFileName to a directory path ending with /. All output files are written into that directory using their default names. For example, with outputFileName("results/") and outputs PLAIN_TEXT, JSON, HTML:

results/plain.txt
results/assembly.json
results/assembly.html
results/embedded_images_assembly/*.png   (if embedded images are enabled)
results/page_images/*.png                (if PAGE_IMAGES is requested)

If outputFileName is not set, output files are written to the root of the resultReference bucket.

Output Types

Value API String Description
Type.JSON assembly Full structured JSON output including KVP data. Required for key-value pair results
Type.MD md Markdown (default)
Type.HTML html HTML
Type.PLAIN_TEXT plain_text Plain text
Type.PAGE_IMAGES page_images Individual page images. Cannot be used with uploadExtractAndFetch

Embedded Images

Controls how images embedded in the document are handled in the extracted output. Applies to Markdown and JSON formats.

Value Image in output Markdown output JSON output
DISABLED No None None
ENABLED_PLACEHOLDER Yes Link to image location Image in pictures structure; picture.text empty; generic placeholder token IDs in picture.children_ids
ENABLED_TEXT Yes Text extracted directly from the image Image in pictures; OCR text in picture.text; token IDs in picture.children_ids
ENABLED_VERBALIZATION Yes Link + textual description of the image Image in pictures; natural language description in picture.verbalization (only for verbalized images); token IDs in picture.children_ids
ENABLED_VERBALIZATION_ALL Yes Link + textual description of the image Same as ENABLED_VERBALIZATION, but all embedded images are verbalized, not just graphs, charts, and screenshots

Images extracted in any mode are stored as .png files in the embedded_images_assembly/ folder within the output location.


Key-Value Pair Extraction

KVP extraction pulls structured field data out of documents alongside the text. It requires kvpMode(KvpMode.GENERIC_WITH_SEMANTIC) and Type.JSON as output format — results are only included in the JSON output.

Basic KVP Extraction

Define a schema, attach it to a TextExtractionSemanticConfig, and pass it to the parameters:

KvpFields fields = KvpFields.builder()
    .add("invoice_date",   KvpField.of("The date when the invoice was issued.", "2024-07-10"))
    .add("invoice_number", KvpField.of("The unique invoice identifier.", "INV-2024-001"))
    .add("total_amount",   KvpField.of("The total amount due.", "1250.50"))
    .build();

Schema schema = Schema.builder()
    .documentType("Invoice")
    .documentDescription("A vendor-issued invoice listing purchased items, prices, and payment information.")
    .fields(fields)
    .build();

TextExtractionSemanticConfig semanticConfig = TextExtractionSemanticConfig.builder()
    .enableSchemaKvp(true)
    .schemasMergeStrategy(SchemaMergeStrategy.REPLACE)
    .schemas(schema)
    .build();

TextExtractionParameters parameters = TextExtractionParameters.builder()
    .mode(Mode.HIGH_QUALITY)
    .requestedOutputs(Type.JSON)
    .kvpMode(KvpMode.GENERIC_WITH_SEMANTIC)
    .languages(Language.ENGLISH)
    .semanticConfig(semanticConfig)
    .build();

String json = service.uploadExtractAndFetch(new File("invoice.pdf"), parameters);

Extraction Methods

Two methods can be enabled independently or together when using GENERIC_WITH_SEMANTIC:

Method Parameter Behaviour
Schema-based enableSchemaKvp(true) Classifies each page into a schema type and extracts only the defined fields. Higher accuracy for known document types
Generic enableGenericKvp(true) Broad sweep: extracts any labelled data regardless of schema. Useful for unknown document formats

Both are active by default. If you only want schema-based results, set enableGenericKvp(false) to avoid duplicate extractions.

Schema Merge Strategy

Controls how custom schemas interact with the built-in pre-defined ones:

Strategy Behaviour When to use
SchemaMergeStrategy.REPLACE Only your custom schemas are used; all pre-defined schemas are ignored You have a known document format with unique fields, or your custom schema conflicts with a pre-defined one
SchemaMergeStrategy.MERGE Your custom schemas are combined with the pre-defined ones You want to supplement pre-defined document types with additional custom schemas

Using a Custom Foundation Model

Override the default model (mistral-small-3-1-24b-instruct-2503) globally with defaultModelName, or per pipeline task with taskModelNameOverride:

TextExtractionSemanticConfig semanticConfig = TextExtractionSemanticConfig.builder()
    .defaultModelName("mistral-large-2512")
    .taskModelNameOverride(Map.of(
        "extraction", "meta-llama/llama-4-maverick-17b-128e-instruct-fp8",
        "create_schema", "mistral-large-2512"
    ))
    .enableSchemaKvp(true)
    .schemasMergeStrategy(SchemaMergeStrategy.REPLACE)
    .schemas(schema)
    .build();

TextExtractionParameters parameters = TextExtractionParameters.builder()
    .requestedOutputs(Type.JSON)
    .kvpMode(KvpMode.GENERIC_WITH_SEMANTIC)
    .semanticConfig(semanticConfig)
    .build();

String json = service.uploadExtractAndFetch(new File("invoice.pdf"), parameters);

Supported keys for taskModelNameOverride: classification_exact, extraction, create_schema, create_schema_page_merger, improve_schema_description, cluster_schemas, merge_schemas.


Extraction Parameters

TextExtractionParameters controls how extraction is performed per request.

Builder Reference

Parameter Type Description
requestedOutputs Type Output format(s) to generate. Defaults to MD if not set
mode Mode Processing quality: STANDARD (faster) or HIGH_QUALITY (preserves all data structures, slower)
ocrMode OcrMode OCR mode: DISABLED, ENABLED, FORCED, or AUTO (service decides)
autoRotationCorrection Boolean Automatically correct document rotation before OCR
languages Language Expected languages in the document (ISO 639)
kvpMode KvpMode Key-value pair extraction mode. Disabled by default. Results are only included in Type.JSON output
semanticConfig TextExtractionSemanticConfig Semantic configuration for schema-based KVP extraction
createEmbeddedImages EmbeddedImageMode How images embedded in the document are handled in output
outputDpi Integer DPI for extracted page images
outputTokens Boolean Include token bounding boxes in the output
outputFileName String Name or directory prefix for the output file in COS
removeUploadedFile Boolean Delete the input file from COS after extraction (synchronous only)
removeOutputFile Boolean Delete the output file from COS after reading (synchronous only)
documentReference CosReference Override the default input COS location for this request
resultReference CosReference Override the default output COS location for this request
timeout Duration Override the service-level timeout for this request
addCustomProperty String, Object Add arbitrary key-value metadata to the request
projectId String Override the default project ID
spaceId String Override the default space ID
transactionId String Request tracking ID

Processing Modes

Value Description
Mode.STANDARD Faster processing with standard accuracy
Mode.HIGH_QUALITY Slower processing with higher accuracy

KVP Modes

Value Description
KvpMode.DISABLED Key-value pair extraction is disabled (default)
KvpMode.GENERIC_WITH_SEMANTIC Extract generic and schema-based KVP data. Use with semanticConfig to configure the extraction pipeline

OCR Modes

Value Sent to API Description
OcrMode.AUTO (not sent) Service automatically selects the best OCR option
OcrMode.DISABLED "disabled" OCR is disabled; document must contain native text
OcrMode.ENABLED "enabled" OCR is applied when the service determines it is needed
OcrMode.FORCED "forced" OCR is always applied regardless of document content

TextExtractionResponse

Returned by startExtraction, uploadAndStartExtraction, and fetchExtractionRequest.

Field Type Description
metadata().id() String Unique identifier for the extraction request
metadata().createdAt() String Timestamp when the request was created
metadata().modifiedAt() String Timestamp of the last update
metadata().projectId() String Project ID associated with the request
entity().results() ExtractionResult The current extraction result
entity().documentReference() DataReference Reference to the input document
entity().resultsReference() DataReference Reference to the output file(s)
entity().parameters() Parameters Parameters used for this extraction
entity().custom() Map<String, Object> User-defined custom properties

ExtractionResult

Field Type Description
status() String Current status: submitted, queued, running, completed, or failed
runningAt() String Timestamp when processing started
completedAt() String Timestamp when processing completed or failed
numberPagesProcessed() Integer Number of pages processed so far
totalPages() Integer Total number of pages to process
location() List<String> Paths of the output files produced in COS
error() Error Error details if status is failed


Back to top

Copyright 2025 IBM Corporation. Licensed under the Apache License 2.0.

This site uses Just the Docs, a documentation theme for Jekyll.