Text Extraction Service
The TextExtractionService provides functionality to extract text from documents stored in IBM Cloud Object Storage (COS) using IBM watsonx.ai. It converts business documents into simpler formats (Markdown, JSON, HTML, plain text) suitable for AI pipelines, and can optionally extract structured key-value pair data from documents.
Quick Start
TextExtractionService service = TextExtractionService.builder()
.apiKey(WATSONX_API_KEY)
.projectId(WATSONX_PROJECT_ID)
.baseUrl(CloudRegion.DALLAS)
.cosUrl(CLOUD_OBJECT_STORAGE_URL)
.documentReference(INPUT_CONNECTION_ID, INPUT_BUCKET_NAME)
.resultReference(OUTPUT_CONNECTION_ID, OUTPUT_BUCKET_NAME)
.build();
TextExtractionParameters parameters = TextExtractionParameters.builder()
.requestedOutputs(Type.MD)
.mode(Mode.HIGH_QUALITY)
.languages(Language.ENGLISH)
.build();
String text = service.uploadExtractAndFetch(new File("path/to/file.pdf"), parameters);
System.out.println(text);
// → # Contract
// → ...
Overview
The TextExtractionService enables you to:
- Extract text from documents into
Markdown,JSON,HTML, orplain textformats. - Upload local files or input streams directly to COS before extraction.
- Run extraction synchronously or asynchronously.
- Extract structured key-value pair data with pre-defined or custom schemas.
- Configure OCR settings for language, rotation correction, and processing mode.
- Control output format, DPI, embedded images, and token output.
- Automatically clean up uploaded input and/or output files after processing.
Service Configuration
Basic Setup
TextExtractionService service = TextExtractionService.builder()
.apiKey(WATSONX_API_KEY)
.projectId(WATSONX_PROJECT_ID)
.baseUrl("https://us-south.ml.cloud.ibm.com") // or use CloudRegion
.cosUrl("https://s3.us-south.cloud-object-storage.appdomain.cloud") // or use CosUrl
.documentReference(INPUT_CONNECTION_ID, INPUT_BUCKET_NAME)
.resultReference(OUTPUT_CONNECTION_ID, OUTPUT_BUCKET_NAME)
.build();
Using a Separate COS Authenticator
If your Cloud Object Storage uses different credentials than your watsonx.ai service, provide a dedicated cosAuthenticator:
TextExtractionService service = TextExtractionService.builder()
.apiKey(WATSONX_API_KEY)
.cosAuthenticator(IBMCloudAuthenticator.withKey(COS_API_KEY))
.projectId(WATSONX_PROJECT_ID)
.baseUrl(WATSONX_URL)
.cosUrl(CLOUD_OBJECT_STORAGE_URL)
.documentReference(INPUT_CONNECTION_ID, INPUT_BUCKET_NAME)
.resultReference(OUTPUT_CONNECTION_ID, OUTPUT_BUCKET_NAME)
.build();
Builder Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
apiKey | String | Conditional | API key for IBM Cloud authentication |
authenticator | Authenticator | Conditional | Custom authentication (alternative to apiKey) |
cosAuthenticator | Authenticator | No | Separate authenticator for COS operations (defaults to main authenticator) |
projectId | String | Conditional | Project ID where extraction will be performed |
spaceId | String | Conditional | Space ID (alternative to projectId) |
baseUrl | String/CloudRegion | Yes | watsonx.ai service base URL |
cosUrl | String/CosUrl | Yes | Cloud Object Storage base URL |
documentReference | CosReference | Yes | Connection ID and bucket containing input documents |
resultReference | CosReference | Yes | Connection ID and bucket where extracted results are stored |
timeout | Duration | No | Request timeout (default: 60 seconds) |
logRequests | Boolean | No | Enable request logging (default: false) |
logResponses | Boolean | No | Enable response logging (default: false) |
httpClient | HttpClient | No | Custom HTTP client |
verifySsl | Boolean | No | SSL certificate verification (default: true) |
version | String | No | API version override |
Either
apiKeyorauthenticatormust be provided. EitherprojectIdorspaceIdmust be specified.
Examples
Synchronous Extraction
uploadExtractAndFetch uploads a document, runs extraction, and returns the text content in one call. If no output format is specified, it defaults to Markdown.
From a local file:
String text = service.uploadExtractAndFetch(new File("path/to/file.pdf"));
With parameters:
var parameters = TextExtractionParameters.builder()
.requestedOutputs(Type.MD)
.mode(Mode.HIGH_QUALITY)
.languages(Language.ENGLISH)
.build();
String text = service.uploadExtractAndFetch(new File("path/to/file.pdf"), parameters);
From an InputStream — useful for documents from web uploads or streaming sources:
TextExtractionParameters parameters = TextExtractionParameters.builder()
.requestedOutputs(Type.MD)
.mode(Mode.HIGH_QUALITY)
.build();
String text = service.uploadExtractAndFetch(inputStream, "fileName.pdf", parameters);
From a file already in COS — skip the upload step entirely:
String text = service.extractAndFetch("path/to/cosFile.pdf");
Automatic file cleanup — use removeUploadedFile and removeOutputFile to delete COS files asynchronously after extraction:
var parameters = TextExtractionParameters.builder()
.requestedOutputs(Type.MD)
.removeUploadedFile(true) // delete input file after extraction
.removeOutputFile(true) // delete output file after reading
.build();
String text = service.uploadExtractAndFetch(new File("path/to/file.pdf"), parameters);
Note:
removeUploadedFileandremoveOutputFileare only supported with the synchronous variants (uploadExtractAndFetch/extractAndFetch). They cannot be used withuploadAndStartExtraction.
Asynchronous Extraction
For long-running operations, start the job and poll until it completes:
var parameters = TextExtractionParameters.builder()
.requestedOutputs(Type.MD)
.mode(Mode.HIGH_QUALITY)
.languages(Language.ENGLISH)
.build();
TextExtractionResponse response = service.uploadAndStartExtraction(new File("path/to/file.pdf"), parameters);
String requestId = response.metadata().id();
String status = response.entity().results().status();
while (!status.equals(Status.COMPLETED.value()) && !status.equals(Status.FAILED.value())) {
Thread.sleep(2000);
response = service.fetchExtractionRequest(requestId);
status = response.entity().results().status();
}
if (status.equals(Status.COMPLETED.value())) {
String outputPath = response.entity().resultsReference().location().fileName();
String text = service.readFile(OUTPUT_BUCKET_NAME, outputPath);
System.out.println(text);
} else
System.err.println("Failed: " + response.entity().results().error().message());
Note: Extraction results are retained for 2 days. After that,
fetchExtractionRequestwill no longer return results for the given ID.
Multiple Output Formats
Request multiple output formats in a single extraction using uploadAndStartExtraction. Set outputFileName to a directory path ending with / to group all outputs together:
var parameters = TextExtractionParameters.builder()
.requestedOutputs(Type.PLAIN_TEXT, Type.JSON, Type.HTML)
.mode(Mode.HIGH_QUALITY)
.outputFileName("output/")
.build();
TextExtractionResponse response = service.uploadAndStartExtraction(new File("path/to/file.pdf"), parameters);
// Wait for completion, then read each output file
// Files will be: output/plain.txt, output/assembly.json, output/assembly.html
String plainText = service.readFile(RESULTS_BUCKET, "output/plain.txt");
String json = service.readFile(RESULTS_BUCKET, "output/assembly.json");
String html = service.readFile(RESULTS_BUCKET, "output/assembly.html");
OCR from Images
Use Mode.HIGH_QUALITY for best OCR results on image files:
var parameters = TextExtractionParameters.builder()
.mode(Mode.HIGH_QUALITY)
.requestedOutputs(Type.PLAIN_TEXT)
.build();
String text = service.uploadExtractAndFetch(new File("path/to/image.png"), parameters);
Managing Requests
Use deleteRequest to cancel or remove an extraction job. Pass hardDelete(true) to also remove the job metadata:
TextExtractionResponse response = service.uploadAndStartExtraction(new File("invoice.pdf"));
boolean deleted = service.deleteRequest(
response.metadata().id(),
TextExtractionDeleteParameters.builder()
.hardDelete(true)
.build()
);
System.out.println("Deleted: " + deleted); // → true
Output Files
File Naming Conventions
The output file name is derived from the input file name. When requesting a single output, the extension is replaced automatically:
| Output Type | Output File Name |
|---|---|
Type.MD | <input_name>.md |
Type.HTML | <input_name>.html |
Type.PLAIN_TEXT | <input_name>.txt |
Type.JSON | <input_name>.json |
Type.PAGE_IMAGES | page_images/<page>.png |
When requesting multiple outputs, set outputFileName to a directory path ending with /. All output files are written into that directory using their default names. For example, with outputFileName("results/") and outputs PLAIN_TEXT, JSON, HTML:
results/plain.txt
results/assembly.json
results/assembly.html
results/embedded_images_assembly/*.png (if embedded images are enabled)
results/page_images/*.png (if PAGE_IMAGES is requested)
If outputFileName is not set, output files are written to the root of the resultReference bucket.
Output Types
| Value | API String | Description |
|---|---|---|
Type.JSON | assembly | Full structured JSON output including KVP data. Required for key-value pair results |
Type.MD | md | Markdown (default) |
Type.HTML | html | HTML |
Type.PLAIN_TEXT | plain_text | Plain text |
Type.PAGE_IMAGES | page_images | Individual page images. Cannot be used with uploadExtractAndFetch |
Embedded Images
Controls how images embedded in the document are handled in the extracted output. Applies to Markdown and JSON formats.
| Value | Image in output | Markdown output | JSON output |
|---|---|---|---|
DISABLED | No | None | None |
ENABLED_PLACEHOLDER | Yes | Link to image location | Image in pictures structure; picture.text empty; generic placeholder token IDs in picture.children_ids |
ENABLED_TEXT | Yes | Text extracted directly from the image | Image in pictures; OCR text in picture.text; token IDs in picture.children_ids |
ENABLED_VERBALIZATION | Yes | Link + textual description of the image | Image in pictures; natural language description in picture.verbalization (only for verbalized images); token IDs in picture.children_ids |
ENABLED_VERBALIZATION_ALL | Yes | Link + textual description of the image | Same as ENABLED_VERBALIZATION, but all embedded images are verbalized, not just graphs, charts, and screenshots |
Images extracted in any mode are stored as
.pngfiles in theembedded_images_assembly/folder within the output location.
Key-Value Pair Extraction
KVP extraction pulls structured field data out of documents alongside the text. It requires kvpMode(KvpMode.GENERIC_WITH_SEMANTIC) and Type.JSON as output format — results are only included in the JSON output.
Basic KVP Extraction
Define a schema, attach it to a TextExtractionSemanticConfig, and pass it to the parameters:
KvpFields fields = KvpFields.builder()
.add("invoice_date", KvpField.of("The date when the invoice was issued.", "2024-07-10"))
.add("invoice_number", KvpField.of("The unique invoice identifier.", "INV-2024-001"))
.add("total_amount", KvpField.of("The total amount due.", "1250.50"))
.build();
Schema schema = Schema.builder()
.documentType("Invoice")
.documentDescription("A vendor-issued invoice listing purchased items, prices, and payment information.")
.fields(fields)
.build();
TextExtractionSemanticConfig semanticConfig = TextExtractionSemanticConfig.builder()
.enableSchemaKvp(true)
.schemasMergeStrategy(SchemaMergeStrategy.REPLACE)
.schemas(schema)
.build();
TextExtractionParameters parameters = TextExtractionParameters.builder()
.mode(Mode.HIGH_QUALITY)
.requestedOutputs(Type.JSON)
.kvpMode(KvpMode.GENERIC_WITH_SEMANTIC)
.languages(Language.ENGLISH)
.semanticConfig(semanticConfig)
.build();
String json = service.uploadExtractAndFetch(new File("invoice.pdf"), parameters);
Extraction Methods
Two methods can be enabled independently or together when using GENERIC_WITH_SEMANTIC:
| Method | Parameter | Behaviour |
|---|---|---|
| Schema-based | enableSchemaKvp(true) | Classifies each page into a schema type and extracts only the defined fields. Higher accuracy for known document types |
| Generic | enableGenericKvp(true) | Broad sweep: extracts any labelled data regardless of schema. Useful for unknown document formats |
Both are active by default. If you only want schema-based results, set enableGenericKvp(false) to avoid duplicate extractions.
Schema Merge Strategy
Controls how custom schemas interact with the built-in pre-defined ones:
| Strategy | Behaviour | When to use |
|---|---|---|
SchemaMergeStrategy.REPLACE | Only your custom schemas are used; all pre-defined schemas are ignored | You have a known document format with unique fields, or your custom schema conflicts with a pre-defined one |
SchemaMergeStrategy.MERGE | Your custom schemas are combined with the pre-defined ones | You want to supplement pre-defined document types with additional custom schemas |
Using a Custom Foundation Model
Override the default model (mistral-small-3-1-24b-instruct-2503) globally with defaultModelName, or per pipeline task with taskModelNameOverride:
TextExtractionSemanticConfig semanticConfig = TextExtractionSemanticConfig.builder()
.defaultModelName("mistral-large-2512")
.taskModelNameOverride(Map.of(
"extraction", "meta-llama/llama-4-maverick-17b-128e-instruct-fp8",
"create_schema", "mistral-large-2512"
))
.enableSchemaKvp(true)
.schemasMergeStrategy(SchemaMergeStrategy.REPLACE)
.schemas(schema)
.build();
TextExtractionParameters parameters = TextExtractionParameters.builder()
.requestedOutputs(Type.JSON)
.kvpMode(KvpMode.GENERIC_WITH_SEMANTIC)
.semanticConfig(semanticConfig)
.build();
String json = service.uploadExtractAndFetch(new File("invoice.pdf"), parameters);
Supported keys for taskModelNameOverride: classification_exact, extraction, create_schema, create_schema_page_merger, improve_schema_description, cluster_schemas, merge_schemas.
Extraction Parameters
TextExtractionParameters controls how extraction is performed per request.
Builder Reference
| Parameter | Type | Description |
|---|---|---|
requestedOutputs | Type | Output format(s) to generate. Defaults to MD if not set |
mode | Mode | Processing quality: STANDARD (faster) or HIGH_QUALITY (preserves all data structures, slower) |
ocrMode | OcrMode | OCR mode: DISABLED, ENABLED, FORCED, or AUTO (service decides) |
autoRotationCorrection | Boolean | Automatically correct document rotation before OCR |
languages | Language | Expected languages in the document (ISO 639) |
kvpMode | KvpMode | Key-value pair extraction mode. Disabled by default. Results are only included in Type.JSON output |
semanticConfig | TextExtractionSemanticConfig | Semantic configuration for schema-based KVP extraction |
createEmbeddedImages | EmbeddedImageMode | How images embedded in the document are handled in output |
outputDpi | Integer | DPI for extracted page images |
outputTokens | Boolean | Include token bounding boxes in the output |
outputFileName | String | Name or directory prefix for the output file in COS |
removeUploadedFile | Boolean | Delete the input file from COS after extraction (synchronous only) |
removeOutputFile | Boolean | Delete the output file from COS after reading (synchronous only) |
documentReference | CosReference | Override the default input COS location for this request |
resultReference | CosReference | Override the default output COS location for this request |
timeout | Duration | Override the service-level timeout for this request |
addCustomProperty | String, Object | Add arbitrary key-value metadata to the request |
projectId | String | Override the default project ID |
spaceId | String | Override the default space ID |
transactionId | String | Request tracking ID |
Processing Modes
| Value | Description |
|---|---|
Mode.STANDARD | Faster processing with standard accuracy |
Mode.HIGH_QUALITY | Slower processing with higher accuracy |
KVP Modes
| Value | Description |
|---|---|
KvpMode.DISABLED | Key-value pair extraction is disabled (default) |
KvpMode.GENERIC_WITH_SEMANTIC | Extract generic and schema-based KVP data. Use with semanticConfig to configure the extraction pipeline |
OCR Modes
| Value | Sent to API | Description |
|---|---|---|
OcrMode.AUTO | (not sent) | Service automatically selects the best OCR option |
OcrMode.DISABLED | "disabled" | OCR is disabled; document must contain native text |
OcrMode.ENABLED | "enabled" | OCR is applied when the service determines it is needed |
OcrMode.FORCED | "forced" | OCR is always applied regardless of document content |
TextExtractionResponse
Returned by startExtraction, uploadAndStartExtraction, and fetchExtractionRequest.
| Field | Type | Description |
|---|---|---|
metadata().id() | String | Unique identifier for the extraction request |
metadata().createdAt() | String | Timestamp when the request was created |
metadata().modifiedAt() | String | Timestamp of the last update |
metadata().projectId() | String | Project ID associated with the request |
entity().results() | ExtractionResult | The current extraction result |
entity().documentReference() | DataReference | Reference to the input document |
entity().resultsReference() | DataReference | Reference to the output file(s) |
entity().parameters() | Parameters | Parameters used for this extraction |
entity().custom() | Map<String, Object> | User-defined custom properties |
ExtractionResult
| Field | Type | Description |
|---|---|---|
status() | String | Current status: submitted, queued, running, completed, or failed |
runningAt() | String | Timestamp when processing started |
completedAt() | String | Timestamp when processing completed or failed |
numberPagesProcessed() | Integer | Number of pages processed so far |
totalPages() | Integer | Total number of pages to process |
location() | List<String> | Paths of the output files produced in COS |
error() | Error | Error details if status is failed |