Batch Conversion

Convert large document collections from cloud storage with Docling for IBM watsonx

Batch conversion is built for large-scale ingestion. Like multi-document conversion, it uses the batch endpoint (/v1/convert/source/batch), but it can also read documents directly from cloud storage (S3) — including every object under a bucket and prefix — and deliver the converted results to a cloud storage location you choose.

Use batch conversion when you want to:

convert a large collection of documents in one request,
read documents straight from cloud storage instead of sending each one, or
have results delivered to your own cloud storage rather than downloading them individually.

The batch endpoint is asynchronous: submit the request, poll for status, and then retrieve the result.

Where your results go

How the converted results are delivered depends on where your documents come from:

All documents from web URLs — you can get back temporary download links (one per document). Use the presigned_url target.
Any document from cloud storage (S3) — the converted results are delivered to a cloud storage location you specify. This can be any bucket you choose; it does not have to be the one the documents came from, and the original documents are never modified. Use the s3 target.

If any of your documents come from cloud storage, you must provide a cloud storage destination for the results. A web-only batch can use either delivery method.

Reading from cloud storage

A single cloud storage source can stand in for many documents: point it at a bucket and prefix, and every matching object is converted. Use max_num_elements to cap how many objects are processed.

API Endpoint Usage

Convert from web URLs

Convert several web documents and get back temporary download links:

curl -X POST "${DOCLING_SERVICE_URL}/v1/convert/source/batch" \
  -H "X-Api-Key: ${DOCLING_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "sources": [
      { "kind": "http", "url": "https://arxiv.org/pdf/2408.09869" },
      { "kind": "http", "url": "https://arxiv.org/pdf/2501.17887" }
    ],
    "target": { "kind": "presigned_url" },
    "options": { "to_formats": ["md"] }
  }'

Convert from cloud storage

Read every object under a bucket and prefix, and deliver the results to another bucket:

curl -X POST "${DOCLING_SERVICE_URL}/v1/convert/source/batch" \
  -H "X-Api-Key: ${DOCLING_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "sources": [
      {
        "kind": "s3",
        "endpoint": "s3.us-east-2.amazonaws.com",
        "access_key": "YOUR_ACCESS_KEY",
        "secret_key": "YOUR_SECRET_KEY",
        "bucket": "my-input-bucket",
        "key_prefix": "incoming/",
        "max_num_elements": 500
      }
    ],
    "target": {
      "kind": "s3",
      "endpoint": "s3.us-east-2.amazonaws.com",
      "access_key": "YOUR_ACCESS_KEY",
      "secret_key": "YOUR_SECRET_KEY",
      "bucket": "my-output-bucket",
      "key_prefix": "converted/"
    },
    "options": { "to_formats": ["md", "json"] }
  }'

Both requests return a task object with a task_id. Poll /v1/status/poll/{task_id} until the status is success, then retrieve the result from /v1/result/{task_id}.

For a web-only batch with the presigned_url target, the result lists one entry per document with download URLs (see multi-document conversion and Get Results). For a cloud storage (s3) target, the converted files are written to your destination bucket and the result is a summary of counts:

{
  "num_converted": 312,
  "num_succeeded": 310,
  "num_partially_succeeded": 1,
  "num_failed": 1,
  "processing_time": 842.5
}

Python SDK Usage

Convert from web URLs

from docling.service_client import DoclingServiceClient
from docling.datamodel.service.requests import AnyHttpSourceRequest
from docling.datamodel.service.targets import PresignedUrlTarget
import os

SERVICE_URL = os.getenv("DOCLING_SERVICE_URL")
API_KEY = os.getenv("DOCLING_API_KEY")

with DoclingServiceClient(url=SERVICE_URL, api_key=API_KEY) as client:
    job = client.submit_batch(
        sources=[
            AnyHttpSourceRequest(url="https://arxiv.org/pdf/2408.09869"),
            AnyHttpSourceRequest(url="https://arxiv.org/pdf/2501.17887"),
        ],
        target=PresignedUrlTarget(),
        output_formats=["md"],
    )

    response = job.result()
    for doc in response.documents:
        print(doc.filename, doc.status)
        for artifact in doc.artifacts:
            print("  ", artifact.artifact_type, "->", artifact.uri)

Convert from cloud storage

from docling.service_client import DoclingServiceClient
from docling.datamodel.service.requests import S3SourceRequest
from docling.datamodel.service.targets import S3Target
import os

SERVICE_URL = os.getenv("DOCLING_SERVICE_URL")
API_KEY = os.getenv("DOCLING_API_KEY")

with DoclingServiceClient(url=SERVICE_URL, api_key=API_KEY) as client:
    job = client.submit_batch(
        sources=[
            S3SourceRequest(
                endpoint="s3.us-east-2.amazonaws.com",
                access_key="YOUR_ACCESS_KEY",
                secret_key="YOUR_SECRET_KEY",
                bucket="my-input-bucket",
                key_prefix="incoming/",
                max_num_elements=500,
            ),
        ],
        target=S3Target(
            endpoint="s3.us-east-2.amazonaws.com",
            access_key="YOUR_ACCESS_KEY",
            secret_key="YOUR_SECRET_KEY",
            bucket="my-output-bucket",
            key_prefix="converted/",
        ),
        output_formats=["md", "json"],
    )

    response = job.result()
    print(f"{response.num_succeeded}/{response.num_converted} written to your bucket")

Progress Callbacks

For long-running batches, you can have the service notify your own endpoint as documents are converted, instead of polling. Add one or more callbacks to the request, each with a url the service will POST progress updates to:

{
  "sources": [ /* ... */ ],
  "target": { "kind": "s3", "endpoint": "...", "access_key": "...", "secret_key": "...", "bucket": "my-output-bucket" },
  "callbacks": [
    { "url": "https://your-app.example.com/docling/progress" }
  ]
}

Progress updates report how many documents have been processed and the per-document status as the batch runs. See Convert Batch for the callback payload details.

Batch Conversion

On this page