Batch processing

When you have hundreds or thousands of documents to process, looping POST /v1/extract/file works but isn’t the right tool: every call holds an HTTP connection open for the duration of the extraction, and orchestrating retries on the client side gets tedious. The async batch lane is built for this — you stage all the files first, hand us a list of file_ids as one batch, and poll a single endpoint for per-document status. Same engine. Same response schema. Same per-page billing. The only difference is how you submit and how you fetch results.

How it works

POST /v1/files         (per file)   →  file_id + presigned PUT URL
PUT <upload.url>       (per file)   →  bytes go directly to our S3
POST /v1/batches       (once)       →  batch_id, status="pending"
GET  /v1/batches/{id}  (poll loop)  →  per-item status as workers complete them
GET  /v1/batches/{id}/items/{id}/result  →  302 to a presigned S3 GET

Three properties worth knowing up front:

Bytes never transit our API. POST /v1/files returns a presigned S3 PUT URL; you PUT the file bytes directly to S3. We never see your upload bandwidth.
24h retention. Both uploaded inputs and result blobs auto-expire after 24 hours. Plan to fetch results within that window. (Need longer? Email hello@extract.page.)
Idempotency. Pass an Idempotency-Key header on POST /v1/batches and re-submitting the same key in a 24h window returns the same batch_id instead of creating a duplicate batch. Safe to retry without double-billing.

End-to-end example

This loops over a local directory, uploads everything in parallel, submits one batch, polls until all items reach a terminal state, and writes each result to disk.

import asyncio, os, time
from pathlib import Path

import httpx

API = "https://api.extract.page"
HEADERS = {"X-API-KEY": os.environ["EXTRACT_API_KEY"]}


async def upload(client: httpx.AsyncClient, path: Path) -> str:
    meta = (await client.post(
        f"{API}/v1/files",
        json={"filename": path.name, "size_bytes": path.stat().st_size},
    )).json()
    async with httpx.AsyncClient() as raw:
        await raw.put(
            meta["upload"]["url"],
            content=path.read_bytes(),
            headers={"Content-Type": "application/octet-stream"},
            timeout=600,
        )
    return meta["id"]


async def main(input_dir: str, output_dir: str) -> None:
    files = sorted(p for p in Path(input_dir).rglob("*.pdf") if p.is_file())
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    async with httpx.AsyncClient(headers=HEADERS, timeout=60) as client:
        sem = asyncio.Semaphore(10)
        async def _bound(p): 
            async with sem: return await upload(client, p)
        file_ids = await asyncio.gather(*[_bound(p) for p in files])

        batch = (await client.post(
            f"{API}/v1/batches",
            headers={"Idempotency-Key": f"my-run-{int(time.time())}"},
            json={"source": {"type": "files", "file_ids": file_ids}},
        )).json()
        print("submitted batch", batch["id"], "with", batch["total_items"], "items")

        while True:
            state = (await client.get(f"{API}/v1/batches/{batch['id']}")).json()
            print(state["status"], state["counts"])
            if state["status"] in {"completed", "partially_failed", "failed", "cancelled", "expired"}:
                break
            await asyncio.sleep(3)

        for item in state["items"]:
            if item["status"] != "succeeded":
                continue
            r = await client.get(
                f"{API}{item['result_url']}", follow_redirects=True
            )
            (Path(output_dir) / f"{item['id']}.json").write_bytes(r.content)


asyncio.run(main("./pdfs", "./extracted"))

Polling cursor

GET /v1/batches/{id} returns paginated items in (updated_at, item_id) ascending order. To poll incrementally — i.e. only fetch items that have changed since your last call — pass back the next_cursor from the previous response as ?cursor=.... Cursors are opaque base64; treat them as strings.

# First page
curl ".../v1/batches/batch_yyy?limit=100"
# → { "items": [...], "next_cursor": "MjAyNi0wNS0wN1QxMjowMDowMC..." }

# Next page (or "what's changed since I last polled")
curl ".../v1/batches/batch_yyy?limit=100&cursor=MjAyNi0wNS0wN1QxMjowMDowMC..."

A typical client polls every 2–5 seconds without a cursor (always seeing the full current state) until terminal, then walks the cursor to drain the final list. For very large batches (10k+ items), pass a cursor so you only get the items that changed.

Status lifecycle

A batch moves through:

pending → running → completed | partially_failed | failed | cancelled | expired

An item moves through:

pending → running → succeeded | failed | cancelled

partially_failed means at least one item failed and at least one succeeded; treat it the same as completed and inspect items[].error.code for the failures. Items don’t get retried for terminal errors — if a document is unsupported (unsupported_input) or too large (page_limit_exceeded / document_too_large), the same item won’t succeed on a re-poll. Submit a new batch with the fixed inputs.

Item error codes

When item.status == "failed", item.error.code is one of:

Code	Meaning
`payment_required`	Customer is over their page quota. The whole batch will hit this once it triggers. Top up and re-submit a fresh batch.
`unsupported_input`	The bytes aren’t a supported format. The filename is advisory; magic bytes drive the decision.
`document_too_large`	Source bigger than 150 MB.
`page_limit_exceeded`	Source has more than 1,000 pages.
`extraction_failed`	Generic extraction error (corrupted PDF, missing fonts, etc.).
`ocr_provider_error`	Underlying OCR provider was unavailable; we retried up to 3 times before failing.
`upload_missing`	The presigned PUT URL was never used and the 24h window expired.
`internal_error`	Unexpected server error. Re-submit the item; contact support if it repeats.

Cancelling

POST /v1/batches/{id}/cancel flips remaining pending items to cancelled. Items already in running finish on their own (we don’t kill in-flight work). Cancelled items are not billed. The batch’s terminal status will be cancelled if no items succeeded; partially_failed or completed if some did.

Concurrency

Items in a batch run on a worker fleet. By default we cap each customer at 8 concurrent items so one large batch can’t starve other customers. Throughput per customer at this cap is bounded but predictable: ~5s/page on the dots.ocr path × 8 in parallel ≈ ~1.6 pages/sec. Email if you need higher concurrency for sustained workloads. You don’t manage worker concurrency yourself — the cap is server-side. From your perspective, items just sit in pending until a worker slot frees up.

Limits

Limit	Default
Max files per batch	10,000
Max page count per file	1,000
Max file size	150 MB
Result + upload retention	24 hours
Idempotency key dedup window	24 hours
Submission rate limit (`POST /v1/batches`, `POST /v1/files`)	60/min per key

Picking sync vs async

Use sync when:

You need the result inline with the request (interactive agent loops, screenshots).
The document is small (under 10 pages) and a multi-second response is fine.
You’re already at human-perceptible latency on the user side.

Use async batch when:

You’re processing more than ~50 documents in a single workflow.
You’d otherwise have to write a retry loop around POST /v1/extract/file.
The documents are large (long PDFs) and you’d rather poll than hold a connection.
You want a single billing-error to fail-fast at submit time instead of after spending compute.

Guides

Extract (sync)

Async batch

Batch processing

How it works

End-to-end example

Polling cursor

Status lifecycle

Item error codes

Cancelling

Concurrency

Limits

Picking sync vs async

Guides

Extract (sync)

Async batch

Documentation Index

​How it works

​End-to-end example

​Polling cursor

​Status lifecycle

​Item error codes

​Cancelling

​Concurrency

​Limits

​Picking sync vs async

How it works

End-to-end example

Polling cursor

Status lifecycle

Item error codes

Cancelling

Concurrency

Limits

Picking sync vs async