Skip to main content
When you have hundreds or thousands of documents to process, looping POST /v1/extract/file works but isn’t the right tool: every call holds an HTTP connection open for the duration of the extraction, and orchestrating retries on the client side gets tedious. The async batch lane is built for this — you stage all the files first, hand us a list of file_ids as one batch, and poll a single endpoint for per-document status. Same engine. Same response schema. Same per-page billing. The only difference is how you submit and how you fetch results.

How it works

1. POST /v1/files         (per file)   →  file_id + presigned PUT URL
2. PUT <upload.url>       (per file)   →  bytes go directly to our S3
3. POST /v1/batches       (once)       →  batch_id, status="pending"
4. GET  /v1/batches/{id}  (poll loop)  →  per-item status as workers complete them
5. GET  /v1/batches/{id}/items/{id}/result  →  302 to a presigned S3 GET
A few properties worth knowing up front:
  • Bytes never transit our API. POST /v1/files returns a presigned S3 PUT URL; you PUT the file bytes directly to S3. We never see your upload bandwidth.
  • No separate “confirm upload” step. Hand the file_ids straight to POST /v1/batches once your PUTs return — batch creation re-checks S3 and marks any just-uploaded files ready, so a finished upload never trips a spurious file_not_uploaded. (Want to confirm a single upload landed first? GET /v1/files/{file_id} returns its status.)
  • 3-day retention. Uploaded inputs and result blobs both auto-expire after 3 days. The clock starts at upload (for inputs) or at item completion (for results). (Need longer? Email hello@extract.page — happy to bump it to a week or more on request.)
    • The deadline that matters for inputs is “uploaded → fetched by a worker”, not “uploaded → batch completed”. As soon as a worker pulls the file from S3 it’s safely in memory; the processing itself has no time limit. So a file uploaded Friday morning and submitted Monday morning is fine — but a batch submitted at hour 72:59 may race the S3 lifecycle and fail with upload_missing if the worker doesn’t pick it up in time.
  • Idempotency. Pass an Idempotency-Key header on POST /v1/batches and re-submitting the same key within the batch’s 3-day window returns the same batch_id instead of creating a duplicate batch. Safe to retry without double-billing.

End-to-end example

This loops over a local directory, uploads everything in parallel, submits one batch, polls until all items reach a terminal state, and writes each result to disk.
import asyncio, os, time
from pathlib import Path

import httpx

API = "https://api.extract.page"
HEADERS = {"X-API-KEY": os.environ["EXTRACT_API_KEY"]}


async def upload(client: httpx.AsyncClient, path: Path) -> str:
    meta = (await client.post(
        f"{API}/v1/files",
        json={"filename": path.name, "size_bytes": path.stat().st_size},
    )).json()
    async with httpx.AsyncClient() as raw:
        await raw.put(
            meta["upload"]["url"],
            content=path.read_bytes(),
            headers={"Content-Type": "application/octet-stream"},
            timeout=600,
        )
    return meta["id"]


async def main(input_dir: str, output_dir: str) -> None:
    files = sorted(p for p in Path(input_dir).rglob("*.pdf") if p.is_file())
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    async with httpx.AsyncClient(headers=HEADERS, timeout=60) as client:
        sem = asyncio.Semaphore(10)
        async def _bound(p): 
            async with sem: return await upload(client, p)
        file_ids = await asyncio.gather(*[_bound(p) for p in files])

        batch = (await client.post(
            f"{API}/v1/batches",
            headers={"Idempotency-Key": f"my-run-{int(time.time())}"},
            json={"source": {"type": "files", "file_ids": file_ids}},
        )).json()
        print("submitted batch", batch["id"], "with", batch["total_items"], "items")

        while True:
            state = (await client.get(f"{API}/v1/batches/{batch['id']}")).json()
            print(state["status"], state["counts"])
            if state["status"] in {"completed", "partially_failed", "failed", "cancelled", "expired"}:
                break
            await asyncio.sleep(3)

        for item in state["items"]:
            if item["status"] != "succeeded":
                continue
            r = await client.get(
                f"{API}{item['result_url']}", follow_redirects=True
            )
            (Path(output_dir) / f"{item['id']}.json").write_bytes(r.content)


asyncio.run(main("./pdfs", "./extracted"))

Polling cursor

GET /v1/batches/{id} returns paginated items in (updated_at, item_id) ascending order. To poll incrementally — i.e. only fetch items that have changed since your last call — pass back the next_cursor from the previous response as ?cursor=.... Cursors are opaque base64; treat them as strings.
# First page
curl ".../v1/batches/batch_yyy?limit=100"
# → { "items": [...], "next_cursor": "MjAyNi0wNS0wN1QxMjowMDowMC..." }

# Next page (or "what's changed since I last polled")
curl ".../v1/batches/batch_yyy?limit=100&cursor=MjAyNi0wNS0wN1QxMjowMDowMC..."
A typical client polls every 2–5 seconds without a cursor (always seeing the full current state) until terminal, then walks the cursor to drain the final list. For very large batches (10k+ items), pass a cursor so you only get the items that changed.

Status lifecycle

A batch moves through:
pending → running → completed | partially_failed | failed | cancelled | expired
An item moves through:
pending → running → succeeded | failed | cancelled
partially_failed means at least one item failed and at least one succeeded; treat it the same as completed and inspect items[].error.code for the failures. Items don’t get retried for terminal errors — if a document is unsupported (unsupported_input) or too large (page_limit_exceeded / document_too_large), the same item won’t succeed on a re-poll. Submit a new batch with the fixed inputs.

Errors

Two error surfaces, two shapes — both documented here so batch error handling has a single home. (The global status-code table just indexes the number and links back to this section.)

Request errors

The call was rejected outright — bad file_id, file not uploaded, result not ready. error is a flat string code you can switch on:
{ "error": "file_not_uploaded", "file_ids": ["file_abc123"] }
errorStatusEndpointMeaning + fix
file_not_found404POST /v1/batchesA file_id doesn’t exist for your account, or its 3-day TTL lapsed. The offending ids come back in file_ids.
file_not_uploaded409POST /v1/batchesA file_id’s bytes aren’t in storage. We re-check S3 on submit, so a finished upload won’t hit this — it means the PUT never completed or the upload expired. Re-upload the ids in file_ids and resubmit.
result_not_ready409GET …/items/{id}/resultThe item hasn’t reached succeeded yet (the status field tells you the current state). Keep polling GET /v1/batches/{id}.

Item errors

These are item failures: the batch call itself succeeded, but a document failed during processing. The shape is nesteditem.error is an object, distinct from the flat request-error string above:
{ "error": { "code": "unsupported_input", "message": "…" } }
When item.status == "failed", item.error.code is one of:
CodeMeaning
payment_requiredCustomer is over their page quota. The whole batch will hit this once it triggers. Top up and re-submit a fresh batch.
unsupported_inputThe bytes aren’t a supported format. The filename is advisory; magic bytes drive the decision.
document_too_largeSource bigger than 150 MB.
page_limit_exceededSource has more than 1,000 pages.
extraction_failedGeneric extraction error (corrupted PDF, missing fonts, etc.).
ocr_provider_errorUnderlying OCR provider was unavailable; we retried up to 3 times before failing.
upload_missingThe file was never uploaded, or it had been uploaded but the 3-day retention window passed before a worker fetched it. Re-upload and re-submit.
internal_errorUnexpected server error. Re-submit the item; contact support if it repeats.

Cancelling

POST /v1/batches/{id}/cancel flips remaining pending items to cancelled. Items already in running finish on their own (we don’t kill in-flight work). Cancelled items are not billed. The batch’s terminal status will be cancelled if no items succeeded; partially_failed or completed if some did.

Concurrency

Items in a batch run on a worker fleet. By default we cap each customer at 8 concurrent items so one large batch can’t starve other customers. Email if you need higher concurrency for sustained workloads. You don’t manage worker concurrency yourself — the cap is server-side. From your perspective, items just sit in pending until a worker slot frees up.

Limits

LimitDefault
Max files per batch10,000
Max page count per file1,000
Max file size150 MB
Result + upload retention3 days (input clock starts at upload; output clock starts at item completion)
Idempotency key dedup window3 days
Submission rate limit (POST /v1/batches, POST /v1/files)60/min per key

Picking sync vs async

Use sync when:
  • You need the result inline with the request (interactive agent loops, screenshots).
  • The document is small (under 10 pages) and a multi-second response is fine.
  • You’re already at human-perceptible latency on the user side.
Use async batch when:
  • You’re processing more than ~50 documents in a single workflow.
  • You’d otherwise have to write a retry loop around POST /v1/extract/file.
  • The documents are large (long PDFs) and you’d rather poll than hold a connection.