Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.extract.page/llms.txt

Use this file to discover all available pages before exploring further.

When you have hundreds or thousands of documents to process, looping POST /v1/extract/file works but isn’t the right tool: every call holds an HTTP connection open for the duration of the extraction, and orchestrating retries on the client side gets tedious. The async batch lane is built for this — you stage all the files first, hand us a list of file_ids as one batch, and poll a single endpoint for per-document status. Same engine. Same response schema. Same per-page billing. The only difference is how you submit and how you fetch results.

How it works

1. POST /v1/files         (per file)   →  file_id + presigned PUT URL
2. PUT <upload.url>       (per file)   →  bytes go directly to our S3
3. POST /v1/batches       (once)       →  batch_id, status="pending"
4. GET  /v1/batches/{id}  (poll loop)  →  per-item status as workers complete them
5. GET  /v1/batches/{id}/items/{id}/result  →  302 to a presigned S3 GET
Three properties worth knowing up front:
  • Bytes never transit our API. POST /v1/files returns a presigned S3 PUT URL; you PUT the file bytes directly to S3. We never see your upload bandwidth.
  • 24h retention. Both uploaded inputs and result blobs auto-expire after 24 hours. Plan to fetch results within that window. (Need longer? Email hello@extract.page.)
  • Idempotency. Pass an Idempotency-Key header on POST /v1/batches and re-submitting the same key in a 24h window returns the same batch_id instead of creating a duplicate batch. Safe to retry without double-billing.

End-to-end example

This loops over a local directory, uploads everything in parallel, submits one batch, polls until all items reach a terminal state, and writes each result to disk.
import asyncio, os, time
from pathlib import Path

import httpx

API = "https://api.extract.page"
HEADERS = {"X-API-KEY": os.environ["EXTRACT_API_KEY"]}


async def upload(client: httpx.AsyncClient, path: Path) -> str:
    meta = (await client.post(
        f"{API}/v1/files",
        json={"filename": path.name, "size_bytes": path.stat().st_size},
    )).json()
    async with httpx.AsyncClient() as raw:
        await raw.put(
            meta["upload"]["url"],
            content=path.read_bytes(),
            headers={"Content-Type": "application/octet-stream"},
            timeout=600,
        )
    return meta["id"]


async def main(input_dir: str, output_dir: str) -> None:
    files = sorted(p for p in Path(input_dir).rglob("*.pdf") if p.is_file())
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    async with httpx.AsyncClient(headers=HEADERS, timeout=60) as client:
        sem = asyncio.Semaphore(10)
        async def _bound(p): 
            async with sem: return await upload(client, p)
        file_ids = await asyncio.gather(*[_bound(p) for p in files])

        batch = (await client.post(
            f"{API}/v1/batches",
            headers={"Idempotency-Key": f"my-run-{int(time.time())}"},
            json={"source": {"type": "files", "file_ids": file_ids}},
        )).json()
        print("submitted batch", batch["id"], "with", batch["total_items"], "items")

        while True:
            state = (await client.get(f"{API}/v1/batches/{batch['id']}")).json()
            print(state["status"], state["counts"])
            if state["status"] in {"completed", "partially_failed", "failed", "cancelled", "expired"}:
                break
            await asyncio.sleep(3)

        for item in state["items"]:
            if item["status"] != "succeeded":
                continue
            r = await client.get(
                f"{API}{item['result_url']}", follow_redirects=True
            )
            (Path(output_dir) / f"{item['id']}.json").write_bytes(r.content)


asyncio.run(main("./pdfs", "./extracted"))

Polling cursor

GET /v1/batches/{id} returns paginated items in (updated_at, item_id) ascending order. To poll incrementally — i.e. only fetch items that have changed since your last call — pass back the next_cursor from the previous response as ?cursor=.... Cursors are opaque base64; treat them as strings.
# First page
curl ".../v1/batches/batch_yyy?limit=100"
# → { "items": [...], "next_cursor": "MjAyNi0wNS0wN1QxMjowMDowMC..." }

# Next page (or "what's changed since I last polled")
curl ".../v1/batches/batch_yyy?limit=100&cursor=MjAyNi0wNS0wN1QxMjowMDowMC..."
A typical client polls every 2–5 seconds without a cursor (always seeing the full current state) until terminal, then walks the cursor to drain the final list. For very large batches (10k+ items), pass a cursor so you only get the items that changed.

Status lifecycle

A batch moves through:
pending → running → completed | partially_failed | failed | cancelled | expired
An item moves through:
pending → running → succeeded | failed | cancelled
partially_failed means at least one item failed and at least one succeeded; treat it the same as completed and inspect items[].error.code for the failures. Items don’t get retried for terminal errors — if a document is unsupported (unsupported_input) or too large (page_limit_exceeded / document_too_large), the same item won’t succeed on a re-poll. Submit a new batch with the fixed inputs.

Item error codes

When item.status == "failed", item.error.code is one of:
CodeMeaning
payment_requiredCustomer is over their page quota. The whole batch will hit this once it triggers. Top up and re-submit a fresh batch.
unsupported_inputThe bytes aren’t a supported format. The filename is advisory; magic bytes drive the decision.
document_too_largeSource bigger than 150 MB.
page_limit_exceededSource has more than 1,000 pages.
extraction_failedGeneric extraction error (corrupted PDF, missing fonts, etc.).
ocr_provider_errorUnderlying OCR provider was unavailable; we retried up to 3 times before failing.
upload_missingThe presigned PUT URL was never used and the 24h window expired.
internal_errorUnexpected server error. Re-submit the item; contact support if it repeats.

Cancelling

POST /v1/batches/{id}/cancel flips remaining pending items to cancelled. Items already in running finish on their own (we don’t kill in-flight work). Cancelled items are not billed. The batch’s terminal status will be cancelled if no items succeeded; partially_failed or completed if some did.

Concurrency

Items in a batch run on a worker fleet. By default we cap each customer at 8 concurrent items so one large batch can’t starve other customers. Throughput per customer at this cap is bounded but predictable: ~5s/page on the dots.ocr path × 8 in parallel ≈ ~1.6 pages/sec. Email if you need higher concurrency for sustained workloads. You don’t manage worker concurrency yourself — the cap is server-side. From your perspective, items just sit in pending until a worker slot frees up.

Limits

LimitDefault
Max files per batch10,000
Max page count per file1,000
Max file size150 MB
Result + upload retention24 hours
Idempotency key dedup window24 hours
Submission rate limit (POST /v1/batches, POST /v1/files)60/min per key

Picking sync vs async

Use sync when:
  • You need the result inline with the request (interactive agent loops, screenshots).
  • The document is small (under 10 pages) and a multi-second response is fine.
  • You’re already at human-perceptible latency on the user side.
Use async batch when:
  • You’re processing more than ~50 documents in a single workflow.
  • You’d otherwise have to write a retry loop around POST /v1/extract/file.
  • The documents are large (long PDFs) and you’d rather poll than hold a connection.
  • You want a single billing-error to fail-fast at submit time instead of after spending compute.