When you have hundreds or thousands of documents to process, loopingDocumentation Index
Fetch the complete documentation index at: https://docs.extract.page/llms.txt
Use this file to discover all available pages before exploring further.
POST /v1/extract/file works but isn’t the right tool: every call holds an HTTP connection open for the duration of the extraction, and orchestrating retries on the client side gets tedious. The async batch lane is built for this — you stage all the files first, hand us a list of file_ids as one batch, and poll a single endpoint for per-document status.
Same engine. Same response schema. Same per-page billing. The only difference is how you submit and how you fetch results.
How it works
- Bytes never transit our API.
POST /v1/filesreturns a presigned S3 PUT URL; you PUT the file bytes directly to S3. We never see your upload bandwidth. - 24h retention. Both uploaded inputs and result blobs auto-expire after 24 hours. Plan to fetch results within that window. (Need longer? Email hello@extract.page.)
- Idempotency. Pass an
Idempotency-Keyheader onPOST /v1/batchesand re-submitting the same key in a 24h window returns the samebatch_idinstead of creating a duplicate batch. Safe to retry without double-billing.
End-to-end example
This loops over a local directory, uploads everything in parallel, submits one batch, polls until all items reach a terminal state, and writes each result to disk.Polling cursor
GET /v1/batches/{id} returns paginated items in (updated_at, item_id) ascending order. To poll incrementally — i.e. only fetch items that have changed since your last call — pass back the next_cursor from the previous response as ?cursor=.... Cursors are opaque base64; treat them as strings.
Status lifecycle
A batch moves through:partially_failed means at least one item failed and at least one succeeded; treat it the same as completed and inspect items[].error.code for the failures. Items don’t get retried for terminal errors — if a document is unsupported (unsupported_input) or too large (page_limit_exceeded / document_too_large), the same item won’t succeed on a re-poll. Submit a new batch with the fixed inputs.
Item error codes
Whenitem.status == "failed", item.error.code is one of:
| Code | Meaning |
|---|---|
payment_required | Customer is over their page quota. The whole batch will hit this once it triggers. Top up and re-submit a fresh batch. |
unsupported_input | The bytes aren’t a supported format. The filename is advisory; magic bytes drive the decision. |
document_too_large | Source bigger than 150 MB. |
page_limit_exceeded | Source has more than 1,000 pages. |
extraction_failed | Generic extraction error (corrupted PDF, missing fonts, etc.). |
ocr_provider_error | Underlying OCR provider was unavailable; we retried up to 3 times before failing. |
upload_missing | The presigned PUT URL was never used and the 24h window expired. |
internal_error | Unexpected server error. Re-submit the item; contact support if it repeats. |
Cancelling
POST /v1/batches/{id}/cancel flips remaining pending items to cancelled. Items already in running finish on their own (we don’t kill in-flight work). Cancelled items are not billed. The batch’s terminal status will be cancelled if no items succeeded; partially_failed or completed if some did.
Concurrency
Items in a batch run on a worker fleet. By default we cap each customer at 8 concurrent items so one large batch can’t starve other customers. Throughput per customer at this cap is bounded but predictable: ~5s/page on the dots.ocr path × 8 in parallel ≈ ~1.6 pages/sec. Email if you need higher concurrency for sustained workloads. You don’t manage worker concurrency yourself — the cap is server-side. From your perspective, items just sit inpending until a worker slot frees up.
Limits
| Limit | Default |
|---|---|
| Max files per batch | 10,000 |
| Max page count per file | 1,000 |
| Max file size | 150 MB |
| Result + upload retention | 24 hours |
| Idempotency key dedup window | 24 hours |
Submission rate limit (POST /v1/batches, POST /v1/files) | 60/min per key |
Picking sync vs async
Use sync when:- You need the result inline with the request (interactive agent loops, screenshots).
- The document is small (under 10 pages) and a multi-second response is fine.
- You’re already at human-perceptible latency on the user side.
- You’re processing more than ~50 documents in a single workflow.
- You’d otherwise have to write a retry loop around
POST /v1/extract/file. - The documents are large (long PDFs) and you’d rather poll than hold a connection.
- You want a single billing-error to fail-fast at submit time instead of after spending compute.