POST /v1/extract/file works but isn’t the right tool: every call holds an HTTP connection open for the duration of the extraction, and orchestrating retries on the client side gets tedious. The async batch lane is built for this — you stage all the files first, hand us a list of file_ids as one batch, and poll a single endpoint for per-document status.
Same engine. Same response schema. Same per-page billing. The only difference is how you submit and how you fetch results.
How it works
- Bytes never transit our API.
POST /v1/filesreturns a presigned S3 PUT URL; you PUT the file bytes directly to S3. We never see your upload bandwidth. - No separate “confirm upload” step. Hand the
file_ids straight toPOST /v1/batchesonce your PUTs return — batch creation re-checks S3 and marks any just-uploaded files ready, so a finished upload never trips a spuriousfile_not_uploaded. (Want to confirm a single upload landed first?GET /v1/files/{file_id}returns itsstatus.) - 3-day retention. Uploaded inputs and result blobs both auto-expire after 3 days. The clock starts at upload (for inputs) or at item completion (for results). (Need longer? Email hello@extract.page — happy to bump it to a week or more on request.)
- The deadline that matters for inputs is “uploaded → fetched by a worker”, not “uploaded → batch completed”. As soon as a worker pulls the file from S3 it’s safely in memory; the processing itself has no time limit. So a file uploaded Friday morning and submitted Monday morning is fine — but a batch submitted at hour 72:59 may race the S3 lifecycle and fail with
upload_missingif the worker doesn’t pick it up in time.
- The deadline that matters for inputs is “uploaded → fetched by a worker”, not “uploaded → batch completed”. As soon as a worker pulls the file from S3 it’s safely in memory; the processing itself has no time limit. So a file uploaded Friday morning and submitted Monday morning is fine — but a batch submitted at hour 72:59 may race the S3 lifecycle and fail with
- Idempotency. Pass an
Idempotency-Keyheader onPOST /v1/batchesand re-submitting the same key within the batch’s 3-day window returns the samebatch_idinstead of creating a duplicate batch. Safe to retry without double-billing.
End-to-end example
This loops over a local directory, uploads everything in parallel, submits one batch, polls until all items reach a terminal state, and writes each result to disk.Polling cursor
GET /v1/batches/{id} returns paginated items in (updated_at, item_id) ascending order. To poll incrementally — i.e. only fetch items that have changed since your last call — pass back the next_cursor from the previous response as ?cursor=.... Cursors are opaque base64; treat them as strings.
Status lifecycle
A batch moves through:partially_failed means at least one item failed and at least one succeeded; treat it the same as completed and inspect items[].error.code for the failures. Items don’t get retried for terminal errors — if a document is unsupported (unsupported_input) or too large (page_limit_exceeded / document_too_large), the same item won’t succeed on a re-poll. Submit a new batch with the fixed inputs.
Errors
Two error surfaces, two shapes — both documented here so batch error handling has a single home. (The global status-code table just indexes the number and links back to this section.)Request errors
The call was rejected outright — badfile_id, file not uploaded, result not ready. error is a flat string code you can switch on:
error | Status | Endpoint | Meaning + fix |
|---|---|---|---|
file_not_found | 404 | POST /v1/batches | A file_id doesn’t exist for your account, or its 3-day TTL lapsed. The offending ids come back in file_ids. |
file_not_uploaded | 409 | POST /v1/batches | A file_id’s bytes aren’t in storage. We re-check S3 on submit, so a finished upload won’t hit this — it means the PUT never completed or the upload expired. Re-upload the ids in file_ids and resubmit. |
result_not_ready | 409 | GET …/items/{id}/result | The item hasn’t reached succeeded yet (the status field tells you the current state). Keep polling GET /v1/batches/{id}. |
Item errors
These are item failures: the batch call itself succeeded, but a document failed during processing. The shape is nested —item.error is an object, distinct from the flat request-error string above:
item.status == "failed", item.error.code is one of:
| Code | Meaning |
|---|---|
payment_required | Customer is over their page quota. The whole batch will hit this once it triggers. Top up and re-submit a fresh batch. |
unsupported_input | The bytes aren’t a supported format. The filename is advisory; magic bytes drive the decision. |
document_too_large | Source bigger than 150 MB. |
page_limit_exceeded | Source has more than 1,000 pages. |
extraction_failed | Generic extraction error (corrupted PDF, missing fonts, etc.). |
ocr_provider_error | Underlying OCR provider was unavailable; we retried up to 3 times before failing. |
upload_missing | The file was never uploaded, or it had been uploaded but the 3-day retention window passed before a worker fetched it. Re-upload and re-submit. |
internal_error | Unexpected server error. Re-submit the item; contact support if it repeats. |
Cancelling
POST /v1/batches/{id}/cancel flips remaining pending items to cancelled. Items already in running finish on their own (we don’t kill in-flight work). Cancelled items are not billed. The batch’s terminal status will be cancelled if no items succeeded; partially_failed or completed if some did.
Concurrency
Items in a batch run on a worker fleet. By default we cap each customer at 8 concurrent items so one large batch can’t starve other customers. Email if you need higher concurrency for sustained workloads. You don’t manage worker concurrency yourself — the cap is server-side. From your perspective, items just sit inpending until a worker slot frees up.
Limits
| Limit | Default |
|---|---|
| Max files per batch | 10,000 |
| Max page count per file | 1,000 |
| Max file size | 150 MB |
| Result + upload retention | 3 days (input clock starts at upload; output clock starts at item completion) |
| Idempotency key dedup window | 3 days |
Submission rate limit (POST /v1/batches, POST /v1/files) | 60/min per key |
Picking sync vs async
Use sync when:- You need the result inline with the request (interactive agent loops, screenshots).
- The document is small (under 10 pages) and a multi-second response is fine.
- You’re already at human-perceptible latency on the user side.
- You’re processing more than ~50 documents in a single workflow.
- You’d otherwise have to write a retry loop around
POST /v1/extract/file. - The documents are large (long PDFs) and you’d rather poll than hold a connection.