> ## Documentation Index
> Fetch the complete documentation index at: https://docs.extract.page/llms.txt
> Use this file to discover all available pages before exploring further.

# Batch processing

> Async batch extraction — upload many files, submit one batch, poll for results.

When you have hundreds or thousands of documents to process, looping `POST /v1/extract/file` works but isn't the right tool: every call holds an HTTP connection open for the duration of the extraction, and orchestrating retries on the client side gets tedious. The **async batch lane** is built for this — you stage all the files first, hand us a list of `file_id`s as one batch, and poll a single endpoint for per-document status.

Same engine. Same response schema. Same per-page billing. The only difference is *how* you submit and *how* you fetch results.

## How it works

```
1. POST /v1/files         (per file)   →  file_id + presigned PUT URL
2. PUT <upload.url>       (per file)   →  bytes go directly to our S3
3. POST /v1/batches       (once)       →  batch_id, status="pending"
4. GET  /v1/batches/{id}  (poll loop)  →  per-item status as workers complete them
5. GET  /v1/batches/{id}/items/{id}/result  →  302 to a presigned S3 GET
```

A few properties worth knowing up front:

* **Bytes never transit our API.** `POST /v1/files` returns a presigned S3 PUT URL; you PUT the file bytes directly to S3. We never see your upload bandwidth.
* **No separate "confirm upload" step.** Hand the `file_id`s straight to `POST /v1/batches` once your PUTs return — batch creation re-checks S3 and marks any just-uploaded files ready, so a finished upload never trips a spurious `file_not_uploaded`. (Want to confirm a single upload landed first? `GET /v1/files/{file_id}` returns its `status`.)
* **3-day retention.** Uploaded inputs and result blobs both auto-expire after **3 days**. The clock starts at upload (for inputs) or at item completion (for results). (Need longer? Email [hello@extract.page](mailto:hello@extract.page) — happy to bump it to a week or more on request.)
  * The deadline that matters for inputs is **"uploaded → fetched by a worker"**, not "uploaded → batch completed". As soon as a worker pulls the file from S3 it's safely in memory; the processing itself has no time limit. So a file uploaded Friday morning and submitted Monday morning is fine — but a batch submitted at hour 72:59 may race the S3 lifecycle and fail with `upload_missing` if the worker doesn't pick it up in time.
* **Idempotency.** Pass an `Idempotency-Key` header on `POST /v1/batches` and re-submitting the same key within the batch's 3-day window returns the same `batch_id` instead of creating a duplicate batch. Safe to retry without double-billing.

## End-to-end example

This loops over a local directory, uploads everything in parallel, submits one batch, polls until all items reach a terminal state, and writes each result to disk.

<CodeGroup>
  ```python python theme={null}
  import asyncio, os, time
  from pathlib import Path

  import httpx

  API = "https://api.extract.page"
  HEADERS = {"X-API-KEY": os.environ["EXTRACT_API_KEY"]}


  async def upload(client: httpx.AsyncClient, path: Path) -> str:
      meta = (await client.post(
          f"{API}/v1/files",
          json={"filename": path.name, "size_bytes": path.stat().st_size},
      )).json()
      async with httpx.AsyncClient() as raw:
          await raw.put(
              meta["upload"]["url"],
              content=path.read_bytes(),
              headers={"Content-Type": "application/octet-stream"},
              timeout=600,
          )
      return meta["id"]


  async def main(input_dir: str, output_dir: str) -> None:
      files = sorted(p for p in Path(input_dir).rglob("*.pdf") if p.is_file())
      Path(output_dir).mkdir(parents=True, exist_ok=True)

      async with httpx.AsyncClient(headers=HEADERS, timeout=60) as client:
          sem = asyncio.Semaphore(10)
          async def _bound(p): 
              async with sem: return await upload(client, p)
          file_ids = await asyncio.gather(*[_bound(p) for p in files])

          batch = (await client.post(
              f"{API}/v1/batches",
              headers={"Idempotency-Key": f"my-run-{int(time.time())}"},
              json={"source": {"type": "files", "file_ids": file_ids}},
          )).json()
          print("submitted batch", batch["id"], "with", batch["total_items"], "items")

          while True:
              state = (await client.get(f"{API}/v1/batches/{batch['id']}")).json()
              print(state["status"], state["counts"])
              if state["status"] in {"completed", "partially_failed", "failed", "cancelled", "expired"}:
                  break
              await asyncio.sleep(3)

          for item in state["items"]:
              if item["status"] != "succeeded":
                  continue
              r = await client.get(
                  f"{API}{item['result_url']}", follow_redirects=True
              )
              (Path(output_dir) / f"{item['id']}.json").write_bytes(r.content)


  asyncio.run(main("./pdfs", "./extracted"))
  ```

  ```javascript javascript theme={null}
  import { readFile, stat, mkdir, writeFile, readdir } from "node:fs/promises";
  import { join } from "node:path";

  const API = "https://api.extract.page";
  const HEADERS = { "X-API-KEY": process.env.EXTRACT_API_KEY };

  async function upload(path) {
    const size = (await stat(path)).size;
    const meta = await (await fetch(`${API}/v1/files`, {
      method: "POST",
      headers: { ...HEADERS, "Content-Type": "application/json" },
      body: JSON.stringify({ filename: path.split("/").pop(), size_bytes: size }),
    })).json();
    await fetch(meta.upload.url, { method: "PUT", body: await readFile(path) });
    return meta.id;
  }

  async function main(inputDir, outputDir) {
    await mkdir(outputDir, { recursive: true });
    const entries = await readdir(inputDir);
    const pdfs = entries.filter(f => f.endsWith(".pdf")).map(f => join(inputDir, f));
    const fileIds = await Promise.all(pdfs.map(upload));

    const batch = await (await fetch(`${API}/v1/batches`, {
      method: "POST",
      headers: { ...HEADERS, "Content-Type": "application/json", "Idempotency-Key": `my-run-${Date.now()}` },
      body: JSON.stringify({ source: { type: "files", file_ids: fileIds } }),
    })).json();
    console.log("submitted batch", batch.id, "with", batch.total_items, "items");

    let state = batch;
    const terminal = new Set(["completed", "partially_failed", "failed", "cancelled", "expired"]);
    while (!terminal.has(state.status)) {
      await new Promise(r => setTimeout(r, 3000));
      state = await (await fetch(`${API}/v1/batches/${batch.id}`, { headers: HEADERS })).json();
      console.log(state.status, state.counts);
    }

    for (const item of state.items) {
      if (item.status !== "succeeded") continue;
      const r = await fetch(`${API}${item.result_url}`, { headers: HEADERS, redirect: "follow" });
      await writeFile(join(outputDir, `${item.id}.json`), Buffer.from(await r.arrayBuffer()));
    }
  }

  main("./pdfs", "./extracted");
  ```

  ```bash curl theme={null}
  # 1. Reserve a slot, get a presigned PUT URL.
  curl -sS -X POST https://api.extract.page/v1/files \
    -H "X-API-KEY: $EXTRACT_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"filename": "chart.pdf", "size_bytes": 18342}'
  # → { "id": "file_xxx", "upload": { "url": "https://...?X-Amz-Signature=...", ... } }

  # 2. Upload bytes directly to S3 (bypasses our API).
  curl -X PUT --data-binary @chart.pdf "$UPLOAD_URL"

  # 3. Submit the batch (one or many file_ids).
  curl -sS -X POST https://api.extract.page/v1/batches \
    -H "X-API-KEY: $EXTRACT_API_KEY" \
    -H "Idempotency-Key: my-run-2026-05-07" \
    -H "Content-Type: application/json" \
    -d '{"source": {"type": "files", "file_ids": ["file_xxx"]}}'
  # → { "id": "batch_yyy", "status": "pending", "total_items": 1 }

  # 4. Poll until terminal.
  curl -sS "https://api.extract.page/v1/batches/batch_yyy" -H "X-API-KEY: $EXTRACT_API_KEY"

  # 5. Fetch a successful item's result JSON.
  curl -L "https://api.extract.page/v1/batches/batch_yyy/items/item_zzz/result" \
    -H "X-API-KEY: $EXTRACT_API_KEY"
  ```
</CodeGroup>

## Polling cursor

`GET /v1/batches/{id}` returns paginated items in `(updated_at, item_id)` ascending order. To poll incrementally — i.e. only fetch items that have changed since your last call — pass back the `next_cursor` from the previous response as `?cursor=...`. Cursors are opaque base64; treat them as strings.

```bash theme={null}
# First page
curl ".../v1/batches/batch_yyy?limit=100"
# → { "items": [...], "next_cursor": "MjAyNi0wNS0wN1QxMjowMDowMC..." }

# Next page (or "what's changed since I last polled")
curl ".../v1/batches/batch_yyy?limit=100&cursor=MjAyNi0wNS0wN1QxMjowMDowMC..."
```

A typical client polls every 2–5 seconds without a cursor (always seeing the full current state) until terminal, then walks the cursor to drain the final list. For very large batches (10k+ items), pass a cursor so you only get the items that changed.

## Status lifecycle

A **batch** moves through:

```
pending → running → completed | partially_failed | failed | cancelled | expired
```

An **item** moves through:

```
pending → running → succeeded | failed | cancelled
```

`partially_failed` means at least one item failed and at least one succeeded; treat it the same as `completed` and inspect `items[].error.code` for the failures. Items don't get retried for terminal errors — if a document is unsupported (`unsupported_input`) or too large (`page_limit_exceeded` / `document_too_large`), the same item won't succeed on a re-poll. Submit a new batch with the fixed inputs.

## Errors

Two error surfaces, two shapes — both documented here so batch error handling has a single home. (The global [status-code table](/introduction#errors) just indexes the number and links back to this section.)

### Request errors

The call was rejected outright — bad `file_id`, file not uploaded, result not ready. `error` is a flat **string** code you can switch on:

```json theme={null}
{ "error": "file_not_uploaded", "file_ids": ["file_abc123"] }
```

| `error`             | Status | Endpoint                  | Meaning + fix                                                                                                                                                                                                  |
| ------------------- | ------ | ------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `file_not_found`    | `404`  | `POST /v1/batches`        | A `file_id` doesn't exist for your account, or its 3-day TTL lapsed. The offending ids come back in `file_ids`.                                                                                                |
| `file_not_uploaded` | `409`  | `POST /v1/batches`        | A `file_id`'s bytes aren't in storage. We re-check S3 on submit, so a *finished* upload won't hit this — it means the PUT never completed or the upload expired. Re-upload the ids in `file_ids` and resubmit. |
| `result_not_ready`  | `409`  | `GET …/items/{id}/result` | The item hasn't reached `succeeded` yet (the `status` field tells you the current state). Keep polling `GET /v1/batches/{id}`.                                                                                 |

### Item errors

These are *item* failures: the batch call itself succeeded, but a document failed during processing. The shape is **nested** — `item.error` is an object, distinct from the flat request-error string above:

```json theme={null}
{ "error": { "code": "unsupported_input", "message": "…" } }
```

When `item.status == "failed"`, `item.error.code` is one of:

| Code                  | Meaning                                                                                                                                         |
| --------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| `payment_required`    | Customer is over their page quota. The whole batch will hit this once it triggers. Top up and re-submit a fresh batch.                          |
| `unsupported_input`   | The bytes aren't a supported format. The filename is advisory; magic bytes drive the decision.                                                  |
| `document_too_large`  | Source bigger than 150 MB.                                                                                                                      |
| `page_limit_exceeded` | Source has more than 1,000 pages.                                                                                                               |
| `extraction_failed`   | Generic extraction error (corrupted PDF, missing fonts, etc.).                                                                                  |
| `ocr_provider_error`  | Underlying OCR provider was unavailable; we retried up to 3 times before failing.                                                               |
| `upload_missing`      | The file was never uploaded, or it had been uploaded but the 3-day retention window passed before a worker fetched it. Re-upload and re-submit. |
| `internal_error`      | Unexpected server error. Re-submit the item; contact support if it repeats.                                                                     |

## Cancelling

`POST /v1/batches/{id}/cancel` flips remaining `pending` items to `cancelled`. Items already in `running` finish on their own (we don't kill in-flight work). Cancelled items are not billed. The batch's terminal status will be `cancelled` if no items succeeded; `partially_failed` or `completed` if some did.

## Concurrency

Items in a batch run on a worker fleet. By default we cap each customer at **8 concurrent items** so one large batch can't starve other customers. Email if you need higher concurrency for sustained workloads.

You don't manage worker concurrency yourself — the cap is server-side. From your perspective, items just sit in `pending` until a worker slot frees up.

## Limits

| Limit                                                        | Default                                                                       |
| ------------------------------------------------------------ | ----------------------------------------------------------------------------- |
| Max files per batch                                          | 10,000                                                                        |
| Max page count per file                                      | 1,000                                                                         |
| Max file size                                                | 150 MB                                                                        |
| Result + upload retention                                    | 3 days (input clock starts at upload; output clock starts at item completion) |
| Idempotency key dedup window                                 | 3 days                                                                        |
| Submission rate limit (`POST /v1/batches`, `POST /v1/files`) | 60/min per key                                                                |

## Picking sync vs async

Use **sync** when:

* You need the result inline with the request (interactive agent loops, screenshots).
* The document is small (under 10 pages) and a multi-second response is fine.
* You're already at human-perceptible latency on the user side.

Use **async batch** when:

* You're processing more than \~50 documents in a single workflow.
* You'd otherwise have to write a retry loop around `POST /v1/extract/file`.
* The documents are large (long PDFs) and you'd rather poll than hold a connection.
