> ## Documentation Index
> Fetch the complete documentation index at: https://docs.extract.page/llms.txt
> Use this file to discover all available pages before exploring further.

# Extract: Parse documents into structured data

> Text, tables, and figures in one call, at least 2x faster than other parsers. Parsing \$3 per 1,000 pages; schema extraction \$12 per 1,000 pages.

Extract turns documents into text chunks (with page numbers and bounding boxes), tables (structured cells plus a markdown rendering), and any figures on the page. Two surfaces, one pipeline:

* **Sync** — `POST /v1/extract` and `POST /v1/extract/file`. One document in, results back in seconds. Best for real-time agent loops, screenshots, and small batches.
* **Async batch** — `POST /v1/files` + `POST /v1/batches`. Upload many files, submit them as one batch, poll for results. Best for bulk classification workloads where you don't want to manage thousands of synchronous requests. See **[Batch processing](/guides/batch)**.

Both surfaces share the same extraction engine and the same response schema, so a chunk you get from a sync call looks identical to a chunk you get from a batch item.

## Sync quickstart

Create an API key at [extract.page/dashboard](https://extract.page/dashboard), then pick the ingress path that matches where your document lives:

<Tabs>
  <Tab title="By URL">
    `POST /v1/extract` — JSON body with a `url`. Use this when the document is already reachable over HTTP (S3 presigned URL, public doc, CDN).

    <CodeGroup>
      ```bash curl theme={null}
      curl -X POST https://api.extract.page/v1/extract \
        -H "X-API-KEY: $EXTRACT_API_KEY" \
        -H "Content-Type: application/json" \
        -d '{
          "url": "https://arxiv.org/pdf/1706.03762.pdf"
        }'
      ```

      ```python python theme={null}
      import os, requests

      r = requests.post(
          "https://api.extract.page/v1/extract",
          headers={"X-API-KEY": os.environ["EXTRACT_API_KEY"]},
          json={"url": "https://arxiv.org/pdf/1706.03762.pdf"},
          timeout=120,
      )
      r.raise_for_status()
      doc = r.json()
      print(len(doc["chunks"]), "chunks")
      ```

      ```javascript javascript theme={null}
      const r = await fetch("https://api.extract.page/v1/extract", {
        method: "POST",
        headers: {
          "X-API-KEY": process.env.EXTRACT_API_KEY,
          "Content-Type": "application/json",
        },
        body: JSON.stringify({ url: "https://arxiv.org/pdf/1706.03762.pdf" }),
      });
      const doc = await r.json();
      console.log(doc.chunks.length, "chunks");
      ```
    </CodeGroup>
  </Tab>

  <Tab title="By upload">
    `POST /v1/extract/file` — `multipart/form-data` upload. Use this when you already have the document bytes (agent output, local file, webhook payload) — no need to stage them at a public URL first.

    <CodeGroup>
      ```bash curl theme={null}
      curl -X POST https://api.extract.page/v1/extract/file \
        -H "X-API-KEY: $EXTRACT_API_KEY" \
        -F "file=@paper.pdf"
      ```

      ```python python theme={null}
      import os, requests

      with open("paper.pdf", "rb") as f:
          r = requests.post(
              "https://api.extract.page/v1/extract/file",
              headers={"X-API-KEY": os.environ["EXTRACT_API_KEY"]},
              files={"file": ("paper.pdf", f, "application/pdf")},
              timeout=120,
          )
      r.raise_for_status()
      doc = r.json()
      print(len(doc["chunks"]), "chunks")
      ```

      ```javascript javascript theme={null}
      import { readFileSync } from "node:fs";

      const form = new FormData();
      form.set("file", new Blob([readFileSync("paper.pdf")]), "paper.pdf");

      const r = await fetch("https://api.extract.page/v1/extract/file", {
        method: "POST",
        headers: { "X-API-KEY": process.env.EXTRACT_API_KEY },
        body: form,
      });
      const doc = await r.json();
      console.log(doc.chunks.length, "chunks");
      ```
    </CodeGroup>

    The uploaded filename is advisory — dispatch is driven by the file's magic bytes. A `.docx` filename with PDF bytes is treated as a PDF. The multipart form accepts the same `extract_text`, `extract_images`, and `ocr` fields as the JSON route (send them as individual form fields).
  </Tab>
</Tabs>

## Response shape

```json theme={null}
{
  "chunks": [
    {
      "page_content": "Attention Is All You Need",
      "page_no": 1,
      "bbox": [176.6, 88.7, 438.3, 107.2],
      "chunk_type": "text"
    },
    {
      "page_content": "| Model | BLEU |\n|---|---|\n| Transformer | 28.4 |",
      "page_no": 3,
      "bbox": [110.0, 200.4, 500.0, 320.1],
      "chunk_type": "table",
      "n_rows": 2,
      "n_cols": 2,
      "cells": [
        { "text": "Model", "row": 0, "col": 0 },
        { "text": "BLEU", "row": 0, "col": 1 },
        { "text": "Transformer", "row": 1, "col": 0 },
        { "text": "28.4", "row": 1, "col": 1 }
      ]
    },
    {
      "page_content": "",
      "page_no": 4,
      "bbox": [108.0, 281.4, 504.0, 531.4],
      "chunk_type": "image",
      "image_url": "https://...",
      "image_mime": "image/webp",
      "image_width": 1188,
      "image_height": 750
    }
  ]
}
```

`chunks` is a flat array in reading order. Each chunk is one of:

* **`text`** — a contiguous run of glyphs with the same font and size. Not a paragraph; a single paragraph typically spans several text chunks.
* **`table`** — a table. `cells` is the structured representation (0-based `row`/`col`, with `row_span`/`col_span` for merged cells); `page_content` carries a markdown rendering so plain-text consumers still get readable output.
* **`image`** — a figure extracted from the page, delivered as a URL or inline base64.

Fields:

* `bbox` — `[x0, y0, x1, y1]` in PDF user-space points.
* `confidence` — 0–100, present on OCR'd content (text chunks and table cells); native-text spans omit it.
* `cells` / `n_rows` / `n_cols` — populated on table chunks. Each cell is `{ text, row, col, row_span, col_span, bbox, confidence, page_no }`.
* `merged_from_pages` — present when a table spanning a page break was assembled into one chunk (1-based page numbers, ascending); each cell's `page_no` carries its own source page.
* `image_url` / `image_mime` — populated on image chunks (URL backend).
* `image_b64` — populated on image chunks (inline backend).

Billing is per page: **\$3 per 1,000 pages** for parsing (`/v1/extract`, `/v1/extract/file`), and **\$12 per 1,000 pages** for [schema extraction](/guides/schema-extraction) (`/v1/extract/schema`). Schema extraction is a single all-in charge that **includes the parse** — it is **\$12 total, not billed on top of** the \$3 parse. Remaining balance is visible in your [dashboard](https://extract.page/dashboard).

## Request options

For the URL route (`POST /v1/extract`), `url` is required; everything else is optional. For the upload route (`POST /v1/extract/file`), `file` is required and the remaining fields arrive as individual form fields instead of JSON.

| Field            | Type                | Default  | Description                                                                                                              |
| ---------------- | ------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------ |
| `url`            | string              | —        | URL route only. HTTP(S) URL to a PDF, PPTX, or DOCX. Input type is auto-detected from the extension.                     |
| `file`           | binary              | —        | Upload route only. Multipart file field. Input type is auto-detected from the first few bytes; the filename is advisory. |
| `extract_text`   | boolean             | `true`   | Set `false` to skip text spans (image chunks still returned if `extract_images` is true).                                |
| `extract_images` | boolean             | `true`   | Set `false` to skip figure extraction; the response contains only text chunks.                                           |
| `ocr`            | `"auto" \| "never"` | `"auto"` | *Deprecated — accepted but ignored.* This field no longer changes behavior.                                              |

Server-side limits, not user-configurable but good to know:

* **Max 1,000 pages per document.** Larger docs fail with 413.
* **Max 150 MB per document.** Larger downloads fail with 413.

## How pages are counted

Billing is per page. What counts as a page depends on the input type:

* **PDF:** one page per PDF page.
* **PPTX:** one page per slide.
* **DOCX:** paginated on render; typically 250–400 words per page.

## Large documents and bulk workloads

For documents over 1,000 pages or 150 MB, split client-side and concatenate the `chunks` arrays. The `page_no` field lets you offset page numbers across splits.

For bulk workloads (thousands of documents), use the **[async batch endpoints](/guides/batch)** instead of looping `POST /v1/extract/file` — you upload each file once with a presigned PUT URL (bytes go straight to S3, never through our API), submit the whole set as one batch, and poll for completion. Same response schema, no per-doc HTTP round-trip overhead.

Need support for larger individual documents than 1,000 pages? Email [hello@extract.page](mailto:hello@extract.page).

## Authentication

Every request needs an `X-API-KEY` header. Keys are created and revoked from the [dashboard](https://extract.page/dashboard). Each key:

* is bound to one customer account
* carries a quota expressed in pages (default **1,000 pages** on the Free plan)
* is decremented atomically per request — the cost equals the number of pages in the extracted document

You can rotate a key at any time; the new one is returned once on creation and never shown again.

## Errors

| Status | Meaning                                                   | What to do                                                                                                                |
| ------ | --------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
| `400`  | Unsupported input or extraction failed                    | Check the `url` points to a supported format (PDF, PPTX, or DOCX) and that the document isn't corrupted                   |
| `401`  | Missing or invalid `X-API-KEY`                            | Check the header is set; re-create the key if revoked                                                                     |
| `402`  | Quota exceeded                                            | Top up from the dashboard or wait for plan refresh                                                                        |
| `404`  | File or batch not found (async batch only)                | Verify the `file_id` / `batch_id` is correct and not expired (3-day TTL)                                                  |
| `409`  | File not uploaded, or result not ready (async batch only) | See [Batch errors](/guides/batch#errors)                                                                                  |
| `413`  | Page limit or size limit exceeded                         | Split the document client-side                                                                                            |
| `422`  | Request body invalid                                      | Read `detail[*].loc` + `detail[*].msg` (FastAPI validation error shape); usually a missing `url` or a bad value for `ocr` |
| `429`  | Rate limit exceeded                                       | Back off and retry                                                                                                        |
| `500`  | Server error                                              | Retry with exponential backoff; contact support if persistent                                                             |
| `503`  | Billing service unavailable                               | Retry; we fail closed on billing to avoid silent overspend                                                                |

Async-batch endpoints (`/v1/files`, `/v1/batches`) return machine-readable `error` codes in the response body — see **[Batch errors](/guides/batch#errors)** for the full list, response shapes, and fixes.

Pass your own `X-Request-Id` header if you want to correlate logs with us; it shows up on our side too.

See the **API Reference** in the sidebar for a live playground and the full request/response schema.
