Extract: Parse documents into structured data

Extract turns documents into text chunks (with page numbers and bounding boxes), tables (structured cells plus a markdown rendering), and any figures on the page. Two surfaces, one pipeline:

Sync — POST /v1/extract and POST /v1/extract/file. One document in, results back in seconds. Best for real-time agent loops, screenshots, and small batches.
Async batch — POST /v1/files + POST /v1/batches. Upload many files, submit them as one batch, poll for results. Best for bulk classification workloads where you don’t want to manage thousands of synchronous requests. See Batch processing.

Both surfaces share the same extraction engine and the same response schema, so a chunk you get from a sync call looks identical to a chunk you get from a batch item.

Sync quickstart

Create an API key at extract.page/dashboard, then pick the ingress path that matches where your document lives:

By URL
By upload

POST /v1/extract — JSON body with a url. Use this when the document is already reachable over HTTP (S3 presigned URL, public doc, CDN).

curl -X POST https://api.extract.page/v1/extract \
  -H "X-API-KEY: $EXTRACT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://arxiv.org/pdf/1706.03762.pdf"
  }'

POST /v1/extract/file — multipart/form-data upload. Use this when you already have the document bytes (agent output, local file, webhook payload) — no need to stage them at a public URL first.

curl -X POST https://api.extract.page/v1/extract/file \
  -H "X-API-KEY: $EXTRACT_API_KEY" \
  -F "file=@paper.pdf"

The uploaded filename is advisory — dispatch is driven by the file’s magic bytes. A .docx filename with PDF bytes is treated as a PDF. The multipart form accepts the same extract_text, extract_images, and ocr fields as the JSON route (send them as individual form fields).

Response shape

{
  "chunks": [
    {
      "page_content": "Attention Is All You Need",
      "page_no": 1,
      "bbox": [176.6, 88.7, 438.3, 107.2],
      "chunk_type": "text"
    },
    {
      "page_content": "| Model | BLEU |\n|---|---|\n| Transformer | 28.4 |",
      "page_no": 3,
      "bbox": [110.0, 200.4, 500.0, 320.1],
      "chunk_type": "table",
      "n_rows": 2,
      "n_cols": 2,
      "cells": [
        { "text": "Model", "row": 0, "col": 0 },
        { "text": "BLEU", "row": 0, "col": 1 },
        { "text": "Transformer", "row": 1, "col": 0 },
        { "text": "28.4", "row": 1, "col": 1 }
      ]
    },
    {
      "page_content": "",
      "page_no": 4,
      "bbox": [108.0, 281.4, 504.0, 531.4],
      "chunk_type": "image",
      "image_url": "https://...",
      "image_mime": "image/webp",
      "image_width": 1188,
      "image_height": 750
    }
  ]
}

chunks is a flat array in reading order. Each chunk is one of:

text — a contiguous run of glyphs with the same font and size. Not a paragraph; a single paragraph typically spans several text chunks.
table — a table. cells is the structured representation (0-based row/col, with row_span/col_span for merged cells); page_content carries a markdown rendering so plain-text consumers still get readable output.
image — a figure extracted from the page, delivered as a URL or inline base64.

Fields:

bbox — [x0, y0, x1, y1] in PDF user-space points.
confidence — 0–100, present on OCR’d content (text chunks and table cells); native-text spans omit it.
cells / n_rows / n_cols — populated on table chunks. Each cell is { text, row, col, row_span, col_span, bbox, confidence, page_no }.
merged_from_pages — present when a table spanning a page break was assembled into one chunk (1-based page numbers, ascending); each cell’s page_no carries its own source page.
image_url / image_mime — populated on image chunks (URL backend).
image_b64 — populated on image chunks (inline backend).

Billing is per page: $3 per 1,000 pages for parsing (/v1/extract, /v1/extract/file), and $12 per 1,000 pages for schema extraction (/v1/extract/schema). Schema extraction is a single all-in charge that includes the parse — it is $12 total, not billed on top of the $3 parse. Remaining balance is visible in your dashboard.

Request options

For the URL route (POST /v1/extract), url is required; everything else is optional. For the upload route (POST /v1/extract/file), file is required and the remaining fields arrive as individual form fields instead of JSON.

Field	Type	Default	Description
`url`	string	—	URL route only. HTTP(S) URL to a PDF, PPTX, or DOCX. Input type is auto-detected from the extension.
`file`	binary	—	Upload route only. Multipart file field. Input type is auto-detected from the first few bytes; the filename is advisory.
`extract_text`	boolean	`true`	Set `false` to skip text spans (image chunks still returned if `extract_images` is true).
`extract_images`	boolean	`true`	Set `false` to skip figure extraction; the response contains only text chunks.
`ocr`	`"auto" \| "never"`	`"auto"`	Deprecated — accepted but ignored. This field no longer changes behavior.

Server-side limits, not user-configurable but good to know:

Max 1,000 pages per document. Larger docs fail with 413.
Max 150 MB per document. Larger downloads fail with 413.

How pages are counted

Billing is per page. What counts as a page depends on the input type:

PDF: one page per PDF page.
PPTX: one page per slide.
DOCX: paginated on render; typically 250–400 words per page.

Large documents and bulk workloads

For documents over 1,000 pages or 150 MB, split client-side and concatenate the chunks arrays. The page_no field lets you offset page numbers across splits. For bulk workloads (thousands of documents), use the async batch endpoints instead of looping POST /v1/extract/file — you upload each file once with a presigned PUT URL (bytes go straight to S3, never through our API), submit the whole set as one batch, and poll for completion. Same response schema, no per-doc HTTP round-trip overhead. Need support for larger individual documents than 1,000 pages? Email hello@extract.page.

Authentication

Every request needs an X-API-KEY header. Keys are created and revoked from the dashboard. Each key:

is bound to one customer account
carries a quota expressed in pages (default 1,000 pages on the Free plan)
is decremented atomically per request — the cost equals the number of pages in the extracted document

You can rotate a key at any time; the new one is returned once on creation and never shown again.

Errors

Status	Meaning	What to do
`400`	Unsupported input or extraction failed	Check the `url` points to a supported format (PDF, PPTX, or DOCX) and that the document isn’t corrupted
`401`	Missing or invalid `X-API-KEY`	Check the header is set; re-create the key if revoked
`402`	Quota exceeded	Top up from the dashboard or wait for plan refresh
`404`	File or batch not found (async batch only)	Verify the `file_id` / `batch_id` is correct and not expired (3-day TTL)
`409`	File not uploaded, or result not ready (async batch only)	See Batch errors
`413`	Page limit or size limit exceeded	Split the document client-side
`422`	Request body invalid	Read `detail[].loc` + `detail[].msg` (FastAPI validation error shape); usually a missing `url` or a bad value for `ocr`
`429`	Rate limit exceeded	Back off and retry
`500`	Server error	Retry with exponential backoff; contact support if persistent
`503`	Billing service unavailable	Retry; we fail closed on billing to avoid silent overspend

Async-batch endpoints (/v1/files, /v1/batches) return machine-readable error codes in the response body — see Batch errors for the full list, response shapes, and fixes. Pass your own X-Request-Id header if you want to correlate logs with us; it shows up on our side too. See the API Reference in the sidebar for a live playground and the full request/response schema.

​Sync quickstart

​Response shape

​Request options

​How pages are counted

​Large documents and bulk workloads

​Authentication

​Errors