Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.extract.page/llms.txt

Use this file to discover all available pages before exploring further.

Two HTTP endpoints, one pipeline. Pass a URL when the document is already hosted somewhere; upload the bytes directly when it’s sitting in memory. Either way you get back text chunks (with page numbers and bounding boxes) plus any figures extracted from the pages.

Quickstart

Create an API key at extract.page/dashboard, then pick the ingress path that matches where your document lives:
POST /v1/extract — JSON body with a url. Use this when the document is already reachable over HTTP (S3 presigned URL, public doc, CDN).
curl -X POST https://api.extract.page/v1/extract \
  -H "X-API-KEY: $EXTRACT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://arxiv.org/pdf/1706.03762.pdf"
  }'

Response shape

{
  "chunks": [
    {
      "page_content": "Attention Is All You Need",
      "page_no": 1,
      "bbox": [176.6, 88.7, 438.3, 107.2],
      "chunk_type": "text"
    },
    {
      "page_content": "",
      "page_no": 2,
      "bbox": [108.0, 281.4, 504.0, 531.4],
      "chunk_type": "image",
      "image_url": "https://...",
      "image_mime": "image/webp",
      "image_width": 1188,
      "image_height": 750
    }
  ]
}
chunks is a flat array in reading order. Each chunk is one of:
  • text — a contiguous run of glyphs with the same font and size. Not a paragraph; a single paragraph typically spans several text chunks.
  • image — a figure extracted from the page, delivered as a URL or inline base64.
Fields:
  • bbox[x0, y0, x1, y1] in PDF user-space points.
  • confidence — 0–100, present on OCR’d text chunks only; native-text spans omit it.
  • image_url / image_mime — populated on image chunks (URL backend).
  • image_b64 — populated on image chunks (inline backend).
Billing is one credit per page in the source document. Remaining credits are visible in your dashboard.

Request options

For the URL route (POST /v1/extract), url is required; everything else is optional. For the upload route (POST /v1/extract/file), file is required and the remaining fields arrive as individual form fields instead of JSON.
FieldTypeDefaultDescription
urlstringURL route only. HTTP(S) URL to a PDF, PPTX, or DOCX. Input type is auto-detected from the extension.
filebinaryUpload route only. Multipart file field. Input type is auto-detected from the first few bytes; the filename is advisory.
extract_textbooleantrueSet false to skip text spans (image chunks still returned if extract_images is true).
extract_imagesbooleantrueSet false to skip figure extraction; the response contains only text chunks.
ocr"auto" | "never""auto"auto runs OCR only on pages with no native text or dominated by images. never skips OCR entirely.
Server-side limits, not user-configurable but good to know:
  • Max 1,000 pages per document. Larger docs fail with 413.
  • Max 150 MB per document. Larger downloads fail with 413.

How pages are counted

Billing is per page. What counts as a page depends on the input type:
  • PDF: one page per PDF page.
  • PPTX: one page per slide.
  • DOCX: paginated on render; typically 250–400 words per page.

Large documents

For documents over 1,000 pages or 150 MB, split client-side and concatenate the chunks arrays. The page_no field lets you offset page numbers across splits. Need support for larger documents? Email hello@extract.page.

Authentication

Every request needs an X-API-KEY header. Keys are created and revoked from the dashboard. Each key:
  • is bound to one customer account
  • carries a quota expressed in pages (default 1,000 pages on the Free plan)
  • is decremented atomically per request — the cost equals the number of pages in the extracted document
You can rotate a key at any time; the new one is returned once on creation and never shown again.

Errors

StatusMeaningWhat to do
400Unsupported input or extraction failedCheck the url points to a supported format (PDF, PPTX, or DOCX) and that the document isn’t corrupted
401Missing or invalid X-API-KEYCheck the header is set; re-create the key if revoked
402Quota exceededTop up from the dashboard or wait for plan refresh
413Page limit or size limit exceededSplit the document client-side
422Request body invalidRead detail[*].loc + detail[*].msg (FastAPI validation error shape); usually a missing url or a bad value for ocr
429Rate limit exceededBack off and retry
500Server errorRetry with exponential backoff; contact support if persistent
503Billing service unavailableRetry; we fail closed on billing to avoid silent overspend
Pass your own X-Request-Id header if you want to correlate logs with us; it shows up on our side too. See the API Reference in the sidebar for a live playground and the full request/response schema.