Schema extraction

POST /v1/extract gives you the whole document as structured chunks. Schema extraction is the other direction: you hand us a JSON schema describing the fields you want — invoice_number, total, line_items[] — and get back just those fields, filled from the document, with a citation (page, bounding box, and the verbatim source text) for every value.

Request

POST /v1/extract/schema (JSON body with a url) or POST /v1/extract/schema/file (multipart upload — required for PHI keys). The schema is a standard JSON Schema; descriptions double as per-field hints.

{
  "invoice_number": { "type": "string", "description": "Invoice ID as printed." },
  "total":          { "type": "number", "description": "Amount due, in dollars." },
  "line_items": {
    "type": "array",
    "items": { "type": "object", "properties": {
      "description": { "type": "string" },
      "amount":      { "type": "number" }
    }}
  }
}

Supported subset: string / number / integer / boolean, object (with properties), array (with items), plus enum and description on leaves. Nesting up to 5 levels. Wide schemas are capped (a few hundred fields) and return 422 with a “split your schema” message — keep schemas focused.

Response

values is your schema, filled (null where the document genuinely lacks a field). evidence is a parallel map keyed by field path — each value’s source page, box (0–1000, page-relative), and verified text.

{
  "values": {
    "invoice_number": "INV-42",
    "total": 100.0,
    "line_items": [{ "description": "Widget A", "amount": 80.0 }]
  },
  "evidence": {
    "invoice_number":      [{ "page": 1, "bbox": [120, 80, 240, 96], "text": "Invoice #INV-42" }],
    "total":               [{ "page": 1, "bbox": [410, 540, 470, 556], "text": "Total due $100.00" }],
    "line_items[0].amount":[{ "page": 1, "bbox": [410, 300, 470, 316], "text": "Widget A ... 80.00" }]
  },
  "ungrounded_fields": [],
  "page_count": 1
}

strict (default true): a non-null value whose quote can’t be verified is nulled out and listed in ungrounded_fields (the safe default — a fabrication never reaches you). Set strict: false to keep the value but still see it flagged.
bbox is [x0, y0, x1, y1] normalized to 0–1000, page-relative, top-left origin — overlay it directly without fetching page dimensions.

Example

import json, os, httpx

API = "https://api.extract.page"
schema = {
    "invoice_number": {"type": "string", "description": "Invoice ID as printed."},
    "total":          {"type": "number", "description": "Amount due, in dollars."},
}

with open("invoice.pdf", "rb") as f:
    r = httpx.post(
        f"{API}/v1/extract/schema/file",
        headers={"X-API-KEY": os.environ["EXTRACT_API_KEY"]},
        files={"file": ("invoice.pdf", f, "application/pdf")},
        data={"schema": json.dumps(schema)},
        timeout=300,
    )
r.raise_for_status()
out = r.json()
print(out["values"])                       # {'invoice_number': 'INV-42', 'total': 100.0}
print(out["evidence"]["total"][0])         # {'page': 1, 'bbox': [...], 'text': 'Total due $100.00'}

Billing

Metered per page at $12 per 1,000 pages, all-in — we don’t charge separately for document parsing. See Pricing.

​Request

​Response

​Example

​Billing

Request

Response

Example

Billing