> ## Documentation Index
> Fetch the complete documentation index at: https://docs.extract.page/llms.txt
> Use this file to discover all available pages before exploring further.

# Schema extraction

> Fill an arbitrary JSON schema from a document, with a per-field citation for every value.

`POST /v1/extract` gives you the whole document as structured chunks. **Schema extraction** is the other direction: you hand us a JSON schema describing the fields you want — `invoice_number`, `total`, `line_items[]` — and get back **just those fields, filled from the document**, with a citation (page, bounding box, and the verbatim source text) for every value.

## Request

`POST /v1/extract/schema` (JSON body with a `url`) or `POST /v1/extract/schema/file` (multipart upload — required for PHI keys). The `schema` is a standard JSON Schema; descriptions double as per-field hints.

```json theme={null}
{
  "invoice_number": { "type": "string", "description": "Invoice ID as printed." },
  "total":          { "type": "number", "description": "Amount due, in dollars." },
  "line_items": {
    "type": "array",
    "items": { "type": "object", "properties": {
      "description": { "type": "string" },
      "amount":      { "type": "number" }
    }}
  }
}
```

**Supported subset:** `string` / `number` / `integer` / `boolean`, `object` (with `properties`), `array` (with `items`), plus `enum` and `description` on leaves. Nesting up to **5 levels**. Wide schemas are capped (a few hundred fields) and return `422` with a "split your schema" message — keep schemas focused.

## Response

`values` is your schema, filled (`null` where the document genuinely lacks a field). `evidence` is a parallel map keyed by **field path** — each value's source page, box (0–1000, page-relative), and verified text.

```json theme={null}
{
  "values": {
    "invoice_number": "INV-42",
    "total": 100.0,
    "line_items": [{ "description": "Widget A", "amount": 80.0 }]
  },
  "evidence": {
    "invoice_number":      [{ "page": 1, "bbox": [120, 80, 240, 96], "text": "Invoice #INV-42" }],
    "total":               [{ "page": 1, "bbox": [410, 540, 470, 556], "text": "Total due $100.00" }],
    "line_items[0].amount":[{ "page": 1, "bbox": [410, 300, 470, 316], "text": "Widget A ... 80.00" }]
  },
  "ungrounded_fields": [],
  "page_count": 1
}
```

* **`strict`** (default `true`): a non-null value whose quote can't be verified is **nulled out** and listed in `ungrounded_fields` (the safe default — a fabrication never reaches you). Set `strict: false` to keep the value but still see it flagged.
* **`bbox`** is `[x0, y0, x1, y1]` normalized to **0–1000**, page-relative, top-left origin — overlay it directly without fetching page dimensions.

## Example

<CodeGroup>
  ```python python theme={null}
  import json, os, httpx

  API = "https://api.extract.page"
  schema = {
      "invoice_number": {"type": "string", "description": "Invoice ID as printed."},
      "total":          {"type": "number", "description": "Amount due, in dollars."},
  }

  with open("invoice.pdf", "rb") as f:
      r = httpx.post(
          f"{API}/v1/extract/schema/file",
          headers={"X-API-KEY": os.environ["EXTRACT_API_KEY"]},
          files={"file": ("invoice.pdf", f, "application/pdf")},
          data={"schema": json.dumps(schema)},
          timeout=300,
      )
  r.raise_for_status()
  out = r.json()
  print(out["values"])                       # {'invoice_number': 'INV-42', 'total': 100.0}
  print(out["evidence"]["total"][0])         # {'page': 1, 'bbox': [...], 'text': 'Total due $100.00'}
  ```

  ```bash curl theme={null}
  curl -X POST https://api.extract.page/v1/extract/schema/file \
    -H "X-API-KEY: $EXTRACT_API_KEY" \
    -F "file=@invoice.pdf" \
    -F 'schema={"invoice_number":{"type":"string"},"total":{"type":"number"}}'
  ```
</CodeGroup>

## Billing

Metered per page at **\$12 per 1,000 pages**, all-in — we don't charge separately for document parsing. See [Pricing](https://extract.page/#pricing).
