> ## Documentation Index
> Fetch the complete documentation index at: https://docs.context.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# Extract Structured Data from Websites

> Extract structured data from any website with a single JSON Schema.

Context.dev's Extract API takes a starting URL and a JSON Schema describing the data you want, crawls the most relevant pages on the site (following internal links and parsing PDFs along the way), and returns a typed response matching your schema.

It's an alternative to an entire pipeline that scrapes multiple pages into markdown and runs an LLM over it.

<Prompt description="Integrate Context.dev's Extract API in your app" icon="sparkles" actions={["copy", "cursor"]}>
  I'm integrating Context.dev's Extract API (`POST /web/extract`) into my app to extract structured data from websites. Help me:

  1. Install the official SDK for my language (`context.dev` on npm / PyPI / RubyGems, `github.com/context-dot-dev/context-go-sdk` for Go).
  2. Read the API key from the `CONTEXT_DEV_API_KEY` environment variable. Never hardcode it.
  3. Call `client.web.extract({ url, schema })` with a starting URL and a JSON Schema describing the data I want back. If my project already has a schema library, generate the JSON Schema from it (Zod's `z.toJSONSchema()` in TypeScript, Pydantic's `model_json_schema()` in Python) and add `description`s to fields to tell the model what to look for.
  4. Optionally pass `instructions` to steer the crawl, `maxPages` (1–50, default 5) and `maxDepth` to bound it, `factCheck: true` to forbid inferred values, and `timeoutMS` (1000–300000) to bound the request.
  5. Read the result from `response.data` (it matches my schema) and the crawled pages from `response.urls_analyzed`. Validate `data` on the way out with the same schema I sent.

  Docs: [https://docs.context.dev/guides/extract-structured-data-from-websites](https://docs.context.dev/guides/extract-structured-data-from-websites)
</Prompt>

## Prerequisites

* **A Context.dev API key.** Sign up at [context.dev/signup](https://context.dev/signup), copy the key from the [dashboard](https://context.dev/dashboard) (prefix `ctxt_secret_`), and export it:

  ```bash theme={null}
  export CONTEXT_DEV_API_KEY="ctxt_secret_..."
  ```

* **An SDK (optional).** Install for your language, or skip the install and call directly with `curl`:

  <CodeGroup>
    ```bash TypeScript theme={null}
    npm install context.dev
    ```

    ```bash Python theme={null}
    pip install context.dev
    ```

    ```bash Ruby theme={null}
    gem install context.dev
    ```

    ```bash Go theme={null}
    go get github.com/context-dot-dev/context-go-sdk
    ```
  </CodeGroup>

## Extract data

You describe the result you want as one JSON Schema. Property names become the keys of the response's `data` object, and each property's `description` tells the model what to look for:

<CodeGroup>
  ```bash cURL highlight={6-16} theme={null}
  curl -X POST https://api.context.dev/v1/web/extract \
    -H "Authorization: Bearer $CONTEXT_DEV_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "url": "https://stripe.com",
      "schema": {
        "type": "object",
        "properties": {
          "founded_year": {
            "type": "number",
            "description": "The year the company was founded."
          }
        },
        "required": ["founded_year"],
        "additionalProperties": false
      }
    }'
  ```

  ```typescript TypeScript highlight={7-17} theme={null}
  import ContextDev from "context.dev";

  const client = new ContextDev({ apiKey: process.env.CONTEXT_DEV_API_KEY });

  const response = await client.web.extract({
    url: "https://stripe.com",
    schema: {
      type: "object",
      properties: {
        founded_year: {
          type: "number",
          description: "The year the company was founded.",
        },
      },
      required: ["founded_year"],
      additionalProperties: false,
    },
  });

  console.log(response.data);
  ```

  ```python Python highlight={8-18} theme={null}
  import os
  from context.dev import ContextDev

  client = ContextDev(api_key=os.environ["CONTEXT_DEV_API_KEY"])

  response = client.web.extract(
      url="https://stripe.com",
      schema={
          "type": "object",
          "properties": {
              "founded_year": {
                  "type": "number",
                  "description": "The year the company was founded.",
              }
          },
          "required": ["founded_year"],
          "additionalProperties": False,
      },
  )

  print(response.data)
  ```

  ```ruby Ruby highlight={7-17} theme={null}
  require "context_dev"

  client = ContextDev::Client.new(api_key: ENV.fetch("CONTEXT_DEV_API_KEY"))

  response = client.web.extract(
    url: "https://stripe.com",
    schema: {
      type: "object",
      properties: {
        founded_year: {
          type: "number",
          description: "The year the company was founded.",
        },
      },
      required: ["founded_year"],
      additionalProperties: false,
    },
  )

  puts response.data
  ```

  ```go Go highlight={19-29} theme={null}
  package main

  import (
      "context"
      "fmt"
      "os"

      contextdev "github.com/context-dot-dev/context-go-sdk"
      "github.com/context-dot-dev/context-go-sdk/option"
  )

  func main() {
      client := contextdev.NewClient(
          option.WithAPIKey(os.Getenv("CONTEXT_DEV_API_KEY")),
      )

      response, err := client.Web.Extract(context.TODO(), contextdev.WebExtractParams{
          URL: "https://stripe.com",
          Schema: map[string]any{
              "type": "object",
              "properties": map[string]any{
                  "founded_year": map[string]any{
                      "type":        "number",
                      "description": "The year the company was founded.",
                  },
              },
              "required":             []string{"founded_year"},
              "additionalProperties": false,
          },
      })
      if err != nil {
          panic(err)
      }

      fmt.Printf("%+v\n", response.Data)
  }
  ```
</CodeGroup>

<Badge color="blue" icon="coins">10 credits per successful call</Badge>

### Request parameters

| Parameter          | Type    | Description                                                                                                                                          |
| ------------------ | ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
| `url`              | string  | **Required.** Starting URL to crawl. Must include `http://` or `https://`.                                                                           |
| `schema`           | object  | **Required.** A JSON Schema describing the structure of the data you want back. Add `description`s to properties to tell the model what to look for. |
| `instructions`     | string  | Plain-language guidance for the crawl and extraction (max 2000 chars), e.g. `"Focus on the pricing page."`                                           |
| `factCheck`        | boolean | When `true`, only values stated on the crawled pages are returned. When `false` (default), the model may make reasonable inferences.                 |
| `followSubdomains` | boolean | Follow links to subdomains of the starting domain. Default `false`.                                                                                  |
| `maxPages`         | integer | Number of pages to analyze, `1`–`50`. Default `5`.                                                                                                   |
| `maxDepth`         | integer | Maximum link depth from the starting URL. Unlimited by default.                                                                                      |
| `pdf`              | object  | PDF handling: `shouldParse` (default `true`), plus `start` / `end` to limit parsing to a 1-based page range.                                         |
| `includeFrames`    | boolean | Include iframe contents in extraction. Default `false`.                                                                                              |
| `maxAgeMs`         | integer | Serve cached page content up to this old, `0`–`2592000000` ms. Default `604800000` (7 days).                                                         |
| `waitForMs`        | integer | Extra browser wait after page load, in milliseconds.                                                                                                 |
| `stopAfterMs`      | integer | Soft time budget for the crawl, `10000`–`110000` ms. Default `80000`.                                                                                |
| `timeoutMS`        | integer | Abort the request with a 408 if it exceeds this many milliseconds. Range `1000`–`300000` (5 min max).                                                |

## Use your schema library

Because `schema` is standard JSON Schema, you don't have to write it by hand: generate it from the schema library you already use, and validate `response.data` on the way out with the same model. Nested structures like arrays of objects come for free:

<CodeGroup>
  ```typescript TypeScript (Zod) theme={null}
  import ContextDev from "context.dev";
  import { z } from "zod";

  const PricingPage = z.object({
    pricing_tiers: z
      .array(
        z.object({
          name: z.string(),
          monthly_price_usd: z
            .number()
            .describe("Monthly USD price. Use 0 when pricing is custom."),
          is_custom_pricing: z.boolean(),
        }),
      )
      .describe("Every pricing tier listed on the pricing page."),
  });

  const client = new ContextDev({ apiKey: process.env.CONTEXT_DEV_API_KEY });

  const response = await client.web.extract({
    url: "https://stripe.com/pricing",
    schema: z.toJSONSchema(PricingPage),
    instructions: "Focus on the pricing page and capture every tier, including enterprise plans.",
  });

  const pricing = PricingPage.parse(response.data);
  console.log(pricing.pricing_tiers);
  ```

  ```python Python (Pydantic) theme={null}
  import os
  from context.dev import ContextDev
  from pydantic import BaseModel, Field

  class PricingTier(BaseModel):
      name: str
      monthly_price_usd: float = Field(description="Monthly USD price. Use 0 when pricing is custom.")
      is_custom_pricing: bool

  class PricingPage(BaseModel):
      pricing_tiers: list[PricingTier] = Field(description="Every pricing tier listed on the pricing page.")

  client = ContextDev(api_key=os.environ["CONTEXT_DEV_API_KEY"])

  response = client.web.extract(
      url="https://stripe.com/pricing",
      schema=PricingPage.model_json_schema(),
      instructions="Focus on the pricing page and capture every tier, including enterprise plans.",
  )

  pricing = PricingPage.model_validate(response.data)
  print(pricing.pricing_tiers)
  ```
</CodeGroup>

<Tip>
  `z.toJSONSchema()` is built into Zod 4. On Zod 3, use the `zod-to-json-schema` package instead. The same pattern works with anything that emits JSON Schema in other languages, such as `dry-schema` in Ruby or `invopop/jsonschema` in Go.
</Tip>

## Understand the response

A successful call returns the starting URL, the URLs the crawler actually used, your data in the shape of your schema, and crawl statistics:

```json sample response expandable theme={null}
{
  "status": "ok",
  "url": "https://stripe.com/pricing",
  "urls_analyzed": [
    "https://stripe.com/pricing",
    "https://stripe.com/payments"
  ],
  "data": {
    "pricing_tiers": [
      { "name": "Integrated", "monthly_price_usd": 0, "is_custom_pricing": false },
      { "name": "Customized", "monthly_price_usd": 0, "is_custom_pricing": true }
    ]
  },
  "metadata": {
    "numUrls": 2,
    "maxCrawlDepth": 1,
    "numSucceeded": 2,
    "numFailed": 0,
    "numSkipped": 0
  }
}
```

| Field                    | Type      | Description                                                |
| ------------------------ | --------- | ---------------------------------------------------------- |
| `status`                 | string    | `"ok"` on success.                                         |
| `url`                    | string    | The starting URL that was analyzed.                        |
| `urls_analyzed`          | string\[] | Every URL the crawler actually used to produce the answer. |
| `data`                   | object    | The extracted data. Matches the `schema` you sent.         |
| `metadata.numUrls`       | integer   | Total URLs attempted during the crawl.                     |
| `metadata.maxCrawlDepth` | integer   | Deepest link depth reached.                                |
| `metadata.numSucceeded`  | integer   | Pages fetched and analyzed successfully.                   |
| `metadata.numFailed`     | integer   | Pages that failed to fetch.                                |
| `metadata.numSkipped`    | integer   | Pages skipped as irrelevant to the schema.                 |

Error responses include an `error_code`. Common ones: `INPUT_VALIDATION_ERROR` (bad URL or schema) and `WEBSITE_ACCESS_ERROR` on 400, `UNAUTHORIZED` on 401 (missing or invalid API key), `REQUEST_TIMEOUT` on 408, `RATE_LIMITED` on 429, and `INTERNAL_ERROR` on 500.

## Use cases

* **Lead enrichment**: extract `founded_year`, `employee_count`, `headquarters_city` etc. from a company's site to enrich CRM records.
* **Hiring signal tracking**: extract an array of open roles (title, location, team) starting from a careers page for sourcing pipelines or competitor monitoring.
* **Compliance snapshots**: extract a structured summary of privacy policy or terms clauses on a schedule with `factCheck: true`, then diff against the last run.
* **Investor relations data**: extract revenue, ARR, headcount, or funding figures as typed numbers; PDF parsing picks up IR decks and annual reports automatically.

## Next steps

<CardGroup cols={2}>
  <Card title="Scrape Websites" icon="globe" href="/guides/scrape-websites-to-markdown">
    Get clean Markdown, HTML, or sitemap URLs from any page.
  </Card>

  <Card title="Extract Products" icon="cart-shopping" href="/guides/extract-product-from-websites">
    Typed product data: SKU, price, images, from any storefront.
  </Card>

  <Card title="Best Practices" icon="list-check" href="/optimization/best-practices">
    Caching, error handling, and key hygiene.
  </Card>

  <Card title="Troubleshooting" icon="bug" href="/optimization/troubleshooting">
    Status codes, retry patterns, and common errors.
  </Card>
</CardGroup>
