Extract Structured Data

Context.dev’s Extract API takes a starting URL and a JSON Schema describing the data you want, crawls the most relevant pages on the site (following internal links and parsing PDFs along the way), and returns a typed response matching your schema. It’s an alternative to an entire pipeline that scrapes multiple pages into markdown and runs an LLM over it.

Integrate Context.dev's Extract API in your app

Open in Cursor

Prerequisites

A Context.dev API key. Sign up at context.dev/signup, copy the key from the dashboard (prefix ctxt_secret_), and export it:
export CONTEXT_DEV_API_KEY="ctxt_secret_..."

An SDK (optional). Install for your language, or skip the install and call directly with curl:

npm install context.dev

pip install context.dev

gem install context.dev

go get github.com/context-dot-dev/context-go-sdk

composer require context-dev/context-dev-php

Extract data

You describe the result you want as one JSON Schema. Property names become the keys of the response’s data object, and each property’s description tells the model what to look for:

curl -X POST https://api.context.dev/v1/web/extract \
  -H "Authorization: Bearer $CONTEXT_DEV_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://stripe.com",
    "schema": {
      "type": "object",
      "properties": {
        "founded_year": {
          "type": "number",
          "description": "The year the company was founded."
        }
      },
      "required": ["founded_year"],
      "additionalProperties": false
    }
  }'

import ContextDev from "context.dev";

const client = new ContextDev({ apiKey: process.env.CONTEXT_DEV_API_KEY });

const response = await client.web.extract({
  url: "https://stripe.com",
  schema: {
    type: "object",
    properties: {
      founded_year: {
        type: "number",
        description: "The year the company was founded.",
      },
    },
    required: ["founded_year"],
    additionalProperties: false,
  },
});

console.log(response.data);

import os
from context.dev import ContextDev

client = ContextDev(api_key=os.environ["CONTEXT_DEV_API_KEY"])

response = client.web.extract(
    url="https://stripe.com",
    schema={
        "type": "object",
        "properties": {
            "founded_year": {
                "type": "number",
                "description": "The year the company was founded.",
            }
        },
        "required": ["founded_year"],
        "additionalProperties": False,
    },
)

print(response.data)

require "context_dev"

client = ContextDev::Client.new(api_key: ENV.fetch("CONTEXT_DEV_API_KEY"))

response = client.web.extract(
  url: "https://stripe.com",
  schema: {
    type: "object",
    properties: {
      founded_year: {
        type: "number",
        description: "The year the company was founded.",
      },
    },
    required: ["founded_year"],
    additionalProperties: false,
  },
)

puts response.data

package main

import (
    "context"
    "fmt"
    "os"

    contextdev "github.com/context-dot-dev/context-go-sdk"
    "github.com/context-dot-dev/context-go-sdk/option"
)

func main() {
    client := contextdev.NewClient(
        option.WithAPIKey(os.Getenv("CONTEXT_DEV_API_KEY")),
    )

    response, err := client.Web.Extract(context.TODO(), contextdev.WebExtractParams{
        URL: "https://stripe.com",
        Schema: map[string]any{
            "type": "object",
            "properties": map[string]any{
                "founded_year": map[string]any{
                    "type":        "number",
                    "description": "The year the company was founded.",
                },
            },
            "required":             []string{"founded_year"},
            "additionalProperties": false,
        },
    })
    if err != nil {
        panic(err)
    }

    fmt.Printf("%+v\n", response.Data)
}

<?php

use ContextDev\Client;

$client = new Client(apiKey: getenv('CONTEXT_DEV_API_KEY'));

$response = $client->web->extract(
    url: 'https://stripe.com',
    schema: [
        'type' => 'object',
        'properties' => [
            'founded_year' => [
                'type' => 'number',
                'description' => 'The year the company was founded.',
            ],
        ],
        'required' => ['founded_year'],
        'additionalProperties' => false,
    ],
);

print_r($response->data);

10 credits per successful call

Request parameters

Parameter	Type	Description
`url`	string	Required. Starting URL to crawl. Must include `http://` or `https://`.
`schema`	object	Required. A JSON Schema describing the structure of the data you want back. Add `description`s to properties to tell the model what to look for.
`instructions`	string	Plain-language guidance for the crawl and extraction (max 2000 chars), e.g. `"Focus on the pricing page."`
`factCheck`	boolean	When `true`, only values stated on the crawled pages are returned. When `false` (default), the model may make reasonable inferences.
`followSubdomains`	boolean	Follow links to subdomains of the starting domain. Default `false`.
`maxPages`	integer	Number of pages to analyze, `1`–`50`. Default `5`.
`maxDepth`	integer	Maximum link depth from the starting URL. Unlimited by default.
`pdf`	object	PDF handling: `shouldParse` (default `true`), plus `start` / `end` to limit parsing to a 1-based page range.
`includeFrames`	boolean	Include iframe contents in extraction. Default `false`.
`maxAgeMs`	integer	Serve cached page content up to this old, `0`–`2592000000` ms. Default `604800000` (7 days).
`waitForMs`	integer	Extra browser wait after page load, in milliseconds.
`stopAfterMs`	integer	Soft time budget for the crawl, `10000`–`110000` ms. Default `80000`.
`timeoutMS`	integer	Abort the request with a 408 if it exceeds this many milliseconds. Range `1000`–`300000` (5 min max).

Use your schema library

Because schema is standard JSON Schema, you don’t have to write it by hand: generate it from the schema library you already use, and validate response.data on the way out with the same model. Nested structures like arrays of objects come for free:

import ContextDev from "context.dev";
import { z } from "zod";

const PricingPage = z.object({
  pricing_tiers: z
    .array(
      z.object({
        name: z.string(),
        monthly_price_usd: z
          .number()
          .describe("Monthly USD price. Use 0 when pricing is custom."),
        is_custom_pricing: z.boolean(),
      }),
    )
    .describe("Every pricing tier listed on the pricing page."),
});

const client = new ContextDev({ apiKey: process.env.CONTEXT_DEV_API_KEY });

const response = await client.web.extract({
  url: "https://stripe.com/pricing",
  schema: z.toJSONSchema(PricingPage),
  instructions: "Focus on the pricing page and capture every tier, including enterprise plans.",
});

const pricing = PricingPage.parse(response.data);
console.log(pricing.pricing_tiers);

import os
from context.dev import ContextDev
from pydantic import BaseModel, Field

class PricingTier(BaseModel):
    name: str
    monthly_price_usd: float = Field(description="Monthly USD price. Use 0 when pricing is custom.")
    is_custom_pricing: bool

class PricingPage(BaseModel):
    pricing_tiers: list[PricingTier] = Field(description="Every pricing tier listed on the pricing page.")

client = ContextDev(api_key=os.environ["CONTEXT_DEV_API_KEY"])

response = client.web.extract(
    url="https://stripe.com/pricing",
    schema=PricingPage.model_json_schema(),
    instructions="Focus on the pricing page and capture every tier, including enterprise plans.",
)

pricing = PricingPage.model_validate(response.data)
print(pricing.pricing_tiers)

z.toJSONSchema() is built into Zod 4. On Zod 3, use the zod-to-json-schema package instead. The same pattern works with anything that emits JSON Schema in other languages, such as dry-schema in Ruby or invopop/jsonschema in Go.

Understand the response

A successful call returns the starting URL, the URLs the crawler actually used, your data in the shape of your schema, and crawl statistics:

sample response

{
  "status": "ok",
  "url": "https://stripe.com/pricing",
  "urls_analyzed": [
    "https://stripe.com/pricing",
    "https://stripe.com/payments"
  ],
  "data": {
    "pricing_tiers": [
      { "name": "Integrated", "monthly_price_usd": 0, "is_custom_pricing": false },
      { "name": "Customized", "monthly_price_usd": 0, "is_custom_pricing": true }
    ]
  },
  "metadata": {
    "numUrls": 2,
    "maxCrawlDepth": 1,
    "numSucceeded": 2,
    "numFailed": 0,
    "numSkipped": 0,
    "numBlocked": 0
  }
}

Field	Type	Description
`status`	string	`"ok"` on success.
`url`	string	The starting URL that was analyzed.
`urls_analyzed`	string[]	Every URL the crawler actually used to produce the answer.
`data`	object	The extracted data. Matches the `schema` you sent.
`metadata.numUrls`	integer	Total URLs attempted during the crawl.
`metadata.maxCrawlDepth`	integer	Deepest link depth reached.
`metadata.numSucceeded`	integer	Pages fetched and analyzed successfully.
`metadata.numFailed`	integer	Pages that failed to fetch.
`metadata.numSkipped`	integer	Pages skipped as irrelevant to the schema.
`metadata.numBlocked`	integer	Pages excluded because they were CAPTCHA walls, bot interstitials, 403/404 pages, or parked-domain placeholders. See Blocked and parked pages.

Error responses include an error_code. Common ones: INPUT_VALIDATION_ERROR (bad URL or schema) and WEBSITE_ACCESS_ERROR on 400, UNAUTHORIZED on 401 (missing or invalid API key), REQUEST_TIMEOUT on 408, RATE_LIMITED on 429, and INTERNAL_ERROR on 500.

Blocked and parked pages

Before pages reach the extraction model, the crawler classifies each fetched page and excludes CAPTCHA walls, bot interstitials, 403/404 pages, login shells, and parked-domain landers. Excluded pages don’t contribute to data and are counted in metadata.numBlocked.

If the starting URL is blocked, the crawler retries it once with cache bypass and anti-blocking enabled before giving up.
If every crawled page ends up blocked, the request returns 400 WEBSITE_ACCESS_ERROR and is not billed. Treat this the same as any other WEBSITE_ACCESS_ERROR (see Troubleshooting).
If some pages come back clean, extraction runs against those and numBlocked records how many were dropped.

Schema type aliases

schema follows JSON Schema, so type values should be "string", "number", "integer", "boolean", "array", "object", or "null". Common non-canonical names hand-written by callers are accepted and normalized before validation — for example "List" becomes "array", "Text", "URL", and "Email" become "string", "Int" becomes "integer", and "Bool" becomes "boolean". Prefer the canonical keywords in new code; the aliases exist to keep hand-written schemas working.

Use cases

Lead enrichment: extract founded_year, employee_count, headquarters_city etc. from a company’s site to enrich CRM records.
Hiring signal tracking: extract an array of open roles (title, location, team) starting from a careers page for sourcing pipelines or competitor monitoring.
Compliance snapshots: extract a structured summary of privacy policy or terms clauses on a schedule with factCheck: true, then diff against the last run.
Investor relations data: extract revenue, ARR, headcount, or funding figures as typed numbers; PDF parsing picks up IR decks and annual reports automatically.

Next steps

Scrape Websites

Get clean Markdown, HTML, or sitemap URLs from any page.

Extract Products

Typed product data: SKU, price, images, from any storefront.

Best Practices

Caching, error handling, and key hygiene.

Troubleshooting

Status codes, retry patterns, and common errors.

Get Started

Give it to your Agent

What can Context.dev do

Optimizations

No-code Integrations

Extract Structured Data

Prerequisites

Extract data

Request parameters

Use your schema library

Understand the response

Blocked and parked pages

Schema type aliases

Use cases

Next steps

Scrape Websites

Extract Products

Best Practices

Troubleshooting

​Prerequisites

​Extract data

​Request parameters

​Use your schema library

​Understand the response

​Blocked and parked pages

​Schema type aliases

​Use cases

​Next steps

Scrape Websites

Extract Products

Best Practices

Troubleshooting

Prerequisites

Extract data

Request parameters

Use your schema library

Understand the response

Blocked and parked pages

Schema type aliases

Use cases

Next steps