Skip to main content
Context.dev’s Extract API takes a starting URL and a JSON Schema describing the data you want, crawls the most relevant pages on the site (following internal links and parsing PDFs along the way), and returns a typed response matching your schema. It’s an alternative to an entire pipeline that scrapes multiple pages into markdown and runs an LLM over it.

Integrate Context.dev's Extract API in your app

Open in Cursor

Prerequisites

  • A Context.dev API key. Sign up at context.dev/signup, copy the key from the dashboard (prefix ctxt_secret_), and export it:
    export CONTEXT_DEV_API_KEY="ctxt_secret_..."
    
  • An SDK (optional). Install for your language, or skip the install and call directly with curl:
    npm install context.dev
    

Extract data

You describe the result you want as one JSON Schema. Property names become the keys of the response’s data object, and each property’s description tells the model what to look for:
curl -X POST https://api.context.dev/v1/web/extract \
  -H "Authorization: Bearer $CONTEXT_DEV_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://stripe.com",
    "schema": {
      "type": "object",
      "properties": {
        "founded_year": {
          "type": "number",
          "description": "The year the company was founded."
        }
      },
      "required": ["founded_year"],
      "additionalProperties": false
    }
  }'
10 credits per successful call

Request parameters

ParameterTypeDescription
urlstringRequired. Starting URL to crawl. Must include http:// or https://.
schemaobjectRequired. A JSON Schema describing the structure of the data you want back. Add descriptions to properties to tell the model what to look for.
instructionsstringPlain-language guidance for the crawl and extraction (max 2000 chars), e.g. "Focus on the pricing page."
factCheckbooleanWhen true, only values stated on the crawled pages are returned. When false (default), the model may make reasonable inferences.
followSubdomainsbooleanFollow links to subdomains of the starting domain. Default false.
maxPagesintegerNumber of pages to analyze, 150. Default 5.
maxDepthintegerMaximum link depth from the starting URL. Unlimited by default.
pdfobjectPDF handling: shouldParse (default true), plus start / end to limit parsing to a 1-based page range.
includeFramesbooleanInclude iframe contents in extraction. Default false.
maxAgeMsintegerServe cached page content up to this old, 02592000000 ms. Default 604800000 (7 days).
waitForMsintegerExtra browser wait after page load, in milliseconds.
stopAfterMsintegerSoft time budget for the crawl, 10000110000 ms. Default 80000.
timeoutMSintegerAbort the request with a 408 if it exceeds this many milliseconds. Range 1000300000 (5 min max).

Use your schema library

Because schema is standard JSON Schema, you don’t have to write it by hand: generate it from the schema library you already use, and validate response.data on the way out with the same model. Nested structures like arrays of objects come for free:
import ContextDev from "context.dev";
import { z } from "zod";

const PricingPage = z.object({
  pricing_tiers: z
    .array(
      z.object({
        name: z.string(),
        monthly_price_usd: z
          .number()
          .describe("Monthly USD price. Use 0 when pricing is custom."),
        is_custom_pricing: z.boolean(),
      }),
    )
    .describe("Every pricing tier listed on the pricing page."),
});

const client = new ContextDev({ apiKey: process.env.CONTEXT_DEV_API_KEY });

const response = await client.web.extract({
  url: "https://stripe.com/pricing",
  schema: z.toJSONSchema(PricingPage),
  instructions: "Focus on the pricing page and capture every tier, including enterprise plans.",
});

const pricing = PricingPage.parse(response.data);
console.log(pricing.pricing_tiers);
z.toJSONSchema() is built into Zod 4. On Zod 3, use the zod-to-json-schema package instead. The same pattern works with anything that emits JSON Schema in other languages, such as dry-schema in Ruby or invopop/jsonschema in Go.

Understand the response

A successful call returns the starting URL, the URLs the crawler actually used, your data in the shape of your schema, and crawl statistics:
sample response
{
  "status": "ok",
  "url": "https://stripe.com/pricing",
  "urls_analyzed": [
    "https://stripe.com/pricing",
    "https://stripe.com/payments"
  ],
  "data": {
    "pricing_tiers": [
      { "name": "Integrated", "monthly_price_usd": 0, "is_custom_pricing": false },
      { "name": "Customized", "monthly_price_usd": 0, "is_custom_pricing": true }
    ]
  },
  "metadata": {
    "numUrls": 2,
    "maxCrawlDepth": 1,
    "numSucceeded": 2,
    "numFailed": 0,
    "numSkipped": 0
  }
}
FieldTypeDescription
statusstring"ok" on success.
urlstringThe starting URL that was analyzed.
urls_analyzedstring[]Every URL the crawler actually used to produce the answer.
dataobjectThe extracted data. Matches the schema you sent.
metadata.numUrlsintegerTotal URLs attempted during the crawl.
metadata.maxCrawlDepthintegerDeepest link depth reached.
metadata.numSucceededintegerPages fetched and analyzed successfully.
metadata.numFailedintegerPages that failed to fetch.
metadata.numSkippedintegerPages skipped as irrelevant to the schema.
Error responses include an error_code. Common ones: INPUT_VALIDATION_ERROR (bad URL or schema) and WEBSITE_ACCESS_ERROR on 400, UNAUTHORIZED on 401 (missing or invalid API key), REQUEST_TIMEOUT on 408, RATE_LIMITED on 429, and INTERNAL_ERROR on 500.

Use cases

  • Lead enrichment: extract founded_year, employee_count, headquarters_city etc. from a company’s site to enrich CRM records.
  • Hiring signal tracking: extract an array of open roles (title, location, team) starting from a careers page for sourcing pipelines or competitor monitoring.
  • Compliance snapshots: extract a structured summary of privacy policy or terms clauses on a schedule with factCheck: true, then diff against the last run.
  • Investor relations data: extract revenue, ARR, headcount, or funding figures as typed numbers; PDF parsing picks up IR decks and annual reports automatically.

Next steps

Scrape Websites

Get clean Markdown, HTML, or sitemap URLs from any page.

Extract Products

Typed product data: SKU, price, images, from any storefront.

Best Practices

Caching, error handling, and key hygiene.

Troubleshooting

Status codes, retry patterns, and common errors.