Extract Structured Website Data
Crawl a website, use the provided JSON Schema and instructions to prioritize relevant internal links, and extract structured data from the selected pages.
Authorizations
Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
Body
The starting website URL to crawl and extract from. Must include http:// or https://.
JSON Schema for the returned data object. TypeScript Zod users can pass a JSON Schema generated from a Zod object; Python users can pass the equivalent JSON Schema object.
{
"type": "object",
"properties": {
"mission_statement": {
"type": "string",
"description": "The company's stated mission."
},
"case_studies": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": { "type": "string" },
"url": { "type": "string" }
},
"required": ["title", "url"],
"additionalProperties": false
}
}
},
"required": ["mission_statement", "case_studies"],
"additionalProperties": false
}Optional extraction guidance, such as which facts to prioritize or how to interpret fields in the schema.
2000When true, every returned value must be grounded in facts stated on the page; fields that cannot be supported by the page are returned as null/empty. When false (default), the model may make reasonable inferences and derivations from the page content (e.g. ideal customer, competitor analysis, recommendations) while keeping verifiable specifics (names, quotes, URLs, dates, metrics) faithful to the source.
When true, follow links on subdomains of the starting URL's domain.
When true, iframe contents are included in Markdown before extraction.
Return cached scrape results if a prior scrape for the same parameters is younger than this many milliseconds. Defaults to 7 days (604800000 ms).
0 <= x <= 2592000000Optional browser wait time in milliseconds after initial page load for each crawled page.
0 <= x <= 30000Soft time budget for the crawl in milliseconds.
10000 <= x <= 240000Optional timeout in milliseconds for the request. If the request takes longer than this value, it will be aborted with a 408 status code. Maximum allowed value is 300000ms (5 minutes).
1000 <= x <= 300000Response
Successful response