Scrape HTML - Context.dev

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Query Parameters

url

string<uri>

required

Full URL to scrape (must include http:// or https:// protocol)

pdf

object

PDF parsing controls. Use start/end to limit text extraction and OCR to an inclusive 1-based page range.

Show child attributes

includeFrames

boolean

default:false

When true, iframes are rendered inline into the returned HTML.

useMainContentOnly

boolean

default:false

When true, return only the page's main content in the HTML response, excluding headers, footers, sidebars, and navigation when detectable.

includeSelectors

string[]

CSS selectors. When provided, only matching subtrees (and their descendants) are kept and everything else is dropped. When omitted, the entire document is kept. Examples: "article.main", "#content", "[role=main]".

Maximum array length: 50

Maximum string length: 2048

excludeSelectors

string[]

CSS selectors to remove from the result. Applied after includeSelectors. Exclusion takes precedence: an element matching both is removed. Examples: "nav", "footer", ".ad-banner", "[aria-hidden=true]".

Maximum array length: 50

Maximum string length: 2048

maxAgeMs

integer

default:86400000

Return a cached result if a prior scrape for the same parameters exists and is younger than this many milliseconds. Defaults to 1 day (86400000 ms) when omitted. Max is 30 days (2592000000 ms). Set to 0 to always scrape fresh.

Required range: 0 <= x <= 2592000000

waitForMs

integer

Optional browser wait time in milliseconds after initial page load. Min: 0. Max: 30000 (30 seconds).

Required range: 0 <= x <= 30000

headers

object

Optional outbound HTTP headers forwarded only to the target URL, sent as deep-object query params such as headers[X-Custom]=value. When provided, caching is bypassed: the result is neither read from nor written to cache.

Show child attributes

timeoutMS

integer

Optional timeout in milliseconds for the request. If the request takes longer than this value, it will be aborted with a 408 status code. Maximum allowed value is 300000ms (5 minutes).

Required range: 1000 <= x <= 300000

Response

Successful response

success

enum<boolean>

required

Indicates success

Available options:

true

html

string

required

The scraped content of the page. For normal pages this is the raw HTML. When the page is a sitemap or feed served behind an XSL stylesheet (which browsers render into HTML), this is the underlying XML instead — see the type field.

url

string

required

The URL that was scraped

type

enum<string>

required

Detected content type of the returned html field. Sitemaps and feeds are surfaced as xml; ordinary pages are html.

Available options:

html,

xml,

json,

text,

csv,

markdown,

svg,

pdf,

docx,

doc

metadata

object

required

Metadata extracted from the scraped page HTML.

Show child attributes

key_metadata

object

Metadata about the API key used for the request. Included in every response whenever a valid API key is provided, even when the response status is not 200.

Show child attributes