Skip to main content
Context.dev’s Web APIs can:
  • Turn a webpage into clean markdown/HTML
  • Crawl an entire website and save every page as markdown
  • Get all webpages under a domain
  • Extract every image on a webpage
Context.dev’s scrapers automatically switch to a different web-proxy when they get blocked by bot protection or geo-restrictions.

Integrate Context.dev's scraping endpoints in your app

Open in Cursor

Prerequisites

  • A Context.dev API key. Sign up at context.dev/signup, copy the key from the dashboard (prefix ctxt_secret_), and export it:
    export CONTEXT_DEV_API_KEY="ctxt_secret_..."
    
  • An SDK (optional). Install for your language, or skip the install and call directly with curl:
    npm install context.dev
    

Scrape a single page to Markdown

GET /web/scrape/markdown scrapes any URL into LLM-ready GitHub Flavored Markdown. Bot protection and geo-blocks are handled by automatic proxy escalation; pass useMainContentOnly: true to drop nav, footer, sidebars, and other chrome.
curl -G https://api.context.dev/v1/web/scrape/markdown \
  -H "Authorization: Bearer $CONTEXT_DEV_API_KEY" \
  --data-urlencode "url=https://example.com" \
  --data-urlencode "useMainContentOnly=true"
1 credit per call The connection stays open while the page is fetched and converted, so there’s no need to poll. Repeated calls for the same URL within maxAgeMs return the cached scrape.

Request Parameters

ParameterTypeDefaultDescription
urlstring (URI)noneRequired. Full URL to scrape. Must include http:// or https://.
includeLinksbooleantruePreserve hyperlinks in the Markdown output.
includeImagesbooleanfalseInclude image references in the Markdown output.
shortenBase64ImagesbooleantrueTruncate base64-encoded image data so it doesn’t dominate the response.
useMainContentOnlybooleanfalseStrip headers, footers, sidebars, and navigation, keeping only the main content.
includeFramesbooleanfalseWhen true, the contents of iframes are rendered to Markdown.
includeSelectorsstring[]noneCSS selectors. When provided, only matching HTML subtrees (and their descendants) are kept before conversion to Markdown. Examples: article.main, #content, [role=main].
excludeSelectorsstring[]noneCSS selectors to remove before conversion to Markdown. Applied after includeSelectors; exclusion takes precedence. Examples: nav, footer, .ad-banner.
pdfobject{ shouldParse: true }PDF-page controls: shouldParse, start, end (1-based inclusive range). Set shouldParse: false to skip PDFs.
maxAgeMsinteger86400000 (24h)Return a cached scrape if one exists younger than this. 0 forces a fresh scrape. Max is 30 days.
waitForMsintegernoneBrowser wait time after initial load (max 30000). Use when the page needs JS time to populate.
headersobjectnoneOutbound HTTP headers forwarded to the target URL, sent as deep-object query params (e.g. headers[X-Custom]=value). When provided, caching is bypassed entirely.
timeoutMSintegernoneAbort with a 408 if the request exceeds this many milliseconds. Min 1000, max 300000 (5 min).

Response

{
  "success": true,
  "url": "https://example.com",
  "markdown": "# Example Domain\n\nThis domain is for use in documentation examples without needing permission. Avoid use in operations.\n\n[Learn more](https://iana.org/domains/example)"
}
FieldTypeDescription
successbooleantrue when the scrape completed.
urlstringThe URL that was scraped.
markdownstringThe page rendered as GitHub Flavored Markdown. By default the full page is converted; pass useMainContentOnly: true to strip nav, footer, sidebars, and other chrome.
To get the page as raw HTML:
curl -G https://api.context.dev/v1/web/scrape/html \
  -H "Authorization: Bearer $CONTEXT_DEV_API_KEY" \
  --data-urlencode "url=https://example.com"
1 credit per callRequest Parameters
ParameterTypeDefaultDescription
urlstring (URI)noneRequired. Full URL to scrape.
includeFramesbooleanfalseWhen true, iframes are rendered inline into the returned HTML.
useMainContentOnlybooleanfalseReturn only the page’s main content, excluding headers, footers, sidebars, and navigation when detectable.
includeSelectorsstring[]noneCSS selectors. When provided, only matching subtrees (and their descendants) are kept; everything else is dropped.
excludeSelectorsstring[]noneCSS selectors to remove from the result. Applied after includeSelectors; exclusion takes precedence.
pdfobject{ shouldParse: true }PDF-page controls; same shape as /web/scrape/markdown.
maxAgeMsinteger86400000Cache TTL. 0 for fresh. Max 30 days.
waitForMsintegernoneWait after initial load (max 30000).
headersobjectnoneOutbound HTTP headers forwarded to the target URL (e.g. headers[X-Custom]=value). When provided, caching is bypassed.
timeoutMSintegernoneAbort with a 408 if the request exceeds this many milliseconds. Min 1000, max 300000 (5 min).
Response
{
  "success": true,
  "url": "https://example.com",
  "html": "<!DOCTYPE html><html lang=\"en\"><head><title>Example Domain</title><meta name=\"viewport\" content=\"width=device-width, initial-scale=1\"><style>body{background:#eee;width:60vw;margin:15vh auto;font-family:system-ui,sans-serif}…</style></head><body>…</body></html>"
}

Crawl a whole site

POST /web/crawl takes a seed URL and returns an array of scraped pages in one call. That’s exactly the shape you want for seeding a RAG index or building a knowledge base.
curl -X POST https://api.context.dev/v1/web/crawl \
  -H "Authorization: Bearer $CONTEXT_DEV_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://context.dev",
    "maxPages": 3,
    "maxDepth": 1
  }'
1 credit per page scraped
Crawls are billed per page, not per call. Scraping a website with 50 pages costs 50 credits. Set maxPages accordingly.

Request Parameters

ParameterTypeDefaultDescription
urlstring (URI)noneRequired. Starting URL for the crawl.
maxPagesinteger100Maximum pages to crawl. Hard cap: 500.
maxDepthintegernoneMaximum link depth from the starting URL (0 = only the seed).
urlRegexstringnoneOnly URLs matching this regex are followed and scraped. Example: ^https?://[^/]+/blog/.
followSubdomainsbooleanfalseWhen true, follow links on subdomains (docs.example.com from example.com). www and apex are always treated as equivalent.
includeLinksbooleantruePreserve hyperlinks in each page’s Markdown.
includeImagesbooleanfalseInclude image references in each page’s Markdown.
shortenBase64ImagesbooleantrueTruncate base64 image data.
useMainContentOnlybooleanfalseStrip nav/footer/sidebars on every page.
includeFramesbooleanfalseRender iframes on every page.
includeSelectorsstring[]noneCSS selectors. When provided, only matching HTML subtrees (and their descendants) are kept before each page is converted to Markdown.
excludeSelectorsstring[]noneCSS selectors to remove before each page is converted to Markdown. Applied after includeSelectors; exclusion takes precedence.
pdfobject{ shouldParse: true }PDF-page controls. Set shouldParse: false to skip PDFs entirely.
maxAgeMsinteger86400000Per-page cache TTL.
waitForMsintegernonePer-page wait after initial load. Max 30000.
stopAfterMsinteger80000Soft time budget for the entire crawl (10000–110000). The crawler returns what it has so far when exceeded.
timeoutMSintegernoneHard abort: returns a 408 if the request exceeds this many milliseconds. Min 1000, max 300000 (5 min).

Response

{
  "results": [
    {
      "markdown": "# Context.dev\n\nTurn any domain into structured, AI-ready data…",
      "metadata": {
        "url": "https://context.dev/",
        "title": "Context.dev: Brand & Web APIs for Agents",
        "crawlDepth": 0,
        "statusCode": 200,
        "success": true
      }
    },
    {
      "markdown": "# Brand.dev is now Context.dev\n\nWhy we renamed…",
      "metadata": {
        "url": "https://context.dev/blog/brand-dev-is-now-context-dev",
        "title": "Brand.dev is now Context.dev",
        "crawlDepth": 1,
        "statusCode": 200,
        "success": true
      }
    },
    {
      "markdown": "# Pricing\n\nPay only for successful calls…",
      "metadata": {
        "url": "https://context.dev/pricing",
        "title": "Pricing | Context.dev",
        "crawlDepth": 1,
        "statusCode": 200,
        "success": true
      }
    }
  ],
  "metadata": {
    "numUrls": 3,
    "maxCrawlDepth": 1,
    "numSucceeded": 3,
    "numFailed": 0,
    "numSkipped": 0
  }
}
FieldTypeDescription
results[]arrayOne entry per crawled page.
results[].markdownstringThe page body as GitHub-Flavored Markdown.
results[].metadata.urlstringThe URL that was fetched (after redirects).
results[].metadata.titlestringThe page’s <title> tag value.
results[].metadata.crawlDepthnumberLink-hops from the seed URL (0 for the seed itself).
results[].metadata.statusCodenumberHTTP status of the underlying fetch.
results[].metadata.successbooleanfalse for pages that failed to render; markdown may be empty.
metadata.numUrlsnumberTotal URLs the crawler attempted.
metadata.maxCrawlDepthnumberDeepest hop reached during the crawl.
metadata.numSucceedednumberPages fetched successfully. Matches the credit cost.
metadata.numFailednumberPages that errored.
metadata.numSkippednumberPages skipped (e.g. by urlRegex or pdf: { shouldParse: false }).

Get all URLs of a domain

GET /web/scrape/sitemap reads sitemap.xml from a domain root, follows any nested sitemap indexes, and returns a de-duplicated URL list without rendering any of the pages. Use it for cheap coverage of large sites or to feed a downstream scraper with a curated list.
curl -G https://api.context.dev/v1/web/scrape/sitemap \
  -H "Authorization: Bearer $CONTEXT_DEV_API_KEY" \
  --data-urlencode "domain=stripe.com" \
  --data-urlencode "maxLinks=50" \
  --data-urlencode "urlRegex=/customers/"
1 credit per call

Request Parameters

ParameterTypeDefaultDescription
domainstringnoneRequired. Domain to build a sitemap for (e.g. example.com). No protocol required; the API validates and normalizes the input.
maxLinksinteger500Maximum URLs to return (effective range 1–500). The response’s urls[] array is hard-capped at 500 entries, so values above 500 are clamped.
urlRegexstringnoneFilter the discovered URLs by regex pattern.
headersobjectnoneOutbound HTTP headers forwarded to the target URL (e.g. headers[X-Custom]=value). When provided, caching is bypassed.
timeoutMSintegernoneAbort with a 408 if the request exceeds this many milliseconds. Min 1000, max 300000 (5 min).

Response

{
  "success": true,
  "domain": "stripe.com",
  "urls": [
    "https://stripe.com/customers/all",
    "https://stripe.com/customers/gamma",
    "https://stripe.com/customers/chatbase"
  ],
  "meta": {
    "sitemapsDiscovered": 4,
    "sitemapsFetched": 4,
    "sitemapsSkipped": 0,
    "errors": 0
  }
}
FieldTypeDescription
successbooleantrue when the sitemap crawl completed.
domainstringThe normalized domain that was crawled.
urls[]string[]Discovered page URLs, de-duplicated. Bounded by maxLinks and capped at 500 entries per response.
meta.sitemapsDiscoverednumberTotal sitemap XML files discovered (root + nested indexes).
meta.sitemapsFetchednumberSitemaps actually fetched and parsed.
meta.sitemapsSkippednumberSitemaps skipped (404s, malformed XML, etc.).
meta.errorsnumberErrors encountered during crawling.

Extract every image on a page

GET /web/scrape/images takes a URL and returns a manifest of every image referenced on the page: <img> tags, inline <svg>, CSS background images, <picture> sources, OpenGraph and Twitter card images, favicons. Opt into enrichment to also get measured dimensions, a CDN-hosted copy, and a visual-type classification per image.
curl -G https://api.context.dev/v1/web/scrape/images \
  -H "Authorization: Bearer $CONTEXT_DEV_API_KEY" \
  --data-urlencode "url=https://airbnb.com" \
  --data-urlencode "enrichment[resolution]=true" \
  --data-urlencode "enrichment[classification]=true"
1 credit per call 5 credits per call if enrichment flags are used

Request Parameters

ParameterTypeDefaultDescription
urlstring (URI)noneRequired. Page URL to inspect.
maxAgeMsinteger86400000Cache TTL (0 forces fresh; max 30 days).
enrichment.resolutionbooleanfalseMeasure width × height in pixels when possible.
enrichment.hostedUrlbooleanfalseHost materializable images on Context.dev’s CDN and return their URL + MIME type.
enrichment.classificationbooleanfalseClassify each image as photography, illustration, logo, wordmark, icon, pattern, graphic, or other.
enrichment.maxTimePerMsinteger30000Per-image enrichment timeout (1–60000 ms).
waitForMsintegernoneBrowser wait after initial load (max 30000).
headersobjectnoneOutbound HTTP headers forwarded to the target URL (e.g. headers[X-Custom]=value). When provided, caching is bypassed.
timeoutMSintegernoneAbort with a 408 if the request exceeds this many milliseconds. Min 1000, max 300000 (5 min).

Response

{
  "success": true,
  "url": "https://airbnb.com",
  "images": [
    {
      "src": "https://a0.muscache.com/im/pictures/airbnb-platform-assets/AirbnbPlatformAssets-UserProfile/original/5347d650-16de-4f5a-a38e-79edc988befa.png?im_w=720",
      "element": "img",
      "type": "url",
      "alt": null,
      "enrichment": {
        "width": 720,
        "height": 720,
        "type": "illustration"
      }
    },
    {
      "src": "<svg xmlns=\"http://www.w3.org/2000/svg\" viewBox=\"0 0 32 32\" …></svg>",
      "element": "svg",
      "type": "html",
      "alt": null,
      "enrichment": {
        "width": 16,
        "height": 16,
        "type": "icon"
      }
    }
  ]
}
FieldTypeDescription
successbooleantrue when the scrape completed.
urlstringThe page URL that was scraped.
images[]arrayOne entry per image referenced on the page.
images[].srcstringFor type: "url", the absolute image URL. For type: "html", the raw inline SVG/HTML.
images[].elementenumDOM origin: img, svg, link, source, video, css, object, meta, or background.
images[].typeenumFormat of src: url (external image), html (inline markup like SVG), or base64 (data URI).
images[].altstring | nullAlt text where present.
images[].enrichment.widthnumberPixel width. Present when enrichment.resolution=true.
images[].enrichment.heightnumberPixel height. Present when enrichment.resolution=true.
images[].enrichment.mimetypestringMIME type. Present when hosted via enrichment.hostedUrl=true.
images[].enrichment.urlstringContext.dev CDN URL. Present when enrichment.hostedUrl=true.
images[].enrichment.typeenumVisual category. Present when enrichment.classification=true. One of photography, illustration, logo, wordmark, icon, pattern, graphic, other.
The base manifest is 1 credit. Setting any enrichment flag (resolution, hostedUrl, or classification) bumps the entire call to 5 credits, even if only one image qualifies for enrichment.

Use cases

  • Build a RAG pipeline from a docs site by crawling and chunking the returned Markdown.
  • Cut LLM token bills by feeding clean Markdown instead of raw HTML.
  • Seed a vector index without managing scrapers or proxy infrastructure.
  • Monitor competitors’ marketing pages by scraping them on a schedule.

Next steps

Prefetch for Faster Response

Hide cold-hit latency from your users.

Handle Rate Limits

Backoff strategies, client cache, and prefetch fallbacks.

Best Practices

Caching, error handling, and key hygiene.

Troubleshooting

Status codes, retry patterns, and common errors.