Scrape Websites

Context.dev’s Web APIs can:

Turn a webpage into clean markdown/HTML
Crawl an entire website and save every page as markdown
Get all webpages under a domain
Extract every image on a webpage

Context.dev’s scrapers automatically switch to a different web-proxy when they get blocked by bot protection or geo-restrictions.

Integrate Context.dev's scraping endpoints in your app

Open in Cursor

Prerequisites

A Context.dev API key. Sign up at context.dev/signup, copy the key from the dashboard (prefix ctxt_secret_), and export it:
export CONTEXT_DEV_API_KEY="ctxt_secret_..."

An SDK (optional). Install for your language, or skip the install and call directly with curl:

npm install context.dev

pip install context.dev

gem install context.dev

go get github.com/context-dot-dev/context-go-sdk

composer require context-dev/context-dev-php

Scrape a single page to Markdown

GET /web/scrape/markdown scrapes any URL into LLM-ready GitHub Flavored Markdown. Bot protection and geo-blocks are handled by automatic proxy escalation; pass useMainContentOnly: true to drop nav, footer, sidebars, and other chrome. The endpoint transparently handles HTML, XML, JSON, text, Markdown, SVG, PDF, DOCX, DOC, XLSX, XLS, PPTX, PPT, and CSV. Excel workbooks come back as one GFM table per sheet (with each sheet name as an ## heading); PowerPoint decks come back as slide-structured markdown (one ## Slide N: Title per slide, followed by body text, tables, and speaker notes); CSV files come back as a single GFM table, with any comma-less leading line surfaced as text above it. Unsupported formats (images, media, archives) return a 415.

curl -G https://api.context.dev/v1/web/scrape/markdown \
  -H "Authorization: Bearer $CONTEXT_DEV_API_KEY" \
  --data-urlencode "url=https://example.com" \
  --data-urlencode "useMainContentOnly=true"

import ContextDev from "context.dev";

const client = new ContextDev({ apiKey: process.env.CONTEXT_DEV_API_KEY });

const response = await client.web.webScrapeMd({
  url: "https://example.com",
  useMainContentOnly: true,
});

console.log(response.markdown);

import os
from context.dev import ContextDev

client = ContextDev(api_key=os.environ["CONTEXT_DEV_API_KEY"])

response = client.web.web_scrape_md(
    url="https://example.com",
    use_main_content_only=True,
)

print(response.markdown)

require "context_dev"

client = ContextDev::Client.new(api_key: ENV.fetch("CONTEXT_DEV_API_KEY"))

response = client.web.web_scrape_md(
  url: "https://example.com",
  use_main_content_only: true,
)

puts response.markdown

package main

import (
    "context"
    "fmt"
    "os"

    contextdev "github.com/context-dot-dev/context-go-sdk"
    "github.com/context-dot-dev/context-go-sdk/option"
    "github.com/context-dot-dev/context-go-sdk/packages/param"
)

func main() {
    client := contextdev.NewClient(
        option.WithAPIKey(os.Getenv("CONTEXT_DEV_API_KEY")),
    )

    response, err := client.Web.WebScrapeMd(context.TODO(), contextdev.WebWebScrapeMdParams{
        URL:                "https://example.com",
        UseMainContentOnly: param.NewOpt(true),
    })
    if err != nil {
        panic(err)
    }

    fmt.Println(response.Markdown)
}

<?php

use ContextDev\Client;

$client = new Client(apiKey: getenv('CONTEXT_DEV_API_KEY'));

$response = $client->web->webScrapeMd(
    url: 'https://example.com',
    useMainContentOnly: true,
);

echo $response->markdown, PHP_EOL;

1 credit per call The connection stays open while the page is fetched and converted, so there’s no need to poll. Repeated calls for the same URL within maxAgeMs return the cached scrape.

Request Parameters

Parameter	Type	Default	Description
`url`	string (URI)	none	Required. Full URL to scrape. Must include `http://` or `https://`.
`includeLinks`	boolean	`true`	Preserve hyperlinks in the Markdown output.
`includeImages`	boolean	`false`	Include image references in the Markdown output.
`shortenBase64Images`	boolean	`true`	Truncate base64-encoded image data so it doesn’t dominate the response.
`useMainContentOnly`	boolean	`false`	Strip headers, footers, sidebars, and navigation, keeping only the main content.
`includeFrames`	boolean	`false`	When true, the contents of iframes are rendered to Markdown.
`includeSelectors`	string[]	none	CSS selectors. When provided, only matching HTML subtrees (and their descendants) are kept before conversion to Markdown. Examples: `article.main`, `#content`, `[role=main]`.
`excludeSelectors`	string[]	none	CSS selectors to remove before conversion to Markdown. Applied after `includeSelectors`; exclusion takes precedence. Examples: `nav`, `footer`, `.ad-banner`.
`pdf`	object	`{ shouldParse: true }`	PDF-page controls: `shouldParse`, `start`, `end` (1-based inclusive range), `ocr`. Set `shouldParse: false` to skip PDFs. Set `ocr: true` to OCR images embedded in the PDF and inline the recognized text alongside the text layer.
`maxAgeMs`	integer	`86400000` (24h)	Return a cached scrape if one exists younger than this. `0` forces a fresh scrape. Max is 30 days.
`waitForMs`	integer	none	Browser wait time after initial load (max 30000). Use when the page needs JS time to populate.
`actions`	array	none	Up to 5 ordered browser actions executed after the page loads and before content is captured. Requires a paid plan; bumps the call to 2 credits and bypasses the cache.
`headers`	object	none	Outbound HTTP headers forwarded to the target URL, sent as deep-object query params (e.g. `headers[X-Custom]=value`). When provided, caching is bypassed entirely.
`timeoutMS`	integer	none	Abort with a 408 if the request exceeds this many milliseconds. Min `1000`, max `300000` (5 min).

Response

{
  "success": true,
  "url": "https://example.com",
  "markdown": "# Example Domain\n\nThis domain is for use in documentation examples without needing permission. Avoid use in operations.\n\n[Learn more](https://iana.org/domains/example)",
  "contentLength": 174
}

Field	Type	Description
`success`	boolean	`true` when the scrape completed.
`url`	string	The URL that was scraped.
`markdown`	string	The page rendered as GitHub Flavored Markdown. By default the full page is converted; pass `useMainContentOnly: true` to strip nav, footer, sidebars, and other chrome.
`contentLength`	integer	UTF-8 byte length of `markdown`. Use this to budget tokens or detect empty results without re-measuring the string.

Get raw HTML instead

To get the page as raw HTML:

curl -G https://api.context.dev/v1/web/scrape/html \
  -H "Authorization: Bearer $CONTEXT_DEV_API_KEY" \
  --data-urlencode "url=https://example.com"

import ContextDev from "context.dev";

const client = new ContextDev({ apiKey: process.env.CONTEXT_DEV_API_KEY });

const response = await client.web.webScrapeHTML({ url: "https://example.com" });

console.log(response.html.length);

# Requires beautifulsoup4 and lxml if you copy this parsing example.
import os
from context.dev import ContextDev
from bs4 import BeautifulSoup

client = ContextDev(api_key=os.environ["CONTEXT_DEV_API_KEY"])

response = client.web.web_scrape_html(url="https://example.com")
soup = BeautifulSoup(response.html, "lxml")

print(soup.title.string)

require "context_dev"

client = ContextDev::Client.new(api_key: ENV.fetch("CONTEXT_DEV_API_KEY"))

response = client.web.web_scrape_html(url: "https://example.com")

puts response.html.length

package main

import (
    "context"
    "fmt"
    "os"

    contextdev "github.com/context-dot-dev/context-go-sdk"
    "github.com/context-dot-dev/context-go-sdk/option"
)

func main() {
    client := contextdev.NewClient(
        option.WithAPIKey(os.Getenv("CONTEXT_DEV_API_KEY")),
    )

    response, err := client.Web.WebScrapeHTML(context.TODO(), contextdev.WebWebScrapeHTMLParams{
        URL: "https://example.com",
    })
    if err != nil {
        panic(err)
    }

    fmt.Println(len(response.HTML))
}

<?php

use ContextDev\Client;

$client = new Client(apiKey: getenv('CONTEXT_DEV_API_KEY'));

$response = $client->web->webScrapeHTML(url: 'https://example.com');

echo strlen($response->html), PHP_EOL;

1 credit per callRequest Parameters

Parameter	Type	Default	Description
`url`	string (URI)	none	Required. Full URL to scrape.
`includeFrames`	boolean	`false`	When true, iframes are rendered inline into the returned HTML.
`useMainContentOnly`	boolean	`false`	Return only the page’s main content, excluding headers, footers, sidebars, and navigation when detectable.
`includeSelectors`	string[]	none	CSS selectors. When provided, only matching subtrees (and their descendants) are kept; everything else is dropped.
`excludeSelectors`	string[]	none	CSS selectors to remove from the result. Applied after `includeSelectors`; exclusion takes precedence.
`pdf`	object	`{ shouldParse: true }`	PDF-page controls; same shape as `/web/scrape/markdown`.
`maxAgeMs`	integer	`86400000`	Cache TTL. `0` for fresh. Max 30 days.
`waitForMs`	integer	none	Wait after initial load (max 30000).
`actions`	array	none	Up to 5 ordered browser actions executed after the page loads and before content is captured. Requires a paid plan; bumps the call to 2 credits and bypasses the cache.
`headers`	object	none	Outbound HTTP headers forwarded to the target URL (e.g. `headers[X-Custom]=value`). When provided, caching is bypassed.
`timeoutMS`	integer	none	Abort with a 408 if the request exceeds this many milliseconds. Min `1000`, max `300000` (5 min).

Response

{
  "success": true,
  "url": "https://example.com",
  "html": "<!DOCTYPE html><html lang=\"en\"><head><title>Example Domain</title><meta name=\"viewport\" content=\"width=device-width, initial-scale=1\"><style>body{background:#eee;width:60vw;margin:15vh auto;font-family:system-ui,sans-serif}…</style></head><body>…</body></html>",
  "type": "html"
}

Field	Type	Description
`success`	boolean	`true` when the scrape completed.
`url`	string	The URL that was scraped.
`html`	string	Rendered HTML for normal pages. For sitemaps and feeds behind an XSL stylesheet, the underlying XML. For Excel workbooks, the extracted sheets as HTML `<table>` elements (one per sheet, headed by an `<h2>`). For PowerPoint decks, the extracted slides as HTML (each slide headed by `<h2>Slide N: Title</h2>`, followed by paragraphs, `<table>` elements, and an `<h3>Notes</h3>` block when speaker notes are present).
`type`	enum	Detected content type of `html`. One of `html`, `xml`, `json`, `text`, `csv`, `markdown`, `svg`, `pdf`, `docx`, `doc`, `xlsx`, `xls`, `pptx`, `ppt`.

Crawl a whole site

POST /web/crawl takes a seed URL and returns an array of scraped pages in one call. That’s exactly the shape you want for seeding a RAG index or building a knowledge base.

curl -X POST https://api.context.dev/v1/web/crawl \
  -H "Authorization: Bearer $CONTEXT_DEV_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://context.dev",
    "maxPages": 3,
    "maxDepth": 1
  }'

import ContextDev from "context.dev";

const client = new ContextDev({ apiKey: process.env.CONTEXT_DEV_API_KEY });

const response = await client.web.webCrawlMd({
  url: "https://context.dev",
  maxPages: 3,
  maxDepth: 1,
});

for (const page of response.results) {
  console.log(`${page.metadata.url}: ${page.markdown.length} chars`);
}

import os
from context.dev import ContextDev

client = ContextDev(api_key=os.environ["CONTEXT_DEV_API_KEY"])

response = client.web.web_crawl_md(
    url="https://context.dev",
    max_pages=3,
    max_depth=1,
)

for page in response.results:
    print(f"{page.metadata.url}: {len(page.markdown)} chars")

require "context_dev"

client = ContextDev::Client.new(api_key: ENV.fetch("CONTEXT_DEV_API_KEY"))

response = client.web.web_crawl_md(
  url: "https://context.dev",
  max_pages: 3,
  max_depth: 1,
)

response.results.each do |page|
  puts "#{page.metadata.url}: #{page.markdown.length} chars"
end

package main

import (
    "context"
    "fmt"
    "os"

    contextdev "github.com/context-dot-dev/context-go-sdk"
    "github.com/context-dot-dev/context-go-sdk/option"
    "github.com/context-dot-dev/context-go-sdk/packages/param"
)

func main() {
    client := contextdev.NewClient(
        option.WithAPIKey(os.Getenv("CONTEXT_DEV_API_KEY")),
    )

    response, err := client.Web.WebCrawlMd(context.TODO(), contextdev.WebWebCrawlMdParams{
        URL:      "https://context.dev",
        MaxPages: param.NewOpt(int64(3)),
        MaxDepth: param.NewOpt(int64(1)),
    })
    if err != nil {
        panic(err)
    }

    for _, page := range response.Results {
        fmt.Printf("%s: %d chars\n", page.Metadata.URL, len(page.Markdown))
    }
}

<?php

use ContextDev\Client;

$client = new Client(apiKey: getenv('CONTEXT_DEV_API_KEY'));

$response = $client->web->webCrawlMd(
    url: 'https://context.dev',
    maxPages: 3,
    maxDepth: 1,
);

foreach ($response->results as $page) {
    echo $page->metadata->url, ': ', strlen($page->markdown), ' chars', PHP_EOL;
}

1 credit per page scraped

Crawls are billed per page, not per call. Scraping a website with 50 pages costs 50 credits. Set maxPages accordingly.

Request Parameters

Parameter	Type	Default	Description
`url`	string (URI)	none	Required. Starting URL for the crawl.
`maxPages`	integer	`100`	Maximum pages to crawl. Hard cap: `500`.
`maxDepth`	integer	none	Maximum link depth from the starting URL (`0` = only the seed).
`urlRegex`	string	none	Only URLs matching this regex are followed and scraped. Example: `^https?://[^/]+/blog/`.
`followSubdomains`	boolean	`false`	When true, follow links on subdomains (`docs.example.com` from `example.com`). `www` and apex are always treated as equivalent.
`includeLinks`	boolean	`true`	Preserve hyperlinks in each page’s Markdown.
`includeImages`	boolean	`false`	Include image references in each page’s Markdown.
`shortenBase64Images`	boolean	`true`	Truncate base64 image data.
`useMainContentOnly`	boolean	`false`	Strip nav/footer/sidebars on every page.
`includeFrames`	boolean	`false`	Render iframes on every page.
`includeSelectors`	string[]	none	CSS selectors. When provided, only matching HTML subtrees (and their descendants) are kept before each page is converted to Markdown.
`excludeSelectors`	string[]	none	CSS selectors to remove before each page is converted to Markdown. Applied after `includeSelectors`; exclusion takes precedence.
`pdf`	object	`{ shouldParse: true }`	PDF-page controls. Set `shouldParse: false` to skip PDFs entirely.
`maxAgeMs`	integer	`86400000`	Per-page cache TTL.
`waitForMs`	integer	none	Per-page wait after initial load. Max 30000.
`stopAfterMs`	integer	`80000`	Soft time budget for the entire crawl (10000–110000). The crawler returns what it has so far when exceeded.
`timeoutMS`	integer	none	Hard abort: returns a 408 if the request exceeds this many milliseconds. Min `1000`, max `300000` (5 min).

Response

{
  "results": [
    {
      "markdown": "# Context.dev\n\nTurn any domain into structured, AI-ready data…",
      "metadata": {
        "url": "https://context.dev/",
        "title": "Context.dev: Brand & Web APIs for Agents",
        "crawlDepth": 0,
        "statusCode": 200,
        "success": true
      }
    },
    {
      "markdown": "# Brand.dev is now Context.dev\n\nWhy we renamed…",
      "metadata": {
        "url": "https://context.dev/blog/brand-dev-is-now-context-dev",
        "title": "Brand.dev is now Context.dev",
        "crawlDepth": 1,
        "statusCode": 200,
        "success": true
      }
    },
    {
      "markdown": "# Pricing\n\nPay only for successful calls…",
      "metadata": {
        "url": "https://context.dev/pricing",
        "title": "Pricing | Context.dev",
        "crawlDepth": 1,
        "statusCode": 200,
        "success": true
      }
    }
  ],
  "metadata": {
    "numUrls": 3,
    "maxCrawlDepth": 1,
    "numSucceeded": 3,
    "numFailed": 0,
    "numSkipped": 0
  }
}

Field	Type	Description
`results[]`	array	One entry per crawled page.
`results[].markdown`	string	The page body as GitHub-Flavored Markdown.
`results[].metadata.url`	string	The URL that was fetched (after redirects).
`results[].metadata.title`	string	The page’s `<title>` tag value.
`results[].metadata.crawlDepth`	number	Link-hops from the seed URL (`0` for the seed itself).
`results[].metadata.statusCode`	number	HTTP status of the underlying fetch.
`results[].metadata.success`	boolean	`false` for pages that failed to render; `markdown` may be empty.
`metadata.numUrls`	number	Total URLs the crawler attempted.
`metadata.maxCrawlDepth`	number	Deepest hop reached during the crawl.
`metadata.numSucceeded`	number	Pages fetched successfully. Matches the credit cost.
`metadata.numFailed`	number	Pages that errored.
`metadata.numSkipped`	number	Pages skipped (e.g. by `urlRegex` or `pdf: { shouldParse: false }`).

Get all URLs of a domain

GET /web/scrape/sitemap reads sitemap.xml from a domain root, follows any nested sitemap indexes, and returns a de-duplicated URL list without rendering any of the pages. Use it for cheap coverage of large sites or to feed a downstream scraper with a curated list.

curl -G https://api.context.dev/v1/web/scrape/sitemap \
  -H "Authorization: Bearer $CONTEXT_DEV_API_KEY" \
  --data-urlencode "domain=stripe.com" \
  --data-urlencode "maxLinks=50" \
  --data-urlencode "urlRegex=/customers/"

import ContextDev from "context.dev";

const client = new ContextDev({ apiKey: process.env.CONTEXT_DEV_API_KEY });

const response = await client.web.webScrapeSitemap({
  domain: "stripe.com",
  maxLinks: 50,
  urlRegex: "/customers/",
});

console.log(`${response.urls.length} URLs across ${response.meta.sitemapsDiscovered} sitemaps`);

import os
from context.dev import ContextDev

client = ContextDev(api_key=os.environ["CONTEXT_DEV_API_KEY"])

response = client.web.web_scrape_sitemap(
    domain="stripe.com",
    max_links=50,
    url_regex="/customers/",
)

print(f"{len(response.urls)} URLs across {response.meta.sitemaps_discovered} sitemaps")

require "context_dev"

client = ContextDev::Client.new(api_key: ENV.fetch("CONTEXT_DEV_API_KEY"))

response = client.web.web_scrape_sitemap(
  domain: "stripe.com",
  max_links: 50,
  url_regex: "/customers/",
)

puts "#{response.urls.length} URLs across #{response.meta.sitemaps_discovered} sitemaps"

package main

import (
    "context"
    "fmt"
    "os"

    contextdev "github.com/context-dot-dev/context-go-sdk"
    "github.com/context-dot-dev/context-go-sdk/option"
    "github.com/context-dot-dev/context-go-sdk/packages/param"
)

func main() {
    client := contextdev.NewClient(
        option.WithAPIKey(os.Getenv("CONTEXT_DEV_API_KEY")),
    )

    response, err := client.Web.WebScrapeSitemap(context.TODO(), contextdev.WebWebScrapeSitemapParams{
        Domain:   "stripe.com",
        MaxLinks: param.NewOpt(int64(50)),
        URLRegex: param.NewOpt("/customers/"),
    })
    if err != nil {
        panic(err)
    }

    fmt.Printf("%d URLs across %d sitemaps\n", len(response.URLs), response.Meta.SitemapsDiscovered)
}

<?php

use ContextDev\Client;

$client = new Client(apiKey: getenv('CONTEXT_DEV_API_KEY'));

$response = $client->web->webScrapeSitemap(
    domain: 'stripe.com',
    maxLinks: 50,
    urlRegex: '/customers/',
);

echo count($response->urls), ' URLs across ', $response->meta->sitemapsDiscovered, ' sitemaps', PHP_EOL;

1 credit per call

Request Parameters

Parameter	Type	Default	Description
`domain`	string	none	Required. Domain to build a sitemap for (e.g. `example.com`). No protocol required; the API validates and normalizes the input.
`maxLinks`	integer	`500`	Maximum URLs to return (effective range 1–500). The response’s `urls[]` array is hard-capped at 500 entries, so values above 500 are clamped.
`urlRegex`	string	none	Filter the discovered URLs by regex pattern.
`headers`	object	none	Outbound HTTP headers forwarded to the target URL (e.g. `headers[X-Custom]=value`). When provided, caching is bypassed.
`timeoutMS`	integer	none	Abort with a 408 if the request exceeds this many milliseconds. Min `1000`, max `300000` (5 min).

Response

{
  "success": true,
  "domain": "stripe.com",
  "urls": [
    "https://stripe.com/customers/all",
    "https://stripe.com/customers/gamma",
    "https://stripe.com/customers/chatbase"
  ],
  "meta": {
    "sitemapsDiscovered": 4,
    "sitemapsFetched": 4,
    "sitemapsSkipped": 0,
    "errors": 0
  }
}

Field	Type	Description
`success`	boolean	`true` when the sitemap crawl completed.
`domain`	string	The normalized domain that was crawled.
`urls[]`	string[]	Discovered page URLs, de-duplicated. Bounded by `maxLinks` and capped at 500 entries per response.
`meta.sitemapsDiscovered`	number	Total sitemap XML files discovered (root + nested indexes).
`meta.sitemapsFetched`	number	Sitemaps actually fetched and parsed.
`meta.sitemapsSkipped`	number	Sitemaps skipped (404s, malformed XML, etc.).
`meta.errors`	number	Errors encountered during crawling.

Extract every image on a page

GET /web/scrape/images takes a URL and returns a manifest of every image referenced on the page: <img> tags, inline <svg>, CSS background images, <picture> sources, OpenGraph and Twitter card images, favicons. Opt into enrichment to also get measured dimensions, a CDN-hosted copy, and a visual-type classification per image.

curl -G https://api.context.dev/v1/web/scrape/images \
  -H "Authorization: Bearer $CONTEXT_DEV_API_KEY" \
  --data-urlencode "url=https://airbnb.com" \
  --data-urlencode "enrichment[resolution]=true" \
  --data-urlencode "enrichment[classification]=true"

import ContextDev from "context.dev";

const client = new ContextDev({ apiKey: process.env.CONTEXT_DEV_API_KEY });

const response = await client.web.webScrapeImages({
  url: "https://airbnb.com",
  enrichment: { resolution: true, classification: true },
});

const hero = response.images
  .filter((img) => img.enrichment?.width && img.enrichment?.height)
  .sort((a, b) => b.enrichment!.width! * b.enrichment!.height! - a.enrichment!.width! * a.enrichment!.height!)[0];

console.log(hero?.src, hero?.enrichment?.type);

import os
from context.dev import ContextDev

client = ContextDev(api_key=os.environ["CONTEXT_DEV_API_KEY"])

response = client.web.web_scrape_images(
    url="https://airbnb.com",
    enrichment={"resolution": True, "classification": True},
)

sized = [img for img in response.images if img.enrichment and img.enrichment.width and img.enrichment.height]
hero = max(sized, key=lambda i: i.enrichment.width * i.enrichment.height, default=None)
if hero:
    print(hero.src, hero.enrichment.type)

require "context_dev"

client = ContextDev::Client.new(api_key: ENV.fetch("CONTEXT_DEV_API_KEY"))

response = client.web.web_scrape_images(
  url: "https://airbnb.com",
  enrichment: { resolution: true, classification: true },
)

sized = response.images.select { |i| i.enrichment && i.enrichment.width && i.enrichment.height }
hero = sized.max_by { |i| i.enrichment.width * i.enrichment.height }
puts "#{hero.src} #{hero.enrichment.type}" if hero

package main

import (
    "context"
    "fmt"
    "os"

    contextdev "github.com/context-dot-dev/context-go-sdk"
    "github.com/context-dot-dev/context-go-sdk/option"
    "github.com/context-dot-dev/context-go-sdk/packages/param"
)

func main() {
    client := contextdev.NewClient(
        option.WithAPIKey(os.Getenv("CONTEXT_DEV_API_KEY")),
    )

    response, err := client.Web.WebScrapeImages(context.TODO(), contextdev.WebWebScrapeImagesParams{
        URL: "https://airbnb.com",
        Enrichment: contextdev.WebWebScrapeImagesParamsEnrichment{
            Resolution:     param.NewOpt(true),
            Classification: param.NewOpt(true),
        },
    })
    if err != nil {
        panic(err)
    }

    fmt.Printf("found %d images\n", len(response.Images))
}

<?php

use ContextDev\Client;

$client = new Client(apiKey: getenv('CONTEXT_DEV_API_KEY'));

$response = $client->web->webScrapeImages(
    url: 'https://airbnb.com',
    enrichment: ['resolution' => true, 'classification' => true],
);

$sized = array_filter(
    $response->images ?? [],
    fn ($img) => $img->enrichment?->width && $img->enrichment?->height,
);

usort($sorted = array_values($sized), fn ($a, $b) =>
    ($b->enrichment->width * $b->enrichment->height) <=> ($a->enrichment->width * $a->enrichment->height)
);
$hero = $sorted[0] ?? null;

if ($hero) {
    echo $hero->src, ' ', $hero->enrichment->type, PHP_EOL;
}

1 credit per call 5 credits per call if enrichment flags are used

Request Parameters

Parameter	Type	Default	Description
`url`	string (URI)	none	Required. Page URL to inspect.
`maxAgeMs`	integer	`86400000`	Cache TTL (0 forces fresh; max 30 days).
`enrichment.resolution`	boolean	`false`	Measure width × height in pixels when possible.
`enrichment.hostedUrl`	boolean	`false`	Host materializable images on Context.dev’s CDN and return their URL + MIME type.
`enrichment.classification`	boolean	`false`	Classify each image as `photography`, `illustration`, `logo`, `wordmark`, `icon`, `pattern`, `graphic`, or `other`.
`enrichment.maxTimePerMs`	integer	`30000`	Per-image enrichment timeout (1–60000 ms).
`waitForMs`	integer	none	Browser wait after initial load (max 30000).
`actions`	array	none	Up to 5 ordered browser actions executed after the page loads and before images are collected. Requires a paid plan; bumps the base call to 2 credits and bypasses the cache. Enriched calls stay at 5 credits.
`headers`	object	none	Outbound HTTP headers forwarded to the target URL (e.g. `headers[X-Custom]=value`). When provided, caching is bypassed.
`timeoutMS`	integer	none	Abort with a 408 if the request exceeds this many milliseconds. Min `1000`, max `300000` (5 min).

Response

{
  "success": true,
  "url": "https://airbnb.com",
  "images": [
    {
      "src": "https://a0.muscache.com/im/pictures/airbnb-platform-assets/AirbnbPlatformAssets-UserProfile/original/5347d650-16de-4f5a-a38e-79edc988befa.png?im_w=720",
      "element": "img",
      "type": "url",
      "alt": null,
      "enrichment": {
        "width": 720,
        "height": 720,
        "type": "illustration"
      }
    },
    {
      "src": "<svg xmlns=\"http://www.w3.org/2000/svg\" viewBox=\"0 0 32 32\" …></svg>",
      "element": "svg",
      "type": "html",
      "alt": null,
      "enrichment": {
        "width": 16,
        "height": 16,
        "type": "icon"
      }
    }
  ]
}

Field	Type	Description
`success`	boolean	`true` when the scrape completed.
`url`	string	The page URL that was scraped.
`images[]`	array	One entry per image referenced on the page.
`images[].src`	string	For `type: "url"`, the absolute image URL. For `type: "html"`, the raw inline SVG/HTML.
`images[].element`	enum	DOM origin: `img`, `svg`, `link`, `source`, `video`, `css`, `object`, `meta`, or `background`.
`images[].type`	enum	Format of `src`: `url` (external image), `html` (inline markup like SVG), or `base64` (data URI).
`images[].alt`	string \| null	Alt text where present.
`images[].enrichment.width`	number	Pixel width. Present when `enrichment.resolution=true`.
`images[].enrichment.height`	number	Pixel height. Present when `enrichment.resolution=true`.
`images[].enrichment.mimetype`	string	MIME type. Present when hosted via `enrichment.hostedUrl=true`.
`images[].enrichment.url`	string	Context.dev CDN URL. Present when `enrichment.hostedUrl=true`.
`images[].enrichment.type`	enum	Visual category. Present when `enrichment.classification=true`. One of `photography`, `illustration`, `logo`, `wordmark`, `icon`, `pattern`, `graphic`, `other`.

The base manifest is 1 credit. Setting any enrichment flag (resolution, hostedUrl, or classification) bumps the entire call to 5 credits, even if only one image qualifies for enrichment.

Run browser actions before scraping

Some pages only reveal their real content after a user interacts — dismissing a cookie banner, clicking “Load more”, switching a tab, or waiting for an animation. Pass an actions array to /web/scrape/markdown, /web/scrape/html, or /web/scrape/images to run up to 5 ordered browser actions after the page loads and before content is captured.

Actions require a paid plan. Free-tier keys receive a 403 PAID_PLAN_REQUIRED. When any action is provided, the request costs 2 credits (5 for enriched images) and both the scrape cache and any HTTP/PDF fast paths are bypassed.

Each action is a JSON object discriminated by do:

`do`	Fields	Purpose
`wait`	`timeMs` (integer, `0`–`30000`)	Pause for a fixed number of milliseconds before continuing to the next action.
`perform`	`action` (string, 1–500 chars)	Resolve and perform a natural-language browser instruction, such as `click the "Accept all" button`.

curl -G https://api.context.dev/v1/web/scrape/markdown \
  -H "Authorization: Bearer $CONTEXT_DEV_API_KEY" \
  --data-urlencode "url=https://news.ycombinator.com" \
  --data-urlencode 'actions=[{"do":"perform","action":"click the More link at the bottom"},{"do":"wait","timeMs":1500}]'

import ContextDev from "context.dev";

const client = new ContextDev({ apiKey: process.env.CONTEXT_DEV_API_KEY });

const response = await client.web.webScrapeMd({
  url: "https://news.ycombinator.com",
  actions: [
    { do: "perform", action: 'click the More link at the bottom' },
    { do: "wait", timeMs: 1500 },
  ],
});

console.log(response.markdown);

import os
from context.dev import ContextDev

client = ContextDev(api_key=os.environ["CONTEXT_DEV_API_KEY"])

response = client.web.web_scrape_md(
    url="https://news.ycombinator.com",
    actions=[
        {"do": "perform", "action": "click the More link at the bottom"},
        {"do": "wait", "time_ms": 1500},
    ],
)

print(response.markdown)

Actions run in array order. A common recipe is one perform to trigger the interaction followed by a short wait so the DOM can settle before capture.

Use cases

Build a RAG pipeline from a docs site by crawling and chunking the returned Markdown.
Cut LLM token bills by feeding clean Markdown instead of raw HTML.
Seed a vector index without managing scrapers or proxy infrastructure.
Monitor competitors’ marketing pages by scraping them on a schedule.

Next steps

Prefetch for Faster Response

Hide cold-hit latency from your users.

Handle Rate Limits

Backoff strategies, client cache, and prefetch fallbacks.

Best Practices

Caching, error handling, and key hygiene.

Troubleshooting

Status codes, retry patterns, and common errors.

Get Started

Give it to your Agent

What can Context.dev do

Optimizations

No-code Integrations

Prerequisites

Scrape a single page to Markdown

Request Parameters

Response

Crawl a whole site

Request Parameters

Response

Get all URLs of a domain

Request Parameters

Response

Extract every image on a page

Request Parameters

Response

Run browser actions before scraping

Use cases

Next steps

Prefetch for Faster Response

Handle Rate Limits

Best Practices

Troubleshooting

​Prerequisites

​Scrape a single page to Markdown

​Request Parameters

​Response

​Crawl a whole site

​Request Parameters

​Response

​Get all URLs of a domain

​Request Parameters

​Response

​Extract every image on a page

​Request Parameters

​Response

​Run browser actions before scraping

​Use cases

​Next steps

Prefetch for Faster Response

Handle Rate Limits

Best Practices

Troubleshooting

Prerequisites

Scrape a single page to Markdown

Request Parameters

Response

Crawl a whole site

Request Parameters

Response

Get all URLs of a domain

Request Parameters

Response

Extract every image on a page

Request Parameters

Response

Run browser actions before scraping

Use cases

Next steps