text-embedding-3-small, and store and search in Pinecone.
This RAG pipeline has four steps:
- Crawl the site into Markdown.
- Chunk each page into focused passages.
- Embed and store the chunks in a vector index.
- Retrieve the best chunks at query time.
Architecture
The pipeline has two phases:- Ingestion runs once during setup, then on a schedule: crawl the site, split each page into chunks, embed the chunks, and upsert them into the vector index.
- Retrieval runs at query time: embed the user’s question, pull the most similar chunks, and hand them to the LLM as context.
Prerequisites
- A Context.dev API key.
- A Pinecone account with a serverless index of dimension
1536(see Step 3). - An OpenAI API key for embeddings.
- The root URL of the site you want to ingest.
Step 1: Crawl the site into Markdown
maxPages (default 100, max 500).
Context.dev automatically strips off everything except the main content and converts it into clean markdown you can use in the next step.
Each successful page crawl costs 1 credit. Check the response’s metadata.numSucceeded to find exactly how much this call cost you.
To scope a large site, pass a urlRegex like ^https?://[^/]+/docs/ so only matching paths are followed.
Preview the URL set (and your bill) first
Preview the URL set (and your bill) first
To get an upper-bound estimate of the API cost of a website’s crawl, pull its sitemap using the Sitemap API. This call is 1 credit regardless of size and returns a
urls array; urls.length is useful for planning, but the actual crawl can differ because it follows discovered links and respects maxPages, maxDepth, and urlRegex.Need images?
Need images?
By default the crawl drops images. Set For structured image data (dimensions, a CDN-hosted copy with its MIME type, and an image-type classification), call the dedicated image-scraping endpoint on a single page instead.
includeImages to keep image references inline in the Markdown:Step 2: Chunk the Markdown
Embedding models have a token ceiling per input (OpenAI’stext-embedding-3-small accepts about 8K tokens), and retrieval works better on focused passages than on whole pages.
Split each crawled page into chunks of roughly 500–1500 tokens. Markdown makes this easy: split on heading boundaries first, then sub-split anything still over budget.
metadata.url and markdown. The split keeps each heading attached to the section it introduces and tags the chunk with its url and heading for later citations.
Sections still over budget need a second split, where a sliding window over paragraphs works; because each chunk keeps its own heading line, it stays self-contained when shown to the model.
Step 3: Embed and store
You can choose any embedding model provider (OpenAI, Voyage, Cohere, a local model) and vector database (Pinecone, pgvector, Qdrant, Weaviate). The next steps remain similar: embed each chunk, attachurl and heading as metadata, and upsert by a deterministic ID.
We will use OpenAI’s text-embedding-3-small for embeddings and Pinecone serverless for storage.
text-embedding-3-small returns 1536-dimensional vectors, so create the Pinecone index with dimension 1536; it is cheap and accurate enough for documentation, and you only need text-embedding-3-large (3072 dimensions, a separate index) if an evaluation says so.
The deterministic SHA-1 of url + heading makes a re-run upsert in place instead of duplicating, and the text you keep in metadata is what you hand back to the model at query time.
Step 4: Retrieve at query time
At runtime, your RAG system needs to:- Embed the user’s question with the same model used for indexing
- Pull the top-K most similar chunks
- Hand them to the LLM as context.
topK: 6 is a sensible default for documentation Q&A; raise it if answers miss context, lower it if noise creeps in.
The retrieved text is already Markdown and carries its url, so the model sees real structure and can return linkable citations if you add Cite the URL of each section you used to the system prompt.
Related resources
Scrape Websites to Clean Markdown
The full Web Scraping API guide: Markdown, HTML, sitemaps, and image
extraction with automatic proxy switching.
Extract structured website data
Send a URL plus a JSON Schema when you need typed answers at inference time, not ingest time.
Web Crawl API
Full request and response schema for the crawl endpoint, including
maxPages, urlRegex, and per-page metadata.Web Sitemap API
Sitemap discovery for previewing a site’s URL set before a crawl.