Skip to main content
Retrieval-augmented generation (RAG) is a system that answers a question by pulling the most relevant passages from your own content and handing them to an LLM as context. The agent’s response is only as good as the RAG output, which is only as good as the content in its index. Building this index usually means building an ingestion pipeline: scrape HTML from the internet, clean it up to get only the visible text, then process it. All this while evading geo-restrictions and bot detection, and handling messy client-rendered content. This is an unnecessary overhead you can avoid if you use Context.dev’s Web Scraping API. You give it one starting URL and it returns reachable pages as clean, LLM-ready Markdown, with no browser pool, no extractor to maintain, and no HTML to detangle. In this guide we will build a complete RAG implementation: crawl the website with Context.dev, embed with OpenAI’s text-embedding-3-small, and store and search in Pinecone. This RAG pipeline has four steps:
  1. Crawl the site into Markdown.
  2. Chunk each page into focused passages.
  3. Embed and store the chunks in a vector index.
  4. Retrieve the best chunks at query time.

Architecture

The pipeline has two phases:
  • Ingestion runs once during setup, then on a schedule: crawl the site, split each page into chunks, embed the chunks, and upsert them into the vector index.
  • Retrieval runs at query time: embed the user’s question, pull the most similar chunks, and hand them to the LLM as context.

Prerequisites

  1. A Context.dev API key.
  2. A Pinecone account with a serverless index of dimension 1536 (see Step 3).
  3. An OpenAI API key for embeddings.
  4. The root URL of the site you want to ingest.
Install the three clients for your language:
npm install context.dev openai @pinecone-database/pinecone

Step 1: Crawl the site into Markdown

import ContextDev from "context.dev";

const client = new ContextDev({ apiKey: process.env.CONTEXT_DEV_API_KEY! });

const { results, metadata } = await client.web.webCrawlMd({
  url: "https://docs.example.com",
  maxPages: 200
});

console.log(`Crawled ${metadata.numSucceeded}/${metadata.numUrls} pages`);
This API call makes Context.dev start at the seed URL and follow same-domain links until it has covered reachable pages or hit maxPages (default 100, max 500). Context.dev automatically strips off everything except the main content and converts it into clean markdown you can use in the next step. Each successful page crawl costs 1 credit. Check the response’s metadata.numSucceeded to find exactly how much this call cost you. To scope a large site, pass a urlRegex like ^https?://[^/]+/docs/ so only matching paths are followed.
To get an upper-bound estimate of the API cost of a website’s crawl, pull its sitemap using the Sitemap API. This call is 1 credit regardless of size and returns a urls array; urls.length is useful for planning, but the actual crawl can differ because it follows discovered links and respects maxPages, maxDepth, and urlRegex.
const { urls } = await client.web.webScrapeSitemap({ domain: "docs.example.com" });
console.log(`${urls.length} URLs, about ${urls.length} crawl credits`);
By default the crawl drops images. Set includeImages to keep image references inline in the Markdown:
const { results } = await client.web.webCrawlMd({
  url: "https://docs.example.com",
  maxPages: 200,
  includeImages: true,
});
For structured image data (dimensions, a CDN-hosted copy with its MIME type, and an image-type classification), call the dedicated image-scraping endpoint on a single page instead.

Step 2: Chunk the Markdown

Embedding models have a token ceiling per input (OpenAI’s text-embedding-3-small accepts about 8K tokens), and retrieval works better on focused passages than on whole pages. Split each crawled page into chunks of roughly 500–1500 tokens. Markdown makes this easy: split on heading boundaries first, then sub-split anything still over budget.
type Chunk = { url: string; heading: string; text: string };

function chunkByHeading(url: string, markdown: string): Chunk[] {
  return markdown.split(/\n(?=#{1,3} )/).map((section) => ({
    url,
    heading: section.split("\n")[0].replace(/^#+\s*/, ""),
    text: section.trim()
  }));
}
Run this over every crawled page, passing each record’s metadata.url and markdown. The split keeps each heading attached to the section it introduces and tags the chunk with its url and heading for later citations. Sections still over budget need a second split, where a sliding window over paragraphs works; because each chunk keeps its own heading line, it stays self-contained when shown to the model.

Step 3: Embed and store

You can choose any embedding model provider (OpenAI, Voyage, Cohere, a local model) and vector database (Pinecone, pgvector, Qdrant, Weaviate). The next steps remain similar: embed each chunk, attach url and heading as metadata, and upsert by a deterministic ID. We will use OpenAI’s text-embedding-3-small for embeddings and Pinecone serverless for storage.
import OpenAI from "openai";
import { Pinecone } from "@pinecone-database/pinecone";
import { createHash } from "crypto";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
const index = pinecone.index("docs");

const idFor = (c: Chunk) => createHash("sha1").update(c.url + c.heading).digest("hex");

async function indexChunks(chunks: Chunk[]) {
  for (let i = 0; i < chunks.length; i += 100) {
    const batch = chunks.slice(i, i + 100);
    const { data } = await openai.embeddings.create({
      model: "text-embedding-3-small",
      input: batch.map((c) => c.text)
    });
    await index.upsert(
      batch.map((chunk, j) => ({
        id: idFor(chunk),
        values: data[j].embedding,
        metadata: { url: chunk.url, heading: chunk.heading, text: chunk.text }
      }))
    );
  }
}
Chunks embed in batches of 100 to stay under the OpenAI input cap. text-embedding-3-small returns 1536-dimensional vectors, so create the Pinecone index with dimension 1536; it is cheap and accurate enough for documentation, and you only need text-embedding-3-large (3072 dimensions, a separate index) if an evaluation says so. The deterministic SHA-1 of url + heading makes a re-run upsert in place instead of duplicating, and the text you keep in metadata is what you hand back to the model at query time.

Step 4: Retrieve at query time

At runtime, your RAG system needs to:
  1. Embed the user’s question with the same model used for indexing
  2. Pull the top-K most similar chunks
  3. Hand them to the LLM as context.
async function answer(question: string) {
  const { data } = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: [question]
  });

  const { matches } = await index.query({
    vector: data[0].embedding,
    topK: 6,
    includeMetadata: true
  });

  const context = matches
    .map((m) => `## ${m.metadata!.heading} (${m.metadata!.url})\n\n${m.metadata!.text}`)
    .join("\n\n---\n\n");

  return askLLM({ question, context });
}
topK: 6 is a sensible default for documentation Q&A; raise it if answers miss context, lower it if noise creeps in. The retrieved text is already Markdown and carries its url, so the model sees real structure and can return linkable citations if you add Cite the URL of each section you used to the system prompt.

Scrape Websites to Clean Markdown

The full Web Scraping API guide: Markdown, HTML, sitemaps, and image extraction with automatic proxy switching.

Extract structured website data

Send a URL plus a JSON Schema when you need typed answers at inference time, not ingest time.

Web Crawl API

Full request and response schema for the crawl endpoint, including maxPages, urlRegex, and per-page metadata.

Web Sitemap API

Sitemap discovery for previewing a site’s URL set before a crawl.