Build an Agentic RAG System with Web Scraping

Retrieval-augmented generation (RAG) is a system that answers a question by pulling the most relevant passages from your own content and handing them to an LLM as context. The agent’s response is only as good as the RAG output, which is only as good as the content in its index. Building this index usually means building an ingestion pipeline: scrape HTML from the internet, clean it up to get only the visible text, then process it. All this while evading geo-restrictions and bot detection, and handling messy client-rendered content. This is an unnecessary overhead you can avoid if you use Context.dev’s Web Scraping API. You give it one starting URL and it returns reachable pages as clean, LLM-ready Markdown, with no browser pool, no extractor to maintain, and no HTML to detangle. In this guide we will build a complete RAG implementation: crawl the website with Context.dev, embed with OpenAI’s text-embedding-3-small, and store and search in Pinecone. This RAG pipeline has four steps:

Crawl the site into Markdown.
Chunk each page into focused passages.
Embed and store the chunks in a vector index.
Retrieve the best chunks at query time.

Architecture

The pipeline has two phases:

Ingestion runs once during setup, then on a schedule: crawl the site, split each page into chunks, embed the chunks, and upsert them into the vector index.
Retrieval runs at query time: embed the user’s question, pull the most similar chunks, and hand them to the LLM as context.

Prerequisites

A Context.dev API key.
A Pinecone account with a serverless index of dimension 1536 (see Step 3).
An OpenAI API key for embeddings.
The root URL of the site you want to ingest.

Install the three clients for your language:

npm install context.dev openai @pinecone-database/pinecone

pip install context.dev openai pinecone

gem install context.dev ruby-openai pinecone

go get github.com/context-dot-dev/context-go-sdk github.com/openai/openai-go github.com/pinecone-io/go-pinecone/v3

composer require context-dev/context-dev-php openai-php/client symfony/http-client nyholm/psr7

Step 1: Crawl the site into Markdown

import ContextDev from "context.dev";

const client = new ContextDev({ apiKey: process.env.CONTEXT_DEV_API_KEY! });

const { results, metadata } = await client.web.webCrawlMd({
  url: "https://docs.example.com",
  maxPages: 200
});

console.log(`Crawled ${metadata.numSucceeded}/${metadata.numUrls} pages`);

import os
from context.dev import ContextDev

client = ContextDev(api_key=os.environ["CONTEXT_DEV_API_KEY"])

response = client.web.web_crawl_md(url="https://docs.example.com", max_pages=200)

print(f"Crawled {response.metadata.num_succeeded}/{response.metadata.num_urls} pages")

require "context_dev"

client = ContextDev::Client.new(api_key: ENV.fetch("CONTEXT_DEV_API_KEY"))

response = client.web.web_crawl_md(url: "https://docs.example.com", max_pages: 200)

puts "Crawled #{response.metadata.num_succeeded}/#{response.metadata.num_urls} pages"

package main

import (
	"context"
	"fmt"
	"os"

	contextdev "github.com/context-dot-dev/context-go-sdk"
	"github.com/context-dot-dev/context-go-sdk/option"
	"github.com/context-dot-dev/context-go-sdk/packages/param"
)

func main() {
	client := contextdev.NewClient(option.WithAPIKey(os.Getenv("CONTEXT_DEV_API_KEY")))

	response, err := client.Web.WebCrawlMd(context.TODO(), contextdev.WebWebCrawlMdParams{
		URL:      "https://docs.example.com",
		MaxPages: param.NewOpt(int64(200)),
	})
	if err != nil {
		panic(err.Error())
	}

	fmt.Printf("Crawled %d/%d pages\n", response.Metadata.NumSucceeded, response.Metadata.NumUrls)
}

<?php

use ContextDev\Client;

$client = new Client(apiKey: getenv('CONTEXT_DEV_API_KEY'));

$response = $client->web->webCrawlMd(url: 'https://docs.example.com', maxPages: 200);

printf(
    "Crawled %d/%d pages\n",
    $response->metadata->numSucceeded,
    $response->metadata->numUrls,
);

This API call makes Context.dev start at the seed URL and follow same-domain links until it has covered reachable pages or hit maxPages (default 100, max 500). Context.dev automatically strips off everything except the main content and converts it into clean markdown you can use in the next step. Each successful page crawl costs 1 credit. Check the response’s metadata.numSucceeded to find exactly how much this call cost you. To scope a large site, pass a urlRegex like ^https?://[^/]+/docs/ so only matching paths are followed.

Preview the URL set (and your bill) first

To get an upper-bound estimate of the API cost of a website’s crawl, pull its sitemap using the Sitemap API. This call is 1 credit regardless of size and returns a urls array; urls.length is useful for planning, but the actual crawl can differ because it follows discovered links and respects maxPages, maxDepth, and urlRegex.

const { urls } = await client.web.webScrapeSitemap({ domain: "docs.example.com" });
console.log(`${urls.length} URLs, about ${urls.length} crawl credits`);

response = client.web.web_scrape_sitemap(domain="docs.example.com")
print(f"{len(response.urls)} URLs, about {len(response.urls)} crawl credits")

response = client.web.web_scrape_sitemap(domain: "docs.example.com")
puts "#{response.urls.length} URLs, about #{response.urls.length} crawl credits"

response, err := client.Web.WebScrapeSitemap(context.TODO(), contextdev.WebWebScrapeSitemapParams{
    Domain: "docs.example.com",
})
if err != nil {
    panic(err.Error())
}
fmt.Printf("%d URLs, about %d crawl credits\n", len(response.URLs), len(response.URLs))

$response = $client->web->webScrapeSitemap(domain: 'docs.example.com');
$count = count($response->urls ?? []);
echo "{$count} URLs, about {$count} crawl credits\n";

Need images?

By default the crawl drops images. Set includeImages to keep image references inline in the Markdown:

const { results } = await client.web.webCrawlMd({
  url: "https://docs.example.com",
  maxPages: 200,
  includeImages: true,
});

response = client.web.web_crawl_md(
    url="https://docs.example.com",
    max_pages=200,
    include_images=True,
)

response = client.web.web_crawl_md(
  url: "https://docs.example.com",
  max_pages: 200,
  include_images: true,
)

response, err := client.Web.WebCrawlMd(context.TODO(), contextdev.WebWebCrawlMdParams{
    URL:           "https://docs.example.com",
    MaxPages:      param.NewOpt(int64(200)),
    IncludeImages: param.NewOpt(true),
})

$response = $client->web->webCrawlMd(
    url: 'https://docs.example.com',
    maxPages: 200,
    includeImages: true,
);

For structured image data (dimensions, a CDN-hosted copy with its MIME type, and an image-type classification), call the dedicated image-scraping endpoint on a single page instead.

Step 2: Chunk the Markdown

Embedding models have a token ceiling per input (OpenAI’s text-embedding-3-small accepts about 8K tokens), and retrieval works better on focused passages than on whole pages. Split each crawled page into chunks of roughly 500–1500 tokens. Markdown makes this easy: split on heading boundaries first, then sub-split anything still over budget.

type Chunk = { url: string; heading: string; text: string };

function chunkByHeading(url: string, markdown: string): Chunk[] {
  return markdown.split(/\n(?=#{1,3} )/).map((section) => ({
    url,
    heading: section.split("\n")[0].replace(/^#+\s*/, ""),
    text: section.trim()
  }));
}

import re

def chunk_by_heading(url: str, markdown: str) -> list[dict]:
    sections = re.split(r"\n(?=#{1,3} )", markdown)
    return [
        {
            "url": url,
            "heading": re.sub(r"^#+\s*", "", section.split("\n", 1)[0]),
            "text": section.strip(),
        }
        for section in sections
    ]

def chunk_by_heading(url, markdown)
  markdown.split(/\n(?=\#{1,3} )/).map do |section|
    heading = section.lines.first.to_s.sub(/^#+\s*/, "").strip
    { url: url, heading: heading, text: section.strip }
  end
end

package main

import (
	"regexp"
	"strings"
)

type Chunk struct {
	URL     string
	Heading string
	Text    string
}

// Go's RE2 engine has no lookahead, so mark heading boundaries, then split.
var headingBoundary = regexp.MustCompile(`\n(#{1,3} )`)

func chunkByHeading(url, markdown string) []Chunk {
	marked := headingBoundary.ReplaceAllString(markdown, "\x00$1")
	var chunks []Chunk
	for _, section := range strings.Split(marked, "\x00") {
		section = strings.TrimSpace(section)
		if section == "" {
			continue
		}
		heading := strings.TrimLeft(strings.SplitN(section, "\n", 2)[0], "# ")
		chunks = append(chunks, Chunk{URL: url, Heading: heading, Text: section})
	}
	return chunks
}

<?php

function chunkByHeading(string $url, string $markdown): array
{
    $sections = preg_split('/\n(?=#{1,3} )/', $markdown);
    $chunks = [];

    foreach ($sections as $section) {
        $section = trim($section);
        if ($section === '') {
            continue;
        }
        $heading = preg_replace('/^#+\s*/', '', explode("\n", $section)[0]);
        $chunks[] = [
            'url' => $url,
            'heading' => $heading,
            'text' => $section,
        ];
    }

    return $chunks;
}

Run this over every crawled page, passing each record’s metadata.url and markdown. The split keeps each heading attached to the section it introduces and tags the chunk with its url and heading for later citations. Sections still over budget need a second split, where a sliding window over paragraphs works; because each chunk keeps its own heading line, it stays self-contained when shown to the model.

Step 3: Embed and store

You can choose any embedding model provider (OpenAI, Voyage, Cohere, a local model) and vector database (Pinecone, pgvector, Qdrant, Weaviate). The next steps remain similar: embed each chunk, attach url and heading as metadata, and upsert by a deterministic ID. We will use OpenAI’s text-embedding-3-small for embeddings and Pinecone serverless for storage.

import OpenAI from "openai";
import { Pinecone } from "@pinecone-database/pinecone";
import { createHash } from "crypto";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
const index = pinecone.index("docs");

const idFor = (c: Chunk) => createHash("sha1").update(c.url + c.heading).digest("hex");

async function indexChunks(chunks: Chunk[]) {
  for (let i = 0; i < chunks.length; i += 100) {
    const batch = chunks.slice(i, i + 100);
    const { data } = await openai.embeddings.create({
      model: "text-embedding-3-small",
      input: batch.map((c) => c.text)
    });
    await index.upsert(
      batch.map((chunk, j) => ({
        id: idFor(chunk),
        values: data[j].embedding,
        metadata: { url: chunk.url, heading: chunk.heading, text: chunk.text }
      }))
    );
  }
}

import hashlib
from openai import OpenAI
from pinecone import Pinecone

openai = OpenAI()              # reads OPENAI_API_KEY
pc = Pinecone()                # reads PINECONE_API_KEY
index = pc.Index("docs")

def id_for(chunk: dict) -> str:
    return hashlib.sha1((chunk["url"] + chunk["heading"]).encode()).hexdigest()

def index_chunks(chunks: list[dict]) -> None:
    for i in range(0, len(chunks), 100):
        batch = chunks[i:i + 100]
        resp = openai.embeddings.create(
            model="text-embedding-3-small",
            input=[c["text"] for c in batch],
        )
        index.upsert(vectors=[
            {
                "id": id_for(chunk),
                "values": resp.data[j].embedding,
                "metadata": {"url": chunk["url"], "heading": chunk["heading"], "text": chunk["text"]},
            }
            for j, chunk in enumerate(batch)
        ])

require "openai"
require "pinecone"
require "digest"

openai = OpenAI::Client.new(access_token: ENV.fetch("OPENAI_API_KEY"))
Pinecone.configure { |c| c.api_key = ENV.fetch("PINECONE_API_KEY") }
index = Pinecone::Vector.new("docs")

def id_for(chunk)
  Digest::SHA1.hexdigest(chunk[:url] + chunk[:heading])
end

def index_chunks(openai, index, chunks)
  chunks.each_slice(100) do |batch|
    resp = openai.embeddings(parameters: {
      model: "text-embedding-3-small",
      input: batch.map { |c| c[:text] }
    })
    vectors = batch.each_with_index.map do |chunk, j|
      {
        id: id_for(chunk),
        values: resp.dig("data", j, "embedding"),
        metadata: { url: chunk[:url], heading: chunk[:heading], text: chunk[:text] }
      }
    end
    index.upsert(vectors: vectors)
  end
end

package main

import (
	"context"
	"crypto/sha1"
	"encoding/hex"
	"os"

	"github.com/openai/openai-go"
	"github.com/openai/openai-go/option"
	"github.com/pinecone-io/go-pinecone/v3/pinecone"
	"google.golang.org/protobuf/types/known/structpb"
)

func idFor(c Chunk) string {
	sum := sha1.Sum([]byte(c.URL + c.Heading))
	return hex.EncodeToString(sum[:])
}

func indexChunks(ctx context.Context, chunks []Chunk) error {
	oa := openai.NewClient(option.WithAPIKey(os.Getenv("OPENAI_API_KEY")))
	pc, err := pinecone.NewClient(pinecone.NewClientParams{ApiKey: os.Getenv("PINECONE_API_KEY")})
	if err != nil {
		return err
	}
	desc, err := pc.DescribeIndex(ctx, "docs")
	if err != nil {
		return err
	}
	idx, err := pc.Index(pinecone.NewIndexConnParams{Host: desc.Host})
	if err != nil {
		return err
	}

	for i := 0; i < len(chunks); i += 100 {
		batch := chunks[i:min(i+100, len(chunks))]

		inputs := make([]string, len(batch))
		for j, c := range batch {
			inputs[j] = c.Text
		}
		emb, err := oa.Embeddings.New(ctx, openai.EmbeddingNewParams{
			Model: openai.EmbeddingModelTextEmbedding3Small,
			Input: openai.EmbeddingNewParamsInputUnion{OfArrayOfStrings: inputs},
		})
		if err != nil {
			return err
		}

		vectors := make([]*pinecone.Vector, len(batch))
		for j, c := range batch {
			values := make([]float32, len(emb.Data[j].Embedding))
			for k, v := range emb.Data[j].Embedding {
				values[k] = float32(v)
			}
			meta, _ := structpb.NewStruct(map[string]any{"url": c.URL, "heading": c.Heading, "text": c.Text})
			vectors[j] = &pinecone.Vector{Id: idFor(c), Values: &values, Metadata: meta}
		}
		if _, err := idx.UpsertVectors(ctx, vectors); err != nil {
			return err
		}
	}
	return nil
}

<?php

use OpenAI\Client as OpenAIClient;
use Symfony\Component\HttpClient\HttpClient;

$openai = OpenAIClient::factory()->withApiKey(getenv('OPENAI_API_KEY'))->make();
$http = HttpClient::create();
$pineconeHost = getenv('PINECONE_HOST'); // e.g. docs-xxxxx.svc.us-east-1.pinecone.io

function idFor(array $chunk): string
{
    return sha1($chunk['url'] . $chunk['heading']);
}

function indexChunks(array $chunks): void
{
    global $openai, $http, $pineconeHost;

    for ($i = 0; $i < count($chunks); $i += 100) {
        $batch = array_slice($chunks, $i, 100);

        $response = $openai->embeddings()->create([
            'model' => 'text-embedding-3-small',
            'input' => array_column($batch, 'text'),
        ]);

        $vectors = [];
        foreach ($batch as $j => $chunk) {
            $vectors[] = [
                'id' => idFor($chunk),
                'values' => $response->embeddings[$j]->embedding,
                'metadata' => [
                    'url' => $chunk['url'],
                    'heading' => $chunk['heading'],
                    'text' => $chunk['text'],
                ],
            ];
        }

        $http->request('POST', "https://{$pineconeHost}/vectors/upsert", [
            'headers' => ['Api-Key' => getenv('PINECONE_API_KEY')],
            'json' => ['vectors' => $vectors],
        ]);
    }
}

Chunks embed in batches of 100 to stay under the OpenAI input cap. text-embedding-3-small returns 1536-dimensional vectors, so create the Pinecone index with dimension 1536; it is cheap and accurate enough for documentation, and you only need text-embedding-3-large (3072 dimensions, a separate index) if an evaluation says so. The deterministic SHA-1 of url + heading makes a re-run upsert in place instead of duplicating, and the text you keep in metadata is what you hand back to the model at query time.

Step 4: Retrieve at query time

At runtime, your RAG system needs to:

Embed the user’s question with the same model used for indexing
Pull the top-K most similar chunks
Hand them to the LLM as context.

async function answer(question: string) {
  const { data } = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: [question]
  });

  const { matches } = await index.query({
    vector: data[0].embedding,
    topK: 6,
    includeMetadata: true
  });

  const context = matches
    .map((m) => `## ${m.metadata!.heading} (${m.metadata!.url})\n\n${m.metadata!.text}`)
    .join("\n\n---\n\n");

  return askLLM({ question, context });
}

def answer(question: str) -> str:
    embedding = openai.embeddings.create(
        model="text-embedding-3-small",
        input=[question],
    ).data[0].embedding

    matches = index.query(vector=embedding, top_k=6, include_metadata=True).matches

    context = "\n\n---\n\n".join(
        f"## {m.metadata['heading']} ({m.metadata['url']})\n\n{m.metadata['text']}"
        for m in matches
    )
    return ask_llm(question, context)

def answer(openai, index, question)
  embedding = openai.embeddings(parameters: {
    model: "text-embedding-3-small",
    input: question
  }).dig("data", 0, "embedding")

  matches = index.query(vector: embedding, top_k: 6, include_metadata: true)["matches"]

  context = matches.map do |m|
    meta = m["metadata"]
    "## #{meta['heading']} (#{meta['url']})\n\n#{meta['text']}"
  end.join("\n\n---\n\n")

  ask_llm(question, context)
end

package main

import (
	"context"
	"fmt"
	"strings"

	"github.com/openai/openai-go"
	"github.com/pinecone-io/go-pinecone/v3/pinecone"
)

func answer(ctx context.Context, oa openai.Client, idx *pinecone.IndexConnection, question string) (string, error) {
	emb, err := oa.Embeddings.New(ctx, openai.EmbeddingNewParams{
		Model: openai.EmbeddingModelTextEmbedding3Small,
		Input: openai.EmbeddingNewParamsInputUnion{OfString: openai.String(question)},
	})
	if err != nil {
		return "", err
	}

	vector := make([]float32, len(emb.Data[0].Embedding))
	for i, v := range emb.Data[0].Embedding {
		vector[i] = float32(v)
	}

	res, err := idx.QueryByVectorValues(ctx, &pinecone.QueryByVectorValuesRequest{
		Vector:          vector,
		TopK:            6,
		IncludeMetadata: true,
	})
	if err != nil {
		return "", err
	}

	var sb strings.Builder
	for _, match := range res.Matches {
		meta := match.Vector.Metadata.AsMap()
		fmt.Fprintf(&sb, "## %s (%s)\n\n%s\n\n---\n\n", meta["heading"], meta["url"], meta["text"])
	}
	return askLLM(question, sb.String())
}

<?php

function answer(string $question): string
{
    global $openai, $http, $pineconeHost;

    $embedding = $openai->embeddings()->create([
        'model' => 'text-embedding-3-small',
        'input' => [$question],
    ])->embeddings[0]->embedding;

    $response = $http->request('POST', "https://{$pineconeHost}/query", [
        'headers' => ['Api-Key' => getenv('PINECONE_API_KEY')],
        'json' => [
            'vector' => $embedding,
            'topK' => 6,
            'includeMetadata' => true,
        ],
    ])->toArray();

    $parts = [];
    foreach ($response['matches'] ?? [] as $match) {
        $meta = $match['metadata'];
        $parts[] = "## {$meta['heading']} ({$meta['url']})\n\n{$meta['text']}";
    }
    $context = implode("\n\n---\n\n", $parts);

    return askLLM($question, $context);
}

topK: 6 is a sensible default for documentation Q&A; raise it if answers miss context, lower it if noise creeps in. The retrieved text is already Markdown and carries its url, so the model sees real structure and can return linkable citations if you add Cite the URL of each section you used to the system prompt.

Scrape Websites to Clean Markdown

The full Web Scraping API guide: Markdown, HTML, sitemaps, and image extraction with automatic proxy switching.

Extract structured website data

Send a URL plus a JSON Schema when you need typed answers at inference time, not ingest time.

Web Crawl API

Full request and response schema for the crawl endpoint, including maxPages, urlRegex, and per-page metadata.

Web Sitemap API

Sitemap discovery for previewing a site’s URL set before a crawl.

Get Started

Give it to your Agent

What can Context.dev do

Optimizations

No-code Integrations

Build an Agentic RAG System with Web Scraping

Architecture

Prerequisites

Step 1: Crawl the site into Markdown

Step 2: Chunk the Markdown

Step 3: Embed and store

Step 4: Retrieve at query time

Scrape Websites to Clean Markdown

Extract structured website data

Web Crawl API

Web Sitemap API

​Architecture

​Prerequisites

​Step 1: Crawl the site into Markdown

​Step 2: Chunk the Markdown

​Step 3: Embed and store

​Step 4: Retrieve at query time

​Related resources

Scrape Websites to Clean Markdown

Extract structured website data

Web Crawl API

Web Sitemap API

Architecture

Prerequisites

Step 1: Crawl the site into Markdown

Step 2: Chunk the Markdown

Step 3: Embed and store

Step 4: Retrieve at query time

Related resources