RAG Fundamentals

Your agent just told a customer that your product costs $99. It actually costs $149. The agent was not lying. It was confidently generating text based on training data from eight months ago, before the price change. This is the core problem Retrieval Augmented Generation (RAG) solves: connecting agents to your actual, current data instead of letting them guess.

The Hallucination Problem

Large language models generate text that sounds right. That does not mean it is right. GPT-4 hallucinates at a rate of 3 to 5 percent according to Vectara's Hallucination Leaderboard^[1]. That means roughly 1 in 25 factual claims is fabricated. For a product catalog with 500 items, that is 20 wrong answers waiting to happen.

Two forces drive hallucination. First, training data has a cutoff. GPT-4o's knowledge stops at a fixed date. Anything that changed after that, your pricing, your inventory, your return policy, does not exist in the model's world.

Second, LLMs are text completion engines, not databases. They optimize for plausible continuations, not factual accuracy. When the model does not know something, it fills the gap with something that reads well.

RAG eliminates both problems. Instead of asking the model to recall facts from training, you retrieve the actual source documents and inject them directly into the prompt. The model generates answers grounded in your real data, not its memory.

How RAG Works

The RAG pipeline has five stages: chunk, embed, store, retrieve, inject.

Chunk. You break your source content (product pages, help docs, policy pages) into smaller pieces. A 2,000-word product page might become 8 to 10 chunks of 200 to 500 tokens each.
Embed. Each chunk gets converted into a vector, a list of numbers that represents its semantic meaning. OpenAI's text-embedding-3-small produces 1,536-dimensional vectors. Cohere's embed-v3 produces 1,024 dimensions. These models understand meaning, not just keywords. "Return policy" and "how to send items back" land near each other in vector space.
Store. Vectors go into a vector database optimized for similarity search. This is your retrieval layer.
Retrieve. When a user asks a question, the question itself gets embedded into the same vector space. The database returns the closest matching chunks, typically the top 5 to 20 results.
Inject. Those chunks get inserted into the LLM prompt as context. The model generates its answer using your actual content as the source of truth.

The result: your agent answers questions about your products using your product data, not its training data.

Chunking Strategies

How you split your content determines how well retrieval works. Get this wrong and the right answer might be split across two chunks, with neither chunk containing enough context to be useful.

Fixed-size chunking splits text into equal token counts, typically 256 to 512 tokens with 50 to 100 tokens of overlap. Simple and predictable. Works well for uniform content like product descriptions. Falls apart on content with variable structure.

Semantic chunking uses the embedding model itself to detect topic boundaries. When the cosine similarity between consecutive sentences drops below a threshold, a new chunk starts. Better at preserving meaning, but slower to process and harder to debug.

Recursive chunking tries to split on natural boundaries: paragraphs first, then sentences, then tokens. LangChain's RecursiveCharacterTextSplitter is the most common implementation^[2]. It respects document structure while staying within size limits.

Overlap matters. Without overlap, a question about a concept that spans a chunk boundary retrieves neither chunk. 10 to 20 percent overlap is standard. A 10,000-page enterprise site produces roughly 50,000 to 200,000 chunks depending on strategy and overlap settings.

Vector Databases

Your chunks need a home optimized for similarity search. Three options dominate the market.

Pinecone is a fully managed vector database. No infrastructure to run. Scales automatically. Supports metadata filtering, namespaces, and hybrid search out of the box.

Pricing starts at the free tier (100K vectors) and scales with usage^[3]. Best for teams that do not want to manage infrastructure.

pgvector is a Postgres extension that adds vector similarity search to your existing database^[4]. If you already run Postgres, this means zero new infrastructure. Supports IVFFlat and HNSW indexing. Performance holds up well to a few million vectors. Beyond that, dedicated vector databases pull ahead.

Vercel AI SDK provides a vector store abstraction that works across multiple backends^[5]. Useful if you want to swap providers without rewriting your retrieval code.

How to choose: If you need sub-50ms retrieval at millions of vectors, go managed (Pinecone). If you have under 1 million vectors and already use Postgres, pgvector keeps your stack simple. If you are experimenting, start with pgvector and migrate when you hit scale limits.

Retrieval and Reranking

Retrieval is where RAG pipelines succeed or fail. The model can only answer well if the right chunks make it into the prompt.

Top-K retrieval returns the K nearest vectors by cosine similarity. K=10 is a common starting point. Too low and you miss relevant context. Too high and you flood the prompt with noise, burning tokens and confusing the model.

Cosine similarity has limits. It measures geometric distance in vector space, which correlates with semantic similarity but is not perfect. A chunk about "product dimensions" and a chunk about "product reviews" might score similarly for a query about "product details." The retrieval step cannot distinguish which one actually answers the question.

Reranking fixes this. After initial vector retrieval returns the top 50 to 100 candidates, a cross-encoder model scores each candidate against the original query. Cohere Rerank v3 is the industry standard^[6]. It reads the query and each candidate together, producing a relevance score that is significantly more accurate than cosine similarity alone. Typical improvement: 5 to 15 percent better recall on benchmarks.

Hybrid search combines vector similarity with traditional keyword search (BM25). This catches cases where the user's exact terminology matters, like searching for a specific SKU or model number that embeddings might not distinguish well. Pinecone and Elasticsearch both support hybrid search natively.

Production RAG Patterns

Getting RAG into production requires more than a demo pipeline. Six patterns separate production systems from prototypes.

Index refresh. Your content changes. Products get updated, pages get published, policies get revised. Production RAG pipelines need an incremental indexing strategy. Reindex changed pages on publish, not on a nightly batch. Stale indexes defeat the entire purpose of RAG.

Metadata filtering. Not all chunks are relevant to all queries. Tag chunks with metadata (product category, content type, date, language) and filter at query time. A question about women's shoes should not retrieve chunks about men's outerwear. Metadata filtering reduces the search space before vector similarity even runs.

Citation. Users and internal stakeholders need to verify answers. Every chunk should carry a source URL. When the model generates an answer, include links back to the source pages. This builds trust and makes errors auditable.

Guardrails. When retrieval returns no relevant chunks (low similarity scores across all results), the agent should say "I don't have information about that" instead of falling back to its training data. Set a minimum similarity threshold and enforce it.

Evaluation. Measure retrieval quality with two metrics: recall (did the correct chunk appear in the results?) and precision (what fraction of retrieved chunks were actually relevant?). Build an evaluation dataset of 50 to 100 question-answer pairs with known source chunks. Run this after every change to your chunking or retrieval pipeline. A 2 percent drop in recall means real users are getting worse answers.

Cost management. Embedding API calls and vector database queries cost money at scale. Cache frequent queries. Batch embedding calls during indexing. Monitor your per-query cost. A typical RAG query costs $0.001 to $0.01 depending on model and retrieval depth, but at 100,000 queries per day that adds up to $100 to $1,000 daily.

How Scanner Helps

Scanner evaluates whether your content is structured and accessible enough for RAG pipelines to consume effectively. Pages with clean HTML, proper heading hierarchy, and well-organized content produce better chunks and better retrieval results. The Discoverability and Performance dimensions of your Site Score reflect how effectively agents can extract and use your content.

Retrieval-Augmented Generation

Vector Space Search

Nearest

Other

The query “What size should I get?” finds the nearest document vectors.

Click tabs to walk through the three RAG stages.

The Hallucination Problem

How RAG Works

The RAG pipeline has five stages: chunk, embed, store, retrieve, inject.

Chunk. You break your source content (product pages, help docs, policy pages) into smaller pieces. A 2,000-word product page might become 8 to 10 chunks of 200 to 500 tokens each.
Embed. Each chunk gets converted into a vector, a list of numbers that represents its semantic meaning. OpenAI's text-embedding-3-small produces 1,536-dimensional vectors. Cohere's embed-v3 produces 1,024 dimensions. These models understand meaning, not just keywords. "Return policy" and "how to send items back" land near each other in vector space.
Store. Vectors go into a vector database optimized for similarity search. This is your retrieval layer.
Retrieve. When a user asks a question, the question itself gets embedded into the same vector space. The database returns the closest matching chunks, typically the top 5 to 20 results.
Inject. Those chunks get inserted into the LLM prompt as context. The model generates its answer using your actual content as the source of truth.

The result: your agent answers questions about your products using your product data, not its training data.

Chunking Strategies

How you split your content determines how well retrieval works. Get this wrong and the right answer might be split across two chunks, with neither chunk containing enough context to be useful.

Vector Databases

Your chunks need a home optimized for similarity search. Three options dominate the market.

Pinecone is a fully managed vector database. No infrastructure to run. Scales automatically. Supports metadata filtering, namespaces, and hybrid search out of the box.

Pricing starts at the free tier (100K vectors) and scales with usage^[3]. Best for teams that do not want to manage infrastructure.

Vercel AI SDK provides a vector store abstraction that works across multiple backends^[5]. Useful if you want to swap providers without rewriting your retrieval code.

Retrieval and Reranking

Retrieval is where RAG pipelines succeed or fail. The model can only answer well if the right chunks make it into the prompt.

Production RAG Patterns

Getting RAG into production requires more than a demo pipeline. Six patterns separate production systems from prototypes.

RAG Fundamentals

The Hallucination Problem

How RAG Works

Chunking Strategies

Vector Databases

Retrieval and Reranking

Production RAG Patterns

How Scanner Helps

More from Learn

Why Agents on Your Site

Voice Architecture

Measuring Agent Performance

RAG Fundamentals

The Hallucination Problem

How RAG Works

Chunking Strategies

Vector Databases

Retrieval and Reranking

Production RAG Patterns

How Scanner Helps

More from Learn

Why Agents on Your Site

Voice Architecture

Measuring Agent Performance