Building RAG Pipelines That Actually Work

The gap between demo and production

Every RAG tutorial starts the same way: chunk some documents, embed them, stuff them into a prompt. It works beautifully on the demo dataset. Then you throw real documents at it and everything falls apart.

I’ve built RAG systems that serve production traffic — search platforms indexing millions of records, internal knowledge bases for engineering teams, and customer-facing Q&A systems. The pattern that works in a notebook is rarely the pattern that works at scale.

Here’s what I’ve learned about the gaps.

Chunking is where most pipelines silently fail

The default advice is “split on 500 tokens with 50-token overlap.” This works for clean, well-structured text. It fails catastrophically on:

Tables and structured data — a table split mid-row becomes two meaningless chunks
Code blocks — half a function is worse than no function
Lists with context — item 7 of a list means nothing without the list header

def smart_chunk(document: Document) -> list[Chunk]:
    """Chunk based on document structure, not arbitrary token counts."""
    sections = extract_sections(document)
    chunks = []
    for section in sections:
        if section.token_count <= MAX_CHUNK_SIZE:
            chunks.append(Chunk(content=section.text, metadata=section.metadata))
        else:
            # Only split within sections, preserving headers
            sub_chunks = split_with_context(section, MAX_CHUNK_SIZE)
            chunks.extend(sub_chunks)
    return chunks

The key insight: chunk boundaries should follow document structure, not token counts. Parse your documents into semantic sections first, then decide how to split within those sections.

Retrieval quality is not embedding quality

Better embeddings help, but they’re not the bottleneck most people think they are. The real retrieval problems are:

Query-document mismatch — users ask questions in natural language; documents state facts declaratively
Specificity collapse — “how do I configure X?” retrieves every document that mentions X
Missing context — the answer requires information from multiple chunks that don’t co-occur

Hybrid search solves the first two

Pure vector search is fragile. Combining it with keyword search (BM25) catches the cases where semantic similarity fails:

def hybrid_search(query: str, k: int = 10) -> list[Result]:
    vector_results = vector_store.similarity_search(query, k=k * 2)
    keyword_results = bm25_index.search(query, k=k * 2)
    return reciprocal_rank_fusion(vector_results, keyword_results, k=k)

Reciprocal rank fusion is simple and remarkably effective. It doesn’t require tuning weights between the two result sets.

The generation prompt matters more than you think

Once you have good retrieval, the generation step is where trust is built or broken. Two rules:

Always cite sources — tell the model to reference which chunks it used
Admit ignorance — “I don’t have enough information” is better than a hallucinated answer

Given the following context, answer the user's question.
If the context doesn't contain enough information, say so.
Always reference which sections you used.

Context:
{retrieved_chunks}

Question: {user_query}

Evaluation is the hardest part

You can’t improve what you can’t measure. Build an evaluation set early:

Retrieval evaluation: for each question, do the right chunks appear in the top-k?
Answer evaluation: is the generated answer correct, complete, and grounded?
Regression testing: does a change to chunking or retrieval break previously correct answers?

This isn’t optional. Without evaluation, you’re tuning hyperparameters by vibes.

What I’d do differently next time

Start with the evaluation set. Write 50 question-answer pairs before writing any pipeline code. It changes every decision you make downstream — chunking strategy, embedding model, retrieval approach, prompt design.

RAG isn’t hard because the individual pieces are complex. It’s hard because the pieces interact in ways that are difficult to predict. The only way through is to measure everything and iterate fast.