Building a RAG Pipeline with LangChain

· 7 min read
rag langchain llm ai python

Prerequisites

Before we crack on, you’ll need:

  • Python 3.11+
  • An OpenAI API key (or a local LLM if you enjoy suffering)
  • Basic LangChain knowledge
  • A mass of documents you’ve been hoarding like a digital dragon
  • At least one existential crisis about whether AI will replace you
  • Tea. Non-negotiable.

What We’re Building

Ever asked an LLM a question about your own codebase and watched it confidently hallucinate an answer that has absolutely nothing to do with reality? Yeah. Same.

Sarcastic Wonka

Retrieval-Augmented Generation (RAG) fixes that. Instead of letting the model freestyle, we shove our actual documents into the pipeline so it can, you know, read them first. Revolutionary concept.

We’re building a RAG pipeline that lets you query your own documents with an LLM. Ask questions about your codebase, documentation, or knowledge base and get accurate, sourced answers. Answers that are grounded in things that actually exist.

The Approach

  1. Load and split documents
  2. Create embeddings
  3. Store in vector database
  4. Build retrieval chain
  5. Query with sources

Simple enough on paper. Famous last words.

Step 1: Install Dependencies

pip install langchain langchain-openai langchain-chroma chromadb

Four packages. Thats it. No sprawling dependency nightmare. Yet.

Step 2: Load Documents

from langchain_community.document_loaders import (
    DirectoryLoader,
    TextLoader,
    PyPDFLoader,
    UnstructuredMarkdownLoader,
)

loader = DirectoryLoader(
    "./docs",
    glob="**/*.md",
    loader_cls=UnstructuredMarkdownLoader,
    show_progress=True,
)

documents = loader.load()
print(f"Loaded {len(documents)} documents")

This grabs all your Markdown files from the ./docs directory. The show_progress=True is there so you can watch numbers tick up and feel productive.

But lets be honest, your docs folder isnt going to be exclusively Markdown. You’ve probably got PDFs from 2019 that nobody’s read, plain text files with names like notes_FINAL_v2_ACTUAL_FINAL.txt, and maybe a few existential README files. Here’s how to handle the lot:

from pathlib import Path

def load_documents(docs_path: str):
    documents = []

    loaders = {
        ".md": UnstructuredMarkdownLoader,
        ".txt": TextLoader,
        ".pdf": PyPDFLoader,
    }

    for file_path in Path(docs_path).rglob("*"):
        if file_path.suffix in loaders:
            loader = loaders[file_path.suffix](str(file_path))
            documents.extend(loader.load())

    return documents

This walks through your directory recursively, picks the right loader for each file type, and hoovers everything up into one big list. Lovely.

Step 3: Split Documents

You cant just dump entire documents into an LLM. They have context windows, and those context windows have limits. We need to chop things into sensible chunks.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", " ", ""],
)

splits = text_splitter.split_documents(documents)
print(f"Created {len(splits)} chunks")

The chunk_overlap=200 is doing important work here. Without overlap, you’d lose context at chunk boundaries. Imagine splitting a paragraph mid-sentence and then asking the model what it means. Not ideal.

For code, LangChain has language-aware splitters that actually understand syntax boundaries instead of just blindly counting characters:

from langchain.text_splitter import Language, RecursiveCharacterTextSplitter

code_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1000,
    chunk_overlap=200,
)

This knows where functions and classes start and end. Much better than splitting your code in the middle of a function definition and wondering why the retrieval is garbage.

Step 4: Create Embeddings and Store

Now for the bit that feels like magic. We convert our text chunks into vectors (big lists of numbers that capture semantic meaning) and store them in a database built for exactly this sort of thing.

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=embeddings,
    persist_directory="./chroma_db",
)

Its alive

The persist_directory means your embeddings survive between runs. Without it, you’d be re-embedding everything every single time, which is both slow and expensive. Nobody wants that.

Loading an existing store back up is straightforward:

vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings,
)

Step 5: Create Retrieval Chain

This is where we wire everything together. The retriever finds relevant chunks, and the LLM uses them to actually answer your question.

from langchain_openai import ChatOpenAI
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o", temperature=0)

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5},
)

system_prompt = """You are an assistant that answers questions based on the provided context.
Use only the context to answer. If you don't know, say so.
Always cite your sources with the document name.

Context:
{context}
"""

prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}"),
])

question_answer_chain = create_stuff_documents_chain(llm, prompt)
chain = create_retrieval_chain(retriever, question_answer_chain)

Setting temperature=0 keeps the model from getting creative. We want facts here, not improv. The k=5 means we fetch the five most relevant chunks. Adjust to taste, but five is a solid starting point.

Step 6: Query the System

The moment of truth. Lets ask it something.

def ask(question: str) -> dict:
    result = chain.invoke({"input": question})

    return {
        "answer": result["answer"],
        "sources": [
            {
                "content": doc.page_content[:200],
                "source": doc.metadata.get("source", "unknown"),
            }
            for doc in result["context"]
        ],
    }

response = ask("How do I configure authentication?")
print(response["answer"])
print("\nSources:")
for source in response["sources"]:
    print(f"- {source['source']}")

And just like that, you get an answer grounded in your actual documents, with sources you can verify. No more blind trust in a model that might be making things up with supreme confidence.

Mind blown

Step 7: Improve Retrieval

Basic similarity search works, but we can do better. Hybrid search combines keyword matching (BM25) with semantic search, giving you the best of both worlds.

Hybrid search (keyword + semantic):

from langchain.retrievers import BM25Retriever, EnsembleRetriever

bm25_retriever = BM25Retriever.from_documents(splits)
bm25_retriever.k = 5

ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, retriever],
    weights=[0.3, 0.7],
)

The 0.3/0.7 weighting favours semantic search but still lets keyword matching have its say. If someone searches for an exact function name, BM25 will catch it even if the semantic search gets a bit confused.

Reranking takes things a step further. You fetch a bunch of candidates and then use a specialised model to sort them by actual relevance:

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(top_n=3)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever,
)

Chef’s Kiss

This is the “I actually care about quality” upgrade. It adds latency, but the improvement in answer quality is often worth the trade-off.

Step 8: Add Metadata Filtering

Sometimes you dont want to search everything. Maybe you only care about API docs, or you want to restrict to a specific category. Metadata filtering lets you do exactly that.

retriever = vectorstore.as_retriever(
    search_kwargs={
        "k": 5,
        "filter": {"category": "api-docs"},
    },
)

def ask_category(question: str, category: str) -> dict:
    filtered_retriever = vectorstore.as_retriever(
        search_kwargs={
            "k": 5,
            "filter": {"category": category},
        },
    )

    filtered_chain = create_retrieval_chain(
        filtered_retriever,
        question_answer_chain,
    )

    return filtered_chain.invoke({"input": question})

This is especially useful when your document collection grows. Without filtering, the retriever might pull in chunks from completely unrelated sections. Nobody wants their API authentication question answered with excerpts from the onboarding guide.

Step 9: Evaluation

Building a RAG pipeline without evaluating it is like cooking dinner without tasting it. You need to know if the thing actually works.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

test_questions = [
    "How do I install the package?",
    "What authentication methods are supported?",
    "How do I handle errors?",
]

def evaluate_rag():
    results = []

    for question in test_questions:
        response = ask(question)
        results.append({
            "question": question,
            "answer": response["answer"],
            "contexts": [s["content"] for s in response["sources"]],
        })

    scores = evaluate(
        results,
        metrics=[faithfulness, answer_relevancy, context_precision],
    )

    return scores

Faithfulness checks whether the answer is supported by the context. Answer relevancy checks whether the answer actually addresses the question. Context precision checks whether the retrieved chunks were actually useful. Together, they tell you if your pipeline is working or just vibing.

The Result

What you’ve got now:

  • Query your own documents naturally
  • Accurate, sourced answers
  • Scales to thousands of documents
  • Configurable retrieval strategies
  • An evaluation framework so you actually know if its working

What I’d Do Differently

Invest more in chunking strategy upfront. Poor chunks equal poor retrieval, which equals poor answers. Semantic chunking beats arbitrary character limits every time. I spent far too long tweaking prompts before realising the real problem was how I was splitting documents. Fix the retrieval first. The generation will follow.

RAG is how you make LLMs useful for your specific domain. The retrieval is 80% of the battle, and the other 20% is convincing your team that “vector database” isnt just a buzzword.

Related Posts

Comments