Building RAG Applications with LangChain

What Is RAG — and Why Does It Matter?

Large language models have a fundamental limitation: they only know what was in their training data. If you ask GPT-4 about your company's internal policies, last week's earnings call, or a newly published research paper, it either hallucinates an answer or admits it doesn't know.

Retrieval-Augmented Generation (RAG) solves this. Instead of relying solely on the model's training, RAG retrieves relevant documents from a knowledge base and provides them as context in the prompt. The model then generates answers grounded in your data. This approach is now the standard architecture for enterprise AI assistants, document Q&A systems, and custom knowledge bases.

The RAG Pipeline

RAG has two phases:

Indexing (one-time setup):

Load: Read documents (PDFs, Word files, web pages, databases)
Split: Break documents into smaller chunks (500–1000 tokens each)
Embed: Convert each chunk to a vector (numerical representation of meaning)
Store: Save vectors in a vector database (ChromaDB, Pinecone, Weaviate)

Querying (at runtime):

Embed query: Convert the user's question to a vector
Retrieve: Find the top-k most similar document chunks
Generate: Send retrieved chunks + question to the LLM for a grounded answer

Setup

pip install langchain langchain-openai langchain-community chromadb pypdf tiktoken

export OPENAI_API_KEY="sk-..."

Step 1: Loading Documents

LangChain provides DocumentLoaders for almost every file format:

from langchain_community.document_loaders import (
    PyPDFLoader,
    TextLoader,
    WebBaseLoader,
    DirectoryLoader,
    UnstructuredWordDocumentLoader,
)

# Load a PDF
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} pages")

# Load all PDFs in a folder
loader = DirectoryLoader("./docs/", glob="*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()

# Load a web page
loader = WebBaseLoader("https://example.com/article")
documents = loader.load()

# Each document has: page_content (str) and metadata (dict with source, page, etc.)
print(documents[0].page_content[:200])
print(documents[0].metadata)

Step 2: Text Splitting

Raw documents are too large to fit in a single prompt. We split them into overlapping chunks. The overlap (typically 10–20% of chunk size) ensures context isn't lost at chunk boundaries:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # target chunk size in characters
    chunk_overlap=200,    # overlap between chunks
    length_function=len,
    separators=["

", "
", " ", ""]  # split on paragraphs first, then lines, etc.
)

chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
print(f"Sample chunk: {chunks[0].page_content[:300]}")

Chunk size tips: For dense technical documents, use smaller chunks (500–700 chars). For narrative text, larger chunks (1000–1500 chars) preserve more context. Always tune for your specific use case.

Step 3: Creating Embeddings

Embeddings transform text into high-dimensional vectors that encode semantic meaning. Similar concepts have similar vectors, enabling similarity search.

from langchain_openai import OpenAIEmbeddings

# OpenAI's text-embedding-3-small is cheap and good for most use cases
# text-embedding-3-large is higher quality but 5x more expensive
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Test: embed a single string
vector = embeddings.embed_query("What is the vacation policy?")
print(f"Vector dimension: {len(vector)}")  # 1536 for text-embedding-3-small

For open-source embeddings (no API cost), use HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5") — nearly as good for English text.

Step 4: Vector Store with ChromaDB

from langchain_community.vectorstores import Chroma

# Create vector store and embed all chunks (this takes a minute for large documents)
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",  # saves to disk so you don't re-embed every run
)

# Next time, load the existing store:
# vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)

# Test retrieval
query = "What is the company's remote work policy?"
results = vectorstore.similarity_search(query, k=4)
for i, doc in enumerate(results):
    print(f"Result {i+1}: {doc.page_content[:200]}")
    print(f"Source: {doc.metadata}")

Step 5: Building the Retrieval Chain

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Create a retriever from the vector store
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}  # retrieve top 4 chunks
)

# Custom prompt that encourages grounded answers
prompt_template = """You are a helpful assistant answering questions based on the provided context.
Use only the information in the context to answer. If the answer isn't in the context, say "I don't have that information in the provided documents."

Context:
{context}

Question: {question}

Answer:"""

PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # "stuff" puts all retrieved docs in one prompt
    retriever=retriever,
    chain_type_kwargs={"prompt": PROMPT},
    return_source_documents=True,
)

# Ask a question
result = qa_chain.invoke({"query": "What is the vacation policy?"})
print(result["result"])
print("
Sources:")
for doc in result["source_documents"]:
    print(f"  - {doc.metadata.get('source', 'Unknown')}, page {doc.metadata.get('page', '?')}")

Complete Working Application

Here's a full "chat with your documents" app combining everything above:

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
import os

def build_rag_app(docs_dir: str, db_dir: str = "./chroma_db"):
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    
    if os.path.exists(db_dir) and os.listdir(db_dir):
        print("Loading existing vector store...")
        vectorstore = Chroma(persist_directory=db_dir, embedding_function=embeddings)
    else:
        print("Building vector store from documents...")
        loader = DirectoryLoader(docs_dir, glob="*.pdf", loader_cls=PyPDFLoader)
        documents = loader.load()
        splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
        chunks = splitter.split_documents(documents)
        vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory=db_dir)
        print(f"Indexed {len(chunks)} chunks from {len(documents)} pages")
    
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0, streaming=True)
    memory = ConversationBufferMemory(
        memory_key="chat_history",
        return_messages=True,
        output_key="answer"
    )
    
    chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
        memory=memory,
        return_source_documents=True,
    )
    return chain

def main():
    chain = build_rag_app("./documents")
    print("Document Q&A ready! Type 'quit' to exit.
")
    
    while True:
        question = input("You: ").strip()
        if question.lower() in ("quit", "exit"):
            break
        result = chain.invoke({"question": question})
        print(f"Assistant: {result['answer']}")
        sources = set(d.metadata.get("source", "") for d in result.get("source_documents", []))
        if sources:
            print(f"(Sources: {', '.join(sources)})
")

if __name__ == "__main__":
    main()

Evaluating RAG Quality

Measure your RAG system with three metrics:

Faithfulness: Does the answer only use information from retrieved chunks? (Prevents hallucination)
Answer relevance: Does the answer address the question asked?
Context relevance: Are the retrieved chunks actually relevant to the question?

Use the RAGAS library (pip install ragas) to automate evaluation with these metrics using a small test set of question-answer pairs.

Common Problems and Solutions

Hallucinations: Add "Only answer from the context. If uncertain, say so." to your prompt. Lower temperature to 0.
Irrelevant retrieval: Try a different chunk size, use metadata filtering, or switch to hybrid search (keyword + semantic)
Context window overflow: Reduce k (number of retrieved chunks) or use a model with larger context window
Slow indexing: Use batch embedding, process documents offline, cache results to disk

Production Considerations

Use Pinecone or Weaviate instead of ChromaDB for production (managed, scalable, persistent)
Add document metadata (date, author, section) to enable filtered retrieval
Implement a re-ranking step (Cohere rerank API) to improve retrieval quality
Cache embedding calls — re-embedding the same text is wasteful
Monitor retrieval quality in production with periodic spot checks