What Is RAG — and Why Does It Matter?
Large language models have a fundamental limitation: they only know what was in their training data. If you ask GPT-4 about your company's internal policies, last week's earnings call, or a newly published research paper, it either hallucinates an answer or admits it doesn't know.
Retrieval-Augmented Generation (RAG) solves this. Instead of relying solely on the model's training, RAG retrieves relevant documents from a knowledge base and provides them as context in the prompt. The model then generates answers grounded in your data. This approach is now the standard architecture for enterprise AI assistants, document Q&A systems, and custom knowledge bases.
The RAG Pipeline
RAG has two phases:
Indexing (one-time setup):
- Load: Read documents (PDFs, Word files, web pages, databases)
- Split: Break documents into smaller chunks (500–1000 tokens each)
- Embed: Convert each chunk to a vector (numerical representation of meaning)
- Store: Save vectors in a vector database (ChromaDB, Pinecone, Weaviate)
Querying (at runtime):
- Embed query: Convert the user's question to a vector
- Retrieve: Find the top-k most similar document chunks
- Generate: Send retrieved chunks + question to the LLM for a grounded answer
Setup
pip install langchain langchain-openai langchain-community chromadb pypdf tiktoken
export OPENAI_API_KEY="sk-..."
Step 1: Loading Documents
LangChain provides DocumentLoaders for almost every file format:
from langchain_community.document_loaders import (
PyPDFLoader,
TextLoader,
WebBaseLoader,
DirectoryLoader,
UnstructuredWordDocumentLoader,
)
# Load a PDF
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} pages")
# Load all PDFs in a folder
loader = DirectoryLoader("./docs/", glob="*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
# Load a web page
loader = WebBaseLoader("https://example.com/article")
documents = loader.load()
# Each document has: page_content (str) and metadata (dict with source, page, etc.)
print(documents[0].page_content[:200])
print(documents[0].metadata)
Step 2: Text Splitting
Raw documents are too large to fit in a single prompt. We split them into overlapping chunks. The overlap (typically 10–20% of chunk size) ensures context isn't lost at chunk boundaries:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # target chunk size in characters
chunk_overlap=200, # overlap between chunks
length_function=len,
separators=["
", "
", " ", ""] # split on paragraphs first, then lines, etc.
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
print(f"Sample chunk: {chunks[0].page_content[:300]}")
Chunk size tips: For dense technical documents, use smaller chunks (500–700 chars). For narrative text, larger chunks (1000–1500 chars) preserve more context. Always tune for your specific use case.
Step 3: Creating Embeddings
Embeddings transform text into high-dimensional vectors that encode semantic meaning. Similar concepts have similar vectors, enabling similarity search.
from langchain_openai import OpenAIEmbeddings
# OpenAI's text-embedding-3-small is cheap and good for most use cases
# text-embedding-3-large is higher quality but 5x more expensive
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Test: embed a single string
vector = embeddings.embed_query("What is the vacation policy?")
print(f"Vector dimension: {len(vector)}") # 1536 for text-embedding-3-small
For open-source embeddings (no API cost), use HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5") — nearly as good for English text.
Step 4: Vector Store with ChromaDB
from langchain_community.vectorstores import Chroma
# Create vector store and embed all chunks (this takes a minute for large documents)
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db", # saves to disk so you don't re-embed every run
)
# Next time, load the existing store:
# vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
# Test retrieval
query = "What is the company's remote work policy?"
results = vectorstore.similarity_search(query, k=4)
for i, doc in enumerate(results):
print(f"Result {i+1}: {doc.page_content[:200]}")
print(f"Source: {doc.metadata}")
Step 5: Building the Retrieval Chain
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# Create a retriever from the vector store
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 4} # retrieve top 4 chunks
)
# Custom prompt that encourages grounded answers
prompt_template = """You are a helpful assistant answering questions based on the provided context.
Use only the information in the context to answer. If the answer isn't in the context, say "I don't have that information in the provided documents."
Context:
{context}
Question: {question}
Answer:"""
PROMPT = PromptTemplate(
template=prompt_template,
input_variables=["context", "question"]
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # "stuff" puts all retrieved docs in one prompt
retriever=retriever,
chain_type_kwargs={"prompt": PROMPT},
return_source_documents=True,
)
# Ask a question
result = qa_chain.invoke({"query": "What is the vacation policy?"})
print(result["result"])
print("
Sources:")
for doc in result["source_documents"]:
print(f" - {doc.metadata.get('source', 'Unknown')}, page {doc.metadata.get('page', '?')}")
Complete Working Application
Here's a full "chat with your documents" app combining everything above:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
import os
def build_rag_app(docs_dir: str, db_dir: str = "./chroma_db"):
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
if os.path.exists(db_dir) and os.listdir(db_dir):
print("Loading existing vector store...")
vectorstore = Chroma(persist_directory=db_dir, embedding_function=embeddings)
else:
print("Building vector store from documents...")
loader = DirectoryLoader(docs_dir, glob="*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory=db_dir)
print(f"Indexed {len(chunks)} chunks from {len(documents)} pages")
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0, streaming=True)
memory = ConversationBufferMemory(
memory_key="chat_history",
return_messages=True,
output_key="answer"
)
chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
memory=memory,
return_source_documents=True,
)
return chain
def main():
chain = build_rag_app("./documents")
print("Document Q&A ready! Type 'quit' to exit.
")
while True:
question = input("You: ").strip()
if question.lower() in ("quit", "exit"):
break
result = chain.invoke({"question": question})
print(f"Assistant: {result['answer']}")
sources = set(d.metadata.get("source", "") for d in result.get("source_documents", []))
if sources:
print(f"(Sources: {', '.join(sources)})
")
if __name__ == "__main__":
main()
Evaluating RAG Quality
Measure your RAG system with three metrics:
- Faithfulness: Does the answer only use information from retrieved chunks? (Prevents hallucination)
- Answer relevance: Does the answer address the question asked?
- Context relevance: Are the retrieved chunks actually relevant to the question?
Use the RAGAS library (pip install ragas) to automate evaluation with these metrics using a small test set of question-answer pairs.
Common Problems and Solutions
- Hallucinations: Add "Only answer from the context. If uncertain, say so." to your prompt. Lower temperature to 0.
- Irrelevant retrieval: Try a different chunk size, use metadata filtering, or switch to hybrid search (keyword + semantic)
- Context window overflow: Reduce k (number of retrieved chunks) or use a model with larger context window
- Slow indexing: Use batch embedding, process documents offline, cache results to disk
Production Considerations
- Use Pinecone or Weaviate instead of ChromaDB for production (managed, scalable, persistent)
- Add document metadata (date, author, section) to enable filtered retrieval
- Implement a re-ranking step (Cohere rerank API) to improve retrieval quality
- Cache embedding calls — re-embedding the same text is wasteful
- Monitor retrieval quality in production with periodic spot checks