agentshelf.io
Browse the ShelfLearnPromote Your Tool
Sign inSign up free
agentshelf.io

The curated shelf for AI tools. Discover, compare, and add the best AI agents and tools to your workflow.

Stay in the loop

New on the Shelf — weekly digest of the best AI tools.

Explore

  • Browse the Shelf
  • Categories
  • New on the Shelf
  • Top Shelf Picks

Learn

  • Tutorials
  • FAQ

Company

  • About
  • Pricing
  • Contact

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy

© 2026 AgentShelf, Inc. All rights reserved.

TermsPrivacyContact
    ← All Tutorials
    🔗 API & DevelopmentAdvanced

    Building RAG Applications with LangChain

    Build production-ready Retrieval-Augmented Generation applications. Learn to load documents, create embeddings, set up vector stores, and build a complete chat-with-your-documents system.

    AS
    AgentShelf Team
    ·February 18, 2025·20 min read
    Share this guide
    Share:Share on X

    What Is RAG — and Why Does It Matter?

    Large language models have a fundamental limitation: they only know what was in their training data. If you ask GPT-4 about your company's internal policies, last week's earnings call, or a newly published research paper, it either hallucinates an answer or admits it doesn't know.

    Retrieval-Augmented Generation (RAG) solves this. Instead of relying solely on the model's training, RAG retrieves relevant documents from a knowledge base and provides them as context in the prompt. The model then generates answers grounded in your data. This approach is now the standard architecture for enterprise AI assistants, document Q&A systems, and custom knowledge bases.

    The RAG Pipeline

    RAG has two phases:

    Indexing (one-time setup):

    1. Load: Read documents (PDFs, Word files, web pages, databases)
    2. Split: Break documents into smaller chunks (500–1000 tokens each)
    3. Embed: Convert each chunk to a vector (numerical representation of meaning)
    4. Store: Save vectors in a vector database (ChromaDB, Pinecone, Weaviate)

    Querying (at runtime):

    1. Embed query: Convert the user's question to a vector
    2. Retrieve: Find the top-k most similar document chunks
    3. Generate: Send retrieved chunks + question to the LLM for a grounded answer

    Setup

    pip install langchain langchain-openai langchain-community chromadb pypdf tiktoken
    
    export OPENAI_API_KEY="sk-..."

    Step 1: Loading Documents

    LangChain provides DocumentLoaders for almost every file format:

    from langchain_community.document_loaders import (
        PyPDFLoader,
        TextLoader,
        WebBaseLoader,
        DirectoryLoader,
        UnstructuredWordDocumentLoader,
    )
    
    # Load a PDF
    loader = PyPDFLoader("company_handbook.pdf")
    documents = loader.load()
    print(f"Loaded {len(documents)} pages")
    
    # Load all PDFs in a folder
    loader = DirectoryLoader("./docs/", glob="*.pdf", loader_cls=PyPDFLoader)
    documents = loader.load()
    
    # Load a web page
    loader = WebBaseLoader("https://example.com/article")
    documents = loader.load()
    
    # Each document has: page_content (str) and metadata (dict with source, page, etc.)
    print(documents[0].page_content[:200])
    print(documents[0].metadata)

    Step 2: Text Splitting

    Raw documents are too large to fit in a single prompt. We split them into overlapping chunks. The overlap (typically 10–20% of chunk size) ensures context isn't lost at chunk boundaries:

    from langchain.text_splitter import RecursiveCharacterTextSplitter
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,      # target chunk size in characters
        chunk_overlap=200,    # overlap between chunks
        length_function=len,
        separators=["
    
    ", "
    ", " ", ""]  # split on paragraphs first, then lines, etc.
    )
    
    chunks = splitter.split_documents(documents)
    print(f"Split into {len(chunks)} chunks")
    print(f"Sample chunk: {chunks[0].page_content[:300]}")

    Chunk size tips: For dense technical documents, use smaller chunks (500–700 chars). For narrative text, larger chunks (1000–1500 chars) preserve more context. Always tune for your specific use case.

    Step 3: Creating Embeddings

    Embeddings transform text into high-dimensional vectors that encode semantic meaning. Similar concepts have similar vectors, enabling similarity search.

    from langchain_openai import OpenAIEmbeddings
    
    # OpenAI's text-embedding-3-small is cheap and good for most use cases
    # text-embedding-3-large is higher quality but 5x more expensive
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    
    # Test: embed a single string
    vector = embeddings.embed_query("What is the vacation policy?")
    print(f"Vector dimension: {len(vector)}")  # 1536 for text-embedding-3-small

    For open-source embeddings (no API cost), use HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5") — nearly as good for English text.

    Step 4: Vector Store with ChromaDB

    from langchain_community.vectorstores import Chroma
    
    # Create vector store and embed all chunks (this takes a minute for large documents)
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory="./chroma_db",  # saves to disk so you don't re-embed every run
    )
    
    # Next time, load the existing store:
    # vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
    
    # Test retrieval
    query = "What is the company's remote work policy?"
    results = vectorstore.similarity_search(query, k=4)
    for i, doc in enumerate(results):
        print(f"Result {i+1}: {doc.page_content[:200]}")
        print(f"Source: {doc.metadata}")

    Step 5: Building the Retrieval Chain

    from langchain_openai import ChatOpenAI
    from langchain.chains import RetrievalQA
    from langchain.prompts import PromptTemplate
    
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    
    # Create a retriever from the vector store
    retriever = vectorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 4}  # retrieve top 4 chunks
    )
    
    # Custom prompt that encourages grounded answers
    prompt_template = """You are a helpful assistant answering questions based on the provided context.
    Use only the information in the context to answer. If the answer isn't in the context, say "I don't have that information in the provided documents."
    
    Context:
    {context}
    
    Question: {question}
    
    Answer:"""
    
    PROMPT = PromptTemplate(
        template=prompt_template,
        input_variables=["context", "question"]
    )
    
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",  # "stuff" puts all retrieved docs in one prompt
        retriever=retriever,
        chain_type_kwargs={"prompt": PROMPT},
        return_source_documents=True,
    )
    
    # Ask a question
    result = qa_chain.invoke({"query": "What is the vacation policy?"})
    print(result["result"])
    print("
    Sources:")
    for doc in result["source_documents"]:
        print(f"  - {doc.metadata.get('source', 'Unknown')}, page {doc.metadata.get('page', '?')}")

    Complete Working Application

    Here's a full "chat with your documents" app combining everything above:

    from langchain_openai import ChatOpenAI, OpenAIEmbeddings
    from langchain_community.vectorstores import Chroma
    from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain.chains import ConversationalRetrievalChain
    from langchain.memory import ConversationBufferMemory
    import os
    
    def build_rag_app(docs_dir: str, db_dir: str = "./chroma_db"):
        embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        
        if os.path.exists(db_dir) and os.listdir(db_dir):
            print("Loading existing vector store...")
            vectorstore = Chroma(persist_directory=db_dir, embedding_function=embeddings)
        else:
            print("Building vector store from documents...")
            loader = DirectoryLoader(docs_dir, glob="*.pdf", loader_cls=PyPDFLoader)
            documents = loader.load()
            splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
            chunks = splitter.split_documents(documents)
            vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory=db_dir)
            print(f"Indexed {len(chunks)} chunks from {len(documents)} pages")
        
        llm = ChatOpenAI(model="gpt-4o-mini", temperature=0, streaming=True)
        memory = ConversationBufferMemory(
            memory_key="chat_history",
            return_messages=True,
            output_key="answer"
        )
        
        chain = ConversationalRetrievalChain.from_llm(
            llm=llm,
            retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
            memory=memory,
            return_source_documents=True,
        )
        return chain
    
    def main():
        chain = build_rag_app("./documents")
        print("Document Q&A ready! Type 'quit' to exit.
    ")
        
        while True:
            question = input("You: ").strip()
            if question.lower() in ("quit", "exit"):
                break
            result = chain.invoke({"question": question})
            print(f"Assistant: {result['answer']}")
            sources = set(d.metadata.get("source", "") for d in result.get("source_documents", []))
            if sources:
                print(f"(Sources: {', '.join(sources)})
    ")
    
    if __name__ == "__main__":
        main()

    Evaluating RAG Quality

    Measure your RAG system with three metrics:

    • Faithfulness: Does the answer only use information from retrieved chunks? (Prevents hallucination)
    • Answer relevance: Does the answer address the question asked?
    • Context relevance: Are the retrieved chunks actually relevant to the question?

    Use the RAGAS library (pip install ragas) to automate evaluation with these metrics using a small test set of question-answer pairs.

    Common Problems and Solutions

    • Hallucinations: Add "Only answer from the context. If uncertain, say so." to your prompt. Lower temperature to 0.
    • Irrelevant retrieval: Try a different chunk size, use metadata filtering, or switch to hybrid search (keyword + semantic)
    • Context window overflow: Reduce k (number of retrieved chunks) or use a model with larger context window
    • Slow indexing: Use batch embedding, process documents offline, cache results to disk

    Production Considerations

    • Use Pinecone or Weaviate instead of ChromaDB for production (managed, scalable, persistent)
    • Add document metadata (date, author, section) to enable filtered retrieval
    • Implement a re-ranking step (Cohere rerank API) to improve retrieval quality
    • Cache embedding calls — re-embedding the same text is wasteful
    • Monitor retrieval quality in production with periodic spot checks
    ← PreviousHow to Use Midjourney: Complete Beginner's GuideNext →The Complete AI Toolkit for Content Creators