Build RAG Application with LangChain and Ollama Locally

@SudhirNakka07|September 8, 2025 (5d ago)6 views

What you’ll build

In this tutorial you’ll build a Retrieval‑Augmented Generation (RAG) application that runs completely on your machine. We’ll use:

Ollama to run an LLM locally (e.g., Llama 3.1) and an embedding model.
LangChain to load and split data, embed chunks, store vectors, retrieve context, and compose a RAG chain.
A simple local vector store (Chroma) to persist embeddings on disk.
An optional FastAPI endpoint and Streamlit UI to interact with your RAG.

You’ll be able to point the app at your own docs (markdown, PDF, HTML, etc.) and get grounded answers with cited context.

Prerequisites

macOS, Linux, or Windows (WSL2 recommended on Windows)
Python 3.10+
Node.js is NOT required for the backend; we’ll use Python. (This website is Next.js, but the RAG demo is Python.)
Basic terminal familiarity

Recommended hardware: 16GB RAM+ for smooth local LLM usage. Smaller models can work on lower-end machines but expect slower responses.

1) Install Ollama and pull models

Install Ollama: https://ollama.com/download
Pull a chat model and an embedding model:

# Chat model
ollama pull llama3.1:8b

# Embedding model (choose one; nomic or mxbai are popular)
ollama pull nomic-embed-text:latest
# or
ollama pull mxbai-embed-large:latest

Verify:

ollama list

2) Set up your environment

Create a fresh Python virtual environment and install dependencies.

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\\Scripts\\activate

pip install -U pip
pip install langchain langchain-core langchain-community langchain-text-splitters
pip install langchain-chroma chromadb
pip install fastapi uvicorn
pip install pypdf beautifulsoup4 requests
# Optional: evaluation
pip install ragas datasets

3) Prepare some documents

Create a folder and add a few documents:

mkdir -p data
# Add your files here, e.g.
# - data/handbook.pdf
# - data/notes.md
# - data/faq.html

LangChain supports many loaders (PDF, HTML, Markdown, Notion, Confluence, etc.). Below we’ll show a simple local loader example and a web page loader.

4) Build the RAG pipeline (Python)

The modern LangChain pattern uses Runnable pipelines. We’ll:

Load documents
Split into chunks
Embed with an Ollama embedding model
Store in Chroma (local)
Create a retriever
Compose a prompt and chain with the LLM

Save the following as rag_app.py

# rag_app.py
from pathlib import Path

from langchain_community.document_loaders import TextLoader, PyPDFLoader, WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

DATA_DIR = Path("data")
DB_DIR = "./chroma_db"

# 1) Load documents (add more loaders to taste)
def load_docs():
    docs = []
    # Local text/markdown
    for p in DATA_DIR.glob("**/*"):
        if p.suffix.lower() in {".md", ".txt"}:
            docs.extend(TextLoader(str(p), autodetect_encoding=True).load())
        elif p.suffix.lower() == ".pdf":
            docs.extend(PyPDFLoader(str(p)).load())

    # Example: Load a web page
    try:
        web_docs = WebBaseLoader("https://python.langchain.com/").load()
        docs.extend(web_docs)
    except Exception:
        pass

    return docs

# 2) Split
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800, chunk_overlap=100, add_start_index=True
)

# 3) Embeddings via Ollama
# Choose your pulled embedding model name
EMBED_MODEL = "nomic-embed-text"  # or "mxbai-embed-large"
embeddings = OllamaEmbeddings(model=EMBED_MODEL)

# 4) Vector store (Chroma)
vectorstore = Chroma(collection_name="local_rag", embedding_function=embeddings, persist_directory=DB_DIR)

# 5) Indexing function

def build_or_load_index():
    # If DB exists, you can skip re-index; here we rebuild for demo
    docs = load_docs()
    splits = text_splitter.split_documents(docs)

    # Recreate collection for a fresh demo
    vectorstore.delete_collection()
    vectorstore.add_documents(splits)
    vectorstore.persist()

# 6) Retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# 7) LLM via Ollama
CHAT_MODEL = "llama3.1:8b"  # tune to your device
llm = Ollama(model=CHAT_MODEL)

# 8) Prompt
prompt = ChatPromptTemplate.from_template(
    """
    You are a helpful assistant. Answer the user question using the provided context.
    If the answer is not in the context, say you don't know.

    Context:
    {context}

    Question: {question}
    """
)

# 9) Chain: retrieve -> prompt -> llm -> parse
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

if __name__ == "__main__":
    # First time: build the index
    if not Path(DB_DIR).exists():
        Path(DB_DIR).mkdir(parents=True, exist_ok=True)
    build_or_load_index()

    # Simple CLI
    print("RAG ready. Ask a question (Ctrl+C to exit).")
    while True:
        try:
            q = input("\nYou: ")
            if not q.strip():
                continue
            answer = rag_chain.invoke(q)
            print("\nAssistant:\n", answer)
        except KeyboardInterrupt:
            print("\nGoodbye!")
            break

Run it:

python rag_app.py

Ask a few questions about your docs. The first run will embed and persist vectors in ./chroma_db.

5) Add sources (citations)

Often you want to show where an answer came from. Modify the chain to also return the retrieved documents and format citations. One simple approach is to call the retriever directly, then pass to the prompt.

# snippet: returning sources
from langchain_core.runnables import RunnableLambda

def format_docs(docs):
    parts = []
    for i, d in enumerate(docs, 1):
        meta = d.metadata or {}
        src = meta.get("source") or meta.get("file_path") or "unknown"
        parts.append(f"[{i}] {src}:\n{d.page_content[:500]}\n")
    return "\n".join(parts)

retrieve = RunnableLambda(lambda q: retriever.invoke(q))
chain_with_sources = (
    {"context": retrieve | RunnableLambda(format_docs), "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Example usage
question = "Summarize key points"
answer = chain_with_sources.invoke(question)
print(answer)

For richer UIs, you can display the retrieved chunks alongside the answer.

6) Serve with FastAPI (optional)

Create a small API to query your RAG.

# api.py
from fastapi import FastAPI
from pydantic import BaseModel
from rag_app import rag_chain, build_or_load_index

app = FastAPI()

class Query(BaseModel):
    question: str

# Ensure index is ready at startup
build_or_load_index()

@app.post("/ask")
def ask(q: Query):
    answer = rag_chain.invoke(q.question)
    return {"answer": answer}

Run:

uvicorn api:app --reload --port 8000

Then POST a question:

curl -X POST localhost:8000/ask -H 'Content-Type: application/json' \
  -d '{"question": "What are the key topics?"}'

7) Simple Streamlit UI (optional)

# ui.py
import streamlit as st
from rag_app import rag_chain, build_or_load_index

st.set_page_config(page_title="Local RAG with Ollama", page_icon="🦙")

# Build index on first run
build_or_load_index()

st.title("Local RAG with Ollama + LangChain")
question = st.text_input("Ask a question about your docs")
if st.button("Ask") and question:
    with st.spinner("Thinking..."):
        answer = rag_chain.invoke(question)
    st.markdown("### Answer")
    st.write(answer)

Run:

streamlit run ui.py

Tips and troubleshooting

Model choice: If llama3.1:8b is slow, try smaller variants or Q4_K_M quantized builds. For more quality, try llama3.1:70b (requires a strong GPU or patience on CPU).
Embeddings: nomic-embed-text and mxbai-embed-large both work well for general text. Keep the same embedding model during index + query.
Chunking: Tweak chunk_size and chunk_overlap based on your content; code or tables often need smaller chunks.
Persistence: Chroma’s persist_directory stores your vectors on disk. Back up the folder to reuse the index.
Freshness: Re-run build_or_load_index() after adding or updating docs; for larger corpora, implement incremental updates.
Safety: Always say “I don’t know” when info isn’t in the context.
Caching: You can add LangChain’s in-memory cache or result caching if you have repeated queries.
Evaluation: Use RAGAS to score faithfulness, answer relevancy, and context precision.

RAG evaluation with RAGAS (optional)

Quick sketch to evaluate a few Q/A pairs:

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

# Suppose you have a list of dicts with question, answer, and contexts
examples = [
    {
        "question": "What is LangChain?",
        "answer": "LangChain is a framework for building LLM applications...",
        "contexts": ["LangChain provides...", "It supports retrievers..."]
    }
]

dataset = Dataset.from_list(examples)
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision])
print(result)

Where to go next

Add multi-turn memory so your assistant keeps context across questions.
Introduce tools (web search, code execution) for agentic retrieval + reasoning.
Swap Chroma for other vector DBs (FAISS, Qdrant, Milvus, PGVector) with nearly identical code.
Build a richer UI with citations, collapsible sources, and per-chunk metadata.

1. ^ DZone: Build a RAG App With LangChain and Local LLMs (Ollama): https://dzone.com/articles/rag-app-langchain-local-llms-ollama

2. ^ LangChain Docs: https://python.langchain.com/

3. ^ Ollama: https://ollama.com/

4. ^ Chroma: https://docs.trychroma.com/

5. ^ RAGAS: https://github.com/explodinggradients/ragas