Sudhir Nakka

Build RAG Application with LangChain and Ollama Locally

September 8, 2025 (5d ago)6 views

What you’ll build

In this tutorial you’ll build a Retrieval‑Augmented Generation (RAG) application that runs completely on your machine. We’ll use:

You’ll be able to point the app at your own docs (markdown, PDF, HTML, etc.) and get grounded answers with cited context.

Prerequisites

Recommended hardware: 16GB RAM+ for smooth local LLM usage. Smaller models can work on lower-end machines but expect slower responses.

1) Install Ollama and pull models

  1. Install Ollama: https://ollama.com/download
  2. Pull a chat model and an embedding model:
# Chat model
ollama pull llama3.1:8b

# Embedding model (choose one; nomic or mxbai are popular)
ollama pull nomic-embed-text:latest
# or
ollama pull mxbai-embed-large:latest

Verify:

ollama list

2) Set up your environment

Create a fresh Python virtual environment and install dependencies.

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\\Scripts\\activate

pip install -U pip
pip install langchain langchain-core langchain-community langchain-text-splitters
pip install langchain-chroma chromadb
pip install fastapi uvicorn
pip install pypdf beautifulsoup4 requests
# Optional: evaluation
pip install ragas datasets

3) Prepare some documents

Create a folder and add a few documents:

mkdir -p data
# Add your files here, e.g.
# - data/handbook.pdf
# - data/notes.md
# - data/faq.html

LangChain supports many loaders (PDF, HTML, Markdown, Notion, Confluence, etc.). Below we’ll show a simple local loader example and a web page loader.

4) Build the RAG pipeline (Python)

The modern LangChain pattern uses Runnable pipelines. We’ll:

Save the following as rag_app.py

# rag_app.py
from pathlib import Path

from langchain_community.document_loaders import TextLoader, PyPDFLoader, WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

DATA_DIR = Path("data")
DB_DIR = "./chroma_db"

# 1) Load documents (add more loaders to taste)
def load_docs():
    docs = []
    # Local text/markdown
    for p in DATA_DIR.glob("**/*"):
        if p.suffix.lower() in {".md", ".txt"}:
            docs.extend(TextLoader(str(p), autodetect_encoding=True).load())
        elif p.suffix.lower() == ".pdf":
            docs.extend(PyPDFLoader(str(p)).load())

    # Example: Load a web page
    try:
        web_docs = WebBaseLoader("https://python.langchain.com/").load()
        docs.extend(web_docs)
    except Exception:
        pass

    return docs

# 2) Split
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800, chunk_overlap=100, add_start_index=True
)

# 3) Embeddings via Ollama
# Choose your pulled embedding model name
EMBED_MODEL = "nomic-embed-text"  # or "mxbai-embed-large"
embeddings = OllamaEmbeddings(model=EMBED_MODEL)

# 4) Vector store (Chroma)
vectorstore = Chroma(collection_name="local_rag", embedding_function=embeddings, persist_directory=DB_DIR)

# 5) Indexing function

def build_or_load_index():
    # If DB exists, you can skip re-index; here we rebuild for demo
    docs = load_docs()
    splits = text_splitter.split_documents(docs)

    # Recreate collection for a fresh demo
    vectorstore.delete_collection()
    vectorstore.add_documents(splits)
    vectorstore.persist()

# 6) Retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# 7) LLM via Ollama
CHAT_MODEL = "llama3.1:8b"  # tune to your device
llm = Ollama(model=CHAT_MODEL)

# 8) Prompt
prompt = ChatPromptTemplate.from_template(
    """
    You are a helpful assistant. Answer the user question using the provided context.
    If the answer is not in the context, say you don't know.

    Context:
    {context}

    Question: {question}
    """
)

# 9) Chain: retrieve -> prompt -> llm -> parse
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

if __name__ == "__main__":
    # First time: build the index
    if not Path(DB_DIR).exists():
        Path(DB_DIR).mkdir(parents=True, exist_ok=True)
    build_or_load_index()

    # Simple CLI
    print("RAG ready. Ask a question (Ctrl+C to exit).")
    while True:
        try:
            q = input("\nYou: ")
            if not q.strip():
                continue
            answer = rag_chain.invoke(q)
            print("\nAssistant:\n", answer)
        except KeyboardInterrupt:
            print("\nGoodbye!")
            break

Run it:

python rag_app.py

Ask a few questions about your docs. The first run will embed and persist vectors in ./chroma_db.

5) Add sources (citations)

Often you want to show where an answer came from. Modify the chain to also return the retrieved documents and format citations. One simple approach is to call the retriever directly, then pass to the prompt.

# snippet: returning sources
from langchain_core.runnables import RunnableLambda

def format_docs(docs):
    parts = []
    for i, d in enumerate(docs, 1):
        meta = d.metadata or {}
        src = meta.get("source") or meta.get("file_path") or "unknown"
        parts.append(f"[{i}] {src}:\n{d.page_content[:500]}\n")
    return "\n".join(parts)

retrieve = RunnableLambda(lambda q: retriever.invoke(q))
chain_with_sources = (
    {"context": retrieve | RunnableLambda(format_docs), "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Example usage
question = "Summarize key points"
answer = chain_with_sources.invoke(question)
print(answer)

For richer UIs, you can display the retrieved chunks alongside the answer.

6) Serve with FastAPI (optional)

Create a small API to query your RAG.

# api.py
from fastapi import FastAPI
from pydantic import BaseModel
from rag_app import rag_chain, build_or_load_index

app = FastAPI()

class Query(BaseModel):
    question: str

# Ensure index is ready at startup
build_or_load_index()

@app.post("/ask")
def ask(q: Query):
    answer = rag_chain.invoke(q.question)
    return {"answer": answer}

Run:

uvicorn api:app --reload --port 8000

Then POST a question:

curl -X POST localhost:8000/ask -H 'Content-Type: application/json' \
  -d '{"question": "What are the key topics?"}'

7) Simple Streamlit UI (optional)

# ui.py
import streamlit as st
from rag_app import rag_chain, build_or_load_index

st.set_page_config(page_title="Local RAG with Ollama", page_icon="🦙")

# Build index on first run
build_or_load_index()

st.title("Local RAG with Ollama + LangChain")
question = st.text_input("Ask a question about your docs")
if st.button("Ask") and question:
    with st.spinner("Thinking..."):
        answer = rag_chain.invoke(question)
    st.markdown("### Answer")
    st.write(answer)

Run:

streamlit run ui.py

Tips and troubleshooting

RAG evaluation with RAGAS (optional)

Quick sketch to evaluate a few Q/A pairs:

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

# Suppose you have a list of dicts with question, answer, and contexts
examples = [
    {
        "question": "What is LangChain?",
        "answer": "LangChain is a framework for building LLM applications...",
        "contexts": ["LangChain provides...", "It supports retrievers..."]
    }
]

dataset = Dataset.from_list(examples)
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision])
print(result)

Where to go next

1. ^ DZone: Build a RAG App With LangChain and Local LLMs (Ollama): https://dzone.com/articles/rag-app-langchain-local-llms-ollama

2. ^ LangChain Docs: https://python.langchain.com/

3. ^ Ollama: https://ollama.com/

4. ^ Chroma: https://docs.trychroma.com/

5. ^ RAGAS: https://github.com/explodinggradients/ragas