BEON.tech
TECHNICAL ENGINEERING

RAG with LangChain: How to Build Your First Pipeline

Matias Bustamante
Matias Bustamante

RAG with LangChain is quickly becoming one of the most useful patterns in a backend engineer’s toolkit. According to the 2025 Stack Overflow Developer Survey, RAG and Ollama rank among the top topics developers are actively exploring after adopting tools like Gemini, a clear signal that the field is moving past chatbots and into document-grounded AI applications.

The pattern solves a real problem: LLMs are powerful but static. They don’t know your codebase, your internal docs, or the PDF your team published last week. A LangChain RAG pipeline connects them to that external knowledge at query time, without fine-tuning and without sending your entire corpus to a third-party API.

In this post, we’ll break down: 

  • The rag langchain workflow step by step,
  • Implement it in Python against real documents, and 
  • Compare it to two alternative approaches, a fully local setup via Ollama and Google’s NotebookLM, so you can make an informed decision before you commit to an architecture.

So without further ado, let´s get into it. 

What RAG Actually Does Under the Hood

Before writing code, it’s worth being precise about what retrieval-augmented generation is doing, because the abstractions LangChain provides can obscure the mechanics.

At its core, RAG addresses a fundamental limitation: LLMs predict the next token based on training data. They have no access to documents you didn’t include in that training. RAG sidesteps this by injecting relevant document excerpts directly into the prompt at inference time. The model doesn’t “learn” from your documents, it reads them per query, like a developer looking something up in a spec before answering a question.

This distinction matters for understanding what RAG can and can’t do. It’s not a substitute for fine-tuning when you need behavioral change. It’s a retrieval mechanism that improves factual grounding. Choosing between RAG, fine-tuning, and agents is an architectural decision worth getting right before building anything in production.

Two concrete benefits drive most production use cases:

  • Reduced hallucinations. When the model answers from a retrieved excerpt rather than from parametric memory, the answer is anchored to something specific and verifiable. You can check the source chunk.
  • No training cost. Updating a fine-tuned model as your docs change is expensive and slow. A RAG pipeline picks up document changes the next time you re-index, typically a few seconds to a few minutes depending on corpus size.

The LangChain RAG Workflow: Three Phases

The rag langchain workflow has three distinct phases. Understanding them separately makes it easier to debug and optimize.

Phase 1 — Indexing

The document is loaded, split into chunks, converted to vector embeddings, and stored in a vector database. This is a one-time cost per document, not per query. ChromaDB, FAISS, and Pinecone are the most common choices at the vector store layer.

Phase 2 — Retrieval

When a query arrives, it’s converted into the same embedding space as the stored chunks. The vector database runs a similarity search and returns the top-k most relevant chunks. This is where chunking strategy and embedding model quality have the most impact on answer quality.

Phase 3 — Generation

The retrieved chunks are assembled into a prompt alongside the user’s question and passed to the LLM. The model generates an answer using only that context. A well-written system prompt tells the model to say “I don’t know” when the answer isn’t in the retrieved chunks, preventing it from falling back on training data.

User Query

    │

    ▼

[Embed query → vector]

    │

    ▼

[Similarity search] ──► [Vector DB: ChromaDB / FAISS]

    │

    ▼

[Top-k chunks retrieved]

    │

    ▼

[system_prompt + chunks + query → LLM]

    │

    ▼

[Grounded answer]

Alternative 1: RAG using OpenAI API

Let’s build RAG with LangChain against a real PDF. The full pipeline runs in under 50 lines of Python.

Requirements

pip install langchain langchain-community langchain-openai chromadb pypdf python-dotenv

Before you start: This pipeline uses the OpenAI API for embeddings and completions. You’ll need an OpenAI account with API credits. $5 USD Is more than enough to run all the examples in this post. 

Load your OpenAI key from a .env file — don’t hardcode it:

from dotenv import load_dotenv

load_dotenv()

Step 1 — Load the Document

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(“your_document.pdf”)

documents = loader.load()

PyPDFLoader extracts text page by page and returns Document objects, each carrying page_content and metadata (page number, source path). For multi-file corpora, DirectoryLoader handles batch loading across a folder.

Step 2 — Split into Chunks

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(

    chunk_size=1000,

    chunk_overlap=200

)

chunks = splitter.split_documents(documents)

chunk_overlap is the parameter most developers underestimate. Without overlap, a concept that spans two chunks may never be fully retrieved. A 200-character overlap means adjacent chunks share context, so boundary sentences aren’t lost. For densely technical documents, increase this to 300–400.

Step 3 — Embed and Store

from langchain_openai import OpenAIEmbeddings

from langchain_community.vectorstores import Chroma

embeddings = OpenAIEmbeddings()

vector_store = Chroma.from_documents(

    documents=chunks,

    embedding=embeddings,

    persist_directory=”./chroma_db”

)

retriever = vector_store.as_retriever()

The persist_directory parameter is worth emphasizing: it writes the vector store to disk. On subsequent runs, load it with Chroma(persist_directory=”./chroma_db”, embedding_function=embeddings) and skip re-embedding entirely. This cuts per-query cost to zero for the indexing phase.

Step 4 — Build the Chain

from langchain_openai import ChatOpenAI

from langchain.chains import create_retrieval_chain

from langchain.chains.combine_documents import create_stuff_documents_chain

from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model=”gpt-4o”)

system_prompt = “””

You are an assistant for question-answering tasks.

Use only the following retrieved context to answer.

If the answer is not in the context, say you don’t know.

Keep answers concise and specific.

Context: {context}

“””

prompt = ChatPromptTemplate.from_messages([

    (“system”, system_prompt),

    (“human”, “{input}”),

])

question_answer_chain = create_stuff_documents_chain(llm, prompt)

rag_chain = create_retrieval_chain(retriever, question_answer_chain)

The “say you don’t know” instruction in the system prompt is critical. Without it, the model will synthesize an answer from training data when the retrieved chunks don’t contain the answer — defeating the purpose of building a RAG pipeline in the first place.

Step 5 — Query It

response = rag_chain.invoke({“input”: “What does the document say about X?”})

print(response[“answer”])

The response dict also contains context, which holds the actual chunks that were retrieved. Log these during development, they’re the fastest way to debug poor answers, since bad retrieval is usually the root cause, not the LLM.

Alternative 2: Local LLM with Ollama and chatd

If sending document chunks to OpenAI’s servers isn’t acceptable, common with internal documentation, legal files, or client data, you can run the full pipeline locally using Ollama as the model server.

Ollama is a local model registry with a simple CLI. You pull and run open-source models with a single command:

ollama pull mistral

ollama pull deepseek-r1

ollama run mistral

For a no-code interface, chatd wraps Ollama with an Electron-based UI. Load a PDF, ask questions, get answers, no Python code required. It detects a running Ollama instance automatically.

Testing both Mistral and DeepSeek R1 against the same documents used in the LangChain implementation showed a clear pattern. On a densely written technical book, Mistral produced generic answers that were directionally correct but didn’t ground responses in the source material. DeepSeek R1 went deeper, organized sections, clear reasoning chains, but also didn’t surface specific quotes or page-level attribution the way the OpenAI pipeline did. On a structured technical manual, both models performed well, accurately describing features and providing activation steps.

When to use this approach: non-negotiable data privacy requirements, air-gapped environments, or cost constraints at scale. 

The tradeoff is answer quality — 7B–14B parameter local models lag behind GPT-4o on nuanced document Q&A, particularly for densely written technical content. Building with local models also introduces its own security and prompt injection risks that are worth understanding before shipping anything to users.

Alternative 3: Google NotebookLM

NotebookLM requires zero code. Upload PDFs through the web UI, start asking questions. It’s powered by Gemini 1.5 Pro and tuned specifically for document Q&A, and the output quality shows it.

What sets NotebookLM apart from the other approaches is source attribution. Responses are organized into labeled sections with inline citations pointing to the specific document chunk that generated each claim. For exploratory research, where you want to understand a document quickly and verify the AI’s reasoning, this is genuinely useful.

It also has a feature no other tool in this comparison offers: it converts your documents into a two-host podcast-style audio summary. Niche, but useful for absorbing long documents passively.

Limitations that matter in practice:

  • No control over chunking strategy, embedding model, or retrieval parameters.
  • English-only for the audio feature.
  • It’s a Google experiment: free today, no guarantee of availability tomorrow.
  • Slower under heavy load, with no SLA.

When to use this approach: fast exploratory analysis where you need high-quality answers from a document without building anything, and data privacy is not a constraint.

Choosing the Right Approach

LangChain + OpenAIOllama + chatdNotebookLM
Answer quality★★★★★★★★☆☆★★★★★
Data privacy★★★☆☆★★★★★★★☆☆☆
Customizability★★★★★★★★★☆★☆☆☆☆
Setup effortMediumMediumNone
CostAPI creditsFreeFree (for now)
Multi-document❌ One at a time
Production-ready⚠️ With work

For production systems, build RAG with LangChain. The abstractions are composable enough that you can swap the LLM (Anthropic, Gemini, local models via Ollama), the vector store (FAISS, Pinecone, pgvector), or the embedding model without rewriting the pipeline. That flexibility is what makes LangChain RAG the right default for anything that needs to scale or evolve.

For privacy-first internal tooling, go local with Ollama. For one-off document analysis with no infrastructure overhead, NotebookLM delivers quality fast.

Does RAG Become Irrelevant as Context Windows Grow?

With models like Gemini 2.5 Pro offering multi-million token context windows, it’s a fair question: does the rag langchain workflow become obsolete if you can just stuff the entire document into a prompt?

Not for three reasons that matter in production:

  • Cost. Passing a 500-page document on every query at $X per million tokens adds up fast. RAG’s chunk retrieval is token-efficient by design, you’re passing 2–5 relevant pages, not 500.
  • Precision. Retrieval focuses the model on the most relevant sections. Asking a model to reason over a 400,000-token context doesn’t guarantee it will find the right 1,000 tokens. Similarity search does.
  • Privacy. If your documents contain sensitive data and you’re running a local LLM, RAG keeps everything on your hardware. That constraint doesn’t disappear because context windows got bigger. The architectural decisions around AI in engineering workflows are part of what senior engineers need to master to stay effective in 2026.

Hybrid approaches, RAG for retrieval, large context window for synthesis, are already showing up in production architectures. The two aren’t mutually exclusive.

From Pipeline to Production: What Comes Next

The LangChain RAG pattern is one of the most practical AI implementations available right now: 

  • The setup is straightforward, 
  • The architecture is composable, and 
  • The results are measurable 

It is more than can be said for most AI patterns at the moment.

The core loop: load, chunk, embed, retrieve, generate. Maps cleanly onto LangChain’s abstractions. Once you have it working, swapping components is low-friction: different LLM, different vector store, different chunking strategy. None of those changes require rewriting the pipeline.

The LangChain documentation is the right next stop, it reads more like a guide than an API reference, which makes it unusually accessible for a framework of this scope.

If you’re a senior or mid-level engineer looking to build with LangChain RAG on real production problems, BEON.tech connects LATAM developers with US product teams working on exactly this kind of technical challenge. The projects are complex, the teams are distributed, and the stack is current.

FAQs

Does the entire PDF get sent to OpenAI on every query?

No. Only the semantically relevant chunks, typically 3–5 text segments totaling a few hundred to a thousand tokens, are sent to the LLM per query. The full document is stored locally in the vector database.

What is the best chunk size for a LangChain RAG pipeline?

It depends on the document type. A chunk size of 800–1,000 characters with 200-character overlap works well for technical documentation. For denser material like academic papers or legal text, smaller chunks (400–600 characters) with more overlap tend to improve retrieval precision.

Can I use a local LLM instead of OpenAI in a LangChain RAG workflow?

Yes. LangChain supports local models via Ollama out of the box. Replace ChatOpenAI with ChatOllama and point it at a locally running model. Answer quality will vary depending on the model size and the complexity of the document.

Is RAG with LangChain production-ready?

The LangChain framework itself is production-ready and widely deployed. Whether your specific pipeline is production-ready depends on chunking strategy, retrieval quality, error handling, and observability. LangSmith (LangChain’s companion tooling) handles tracing and evaluation for production pipelines.

How is RAG different from fine-tuning?

RAG retrieves external knowledge at inference time without changing the model’s weights. Fine-tuning updates the model’s parameters using new training data, which changes how it behaves. RAG is better for keeping answers grounded in specific documents. Fine-tuning is better for changing tone, style, or task-specific behavior.

Ready to build your team in Latin America?

Let us connect you with pre-vetted senior developers who are ready to make an impact.

Get started
Matias Bustamante
Written by Matias Bustamante