Python RAG Application Development: From Principles to Production
What Is RAG?
RAG (Retrieval-Augmented Generation) is an architectural pattern that combines external knowledge retrieval with large language model generation. It addresses three fundamental limitations of LLMs: knowledge staleness, hallucination, and inability to access private data.
Why RAG Matters in 2026
LLMs are powerful, but they come with inherent limitations:
| Problem | Description | How RAG Solves It |
|---|---|---|
| Knowledge Cutoff | Training data has a cutoff date; no access to recent info | Real-time document retrieval |
| Hallucination | Models may generate plausible but incorrect content | Answers grounded in retrieved facts |
| Private Data | Enterprise documents are never in training data | Retrieves from enterprise knowledge bases |
| Cost | Fine-tuning LLMs is extremely expensive | Only maintain a vector index |
| Traceability | LLM outputs cannot be traced to sources | Every answer has document citations |
Basic RAG Workflow
User Query → Query Embedding → Vector DB Retrieval → Retrieved Docs + Prompt → LLM Generates Answer
│ │
│ ┌───────────────────┐ │
└─────────────→│ Embedding Model │──────────→│
└───────────────────┘ │
↓
┌──────────────┐
│ LLM (GPT) │
└──────────────┘
Vector Embeddings Fundamentals
Vector embeddings convert text into high-dimensional numerical vectors, where semantically similar texts are closer in vector space.
Embedding Model Selection
| Model | Dimensions | Max Length | Highlights | Best For |
|---|---|---|---|---|
| text-embedding-3-large | 3072 | 8191 tokens | OpenAI's latest, strongest performance | High-precision English retrieval |
| text-embedding-3-small | 1536 | 8191 tokens | Great cost-performance ratio | General English use cases |
| bge-large-zh-v1.5 | 1024 | 512 tokens | Best Chinese performance | Chinese document retrieval |
| bge-m3 | 1024 | 8192 tokens | Multilingual, dense + sparse support | Multilingual mixed scenarios |
| gte-Qwen2-7B-instruct | 3584 | 32768 tokens | Ultra-long context, strongest open-source | Long documents, complex semantics |
| Cohere embed-v4 | 1024 | 128k tokens | Multimodal support | Image-text mixed retrieval |
Using OpenAI Embeddings
from openai import OpenAI
client = OpenAI(api_key="your-api-key")
response = client.embeddings.create(
model="text-embedding-3-small",
input="RAG stands for Retrieval-Augmented Generation",
dimensions=1536
)
embedding = response.data[0].embedding
print(f"Vector dimensions: {len(embedding)}") # 1536
Using Open-Source BGE Models
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-large-zh-v1.5")
embeddings = model.encode(
["RAG is retrieval-augmented generation", "Vector databases store embeddings"],
normalize_embeddings=True
)
print(f"Vector dimensions: {embeddings.shape}") # (2, 1024)
💡 Use the Base64 Encode/Decode tool to inspect and debug embedding vector transport encoding.
Document Chunking Strategies
Chunking is the most critical preprocessing step in a RAG system — it directly determines retrieval quality.
Chunking Method Comparison
| Method | Principle | Pros | Cons | Recommended For |
|---|---|---|---|---|
| Fixed-size | Split by character/token count | Simple implementation | May truncate semantics | Logs, structured text |
| Recursive character | Split by separator hierarchy | Preserves paragraph integrity | Requires tuning | General documents |
| Semantic | Split by embedding similarity | Best semantic completeness | High compute cost | High-quality Q&A |
| Document structure | Split by Markdown/HTML headings | Respects document structure | Depends on document format | Structured documents |
| Sentence window | Sentence-level with context window | Precise retrieval + rich context | Complex implementation | Fine-grained Q&A |
LangChain Recursive Character Splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", "! ", "? ", "; ", ", ", " ", ""],
length_function=len
)
chunks = splitter.split_text(long_document)
print(f"Number of chunks: {len(chunks)}")
Semantic Chunking (Advanced)
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
OpenAIEmbeddings(model="text-embedding-3-small"),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=75
)
chunks = splitter.split_text(long_document)
Chunk Size Tuning Guidelines
# Rule of thumb: overlap is typically 10%-20% of chunk_size
# Short documents (FAQs, knowledge cards)
chunk_size = 200
chunk_overlap = 20
# Medium documents (technical docs, blog posts)
chunk_size = 500
chunk_overlap = 50
# Long documents (papers, legal documents)
chunk_size = 1000
chunk_overlap = 100
Vector Database Selection
Mainstream Vector Database Comparison
| Database | Type | Dimensions | Persistence | Filtering | Distributed | Best For |
|---|---|---|---|---|---|---|
| Chroma | Embedded | Any | ✅ | ✅ Basic | ❌ | Prototyping, small scale |
| FAISS | In-memory | Any | ⚠️ Manual | ❌ | ❌ | High-performance single-node |
| Pinecone | Cloud | Any | ✅ | ✅ Full | ✅ | Production, zero ops |
| Milvus | Standalone | Any | ✅ | ✅ Full | ✅ | Large-scale enterprise |
| Qdrant | Standalone | Any | ✅ | ✅ Full | ✅ | Rust high performance |
| Weaviate | Standalone | Any | ✅ | ✅ GraphQL | ✅ | Hybrid search |
| pgvector | PG extension | ≤2000 | ✅ | ✅ SQL | ✅ | Existing PostgreSQL infra |
Using Chroma (Quick Start)
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
name="knowledge_base",
metadata={"hnsw:space": "cosine"}
)
collection.add(
documents=["RAG is retrieval-augmented generation", "Vector databases store embeddings"],
metadatas=[{"source": "doc1"}, {"source": "doc2"}],
ids=["id1", "id2"]
)
results = collection.query(
query_texts=["What is RAG?"],
n_results=3
)
print(results["documents"])
Using FAISS (High Performance)
import faiss
import numpy as np
dimension = 1024
index = faiss.IndexFlatIP(dimension)
embeddings = np.random.rand(1000, dimension).astype("float32")
faiss.normalize_L2(embeddings)
index.add(embeddings)
query = np.random.rand(1, dimension).astype("float32")
faiss.normalize_L2(query)
distances, indices = index.search(query, k=5)
print(f"Top-5 indices: {indices}")
print(f"Top-5 similarities: {distances}")
Using Pinecone (Production-Grade)
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key="your-api-key")
index_name = "rag-knowledge"
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=1536,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
index = pc.Index(index_name)
index.upsert(vectors=[
{"id": "doc1", "values": [0.1] * 1536, "metadata": {"source": "wiki"}},
{"id": "doc2", "values": [0.2] * 1536, "metadata": {"source": "blog"}}
])
results = index.query(
vector=[0.15] * 1536,
top_k=5,
filter={"source": {"$eq": "wiki"}}
)
Building a RAG Pipeline
Complete RAG with LangChain
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import TextLoader
loader = TextLoader("knowledge.txt", encoding="utf-8")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50
)
chunks = splitter.split_documents(documents)
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
persist_directory="./chroma_db"
)
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
return_source_documents=True
)
result = qa_chain.invoke({"query": "What is RAG?"})
print(result["result"])
print(f"Sources: {[doc.metadata for doc in result['source_documents']]}")
Complete RAG with LlamaIndex
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI as LlamaOpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
Settings.llm = LlamaOpenAI(model="gpt-4o", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(
similarity_top_k=4,
response_mode="tree_summarize"
)
response = query_engine.query("What are the core advantages of RAG?")
print(response)
print(f"Source nodes: {[n.metadata for n in response.source_nodes]}")
Retrieval Optimization Strategies
1. Hybrid Search
Combining dense retrieval (semantic) and sparse retrieval (keyword) significantly outperforms either method alone:
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Chroma
bm25_retriever = BM25Retriever.from_documents(chunks, k=5)
vector_retriever = Chroma.from_documents(
chunks, OpenAIEmbeddings()
).as_retriever(k=5)
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.4, 0.6]
)
results = ensemble_retriever.invoke("RAG optimization methods")
2. Reranking
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-large")
query = "How to optimize RAG retrieval?"
candidates = ["Candidate doc 1...", "Candidate doc 2...", "Candidate doc 3..."]
pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, candidates), reverse=True)
print(f"Reranked results: {ranked}")
3. Query Rewriting
from langchain.prompts import ChatPromptTemplate
rewrite_template = ChatPromptTemplate.from_messages([
("system", "You are a query rewriting assistant. Rewrite vague user questions into more precise retrieval queries."),
("human", "Original question: {question}\nGenerate 3 retrieval queries from different angles:")
])
rewrite_chain = rewrite_template | ChatOpenAI(model="gpt-4o", temperature=0)
rewritten = rewrite_chain.invoke({"question": "How do I use RAG?"})
print(rewritten.content)
4. Performance Benchmarks
| Configuration | Retrieval Method | Dataset | Recall@5 | MRR | Latency(ms) |
|---|---|---|---|---|---|
| Basic RAG | Dense | MS MARCO | 0.72 | 0.58 | 120 |
| + BM25 Hybrid | Hybrid | MS MARCO | 0.81 | 0.67 | 180 |
| + Reranker | Hybrid + Rerank | MS MARCO | 0.89 | 0.78 | 350 |
| + Query Rewrite | Full optimization | MS MARCO | 0.92 | 0.83 | 420 |
| Basic RAG | Dense | Chinese CMedQA | 0.65 | 0.51 | 150 |
| + BGE Rerank | Dense + Rerank | Chinese CMedQA | 0.84 | 0.73 | 400 |
Common Errors and Debugging
1. Vector Dimension Mismatch
# ❌ Wrong: embedding model and database dimensions don't match
embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # 1536 dims
# But vector database was created with 1024 dims
# ✅ Correct: ensure dimensions are consistent
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Vector database also uses 1536 dims
2. Oversized Chunks Cause Retrieval Noise
# ❌ Wrong: chunk_size too large, one chunk covers multiple topics
splitter = RecursiveCharacterTextSplitter(chunk_size=5000)
# ✅ Correct: reasonable chunk_size
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
3. Ignoring Metadata Filtering
# ❌ Wrong: full-database search, results include irrelevant docs
results = vectorstore.similarity_search("Python tutorial", k=5)
# ✅ Correct: leverage metadata filtering
results = vectorstore.similarity_search(
"Python tutorial",
k=5,
filter={"category": "programming", "language": "en"}
)
4. Embedding Model and Query Language Mismatch
# ❌ Wrong: Chinese documents with English-optimized embedding model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # English-focused
# ✅ Correct: Chinese documents with Chinese-optimized model
from sentence_transformers import SentenceTransformer
embeddings = SentenceTransformer("BAAI/bge-large-zh-v1.5")
5. Debugging Tips
import json
def debug_retrieval(query, retriever, top_k=5):
docs = retriever.invoke(query)
for i, doc in enumerate(docs):
print(f"--- Result {i+1} ---")
print(f"Content: {doc.page_content[:200]}")
print(f"Metadata: {json.dumps(doc.metadata, ensure_ascii=False)}")
return docs
# Use the [JSON Formatter](/en/json/format) tool to inspect metadata structures
debug_retrieval("What is a vector database?", vectorstore.as_retriever(k=5))
Production Deployment Tips
1. Architecture Design
┌──────────────┐
│ API Gateway │
└──────┬───────┘
│
┌────────────┼────────────┐
│ │ │
┌─────┴─────┐ ┌───┴────┐ ┌─────┴─────┐
│ Document │ │Retrieve│ │ LLM │
│ Ingestion │ │Service │ │ Service │
│ Pipeline │ │ │ │ │
└─────┬─────┘ └───┬────┘ └─────┬─────┘
│ │ │
┌─────┴─────┐ ┌───┴────┐ ┌─────┴─────┐
│ Message │ │Vector │ │ Model │
│ Queue │ │ DB │ │ Serving │
│ (Celery) │ │(Milvus)│ │ (vLLM) │
└────────────┘ └────────┘ └───────────┘
2. Caching Strategy
import hashlib
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_embedding(text_hash: str, model: str):
return embeddings_model.embed_query(text)
def get_embedding_with_cache(text: str, model: str = "text-embedding-3-small"):
text_hash = hashlib.md5(text.encode()).hexdigest()
return cached_embedding(text_hash, model)
3. Asynchronous Document Ingestion
from celery import Celery
app = Celery("rag_worker", broker="redis://localhost:6379/0")
@app.task
def ingest_document(file_path: str):
loader = TextLoader(file_path, encoding="utf-8")
documents = loader.load()
chunks = splitter.split_documents(documents)
vectorstore.add_documents(chunks)
return {"status": "success", "chunks": len(chunks)}
4. Monitoring Metrics
import time
from dataclasses import dataclass
@dataclass
class RAGMetrics:
retrieval_latency_ms: float
llm_latency_ms: float
total_latency_ms: float
num_chunks_retrieved: int
num_source_docs: int
query_tokens: int
response_tokens: int
def measure_rag_performance(query: str, qa_chain):
start = time.time()
result = qa_chain.invoke({"query": query})
total = (time.time() - start) * 1000
metrics = RAGMetrics(
retrieval_latency_ms=0,
llm_latency_ms=0,
total_latency_ms=total,
num_chunks_retrieved=len(result.get("source_documents", [])),
num_source_docs=len(set(
d.metadata.get("source", "")
for d in result.get("source_documents", [])
)),
query_tokens=len(query),
response_tokens=len(result["result"])
)
return result, metrics
FAQ
Q1: RAG vs Fine-tuning — which should I choose?
| Dimension | RAG | Fine-tuning |
|---|---|---|
| Knowledge updates | Update documents in real-time | Requires retraining |
| Cost | Low (vector index only) | High (GPU training) |
| Explainability | High (traceable sources) | Low (black box) |
| Style customization | Weak | Strong |
| Recommended for | Factual Q&A first choice | Style/format customization first choice |
Q2: What chunk_size should I use?
Typically 300-800 characters works best. It depends on document type: 200 for FAQs, 500 for technical docs, 1000 for legal documents. Always A/B test with real data.
Q3: Which vector database should I pick?
- Prototype/MVP: Chroma (zero config, embedded)
- Single-node high performance: FAISS + custom filtering
- Production zero-ops: Pinecone
- Large-scale enterprise: Milvus / Qdrant
- Existing PG infrastructure: pgvector
Q4: How to evaluate RAG system quality?
Use the RAGAS framework to evaluate four core metrics:
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_recall,
context_precision
)
results = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_relevancy, context_recall, context_precision]
)
print(results)
Q5: Any special considerations for non-English RAG?
- Choose language-optimized embedding models (BGE series for Chinese, multilingual-e5 for European languages)
- Use language-appropriate separators when chunking
- Hybrid search provides a larger improvement for non-English languages
- Handle mixed-language documents carefully
- Query rewriting helps significantly with colloquial queries
Related Tools
- Base64 Encode/Decode — Debug embedding vector transport encoding
- JSON Formatter — View and format RAG metadata
- Hash Calculator — Document deduplication and cache key generation
Summary
RAG is the most practical LLM application architecture today, solving knowledge staleness, hallucination, and private data access at low cost. The keys to a high-quality RAG system are: choosing the right embedding model, designing chunking strategies carefully, selecting the appropriate vector database, and continuously optimizing retrieval. Start with a Chroma prototype, progressively introduce hybrid search and reranking, and evolve toward a production-grade architecture — this is the best path to RAG deployment.
Try these browser-local tools — no sign-up required →