Python RAG Application Development: From Principles to Production

AI与大数据

What Is RAG?

RAG (Retrieval-Augmented Generation) is an architectural pattern that combines external knowledge retrieval with large language model generation. It addresses three fundamental limitations of LLMs: knowledge staleness, hallucination, and inability to access private data.

Why RAG Matters in 2026

LLMs are powerful, but they come with inherent limitations:

Problem Description How RAG Solves It
Knowledge Cutoff Training data has a cutoff date; no access to recent info Real-time document retrieval
Hallucination Models may generate plausible but incorrect content Answers grounded in retrieved facts
Private Data Enterprise documents are never in training data Retrieves from enterprise knowledge bases
Cost Fine-tuning LLMs is extremely expensive Only maintain a vector index
Traceability LLM outputs cannot be traced to sources Every answer has document citations

Basic RAG Workflow

User Query → Query Embedding → Vector DB Retrieval → Retrieved Docs + Prompt → LLM Generates Answer
    │                                              │
    │              ┌───────────────────┐           │
    └─────────────→│  Embedding Model  │──────────→│
                   └───────────────────┘           │
                                                   ↓
                                          ┌──────────────┐
                                          │  LLM (GPT)   │
                                          └──────────────┘

Vector Embeddings Fundamentals

Vector embeddings convert text into high-dimensional numerical vectors, where semantically similar texts are closer in vector space.

Embedding Model Selection

Model Dimensions Max Length Highlights Best For
text-embedding-3-large 3072 8191 tokens OpenAI's latest, strongest performance High-precision English retrieval
text-embedding-3-small 1536 8191 tokens Great cost-performance ratio General English use cases
bge-large-zh-v1.5 1024 512 tokens Best Chinese performance Chinese document retrieval
bge-m3 1024 8192 tokens Multilingual, dense + sparse support Multilingual mixed scenarios
gte-Qwen2-7B-instruct 3584 32768 tokens Ultra-long context, strongest open-source Long documents, complex semantics
Cohere embed-v4 1024 128k tokens Multimodal support Image-text mixed retrieval

Using OpenAI Embeddings

from openai import OpenAI

client = OpenAI(api_key="your-api-key")

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="RAG stands for Retrieval-Augmented Generation",
    dimensions=1536
)

embedding = response.data[0].embedding
print(f"Vector dimensions: {len(embedding)}")  # 1536

Using Open-Source BGE Models

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-large-zh-v1.5")

embeddings = model.encode(
    ["RAG is retrieval-augmented generation", "Vector databases store embeddings"],
    normalize_embeddings=True
)

print(f"Vector dimensions: {embeddings.shape}")  # (2, 1024)

💡 Use the Base64 Encode/Decode tool to inspect and debug embedding vector transport encoding.


Document Chunking Strategies

Chunking is the most critical preprocessing step in a RAG system — it directly determines retrieval quality.

Chunking Method Comparison

Method Principle Pros Cons Recommended For
Fixed-size Split by character/token count Simple implementation May truncate semantics Logs, structured text
Recursive character Split by separator hierarchy Preserves paragraph integrity Requires tuning General documents
Semantic Split by embedding similarity Best semantic completeness High compute cost High-quality Q&A
Document structure Split by Markdown/HTML headings Respects document structure Depends on document format Structured documents
Sentence window Sentence-level with context window Precise retrieval + rich context Complex implementation Fine-grained Q&A

LangChain Recursive Character Splitter

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", "! ", "? ", "; ", ", ", " ", ""],
    length_function=len
)

chunks = splitter.split_text(long_document)
print(f"Number of chunks: {len(chunks)}")

Semantic Chunking (Advanced)

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(model="text-embedding-3-small"),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=75
)

chunks = splitter.split_text(long_document)

Chunk Size Tuning Guidelines

# Rule of thumb: overlap is typically 10%-20% of chunk_size

# Short documents (FAQs, knowledge cards)
chunk_size = 200
chunk_overlap = 20

# Medium documents (technical docs, blog posts)
chunk_size = 500
chunk_overlap = 50

# Long documents (papers, legal documents)
chunk_size = 1000
chunk_overlap = 100

Vector Database Selection

Mainstream Vector Database Comparison

Database Type Dimensions Persistence Filtering Distributed Best For
Chroma Embedded Any ✅ Basic Prototyping, small scale
FAISS In-memory Any ⚠️ Manual High-performance single-node
Pinecone Cloud Any ✅ Full Production, zero ops
Milvus Standalone Any ✅ Full Large-scale enterprise
Qdrant Standalone Any ✅ Full Rust high performance
Weaviate Standalone Any ✅ GraphQL Hybrid search
pgvector PG extension ≤2000 ✅ SQL Existing PostgreSQL infra

Using Chroma (Quick Start)

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
    name="knowledge_base",
    metadata={"hnsw:space": "cosine"}
)

collection.add(
    documents=["RAG is retrieval-augmented generation", "Vector databases store embeddings"],
    metadatas=[{"source": "doc1"}, {"source": "doc2"}],
    ids=["id1", "id2"]
)

results = collection.query(
    query_texts=["What is RAG?"],
    n_results=3
)

print(results["documents"])

Using FAISS (High Performance)

import faiss
import numpy as np

dimension = 1024
index = faiss.IndexFlatIP(dimension)

embeddings = np.random.rand(1000, dimension).astype("float32")
faiss.normalize_L2(embeddings)
index.add(embeddings)

query = np.random.rand(1, dimension).astype("float32")
faiss.normalize_L2(query)

distances, indices = index.search(query, k=5)
print(f"Top-5 indices: {indices}")
print(f"Top-5 similarities: {distances}")

Using Pinecone (Production-Grade)

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="your-api-key")
index_name = "rag-knowledge"

if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

index = pc.Index(index_name)

index.upsert(vectors=[
    {"id": "doc1", "values": [0.1] * 1536, "metadata": {"source": "wiki"}},
    {"id": "doc2", "values": [0.2] * 1536, "metadata": {"source": "blog"}}
])

results = index.query(
    vector=[0.15] * 1536,
    top_k=5,
    filter={"source": {"$eq": "wiki"}}
)

Building a RAG Pipeline

Complete RAG with LangChain

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import TextLoader

loader = TextLoader("knowledge.txt", encoding="utf-8")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
chunks = splitter.split_documents(documents)

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
    persist_directory="./chroma_db"
)

llm = ChatOpenAI(model="gpt-4o", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    return_source_documents=True
)

result = qa_chain.invoke({"query": "What is RAG?"})
print(result["result"])
print(f"Sources: {[doc.metadata for doc in result['source_documents']]}")

Complete RAG with LlamaIndex

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI as LlamaOpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = LlamaOpenAI(model="gpt-4o", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

documents = SimpleDirectoryReader("./data").load_data()

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine(
    similarity_top_k=4,
    response_mode="tree_summarize"
)

response = query_engine.query("What are the core advantages of RAG?")
print(response)
print(f"Source nodes: {[n.metadata for n in response.source_nodes]}")

Retrieval Optimization Strategies

Combining dense retrieval (semantic) and sparse retrieval (keyword) significantly outperforms either method alone:

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Chroma

bm25_retriever = BM25Retriever.from_documents(chunks, k=5)
vector_retriever = Chroma.from_documents(
    chunks, OpenAIEmbeddings()
).as_retriever(k=5)

ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]
)

results = ensemble_retriever.invoke("RAG optimization methods")

2. Reranking

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-large")

query = "How to optimize RAG retrieval?"
candidates = ["Candidate doc 1...", "Candidate doc 2...", "Candidate doc 3..."]

pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs)

ranked = sorted(zip(scores, candidates), reverse=True)
print(f"Reranked results: {ranked}")

3. Query Rewriting

from langchain.prompts import ChatPromptTemplate

rewrite_template = ChatPromptTemplate.from_messages([
    ("system", "You are a query rewriting assistant. Rewrite vague user questions into more precise retrieval queries."),
    ("human", "Original question: {question}\nGenerate 3 retrieval queries from different angles:")
])

rewrite_chain = rewrite_template | ChatOpenAI(model="gpt-4o", temperature=0)

rewritten = rewrite_chain.invoke({"question": "How do I use RAG?"})
print(rewritten.content)

4. Performance Benchmarks

Configuration Retrieval Method Dataset Recall@5 MRR Latency(ms)
Basic RAG Dense MS MARCO 0.72 0.58 120
+ BM25 Hybrid Hybrid MS MARCO 0.81 0.67 180
+ Reranker Hybrid + Rerank MS MARCO 0.89 0.78 350
+ Query Rewrite Full optimization MS MARCO 0.92 0.83 420
Basic RAG Dense Chinese CMedQA 0.65 0.51 150
+ BGE Rerank Dense + Rerank Chinese CMedQA 0.84 0.73 400

Common Errors and Debugging

1. Vector Dimension Mismatch

# ❌ Wrong: embedding model and database dimensions don't match
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")  # 1536 dims
# But vector database was created with 1024 dims

# ✅ Correct: ensure dimensions are consistent
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Vector database also uses 1536 dims

2. Oversized Chunks Cause Retrieval Noise

# ❌ Wrong: chunk_size too large, one chunk covers multiple topics
splitter = RecursiveCharacterTextSplitter(chunk_size=5000)

# ✅ Correct: reasonable chunk_size
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

3. Ignoring Metadata Filtering

# ❌ Wrong: full-database search, results include irrelevant docs
results = vectorstore.similarity_search("Python tutorial", k=5)

# ✅ Correct: leverage metadata filtering
results = vectorstore.similarity_search(
    "Python tutorial",
    k=5,
    filter={"category": "programming", "language": "en"}
)

4. Embedding Model and Query Language Mismatch

# ❌ Wrong: Chinese documents with English-optimized embedding model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")  # English-focused

# ✅ Correct: Chinese documents with Chinese-optimized model
from sentence_transformers import SentenceTransformer
embeddings = SentenceTransformer("BAAI/bge-large-zh-v1.5")

5. Debugging Tips

import json

def debug_retrieval(query, retriever, top_k=5):
    docs = retriever.invoke(query)
    for i, doc in enumerate(docs):
        print(f"--- Result {i+1} ---")
        print(f"Content: {doc.page_content[:200]}")
        print(f"Metadata: {json.dumps(doc.metadata, ensure_ascii=False)}")
    return docs

# Use the [JSON Formatter](/en/json/format) tool to inspect metadata structures
debug_retrieval("What is a vector database?", vectorstore.as_retriever(k=5))

Production Deployment Tips

1. Architecture Design

                    ┌──────────────┐
                    │  API Gateway  │
                    └──────┬───────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
        ┌─────┴─────┐ ┌───┴────┐ ┌─────┴─────┐
        │  Document  │ │Retrieve│ │  LLM      │
        │  Ingestion │ │Service │ │  Service  │
        │  Pipeline  │ │        │ │           │
        └─────┬─────┘ └───┬────┘ └─────┬─────┘
              │            │            │
        ┌─────┴─────┐ ┌───┴────┐ ┌─────┴─────┐
        │  Message   │ │Vector  │ │  Model    │
        │  Queue     │ │  DB    │ │  Serving  │
        │ (Celery)   │ │(Milvus)│ │ (vLLM)    │
        └────────────┘ └────────┘ └───────────┘

2. Caching Strategy

import hashlib
from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_embedding(text_hash: str, model: str):
    return embeddings_model.embed_query(text)

def get_embedding_with_cache(text: str, model: str = "text-embedding-3-small"):
    text_hash = hashlib.md5(text.encode()).hexdigest()
    return cached_embedding(text_hash, model)

3. Asynchronous Document Ingestion

from celery import Celery

app = Celery("rag_worker", broker="redis://localhost:6379/0")

@app.task
def ingest_document(file_path: str):
    loader = TextLoader(file_path, encoding="utf-8")
    documents = loader.load()
    chunks = splitter.split_documents(documents)
    vectorstore.add_documents(chunks)
    return {"status": "success", "chunks": len(chunks)}

4. Monitoring Metrics

import time
from dataclasses import dataclass

@dataclass
class RAGMetrics:
    retrieval_latency_ms: float
    llm_latency_ms: float
    total_latency_ms: float
    num_chunks_retrieved: int
    num_source_docs: int
    query_tokens: int
    response_tokens: int

def measure_rag_performance(query: str, qa_chain):
    start = time.time()
    result = qa_chain.invoke({"query": query})
    total = (time.time() - start) * 1000

    metrics = RAGMetrics(
        retrieval_latency_ms=0,
        llm_latency_ms=0,
        total_latency_ms=total,
        num_chunks_retrieved=len(result.get("source_documents", [])),
        num_source_docs=len(set(
            d.metadata.get("source", "")
            for d in result.get("source_documents", [])
        )),
        query_tokens=len(query),
        response_tokens=len(result["result"])
    )
    return result, metrics

FAQ

Q1: RAG vs Fine-tuning — which should I choose?

Dimension RAG Fine-tuning
Knowledge updates Update documents in real-time Requires retraining
Cost Low (vector index only) High (GPU training)
Explainability High (traceable sources) Low (black box)
Style customization Weak Strong
Recommended for Factual Q&A first choice Style/format customization first choice

Q2: What chunk_size should I use?

Typically 300-800 characters works best. It depends on document type: 200 for FAQs, 500 for technical docs, 1000 for legal documents. Always A/B test with real data.

Q3: Which vector database should I pick?

  • Prototype/MVP: Chroma (zero config, embedded)
  • Single-node high performance: FAISS + custom filtering
  • Production zero-ops: Pinecone
  • Large-scale enterprise: Milvus / Qdrant
  • Existing PG infrastructure: pgvector

Q4: How to evaluate RAG system quality?

Use the RAGAS framework to evaluate four core metrics:

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision
)

results = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_recall, context_precision]
)
print(results)

Q5: Any special considerations for non-English RAG?

  1. Choose language-optimized embedding models (BGE series for Chinese, multilingual-e5 for European languages)
  2. Use language-appropriate separators when chunking
  3. Hybrid search provides a larger improvement for non-English languages
  4. Handle mixed-language documents carefully
  5. Query rewriting helps significantly with colloquial queries


Summary

RAG is the most practical LLM application architecture today, solving knowledge staleness, hallucination, and private data access at low cost. The keys to a high-quality RAG system are: choosing the right embedding model, designing chunking strategies carefully, selecting the appropriate vector database, and continuously optimizing retrieval. Start with a Chroma prototype, progressively introduce hybrid search and reranking, and evolve toward a production-grade architecture — this is the best path to RAG deployment.

Try these browser-local tools — no sign-up required →

#RAG#Python#AI#大模型#向量检索#教程