PostgreSQL pgvector RAG in Action: 7 Production Patterns from Embedding Storage to Hybrid Search

数据库

Your team just decided to build a RAG system, and the architect drops a Pinecone bill — $2,000/month minimum. Worse, your business data lives in PostgreSQL while vectors sit in Pinecone. Who guarantees dual-write consistency? In 2026, pgvector has evolved from "usable" to "excellent": HNSW index performance has skyrocketed, halfvec halves storage, and native hybrid search is here. Your existing PostgreSQL is the best vector database you already have.


Key Takeaways

  • Master 7 production-grade pgvector RAG patterns, from table design to hybrid search
  • Understand HNSW vs IVFFlat index selection strategies and parameter tuning
  • Implement three-way hybrid search (vector + full-text + keyword) with RRF reranking
  • Solve production challenges: multi-tenant isolation, performance tuning, backup & recovery
  • Get ready-to-use complete Python RAG code

Table of Contents

  1. pgvector Core Concepts
  2. Pattern 1: Embedding Storage & Table Design
  3. Pattern 2: HNSW Index Optimization
  4. Pattern 3: Hybrid Search (Vector + Full-Text + Keyword)
  5. Pattern 4: Reranking & Result Fusion
  6. Pattern 5: Multi-Tenant Data Isolation
  7. Pattern 6: Performance Tuning & Query Optimization
  8. Pattern 7: Production Deployment & Backup Recovery
  9. 5 Common Pitfalls & Solutions
  10. 10 Common Error Troubleshooting
  11. Advanced Optimization Techniques
  12. Comparison: pgvector vs Pinecone vs Milvus vs Qdrant
  13. Recommended Online Tools

pgvector Core Concepts

RAG (Retrieval-Augmented Generation) workflow: user asks a question → vector search finds relevant documents → documents become context for the LLM → generate an answer. Vector search is the foundation of RAG, and pgvector gives PostgreSQL this capability.

┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│  User Query  │────▶│  Embedding   │────▶│  pgvector    │
│              │     │  Model       │     │  Similarity  │
└─────────────┘     └──────────────┘     └──────┬───────┘
                                                 │
                                                 ▼
┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│  LLM Answer  │◀────│  Context +   │◀────│  Top-K Docs  │
│              │     │  Prompt      │     │              │
└─────────────┘     └──────────────┘     └──────────────┘

Distance Metrics

Metric Operator Index Op Class Use Case Range
Cosine Distance <=> vector_cosine_ops Semantic similarity (most common) [0, 2]
Euclidean Distance <-> vector_l2_ops Spatial distance [0, +∞)
Inner Product <#> vector_ip_ops Normalized vectors (-∞, +∞)

pgvector Version History

Version Release Key Features
0.5 2023 HNSW index, half-precision vectors
0.6 2024 Parallel index build, sparse vectors
0.7 2024 CONCURRENTLY support, halfvec indexing
0.8 2025 SPARSEVEC type, quantization support
0.9 2026 Improved HNSW build speed, multi-core parallel

Pattern 1: Embedding Storage & Table Design

Basic Table Structure

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    tenant_id INTEGER NOT NULL,
    title TEXT NOT NULL,
    content TEXT NOT NULL,
    source TEXT,
    category TEXT,
    tags TEXT[],
    chunk_index INTEGER DEFAULT 0,
    token_count INTEGER,
    embedding vector(1536),
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_documents_tenant ON documents(tenant_id);
CREATE INDEX idx_documents_category ON documents(category);
CREATE INDEX idx_documents_created ON documents(created_at DESC);

Chunked Storage Design

In RAG systems, long documents must be split into chunks, each with its own embedding:

CREATE TABLE document_chunks (
    id BIGSERIAL PRIMARY KEY,
    document_id BIGINT NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
    chunk_index INTEGER NOT NULL,
    content TEXT NOT NULL,
    token_count INTEGER,
    embedding vector(1536),
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_chunks_document ON document_chunks(document_id);
CREATE INDEX idx_chunks_embedding ON document_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

Metadata + Vector Combined Table

CREATE TABLE knowledge_base (
    id BIGSERIAL PRIMARY KEY,
    tenant_id INTEGER NOT NULL,
    doc_type VARCHAR(50),
    title TEXT NOT NULL,
    content TEXT NOT NULL,
    summary TEXT,
    keywords TEXT[],
    published_at DATE,
    access_level INTEGER DEFAULT 0,
    embedding vector(1536),
    tsv tsvector GENERATED ALWAYS AS (
        to_tsvector('english',
            coalesce(title, '') || ' ' ||
            coalesce(content, '') || ' ' ||
            coalesce(array_to_string(keywords, ' '), '')
        )
    ) STORED,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_kb_tsv ON knowledge_base USING gin(tsv);
CREATE INDEX idx_kb_tenant_type ON knowledge_base(tenant_id, doc_type);
CREATE INDEX idx_kb_embedding ON knowledge_base
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

Python Batch Embedding Insert

import psycopg2
from openai import OpenAI
from dataclasses import dataclass

client = OpenAI()

@dataclass
class Document:
    title: str
    content: str
    source: str
    category: str

def generate_embeddings(texts: list[str], batch_size: int = 100) -> list[list[float]]:
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=batch,
        )
        all_embeddings.extend([d.embedding for d in response.data])
    return all_embeddings

def chunk_text(text: str, max_tokens: int = 500, overlap: int = 50) -> list[str]:
    words = text.split()
    chunks = []
    step = max_tokens - overlap
    for i in range(0, len(words), step):
        chunk = " ".join(words[i:i + max_tokens])
        chunks.append(chunk)
    return chunks

def insert_documents_with_chunks(
    conn_params: dict,
    docs: list[Document],
):
    conn = psycopg2.connect(**conn_params)
    cur = conn.cursor()

    for doc in docs:
        cur.execute(
            """INSERT INTO documents (tenant_id, title, content, source, category)
               VALUES (1, %s, %s, %s, %s) RETURNING id""",
            (doc.title, doc.content, doc.source, doc.category),
        )
        doc_id = cur.fetchone()[0]

        chunks = chunk_text(doc.content)
        texts = [f"{doc.title}\n{c}" for c in chunks]
        embeddings = generate_embeddings(texts)

        for idx, (chunk, emb) in enumerate(zip(chunks, embeddings)):
            cur.execute(
                """INSERT INTO document_chunks
                   (document_id, chunk_index, content, embedding)
                   VALUES (%s, %s, %s, %s)""",
                (doc_id, idx, chunk, str(emb)),
            )

    conn.commit()
    cur.close()
    conn.close()

insert_documents_with_chunks(
    conn_params={"dbname": "mydb", "user": "postgres", "password": "password"},
    docs=[
        Document(
            title="pgvector Performance Tuning Guide",
            content="pgvector is a PostgreSQL extension for vector search...",
            source="toolsku.dev/blog",
            category="Database",
        ),
    ],
)

Pattern 2: HNSW Index Optimization

HNSW Parameter Deep Dive

                    HNSW Index Structure
            ┌──────────────────────────┐
            │     Layer 2 (sparse)     │
            │       ○──────○          │
            │      /        \         │
            ├─────/──────────\────────┤
            │   Layer 1 (medium)      │
            │   ○──○──○──○──○         │
            │  / \  |   |  / \        │
            ├─/──\─│───│─/──\─────────┤
            │ Layer 0 (dense)         │
            │ ○─○─○─○─○─○─○─○─○      │
            │  m=16 max 16 edges/node │
            └──────────────────────────┘

    ef_construction: build-time search width (higher = better, slower build)
    m: max connections per node (higher = better recall, more memory)
    ef_search: query-time search width (higher = better recall, slower query)
Parameter Default Recommended Range Impact
m 16 8-64 Connectivity; higher = better recall, more memory
ef_construction 64 32-200 Build quality; higher = better index, slower build
ef_search 40 20-200 Query precision; higher = better recall, slower query

Index Parameters by Scenario

-- High recall (legal/medical document retrieval)
CREATE INDEX idx_docs_embedding_high_recall ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 48, ef_construction = 128);

-- Balanced (general RAG)
CREATE INDEX idx_docs_embedding_balanced ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- Low latency (real-time recommendations)
CREATE INDEX idx_docs_embedding_low_latency ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 8, ef_construction = 32);

IVFFlat Index (Large-Scale Alternative)

-- IVFFlat for large datasets where slight accuracy loss is acceptable
-- lists rule of thumb: rows/1000, minimum 10
CREATE INDEX idx_docs_embedding_ivf ON documents
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);

-- IVFFlat must be created AFTER representative data exists
-- Empty table → inaccurate centroids → poor recall
Index Build Time (1M rows) Query Latency Recall Incremental Memory
HNSW m=16 ~15min ~2ms 98% Yes 2-3x
HNSW m=48 ~45min ~5ms 99.5% Yes 4-5x
IVFFlat lists=100 ~3min ~3ms 90% No (needs REINDEX) 1-1.5x

Dynamic Precision at Query Time

-- High precision (user-initiated search)
SET LOCAL hnsw.ef_search = 200;
SELECT id, title, 1 - (embedding <=> $1::vector) AS similarity
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT 10;

-- Low latency (autocomplete/recommendations)
SET LOCAL hnsw.ef_search = 20;
SELECT id, title, 1 - (embedding <=> $1::vector) AS similarity
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT 5;

Pattern 3: Hybrid Search (Vector + Full-Text + Keyword)

Pure vector search excels at semantic matching but may miss exact keywords. Pure full-text search excels at keyword matching but can't understand semantics. Hybrid search combines both — the standard for RAG systems in 2026.

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  Vector       │  │  Full-Text   │  │  Keyword     │
│  (pgvector)   │  │  (tsvector)  │  │  (LIKE/ILIKE) │
│  Semantic     │  │  Term freq   │  │  Exact       │
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                  │                  │
       ▼                  ▼                  ▼
┌──────────────────────────────────────────────────┐
│              RRF (Reciprocal Rank Fusion)         │
│  score = Σ 1/(k + rank_i)   k=60                 │
└──────────────────────┬───────────────────────────┘
                       │
                       ▼
              ┌──────────────────┐
              │  Final Results   │
              └──────────────────┘

Three-Way Hybrid Search SQL

WITH vector_results AS (
    SELECT id, title, content,
           ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) AS v_rank,
           1 - (embedding <=> $1::vector) AS v_score
    FROM knowledge_base
    WHERE 1 - (embedding <=> $1::vector) > 0.5
    ORDER BY embedding <=> $1::vector
    LIMIT 50
),
fulltext_results AS (
    SELECT id, title, content,
           ROW_NUMBER() OVER (
               ORDER BY ts_rank(tsv, plainto_tsquery('english', $2)) DESC
           ) AS ft_rank,
           ts_rank(tsv, plainto_tsquery('english', $2)) AS ft_score
    FROM knowledge_base
    WHERE tsv @@ plainto_tsquery('english', $2)
    LIMIT 50
),
keyword_results AS (
    SELECT id, title, content,
           ROW_NUMBER() OVER (
               ORDER BY
                   CASE WHEN title ILIKE '%' || $2 || '%' THEN 0 ELSE 1 END,
                   CASE WHEN content ILIKE '%' || $2 || '%' THEN 0 ELSE 1 END
           ) AS kw_rank
    FROM knowledge_base
    WHERE title ILIKE '%' || $2 || '%'
       OR content ILIKE '%' || $2 || '%'
    LIMIT 50
),
rrf_combined AS (
    SELECT
        COALESCE(v.id, f.id, k.id) AS id,
        COALESCE(v.title, f.title, k.title) AS title,
        COALESCE(v.content, f.content, k.content) AS content,
        COALESCE(1.0 / (60 + v.v_rank), 0) * 0.5 +
        COALESCE(1.0 / (60 + f.ft_rank), 0) * 0.3 +
        COALESCE(1.0 / (60 + k.kw_rank), 0) * 0.2 AS rrf_score
    FROM vector_results v
    FULL OUTER JOIN fulltext_results f ON v.id = f.id
    FULL OUTER JOIN keyword_results k ON v.id = k.id
)
SELECT id, title, content, rrf_score
FROM rrf_combined
ORDER BY rrf_score DESC
LIMIT 10;

Python Hybrid Search Wrapper

import psycopg2
from openai import OpenAI
from dataclasses import dataclass

@dataclass
class HybridSearchResult:
    id: int
    title: str
    content: str
    score: float

class PgVectorHybridSearch:
    def __init__(self, db_url: str, openai_api_key: str):
        self.conn = psycopg2.connect(db_url)
        self.client = OpenAI(api_key=openai_api_key)

    def embed(self, text: str) -> list[float]:
        response = self.client.embeddings.create(
            model="text-embedding-3-small",
            input=text,
        )
        return response.data[0].embedding

    def hybrid_search(
        self,
        query: str,
        top_k: int = 10,
        vector_weight: float = 0.5,
        fulltext_weight: float = 0.3,
        keyword_weight: float = 0.2,
        category: str | None = None,
    ) -> list[HybridSearchResult]:
        query_embedding = str(self.embed(query))
        category_filter = "AND category = %s" if category else ""
        params = [query_embedding, query_embedding]
        if category:
            params.append(category)
        params.extend([query_embedding, query])
        if category:
            params.append(category)
        params.extend([query, query])
        if category:
            params.append(category)
        params.extend([query_embedding, query, top_k])

        sql = f"""
        WITH vector_results AS (
            SELECT id, title, content,
                   ROW_NUMBER() OVER (ORDER BY embedding <=> %s::vector) AS v_rank
            FROM knowledge_base
            WHERE 1 - (embedding <=> %s::vector) > 0.5
            {category_filter}
            ORDER BY embedding <=> %s::vector
            LIMIT 50
        ),
        fulltext_results AS (
            SELECT id, title, content,
                   ROW_NUMBER() OVER (
                       ORDER BY ts_rank(tsv, plainto_tsquery('english', %s)) DESC
                   ) AS ft_rank
            FROM knowledge_base
            WHERE tsv @@ plainto_tsquery('english', %s)
            {category_filter}
            LIMIT 50
        ),
        keyword_results AS (
            SELECT id, title, content,
                   ROW_NUMBER() OVER (
                       ORDER BY
                       CASE WHEN title ILIKE '%%' || %s || '%%' THEN 0 ELSE 1 END
                   ) AS kw_rank
            FROM knowledge_base
            WHERE title ILIKE '%%' || %s || '%%'
               OR content ILIKE '%%' || %s || '%%'
            {category_filter}
            LIMIT 50
        ),
        rrf_combined AS (
            SELECT
                COALESCE(v.id, f.id, k.id) AS id,
                COALESCE(v.title, f.title, k.title) AS title,
                COALESCE(v.content, f.content, k.content) AS content,
                COALESCE(1.0 / (60 + v.v_rank), 0) * {vector_weight} +
                COALESCE(1.0 / (60 + f.ft_rank), 0) * {fulltext_weight} +
                COALESCE(1.0 / (60 + k.kw_rank), 0) * {keyword_weight} AS rrf_score
            FROM vector_results v
            FULL OUTER JOIN fulltext_results f ON v.id = f.id
            FULL OUTER JOIN keyword_results k ON v.id = k.id
        )
        SELECT id, title, content, rrf_score
        FROM rrf_combined
        ORDER BY rrf_score DESC
        LIMIT %s
        """

        cur = self.conn.cursor()
        cur.execute(sql, params)
        results = cur.fetchall()

        return [
            HybridSearchResult(id=r[0], title=r[1], content=r[2], score=r[3])
            for r in results
        ]

    def close(self):
        self.conn.close()

Pattern 4: Reranking & Result Fusion

Why Reranking

RRF fusion ranks results by position, not semantic relevance. Reranking with a Cross-Encoder re-scores candidates pairwise, significantly improving final result quality.

┌──────────────┐
│  Hybrid       │
│  Top-50       │
│  candidates   │
└──────┬───────┘
       │
       ▼
┌──────────────┐     ┌──────────────┐
│  Reranker     │────▶│  Refined     │
│  Cross-Encoder│     │  Top-10      │
│  Pairwise     │     │  High quality│
└──────────────┘     └──────────────┘

PostgreSQL In-DB Reranking (Business Rules)

WITH hybrid_results AS (
    SELECT id, title, content, rrf_score
    FROM rrf_combined
    ORDER BY rrf_score DESC
    LIMIT 30
)
SELECT
    id, title, content,
    rrf_score AS base_score,
    rrf_score
        * CASE
            WHEN published_at > CURRENT_DATE - INTERVAL '30 days' THEN 1.2
            WHEN published_at > CURRENT_DATE - INTERVAL '90 days' THEN 1.1
            ELSE 1.0
        END
        * CASE
            WHEN doc_type = 'official' THEN 1.3
            WHEN doc_type = 'tutorial' THEN 1.1
            ELSE 1.0
        END AS final_score
FROM hybrid_results h
JOIN knowledge_base kb ON h.id = kb.id
ORDER BY final_score DESC
LIMIT 10;

Python Cross-Encoder Reranking

from sentence_transformers import CrossEncoder
import psycopg2
from openai import OpenAI

class RerankingRAG:
    def __init__(self, db_url: str):
        self.conn = psycopg2.connect(db_url)
        self.client = OpenAI()
        self.reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

    def embed(self, text: str) -> list[float]:
        response = self.client.embeddings.create(
            model="text-embedding-3-small",
            input=text,
        )
        return response.data[0].embedding

    def search_with_rerank(self, query: str, top_k: int = 10, candidate_size: int = 30):
        query_embedding = str(self.embed(query))
        cur = self.conn.cursor()

        cur.execute(
            """SELECT id, title, content, 1 - (embedding <=> %s::vector) AS similarity
               FROM knowledge_base
               WHERE 1 - (embedding <=> %s::vector) > 0.5
               ORDER BY embedding <=> %s::vector
               LIMIT %s""",
            (query_embedding, query_embedding, query_embedding, candidate_size),
        )
        candidates = cur.fetchall()

        pairs = [(query, f"{row[1]}\n{row[2]}") for row in candidates]
        rerank_scores = self.reranker.predict(pairs)

        results = []
        for row, score in zip(candidates, rerank_scores):
            results.append({
                "id": row[0],
                "title": row[1],
                "content": row[2],
                "vector_score": float(row[3]),
                "rerank_score": float(score),
            })

        results.sort(key=lambda x: x["rerank_score"], reverse=True)
        return results[:top_k]

    def close(self):
        self.conn.close()

Pattern 5: Multi-Tenant Data Isolation

Row-Level Security (RLS)

ALTER TABLE knowledge_base ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON knowledge_base
    USING (tenant_id = current_setting('app.current_tenant_id')::INTEGER);

-- Set tenant context
SET app.current_tenant_id = '42';

-- Queries are automatically filtered
SELECT id, title, 1 - (embedding <=> $1::vector) AS similarity
FROM knowledge_base
ORDER BY embedding <=> $1::vector
LIMIT 10;
-- Only returns rows where tenant_id = 42

Partition-Based Isolation

CREATE TABLE knowledge_base_partitioned (
    LIKE knowledge_base INCLUDING DEFAULTS
) PARTITION BY LIST (tenant_id);

CREATE TABLE kb_tenant_1 PARTITION OF knowledge_base_partitioned FOR VALUES IN (1);
CREATE TABLE kb_tenant_2 PARTITION OF knowledge_base_partitioned FOR VALUES IN (2);
CREATE TABLE kb_tenant_3 PARTITION OF knowledge_base_partitioned FOR VALUES IN (3);
CREATE TABLE kb_tenant_default PARTITION OF knowledge_base_partitioned DEFAULT;

CREATE INDEX idx_kb_t1_embedding ON kb_tenant_1
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);
CREATE INDEX idx_kb_t2_embedding ON kb_tenant_2
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);
import psycopg2
from openai import OpenAI
from contextlib import contextmanager

@contextmanager
def tenant_session(conn: psycopg2.extensions.connection, tenant_id: int):
    cur = conn.cursor()
    cur.execute("SET app.current_tenant_id = %s", (str(tenant_id),))
    try:
        yield conn
    finally:
        cur.execute("RESET app.current_tenant_id")

class MultiTenantRAG:
    def __init__(self, db_url: str, openai_api_key: str):
        self.db_url = db_url
        self.client = OpenAI(api_key=openai_api_key)

    def embed(self, text: str) -> list[float]:
        response = self.client.embeddings.create(
            model="text-embedding-3-small",
            input=text,
        )
        return response.data[0].embedding

    def search(self, tenant_id: int, query: str, top_k: int = 10):
        conn = psycopg2.connect(self.db_url)
        query_embedding = str(self.embed(query))

        with tenant_session(conn, tenant_id):
            cur = conn.cursor()
            cur.execute(
                """SELECT id, title, content, 1 - (embedding <=> %s::vector) AS similarity
                   FROM knowledge_base
                   WHERE 1 - (embedding <=> %s::vector) > 0.6
                   ORDER BY embedding <=> %s::vector
                   LIMIT %s""",
                (query_embedding, query_embedding, query_embedding, top_k),
            )
            results = cur.fetchall()

        conn.close()
        return [
            {"id": r[0], "title": r[1], "content": r[2], "score": float(r[3])}
            for r in results
        ]

Pattern 6: Performance Tuning & Query Optimization

Key PostgreSQL Parameters

SET work_mem = '256MB';
SET maintenance_work_mem = '1GB';
SET max_parallel_workers_per_gather = 4;
SET hnsw.ef_search = 40;
SET ivfflat.probes = 10;

EXPLAIN ANALYZE

EXPLAIN ANALYZE
SELECT id, title, 1 - (embedding <=> '[0.1, 0.2, ...]'::vector) AS similarity
FROM knowledge_base
WHERE 1 - (embedding <=> '[0.1, 0.2, ...]'::vector) > 0.7
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 10;

-- Expected:
-- Index Scan using idx_kb_embedding on knowledge_base
--   Cost: 0.00..15.00  Rows: 10  Time: 2.5ms
--
-- If you see Seq Scan, the index isn't being used
-- Possible causes: dimension mismatch, missing cast, too few rows

Pre-Filter Optimization

-- Partial index for high-frequency category queries
CREATE INDEX idx_kb_ai_embedding ON knowledge_base
    USING hnsw (embedding vector_cosine_ops)
    WHERE category = 'AI';

Connection Pool Configuration

# pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 50
reserve_pool_size = 10
reserve_pool_timeout = 3
server_idle_timeout = 300

Batch Write Optimization

import psycopg2
from openai import OpenAI
import io

client = OpenAI()

def bulk_insert_with_copy(
    conn_params: dict,
    records: list[dict],
):
    conn = psycopg2.connect(**conn_params)
    cur = conn.cursor()

    texts = [f"{r['title']}\n{r['content']}" for r in records]
    embeddings = []
    for i in range(0, len(texts), 100):
        batch = texts[i:i + 100]
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=batch,
        )
        embeddings.extend([d.embedding for d in response.data])

    buffer = io.StringIO()
    for record, emb in zip(records, embeddings):
        line = "\t".join([
            str(record.get("tenant_id", 1)),
            record["title"].replace("\t", " ").replace("\n", " "),
            record["content"].replace("\t", " ").replace("\n", " "),
            str(emb),
        ])
        buffer.write(line + "\n")

    buffer.seek(0)
    cur.copy_from(buffer, "knowledge_base", columns=[
        "tenant_id", "title", "content", "embedding"
    ])
    conn.commit()
    cur.close()
    conn.close()

Pattern 7: Production Deployment & Backup Recovery

Docker Compose Production Config

services:
  postgres:
    image: pgvector/pgvector:pg16
    restart: always
    environment:
      POSTGRES_DB: rag_production
      POSTGRES_USER: rag_app
      POSTGRES_PASSWORD: ${PG_PASSWORD}
      POSTGRES_INITDB_ARGS: "--encoding=UTF-8"
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
    command: >
      postgres
        -c shared_buffers=4GB
        -c work_mem=256MB
        -c maintenance_work_mem=1GB
        -c max_parallel_workers_per_gather=4
        -c effective_cache_size=12GB
        -c random_page_cost=1.1
        -c wal_buffers=64MB
        -c checkpoint_completion_target=0.9
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U rag_app -d rag_production"]
      interval: 10s
      timeout: 5s
      retries: 5

  pgbouncer:
    image: edoburu/pgbouncer
    restart: always
    environment:
      DB_HOST: postgres
      DB_PORT: 5432
      DB_USER: rag_app
      DB_PASSWORD: ${PG_PASSWORD}
      DB_NAME: rag_production
      POOL_MODE: transaction
      MAX_CLIENT_CONN: 1000
      DEFAULT_POOL_SIZE: 50
    ports:
      - "6432:5432"
    depends_on:
      postgres:
        condition: service_healthy

volumes:
  pgdata:
    driver: local

init.sql

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE knowledge_base (
    id BIGSERIAL PRIMARY KEY,
    tenant_id INTEGER NOT NULL,
    doc_type VARCHAR(50),
    title TEXT NOT NULL,
    content TEXT NOT NULL,
    summary TEXT,
    keywords TEXT[],
    published_at DATE,
    access_level INTEGER DEFAULT 0,
    embedding vector(1536),
    tsv tsvector GENERATED ALWAYS AS (
        to_tsvector('english',
            coalesce(title, '') || ' ' ||
            coalesce(content, '') || ' ' ||
            coalesce(array_to_string(keywords, ' '), '')
        )
    ) STORED,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

ALTER TABLE knowledge_base ENABLE ROW LEVEL SECURITY;

CREATE INDEX idx_kb_tsv ON knowledge_base USING gin(tsv);
CREATE INDEX idx_kb_tenant_type ON knowledge_base(tenant_id, doc_type);
CREATE INDEX idx_kb_embedding ON knowledge_base
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

Backup Strategy

#!/bin/bash
DB_NAME="rag_production"
DB_USER="rag_app"
BACKUP_DIR="/backups/pgvector"
DATE=$(date +%Y%m%d_%H%M%S)

mkdir -p $BACKUP_DIR

pg_dump -U $DB_USER -d $DB_NAME -F c -f "${BACKUP_DIR}/full_${DATE}.dump"
pg_dump -U $DB_USER -d $DB_NAME -t knowledge_base -F c -f "${BACKUP_DIR}/kb_${DATE}.dump"

find $BACKUP_DIR -name "*.dump" -mtime +7 -delete

echo "Backup completed: ${DATE}"

Recovery

pg_restore -U rag_app -d rag_production -c "${BACKUP_DIR}/full_20260616_030000.dump"
pg_restore -U rag_app -d rag_production -t knowledge_base "${BACKUP_DIR}/kb_20260616_030000.dump"

Monitoring Queries

SELECT pg_size_pretty(pg_relation_size('idx_kb_embedding')) AS index_size;
SELECT pg_size_pretty(pg_total_relation_size('knowledge_base')) AS total_size;

SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE indexname = 'idx_kb_embedding';

SELECT sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS cache_hit_ratio
FROM pg_statio_user_tables
WHERE relname = 'knowledge_base';

SELECT query, mean_exec_time, calls
FROM pg_stat_statements
WHERE query LIKE '%embedding%'
ORDER BY mean_exec_time DESC
LIMIT 10;

5 Common Pitfalls & Solutions

Pitfall 1: HNSW Index Build OOM

Symptom: Building HNSW index on millions of rows causes OOM, PostgreSQL process killed.

Solution: Increase maintenance_work_mem, lower ef_construction, use CONCURRENTLY.

SET maintenance_work_mem = '2GB';
CREATE INDEX CONCURRENTLY idx_docs_embedding ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 32);

Pitfall 2: Vector Dimension Mismatch

Symptom: ERROR: expected 1536 dimensions, not 768.

Solution: Verify embedding model output matches table definition.

Model Output Dimensions
text-embedding-3-small 1536
text-embedding-3-large 3072
text-embedding-ada-002 1536
bge-large-en-v1.5 1024
bge-small-en-v1.5 384
ALTER TABLE knowledge_base ALTER COLUMN embedding TYPE vector(768);

Pitfall 3: IVFFlat Recall Drops After Inserts

Symptom: Search results become inaccurate after large INSERTs.

Solution: IVFFlat centroids are fixed at index creation. Periodically REINDEX or switch to HNSW.

REINDEX INDEX CONCURRENTLY idx_docs_embedding_ivf;

Pitfall 4: Hybrid Search Performance

Symptom: Three-way hybrid search takes over 2 seconds.

Solution: Limit candidate set size (50 each), ensure indexes are hit, use EXPLAIN ANALYZE.

EXPLAIN ANALYZE
SELECT id FROM knowledge_base
ORDER BY embedding <=> '[0.1,...]'::vector
LIMIT 50;
-- Should show Index Scan, not Seq Scan

Pitfall 5: Multi-Tenant Data Leakage

Symptom: Tenant A's query returns Tenant B's data.

Solution: Enable RLS, ensure all queries include tenant_id filtering. Partitions provide stronger isolation.

ALTER TABLE knowledge_base ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON knowledge_base
    USING (tenant_id = current_setting('app.current_tenant_id')::INTEGER);

10 Common Error Troubleshooting

# Error Cause Solution
1 extension "vector" not available pgvector not installed Use pgvector/pgvector Docker image or compile from source
2 expected 1536 dimensions, not 768 Dimension mismatch Check embedding model, modify table vector(N)
3 operator does not exist: vector <=> unknown Missing type cast Use $1::vector explicit cast
4 index row size exceeds maximum 2712 HNSW row too large Lower m parameter, or use halfvec(1536)
5 cannot create index concurrently pgvector version too old Upgrade to 0.7+
6 out of memory during index build Insufficient memory Increase maintenance_work_mem, lower ef_construction
7 invalid input syntax for type vector Invalid vector format Use [0.1, 0.2, ...] format
8 permission denied for schema public No CREATE EXTENSION privilege Use superuser to run CREATE EXTENSION
9 IVFFlat recall drops after inserts New data deviates from centroids Periodic REINDEX or switch to HNSW
10 sequential scan on vector column Index not being used Check dimension match, type cast, data volume

Advanced Optimization Techniques

1. Half-Precision Vectors (50% Storage Savings)

ALTER TABLE knowledge_base ALTER COLUMN embedding TYPE halfvec(1536);

CREATE INDEX idx_kb_embedding_half ON knowledge_base
    USING hnsw (embedding halfvec_cosine_ops)
    WITH (m = 16, ef_construction = 64);

2. Sparse Vectors

CREATE TABLE sparse_docs (
    id BIGSERIAL PRIMARY KEY,
    content TEXT,
    embedding sparsevec(30000)
);

INSERT INTO sparse_docs (content, embedding)
VALUES ('example', '{1:0.1, 5:0.3, 100:0.7}/30000');

3. Time-Based Partitioning for Hot/Cold Data

CREATE TABLE knowledge_base_partitioned (
    LIKE knowledge_base INCLUDING DEFAULTS
) PARTITION BY RANGE (created_at);

CREATE TABLE kb_2026_q1 PARTITION OF knowledge_base_partitioned
    FOR VALUES FROM ('2026-01-01') TO ('2026-04-01');
CREATE TABLE kb_2026_q2 PARTITION OF knowledge_base_partitioned
    FOR VALUES FROM ('2026-04-01') TO ('2026-07-01');
CREATE TABLE kb_archive PARTITION OF knowledge_base_partitioned
    DEFAULT;

ALTER TABLE kb_archive SET TABLESPACE slow_disk;

4. Async Embedding Updates

import psycopg2
import redis
import json
from openai import OpenAI

r = redis.Redis()
client = OpenAI()

def enqueue_embedding_update(doc_id: int, text: str):
    r.lpush("embedding_queue", json.dumps({"doc_id": doc_id, "text": text}))

def process_embedding_queue():
    conn = psycopg2.connect("dbname=mydb user=postgres password=password")
    cur = conn.cursor()

    while True:
        _, data = r.brpop("embedding_queue", timeout=30)
        if data is None:
            continue
        task = json.loads(data)
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=task["text"],
        )
        embedding = response.data[0].embedding

        cur.execute(
            "UPDATE knowledge_base SET embedding = %s WHERE id = %s",
            (str(embedding), task["doc_id"]),
        )
        conn.commit()

    cur.close()
    conn.close()

5. Query Caching

import hashlib
import json
import redis

r = redis.Redis()

def cached_vector_search(query: str, search_fn, ttl: int = 3600):
    cache_key = f"vec_search:{hashlib.md5(query.encode()).hexdigest()}"
    cached = r.get(cache_key)
    if cached:
        return json.loads(cached)

    results = search_fn(query)
    r.setex(cache_key, ttl, json.dumps(results))
    return results

Comparison: pgvector vs Pinecone vs Milvus vs Qdrant

Dimension pgvector Pinecone Milvus Qdrant
Deployment PG extension Fully managed Self/managed Self/managed
Ops Cost Minimal (reuse PG) High (per-query billing) High Medium
Hybrid Query Native SQL Limited filtering Limited filtering Filters
Transactions ACID None None None
Max Dimensions 16000 20000 32768 65535
Index Types HNSW/IVFFlat Proprietary HNSW/IVF/DiskANN HNSW
Horizontal Scale Needs Citus Automatic Automatic Automatic
Scale Millions Billions Billions Hundreds of millions
Consistency Strong Eventual Eventual Eventual
Half-Precision halfvec Yes Yes Yes
Multi-Tenant RLS/Partition Namespaces Collections Collections
Learning Curve Low (SQL) Low (API) High Medium
Cost (1M vectors) $0 (existing PG) ~$70/month ~$50/month (self) ~$40/month (self)

Selection Guide:

  • Already have PostgreSQL + million-scale data → pgvector (zero extra cost)
  • Need fully managed + budget OK → Pinecone
  • Billion-scale + horizontal scaling → Milvus
  • Medium scale + flexible filtering → Qdrant



Summary

PostgreSQL pgvector RAG has become the go-to solution for small-to-medium RAG systems in 2026. The 7 production patterns cover the full pipeline from table design, index optimization, hybrid search, reranking, multi-tenant isolation to production deployment. Core advantages: zero additional ops cost, native SQL hybrid queries, ACID transaction guarantees. HNSW index performs excellently at million-scale, and three-way hybrid search (vector + full-text + keyword) + RRF fusion + Cross-Encoder reranking delivers recall far beyond pure vector search. Only consider dedicated vector databases when data reaches billion-scale or horizontal scaling is required.


References:

Try these browser-local tools — no sign-up required →

#PostgreSQL#pgvector#RAG#向量检索#语义搜索#混合检索#2026#数据库