SpringBoot 3.5 + AI RAG in Action: 6 Production Patterns from Vector Search to Intelligent Q&A

The Java Developer's AI Dilemma: You Can Write Code, But Can You Make It "Understand Knowledge"?

You spent three months training an enterprise knowledge base LLM. On day one, a user asks about an internal term the model has never seen — and it hallucinates with absolute confidence.

This isn't a joke; it's the reality for Java teams doing AI integration in 2026. LLMs have reasoning capability but lack your enterprise data; your databases have data but lack reasoning capability. RAG (Retrieval-Augmented Generation) is the bridge between the two.

But here's the problem — Python RAG tutorials are everywhere, while the Java ecosystem is nearly a blank slate. Although Spring AI 1.0 is GA, the RAG examples in the docs are still at "Hello World" level. What does production-grade RAG require? Vector store selection, document chunking strategies, hybrid search, conversation memory, pipeline orchestration, monitoring and alerting — you need all of them.

This article provides 6 production-ready RAG patterns based on SpringBoot 3.5 + Spring AI 1.0, each with complete, runnable Java code.

Key Takeaways

Master the complete integration of Spring AI + PgVector vector store
Understand best practices and performance tuning for document chunking and embedding generation
Implement hybrid vector + keyword search for 40%+ recall improvement
Build a multi-turn Q&A system with conversation memory
Learn RAG pipeline orchestration and production deployment monitoring
Avoid the 5 most common RAG implementation pitfalls

RAG Architecture Overview
Pattern 1: Spring AI + PgVector Vector Store Integration
Pattern 2: Document Chunking and Embedding Generation
Pattern 3: Hybrid Search (Vector + Keyword)
Pattern 4: Conversation Memory and Multi-Turn Q&A
Pattern 5: RAG Pipeline Orchestration
Pattern 6: Production Deployment and Monitoring
5 Common Pitfalls and Solutions
10 Common Error Troubleshooting
Advanced Optimization Tips
Comparison: 3 Vector Database Solutions
Recommended Online Tools

RAG Architecture Overview

RAG is not simply "search first, then answer" — it's a complete knowledge processing pipeline:

┌─────────────────────────────────────────────────────────────────────┐
│                 RAG Complete Architecture (SpringBoot 3.5)           │
│                                                                     │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐      │
│  │ Offline   │    │          │    │          │    │          │      │
│  │ Indexing  │    │          │    │          │    │          │      │
│  │          │    │          │    │          │    │          │      │
│  │ Document │───▶│ Document │───▶│ Embedding│───▶│ Vector   │      │
│  │ Loading  │    │ Chunking │    │ Generation│   │ Store    │      │
│  │ PDF/DOCX │    │ 512token │    │ OpenAI   │    │ PgVector │      │
│  │ Markdown │    │ overlap64│    │ BGE      │    │ Milvus   │      │
│  │ HTML     │    │          │    │          │    │ Chroma   │      │
│  └──────────┘    └──────────┘    └──────────┘    └──────────┘      │
│                                                                     │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐      │
│  │ Online   │    │          │    │          │    │          │      │
│  │ Query    │    │          │    │          │    │          │      │
│  │          │    │          │    │          │    │          │      │
│  │ User     │───▶│ Hybrid   │───▶│ Context  │───▶│ LLM      │      │
│  │ Query    │    │ Search   │    │ Assembly │    │ Generation│     │
│  │          │    │ Vec+BM25 │    │ Prompt   │    │ GPT-4o   │      │
│  │          │    │ Rerank   │    │ Template │    │ DeepSeek │      │
│  │          │    │          │    │          │    │ Qwen     │      │
│  └──────────┘    └──────────┘    └──────────┘    └──────────┘      │
│         │                                              │           │
│         │              ┌──────────┐                    │           │
│         └─────────────▶│ Convo    │◀───────────────────┘           │
│                        │ Memory   │                                │
│                        │ Redis    │                                │
│                        │ Window   │                                │
│                        └──────────┘                                │
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                  Observability & Governance Layer            │    │
│  │  OpenTelemetry · Prometheus · Alerting · Rate Limiting      │    │
│  │  · Circuit Breaking                                         │    │
│  └─────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘

Why Choose SpringBoot 3.5 for RAG

Feature	SpringBoot 3.5	Python FastAPI
Virtual Threads	Native support, 5x IO throughput	Not supported
Vector Store Abstraction	Unified VectorStore interface	Inconsistent APIs across libraries
Dependency Injection	Auto-configuration, zero boilerplate	Manual lifecycle management
Streaming Response	WebFlux + Flux	Manual SSE implementation
Enterprise Security	Spring Security integration	Requires additional middleware
Monitoring	Actuator + Micrometer	Self-integration required

Pattern 1: Spring AI + PgVector Vector Store Integration

PgVector is PostgreSQL's vector extension. The biggest advantage for Java teams is zero learning curve for operations — just add an extension to your existing PG instance.

1.1 Environment Setup

-- Enable pgvector extension in PostgreSQL
CREATE EXTENSION IF NOT EXISTS vector;

-- Create vector store table (Spring AI can auto-create; shown here for reference)
CREATE TABLE vector_store (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    content TEXT NOT NULL,
    metadata JSONB DEFAULT '{}',
    embedding VECTOR(1536)  -- OpenAI text-embedding-3-small dimension
);

-- Create HNSW index (better than IVFFlat for production)
CREATE INDEX ON vector_store
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

1.2 Maven Dependency Configuration

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
        <version>1.0.0</version>
    </dependency>
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-pgvector-store-spring-boot-starter</artifactId>
        <version>1.0.0</version>
    </dependency>
    <dependency>
        <groupId>org.postgresql</groupId>
        <artifactId>postgresql</artifactId>
    </dependency>
</dependencies>

1.3 Application Configuration

spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      base-url: ${OPENAI_BASE_URL:https://api.openai.com}
      embedding:
        options:
          model: text-embedding-3-small
    vectorstore:
      pgvector:
        index-type: HNSW
        distance-type: COSINE
        dimensions: 1536
        initialize-schema: true
  datasource:
    url: jdbc:postgresql://localhost:5432/rag_db
    username: ${PG_USERNAME}
    password: ${PG_PASSWORD}
    hikari:
      maximum-pool-size: 20
      minimum-idle: 5

1.4 Vector Store Service

@Service
public class VectorStoreService {

    private final VectorStore vectorStore;
    private final EmbeddingModel embeddingModel;

    public VectorStoreService(VectorStore vectorStore, EmbeddingModel embeddingModel) {
        this.vectorStore = vectorStore;
        this.embeddingModel = embeddingModel;
    }

    public void indexDocument(String content, Map<String, Object> metadata) {
        Document document = new Document(content, metadata);
        vectorStore.add(List.of(document));
    }

    public void indexDocuments(List<Document> documents) {
        vectorStore.add(documents);
    }

    public List<Document> search(String query, int topK) {
        return vectorStore.similaritySearch(
            SearchRequest.builder()
                .query(query)
                .topK(topK)
                .similarityThreshold(0.7)
                .build()
        );
    }

    public List<Document> searchWithFilter(String query, int topK, String filterExpression) {
        return vectorStore.similaritySearch(
            SearchRequest.builder()
                .query(query)
                .topK(topK)
                .similarityThreshold(0.7)
                .filterExpression(filterExpression)
                .build()
        );
    }

    public void deleteDocuments(List<String> ids) {
        vectorStore.delete(ids);
    }
}

1.5 REST API Exposure

@RestController
@RequestMapping("/api/v1/vectors")
public class VectorStoreController {

    private final VectorStoreService vectorStoreService;

    public VectorStoreController(VectorStoreService vectorStoreService) {
        this.vectorStoreService = vectorStoreService;
    }

    @PostMapping("/index")
    public ResponseEntity<String> indexDocument(@RequestBody IndexRequest request) {
        vectorStoreService.indexDocument(request.content(), request.metadata());
        return ResponseEntity.ok("Document indexed successfully");
    }

    @PostMapping("/search")
    public ResponseEntity<List<SearchResult>> search(@RequestBody SearchRequestDto request) {
        List<Document> results = vectorStoreService.search(request.query(), request.topK());
        List<SearchResult> searchResults = results.stream()
            .map(doc -> new SearchResult(
                doc.getId(),
                doc.getText(),
                doc.getMetadata(),
                (Double) doc.getMetadata().get("distance")
            ))
            .toList();
        return ResponseEntity.ok(searchResults);
    }

    @PostMapping("/search/filtered")
    public ResponseEntity<List<SearchResult>> searchWithFilter(
            @RequestBody FilteredSearchRequest request) {
        List<Document> results = vectorStoreService.searchWithFilter(
            request.query(), request.topK(), request.filterExpression()
        );
        List<SearchResult> searchResults = results.stream()
            .map(doc -> new SearchResult(
                doc.getId(),
                doc.getText(),
                doc.getMetadata(),
                (Double) doc.getMetadata().get("distance")
            ))
            .toList();
        return ResponseEntity.ok(searchResults);
    }

    @DeleteMapping("/{id}")
    public ResponseEntity<Void> deleteDocument(@PathVariable String id) {
        vectorStoreService.deleteDocuments(List.of(id));
        return ResponseEntity.noContent().build();
    }
}

record IndexRequest(String content, Map<String, Object> metadata) {}
record SearchRequestDto(String query, int topK) {}
record FilteredSearchRequest(String query, int topK, String filterExpression) {}
record SearchResult(String id, String content, Map<String, Object> metadata, Double distance) {}

1.6 Metadata Filtering in Practice

Spring AI supports SQL-style metadata filtering, which is critical in multi-tenant scenarios:

@Service
public class MetadataFilterService {

    private final VectorStore vectorStore;

    public MetadataFilterService(VectorStore vectorStore) {
        this.vectorStore = vectorStore;
    }

    public List<Document> searchByTenant(String query, String tenantId) {
        return vectorStore.similaritySearch(
            SearchRequest.builder()
                .query(query)
                .topK(5)
                .filterExpression("tenantId == '" + tenantId + "'")
                .build()
        );
    }

    public List<Document> searchByDepartmentAndDate(
            String query, String department, String dateAfter) {
        return vectorStore.similaritySearch(
            SearchRequest.builder()
                .query(query)
                .topK(5)
                .filterExpression(
                    "department == '" + department + "' && createdAt >= '" + dateAfter + "'"
                )
                .build()
        );
    }

    public List<Document> searchByTags(String query, List<String> tags) {
        String tagFilter = tags.stream()
            .map(tag -> "tags.contains('" + tag + "')")
            .collect(Collectors.joining(" || "));
        return vectorStore.similaritySearch(
            SearchRequest.builder()
                .query(query)
                .topK(5)
                .filterExpression(tagFilter)
                .build()
        );
    }
}

Pattern 2: Document Chunking and Embedding Generation

Chunking strategy directly determines the ceiling of RAG effectiveness. Chunks too large = noisy retrieval; chunks too small = incomplete semantics.

2.1 Chunking Strategy Comparison

Strategy	Use Case	Pros	Cons
Fixed-size chunking	General documents	Simple, stable performance	May break semantic boundaries
Recursive character chunking	Markdown/code	Preserves structural boundaries	Requires parameter tuning
Semantic chunking	High-quality documents	Best semantic completeness	High compute cost
Sentence window chunking	Precise Q&A	Rich context	High storage overhead

2.2 Spring AI Document Processing Pipeline

@Configuration
public class DocumentProcessingConfig {

    @Bean
    public DocumentTransformer documentTransformer() {
        return new TokenTextSplitter(
            512,    // defaultChunkSize
            64,     // minChunkSizeChars
            64,     // maxNumChunks
            true,   // keepSeparator
            null    // separators custom separators
        );
    }

    @Bean
    public DocumentReader markdownReader() {
        return new MarkdownDocumentReader(
            new ClassPathResource("docs/knowledge-base.md"),
            MarkdownDocumentReaderConfig.builder()
                .withHorizontalRuleCreateDocument(true)
                .withIncludeCodeBlock(true)
                .withIncludeBlockquote(true)
                .withAdditionalMetadata("source", "knowledge-base")
                .build()
        );
    }
}

2.3 Custom Chunking Strategies

@Service
public class CustomChunkingService {

    private final EmbeddingModel embeddingModel;

    public CustomChunkingService(EmbeddingModel embeddingModel) {
        this.embeddingModel = embeddingModel;
    }

    public List<Document> chunkWithOverlap(String content, int chunkSize, int overlap) {
        List<Document> chunks = new ArrayList<>();
        int start = 0;
        int chunkIndex = 0;

        while (start < content.length()) {
            int end = Math.min(start + chunkSize, content.length());
            String chunkText = content.substring(start, end);

            Map<String, Object> metadata = new HashMap<>();
            metadata.put("chunkIndex", chunkIndex);
            metadata.put("startOffset", start);
            metadata.put("endOffset", end);
            metadata.put("totalChunks", (content.length() + chunkSize - 1) / chunkSize);

            chunks.add(new Document(chunkText, metadata));

            start += chunkSize - overlap;
            chunkIndex++;
        }

        return chunks;
    }

    public List<Document> chunkMarkdownByHeaders(String markdownContent) {
        List<Document> chunks = new ArrayList<>();
        String[] sections = markdownContent.split("(?=^#{1,6}\\s)");

        for (String section : sections) {
            if (section.isBlank()) continue;

            String[] lines = section.split("\n", 2);
            String header = lines[0].trim();
            String body = lines.length > 1 ? lines[1].trim() : "";

            if (body.length() > 512) {
                List<Document> subChunks = chunkWithOverlap(body, 512, 64);
                for (Document subChunk : subChunks) {
                    subChunk.getMetadata().put("sectionHeader", header);
                    chunks.add(subChunk);
                }
            } else if (!body.isEmpty()) {
                Map<String, Object> metadata = new HashMap<>();
                metadata.put("sectionHeader", header);
                chunks.add(new Document(body, metadata));
            }
        }

        return chunks;
    }

    public List<Document> chunkWithSemanticBoundary(String text, int maxChunkSize) {
        String[] sentences = text.split("(?<=[.!?])");
        List<Document> chunks = new ArrayList<>();
        StringBuilder currentChunk = new StringBuilder();

        for (String sentence : sentences) {
            if (currentChunk.length() + sentence.length() > maxChunkSize
                    && currentChunk.length() > 0) {
                Map<String, Object> metadata = new HashMap<>();
                metadata.put("chunkStrategy", "semantic");
                metadata.put("sentenceCount", countSentences(currentChunk.toString()));
                chunks.add(new Document(currentChunk.toString().trim(), metadata));
                currentChunk = new StringBuilder();
            }
            currentChunk.append(sentence);
        }

        if (currentChunk.length() > 0) {
            Map<String, Object> metadata = new HashMap<>();
            metadata.put("chunkStrategy", "semantic");
            chunks.add(new Document(currentChunk.toString().trim(), metadata));
        }

        return chunks;
    }

    private int countSentences(String text) {
        return text.split("[.!?]").length;
    }
}

2.4 Batch Embedding Generation and Indexing

@Service
public class EmbeddingIndexService {

    private static final int BATCH_SIZE = 100;
    private final VectorStore vectorStore;
    private final DocumentTransformer textSplitter;
    private final CustomChunkingService customChunkingService;

    public EmbeddingIndexService(
            VectorStore vectorStore,
            DocumentTransformer textSplitter,
            CustomChunkingService customChunkingService) {
        this.vectorStore = vectorStore;
        this.textSplitter = textSplitter;
        this.customChunkingService = customChunkingService;
    }

    @Async
    public CompletableFuture<Integer> indexFile(Resource resource, String source) {
        try {
            DocumentReader reader = createReader(resource, source);
            List<Document> documents = reader.get();
            List<Document> splitDocuments = textSplitter.apply(documents);

            for (int i = 0; i < splitDocuments.size(); i += BATCH_SIZE) {
                List<Document> batch = splitDocuments.subList(
                    i, Math.min(i + BATCH_SIZE, splitDocuments.size())
                );
                vectorStore.add(batch);
            }

            return CompletableFuture.completedFuture(splitDocuments.size());
        } catch (Exception e) {
            return CompletableFuture.failedFuture(e);
        }
    }

    public int indexMarkdownContent(String content, String source) {
        List<Document> chunks = customChunkingService.chunkMarkdownByHeaders(content);
        chunks.forEach(doc -> doc.getMetadata().putIfAbsent("source", source));

        for (int i = 0; i < chunks.size(); i += BATCH_SIZE) {
            List<Document> batch = chunks.subList(
                i, Math.min(i + BATCH_SIZE, chunks.size())
            );
            vectorStore.add(batch);
        }

        return chunks.size();
    }

    public int reindexAll(List<Resource> resources) {
        return resources.parallelStream()
            .mapToInt(resource -> {
                try {
                    DocumentReader reader = createReader(resource, resource.getFilename());
                    List<Document> documents = reader.get();
                    List<Document> splitDocuments = textSplitter.apply(documents);
                    vectorStore.add(splitDocuments);
                    return splitDocuments.size();
                } catch (Exception e) {
                    return 0;
                }
            })
            .sum();
    }

    private DocumentReader createReader(Resource resource, String source) {
        String filename = resource.getFilename();
        if (filename != null && filename.endsWith(".md")) {
            return new MarkdownDocumentReader(
                resource,
                MarkdownDocumentReaderConfig.builder()
                    .withAdditionalMetadata("source", source)
                    .build()
            );
        }
        return new TextReader(resource);
    }
}

2.5 Embedding Model Performance Comparison

Model	Dimensions	Speed (tokens/s)	Quality (MTEB)	Price (/1M tokens)
text-embedding-3-small	1536	15000	62.3%	$0.02
text-embedding-3-large	3072	8000	64.5%	$0.13
BGE-M3	1024	12000	63.1%	Free (self-hosted)
bge-large-zh-v1.5	1024	10000	64.2% (Chinese)	Free (self-hosted)

Pattern 3: Hybrid Search (Vector + Keyword)

Pure vector search performs poorly with proper nouns, product codes, and other exact-match scenarios. Hybrid search is a must for production.

3.1 Hybrid Search Architecture

┌─────────────┐
│  User Query  │
└──────┬──────┘
       │
       ├──────────────────┐
       ▼                  ▼
┌──────────────┐   ┌──────────────┐
│  Vector      │   │  Keyword     │
│  Search      │   │  Search      │
│  PgVector    │   │  Full-Text   │
│  Semantic    │   │  BM25        │
│  topK=10     │   │  topK=10     │
└──────┬───────┘   └──────┬───────┘
       │                  │
       ▼                  ▼
┌─────────────────────────────────┐
│      Result Fusion & Rerank      │
│   Reciprocal Rank Fusion (RRF)  │
│   or Cohere Rerank API          │
└──────────────┬──────────────────┘
               │
               ▼
        ┌──────────────┐
        │  Top-N Results│
        └──────────────┘

3.2 Hybrid Search Implementation

@Service
public class HybridSearchService {

    private final VectorStore vectorStore;
    private final JdbcTemplate jdbcTemplate;
    private final ChatModel chatModel;

    public HybridSearchService(
            VectorStore vectorStore,
            JdbcTemplate jdbcTemplate,
            ChatModel chatModel) {
        this.vectorStore = vectorStore;
        this.jdbcTemplate = jdbcTemplate;
        this.chatModel = chatModel;
    }

    public List<ScoredDocument> hybridSearch(String query, int topK) {
        List<ScoredDocument> vectorResults = vectorSearch(query, topK * 2);
        List<ScoredDocument> keywordResults = keywordSearch(query, topK * 2);
        List<ScoredDocument> fused = reciprocalRankFusion(vectorResults, keywordResults);
        return fused.stream().limit(topK).toList();
    }

    private List<ScoredDocument> vectorSearch(String query, int topK) {
        List<Document> docs = vectorStore.similaritySearch(
            SearchRequest.builder()
                .query(query)
                .topK(topK)
                .similarityThreshold(0.5)
                .build()
        );
        return docs.stream()
            .map(doc -> new ScoredDocument(
                doc.getId(),
                doc.getText(),
                doc.getMetadata(),
                1.0 - (Double) doc.getMetadata().getOrDefault("distance", 1.0)
            ))
            .toList();
    }

    private List<ScoredDocument> keywordSearch(String query, int topK) {
        String sql = """
            SELECT id, content, metadata,
                   ts_rank_cd(to_tsvector('simple', content),
                              plainto_tsquery('simple', ?)) AS rank
            FROM vector_store
            WHERE to_tsvector('simple', content) @@ plainto_tsquery('simple', ?)
            ORDER BY rank DESC
            LIMIT ?
            """;

        return jdbcTemplate.query(sql, (rs, rowNum) -> {
            Map<String, Object> metadata = new HashMap<>();
            try {
                String metadataJson = rs.getString("metadata");
                metadata = new ObjectMapper().readValue(metadataJson, new TypeReference<>() {});
            } catch (Exception ignored) {}

            return new ScoredDocument(
                rs.getString("id"),
                rs.getString("content"),
                metadata,
                rs.getDouble("rank")
            );
        }, query, query, topK);
    }

    private List<ScoredDocument> reciprocalRankFusion(
            List<ScoredDocument> vectorResults,
            List<ScoredDocument> keywordResults) {
        int k = 60;
        Map<String, ScoredDocument> docMap = new LinkedHashMap<>();
        Map<String, Double> scoreMap = new HashMap<>();

        for (int i = 0; i < vectorResults.size(); i++) {
            ScoredDocument doc = vectorResults.get(i);
            scoreMap.merge(doc.id(), 1.0 / (k + i + 1), Double::sum);
            docMap.putIfAbsent(doc.id(), doc);
        }

        for (int i = 0; i < keywordResults.size(); i++) {
            ScoredDocument doc = keywordResults.get(i);
            scoreMap.merge(doc.id(), 1.0 / (k + i + 1), Double::sum);
            docMap.putIfAbsent(doc.id(), doc);
        }

        return scoreMap.entrySet().stream()
            .sorted(Map.Entry.<String, Double>comparingByValue().reversed())
            .map(entry -> {
                ScoredDocument doc = docMap.get(entry.getKey());
                return new ScoredDocument(doc.id(), doc.text(), doc.metadata(), entry.getValue());
            })
            .toList();
    }

    public String askWithHybridSearch(String question) {
        List<ScoredDocument> results = hybridSearch(question, 5);
        String context = results.stream()
            .map(ScoredDocument::text)
            .collect(Collectors.joining("\n\n---\n\n"));

        String prompt = """
            Answer the user's question based on the following reference materials.
            If the reference materials don't contain relevant information, say so clearly.

            Reference materials:
            %s

            User question: %s

            Provide an accurate, complete answer with source citations.
            """.formatted(context, question);

        return chatModel.call(prompt);
    }
}

record ScoredDocument(String id, String text, Map<String, Object> metadata, double score) {}

3.3 Full-Text Search Index Configuration

-- Add full-text search support to PgVector table
ALTER TABLE vector_store ADD COLUMN IF NOT EXISTS tsv tsvector
    GENERATED ALWAYS AS (to_tsvector('simple', content)) STORED;

CREATE INDEX IF NOT EXISTS idx_vector_store_tsv ON vector_store USING GIN(tsv);

-- For Chinese full-text search, use zhparser extension
CREATE EXTENSION IF NOT EXISTS zhparser;
CREATE TEXT SEARCH CONFIGURATION chinese_zh (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION chinese_zh ADD MAPPING FOR n,v,a,i,e,l WITH simple;

3.4 Query Rewriting for Better Recall

@Service
public class QueryRewriteService {

    private final ChatModel chatModel;

    public QueryRewriteService(ChatModel chatModel) {
        this.chatModel = chatModel;
    }

    public List<String> rewriteQuery(String originalQuery) {
        String rewritePrompt = """
            The user asked the following question. Generate 3 semantically equivalent
            rephrasings to improve retrieval recall. One per line, no numbering.

            Original question: %s
            """.formatted(originalQuery);

        String response = chatModel.call(rewritePrompt);
        List<String> rewrites = Arrays.stream(response.split("\n"))
            .map(String::trim)
            .filter(line -> !line.isEmpty())
            .toList();

        List<String> allQueries = new ArrayList<>();
        allQueries.add(originalQuery);
        allQueries.addAll(rewrites);
        return allQueries;
    }

    public String expandWithSynonyms(String query) {
        String synonymPrompt = """
            Extract key entities and synonyms from the following query
            to expand the search scope. Format: one keyword or synonym per line.

            Query: %s
            """.formatted(query);

        return chatModel.call(synonymPrompt);
    }
}

Pattern 4: Conversation Memory and Multi-Turn Q&A

Single-turn Q&A is just a toy. Production-grade RAG must support multi-turn conversations, understanding references and omissions in context.

4.1 Conversation Memory Architecture

┌──────────┐     ┌──────────────┐     ┌──────────┐
│  User     │────▶│  Context     │────▶│  LLM     │
│  Message  │     │  Manager     │     │  Generate│
│  Turn N   │     │  Window/     │     │          │
└──────────┘     │  Summary     │     └──────────┘
                 └──────────────┘
                       ▲       │
                       │       ▼
                 ┌──────────────────┐
                 │  Conversation    │
                 │  History Store   │
                 │  Redis / PG      │
                 └──────────────────┘

4.2 Redis-Based Conversation Memory

@Configuration
public class ChatMemoryConfig {

    @Bean
    public ChatMemory chatMemory(RedisTemplate<String, String> redisTemplate) {
        return new RedisChatMemory(redisTemplate, 20);
    }
}

public class RedisChatMemory implements ChatMemory {

    private static final String KEY_PREFIX = "chat:memory:";
    private final RedisTemplate<String, String> redisTemplate;
    private final int maxMessages;

    public RedisChatMemory(RedisTemplate<String, String> redisTemplate, int maxMessages) {
        this.redisTemplate = redisTemplate;
        this.maxMessages = maxMessages;
    }

    @Override
    public void add(String conversationId, List<Message> messages) {
        String key = KEY_PREFIX + conversationId;
        for (Message message : messages) {
            String serialized = serializeMessage(message);
            redisTemplate.opsForList().rightPush(key, serialized);
        }
        redisTemplate.opsForList().trim(key, -maxMessages, -1);
        redisTemplate.expire(key, Duration.ofHours(24));
    }

    @Override
    public List<Message> get(String conversationId, int lastN) {
        String key = KEY_PREFIX + conversationId;
        List<String> rawMessages = redisTemplate.opsForList().range(key, -lastN, -1);
        if (rawMessages == null || rawMessages.isEmpty()) {
            return List.of();
        }
        return rawMessages.stream()
            .map(this::deserializeMessage)
            .toList();
    }

    @Override
    public void clear(String conversationId) {
        redisTemplate.delete(KEY_PREFIX + conversationId);
    }

    private String serializeMessage(Message message) {
        try {
            Map<String, String> map = Map.of(
                "role", message.getMessageType().getValue(),
                "content", message.getText()
            );
            return new ObjectMapper().writeValueAsString(map);
        } catch (Exception e) {
            throw new RuntimeException("Failed to serialize message", e);
        }
    }

    private Message deserializeMessage(String json) {
        try {
            Map<String, String> map = new ObjectMapper().readValue(json, new TypeReference<>() {});
            return switch (map.get("role")) {
                case "user" -> new UserMessage(map.get("content"));
                case "assistant" -> new AssistantMessage(map.get("content"));
                case "system" -> new SystemMessage(map.get("content"));
                default -> new UserMessage(map.get("content"));
            };
        } catch (Exception e) {
            throw new RuntimeException("Failed to deserialize message", e);
        }
    }
}

4.3 Multi-Turn RAG Q&A Service

@Service
public class ConversationalRagService {

    private final ChatClient chatClient;
    private final VectorStore vectorStore;
    private final ChatMemory chatMemory;

    public ConversationalRagService(
            ChatClient chatClient,
            VectorStore vectorStore,
            ChatMemory chatMemory) {
        this.chatClient = chatClient;
        this.vectorStore = vectorStore;
        this.chatMemory = chatMemory;
    }

    public String chat(String conversationId, String userMessage) {
        List<Message> history = chatMemory.get(conversationId, 10);

        String contextualizedQuery = buildContextualizedQuery(userMessage, history);

        List<Document> relevantDocs = vectorStore.similaritySearch(
            SearchRequest.builder()
                .query(contextualizedQuery)
                .topK(5)
                .similarityThreshold(0.6)
                .build()
        );

        String context = relevantDocs.stream()
            .map(Document::getText)
            .collect(Collectors.joining("\n\n"));

        String systemPrompt = """
            You are a professional knowledge base assistant. Answer user questions
            based on the provided reference materials.

            Rules:
            1. Only answer based on reference materials — do not fabricate information
            2. If reference materials are insufficient, clearly inform the user
            3. Cite specific sources
            4. Keep answers concise and accurate

            Reference materials:
            %s
            """.formatted(context);

        String response = chatClient.prompt()
            .system(systemPrompt)
            .messages(history)
            .user(userMessage)
            .call()
            .content();

        chatMemory.add(conversationId, List.of(
            new UserMessage(userMessage),
            new AssistantMessage(response)
        ));

        return response;
    }

    private String buildContextualizedQuery(String currentQuery, List<Message> history) {
        if (history.isEmpty()) {
            return currentQuery;
        }

        String historySummary = history.stream()
            .map(msg -> msg.getMessageType().getValue() + ": " + msg.getText())
            .collect(Collectors.joining("\n"));

        String condensePrompt = """
            Based on the conversation history and current question, generate a
            standalone query that includes full context. Output only the rewritten query.

            Conversation history:
            %s

            Current question: %s
            """.formatted(historySummary, currentQuery);

        return chatClient.prompt()
            .user(condensePrompt)
            .call()
            .content();
    }
}

4.4 Streaming Multi-Turn Conversation

@RestController
@RequestMapping("/api/v1/chat")
public class StreamingChatController {

    private final StreamingRagService streamingRagService;

    public StreamingChatController(StreamingRagService streamingRagService) {
        this.streamingRagService = streamingRagService;
    }

    @PostMapping(value = "/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<String> streamChat(@RequestBody ChatRequest request) {
        return streamingRagService.streamChat(request.conversationId(), request.message());
    }
}

@Service
public class StreamingRagService {

    private final ChatClient chatClient;
    private final VectorStore vectorStore;

    public StreamingRagService(ChatClient chatClient, VectorStore vectorStore) {
        this.chatClient = chatClient;
        this.vectorStore = vectorStore;
    }

    public Flux<String> streamChat(String conversationId, String userMessage) {
        List<Document> docs = vectorStore.similaritySearch(
            SearchRequest.builder()
                .query(userMessage)
                .topK(5)
                .similarityThreshold(0.6)
                .build()
        );

        String context = docs.stream()
            .map(Document::getText)
            .collect(Collectors.joining("\n\n"));

        return chatClient.prompt()
            .system("Answer based on the following reference materials:\n" + context)
            .user(userMessage)
            .stream()
            .content();
    }
}

Pattern 5: RAG Pipeline Orchestration

Production-grade RAG is not a single API call — it's an orchestratable, observable, degradable pipeline.

5.1 Pipeline Architecture

┌──────────────────────────────────────────────────────────────┐
│                    RAG Pipeline Orchestration                 │
│                                                              │
│  Query ──▶ [Rewrite] ──▶ [Retrieve] ──▶ [Rerank] ──▶ [Generate] │
│              │              │             │            │      │
│              ▼              ▼             ▼            ▼      │
│           [Cache]       [Fallback]    [Score]     [Guard]    │
│              │              │             │            │      │
│              └──────────────┴─────────────┴────────────┘      │
│                             │                                 │
│                             ▼                                 │
│                     [Observability]                           │
│              Tracing · Metrics · Logging                      │
└──────────────────────────────────────────────────────────────┘

5.2 Pipeline Definition and Execution

public interface RagPipelineStep {
    String getName();
    StepResult execute(StepContext context);
    default int getOrder() { return 0; }
    default boolean isEnabled() { return true; }
}

public record StepResult(boolean success, Map<String, Object> data, String error) {
    public static StepResult success(Map<String, Object> data) {
        return new StepResult(true, data, null);
    }

    public static StepResult failure(String error) {
        return new StepResult(false, Map.of(), error);
    }
}

public class StepContext {
    private final Map<String, Object> data = new ConcurrentHashMap<>();
    private String originalQuery;
    private String rewrittenQuery;
    private List<Document> retrievedDocuments;
    private List<ScoredDocument> rerankedDocuments;
    private String generatedAnswer;
    private long startTimeMs;

    public StepContext(String query) {
        this.originalQuery = query;
        this.startTimeMs = System.currentTimeMillis();
    }

    public Map<String, Object> getData() { return data; }
    public String getOriginalQuery() { return originalQuery; }
    public void setRewrittenQuery(String q) { this.rewrittenQuery = q; }
    public String getEffectiveQuery() { return rewrittenQuery != null ? rewrittenQuery : originalQuery; }
    public void setRetrievedDocuments(List<Document> docs) { this.retrievedDocuments = docs; }
    public List<Document> getRetrievedDocuments() { return retrievedDocuments; }
    public void setRerankedDocuments(List<ScoredDocument> docs) { this.rerankedDocuments = docs; }
    public List<ScoredDocument> getRerankedDocuments() { return rerankedDocuments; }
    public void setGeneratedAnswer(String answer) { this.generatedAnswer = answer; }
    public String getGeneratedAnswer() { return generatedAnswer; }
    public long getElapsedTimeMs() { return System.currentTimeMillis() - startTimeMs; }
}

5.3 Concrete Step Implementations

@Component
public class QueryRewriteStep implements RagPipelineStep {

    private final ChatModel chatModel;

    public QueryRewriteStep(ChatModel chatModel) {
        this.chatModel = chatModel;
    }

    @Override
    public String getName() { return "query-rewrite"; }

    @Override
    public int getOrder() { return 1; }

    @Override
    public StepResult execute(StepContext context) {
        String query = context.getOriginalQuery();
        String rewritePrompt = """
            Rewrite the following query into a form better suited for retrieval,
            preserving core semantics and adding necessary context.
            Output only the rewritten query.

            Original query: %s
            """.formatted(query);

        String rewritten = chatModel.call(rewritePrompt);
        context.setRewrittenQuery(rewritten.trim());

        return StepResult.success(Map.of(
            "originalQuery", query,
            "rewrittenQuery", rewritten.trim()
        ));
    }
}

@Component
public class VectorRetrieveStep implements RagPipelineStep {

    private final VectorStore vectorStore;

    public VectorRetrieveStep(VectorStore vectorStore) {
        this.vectorStore = vectorStore;
    }

    @Override
    public String getName() { return "vector-retrieve"; }

    @Override
    public int getOrder() { return 2; }

    @Override
    public StepResult execute(StepContext context) {
        List<Document> docs = vectorStore.similaritySearch(
            SearchRequest.builder()
                .query(context.getEffectiveQuery())
                .topK(10)
                .similarityThreshold(0.5)
                .build()
        );

        context.setRetrievedDocuments(docs);

        return StepResult.success(Map.of(
            "documentCount", docs.size(),
            "query", context.getEffectiveQuery()
        ));
    }
}

@Component
public class RerankStep implements RagPipelineStep {

    private final ChatModel chatModel;

    public RerankStep(ChatModel chatModel) {
        this.chatModel = chatModel;
    }

    @Override
    public String getName() { return "rerank"; }

    @Override
    public int getOrder() { return 3; }

    @Override
    public StepResult execute(StepContext context) {
        List<Document> docs = context.getRetrievedDocuments();
        if (docs == null || docs.isEmpty()) {
            return StepResult.failure("No documents to rerank");
        }

        String query = context.getEffectiveQuery();
        List<ScoredDocument> scored = docs.stream()
            .map(doc -> {
                double relevanceScore = computeRelevance(query, doc.getText());
                return new ScoredDocument(doc.getId(), doc.getText(), doc.getMetadata(), relevanceScore);
            })
            .sorted(Comparator.comparingDouble(ScoredDocument::score).reversed())
            .limit(5)
            .toList();

        context.setRerankedDocuments(scored);

        return StepResult.success(Map.of("rerankedCount", scored.size()));
    }

    private double computeRelevance(String query, String documentText) {
        Set<String> queryTerms = Arrays.stream(query.toLowerCase().split("\\s+"))
            .collect(Collectors.toSet());
        Set<String> docTerms = Arrays.stream(documentText.toLowerCase().split("\\s+"))
            .collect(Collectors.toSet());
        long overlap = queryTerms.stream().filter(docTerms::contains).count();
        return (double) overlap / queryTerms.size();
    }
}

@Component
public class GenerateStep implements RagPipelineStep {

    private final ChatClient chatClient;

    public GenerateStep(ChatClient chatClient) {
        this.chatClient = chatClient;
    }

    @Override
    public String getName() { return "generate"; }

    @Override
    public int getOrder() { return 4; }

    @Override
    public StepResult execute(StepContext context) {
        List<ScoredDocument> docs = context.getRerankedDocuments();
        if (docs == null || docs.isEmpty()) {
            return StepResult.failure("No documents available for generation");
        }

        String contextText = docs.stream()
            .map(ScoredDocument::text)
            .collect(Collectors.joining("\n\n---\n\n"));

        String answer = chatClient.prompt()
            .system("""
                You are a professional knowledge base assistant. Answer based on reference materials.
                If materials are insufficient, say so clearly. Cite specific sources.

                Reference materials:
                %s
                """.formatted(contextText))
            .user(context.getOriginalQuery())
            .call()
            .content();

        context.setGeneratedAnswer(answer);

        return StepResult.success(Map.of("answerLength", answer.length()));
    }
}

5.4 Pipeline Orchestrator

@Service
public class RagPipelineOrchestrator {

    private final List<RagPipelineStep> steps;
    private final MeterRegistry meterRegistry;

    public RagPipelineOrchestrator(
            List<RagPipelineStep> steps,
            MeterRegistry meterRegistry) {
        this.steps = steps.stream()
            .filter(RagPipelineStep::isEnabled)
            .sorted(Comparator.comparingInt(RagPipelineStep::getOrder))
            .toList();
        this.meterRegistry = meterRegistry;
    }

    public RagPipelineResult execute(String query) {
        StepContext context = new StepContext(query);
        List<StepExecutionRecord> records = new ArrayList<>();

        for (RagPipelineStep step : steps) {
            long stepStart = System.currentTimeMillis();
            try {
                StepResult result = step.execute(context);
                long stepDuration = System.currentTimeMillis() - stepStart;

                records.add(new StepExecutionRecord(
                    step.getName(), true, stepDuration, result.error()
                ));

                meterRegistry.counter("rag.pipeline.step.success",
                    "step", step.getName()).increment();
                meterRegistry.timer("rag.pipeline.step.duration",
                    "step", step.getName())
                    .record(stepDuration, TimeUnit.MILLISECONDS);

                if (!result.success()) {
                    break;
                }
            } catch (Exception e) {
                long stepDuration = System.currentTimeMillis() - stepStart;
                records.add(new StepExecutionRecord(
                    step.getName(), false, stepDuration, e.getMessage()
                ));
                meterRegistry.counter("rag.pipeline.step.failure",
                    "step", step.getName()).increment();
                break;
            }
        }

        return new RagPipelineResult(
            context.getGeneratedAnswer(),
            context.getElapsedTimeMs(),
            records
        );
    }
}

record StepExecutionRecord(String stepName, boolean success, long durationMs, String error) {}
record RagPipelineResult(String answer, long totalDurationMs, List<StepExecutionRecord> steps) {}

Pattern 6: Production Deployment and Monitoring

The operational complexity of RAG applications post-launch far exceeds typical CRUD services. Dedicated monitoring and degradation strategies are essential.

6.1 Docker Compose Deployment

version: '3.8'

services:
  app:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "8080:8080"
    environment:
      - SPRING_PROFILES_ACTIVE=prod
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - SPRING_DATASOURCE_URL=jdbc:postgresql://postgres:5432/rag_db
      - SPRING_DATASOURCE_USERNAME=rag_user
      - SPRING_DATASOURCE_PASSWORD=${PG_PASSWORD}
      - SPRING_DATA_REDIS_HOST=redis
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/actuator/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    deploy:
      resources:
        limits:
          memory: 1G
          cpus: '2'

  postgres:
    image: pgvector/pgvector:pg16
    environment:
      - POSTGRES_DB=rag_db
      - POSTGRES_USER=rag_user
      - POSTGRES_PASSWORD=${PG_PASSWORD}
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U rag_user -d rag_db"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    command: redis-server --maxmemory 256mb --maxmemory-policy allkeys-lru
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}

volumes:
  pgdata:

6.2 RAG-Specific Monitoring Metrics

@Configuration
public class RagMetricsConfig {

    @Bean
    public MeterRegistryCustomizer<MeterRegistry> ragMetrics() {
        return registry -> registry.config()
            .meterFilter(MeterFilter.deny(id ->
                id.getName().startsWith("jvm.") || id.getName().startsWith("process.")
            ));
    }
}

@Service
public class RagMetricsService {

    private final MeterRegistry meterRegistry;

    public RagMetricsService(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
    }

    public void recordRetrievalLatency(long durationMs, String storeType) {
        meterRegistry.timer("rag.retrieval.latency", "store", storeType)
            .record(durationMs, TimeUnit.MILLISECONDS);
    }

    public void recordEmbeddingLatency(long durationMs, int tokenCount) {
        meterRegistry.timer("rag.embedding.latency")
            .record(durationMs, TimeUnit.MILLISECONDS);
        meterRegistry.counter("rag.embedding.tokens").increment(tokenCount);
    }

    public void recordGenerationLatency(long durationMs, int inputTokens, int outputTokens) {
        meterRegistry.timer("rag.generation.latency")
            .record(durationMs, TimeUnit.MILLISECONDS);
        meterRegistry.counter("rag.generation.input.tokens").increment(inputTokens);
        meterRegistry.counter("rag.generation.output.tokens").increment(outputTokens);
    }

    public void recordRetrievalQuality(int retrievedCount, double avgSimilarity) {
        meterRegistry.gauge("rag.retrieval.document.count", retrievedCount);
        meterRegistry.gauge("rag.retrieval.avg.similarity", avgSimilarity);
    }

    public void recordCacheHit(boolean hit) {
        meterRegistry.counter("rag.cache",
            "result", hit ? "hit" : "miss").increment();
    }
}

6.3 Degradation and Circuit Breaking

@Configuration
public class ResilienceConfig {

    @Bean
    public CircuitBreaker ragCircuitBreaker() {
        CircuitBreakerConfig config = CircuitBreakerConfig.custom()
            .failureRateThreshold(50)
            .waitDurationInOpenState(Duration.ofSeconds(30))
            .slidingWindowSize(10)
            .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
            .build();

        return CircuitBreaker.of("ragCircuitBreaker", config);
    }
}

@Service
public class ResilientRagService {

    private final VectorStore vectorStore;
    private final ChatClient chatClient;
    private final CircuitBreaker circuitBreaker;
    private final RagMetricsService metricsService;

    public ResilientRagService(
            VectorStore vectorStore,
            ChatClient chatClient,
            CircuitBreaker circuitBreaker,
            RagMetricsService metricsService) {
        this.vectorStore = vectorStore;
        this.chatClient = chatClient;
        this.circuitBreaker = circuitBreaker;
        this.metricsService = metricsService;
    }

    public String ask(String question) {
        return circuitBreaker.executeSupplier(() -> {
            try {
                long start = System.currentTimeMillis();

                List<Document> docs = vectorStore.similaritySearch(
                    SearchRequest.builder()
                        .query(question)
                        .topK(5)
                        .similarityThreshold(0.6)
                        .build()
                );

                metricsService.recordRetrievalLatency(
                    System.currentTimeMillis() - start, "pgvector"
                );

                if (docs.isEmpty()) {
                    return "Sorry, no relevant information was found in the knowledge base. Please try rephrasing your question.";
                }

                String context = docs.stream()
                    .map(Document::getText)
                    .collect(Collectors.joining("\n\n"));

                long genStart = System.currentTimeMillis();
                String answer = chatClient.prompt()
                    .system("Answer based on reference materials:\n" + context)
                    .user(question)
                    .call()
                    .content();
                metricsService.recordGenerationLatency(
                    System.currentTimeMillis() - genStart, 0, 0
                );

                return answer;

            } catch (Exception e) {
                metricsService.recordRetrievalLatency(-1, "error");
                return getFallbackAnswer(question);
            }
        });
    }

    private String getFallbackAnswer(String question) {
        return """
            Sorry, the AI service is temporarily unavailable. This may be due to:
            1. Vector database connection timeout
            2. LLM service overload
            3. Network fluctuation

            Please try again later or contact technical support.
            """;
    }
}

6.4 Prometheus Alerting Rules

groups:
  - name: rag-alerts
    rules:
      - alert: RAGRetrievalLatencyHigh
        expr: histogram_quantile(0.95, rag_retrieval_latency_seconds) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "RAG retrieval latency is high"
          description: "P95 retrieval latency exceeds 2s, current: {{ $value }}s"

      - alert: RAGGenerationLatencyHigh
        expr: histogram_quantile(0.95, rag_generation_latency_seconds) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "RAG generation latency is high"
          description: "P95 generation latency exceeds 10s"

      - alert: RAGCircuitBreakerOpen
        expr: resilience4j_circuitbreaker_state{name="ragCircuitBreaker",state="open"} == 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "RAG circuit breaker is open"
          description: "RAG service circuit breaker is OPEN; all requests will use fallback"

      - alert: RAGEmbeddingErrorRateHigh
        expr: rate(rag_pipeline_step_failure_total{step="vector-retrieve"}[5m]) > 0.1
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Vector retrieval error rate is high"

6.5 Kubernetes Deployment Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: rag-service
  labels:
    app: rag-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: rag-service
  template:
    metadata:
      labels:
        app: rag-service
    spec:
      containers:
        - name: rag-service
          image: registry.example.com/rag-service:latest
          ports:
            - containerPort: 8080
          env:
            - name: SPRING_PROFILES_ACTIVE
              value: "prod"
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: rag-secrets
                  key: openai-api-key
          resources:
            requests:
              memory: "512Mi"
              cpu: "500m"
            limits:
              memory: "1Gi"
              cpu: "2000m"
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            initialDelaySeconds: 60
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: rag-service
spec:
  selector:
    app: rag-service
  ports:
    - port: 80
      targetPort: 8080
  type: ClusterIP

5 Common Pitfalls and Solutions

Pitfall 1: Embedding Dimension Mismatch

Symptom: ERROR: expected 1536 dimensions, not 768

Cause: Switched embedding model without updating vector table dimension configuration

Solution:

@Configuration
public class EmbeddingDimensionGuard {

    @Value("${spring.ai.openai.embedding.options.model:text-embedding-3-small}")
    private String embeddingModel;

    @Bean
    public ApplicationRunner validateDimensions(JdbcTemplate jdbcTemplate) {
        return args -> {
            int expectedDim = getExpectedDimension(embeddingModel);
            Integer actualDim = jdbcTemplate.queryForObject(
                "SELECT atttypmod FROM pg_attribute WHERE attrelid = 'vector_store'::regclass AND attname = 'embedding'",
                Integer.class
            );
            if (actualDim != null && actualDim != expectedDim + 4) {
                throw new IllegalStateException(
                    "Embedding dimension mismatch! Expected: " + expectedDim +
                    ", Actual: " + (actualDim - 4)
                );
            }
        };
    }

    private int getExpectedDimension(String model) {
        return switch (model) {
            case "text-embedding-3-small" -> 1536;
            case "text-embedding-3-large" -> 3072;
            case "text-embedding-ada-002" -> 1536;
            default -> 1536;
        };
    }
}

Pitfall 2: Large Document OOM

Symptom: JVM OOM when loading a 100MB PDF

Cause: DocumentReader loads the entire document into memory at once

Solution:

@Service
public class SafeDocumentLoader {

    private static final long MAX_FILE_SIZE = 50 * 1024 * 1024;

    public List<Document> loadSafely(Resource resource) {
        try {
            if (resource.contentLength() > MAX_FILE_SIZE) {
                return loadInChunks(resource);
            }
            DocumentReader reader = new TextReader(resource);
            return reader.get();
        } catch (IOException e) {
            throw new RuntimeException("Failed to load document", e);
        }
    }

    private List<Document> loadInChunks(Resource resource) {
        try (BufferedReader reader = new BufferedReader(
                new InputStreamReader(resource.getInputStream(), StandardCharsets.UTF_8))) {
            List<Document> allChunks = new ArrayList<>();
            StringBuilder buffer = new StringBuilder();
            String line;

            while ((line = reader.readLine()) != null) {
                buffer.append(line).append("\n");
                if (buffer.length() > 100000) {
                    Map<String, Object> metadata = Map.of(
                        "source", resource.getFilename(),
                        "chunkStrategy", "streaming"
                    );
                    allChunks.add(new Document(buffer.toString(), metadata));
                    buffer = new StringBuilder();
                }
            }

            if (buffer.length() > 0) {
                allChunks.add(new Document(buffer.toString(),
                    Map.of("source", resource.getFilename())));
            }

            return allChunks;
        } catch (IOException e) {
            throw new RuntimeException("Failed to stream document", e);
        }
    }
}

Pitfall 3: All Retrieval Results Are Irrelevant

Symptom: Lots of noise returned below the 0.7 similarity threshold

Cause: Unreasonable threshold settings + poor embedding model support for specific languages

Solution:

spring:
  ai:
    vectorstore:
      pgvector:
        distance-type: COSINE

@Service
public class AdaptiveThresholdService {

    private final VectorStore vectorStore;

    public AdaptiveThresholdService(VectorStore vectorStore) {
        this.vectorStore = vectorStore;
    }

    public List<Document> searchWithAdaptiveThreshold(String query, int topK) {
        double[] thresholds = {0.8, 0.7, 0.6, 0.5};

        for (double threshold : thresholds) {
            List<Document> results = vectorStore.similaritySearch(
                SearchRequest.builder()
                    .query(query)
                    .topK(topK)
                    .similarityThreshold(threshold)
                    .build()
            );
            if (!results.isEmpty()) {
                return results;
            }
        }

        return List.of();
    }
}

Pitfall 4: Concurrent Indexing Causes Duplicate Vectors

Symptom: The same document indexed multiple times, duplicate results in retrieval

Cause: Lack of idempotency guarantees

Solution:

@Service
public class IdempotentIndexService {

    private final VectorStore vectorStore;
    private final JdbcTemplate jdbcTemplate;

    public IdempotentIndexService(VectorStore vectorStore, JdbcTemplate jdbcTemplate) {
        this.vectorStore = vectorStore;
        this.jdbcTemplate = jdbcTemplate;
    }

    @Transactional
    public void indexWithDedup(List<Document> documents, String sourceId) {
        jdbcTemplate.update(
            "DELETE FROM vector_store WHERE metadata->>'sourceId' = ?",
            sourceId
        );

        documents.forEach(doc ->
            doc.getMetadata().put("sourceId", sourceId)
        );

        vectorStore.add(documents);
    }
}

Pitfall 5: Uncontrollable LLM Hallucinations

Symptom: The model fabricates answers even when the knowledge base has no relevant information

Cause: Insufficient System Prompt constraints + lack of grounding validation

Solution:

@Service
public class GroundedRagService {

    private final ChatClient chatClient;
    private final VectorStore vectorStore;

    public GroundedRagService(ChatClient chatClient, VectorStore vectorStore) {
        this.chatClient = chatClient;
        this.vectorStore = vectorStore;
    }

    public GroundedAnswer askWithGrounding(String question) {
        List<Document> docs = vectorStore.similaritySearch(
            SearchRequest.builder()
                .query(question)
                .topK(5)
                .similarityThreshold(0.65)
                .build()
        );

        if (docs.isEmpty()) {
            return new GroundedAnswer(
                "No relevant information found in the knowledge base. Cannot answer this question.",
                false,
                List.of()
            );
        }

        String context = docs.stream()
            .map(doc -> "[Source:" + doc.getMetadata().get("source") + "]\n" + doc.getText())
            .collect(Collectors.joining("\n\n"));

        String systemPrompt = """
            Strict rules:
            1. Only answer based on the provided reference materials
            2. Every factual claim must cite the source number [Source:xxx]
            3. If reference materials are insufficient to fully answer, clearly state which parts lack evidence
            4. Never fabricate information not present in the reference materials

            Reference materials:
            %s
            """.formatted(context);

        String answer = chatClient.prompt()
            .system(systemPrompt)
            .user(question)
            .call()
            .content();

        boolean isGrounded = validateGrounding(answer, docs);

        return new GroundedAnswer(answer, isGrounded,
            docs.stream().map(d -> (String) d.getMetadata().get("source")).toList());
    }

    private boolean validateGrounding(String answer, List<Document> sources) {
        String sourceTexts = sources.stream()
            .map(Document::getText)
            .collect(Collectors.joining(" "));

        String validationPrompt = """
            Determine whether the following answer is entirely based on the given reference materials.
            Answer only YES or NO.

            Reference materials: %s

            Answer: %s
            """.formatted(sourceTexts.substring(0, Math.min(2000, sourceTexts.length())),
                          answer);

        String result = chatClient.prompt()
            .user(validationPrompt)
            .call()
            .content();

        return result.trim().toUpperCase().startsWith("YES");
    }
}

record GroundedAnswer(String answer, boolean isGrounded, List<String> sources) {}

10 Common Error Troubleshooting

#	Error Message	Cause	Solution
1	`ERROR: operator does not exist: vector <=> vector`	pgvector extension not installed	`CREATE EXTENSION vector;` and restart application
2	`EmbeddingModel bean not found`	Missing embedding starter dependency	Add `spring-ai-openai-spring-boot-starter`
3	`Connection refused: localhost:5432`	PostgreSQL not running	`docker compose up -d postgres`
4	`429 Too Many Requests`	OpenAI API rate limiting	Add rate limiter, reduce concurrency, use local model
5	`expected 1536 dimensions, not 768`	Embedding model dimension mismatch	Unify embedding model or rebuild vector table
6	`OutOfMemoryError: Java heap space`	Large document loaded at once	Use streaming loader, limit file size
7	`CircuitBreaker 'ragCircuitBreaker' is OPEN`	Downstream service persistent failure	Check LLM/vector store connectivity, wait for circuit recovery
8	`RedisConnectionFailureException`	Redis unavailable	Check Redis health, degrade to in-memory memory
9	`Empty search results for threshold 0.8`	Similarity threshold too high	Lower threshold or use adaptive threshold strategy
10	`JsonProcessingException: metadata`	Metadata JSON format error	Check metadata fields, ensure serializability

Advanced Optimization Tips

1. Caching Layer: Reduce Duplicate Embedding Computation

@Service
public class EmbeddingCacheService {

    private final Cache<String, float[]> embeddingCache;
    private final EmbeddingModel embeddingModel;

    public EmbeddingCacheService(EmbeddingModel embeddingModel) {
        this.embeddingModel = embeddingModel;
        this.embeddingCache = Caffeine.newBuilder()
            .maximumSize(10000)
            .expireAfterWrite(Duration.ofHours(24))
            .build();
    }

    public float[] getEmbedding(String text) {
        String cacheKey = DigestUtils.md5Hex(text);
        return embeddingCache.get(cacheKey, key -> {
            float[] embedding = embeddingModel.embed(text);
            return embedding;
        });
    }

    public void preloadCache(List<String> texts) {
        texts.parallelStream().forEach(text -> {
            String cacheKey = DigestUtils.md5Hex(text);
            embeddingCache.put(cacheKey, embeddingModel.embed(text));
        });
    }
}

2. Async Indexing: Don't Block the Main Flow

@Service
public class AsyncIndexingService {

    private final VectorStore vectorStore;
    private final DocumentTransformer textSplitter;
    private final TaskExecutor indexExecutor;

    public AsyncIndexingService(
            VectorStore vectorStore,
            DocumentTransformer textSplitter) {
        this.vectorStore = vectorStore;
        this.textSplitter = textSplitter;
        this.indexExecutor = Executors.newVirtualThreadPerTaskExecutor();
    }

    @Async("indexExecutor")
    public CompletableFuture<IndexResult> indexAsync(Resource resource, String sourceId) {
        long start = System.currentTimeMillis();
        try {
            DocumentReader reader = new TextReader(resource);
            List<Document> documents = reader.get();
            List<Document> split = textSplitter.apply(documents);

            split.forEach(doc -> doc.getMetadata().put("sourceId", sourceId));
            vectorStore.add(split);

            return CompletableFuture.completedFuture(
                new IndexResult(true, split.size(), System.currentTimeMillis() - start, null)
            );
        } catch (Exception e) {
            return CompletableFuture.completedFuture(
                new IndexResult(false, 0, System.currentTimeMillis() - start, e.getMessage())
            );
        }
    }
}

record IndexResult(boolean success, int documentCount, long durationMs, String error) {}

3. Multi-Model Routing: Balance Cost and Quality

@Service
public class ModelRoutingService {

    private final Map<String, ChatModel> models;
    private final MeterRegistry meterRegistry;

    public ModelRoutingService(
            @Qualifier("openAiChatModel") ChatModel openAiModel,
            @Qualifier("deepseekChatModel") ChatModel deepseekModel,
            @Qualifier("qwenChatModel") ChatModel qwenModel,
            MeterRegistry meterRegistry) {
        this.models = Map.of(
            "gpt-4o", openAiModel,
            "deepseek-v3", deepseekModel,
            "qwen-max", qwenModel
        );
        this.meterRegistry = meterRegistry;
    }

    public String routeAndChat(String question, String priority) {
        ChatModel selectedModel = switch (priority) {
            case "quality" -> models.get("gpt-4o");
            case "cost" -> models.get("deepseek-v3");
            case "chinese" -> models.get("qwen-max");
            default -> models.get("deepseek-v3");
        };

        meterRegistry.counter("rag.model.routing",
            "model", getModelName(selectedModel)).increment();

        return selectedModel.call(question);
    }

    private String getModelName(ChatModel model) {
        for (Map.Entry<String, ChatModel> entry : models.entrySet()) {
            if (entry.getValue().equals(model)) {
                return entry.getKey();
            }
        }
        return "unknown";
    }
}

Comparison: 3 Vector Database Solutions

Dimension	PgVector	Milvus	Chroma
Deployment	PG extension, zero extra ops	Independent cluster, needs Zookeeper	Embedded/Server modes
Scale	<1M vectors	100M+ vectors	<500K vectors
Index Types	HNSW/IVFFlat	HNSW/IVF_FLAT/IVF_PQ8	HNSW
Query Latency (P99)	50-200ms	10-50ms	30-100ms
Filter Queries	Native SQL support	Expression engine	Metadata filtering
Transaction Support	ACID	Eventual consistency	None
Java Ecosystem	Spring AI native	Spring AI + Milvus SDK	Spring AI native
Ops Complexity	Low (reuses PG)	High (distributed cluster)	Low (embedded)
Cost	Low (reuses PG instance)	Medium (needs dedicated cluster)	Low (embedded free)
Recommended For	Enterprises with existing PG, small-medium scale	Large-scale vector retrieval	Prototype validation, small scale

Selection Guide:

Enterprises with existing PostgreSQL → PgVector, zero ops cost
Vector scale exceeds 5M → Milvus, distributed scaling
Quick prototype validation → Chroma, up and running in 5 minutes

For more vector database comparisons, see Vector Database Semantic Search in Practice.

Recommended Online Tools

The following online tools can significantly boost productivity when building RAG applications:

Tool	Purpose	Link
JSON Formatter	Process vector store metadata JSON	JSON Formatter
Hash Calculator	Generate document fingerprints for caching and dedup	Hash Calculator
Curl to Code	Quickly generate LLM API call code	Curl to Code
Base64 Codec	Handle document content encoding conversion	Base64 Codec
Regex Tester	Validate document chunking regex rules	Regex Tester

Summary

SpringBoot 3.5 + Spring AI finally gives Java developers a complete production-grade RAG solution. The 6 patterns cover the full chain from vector storage to intelligent Q&A: PgVector integration is the foundation, document chunking determines the effectiveness ceiling, hybrid search boosts recall, conversation memory makes Q&A smarter, pipeline orchestration ensures reliability, and monitoring with degradation safeguards production stability. Remember: RAG isn't a silver bullet, but it's the most pragmatic path for Java AI adoption in 2026.

Spring Boot 3 AI LLM Integration: The Complete Guide — Spring AI vs LangChain4j framework selection and AI Agent construction
Python AI Production Deployment — Python-side AI model deployment and operations experience
RAG Evaluation and Optimization — RAG effectiveness evaluation metrics and optimization methodology
PostgreSQL PgVector RAG in Practice — PgVector deep configuration and performance tuning

The Java Developer's AI Dilemma: You Can Write Code, But Can You Make It "Understand Knowledge"?

Key Takeaways

Table of Contents

RAG Architecture Overview

Why Choose SpringBoot 3.5 for RAG

Pattern 1: Spring AI + PgVector Vector Store Integration

1.1 Environment Setup

1.2 Maven Dependency Configuration

1.3 Application Configuration

1.4 Vector Store Service

1.5 REST API Exposure

1.6 Metadata Filtering in Practice

Pattern 2: Document Chunking and Embedding Generation

2.1 Chunking Strategy Comparison

2.2 Spring AI Document Processing Pipeline

2.3 Custom Chunking Strategies

2.4 Batch Embedding Generation and Indexing

2.5 Embedding Model Performance Comparison

Pattern 3: Hybrid Search (Vector + Keyword)

3.1 Hybrid Search Architecture

3.2 Hybrid Search Implementation

3.3 Full-Text Search Index Configuration

3.4 Query Rewriting for Better Recall

Pattern 4: Conversation Memory and Multi-Turn Q&A

4.1 Conversation Memory Architecture

4.2 Redis-Based Conversation Memory

4.3 Multi-Turn RAG Q&A Service

4.4 Streaming Multi-Turn Conversation

Pattern 5: RAG Pipeline Orchestration

5.1 Pipeline Architecture

5.2 Pipeline Definition and Execution

5.3 Concrete Step Implementations

5.4 Pipeline Orchestrator

Pattern 6: Production Deployment and Monitoring

6.1 Docker Compose Deployment

6.2 RAG-Specific Monitoring Metrics

6.3 Degradation and Circuit Breaking

6.4 Prometheus Alerting Rules

6.5 Kubernetes Deployment Configuration

5 Common Pitfalls and Solutions

Pitfall 1: Embedding Dimension Mismatch

Pitfall 2: Large Document OOM

Pitfall 3: All Retrieval Results Are Irrelevant

Pitfall 4: Concurrent Indexing Causes Duplicate Vectors

Pitfall 5: Uncontrollable LLM Hallucinations

10 Common Error Troubleshooting

Advanced Optimization Tips

1. Caching Layer: Reduce Duplicate Embedding Computation

2. Async Indexing: Don't Block the Main Flow

3. Multi-Model Routing: Balance Cost and Quality

Comparison: 3 Vector Database Solutions

Recommended Online Tools

Summary

Related Reading