RAG Deep Dive: From Naive RAG to Agentic RAG — Three Generations of Evolution and Enterprise Implementation
Why RAG Is a Must-Have for Enterprise AI
No matter how powerful LLMs get, they suffer from three fundamental flaws. In 2026, if you're still making raw LLM API calls for enterprise Q&A, you've definitely hit these walls:
| Flaw | Manifestation | Business Impact |
|---|---|---|
| Knowledge Cutoff | Training data is frozen at a specific date | Cannot answer latest policies or product info |
| Hallucination | Fabricates plausible but non-existent facts | Misleads decisions, legal risk |
| Private Knowledge Gap | Unfamiliar with internal docs, processes, terminology | General models can't replace domain-specific Q&A |
The essence of RAG (Retrieval-Augmented Generation): Make the LLM "look up references" before answering, constraining generation with retrieval results to address all three flaws at their root.
Data point: In 2026, 92% of enterprise AI deployments adopted RAG architecture; pure prompt approaches account for less than 5%.
┌─────────────────────────────────────────────────────────┐
│ Without RAG vs With RAG: The Fundamental Shift │
├──────────────────────┬──────────────────────────────────┤
│ Raw LLM Call │ RAG-Enhanced Call │
├──────────────────────┼──────────────────────────────────┤
│ User Query → LLM → │ User Query → Retrieve → Inject │
│ Answer (from memory)│ Context → LLM → Answer+Citations│
│ │ (answer after research) │
├──────────────────────┼──────────────────────────────────┤
│ Knowledge: cutoff │ Knowledge: real-time, updatable │
│ Accuracy: uncontrol │ Accuracy: retrieval-constrained │
│ Traceability: none │ Traceability: every claim cited │
└──────────────────────┴──────────────────────────────────┘
Naive RAG → Advanced RAG → Agentic RAG: Three Generations of Evolution
RAG isn't static. From 2023 to 2026, RAG architecture has evolved through three generations:
┌───────────────────────────────────────────────────────────────┐
│ RAG Three-Generation Evolution Roadmap │
├───────────────────┬───────────────────┬───────────────────────┤
│ Naive RAG │ Advanced RAG │ Agentic RAG │
│ (2023) │ (2024-2025) │ (2026) │
├───────────────────┼───────────────────┼───────────────────────┤
│ Query → Retrieve │ Query Rewrite → │ Agent autonomous │
│ → Generate │ Hybrid Retrieve → │ decision-making → │
│ │ Rerank → Generate │ Multi-turn + self-eval│
├───────────────────┼───────────────────┼───────────────────────┤
│ Problem: poor │ Problem: lacks │ Problem: high │
│ retrieval quality │ autonomy │ complexity │
│ Hallucination 30%+│ Hallucination │ Hallucination <5% │
│ Uncontrollable │ 10-15% │ Proactive reasoning + │
│ │ Controllable but │ self-correction │
│ │ passive │ │
└───────────────────┴───────────────────┴───────────────────────┘
Three Generations Compared
| Dimension | Naive RAG | Advanced RAG | Agentic RAG |
|---|---|---|---|
| Query Processing | Raw query directly | Query rewrite/expansion/HyDE | Agent decomposes sub-questions |
| Retrieval Strategy | Single vector search | Hybrid retrieval + reranking | Multi-turn retrieval + tool calls |
| Generation Strategy | Concatenate context | Curated context + citations | Self-evaluation + iterative refinement |
| Typical Hallucination Rate | 30-40% | 10-15% | <5% |
| End-to-End Latency | 1-2s | 2-4s | 5-15s |
| Use Case | Demo/Prototype | Production | Complex reasoning |
| Implementation Complexity | Low | Medium | High |
Embedding Model Selection and Evaluation
Embeddings are RAG's "eyes" — choose the wrong model and retrieval quality tanks instantly. 2026 mainstream embedding model comparison:
| Model | Dimensions | MTEB Score | Chinese MTEB | Price/1M tokens | Deployment |
|---|---|---|---|---|---|
| text-embedding-3-large | 3072 | 68.4 | 64.2 | $0.13 | API |
| text-embedding-3-small | 1536 | 62.3 | 58.7 | $0.02 | API |
| bge-m3 | 1024 | 65.8 | 70.1 | Free | Local/API |
| gte-Qwen2-1.5B | 1536 | 67.2 | 72.5 | Free | Local/API |
| gte-Qwen2-7B | 3584 | 70.1 | 75.8 | Free | Local (GPU required) |
| Cohere embed-v4 | 1024 | 66.1 | 61.3 | $0.10 | API |
| Voyage-3 | 1024 | 67.8 | 63.9 | $0.12 | API |
Selection Guide
| Scenario | Recommended Model | Rationale |
|---|---|---|
| Chinese-dominant enterprise KB | gte-Qwen2-1.5B | Highest Chinese MTEB, free local deployment |
| Multilingual mix | bge-m3 | Native multilingual, 1024-dim cost-effective |
| Maximum quality | gte-Qwen2-7B | 7B parameters, best results, needs GPU |
| Quick launch, no ops | text-embedding-3-small | API call, $0.02/1M tokens |
| Budget available, stability first | text-embedding-3-large | OpenAI ecosystem, 3072-dim high precision |
Java Embedding Service Implementation
public class EmbeddingService {
private final OpenAiChatModel chatModel;
private final RestTemplate restTemplate;
private final String embeddingApiUrl;
private final String embeddingModel;
public EmbeddingService(OpenAiChatModel chatModel, String apiUrl, String model) {
this.chatModel = chatModel;
this.restTemplate = new RestTemplate();
this.embeddingApiUrl = apiUrl;
this.embeddingModel = model;
}
public float[] embed(String text) {
Map<String, Object> request = Map.of(
"model", embeddingModel,
"input", text
);
ResponseEntity<Map> response = restTemplate.postForEntity(
embeddingApiUrl + "/embeddings",
request,
Map.class
);
List<Double> embedding = (List<Double>) ((Map) ((List<?>) response.getBody().get("data")).get(0)).get("embedding");
float[] result = new float[embedding.size()];
for (int i = 0; i < embedding.size(); i++) {
result[i] = embedding.get(i).floatValue();
}
return result;
}
public List<float[]> embedBatch(List<String> texts) {
Map<String, Object> request = Map.of(
"model", embeddingModel,
"input", texts
);
ResponseEntity<Map> response = restTemplate.postForEntity(
embeddingApiUrl + "/embeddings",
request,
Map.class
);
List<?> dataList = (List<?>) response.getBody().get("data");
return dataList.stream()
.map(item -> {
List<Double> embedding = (List<Double>) ((Map) item).get("embedding");
float[] arr = new float[embedding.size()];
for (int i = 0; i < embedding.size(); i++) {
arr[i] = embedding.get(i).floatValue();
}
return arr;
})
.collect(Collectors.toList());
}
public static float cosineSimilarity(float[] a, float[] b) {
float dotProduct = 0.0f;
float normA = 0.0f;
float normB = 0.0f;
for (int i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (float) (Math.sqrt(normA) * Math.sqrt(normB));
}
}
Vector Database Comparison: PGVector vs Milvus vs Qdrant
In 2026, the question isn't "do I need a vector database?" but "which one should I choose?" Deep comparison of three mainstream solutions:
| Dimension | PGVector | Milvus | Qdrant |
|---|---|---|---|
| Underlying | PostgreSQL extension | Standalone distributed system | Standalone Rust service |
| Max Vectors | Tens of millions | Tens of billions | Billions |
| Query Latency (1M vectors) | 25-40ms | 15-25ms | 10-18ms |
| Hybrid Search | Needs tsvector | Native support | Native support |
| Distributed | Relies on PG logical replication | Native distributed | Shard support |
| Ops Complexity | Low (reuses PG) | High | Medium |
| Transaction Support | ACID | Limited | Limited |
| Filtering | Full SQL | Scalar filtering | Payload filtering |
| Ecosystem | Spring Data JPA | Java SDK | Java SDK |
| Use Case | Existing PG, small-medium scale | Ultra-large scale, high concurrency | High performance, medium scale |
Architecture Comparison
┌────────────────────────────────────────────────────────────────┐
│ PGVector Architecture │
├────────────────────────────────────────────────────────────────┤
│ Application ──→ PostgreSQL (pgvector extension) │
│ ├── Vector Index (HNSW/IVFFlat) │
│ ├── Full-Text Search (tsvector) │
│ ├── Relational Data (row storage) │
│ └── Transactions/ACID │
│ Pros: Zero extra ops, full SQL capabilities │
│ Cons: Performance limited at scale, weak distribution │
├────────────────────────────────────────────────────────────────┤
│ Milvus Architecture │
├────────────────────────────────────────────────────────────────┤
│ Application ──→ Proxy ──→ Coordinator │
│ ├── Query Node (retrieval) │
│ ├── Data Node (writes) │
│ └── Index Node (index building) │
│ Storage: MinIO/S3 + etcd │
│ Pros: Billion-scale, native distributed, cloud-native │
│ Cons: Complex ops, high resource usage │
├────────────────────────────────────────────────────────────────┤
│ Qdrant Architecture │
├────────────────────────────────────────────────────────────────┤
│ Application ──→ Qdrant Service (Rust) │
│ ├── HNSW Index │
│ ├── Payload Filtering │
│ ├── WAL Persistence │
│ └── Sharded Cluster │
│ Pros: Rust high performance, low latency, clean API │
│ Cons: Not as scalable as Milvus at extreme scale │
└────────────────────────────────────────────────────────────────┘
Spring Boot Qdrant Integration
@Configuration
public class QdrantConfig {
@Bean
public QdrantClient qdrantClient() {
return new QdrantClient(
QdrantGrpcClient.newBuilder("localhost", 6334, false).build()
);
}
}
@Service
public class QdrantVectorStore {
private final QdrantClient qdrantClient;
private final EmbeddingService embeddingService;
private static final String COLLECTION_NAME = "knowledge_base";
private static final int VECTOR_SIZE = 1536;
public QdrantVectorStore(QdrantClient qdrantClient, EmbeddingService embeddingService) {
this.qdrantClient = qdrantClient;
this.embeddingService = embeddingService;
}
public void createCollection() throws ExecutionException, InterruptedException {
qdrantClient.createCollectionAsync(
CollectionInfo.newBuilder()
.setCollectionName(COLLECTION_NAME)
.setVectorsConfig(VectorsConfig.newBuilder()
.setParams(VectorParams.newBuilder()
.setSize(VECTOR_SIZE)
.setDistance(Distance.Cosine)
.build())
.build())
.setOptimizersConfig(OptimizersConfigDiff.newBuilder()
.setIndexingThreshold(20000)
.build())
.setHnswConfig(HnswConfigDiff.newBuilder()
.setM(16)
.setEfConstruct(100)
.build())
.build()
).get();
}
public void upsertDocuments(List<DocumentChunk> chunks) throws ExecutionException, InterruptedException {
List<float[]> embeddings = embeddingService.embedBatch(
chunks.stream().map(DocumentChunk::getContent).collect(Collectors.toList())
);
List<PointStruct> points = new ArrayList<>();
for (int i = 0; i < chunks.size(); i++) {
DocumentChunk chunk = chunks.get(i);
points.add(PointStruct.newBuilder()
.setId(PointId.newBuilder().setUuid(UUID.randomUUID().toString()).build())
.setVectors(Vectors.newBuilder().setVector(Vector.newBuilder()
.addAllData(FloatVector.newBuilder()
.addAllData(toFloatList(embeddings.get(i)))
.build().getDataList())
.build()).build())
.putAllPayload(Map.of(
"content", Value.newBuilder().setStringValue(chunk.getContent()).build(),
"source", Value.newBuilder().setStringValue(chunk.getSource()).build(),
"section", Value.newBuilder().setStringValue(chunk.getSection()).build()
))
.build());
}
qdrantClient.upsertAsync(COLLECTION_NAME, points).get();
}
public List<SearchResult> search(String query, int topK) throws ExecutionException, InterruptedException {
float[] queryVector = embeddingService.embed(query);
List<ScoredPoint> results = qdrantClient.searchAsync(
SearchPoints.newBuilder()
.setCollectionName(COLLECTION_NAME)
.setVector(Vector.newBuilder().addAllData(toFloatList(queryVector)).build())
.setLimit(topK)
.setWithPayload(true)
.build()
).get();
return results.stream()
.map(point -> new SearchResult(
point.getPayload().get("content").getStringValue(),
point.getPayload().get("source").getStringValue(),
point.getScore()
))
.collect(Collectors.toList());
}
private List<Float> toFloatList(float[] arr) {
List<Float> list = new ArrayList<>(arr.length);
for (float v : arr) {
list.add(v);
}
return list;
}
}
Spring Boot PGVector Integration
@Entity
@Table(name = "documents")
public class DocumentEntity {
@Id
@GeneratedValue(strategy = GenerationType.UUID)
private UUID id;
private String content;
private String source;
private String section;
@Column(columnDefinition = "vector(1536)")
private float[] embedding;
@Column(columnDefinition = "tsvector")
private String searchText;
}
@Mapper
public interface DocumentMapper extends BaseMapper<DocumentEntity> {
@Select("SELECT *, embedding <=> #{embedding} AS distance " +
"FROM documents " +
"WHERE embedding <=> #{embedding} < #{threshold} " +
"ORDER BY embedding <=> #{embedding} " +
"LIMIT #{limit}")
List<DocumentEntity> vectorSearch(@Param("embedding") float[] embedding,
@Param("threshold") float threshold,
@Param("limit") int limit);
@Select("SELECT *, ts_rank(search_text, plainto_tsquery(#{query})) AS rank " +
"FROM documents " +
"WHERE search_text @@ plainto_tsquery(#{query}) " +
"ORDER BY rank DESC " +
"LIMIT #{limit}")
List<DocumentEntity> fullTextSearch(@Param("query") String query,
@Param("limit") int limit);
}
Advanced RAG in Practice: Query Rewriting + Hybrid Retrieval + Reranking
This is the standard architecture for production RAG in 2026. Single vector retrieval is no longer sufficient — hybrid retrieval + reranking is the way to go.
┌──────────────────────────────────────────────────────────────────┐
│ Advanced RAG Complete Pipeline │
├──────────────────────────────────────────────────────────────────┤
│ │
│ User Query: "What drove the company's Q4 2025 revenue growth?" │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ 1. Query Rewriting │ │
│ │ Original → 3 rewrites + HyDE │ │
│ └──────────────┬──────────────────┘ │
│ │ │
│ ┌─────────┴─────────┐ │
│ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Vector │ │ BM25 │ │
│ │ Search │ │ Search │ │
│ │(Semantic)│ │(Keyword) │ │
│ └────┬─────┘ └────┬─────┘ │
│ │ │ │
│ └────────┬─────────┘ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ 2. RRF Fusion (Reciprocal Rank │ │
│ │ Fusion) │ │
│ │ Vector 0.7 + Keyword 0.3 │ │
│ └──────────────┬──────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ 3. Cross-Encoder Reranking │ │
│ │ Fine-grained query-doc scoring │ │
│ └──────────────┬──────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ 4. Context Injection + LLM Gen │ │
│ │ Precise answer with citations │ │
│ └─────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘
Query Rewriting Implementation
@Service
public class QueryRewriteService {
private final OpenAiChatModel chatModel;
public QueryRewriteService(OpenAiChatModel chatModel) {
this.chatModel = chatModel;
}
public List<String> rewriteQuery(String originalQuery) {
String prompt = """
Rewrite the following user query into 3 different search queries from different angles to improve retrieval recall.
Requirements:
1. Preserve the core intent of the original query
2. Express the same need from different perspectives
3. Add possible technical terms or synonyms
4. Output JSON format: {"rewrites": ["query1", "query2", "query3"]}
Original query: %s
""".formatted(originalQuery);
ChatResponse response = chatModel.call(new ChatRequest(
List.of(new Message("user", prompt)),
ChatOptions.builder().withTemperature(0.3).build()
));
String content = response.getResult().getOutput().getContent();
Map<String, Object> result = new ObjectMapper().readValue(content, Map.class);
return (List<String>) result.get("rewrites");
}
public String generateHyde(String query) {
String prompt = """
Provide a detailed answer to the following question, even if you're uncertain.
This answer will be used to retrieve relevant documents, so include as many
relevant details and technical terms as possible.
Question: %s
""".formatted(query);
ChatResponse response = chatModel.call(new ChatRequest(
List.of(new Message("user", prompt)),
ChatOptions.builder().withTemperature(0.5).build()
));
return response.getResult().getOutput().getContent();
}
}
Hybrid Retrieval + RRF Fusion
@Service
public class HybridRetrievalService {
private final QdrantVectorStore vectorStore;
private final DocumentMapper documentMapper;
private final EmbeddingService embeddingService;
private static final double VECTOR_WEIGHT = 0.7;
private static final double KEYWORD_WEIGHT = 0.3;
private static final int RRF_K = 60;
public HybridRetrievalService(QdrantVectorStore vectorStore,
DocumentMapper documentMapper,
EmbeddingService embeddingService) {
this.vectorStore = vectorStore;
this.documentMapper = documentMapper;
this.embeddingService = embeddingService;
}
public List<SearchResult> hybridSearch(String query, int topK) {
List<SearchResult> vectorResults = vectorSearch(query, topK * 2);
List<SearchResult> keywordResults = keywordSearch(query, topK * 2);
List<SearchResult> fused = reciprocalRankFusion(
List.of(vectorResults, keywordResults),
List.of(VECTOR_WEIGHT, KEYWORD_WEIGHT)
);
return fused.stream().limit(topK).collect(Collectors.toList());
}
private List<SearchResult> vectorSearch(String query, int limit) {
try {
return vectorStore.search(query, limit);
} catch (Exception e) {
return Collections.emptyList();
}
}
private List<SearchResult> keywordSearch(String query, int limit) {
List<DocumentEntity> results = documentMapper.fullTextSearch(query, limit);
return results.stream()
.map(doc -> new SearchResult(doc.getContent(), doc.getSource(), 0.0))
.collect(Collectors.toList());
}
private List<SearchResult> reciprocalRankFusion(
List<List<SearchResult>> resultSets,
List<Double> weights) {
Map<String, Double> scoreMap = new HashMap<>();
Map<String, SearchResult> resultMap = new HashMap<>();
for (int setIndex = 0; setIndex < resultSets.size(); setIndex++) {
List<SearchResult> results = resultSets.get(setIndex);
double weight = weights.get(setIndex);
for (int rank = 0; rank < results.size(); rank++) {
String key = results.get(rank).getContent();
double score = weight / (rank + 1 + RRF_K);
scoreMap.merge(key, score, Double::sum);
resultMap.putIfAbsent(key, results.get(rank));
}
}
return scoreMap.entrySet().stream()
.sorted(Map.Entry.<String, Double>comparingByValue().reversed())
.map(entry -> {
SearchResult result = resultMap.get(entry.getKey());
return new SearchResult(result.getContent(), result.getSource(), entry.getValue());
})
.collect(Collectors.toList());
}
}
Cross-Encoder Reranking
@Service
public class RerankService {
private final OpenAiChatModel chatModel;
public RerankService(OpenAiChatModel chatModel) {
this.chatModel = chatModel;
}
public List<SearchResult> rerank(String query, List<SearchResult> candidates, int topK) {
List<CompletableFuture<ScoredResult>> futures = candidates.stream()
.map(candidate -> CompletableFuture.supplyAsync(() -> scoreRelevance(query, candidate)))
.collect(Collectors.toList());
List<ScoredResult> scored = futures.stream()
.map(CompletableFuture::join)
.sorted(Comparator.comparingDouble(ScoredResult::score).reversed())
.limit(topK)
.collect(Collectors.toList());
return scored.stream()
.map(sr -> new SearchResult(sr.content(), sr.source(), sr.score()))
.collect(Collectors.toList());
}
private ScoredResult scoreRelevance(String query, SearchResult candidate) {
String prompt = """
Rate the relevance of the following document to the query. Output only an integer from 0 to 10.
Query: %s
Document: %s
Score:
""".formatted(query, candidate.getContent());
ChatResponse response = chatModel.call(new ChatRequest(
List.of(new Message("user", prompt)),
ChatOptions.builder().withTemperature(0.0).build()
));
double score = Double.parseDouble(response.getResult().getOutput().getContent().trim());
return new ScoredResult(candidate.getContent(), candidate.getSource(), score / 10.0);
}
}
Complete Advanced RAG Pipeline
@Service
public class AdvancedRagPipeline {
private final QueryRewriteService queryRewriteService;
private final HybridRetrievalService hybridRetrievalService;
private final RerankService rerankService;
private final EmbeddingService embeddingService;
private final OpenAiChatModel chatModel;
public RagResponse query(String userQuery) {
List<String> allQueries = new ArrayList<>();
allQueries.add(userQuery);
allQueries.addAll(queryRewriteService.rewriteQuery(userQuery));
String hydeAnswer = queryRewriteService.generateHyde(userQuery);
allQueries.add(hydeAnswer);
List<SearchResult> allResults = new ArrayList<>();
for (String q : allQueries) {
allResults.addAll(hybridRetrievalService.hybridSearch(q, 10));
}
List<SearchResult> deduplicated = deduplicate(allResults);
List<SearchResult> reranked = rerankService.rerank(userQuery, deduplicated, 5);
String context = buildContext(reranked);
String answer = generateAnswer(userQuery, context);
return new RagResponse(answer, reranked);
}
private List<SearchResult> deduplicate(List<SearchResult> results) {
Map<String, SearchResult> uniqueMap = new LinkedHashMap<>();
for (SearchResult result : results) {
uniqueMap.putIfAbsent(result.getContent(), result);
}
return new ArrayList<>(uniqueMap.values());
}
private String buildContext(List<SearchResult> results) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < results.size(); i++) {
SearchResult r = results.get(i);
sb.append("[%d] Source: %s\n%s\n\n".formatted(i + 1, r.getSource(), r.getContent()));
}
return sb.toString();
}
private String generateAnswer(String query, String context) {
String systemPrompt = """
You are a knowledge base Q&A assistant. Answer the user's question based on the retrieved documents below.
Rules:
1. Answer only based on the retrieved documents — do not fabricate information
2. Every claim must cite its source [1][2]...
3. If the retrieved results are insufficient, state this clearly
4. Prioritize the most recent and relevant information
Retrieved Documents:
%s
""".formatted(context);
ChatResponse response = chatModel.call(new ChatRequest(
List.of(
new Message("system", systemPrompt),
new Message("user", query)
),
ChatOptions.builder().withTemperature(0.1).build()
));
return response.getResult().getOutput().getContent();
}
}
Document Chunking Strategies
Chunking is the foundation of RAG — poor chunking makes even the best retrieval useless. Deep comparison of four mainstream strategies:
| Strategy | Principle | Pros | Cons | Use Case | Granularity |
|---|---|---|---|---|---|
| Fixed Length | Split by token/character count | Simple, controllable | Breaks semantic integrity | Logs, tabular data | Fixed chunk_size |
| Semantic Chunking | Embedding similarity breakpoint detection | Good semantic integrity | High compute cost | Technical docs, papers | Adaptive |
| Structure Chunking | Split by headings/sections/paragraphs | Preserves document structure | Requires parser support | Markdown/HTML/PDF | By structure hierarchy |
| Topic Chunking | LLM identifies topic boundaries | Highly cohesive topics | High LLM call cost | Long docs, multi-topic | By topic |
Fixed Length Chunking (Java)
public class FixedLengthChunker {
private final int chunkSize;
private final int overlapSize;
public FixedLengthChunker(int chunkSize, int overlapSize) {
this.chunkSize = chunkSize;
this.overlapSize = overlapSize;
}
public List<DocumentChunk> chunk(String content, String source) {
List<DocumentChunk> chunks = new ArrayList<>();
int start = 0;
int index = 0;
while (start < content.length()) {
int end = Math.min(start + chunkSize, content.length());
String text = content.substring(start, end);
if (end < content.length()) {
int lastPeriod = text.lastIndexOf('.');
int lastNewline = text.lastIndexOf('\n');
int breakPoint = Math.max(lastPeriod, lastNewline);
if (breakPoint > chunkSize / 2) {
text = text.substring(0, breakPoint + 1);
end = start + breakPoint + 1;
}
}
chunks.add(new DocumentChunk(text, source, "chunk-" + index, index));
start = end - overlapSize;
index++;
}
return chunks;
}
}
Semantic Chunking (Java)
@Service
public class SemanticChunker {
private final EmbeddingService embeddingService;
private static final double SIMILARITY_THRESHOLD = 0.85;
public SemanticChunker(EmbeddingService embeddingService) {
this.embeddingService = embeddingService;
}
public List<DocumentChunk> chunk(String content, String source) {
List<String> sentences = splitSentences(content);
if (sentences.isEmpty()) {
return Collections.emptyList();
}
List<float[]> embeddings = embeddingService.embedBatch(sentences);
List<DocumentChunk> chunks = new ArrayList<>();
StringBuilder currentChunk = new StringBuilder(sentences.get(0));
int chunkIndex = 0;
for (int i = 1; i < sentences.size(); i++) {
float similarity = EmbeddingService.cosineSimilarity(
embeddings.get(i - 1), embeddings.get(i)
);
if (similarity >= SIMILARITY_THRESHOLD) {
currentChunk.append(sentences.get(i));
} else {
chunks.add(new DocumentChunk(
currentChunk.toString(), source, "chunk-" + chunkIndex, chunkIndex
));
currentChunk = new StringBuilder(sentences.get(i));
chunkIndex++;
}
}
if (!currentChunk.isEmpty()) {
chunks.add(new DocumentChunk(
currentChunk.toString(), source, "chunk-" + chunkIndex, chunkIndex
));
}
return chunks;
}
private List<String> splitSentences(String text) {
return Arrays.stream(text.split("(?<=[.!?])"))
.map(String::trim)
.filter(s -> !s.isEmpty())
.collect(Collectors.toList());
}
}
Structure Chunking (Markdown)
@Service
public class MarkdownStructureChunker {
private static final Pattern HEADING_PATTERN = Pattern.compile("^(#{1,6})\\s+(.+)$", Pattern.MULTILINE);
public List<DocumentChunk> chunk(String markdown, String source) {
List<Section> sections = parseSections(markdown);
return sections.stream()
.map(section -> new DocumentChunk(
section.content(),
source,
section.heading(),
section.level()
))
.collect(Collectors.toList());
}
private List<Section> parseSections(String markdown) {
List<Section> sections = new ArrayList<>();
Matcher matcher = HEADING_PATTERN.matcher(markdown);
List<Integer> positions = new ArrayList<>();
List<String> headings = new ArrayList<>();
List<Integer> levels = new ArrayList<>();
while (matcher.find()) {
positions.add(matcher.start());
headings.add(matcher.group(2).trim());
levels.add(matcher.group(1).length());
}
for (int i = 0; i < positions.size(); i++) {
int start = positions.get(i);
int end = (i + 1 < positions.size()) ? positions.get(i + 1) : markdown.length();
String content = markdown.substring(start, end).trim();
sections.add(new Section(headings.get(i), content, levels.get(i)));
}
if (!positions.isEmpty() && positions.get(0) > 0) {
String preamble = markdown.substring(0, positions.get(0)).trim();
if (!preamble.isEmpty()) {
sections.add(0, new Section("Preamble", preamble, 0));
}
}
return sections;
}
}
Agentic RAG: Agents Decide When to Retrieve
Agentic RAG is the cutting edge in 2026. Core idea: Let the Agent itself decide whether to retrieve, what to retrieve, and whether retrieval is sufficient.
┌──────────────────────────────────────────────────────────────────┐
│ Agentic RAG Workflow │
├──────────────────────────────────────────────────────────────────┤
│ │
│ User Query: "Compare the technical architecture of Product A │
│ vs Product B" │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────┐ │
│ │ Agent thinks: Need Product A arch docs │ │
│ └──────────────────┬─────────────────────┘ │
│ ▼ │
│ ┌────────────────────────────────────────┐ │
│ │ Retrieve: Product A docs → Got context │ │
│ └──────────────────┬─────────────────────┘ │
│ ▼ │
│ ┌────────────────────────────────────────┐ │
│ │ Agent evaluates: Still need Product B │ │
│ └──────────────────┬─────────────────────┘ │
│ ▼ │
│ ┌────────────────────────────────────────┐ │
│ │ Retrieve: Product B docs → Got context │ │
│ └──────────────────┬─────────────────────┘ │
│ ▼ │
│ ┌────────────────────────────────────────┐ │
│ │ Agent evaluates: Sufficient info, can │ │
│ │ generate comparison answer │ │
│ └──────────────────┬─────────────────────┘ │
│ ▼ │
│ ┌────────────────────────────────────────┐ │
│ │ Generate: A/B architecture comparison │ │
│ │ with citations │ │
│ └────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘
Agentic RAG Core Implementation
@Service
public class AgenticRagService {
private final OpenAiChatModel chatModel;
private final HybridRetrievalService retrievalService;
private final RerankService rerankService;
private static final int MAX_ITERATIONS = 5;
public AgenticRagResponse query(String userQuery) {
List<RetrievalStep> steps = new ArrayList<>();
List<SearchResult> allContext = new ArrayList<>();
String currentThought = userQuery;
for (int i = 0; i < MAX_ITERATIONS; i++) {
AgentDecision decision = decideAction(currentThought, allContext);
steps.add(new RetrievalStep(i + 1, decision.thought(), decision.action()));
if ("GENERATE".equals(decision.action())) {
String answer = generateAnswer(userQuery, allContext);
return new AgenticRagResponse(answer, allContext, steps);
}
if ("SEARCH".equals(decision.action())) {
List<SearchResult> results = retrievalService.hybridSearch(decision.searchQuery(), 5);
List<SearchResult> reranked = rerankService.rerank(decision.searchQuery(), results, 3);
allContext.addAll(reranked);
currentThought = decision.thought();
}
if ("INSUFFICIENT".equals(decision.action())) {
return new AgenticRagResponse(
"Sorry, after multiple rounds of retrieval, I couldn't find sufficient information to answer your question.",
allContext, steps
);
}
}
String answer = generateAnswer(userQuery, allContext);
return new AgenticRagResponse(answer, allContext, steps);
}
private AgentDecision decideAction(String query, List<SearchResult> context) {
String contextStr = context.isEmpty() ? "(No retrieval results yet)" :
context.stream()
.map(r -> "- " + r.getContent().substring(0, Math.min(200, r.getContent().length())))
.collect(Collectors.joining("\n"));
String prompt = """
You are a RAG Agent that decides the next action.
User query: %s
Current context:
%s
Decide the next action:
- SEARCH: Need more retrieval (provide search_query)
- GENERATE: Have sufficient info, can generate answer
- INSUFFICIENT: Cannot find sufficient information
Output JSON format:
{"thought": "your reasoning", "action": "SEARCH|GENERATE|INSUFFICIENT", "search_query": "retrieval query (only for SEARCH)"}
""".formatted(query, contextStr);
ChatResponse response = chatModel.call(new ChatRequest(
List.of(new Message("user", prompt)),
ChatOptions.builder().withTemperature(0.1).build()
));
try {
Map<String, String> result = new ObjectMapper().readValue(
response.getResult().getOutput().getContent(), Map.class
);
return new AgentDecision(
result.get("thought"),
result.get("action"),
result.getOrDefault("search_query", "")
);
} catch (Exception e) {
return new AgentDecision("Parse failed, attempting to generate answer", "GENERATE", "");
}
}
private String generateAnswer(String query, List<SearchResult> context) {
String contextStr = context.stream()
.map(r -> "[Source: " + r.getSource() + "]\n" + r.getContent())
.collect(Collectors.joining("\n\n"));
String systemPrompt = """
Answer the user's question based on the retrieved documents below.
Rules:
1. Answer only from retrieved documents — do not fabricate
2. Cite sources for every claim
3. State clearly if information is insufficient
Retrieved Documents:
%s
""".formatted(contextStr);
ChatResponse response = chatModel.call(new ChatRequest(
List.of(new Message("system", systemPrompt), new Message("user", query)),
ChatOptions.builder().withTemperature(0.1).build()
));
return response.getResult().getOutput().getContent();
}
}
Multimodal RAG: Unified Retrieval for Images, Tables, and Code
In 2026, RAG is no longer just text retrieval. Multimodal RAG enables images, tables, and code to be retrieved and cited as well.
┌──────────────────────────────────────────────────────────────────┐
│ Multimodal RAG Architecture │
├──────────────────────────────────────────────────────────────────┤
│ │
│ Input Document (PDF/HTML/Markdown) │
│ │ │
│ ├── Text Extraction ──→ Text Embedding ──→ Vector DB │
│ │ │
│ ├── Table Extraction ──→ Table→Text Desc ──→ Embedding │
│ │ → Vector DB │
│ │ │
│ ├── Image Extraction ──→ Vision Embedding ──→ Vector DB │
│ │ (CLIP/Qwen-VL) │
│ │ │
│ └── Code Extraction ──→ Code Embedding ──→ Vector DB │
│ (CodeBERT/Specialized Model) │
│ │
│ At Query Time: Unified Vector Search → Multimodal Result │
│ Fusion → LLM Generation │
│ │
└──────────────────────────────────────────────────────────────────┘
Multimodal Document Processing
@Service
public class MultimodalDocumentProcessor {
private final EmbeddingService textEmbeddingService;
private final OpenAiChatModel visionModel;
public List<DocumentChunk> processDocument(Document document) {
List<DocumentChunk> chunks = new ArrayList<>();
chunks.addAll(processText(document.getTextContent(), document.getSource()));
chunks.addAll(processTables(document.getTables(), document.getSource()));
chunks.addAll(processImages(document.getImages(), document.getSource()));
chunks.addAll(processCodeBlocks(document.getCodeBlocks(), document.getSource()));
return chunks;
}
private List<DocumentChunk> processText(String text, String source) {
SemanticChunker chunker = new SemanticChunker(textEmbeddingService);
return chunker.chunk(text, source);
}
private List<DocumentChunk> processTables(List<Table> tables, String source) {
return tables.stream()
.map(table -> {
String description = convertTableToText(table);
return new DocumentChunk(
description, source,
"table-" + table.getIndex(),
table.getIndex(),
"TABLE"
);
})
.collect(Collectors.toList());
}
private String convertTableToText(Table table) {
StringBuilder sb = new StringBuilder();
sb.append("Table Description: ").append(table.getCaption()).append("\n");
List<String> headers = table.getHeaders();
sb.append("Columns: ").append(String.join(", ", headers)).append("\n");
for (List<String> row : table.getRows()) {
for (int i = 0; i < headers.size() && i < row.size(); i++) {
sb.append(headers.get(i)).append(": ").append(row.get(i)).append("; ");
}
sb.append("\n");
}
return sb.toString();
}
private List<DocumentChunk> processImages(List<DocumentImage> images, String source) {
return images.stream()
.map(image -> {
String description = describeImage(image);
return new DocumentChunk(
"[Image] " + description, source,
"image-" + image.getIndex(),
image.getIndex(),
"IMAGE"
);
})
.collect(Collectors.toList());
}
private String describeImage(DocumentImage image) {
String prompt = "Describe the content of this image in detail, including chart data and key information.";
ChatResponse response = visionModel.call(new ChatRequest(
List.of(new Message("user", prompt + "\n[Image base64: " + image.getBase64() + "]")),
ChatOptions.builder().withTemperature(0.1).build()
));
return response.getResult().getOutput().getContent();
}
private List<DocumentChunk> processCodeBlocks(List<CodeBlock> codeBlocks, String source) {
return codeBlocks.stream()
.map(code -> new DocumentChunk(
"[Code] " + code.getLanguage() + "\n" + code.getContent() +
"\nDescription: " + code.getDescription(),
source,
"code-" + code.getIndex(),
code.getIndex(),
"CODE"
))
.collect(Collectors.toList());
}
}
RAG Evaluation Framework: Quantitative Assessment
No evaluation means no optimization. RAG evaluation requires quantification across both retrieval quality and generation quality:
Evaluation Metrics Framework
| Dimension | Metric | Description | Calculation | Target |
|---|---|---|---|---|
| Retrieval | Recall@K | Proportion of relevant docs in top K results | Relevant∩Retrieved / Total Relevant | > 90% |
| Retrieval | MRR | Mean Reciprocal Rank of first relevant doc | avg(1/first_relevant_rank) | > 0.8 |
| Retrieval | nDCG@K | Normalized Discounted Cumulative Gain | DCG/IDCG | > 0.85 |
| Generation | Faithfulness | Answer's faithfulness to retrieved content | Supported claims / Total claims | > 95% |
| Generation | Relevancy | Answer's relevance to the query | LLM-assessed 0-1 score | > 0.9 |
| Generation | Correctness | Answer's consistency with ground truth | Semantic similarity to reference | > 0.85 |
| End-to-End | Latency | Total time from query to answer | P95 latency | < 3s |
| End-to-End | Refusal Accuracy | Correctly refusing unanswerable questions | Correct refusals / Should-refuse total | > 80% |
Evaluation Framework Implementation
@Service
public class RagEvaluationService {
private final OpenAiChatModel chatModel;
public EvaluationResult evaluate(RagResponse response, String groundTruth) {
double faithfulness = evaluateFaithfulness(response);
double relevancy = evaluateRelevancy(response);
double correctness = evaluateCorrectness(response, groundTruth);
return new EvaluationResult(faithfulness, relevancy, correctness);
}
private double evaluateFaithfulness(RagResponse response) {
String prompt = """
Evaluate the faithfulness of the following answer to the retrieved content.
Retrieved Content:
%s
Answer:
%s
Extract each claim in the answer and determine if it can be supported by the retrieved content.
Output JSON: {"total_claims": N, "supported_claims": M}
""".formatted(
response.getSources().stream()
.map(SearchResult::getContent)
.collect(Collectors.joining("\n")),
response.getAnswer()
);
ChatResponse llmResponse = chatModel.call(new ChatRequest(
List.of(new Message("user", prompt)),
ChatOptions.builder().withTemperature(0.0).build()
));
try {
Map<String, Integer> result = new ObjectMapper().readValue(
llmResponse.getResult().getOutput().getContent(), Map.class
);
return (double) result.get("supported_claims") / result.get("total_claims");
} catch (Exception e) {
return 0.0;
}
}
private double evaluateRelevancy(RagResponse response) {
String prompt = """
Rate the relevance of the following answer to the query (0-10).
Query: %s
Answer: %s
Output only an integer from 0 to 10.
""".formatted(response.getQuery(), response.getAnswer());
ChatResponse llmResponse = chatModel.call(new ChatRequest(
List.of(new Message("user", prompt)),
ChatOptions.builder().withTemperature(0.0).build()
));
try {
return Double.parseDouble(llmResponse.getResult().getOutput().getContent().trim()) / 10.0;
} catch (Exception e) {
return 0.0;
}
}
private double evaluateCorrectness(RagResponse response, String groundTruth) {
String prompt = """
Rate the semantic consistency of the following answer with the reference answer (0-10).
Answer: %s
Reference: %s
Output only an integer from 0 to 10.
""".formatted(response.getAnswer(), groundTruth);
ChatResponse llmResponse = chatModel.call(new ChatRequest(
List.of(new Message("user", prompt)),
ChatOptions.builder().withTemperature(0.0).build()
));
try {
return Double.parseDouble(llmResponse.getResult().getOutput().getContent().trim()) / 10.0;
} catch (Exception e) {
return 0.0;
}
}
}
Production-Grade RAG Architecture and Performance Optimization
Production Architecture Overview
┌──────────────────────────────────────────────────────────────────────┐
│ Production-Grade RAG Architecture │
├──────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ API Gateway (Spring Cloud Gateway) │ │
│ │ Auth(JWT) │ Rate Limit(Sentinel) │ Cache(Redis) │ Logs │ │
│ └────────────────────────────┬───────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────▼───────────────────────────────┐ │
│ │ RAG Service (Spring Boot) │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Query │ │ Hybrid │ │ Rerank │ │ │
│ │ │ Rewrite │→ │ Search │→ │ Service │ │ │
│ │ │ Service │ │ Service │ │ │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ▼ │ │
│ │ │Evaluation│ │ Cache │ ┌──────────┐ │ │
│ │ │ Service │ │ Service │ │Generate │ │ │
│ │ └──────────┘ └──────────┘ │ Service │ │ │
│ │ └──────────┘ │ │
│ └────────────────────────────┬───────────────────────────────┘ │
│ │ │
│ ┌───────────────────────┼───────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Qdrant │ │Elastic- │ │ Redis │ │
│ │ Vector DB│ │search │ │ Cache │ │
│ │ │ │ BM25 Idx │ │ Layer │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Document Ingestion Pipeline (Async) │ │
│ │ Upload → Parse → Chunk → Embed → Index → Metadata Store │ │
│ │ (Kafka-driven, supports incremental updates) │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Monitoring & Observability │ │
│ │ Prometheus │ Grafana Dashboards │ Alerts │ Quality Tracing │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────┘
Cache Layer Optimization
@Service
public class RagCacheService {
private final RedisTemplate<String, String> redisTemplate;
private final ObjectMapper objectMapper;
private static final long CACHE_TTL_HOURS = 24;
public Optional<RagResponse> getCachedResponse(String query) {
String cacheKey = generateCacheKey(query);
String cached = redisTemplate.opsForValue().get(cacheKey);
if (cached != null) {
try {
return Optional.of(objectMapper.readValue(cached, RagResponse.class));
} catch (Exception e) {
return Optional.empty();
}
}
return Optional.empty();
}
public void cacheResponse(String query, RagResponse response) {
String cacheKey = generateCacheKey(query);
try {
String json = objectMapper.writeValueAsString(response);
redisTemplate.opsForValue().set(cacheKey, json, CACHE_TTL_HOURS, TimeUnit.HOURS);
} catch (Exception e) {
// Cache failure should not affect the main flow
}
}
private String generateCacheKey(String query) {
return "rag:cache:" + DigestUtils.md5Hex(query);
}
}
Document Ingestion Pipeline
@Service
public class DocumentIngestionPipeline {
private final DocumentParserService parserService;
private final SemanticChunker semanticChunker;
private final MarkdownStructureChunker structureChunker;
private final EmbeddingService embeddingService;
private final QdrantVectorStore vectorStore;
private final DocumentMapper documentMapper;
@Async("ingestionExecutor")
public CompletableFuture<Void> ingestDocument(MultipartFile file, String source) {
String content = parserService.parse(file);
List<DocumentChunk> chunks;
if (isMarkdown(content)) {
chunks = structureChunker.chunk(content, source);
} else {
chunks = semanticChunker.chunk(content, source);
}
List<float[]> embeddings = embeddingService.embedBatch(
chunks.stream().map(DocumentChunk::getContent).collect(Collectors.toList())
);
for (int i = 0; i < chunks.size(); i++) {
chunks.get(i).setEmbedding(embeddings.get(i));
}
vectorStore.upsertDocuments(chunks);
for (DocumentChunk chunk : chunks) {
DocumentEntity entity = new DocumentEntity();
entity.setContent(chunk.getContent());
entity.setSource(chunk.getSource());
entity.setSection(chunk.getSection());
entity.setEmbedding(chunk.getEmbedding());
entity.setSearchText(chunk.getContent());
documentMapper.insert(entity);
}
return CompletableFuture.completedFuture(null);
}
private boolean isMarkdown(String content) {
return content.contains("# ") || content.contains("## ") || content.contains("```");
}
}
Performance Optimization Checklist
| Optimization | Method | Impact |
|---|---|---|
| Embedding Cache | Reuse embeddings for identical text | 50%+ fewer API calls |
| Query Cache | Redis cache for similar queries | P95 latency reduced 60% |
| Batch Embedding | Merge multiple texts in one call | 3-5x throughput increase |
| Async Ingestion | Kafka-driven async document processing | Ingestion doesn't block queries |
| Connection Pool | Qdrant/ES connection pool tuning | 2x concurrency capacity |
| Pre-computed HyDE | Pre-generate HyDE embeddings for hot queries | 40% latency reduction for hot queries |
| Index Tuning | HNSW parameter optimization (M=16, ef=100) | Balance precision and speed |
| Sharding Strategy | Shard by document type/time | 30% faster by narrowing search scope |
Spring Boot Configuration
spring:
ai:
openai:
api-key: ${OPENAI_API_KEY}
chat:
options:
model: gpt-4o
temperature: 0.1
embedding:
options:
model: text-embedding-3-small
rag:
vector-store:
type: qdrant
qdrant:
host: localhost
port: 6334
collection: knowledge_base
vector-size: 1536
chunking:
default-strategy: semantic
chunk-size: 512
overlap-size: 64
similarity-threshold: 0.85
retrieval:
hybrid: true
vector-weight: 0.7
keyword-weight: 0.3
rrf-k: 60
top-k: 10
cache:
enabled: true
ttl-hours: 24
ingestion:
async: true
batch-size: 100
pool-size: 4
Summary
| Component | 2026 Best Practice | Key Takeaway |
|---|---|---|
| Query Processing | Query rewriting + HyDE | Multi-angle retrieval boosts recall |
| Retrieval Strategy | Hybrid (vector + BM25) + RRF fusion | Vector 70% + Keyword 30% |
| Ranking Optimization | Cross-Encoder reranking | Fine-grained query-doc relevance scoring |
| Document Chunking | Semantic > Structure > Fixed | Chunking is RAG's foundation |
| Agentification | Agentic RAG multi-turn + self-evaluation | Agent decides when to retrieve |
| Multimodal | Unified image/table/code retrieval | Vision embeddings + text descriptions |
| Evaluation | Faithfulness/Relevancy/Correctness | No evaluation, no optimization |
| Productionization | Cache + async + monitoring + tuning | Engineering determines RAG success |
RAG isn't just "retrieve + generate" — it's a systems engineering challenge requiring careful design at every stage. From query rewriting to hybrid retrieval, from document chunking to agentic reasoning — each component determines the quality of the final answer. In 2026, Advanced RAG is the standard, Agentic RAG is the frontier, and evaluation frameworks are the foundation for continuous improvement.
Try these browser-local tools — no sign-up required →