RAG Deep Dive: From Naive RAG to Agentic RAG — Three Generations of Evolution and Enterprise Implementation — Free online

Why RAG Is a Must-Have for Enterprise AI

No matter how powerful LLMs get, they suffer from three fundamental flaws. In 2026, if you're still making raw LLM API calls for enterprise Q&A, you've definitely hit these walls:

Flaw	Manifestation	Business Impact
Knowledge Cutoff	Training data is frozen at a specific date	Cannot answer latest policies or product info
Hallucination	Fabricates plausible but non-existent facts	Misleads decisions, legal risk
Private Knowledge Gap	Unfamiliar with internal docs, processes, terminology	General models can't replace domain-specific Q&A

The essence of RAG (Retrieval-Augmented Generation): Make the LLM "look up references" before answering, constraining generation with retrieval results to address all three flaws at their root.

Data point: In 2026, 92% of enterprise AI deployments adopted RAG architecture; pure prompt approaches account for less than 5%.

┌─────────────────────────────────────────────────────────┐
│         Without RAG vs With RAG: The Fundamental Shift   │
├──────────────────────┬──────────────────────────────────┤
│     Raw LLM Call      │      RAG-Enhanced Call           │
├──────────────────────┼──────────────────────────────────┤
│  User Query → LLM →  │  User Query → Retrieve → Inject  │
│  Answer (from memory)│  Context → LLM → Answer+Citations│
│                      │  (answer after research)          │
├──────────────────────┼──────────────────────────────────┤
│  Knowledge: cutoff   │  Knowledge: real-time, updatable  │
│  Accuracy: uncontrol │  Accuracy: retrieval-constrained  │
│  Traceability: none  │  Traceability: every claim cited  │
└──────────────────────┴──────────────────────────────────┘

Naive RAG → Advanced RAG → Agentic RAG: Three Generations of Evolution

RAG isn't static. From 2023 to 2026, RAG architecture has evolved through three generations:

┌───────────────────────────────────────────────────────────────┐
│                  RAG Three-Generation Evolution Roadmap        │
├───────────────────┬───────────────────┬───────────────────────┤
│   Naive RAG       │  Advanced RAG     │   Agentic RAG         │
│   (2023)          │  (2024-2025)      │   (2026)              │
├───────────────────┼───────────────────┼───────────────────────┤
│ Query → Retrieve  │ Query Rewrite →   │ Agent autonomous      │
│ → Generate        │ Hybrid Retrieve → │ decision-making →     │
│                   │ Rerank → Generate │ Multi-turn + self-eval│
├───────────────────┼───────────────────┼───────────────────────┤
│ Problem: poor     │ Problem: lacks    │ Problem: high         │
│ retrieval quality │ autonomy          │ complexity            │
│ Hallucination 30%+│ Hallucination     │ Hallucination <5%     │
│ Uncontrollable    │ 10-15%            │ Proactive reasoning + │
│                   │ Controllable but  │ self-correction       │
│                   │ passive           │                       │
└───────────────────┴───────────────────┴───────────────────────┘

Three Generations Compared

Dimension	Naive RAG	Advanced RAG	Agentic RAG
Query Processing	Raw query directly	Query rewrite/expansion/HyDE	Agent decomposes sub-questions
Retrieval Strategy	Single vector search	Hybrid retrieval + reranking	Multi-turn retrieval + tool calls
Generation Strategy	Concatenate context	Curated context + citations	Self-evaluation + iterative refinement
Typical Hallucination Rate	30-40%	10-15%	<5%
End-to-End Latency	1-2s	2-4s	5-15s
Use Case	Demo/Prototype	Production	Complex reasoning
Implementation Complexity	Low	Medium	High

Embedding Model Selection and Evaluation

Embeddings are RAG's "eyes" — choose the wrong model and retrieval quality tanks instantly. 2026 mainstream embedding model comparison:

Model	Dimensions	MTEB Score	Chinese MTEB	Price/1M tokens	Deployment
text-embedding-3-large	3072	68.4	64.2	$0.13	API
text-embedding-3-small	1536	62.3	58.7	$0.02	API
bge-m3	1024	65.8	70.1	Free	Local/API
gte-Qwen2-1.5B	1536	67.2	72.5	Free	Local/API
gte-Qwen2-7B	3584	70.1	75.8	Free	Local (GPU required)
Cohere embed-v4	1024	66.1	61.3	$0.10	API
Voyage-3	1024	67.8	63.9	$0.12	API

Selection Guide

Scenario	Recommended Model	Rationale
Chinese-dominant enterprise KB	gte-Qwen2-1.5B	Highest Chinese MTEB, free local deployment
Multilingual mix	bge-m3	Native multilingual, 1024-dim cost-effective
Maximum quality	gte-Qwen2-7B	7B parameters, best results, needs GPU
Quick launch, no ops	text-embedding-3-small	API call, $0.02/1M tokens
Budget available, stability first	text-embedding-3-large	OpenAI ecosystem, 3072-dim high precision

Java Embedding Service Implementation

public class EmbeddingService {

    private final OpenAiChatModel chatModel;
    private final RestTemplate restTemplate;
    private final String embeddingApiUrl;
    private final String embeddingModel;

    public EmbeddingService(OpenAiChatModel chatModel, String apiUrl, String model) {
        this.chatModel = chatModel;
        this.restTemplate = new RestTemplate();
        this.embeddingApiUrl = apiUrl;
        this.embeddingModel = model;
    }

    public float[] embed(String text) {
        Map<String, Object> request = Map.of(
            "model", embeddingModel,
            "input", text
        );

        ResponseEntity<Map> response = restTemplate.postForEntity(
            embeddingApiUrl + "/embeddings",
            request,
            Map.class
        );

        List<Double> embedding = (List<Double>) ((Map) ((List<?>) response.getBody().get("data")).get(0)).get("embedding");

        float[] result = new float[embedding.size()];
        for (int i = 0; i < embedding.size(); i++) {
            result[i] = embedding.get(i).floatValue();
        }
        return result;
    }

    public List<float[]> embedBatch(List<String> texts) {
        Map<String, Object> request = Map.of(
            "model", embeddingModel,
            "input", texts
        );

        ResponseEntity<Map> response = restTemplate.postForEntity(
            embeddingApiUrl + "/embeddings",
            request,
            Map.class
        );

        List<?> dataList = (List<?>) response.getBody().get("data");
        return dataList.stream()
            .map(item -> {
                List<Double> embedding = (List<Double>) ((Map) item).get("embedding");
                float[] arr = new float[embedding.size()];
                for (int i = 0; i < embedding.size(); i++) {
                    arr[i] = embedding.get(i).floatValue();
                }
                return arr;
            })
            .collect(Collectors.toList());
    }

    public static float cosineSimilarity(float[] a, float[] b) {
        float dotProduct = 0.0f;
        float normA = 0.0f;
        float normB = 0.0f;
        for (int i = 0; i < a.length; i++) {
            dotProduct += a[i] * b[i];
            normA += a[i] * a[i];
            normB += b[i] * b[i];
        }
        return dotProduct / (float) (Math.sqrt(normA) * Math.sqrt(normB));
    }
}

Vector Database Comparison: PGVector vs Milvus vs Qdrant

In 2026, the question isn't "do I need a vector database?" but "which one should I choose?" Deep comparison of three mainstream solutions:

Dimension	PGVector	Milvus	Qdrant
Underlying	PostgreSQL extension	Standalone distributed system	Standalone Rust service
Max Vectors	Tens of millions	Tens of billions	Billions
Query Latency (1M vectors)	25-40ms	15-25ms	10-18ms
Hybrid Search	Needs tsvector	Native support	Native support
Distributed	Relies on PG logical replication	Native distributed	Shard support
Ops Complexity	Low (reuses PG)	High	Medium
Transaction Support	ACID	Limited	Limited
Filtering	Full SQL	Scalar filtering	Payload filtering
Ecosystem	Spring Data JPA	Java SDK	Java SDK
Use Case	Existing PG, small-medium scale	Ultra-large scale, high concurrency	High performance, medium scale

Architecture Comparison

┌────────────────────────────────────────────────────────────────┐
│                     PGVector Architecture                      │
├────────────────────────────────────────────────────────────────┤
│  Application ──→ PostgreSQL (pgvector extension)               │
│                    ├── Vector Index (HNSW/IVFFlat)             │
│                    ├── Full-Text Search (tsvector)             │
│                    ├── Relational Data (row storage)           │
│                    └── Transactions/ACID                       │
│  Pros: Zero extra ops, full SQL capabilities                   │
│  Cons: Performance limited at scale, weak distribution         │
├────────────────────────────────────────────────────────────────┤
│                     Milvus Architecture                        │
├────────────────────────────────────────────────────────────────┤
│  Application ──→ Proxy ──→ Coordinator                        │
│                              ├── Query Node (retrieval)        │
│                              ├── Data Node (writes)            │
│                              └── Index Node (index building)   │
│  Storage: MinIO/S3 + etcd                                      │
│  Pros: Billion-scale, native distributed, cloud-native         │
│  Cons: Complex ops, high resource usage                        │
├────────────────────────────────────────────────────────────────┤
│                     Qdrant Architecture                        │
├────────────────────────────────────────────────────────────────┤
│  Application ──→ Qdrant Service (Rust)                        │
│                    ├── HNSW Index                              │
│                    ├── Payload Filtering                       │
│                    ├── WAL Persistence                         │
│                    └── Sharded Cluster                         │
│  Pros: Rust high performance, low latency, clean API           │
│  Cons: Not as scalable as Milvus at extreme scale              │
└────────────────────────────────────────────────────────────────┘

Spring Boot Qdrant Integration

@Configuration
public class QdrantConfig {

    @Bean
    public QdrantClient qdrantClient() {
        return new QdrantClient(
            QdrantGrpcClient.newBuilder("localhost", 6334, false).build()
        );
    }
}

@Service
public class QdrantVectorStore {

    private final QdrantClient qdrantClient;
    private final EmbeddingService embeddingService;

    private static final String COLLECTION_NAME = "knowledge_base";
    private static final int VECTOR_SIZE = 1536;

    public QdrantVectorStore(QdrantClient qdrantClient, EmbeddingService embeddingService) {
        this.qdrantClient = qdrantClient;
        this.embeddingService = embeddingService;
    }

    public void createCollection() throws ExecutionException, InterruptedException {
        qdrantClient.createCollectionAsync(
            CollectionInfo.newBuilder()
                .setCollectionName(COLLECTION_NAME)
                .setVectorsConfig(VectorsConfig.newBuilder()
                    .setParams(VectorParams.newBuilder()
                        .setSize(VECTOR_SIZE)
                        .setDistance(Distance.Cosine)
                        .build())
                    .build())
                .setOptimizersConfig(OptimizersConfigDiff.newBuilder()
                    .setIndexingThreshold(20000)
                    .build())
                .setHnswConfig(HnswConfigDiff.newBuilder()
                    .setM(16)
                    .setEfConstruct(100)
                    .build())
                .build()
        ).get();
    }

    public void upsertDocuments(List<DocumentChunk> chunks) throws ExecutionException, InterruptedException {
        List<float[]> embeddings = embeddingService.embedBatch(
            chunks.stream().map(DocumentChunk::getContent).collect(Collectors.toList())
        );

        List<PointStruct> points = new ArrayList<>();
        for (int i = 0; i < chunks.size(); i++) {
            DocumentChunk chunk = chunks.get(i);
            points.add(PointStruct.newBuilder()
                .setId(PointId.newBuilder().setUuid(UUID.randomUUID().toString()).build())
                .setVectors(Vectors.newBuilder().setVector(Vector.newBuilder()
                    .addAllData(FloatVector.newBuilder()
                        .addAllData(toFloatList(embeddings.get(i)))
                        .build().getDataList())
                    .build()).build())
                .putAllPayload(Map.of(
                    "content", Value.newBuilder().setStringValue(chunk.getContent()).build(),
                    "source", Value.newBuilder().setStringValue(chunk.getSource()).build(),
                    "section", Value.newBuilder().setStringValue(chunk.getSection()).build()
                ))
                .build());
        }

        qdrantClient.upsertAsync(COLLECTION_NAME, points).get();
    }

    public List<SearchResult> search(String query, int topK) throws ExecutionException, InterruptedException {
        float[] queryVector = embeddingService.embed(query);

        List<ScoredPoint> results = qdrantClient.searchAsync(
            SearchPoints.newBuilder()
                .setCollectionName(COLLECTION_NAME)
                .setVector(Vector.newBuilder().addAllData(toFloatList(queryVector)).build())
                .setLimit(topK)
                .setWithPayload(true)
                .build()
        ).get();

        return results.stream()
            .map(point -> new SearchResult(
                point.getPayload().get("content").getStringValue(),
                point.getPayload().get("source").getStringValue(),
                point.getScore()
            ))
            .collect(Collectors.toList());
    }

    private List<Float> toFloatList(float[] arr) {
        List<Float> list = new ArrayList<>(arr.length);
        for (float v : arr) {
            list.add(v);
        }
        return list;
    }
}

Spring Boot PGVector Integration

@Entity
@Table(name = "documents")
public class DocumentEntity {

    @Id
    @GeneratedValue(strategy = GenerationType.UUID)
    private UUID id;

    private String content;

    private String source;

    private String section;

    @Column(columnDefinition = "vector(1536)")
    private float[] embedding;

    @Column(columnDefinition = "tsvector")
    private String searchText;
}

@Mapper
public interface DocumentMapper extends BaseMapper<DocumentEntity> {

    @Select("SELECT *, embedding <=> #{embedding} AS distance " +
            "FROM documents " +
            "WHERE embedding <=> #{embedding} < #{threshold} " +
            "ORDER BY embedding <=> #{embedding} " +
            "LIMIT #{limit}")
    List<DocumentEntity> vectorSearch(@Param("embedding") float[] embedding,
                                       @Param("threshold") float threshold,
                                       @Param("limit") int limit);

    @Select("SELECT *, ts_rank(search_text, plainto_tsquery(#{query})) AS rank " +
            "FROM documents " +
            "WHERE search_text @@ plainto_tsquery(#{query}) " +
            "ORDER BY rank DESC " +
            "LIMIT #{limit}")
    List<DocumentEntity> fullTextSearch(@Param("query") String query,
                                         @Param("limit") int limit);
}

Advanced RAG in Practice: Query Rewriting + Hybrid Retrieval + Reranking

This is the standard architecture for production RAG in 2026. Single vector retrieval is no longer sufficient — hybrid retrieval + reranking is the way to go.

┌──────────────────────────────────────────────────────────────────┐
│                 Advanced RAG Complete Pipeline                    │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  User Query: "What drove the company's Q4 2025 revenue growth?"  │
│       │                                                          │
│       ▼                                                          │
│  ┌─────────────────────────────────┐                            │
│  │  1. Query Rewriting              │                            │
│  │  Original → 3 rewrites + HyDE    │                            │
│  └──────────────┬──────────────────┘                            │
│                 │                                                │
│       ┌─────────┴─────────┐                                     │
│       ▼                   ▼                                      │
│  ┌──────────┐      ┌──────────┐                                  │
│  │  Vector  │      │  BM25    │                                  │
│  │  Search  │      │  Search  │                                  │
│  │(Semantic)│      │(Keyword) │                                  │
│  └────┬─────┘      └────┬─────┘                                  │
│       │                  │                                        │
│       └────────┬─────────┘                                        │
│                ▼                                                  │
│  ┌─────────────────────────────────┐                            │
│  │  2. RRF Fusion (Reciprocal Rank │                            │
│  │     Fusion)                      │                            │
│  │  Vector 0.7 + Keyword 0.3        │                            │
│  └──────────────┬──────────────────┘                            │
│                 │                                                │
│                 ▼                                                │
│  ┌─────────────────────────────────┐                            │
│  │  3. Cross-Encoder Reranking      │                            │
│  │  Fine-grained query-doc scoring  │                            │
│  └──────────────┬──────────────────┘                            │
│                 │                                                │
│                 ▼                                                │
│  ┌─────────────────────────────────┐                            │
│  │  4. Context Injection + LLM Gen  │                            │
│  │  Precise answer with citations   │                            │
│  └─────────────────────────────────┘                            │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Query Rewriting Implementation

@Service
public class QueryRewriteService {

    private final OpenAiChatModel chatModel;

    public QueryRewriteService(OpenAiChatModel chatModel) {
        this.chatModel = chatModel;
    }

    public List<String> rewriteQuery(String originalQuery) {
        String prompt = """
            Rewrite the following user query into 3 different search queries from different angles to improve retrieval recall.
            Requirements:
            1. Preserve the core intent of the original query
            2. Express the same need from different perspectives
            3. Add possible technical terms or synonyms
            4. Output JSON format: {"rewrites": ["query1", "query2", "query3"]}

            Original query: %s
            """.formatted(originalQuery);

        ChatResponse response = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.3).build()
        ));

        String content = response.getResult().getOutput().getContent();
        Map<String, Object> result = new ObjectMapper().readValue(content, Map.class);
        return (List<String>) result.get("rewrites");
    }

    public String generateHyde(String query) {
        String prompt = """
            Provide a detailed answer to the following question, even if you're uncertain.
            This answer will be used to retrieve relevant documents, so include as many
            relevant details and technical terms as possible.

            Question: %s
            """.formatted(query);

        ChatResponse response = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.5).build()
        ));

        return response.getResult().getOutput().getContent();
    }
}

Hybrid Retrieval + RRF Fusion

@Service
public class HybridRetrievalService {

    private final QdrantVectorStore vectorStore;
    private final DocumentMapper documentMapper;
    private final EmbeddingService embeddingService;

    private static final double VECTOR_WEIGHT = 0.7;
    private static final double KEYWORD_WEIGHT = 0.3;
    private static final int RRF_K = 60;

    public HybridRetrievalService(QdrantVectorStore vectorStore,
                                   DocumentMapper documentMapper,
                                   EmbeddingService embeddingService) {
        this.vectorStore = vectorStore;
        this.documentMapper = documentMapper;
        this.embeddingService = embeddingService;
    }

    public List<SearchResult> hybridSearch(String query, int topK) {
        List<SearchResult> vectorResults = vectorSearch(query, topK * 2);
        List<SearchResult> keywordResults = keywordSearch(query, topK * 2);

        List<SearchResult> fused = reciprocalRankFusion(
            List.of(vectorResults, keywordResults),
            List.of(VECTOR_WEIGHT, KEYWORD_WEIGHT)
        );

        return fused.stream().limit(topK).collect(Collectors.toList());
    }

    private List<SearchResult> vectorSearch(String query, int limit) {
        try {
            return vectorStore.search(query, limit);
        } catch (Exception e) {
            return Collections.emptyList();
        }
    }

    private List<SearchResult> keywordSearch(String query, int limit) {
        List<DocumentEntity> results = documentMapper.fullTextSearch(query, limit);
        return results.stream()
            .map(doc -> new SearchResult(doc.getContent(), doc.getSource(), 0.0))
            .collect(Collectors.toList());
    }

    private List<SearchResult> reciprocalRankFusion(
            List<List<SearchResult>> resultSets,
            List<Double> weights) {

        Map<String, Double> scoreMap = new HashMap<>();
        Map<String, SearchResult> resultMap = new HashMap<>();

        for (int setIndex = 0; setIndex < resultSets.size(); setIndex++) {
            List<SearchResult> results = resultSets.get(setIndex);
            double weight = weights.get(setIndex);

            for (int rank = 0; rank < results.size(); rank++) {
                String key = results.get(rank).getContent();
                double score = weight / (rank + 1 + RRF_K);
                scoreMap.merge(key, score, Double::sum);
                resultMap.putIfAbsent(key, results.get(rank));
            }
        }

        return scoreMap.entrySet().stream()
            .sorted(Map.Entry.<String, Double>comparingByValue().reversed())
            .map(entry -> {
                SearchResult result = resultMap.get(entry.getKey());
                return new SearchResult(result.getContent(), result.getSource(), entry.getValue());
            })
            .collect(Collectors.toList());
    }
}

Cross-Encoder Reranking

@Service
public class RerankService {

    private final OpenAiChatModel chatModel;

    public RerankService(OpenAiChatModel chatModel) {
        this.chatModel = chatModel;
    }

    public List<SearchResult> rerank(String query, List<SearchResult> candidates, int topK) {
        List<CompletableFuture<ScoredResult>> futures = candidates.stream()
            .map(candidate -> CompletableFuture.supplyAsync(() -> scoreRelevance(query, candidate)))
            .collect(Collectors.toList());

        List<ScoredResult> scored = futures.stream()
            .map(CompletableFuture::join)
            .sorted(Comparator.comparingDouble(ScoredResult::score).reversed())
            .limit(topK)
            .collect(Collectors.toList());

        return scored.stream()
            .map(sr -> new SearchResult(sr.content(), sr.source(), sr.score()))
            .collect(Collectors.toList());
    }

    private ScoredResult scoreRelevance(String query, SearchResult candidate) {
        String prompt = """
            Rate the relevance of the following document to the query. Output only an integer from 0 to 10.

            Query: %s
            Document: %s

            Score:
            """.formatted(query, candidate.getContent());

        ChatResponse response = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.0).build()
        ));

        double score = Double.parseDouble(response.getResult().getOutput().getContent().trim());
        return new ScoredResult(candidate.getContent(), candidate.getSource(), score / 10.0);
    }
}

Complete Advanced RAG Pipeline

@Service
public class AdvancedRagPipeline {

    private final QueryRewriteService queryRewriteService;
    private final HybridRetrievalService hybridRetrievalService;
    private final RerankService rerankService;
    private final EmbeddingService embeddingService;
    private final OpenAiChatModel chatModel;

    public RagResponse query(String userQuery) {
        List<String> allQueries = new ArrayList<>();
        allQueries.add(userQuery);
        allQueries.addAll(queryRewriteService.rewriteQuery(userQuery));

        String hydeAnswer = queryRewriteService.generateHyde(userQuery);
        allQueries.add(hydeAnswer);

        List<SearchResult> allResults = new ArrayList<>();
        for (String q : allQueries) {
            allResults.addAll(hybridRetrievalService.hybridSearch(q, 10));
        }

        List<SearchResult> deduplicated = deduplicate(allResults);
        List<SearchResult> reranked = rerankService.rerank(userQuery, deduplicated, 5);

        String context = buildContext(reranked);
        String answer = generateAnswer(userQuery, context);

        return new RagResponse(answer, reranked);
    }

    private List<SearchResult> deduplicate(List<SearchResult> results) {
        Map<String, SearchResult> uniqueMap = new LinkedHashMap<>();
        for (SearchResult result : results) {
            uniqueMap.putIfAbsent(result.getContent(), result);
        }
        return new ArrayList<>(uniqueMap.values());
    }

    private String buildContext(List<SearchResult> results) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < results.size(); i++) {
            SearchResult r = results.get(i);
            sb.append("[%d] Source: %s\n%s\n\n".formatted(i + 1, r.getSource(), r.getContent()));
        }
        return sb.toString();
    }

    private String generateAnswer(String query, String context) {
        String systemPrompt = """
            You are a knowledge base Q&A assistant. Answer the user's question based on the retrieved documents below.

            Rules:
            1. Answer only based on the retrieved documents — do not fabricate information
            2. Every claim must cite its source [1][2]...
            3. If the retrieved results are insufficient, state this clearly
            4. Prioritize the most recent and relevant information

            Retrieved Documents:
            %s
            """.formatted(context);

        ChatResponse response = chatModel.call(new ChatRequest(
            List.of(
                new Message("system", systemPrompt),
                new Message("user", query)
            ),
            ChatOptions.builder().withTemperature(0.1).build()
        ));

        return response.getResult().getOutput().getContent();
    }
}

Document Chunking Strategies

Chunking is the foundation of RAG — poor chunking makes even the best retrieval useless. Deep comparison of four mainstream strategies:

Strategy	Principle	Pros	Cons	Use Case	Granularity
Fixed Length	Split by token/character count	Simple, controllable	Breaks semantic integrity	Logs, tabular data	Fixed chunk_size
Semantic Chunking	Embedding similarity breakpoint detection	Good semantic integrity	High compute cost	Technical docs, papers	Adaptive
Structure Chunking	Split by headings/sections/paragraphs	Preserves document structure	Requires parser support	Markdown/HTML/PDF	By structure hierarchy
Topic Chunking	LLM identifies topic boundaries	Highly cohesive topics	High LLM call cost	Long docs, multi-topic	By topic

Fixed Length Chunking (Java)

public class FixedLengthChunker {

    private final int chunkSize;
    private final int overlapSize;

    public FixedLengthChunker(int chunkSize, int overlapSize) {
        this.chunkSize = chunkSize;
        this.overlapSize = overlapSize;
    }

    public List<DocumentChunk> chunk(String content, String source) {
        List<DocumentChunk> chunks = new ArrayList<>();
        int start = 0;
        int index = 0;

        while (start < content.length()) {
            int end = Math.min(start + chunkSize, content.length());
            String text = content.substring(start, end);

            if (end < content.length()) {
                int lastPeriod = text.lastIndexOf('.');
                int lastNewline = text.lastIndexOf('\n');
                int breakPoint = Math.max(lastPeriod, lastNewline);
                if (breakPoint > chunkSize / 2) {
                    text = text.substring(0, breakPoint + 1);
                    end = start + breakPoint + 1;
                }
            }

            chunks.add(new DocumentChunk(text, source, "chunk-" + index, index));
            start = end - overlapSize;
            index++;
        }

        return chunks;
    }
}

Semantic Chunking (Java)

@Service
public class SemanticChunker {

    private final EmbeddingService embeddingService;

    private static final double SIMILARITY_THRESHOLD = 0.85;

    public SemanticChunker(EmbeddingService embeddingService) {
        this.embeddingService = embeddingService;
    }

    public List<DocumentChunk> chunk(String content, String source) {
        List<String> sentences = splitSentences(content);
        if (sentences.isEmpty()) {
            return Collections.emptyList();
        }

        List<float[]> embeddings = embeddingService.embedBatch(sentences);

        List<DocumentChunk> chunks = new ArrayList<>();
        StringBuilder currentChunk = new StringBuilder(sentences.get(0));
        int chunkIndex = 0;

        for (int i = 1; i < sentences.size(); i++) {
            float similarity = EmbeddingService.cosineSimilarity(
                embeddings.get(i - 1), embeddings.get(i)
            );

            if (similarity >= SIMILARITY_THRESHOLD) {
                currentChunk.append(sentences.get(i));
            } else {
                chunks.add(new DocumentChunk(
                    currentChunk.toString(), source, "chunk-" + chunkIndex, chunkIndex
                ));
                currentChunk = new StringBuilder(sentences.get(i));
                chunkIndex++;
            }
        }

        if (!currentChunk.isEmpty()) {
            chunks.add(new DocumentChunk(
                currentChunk.toString(), source, "chunk-" + chunkIndex, chunkIndex
            ));
        }

        return chunks;
    }

    private List<String> splitSentences(String text) {
        return Arrays.stream(text.split("(?<=[.!?])"))
            .map(String::trim)
            .filter(s -> !s.isEmpty())
            .collect(Collectors.toList());
    }
}

Structure Chunking (Markdown)

@Service
public class MarkdownStructureChunker {

    private static final Pattern HEADING_PATTERN = Pattern.compile("^(#{1,6})\\s+(.+)$", Pattern.MULTILINE);

    public List<DocumentChunk> chunk(String markdown, String source) {
        List<Section> sections = parseSections(markdown);

        return sections.stream()
            .map(section -> new DocumentChunk(
                section.content(),
                source,
                section.heading(),
                section.level()
            ))
            .collect(Collectors.toList());
    }

    private List<Section> parseSections(String markdown) {
        List<Section> sections = new ArrayList<>();
        Matcher matcher = HEADING_PATTERN.matcher(markdown);

        List<Integer> positions = new ArrayList<>();
        List<String> headings = new ArrayList<>();
        List<Integer> levels = new ArrayList<>();

        while (matcher.find()) {
            positions.add(matcher.start());
            headings.add(matcher.group(2).trim());
            levels.add(matcher.group(1).length());
        }

        for (int i = 0; i < positions.size(); i++) {
            int start = positions.get(i);
            int end = (i + 1 < positions.size()) ? positions.get(i + 1) : markdown.length();
            String content = markdown.substring(start, end).trim();
            sections.add(new Section(headings.get(i), content, levels.get(i)));
        }

        if (!positions.isEmpty() && positions.get(0) > 0) {
            String preamble = markdown.substring(0, positions.get(0)).trim();
            if (!preamble.isEmpty()) {
                sections.add(0, new Section("Preamble", preamble, 0));
            }
        }

        return sections;
    }
}

Agentic RAG: Agents Decide When to Retrieve

Agentic RAG is the cutting edge in 2026. Core idea: Let the Agent itself decide whether to retrieve, what to retrieve, and whether retrieval is sufficient.

┌──────────────────────────────────────────────────────────────────┐
│                    Agentic RAG Workflow                           │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  User Query: "Compare the technical architecture of Product A    │
│              vs Product B"                                       │
│       │                                                          │
│       ▼                                                          │
│  ┌────────────────────────────────────────┐                     │
│  │  Agent thinks: Need Product A arch docs │                     │
│  └──────────────────┬─────────────────────┘                     │
│                     ▼                                            │
│  ┌────────────────────────────────────────┐                     │
│  │  Retrieve: Product A docs → Got context │                     │
│  └──────────────────┬─────────────────────┘                     │
│                     ▼                                            │
│  ┌────────────────────────────────────────┐                     │
│  │  Agent evaluates: Still need Product B  │                     │
│  └──────────────────┬─────────────────────┘                     │
│                     ▼                                            │
│  ┌────────────────────────────────────────┐                     │
│  │  Retrieve: Product B docs → Got context │                     │
│  └──────────────────┬─────────────────────┘                     │
│                     ▼                                            │
│  ┌────────────────────────────────────────┐                     │
│  │  Agent evaluates: Sufficient info, can  │                     │
│  │  generate comparison answer             │                     │
│  └──────────────────┬─────────────────────┘                     │
│                     ▼                                            │
│  ┌────────────────────────────────────────┐                     │
│  │  Generate: A/B architecture comparison │                     │
│  │  with citations                        │                     │
│  └────────────────────────────────────────┘                     │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Agentic RAG Core Implementation

@Service
public class AgenticRagService {

    private final OpenAiChatModel chatModel;
    private final HybridRetrievalService retrievalService;
    private final RerankService rerankService;

    private static final int MAX_ITERATIONS = 5;

    public AgenticRagResponse query(String userQuery) {
        List<RetrievalStep> steps = new ArrayList<>();
        List<SearchResult> allContext = new ArrayList<>();
        String currentThought = userQuery;

        for (int i = 0; i < MAX_ITERATIONS; i++) {
            AgentDecision decision = decideAction(currentThought, allContext);
            steps.add(new RetrievalStep(i + 1, decision.thought(), decision.action()));

            if ("GENERATE".equals(decision.action())) {
                String answer = generateAnswer(userQuery, allContext);
                return new AgenticRagResponse(answer, allContext, steps);
            }

            if ("SEARCH".equals(decision.action())) {
                List<SearchResult> results = retrievalService.hybridSearch(decision.searchQuery(), 5);
                List<SearchResult> reranked = rerankService.rerank(decision.searchQuery(), results, 3);
                allContext.addAll(reranked);
                currentThought = decision.thought();
            }

            if ("INSUFFICIENT".equals(decision.action())) {
                return new AgenticRagResponse(
                    "Sorry, after multiple rounds of retrieval, I couldn't find sufficient information to answer your question.",
                    allContext, steps
                );
            }
        }

        String answer = generateAnswer(userQuery, allContext);
        return new AgenticRagResponse(answer, allContext, steps);
    }

    private AgentDecision decideAction(String query, List<SearchResult> context) {
        String contextStr = context.isEmpty() ? "(No retrieval results yet)" :
            context.stream()
                .map(r -> "- " + r.getContent().substring(0, Math.min(200, r.getContent().length())))
                .collect(Collectors.joining("\n"));

        String prompt = """
            You are a RAG Agent that decides the next action.

            User query: %s
            Current context:
            %s

            Decide the next action:
            - SEARCH: Need more retrieval (provide search_query)
            - GENERATE: Have sufficient info, can generate answer
            - INSUFFICIENT: Cannot find sufficient information

            Output JSON format:
            {"thought": "your reasoning", "action": "SEARCH|GENERATE|INSUFFICIENT", "search_query": "retrieval query (only for SEARCH)"}
            """.formatted(query, contextStr);

        ChatResponse response = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.1).build()
        ));

        try {
            Map<String, String> result = new ObjectMapper().readValue(
                response.getResult().getOutput().getContent(), Map.class
            );
            return new AgentDecision(
                result.get("thought"),
                result.get("action"),
                result.getOrDefault("search_query", "")
            );
        } catch (Exception e) {
            return new AgentDecision("Parse failed, attempting to generate answer", "GENERATE", "");
        }
    }

    private String generateAnswer(String query, List<SearchResult> context) {
        String contextStr = context.stream()
            .map(r -> "[Source: " + r.getSource() + "]\n" + r.getContent())
            .collect(Collectors.joining("\n\n"));

        String systemPrompt = """
            Answer the user's question based on the retrieved documents below.
            Rules:
            1. Answer only from retrieved documents — do not fabricate
            2. Cite sources for every claim
            3. State clearly if information is insufficient

            Retrieved Documents:
            %s
            """.formatted(contextStr);

        ChatResponse response = chatModel.call(new ChatRequest(
            List.of(new Message("system", systemPrompt), new Message("user", query)),
            ChatOptions.builder().withTemperature(0.1).build()
        ));

        return response.getResult().getOutput().getContent();
    }
}

Multimodal RAG: Unified Retrieval for Images, Tables, and Code

In 2026, RAG is no longer just text retrieval. Multimodal RAG enables images, tables, and code to be retrieved and cited as well.

┌──────────────────────────────────────────────────────────────────┐
│                    Multimodal RAG Architecture                   │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Input Document (PDF/HTML/Markdown)                              │
│       │                                                          │
│       ├── Text Extraction ──→ Text Embedding ──→ Vector DB      │
│       │                                                          │
│       ├── Table Extraction ──→ Table→Text Desc ──→ Embedding    │
│       │                                            → Vector DB  │
│       │                                                          │
│       ├── Image Extraction ──→ Vision Embedding ──→ Vector DB   │
│       │                       (CLIP/Qwen-VL)                     │
│       │                                                          │
│       └── Code Extraction ──→ Code Embedding ──→ Vector DB      │
│                               (CodeBERT/Specialized Model)       │
│                                                                  │
│  At Query Time: Unified Vector Search → Multimodal Result       │
│  Fusion → LLM Generation                                        │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Multimodal Document Processing

@Service
public class MultimodalDocumentProcessor {

    private final EmbeddingService textEmbeddingService;
    private final OpenAiChatModel visionModel;

    public List<DocumentChunk> processDocument(Document document) {
        List<DocumentChunk> chunks = new ArrayList<>();

        chunks.addAll(processText(document.getTextContent(), document.getSource()));
        chunks.addAll(processTables(document.getTables(), document.getSource()));
        chunks.addAll(processImages(document.getImages(), document.getSource()));
        chunks.addAll(processCodeBlocks(document.getCodeBlocks(), document.getSource()));

        return chunks;
    }

    private List<DocumentChunk> processText(String text, String source) {
        SemanticChunker chunker = new SemanticChunker(textEmbeddingService);
        return chunker.chunk(text, source);
    }

    private List<DocumentChunk> processTables(List<Table> tables, String source) {
        return tables.stream()
            .map(table -> {
                String description = convertTableToText(table);
                return new DocumentChunk(
                    description, source,
                    "table-" + table.getIndex(),
                    table.getIndex(),
                    "TABLE"
                );
            })
            .collect(Collectors.toList());
    }

    private String convertTableToText(Table table) {
        StringBuilder sb = new StringBuilder();
        sb.append("Table Description: ").append(table.getCaption()).append("\n");

        List<String> headers = table.getHeaders();
        sb.append("Columns: ").append(String.join(", ", headers)).append("\n");

        for (List<String> row : table.getRows()) {
            for (int i = 0; i < headers.size() && i < row.size(); i++) {
                sb.append(headers.get(i)).append(": ").append(row.get(i)).append("; ");
            }
            sb.append("\n");
        }

        return sb.toString();
    }

    private List<DocumentChunk> processImages(List<DocumentImage> images, String source) {
        return images.stream()
            .map(image -> {
                String description = describeImage(image);
                return new DocumentChunk(
                    "[Image] " + description, source,
                    "image-" + image.getIndex(),
                    image.getIndex(),
                    "IMAGE"
                );
            })
            .collect(Collectors.toList());
    }

    private String describeImage(DocumentImage image) {
        String prompt = "Describe the content of this image in detail, including chart data and key information.";

        ChatResponse response = visionModel.call(new ChatRequest(
            List.of(new Message("user", prompt + "\n[Image base64: " + image.getBase64() + "]")),
            ChatOptions.builder().withTemperature(0.1).build()
        ));

        return response.getResult().getOutput().getContent();
    }

    private List<DocumentChunk> processCodeBlocks(List<CodeBlock> codeBlocks, String source) {
        return codeBlocks.stream()
            .map(code -> new DocumentChunk(
                "[Code] " + code.getLanguage() + "\n" + code.getContent() +
                "\nDescription: " + code.getDescription(),
                source,
                "code-" + code.getIndex(),
                code.getIndex(),
                "CODE"
            ))
            .collect(Collectors.toList());
    }
}

RAG Evaluation Framework: Quantitative Assessment

No evaluation means no optimization. RAG evaluation requires quantification across both retrieval quality and generation quality:

Evaluation Metrics Framework

Dimension	Metric	Description	Calculation	Target
Retrieval	Recall@K	Proportion of relevant docs in top K results	Relevant∩Retrieved / Total Relevant	> 90%
Retrieval	MRR	Mean Reciprocal Rank of first relevant doc	avg(1/first_relevant_rank)	> 0.8
Retrieval	nDCG@K	Normalized Discounted Cumulative Gain	DCG/IDCG	> 0.85
Generation	Faithfulness	Answer's faithfulness to retrieved content	Supported claims / Total claims	> 95%
Generation	Relevancy	Answer's relevance to the query	LLM-assessed 0-1 score	> 0.9
Generation	Correctness	Answer's consistency with ground truth	Semantic similarity to reference	> 0.85
End-to-End	Latency	Total time from query to answer	P95 latency	< 3s
End-to-End	Refusal Accuracy	Correctly refusing unanswerable questions	Correct refusals / Should-refuse total	> 80%

Evaluation Framework Implementation

@Service
public class RagEvaluationService {

    private final OpenAiChatModel chatModel;

    public EvaluationResult evaluate(RagResponse response, String groundTruth) {
        double faithfulness = evaluateFaithfulness(response);
        double relevancy = evaluateRelevancy(response);
        double correctness = evaluateCorrectness(response, groundTruth);

        return new EvaluationResult(faithfulness, relevancy, correctness);
    }

    private double evaluateFaithfulness(RagResponse response) {
        String prompt = """
            Evaluate the faithfulness of the following answer to the retrieved content.

            Retrieved Content:
            %s

            Answer:
            %s

            Extract each claim in the answer and determine if it can be supported by the retrieved content.
            Output JSON: {"total_claims": N, "supported_claims": M}
            """.formatted(
                response.getSources().stream()
                    .map(SearchResult::getContent)
                    .collect(Collectors.joining("\n")),
                response.getAnswer()
            );

        ChatResponse llmResponse = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.0).build()
        ));

        try {
            Map<String, Integer> result = new ObjectMapper().readValue(
                llmResponse.getResult().getOutput().getContent(), Map.class
            );
            return (double) result.get("supported_claims") / result.get("total_claims");
        } catch (Exception e) {
            return 0.0;
        }
    }

    private double evaluateRelevancy(RagResponse response) {
        String prompt = """
            Rate the relevance of the following answer to the query (0-10).

            Query: %s
            Answer: %s

            Output only an integer from 0 to 10.
            """.formatted(response.getQuery(), response.getAnswer());

        ChatResponse llmResponse = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.0).build()
        ));

        try {
            return Double.parseDouble(llmResponse.getResult().getOutput().getContent().trim()) / 10.0;
        } catch (Exception e) {
            return 0.0;
        }
    }

    private double evaluateCorrectness(RagResponse response, String groundTruth) {
        String prompt = """
            Rate the semantic consistency of the following answer with the reference answer (0-10).

            Answer: %s
            Reference: %s

            Output only an integer from 0 to 10.
            """.formatted(response.getAnswer(), groundTruth);

        ChatResponse llmResponse = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.0).build()
        ));

        try {
            return Double.parseDouble(llmResponse.getResult().getOutput().getContent().trim()) / 10.0;
        } catch (Exception e) {
            return 0.0;
        }
    }
}

Production-Grade RAG Architecture and Performance Optimization

Production Architecture Overview

┌──────────────────────────────────────────────────────────────────────┐
│                    Production-Grade RAG Architecture                 │
├──────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌────────────────────────────────────────────────────────────┐     │
│  │                    API Gateway (Spring Cloud Gateway)       │     │
│  │    Auth(JWT) │ Rate Limit(Sentinel) │ Cache(Redis) │ Logs  │     │
│  └────────────────────────────┬───────────────────────────────┘     │
│                               │                                      │
│  ┌────────────────────────────▼───────────────────────────────┐     │
│  │                    RAG Service (Spring Boot)                │     │
│  │    ┌──────────┐  ┌──────────┐  ┌──────────┐               │     │
│  │    │  Query   │  │  Hybrid  │  │  Rerank  │               │     │
│  │    │  Rewrite │→ │  Search  │→ │  Service │               │     │
│  │    │  Service │  │  Service │  │          │               │     │
│  │    └──────────┘  └──────────┘  └──────────┘               │     │
│  │                                        │                    │     │
│  │    ┌──────────┐  ┌──────────┐          ▼                    │     │
│  │    │Evaluation│  │  Cache   │  ┌──────────┐               │     │
│  │    │ Service  │  │ Service  │  │Generate  │               │     │
│  │    └──────────┘  └──────────┘  │ Service  │               │     │
│  │                                └──────────┘               │     │
│  └────────────────────────────┬───────────────────────────────┘     │
│                               │                                      │
│       ┌───────────────────────┼───────────────────────┐             │
│       ▼                       ▼                       ▼             │
│  ┌──────────┐          ┌──────────┐          ┌──────────┐          │
│  │ Qdrant   │          │Elastic-  │          │ Redis    │          │
│  │ Vector DB│          │search    │          │ Cache    │          │
│  │          │          │ BM25 Idx │          │ Layer    │          │
│  └──────────┘          └──────────┘          └──────────┘          │
│                                                                      │
│  ┌────────────────────────────────────────────────────────────┐     │
│  │              Document Ingestion Pipeline (Async)            │     │
│  │  Upload → Parse → Chunk → Embed → Index → Metadata Store   │     │
│  │  (Kafka-driven, supports incremental updates)               │     │
│  └────────────────────────────────────────────────────────────┘     │
│                                                                      │
│  ┌────────────────────────────────────────────────────────────┐     │
│  │                  Monitoring & Observability                 │     │
│  │  Prometheus │ Grafana Dashboards │ Alerts │ Quality Tracing │     │
│  └────────────────────────────────────────────────────────────┘     │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Cache Layer Optimization

@Service
public class RagCacheService {

    private final RedisTemplate<String, String> redisTemplate;
    private final ObjectMapper objectMapper;

    private static final long CACHE_TTL_HOURS = 24;

    public Optional<RagResponse> getCachedResponse(String query) {
        String cacheKey = generateCacheKey(query);
        String cached = redisTemplate.opsForValue().get(cacheKey);

        if (cached != null) {
            try {
                return Optional.of(objectMapper.readValue(cached, RagResponse.class));
            } catch (Exception e) {
                return Optional.empty();
            }
        }
        return Optional.empty();
    }

    public void cacheResponse(String query, RagResponse response) {
        String cacheKey = generateCacheKey(query);
        try {
            String json = objectMapper.writeValueAsString(response);
            redisTemplate.opsForValue().set(cacheKey, json, CACHE_TTL_HOURS, TimeUnit.HOURS);
        } catch (Exception e) {
            // Cache failure should not affect the main flow
        }
    }

    private String generateCacheKey(String query) {
        return "rag:cache:" + DigestUtils.md5Hex(query);
    }
}

Document Ingestion Pipeline

@Service
public class DocumentIngestionPipeline {

    private final DocumentParserService parserService;
    private final SemanticChunker semanticChunker;
    private final MarkdownStructureChunker structureChunker;
    private final EmbeddingService embeddingService;
    private final QdrantVectorStore vectorStore;
    private final DocumentMapper documentMapper;

    @Async("ingestionExecutor")
    public CompletableFuture<Void> ingestDocument(MultipartFile file, String source) {
        String content = parserService.parse(file);

        List<DocumentChunk> chunks;
        if (isMarkdown(content)) {
            chunks = structureChunker.chunk(content, source);
        } else {
            chunks = semanticChunker.chunk(content, source);
        }

        List<float[]> embeddings = embeddingService.embedBatch(
            chunks.stream().map(DocumentChunk::getContent).collect(Collectors.toList())
        );

        for (int i = 0; i < chunks.size(); i++) {
            chunks.get(i).setEmbedding(embeddings.get(i));
        }

        vectorStore.upsertDocuments(chunks);

        for (DocumentChunk chunk : chunks) {
            DocumentEntity entity = new DocumentEntity();
            entity.setContent(chunk.getContent());
            entity.setSource(chunk.getSource());
            entity.setSection(chunk.getSection());
            entity.setEmbedding(chunk.getEmbedding());
            entity.setSearchText(chunk.getContent());
            documentMapper.insert(entity);
        }

        return CompletableFuture.completedFuture(null);
    }

    private boolean isMarkdown(String content) {
        return content.contains("# ") || content.contains("## ") || content.contains("```");
    }
}

Performance Optimization Checklist

Optimization	Method	Impact
Embedding Cache	Reuse embeddings for identical text	50%+ fewer API calls
Query Cache	Redis cache for similar queries	P95 latency reduced 60%
Batch Embedding	Merge multiple texts in one call	3-5x throughput increase
Async Ingestion	Kafka-driven async document processing	Ingestion doesn't block queries
Connection Pool	Qdrant/ES connection pool tuning	2x concurrency capacity
Pre-computed HyDE	Pre-generate HyDE embeddings for hot queries	40% latency reduction for hot queries
Index Tuning	HNSW parameter optimization (M=16, ef=100)	Balance precision and speed
Sharding Strategy	Shard by document type/time	30% faster by narrowing search scope

Spring Boot Configuration

spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: gpt-4o
          temperature: 0.1
      embedding:
        options:
          model: text-embedding-3-small

rag:
  vector-store:
    type: qdrant
    qdrant:
      host: localhost
      port: 6334
      collection: knowledge_base
      vector-size: 1536
  chunking:
    default-strategy: semantic
    chunk-size: 512
    overlap-size: 64
    similarity-threshold: 0.85
  retrieval:
    hybrid: true
    vector-weight: 0.7
    keyword-weight: 0.3
    rrf-k: 60
    top-k: 10
  cache:
    enabled: true
    ttl-hours: 24
  ingestion:
    async: true
    batch-size: 100
    pool-size: 4

Summary

Component	2026 Best Practice	Key Takeaway
Query Processing	Query rewriting + HyDE	Multi-angle retrieval boosts recall
Retrieval Strategy	Hybrid (vector + BM25) + RRF fusion	Vector 70% + Keyword 30%
Ranking Optimization	Cross-Encoder reranking	Fine-grained query-doc relevance scoring
Document Chunking	Semantic > Structure > Fixed	Chunking is RAG's foundation
Agentification	Agentic RAG multi-turn + self-evaluation	Agent decides when to retrieve
Multimodal	Unified image/table/code retrieval	Vision embeddings + text descriptions
Evaluation	Faithfulness/Relevancy/Correctness	No evaluation, no optimization
Productionization	Cache + async + monitoring + tuning	Engineering determines RAG success

RAG isn't just "retrieve + generate" — it's a systems engineering challenge requiring careful design at every stage. From query rewriting to hybrid retrieval, from document chunking to agentic reasoning — each component determines the quality of the final answer. In 2026, Advanced RAG is the standard, Agentic RAG is the frontier, and evaluation frameworks are the foundation for continuous improvement.