RAG Deep Dive: From Naive RAG to Agentic RAG — Three Generations of Evolution and Enterprise Implementation

技术架构

Why RAG Is a Must-Have for Enterprise AI

No matter how powerful LLMs get, they suffer from three fundamental flaws. In 2026, if you're still making raw LLM API calls for enterprise Q&A, you've definitely hit these walls:

Flaw Manifestation Business Impact
Knowledge Cutoff Training data is frozen at a specific date Cannot answer latest policies or product info
Hallucination Fabricates plausible but non-existent facts Misleads decisions, legal risk
Private Knowledge Gap Unfamiliar with internal docs, processes, terminology General models can't replace domain-specific Q&A

The essence of RAG (Retrieval-Augmented Generation): Make the LLM "look up references" before answering, constraining generation with retrieval results to address all three flaws at their root.

Data point: In 2026, 92% of enterprise AI deployments adopted RAG architecture; pure prompt approaches account for less than 5%.

┌─────────────────────────────────────────────────────────┐
│         Without RAG vs With RAG: The Fundamental Shift   │
├──────────────────────┬──────────────────────────────────┤
│     Raw LLM Call      │      RAG-Enhanced Call           │
├──────────────────────┼──────────────────────────────────┤
│  User Query → LLM →  │  User Query → Retrieve → Inject  │
│  Answer (from memory)│  Context → LLM → Answer+Citations│
│                      │  (answer after research)          │
├──────────────────────┼──────────────────────────────────┤
│  Knowledge: cutoff   │  Knowledge: real-time, updatable  │
│  Accuracy: uncontrol │  Accuracy: retrieval-constrained  │
│  Traceability: none  │  Traceability: every claim cited  │
└──────────────────────┴──────────────────────────────────┘

Naive RAG → Advanced RAG → Agentic RAG: Three Generations of Evolution

RAG isn't static. From 2023 to 2026, RAG architecture has evolved through three generations:

┌───────────────────────────────────────────────────────────────┐
│                  RAG Three-Generation Evolution Roadmap        │
├───────────────────┬───────────────────┬───────────────────────┤
│   Naive RAG       │  Advanced RAG     │   Agentic RAG         │
│   (2023)          │  (2024-2025)      │   (2026)              │
├───────────────────┼───────────────────┼───────────────────────┤
│ Query → Retrieve  │ Query Rewrite →   │ Agent autonomous      │
│ → Generate        │ Hybrid Retrieve → │ decision-making →     │
│                   │ Rerank → Generate │ Multi-turn + self-eval│
├───────────────────┼───────────────────┼───────────────────────┤
│ Problem: poor     │ Problem: lacks    │ Problem: high         │
│ retrieval quality │ autonomy          │ complexity            │
│ Hallucination 30%+│ Hallucination     │ Hallucination <5%     │
│ Uncontrollable    │ 10-15%            │ Proactive reasoning + │
│                   │ Controllable but  │ self-correction       │
│                   │ passive           │                       │
└───────────────────┴───────────────────┴───────────────────────┘

Three Generations Compared

Dimension Naive RAG Advanced RAG Agentic RAG
Query Processing Raw query directly Query rewrite/expansion/HyDE Agent decomposes sub-questions
Retrieval Strategy Single vector search Hybrid retrieval + reranking Multi-turn retrieval + tool calls
Generation Strategy Concatenate context Curated context + citations Self-evaluation + iterative refinement
Typical Hallucination Rate 30-40% 10-15% <5%
End-to-End Latency 1-2s 2-4s 5-15s
Use Case Demo/Prototype Production Complex reasoning
Implementation Complexity Low Medium High

Embedding Model Selection and Evaluation

Embeddings are RAG's "eyes" — choose the wrong model and retrieval quality tanks instantly. 2026 mainstream embedding model comparison:

Model Dimensions MTEB Score Chinese MTEB Price/1M tokens Deployment
text-embedding-3-large 3072 68.4 64.2 $0.13 API
text-embedding-3-small 1536 62.3 58.7 $0.02 API
bge-m3 1024 65.8 70.1 Free Local/API
gte-Qwen2-1.5B 1536 67.2 72.5 Free Local/API
gte-Qwen2-7B 3584 70.1 75.8 Free Local (GPU required)
Cohere embed-v4 1024 66.1 61.3 $0.10 API
Voyage-3 1024 67.8 63.9 $0.12 API

Selection Guide

Scenario Recommended Model Rationale
Chinese-dominant enterprise KB gte-Qwen2-1.5B Highest Chinese MTEB, free local deployment
Multilingual mix bge-m3 Native multilingual, 1024-dim cost-effective
Maximum quality gte-Qwen2-7B 7B parameters, best results, needs GPU
Quick launch, no ops text-embedding-3-small API call, $0.02/1M tokens
Budget available, stability first text-embedding-3-large OpenAI ecosystem, 3072-dim high precision

Java Embedding Service Implementation

public class EmbeddingService {

    private final OpenAiChatModel chatModel;
    private final RestTemplate restTemplate;
    private final String embeddingApiUrl;
    private final String embeddingModel;

    public EmbeddingService(OpenAiChatModel chatModel, String apiUrl, String model) {
        this.chatModel = chatModel;
        this.restTemplate = new RestTemplate();
        this.embeddingApiUrl = apiUrl;
        this.embeddingModel = model;
    }

    public float[] embed(String text) {
        Map<String, Object> request = Map.of(
            "model", embeddingModel,
            "input", text
        );

        ResponseEntity<Map> response = restTemplate.postForEntity(
            embeddingApiUrl + "/embeddings",
            request,
            Map.class
        );

        List<Double> embedding = (List<Double>) ((Map) ((List<?>) response.getBody().get("data")).get(0)).get("embedding");

        float[] result = new float[embedding.size()];
        for (int i = 0; i < embedding.size(); i++) {
            result[i] = embedding.get(i).floatValue();
        }
        return result;
    }

    public List<float[]> embedBatch(List<String> texts) {
        Map<String, Object> request = Map.of(
            "model", embeddingModel,
            "input", texts
        );

        ResponseEntity<Map> response = restTemplate.postForEntity(
            embeddingApiUrl + "/embeddings",
            request,
            Map.class
        );

        List<?> dataList = (List<?>) response.getBody().get("data");
        return dataList.stream()
            .map(item -> {
                List<Double> embedding = (List<Double>) ((Map) item).get("embedding");
                float[] arr = new float[embedding.size()];
                for (int i = 0; i < embedding.size(); i++) {
                    arr[i] = embedding.get(i).floatValue();
                }
                return arr;
            })
            .collect(Collectors.toList());
    }

    public static float cosineSimilarity(float[] a, float[] b) {
        float dotProduct = 0.0f;
        float normA = 0.0f;
        float normB = 0.0f;
        for (int i = 0; i < a.length; i++) {
            dotProduct += a[i] * b[i];
            normA += a[i] * a[i];
            normB += b[i] * b[i];
        }
        return dotProduct / (float) (Math.sqrt(normA) * Math.sqrt(normB));
    }
}

Vector Database Comparison: PGVector vs Milvus vs Qdrant

In 2026, the question isn't "do I need a vector database?" but "which one should I choose?" Deep comparison of three mainstream solutions:

Dimension PGVector Milvus Qdrant
Underlying PostgreSQL extension Standalone distributed system Standalone Rust service
Max Vectors Tens of millions Tens of billions Billions
Query Latency (1M vectors) 25-40ms 15-25ms 10-18ms
Hybrid Search Needs tsvector Native support Native support
Distributed Relies on PG logical replication Native distributed Shard support
Ops Complexity Low (reuses PG) High Medium
Transaction Support ACID Limited Limited
Filtering Full SQL Scalar filtering Payload filtering
Ecosystem Spring Data JPA Java SDK Java SDK
Use Case Existing PG, small-medium scale Ultra-large scale, high concurrency High performance, medium scale

Architecture Comparison

┌────────────────────────────────────────────────────────────────┐
│                     PGVector Architecture                      │
├────────────────────────────────────────────────────────────────┤
│  Application ──→ PostgreSQL (pgvector extension)               │
│                    ├── Vector Index (HNSW/IVFFlat)             │
│                    ├── Full-Text Search (tsvector)             │
│                    ├── Relational Data (row storage)           │
│                    └── Transactions/ACID                       │
│  Pros: Zero extra ops, full SQL capabilities                   │
│  Cons: Performance limited at scale, weak distribution         │
├────────────────────────────────────────────────────────────────┤
│                     Milvus Architecture                        │
├────────────────────────────────────────────────────────────────┤
│  Application ──→ Proxy ──→ Coordinator                        │
│                              ├── Query Node (retrieval)        │
│                              ├── Data Node (writes)            │
│                              └── Index Node (index building)   │
│  Storage: MinIO/S3 + etcd                                      │
│  Pros: Billion-scale, native distributed, cloud-native         │
│  Cons: Complex ops, high resource usage                        │
├────────────────────────────────────────────────────────────────┤
│                     Qdrant Architecture                        │
├────────────────────────────────────────────────────────────────┤
│  Application ──→ Qdrant Service (Rust)                        │
│                    ├── HNSW Index                              │
│                    ├── Payload Filtering                       │
│                    ├── WAL Persistence                         │
│                    └── Sharded Cluster                         │
│  Pros: Rust high performance, low latency, clean API           │
│  Cons: Not as scalable as Milvus at extreme scale              │
└────────────────────────────────────────────────────────────────┘

Spring Boot Qdrant Integration

@Configuration
public class QdrantConfig {

    @Bean
    public QdrantClient qdrantClient() {
        return new QdrantClient(
            QdrantGrpcClient.newBuilder("localhost", 6334, false).build()
        );
    }
}

@Service
public class QdrantVectorStore {

    private final QdrantClient qdrantClient;
    private final EmbeddingService embeddingService;

    private static final String COLLECTION_NAME = "knowledge_base";
    private static final int VECTOR_SIZE = 1536;

    public QdrantVectorStore(QdrantClient qdrantClient, EmbeddingService embeddingService) {
        this.qdrantClient = qdrantClient;
        this.embeddingService = embeddingService;
    }

    public void createCollection() throws ExecutionException, InterruptedException {
        qdrantClient.createCollectionAsync(
            CollectionInfo.newBuilder()
                .setCollectionName(COLLECTION_NAME)
                .setVectorsConfig(VectorsConfig.newBuilder()
                    .setParams(VectorParams.newBuilder()
                        .setSize(VECTOR_SIZE)
                        .setDistance(Distance.Cosine)
                        .build())
                    .build())
                .setOptimizersConfig(OptimizersConfigDiff.newBuilder()
                    .setIndexingThreshold(20000)
                    .build())
                .setHnswConfig(HnswConfigDiff.newBuilder()
                    .setM(16)
                    .setEfConstruct(100)
                    .build())
                .build()
        ).get();
    }

    public void upsertDocuments(List<DocumentChunk> chunks) throws ExecutionException, InterruptedException {
        List<float[]> embeddings = embeddingService.embedBatch(
            chunks.stream().map(DocumentChunk::getContent).collect(Collectors.toList())
        );

        List<PointStruct> points = new ArrayList<>();
        for (int i = 0; i < chunks.size(); i++) {
            DocumentChunk chunk = chunks.get(i);
            points.add(PointStruct.newBuilder()
                .setId(PointId.newBuilder().setUuid(UUID.randomUUID().toString()).build())
                .setVectors(Vectors.newBuilder().setVector(Vector.newBuilder()
                    .addAllData(FloatVector.newBuilder()
                        .addAllData(toFloatList(embeddings.get(i)))
                        .build().getDataList())
                    .build()).build())
                .putAllPayload(Map.of(
                    "content", Value.newBuilder().setStringValue(chunk.getContent()).build(),
                    "source", Value.newBuilder().setStringValue(chunk.getSource()).build(),
                    "section", Value.newBuilder().setStringValue(chunk.getSection()).build()
                ))
                .build());
        }

        qdrantClient.upsertAsync(COLLECTION_NAME, points).get();
    }

    public List<SearchResult> search(String query, int topK) throws ExecutionException, InterruptedException {
        float[] queryVector = embeddingService.embed(query);

        List<ScoredPoint> results = qdrantClient.searchAsync(
            SearchPoints.newBuilder()
                .setCollectionName(COLLECTION_NAME)
                .setVector(Vector.newBuilder().addAllData(toFloatList(queryVector)).build())
                .setLimit(topK)
                .setWithPayload(true)
                .build()
        ).get();

        return results.stream()
            .map(point -> new SearchResult(
                point.getPayload().get("content").getStringValue(),
                point.getPayload().get("source").getStringValue(),
                point.getScore()
            ))
            .collect(Collectors.toList());
    }

    private List<Float> toFloatList(float[] arr) {
        List<Float> list = new ArrayList<>(arr.length);
        for (float v : arr) {
            list.add(v);
        }
        return list;
    }
}

Spring Boot PGVector Integration

@Entity
@Table(name = "documents")
public class DocumentEntity {

    @Id
    @GeneratedValue(strategy = GenerationType.UUID)
    private UUID id;

    private String content;

    private String source;

    private String section;

    @Column(columnDefinition = "vector(1536)")
    private float[] embedding;

    @Column(columnDefinition = "tsvector")
    private String searchText;
}

@Mapper
public interface DocumentMapper extends BaseMapper<DocumentEntity> {

    @Select("SELECT *, embedding <=> #{embedding} AS distance " +
            "FROM documents " +
            "WHERE embedding <=> #{embedding} < #{threshold} " +
            "ORDER BY embedding <=> #{embedding} " +
            "LIMIT #{limit}")
    List<DocumentEntity> vectorSearch(@Param("embedding") float[] embedding,
                                       @Param("threshold") float threshold,
                                       @Param("limit") int limit);

    @Select("SELECT *, ts_rank(search_text, plainto_tsquery(#{query})) AS rank " +
            "FROM documents " +
            "WHERE search_text @@ plainto_tsquery(#{query}) " +
            "ORDER BY rank DESC " +
            "LIMIT #{limit}")
    List<DocumentEntity> fullTextSearch(@Param("query") String query,
                                         @Param("limit") int limit);
}

Advanced RAG in Practice: Query Rewriting + Hybrid Retrieval + Reranking

This is the standard architecture for production RAG in 2026. Single vector retrieval is no longer sufficient — hybrid retrieval + reranking is the way to go.

┌──────────────────────────────────────────────────────────────────┐
│                 Advanced RAG Complete Pipeline                    │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  User Query: "What drove the company's Q4 2025 revenue growth?"  │
│       │                                                          │
│       ▼                                                          │
│  ┌─────────────────────────────────┐                            │
│  │  1. Query Rewriting              │                            │
│  │  Original → 3 rewrites + HyDE    │                            │
│  └──────────────┬──────────────────┘                            │
│                 │                                                │
│       ┌─────────┴─────────┐                                     │
│       ▼                   ▼                                      │
│  ┌──────────┐      ┌──────────┐                                  │
│  │  Vector  │      │  BM25    │                                  │
│  │  Search  │      │  Search  │                                  │
│  │(Semantic)│      │(Keyword) │                                  │
│  └────┬─────┘      └────┬─────┘                                  │
│       │                  │                                        │
│       └────────┬─────────┘                                        │
│                ▼                                                  │
│  ┌─────────────────────────────────┐                            │
│  │  2. RRF Fusion (Reciprocal Rank │                            │
│  │     Fusion)                      │                            │
│  │  Vector 0.7 + Keyword 0.3        │                            │
│  └──────────────┬──────────────────┘                            │
│                 │                                                │
│                 ▼                                                │
│  ┌─────────────────────────────────┐                            │
│  │  3. Cross-Encoder Reranking      │                            │
│  │  Fine-grained query-doc scoring  │                            │
│  └──────────────┬──────────────────┘                            │
│                 │                                                │
│                 ▼                                                │
│  ┌─────────────────────────────────┐                            │
│  │  4. Context Injection + LLM Gen  │                            │
│  │  Precise answer with citations   │                            │
│  └─────────────────────────────────┘                            │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Query Rewriting Implementation

@Service
public class QueryRewriteService {

    private final OpenAiChatModel chatModel;

    public QueryRewriteService(OpenAiChatModel chatModel) {
        this.chatModel = chatModel;
    }

    public List<String> rewriteQuery(String originalQuery) {
        String prompt = """
            Rewrite the following user query into 3 different search queries from different angles to improve retrieval recall.
            Requirements:
            1. Preserve the core intent of the original query
            2. Express the same need from different perspectives
            3. Add possible technical terms or synonyms
            4. Output JSON format: {"rewrites": ["query1", "query2", "query3"]}

            Original query: %s
            """.formatted(originalQuery);

        ChatResponse response = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.3).build()
        ));

        String content = response.getResult().getOutput().getContent();
        Map<String, Object> result = new ObjectMapper().readValue(content, Map.class);
        return (List<String>) result.get("rewrites");
    }

    public String generateHyde(String query) {
        String prompt = """
            Provide a detailed answer to the following question, even if you're uncertain.
            This answer will be used to retrieve relevant documents, so include as many
            relevant details and technical terms as possible.

            Question: %s
            """.formatted(query);

        ChatResponse response = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.5).build()
        ));

        return response.getResult().getOutput().getContent();
    }
}

Hybrid Retrieval + RRF Fusion

@Service
public class HybridRetrievalService {

    private final QdrantVectorStore vectorStore;
    private final DocumentMapper documentMapper;
    private final EmbeddingService embeddingService;

    private static final double VECTOR_WEIGHT = 0.7;
    private static final double KEYWORD_WEIGHT = 0.3;
    private static final int RRF_K = 60;

    public HybridRetrievalService(QdrantVectorStore vectorStore,
                                   DocumentMapper documentMapper,
                                   EmbeddingService embeddingService) {
        this.vectorStore = vectorStore;
        this.documentMapper = documentMapper;
        this.embeddingService = embeddingService;
    }

    public List<SearchResult> hybridSearch(String query, int topK) {
        List<SearchResult> vectorResults = vectorSearch(query, topK * 2);
        List<SearchResult> keywordResults = keywordSearch(query, topK * 2);

        List<SearchResult> fused = reciprocalRankFusion(
            List.of(vectorResults, keywordResults),
            List.of(VECTOR_WEIGHT, KEYWORD_WEIGHT)
        );

        return fused.stream().limit(topK).collect(Collectors.toList());
    }

    private List<SearchResult> vectorSearch(String query, int limit) {
        try {
            return vectorStore.search(query, limit);
        } catch (Exception e) {
            return Collections.emptyList();
        }
    }

    private List<SearchResult> keywordSearch(String query, int limit) {
        List<DocumentEntity> results = documentMapper.fullTextSearch(query, limit);
        return results.stream()
            .map(doc -> new SearchResult(doc.getContent(), doc.getSource(), 0.0))
            .collect(Collectors.toList());
    }

    private List<SearchResult> reciprocalRankFusion(
            List<List<SearchResult>> resultSets,
            List<Double> weights) {

        Map<String, Double> scoreMap = new HashMap<>();
        Map<String, SearchResult> resultMap = new HashMap<>();

        for (int setIndex = 0; setIndex < resultSets.size(); setIndex++) {
            List<SearchResult> results = resultSets.get(setIndex);
            double weight = weights.get(setIndex);

            for (int rank = 0; rank < results.size(); rank++) {
                String key = results.get(rank).getContent();
                double score = weight / (rank + 1 + RRF_K);
                scoreMap.merge(key, score, Double::sum);
                resultMap.putIfAbsent(key, results.get(rank));
            }
        }

        return scoreMap.entrySet().stream()
            .sorted(Map.Entry.<String, Double>comparingByValue().reversed())
            .map(entry -> {
                SearchResult result = resultMap.get(entry.getKey());
                return new SearchResult(result.getContent(), result.getSource(), entry.getValue());
            })
            .collect(Collectors.toList());
    }
}

Cross-Encoder Reranking

@Service
public class RerankService {

    private final OpenAiChatModel chatModel;

    public RerankService(OpenAiChatModel chatModel) {
        this.chatModel = chatModel;
    }

    public List<SearchResult> rerank(String query, List<SearchResult> candidates, int topK) {
        List<CompletableFuture<ScoredResult>> futures = candidates.stream()
            .map(candidate -> CompletableFuture.supplyAsync(() -> scoreRelevance(query, candidate)))
            .collect(Collectors.toList());

        List<ScoredResult> scored = futures.stream()
            .map(CompletableFuture::join)
            .sorted(Comparator.comparingDouble(ScoredResult::score).reversed())
            .limit(topK)
            .collect(Collectors.toList());

        return scored.stream()
            .map(sr -> new SearchResult(sr.content(), sr.source(), sr.score()))
            .collect(Collectors.toList());
    }

    private ScoredResult scoreRelevance(String query, SearchResult candidate) {
        String prompt = """
            Rate the relevance of the following document to the query. Output only an integer from 0 to 10.

            Query: %s
            Document: %s

            Score:
            """.formatted(query, candidate.getContent());

        ChatResponse response = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.0).build()
        ));

        double score = Double.parseDouble(response.getResult().getOutput().getContent().trim());
        return new ScoredResult(candidate.getContent(), candidate.getSource(), score / 10.0);
    }
}

Complete Advanced RAG Pipeline

@Service
public class AdvancedRagPipeline {

    private final QueryRewriteService queryRewriteService;
    private final HybridRetrievalService hybridRetrievalService;
    private final RerankService rerankService;
    private final EmbeddingService embeddingService;
    private final OpenAiChatModel chatModel;

    public RagResponse query(String userQuery) {
        List<String> allQueries = new ArrayList<>();
        allQueries.add(userQuery);
        allQueries.addAll(queryRewriteService.rewriteQuery(userQuery));

        String hydeAnswer = queryRewriteService.generateHyde(userQuery);
        allQueries.add(hydeAnswer);

        List<SearchResult> allResults = new ArrayList<>();
        for (String q : allQueries) {
            allResults.addAll(hybridRetrievalService.hybridSearch(q, 10));
        }

        List<SearchResult> deduplicated = deduplicate(allResults);
        List<SearchResult> reranked = rerankService.rerank(userQuery, deduplicated, 5);

        String context = buildContext(reranked);
        String answer = generateAnswer(userQuery, context);

        return new RagResponse(answer, reranked);
    }

    private List<SearchResult> deduplicate(List<SearchResult> results) {
        Map<String, SearchResult> uniqueMap = new LinkedHashMap<>();
        for (SearchResult result : results) {
            uniqueMap.putIfAbsent(result.getContent(), result);
        }
        return new ArrayList<>(uniqueMap.values());
    }

    private String buildContext(List<SearchResult> results) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < results.size(); i++) {
            SearchResult r = results.get(i);
            sb.append("[%d] Source: %s\n%s\n\n".formatted(i + 1, r.getSource(), r.getContent()));
        }
        return sb.toString();
    }

    private String generateAnswer(String query, String context) {
        String systemPrompt = """
            You are a knowledge base Q&A assistant. Answer the user's question based on the retrieved documents below.

            Rules:
            1. Answer only based on the retrieved documents — do not fabricate information
            2. Every claim must cite its source [1][2]...
            3. If the retrieved results are insufficient, state this clearly
            4. Prioritize the most recent and relevant information

            Retrieved Documents:
            %s
            """.formatted(context);

        ChatResponse response = chatModel.call(new ChatRequest(
            List.of(
                new Message("system", systemPrompt),
                new Message("user", query)
            ),
            ChatOptions.builder().withTemperature(0.1).build()
        ));

        return response.getResult().getOutput().getContent();
    }
}

Document Chunking Strategies

Chunking is the foundation of RAG — poor chunking makes even the best retrieval useless. Deep comparison of four mainstream strategies:

Strategy Principle Pros Cons Use Case Granularity
Fixed Length Split by token/character count Simple, controllable Breaks semantic integrity Logs, tabular data Fixed chunk_size
Semantic Chunking Embedding similarity breakpoint detection Good semantic integrity High compute cost Technical docs, papers Adaptive
Structure Chunking Split by headings/sections/paragraphs Preserves document structure Requires parser support Markdown/HTML/PDF By structure hierarchy
Topic Chunking LLM identifies topic boundaries Highly cohesive topics High LLM call cost Long docs, multi-topic By topic

Fixed Length Chunking (Java)

public class FixedLengthChunker {

    private final int chunkSize;
    private final int overlapSize;

    public FixedLengthChunker(int chunkSize, int overlapSize) {
        this.chunkSize = chunkSize;
        this.overlapSize = overlapSize;
    }

    public List<DocumentChunk> chunk(String content, String source) {
        List<DocumentChunk> chunks = new ArrayList<>();
        int start = 0;
        int index = 0;

        while (start < content.length()) {
            int end = Math.min(start + chunkSize, content.length());
            String text = content.substring(start, end);

            if (end < content.length()) {
                int lastPeriod = text.lastIndexOf('.');
                int lastNewline = text.lastIndexOf('\n');
                int breakPoint = Math.max(lastPeriod, lastNewline);
                if (breakPoint > chunkSize / 2) {
                    text = text.substring(0, breakPoint + 1);
                    end = start + breakPoint + 1;
                }
            }

            chunks.add(new DocumentChunk(text, source, "chunk-" + index, index));
            start = end - overlapSize;
            index++;
        }

        return chunks;
    }
}

Semantic Chunking (Java)

@Service
public class SemanticChunker {

    private final EmbeddingService embeddingService;

    private static final double SIMILARITY_THRESHOLD = 0.85;

    public SemanticChunker(EmbeddingService embeddingService) {
        this.embeddingService = embeddingService;
    }

    public List<DocumentChunk> chunk(String content, String source) {
        List<String> sentences = splitSentences(content);
        if (sentences.isEmpty()) {
            return Collections.emptyList();
        }

        List<float[]> embeddings = embeddingService.embedBatch(sentences);

        List<DocumentChunk> chunks = new ArrayList<>();
        StringBuilder currentChunk = new StringBuilder(sentences.get(0));
        int chunkIndex = 0;

        for (int i = 1; i < sentences.size(); i++) {
            float similarity = EmbeddingService.cosineSimilarity(
                embeddings.get(i - 1), embeddings.get(i)
            );

            if (similarity >= SIMILARITY_THRESHOLD) {
                currentChunk.append(sentences.get(i));
            } else {
                chunks.add(new DocumentChunk(
                    currentChunk.toString(), source, "chunk-" + chunkIndex, chunkIndex
                ));
                currentChunk = new StringBuilder(sentences.get(i));
                chunkIndex++;
            }
        }

        if (!currentChunk.isEmpty()) {
            chunks.add(new DocumentChunk(
                currentChunk.toString(), source, "chunk-" + chunkIndex, chunkIndex
            ));
        }

        return chunks;
    }

    private List<String> splitSentences(String text) {
        return Arrays.stream(text.split("(?<=[.!?])"))
            .map(String::trim)
            .filter(s -> !s.isEmpty())
            .collect(Collectors.toList());
    }
}

Structure Chunking (Markdown)

@Service
public class MarkdownStructureChunker {

    private static final Pattern HEADING_PATTERN = Pattern.compile("^(#{1,6})\\s+(.+)$", Pattern.MULTILINE);

    public List<DocumentChunk> chunk(String markdown, String source) {
        List<Section> sections = parseSections(markdown);

        return sections.stream()
            .map(section -> new DocumentChunk(
                section.content(),
                source,
                section.heading(),
                section.level()
            ))
            .collect(Collectors.toList());
    }

    private List<Section> parseSections(String markdown) {
        List<Section> sections = new ArrayList<>();
        Matcher matcher = HEADING_PATTERN.matcher(markdown);

        List<Integer> positions = new ArrayList<>();
        List<String> headings = new ArrayList<>();
        List<Integer> levels = new ArrayList<>();

        while (matcher.find()) {
            positions.add(matcher.start());
            headings.add(matcher.group(2).trim());
            levels.add(matcher.group(1).length());
        }

        for (int i = 0; i < positions.size(); i++) {
            int start = positions.get(i);
            int end = (i + 1 < positions.size()) ? positions.get(i + 1) : markdown.length();
            String content = markdown.substring(start, end).trim();
            sections.add(new Section(headings.get(i), content, levels.get(i)));
        }

        if (!positions.isEmpty() && positions.get(0) > 0) {
            String preamble = markdown.substring(0, positions.get(0)).trim();
            if (!preamble.isEmpty()) {
                sections.add(0, new Section("Preamble", preamble, 0));
            }
        }

        return sections;
    }
}

Agentic RAG: Agents Decide When to Retrieve

Agentic RAG is the cutting edge in 2026. Core idea: Let the Agent itself decide whether to retrieve, what to retrieve, and whether retrieval is sufficient.

┌──────────────────────────────────────────────────────────────────┐
│                    Agentic RAG Workflow                           │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  User Query: "Compare the technical architecture of Product A    │
│              vs Product B"                                       │
│       │                                                          │
│       ▼                                                          │
│  ┌────────────────────────────────────────┐                     │
│  │  Agent thinks: Need Product A arch docs │                     │
│  └──────────────────┬─────────────────────┘                     │
│                     ▼                                            │
│  ┌────────────────────────────────────────┐                     │
│  │  Retrieve: Product A docs → Got context │                     │
│  └──────────────────┬─────────────────────┘                     │
│                     ▼                                            │
│  ┌────────────────────────────────────────┐                     │
│  │  Agent evaluates: Still need Product B  │                     │
│  └──────────────────┬─────────────────────┘                     │
│                     ▼                                            │
│  ┌────────────────────────────────────────┐                     │
│  │  Retrieve: Product B docs → Got context │                     │
│  └──────────────────┬─────────────────────┘                     │
│                     ▼                                            │
│  ┌────────────────────────────────────────┐                     │
│  │  Agent evaluates: Sufficient info, can  │                     │
│  │  generate comparison answer             │                     │
│  └──────────────────┬─────────────────────┘                     │
│                     ▼                                            │
│  ┌────────────────────────────────────────┐                     │
│  │  Generate: A/B architecture comparison │                     │
│  │  with citations                        │                     │
│  └────────────────────────────────────────┘                     │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Agentic RAG Core Implementation

@Service
public class AgenticRagService {

    private final OpenAiChatModel chatModel;
    private final HybridRetrievalService retrievalService;
    private final RerankService rerankService;

    private static final int MAX_ITERATIONS = 5;

    public AgenticRagResponse query(String userQuery) {
        List<RetrievalStep> steps = new ArrayList<>();
        List<SearchResult> allContext = new ArrayList<>();
        String currentThought = userQuery;

        for (int i = 0; i < MAX_ITERATIONS; i++) {
            AgentDecision decision = decideAction(currentThought, allContext);
            steps.add(new RetrievalStep(i + 1, decision.thought(), decision.action()));

            if ("GENERATE".equals(decision.action())) {
                String answer = generateAnswer(userQuery, allContext);
                return new AgenticRagResponse(answer, allContext, steps);
            }

            if ("SEARCH".equals(decision.action())) {
                List<SearchResult> results = retrievalService.hybridSearch(decision.searchQuery(), 5);
                List<SearchResult> reranked = rerankService.rerank(decision.searchQuery(), results, 3);
                allContext.addAll(reranked);
                currentThought = decision.thought();
            }

            if ("INSUFFICIENT".equals(decision.action())) {
                return new AgenticRagResponse(
                    "Sorry, after multiple rounds of retrieval, I couldn't find sufficient information to answer your question.",
                    allContext, steps
                );
            }
        }

        String answer = generateAnswer(userQuery, allContext);
        return new AgenticRagResponse(answer, allContext, steps);
    }

    private AgentDecision decideAction(String query, List<SearchResult> context) {
        String contextStr = context.isEmpty() ? "(No retrieval results yet)" :
            context.stream()
                .map(r -> "- " + r.getContent().substring(0, Math.min(200, r.getContent().length())))
                .collect(Collectors.joining("\n"));

        String prompt = """
            You are a RAG Agent that decides the next action.

            User query: %s
            Current context:
            %s

            Decide the next action:
            - SEARCH: Need more retrieval (provide search_query)
            - GENERATE: Have sufficient info, can generate answer
            - INSUFFICIENT: Cannot find sufficient information

            Output JSON format:
            {"thought": "your reasoning", "action": "SEARCH|GENERATE|INSUFFICIENT", "search_query": "retrieval query (only for SEARCH)"}
            """.formatted(query, contextStr);

        ChatResponse response = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.1).build()
        ));

        try {
            Map<String, String> result = new ObjectMapper().readValue(
                response.getResult().getOutput().getContent(), Map.class
            );
            return new AgentDecision(
                result.get("thought"),
                result.get("action"),
                result.getOrDefault("search_query", "")
            );
        } catch (Exception e) {
            return new AgentDecision("Parse failed, attempting to generate answer", "GENERATE", "");
        }
    }

    private String generateAnswer(String query, List<SearchResult> context) {
        String contextStr = context.stream()
            .map(r -> "[Source: " + r.getSource() + "]\n" + r.getContent())
            .collect(Collectors.joining("\n\n"));

        String systemPrompt = """
            Answer the user's question based on the retrieved documents below.
            Rules:
            1. Answer only from retrieved documents — do not fabricate
            2. Cite sources for every claim
            3. State clearly if information is insufficient

            Retrieved Documents:
            %s
            """.formatted(contextStr);

        ChatResponse response = chatModel.call(new ChatRequest(
            List.of(new Message("system", systemPrompt), new Message("user", query)),
            ChatOptions.builder().withTemperature(0.1).build()
        ));

        return response.getResult().getOutput().getContent();
    }
}

Multimodal RAG: Unified Retrieval for Images, Tables, and Code

In 2026, RAG is no longer just text retrieval. Multimodal RAG enables images, tables, and code to be retrieved and cited as well.

┌──────────────────────────────────────────────────────────────────┐
│                    Multimodal RAG Architecture                   │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Input Document (PDF/HTML/Markdown)                              │
│       │                                                          │
│       ├── Text Extraction ──→ Text Embedding ──→ Vector DB      │
│       │                                                          │
│       ├── Table Extraction ──→ Table→Text Desc ──→ Embedding    │
│       │                                            → Vector DB  │
│       │                                                          │
│       ├── Image Extraction ──→ Vision Embedding ──→ Vector DB   │
│       │                       (CLIP/Qwen-VL)                     │
│       │                                                          │
│       └── Code Extraction ──→ Code Embedding ──→ Vector DB      │
│                               (CodeBERT/Specialized Model)       │
│                                                                  │
│  At Query Time: Unified Vector Search → Multimodal Result       │
│  Fusion → LLM Generation                                        │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Multimodal Document Processing

@Service
public class MultimodalDocumentProcessor {

    private final EmbeddingService textEmbeddingService;
    private final OpenAiChatModel visionModel;

    public List<DocumentChunk> processDocument(Document document) {
        List<DocumentChunk> chunks = new ArrayList<>();

        chunks.addAll(processText(document.getTextContent(), document.getSource()));
        chunks.addAll(processTables(document.getTables(), document.getSource()));
        chunks.addAll(processImages(document.getImages(), document.getSource()));
        chunks.addAll(processCodeBlocks(document.getCodeBlocks(), document.getSource()));

        return chunks;
    }

    private List<DocumentChunk> processText(String text, String source) {
        SemanticChunker chunker = new SemanticChunker(textEmbeddingService);
        return chunker.chunk(text, source);
    }

    private List<DocumentChunk> processTables(List<Table> tables, String source) {
        return tables.stream()
            .map(table -> {
                String description = convertTableToText(table);
                return new DocumentChunk(
                    description, source,
                    "table-" + table.getIndex(),
                    table.getIndex(),
                    "TABLE"
                );
            })
            .collect(Collectors.toList());
    }

    private String convertTableToText(Table table) {
        StringBuilder sb = new StringBuilder();
        sb.append("Table Description: ").append(table.getCaption()).append("\n");

        List<String> headers = table.getHeaders();
        sb.append("Columns: ").append(String.join(", ", headers)).append("\n");

        for (List<String> row : table.getRows()) {
            for (int i = 0; i < headers.size() && i < row.size(); i++) {
                sb.append(headers.get(i)).append(": ").append(row.get(i)).append("; ");
            }
            sb.append("\n");
        }

        return sb.toString();
    }

    private List<DocumentChunk> processImages(List<DocumentImage> images, String source) {
        return images.stream()
            .map(image -> {
                String description = describeImage(image);
                return new DocumentChunk(
                    "[Image] " + description, source,
                    "image-" + image.getIndex(),
                    image.getIndex(),
                    "IMAGE"
                );
            })
            .collect(Collectors.toList());
    }

    private String describeImage(DocumentImage image) {
        String prompt = "Describe the content of this image in detail, including chart data and key information.";

        ChatResponse response = visionModel.call(new ChatRequest(
            List.of(new Message("user", prompt + "\n[Image base64: " + image.getBase64() + "]")),
            ChatOptions.builder().withTemperature(0.1).build()
        ));

        return response.getResult().getOutput().getContent();
    }

    private List<DocumentChunk> processCodeBlocks(List<CodeBlock> codeBlocks, String source) {
        return codeBlocks.stream()
            .map(code -> new DocumentChunk(
                "[Code] " + code.getLanguage() + "\n" + code.getContent() +
                "\nDescription: " + code.getDescription(),
                source,
                "code-" + code.getIndex(),
                code.getIndex(),
                "CODE"
            ))
            .collect(Collectors.toList());
    }
}

RAG Evaluation Framework: Quantitative Assessment

No evaluation means no optimization. RAG evaluation requires quantification across both retrieval quality and generation quality:

Evaluation Metrics Framework

Dimension Metric Description Calculation Target
Retrieval Recall@K Proportion of relevant docs in top K results Relevant∩Retrieved / Total Relevant > 90%
Retrieval MRR Mean Reciprocal Rank of first relevant doc avg(1/first_relevant_rank) > 0.8
Retrieval nDCG@K Normalized Discounted Cumulative Gain DCG/IDCG > 0.85
Generation Faithfulness Answer's faithfulness to retrieved content Supported claims / Total claims > 95%
Generation Relevancy Answer's relevance to the query LLM-assessed 0-1 score > 0.9
Generation Correctness Answer's consistency with ground truth Semantic similarity to reference > 0.85
End-to-End Latency Total time from query to answer P95 latency < 3s
End-to-End Refusal Accuracy Correctly refusing unanswerable questions Correct refusals / Should-refuse total > 80%

Evaluation Framework Implementation

@Service
public class RagEvaluationService {

    private final OpenAiChatModel chatModel;

    public EvaluationResult evaluate(RagResponse response, String groundTruth) {
        double faithfulness = evaluateFaithfulness(response);
        double relevancy = evaluateRelevancy(response);
        double correctness = evaluateCorrectness(response, groundTruth);

        return new EvaluationResult(faithfulness, relevancy, correctness);
    }

    private double evaluateFaithfulness(RagResponse response) {
        String prompt = """
            Evaluate the faithfulness of the following answer to the retrieved content.

            Retrieved Content:
            %s

            Answer:
            %s

            Extract each claim in the answer and determine if it can be supported by the retrieved content.
            Output JSON: {"total_claims": N, "supported_claims": M}
            """.formatted(
                response.getSources().stream()
                    .map(SearchResult::getContent)
                    .collect(Collectors.joining("\n")),
                response.getAnswer()
            );

        ChatResponse llmResponse = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.0).build()
        ));

        try {
            Map<String, Integer> result = new ObjectMapper().readValue(
                llmResponse.getResult().getOutput().getContent(), Map.class
            );
            return (double) result.get("supported_claims") / result.get("total_claims");
        } catch (Exception e) {
            return 0.0;
        }
    }

    private double evaluateRelevancy(RagResponse response) {
        String prompt = """
            Rate the relevance of the following answer to the query (0-10).

            Query: %s
            Answer: %s

            Output only an integer from 0 to 10.
            """.formatted(response.getQuery(), response.getAnswer());

        ChatResponse llmResponse = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.0).build()
        ));

        try {
            return Double.parseDouble(llmResponse.getResult().getOutput().getContent().trim()) / 10.0;
        } catch (Exception e) {
            return 0.0;
        }
    }

    private double evaluateCorrectness(RagResponse response, String groundTruth) {
        String prompt = """
            Rate the semantic consistency of the following answer with the reference answer (0-10).

            Answer: %s
            Reference: %s

            Output only an integer from 0 to 10.
            """.formatted(response.getAnswer(), groundTruth);

        ChatResponse llmResponse = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.0).build()
        ));

        try {
            return Double.parseDouble(llmResponse.getResult().getOutput().getContent().trim()) / 10.0;
        } catch (Exception e) {
            return 0.0;
        }
    }
}

Production-Grade RAG Architecture and Performance Optimization

Production Architecture Overview

┌──────────────────────────────────────────────────────────────────────┐
│                    Production-Grade RAG Architecture                 │
├──────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌────────────────────────────────────────────────────────────┐     │
│  │                    API Gateway (Spring Cloud Gateway)       │     │
│  │    Auth(JWT) │ Rate Limit(Sentinel) │ Cache(Redis) │ Logs  │     │
│  └────────────────────────────┬───────────────────────────────┘     │
│                               │                                      │
│  ┌────────────────────────────▼───────────────────────────────┐     │
│  │                    RAG Service (Spring Boot)                │     │
│  │    ┌──────────┐  ┌──────────┐  ┌──────────┐               │     │
│  │    │  Query   │  │  Hybrid  │  │  Rerank  │               │     │
│  │    │  Rewrite │→ │  Search  │→ │  Service │               │     │
│  │    │  Service │  │  Service │  │          │               │     │
│  │    └──────────┘  └──────────┘  └──────────┘               │     │
│  │                                        │                    │     │
│  │    ┌──────────┐  ┌──────────┐          ▼                    │     │
│  │    │Evaluation│  │  Cache   │  ┌──────────┐               │     │
│  │    │ Service  │  │ Service  │  │Generate  │               │     │
│  │    └──────────┘  └──────────┘  │ Service  │               │     │
│  │                                └──────────┘               │     │
│  └────────────────────────────┬───────────────────────────────┘     │
│                               │                                      │
│       ┌───────────────────────┼───────────────────────┐             │
│       ▼                       ▼                       ▼             │
│  ┌──────────┐          ┌──────────┐          ┌──────────┐          │
│  │ Qdrant   │          │Elastic-  │          │ Redis    │          │
│  │ Vector DB│          │search    │          │ Cache    │          │
│  │          │          │ BM25 Idx │          │ Layer    │          │
│  └──────────┘          └──────────┘          └──────────┘          │
│                                                                      │
│  ┌────────────────────────────────────────────────────────────┐     │
│  │              Document Ingestion Pipeline (Async)            │     │
│  │  Upload → Parse → Chunk → Embed → Index → Metadata Store   │     │
│  │  (Kafka-driven, supports incremental updates)               │     │
│  └────────────────────────────────────────────────────────────┘     │
│                                                                      │
│  ┌────────────────────────────────────────────────────────────┐     │
│  │                  Monitoring & Observability                 │     │
│  │  Prometheus │ Grafana Dashboards │ Alerts │ Quality Tracing │     │
│  └────────────────────────────────────────────────────────────┘     │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Cache Layer Optimization

@Service
public class RagCacheService {

    private final RedisTemplate<String, String> redisTemplate;
    private final ObjectMapper objectMapper;

    private static final long CACHE_TTL_HOURS = 24;

    public Optional<RagResponse> getCachedResponse(String query) {
        String cacheKey = generateCacheKey(query);
        String cached = redisTemplate.opsForValue().get(cacheKey);

        if (cached != null) {
            try {
                return Optional.of(objectMapper.readValue(cached, RagResponse.class));
            } catch (Exception e) {
                return Optional.empty();
            }
        }
        return Optional.empty();
    }

    public void cacheResponse(String query, RagResponse response) {
        String cacheKey = generateCacheKey(query);
        try {
            String json = objectMapper.writeValueAsString(response);
            redisTemplate.opsForValue().set(cacheKey, json, CACHE_TTL_HOURS, TimeUnit.HOURS);
        } catch (Exception e) {
            // Cache failure should not affect the main flow
        }
    }

    private String generateCacheKey(String query) {
        return "rag:cache:" + DigestUtils.md5Hex(query);
    }
}

Document Ingestion Pipeline

@Service
public class DocumentIngestionPipeline {

    private final DocumentParserService parserService;
    private final SemanticChunker semanticChunker;
    private final MarkdownStructureChunker structureChunker;
    private final EmbeddingService embeddingService;
    private final QdrantVectorStore vectorStore;
    private final DocumentMapper documentMapper;

    @Async("ingestionExecutor")
    public CompletableFuture<Void> ingestDocument(MultipartFile file, String source) {
        String content = parserService.parse(file);

        List<DocumentChunk> chunks;
        if (isMarkdown(content)) {
            chunks = structureChunker.chunk(content, source);
        } else {
            chunks = semanticChunker.chunk(content, source);
        }

        List<float[]> embeddings = embeddingService.embedBatch(
            chunks.stream().map(DocumentChunk::getContent).collect(Collectors.toList())
        );

        for (int i = 0; i < chunks.size(); i++) {
            chunks.get(i).setEmbedding(embeddings.get(i));
        }

        vectorStore.upsertDocuments(chunks);

        for (DocumentChunk chunk : chunks) {
            DocumentEntity entity = new DocumentEntity();
            entity.setContent(chunk.getContent());
            entity.setSource(chunk.getSource());
            entity.setSection(chunk.getSection());
            entity.setEmbedding(chunk.getEmbedding());
            entity.setSearchText(chunk.getContent());
            documentMapper.insert(entity);
        }

        return CompletableFuture.completedFuture(null);
    }

    private boolean isMarkdown(String content) {
        return content.contains("# ") || content.contains("## ") || content.contains("```");
    }
}

Performance Optimization Checklist

Optimization Method Impact
Embedding Cache Reuse embeddings for identical text 50%+ fewer API calls
Query Cache Redis cache for similar queries P95 latency reduced 60%
Batch Embedding Merge multiple texts in one call 3-5x throughput increase
Async Ingestion Kafka-driven async document processing Ingestion doesn't block queries
Connection Pool Qdrant/ES connection pool tuning 2x concurrency capacity
Pre-computed HyDE Pre-generate HyDE embeddings for hot queries 40% latency reduction for hot queries
Index Tuning HNSW parameter optimization (M=16, ef=100) Balance precision and speed
Sharding Strategy Shard by document type/time 30% faster by narrowing search scope

Spring Boot Configuration

spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: gpt-4o
          temperature: 0.1
      embedding:
        options:
          model: text-embedding-3-small

rag:
  vector-store:
    type: qdrant
    qdrant:
      host: localhost
      port: 6334
      collection: knowledge_base
      vector-size: 1536
  chunking:
    default-strategy: semantic
    chunk-size: 512
    overlap-size: 64
    similarity-threshold: 0.85
  retrieval:
    hybrid: true
    vector-weight: 0.7
    keyword-weight: 0.3
    rrf-k: 60
    top-k: 10
  cache:
    enabled: true
    ttl-hours: 24
  ingestion:
    async: true
    batch-size: 100
    pool-size: 4

Summary

Component 2026 Best Practice Key Takeaway
Query Processing Query rewriting + HyDE Multi-angle retrieval boosts recall
Retrieval Strategy Hybrid (vector + BM25) + RRF fusion Vector 70% + Keyword 30%
Ranking Optimization Cross-Encoder reranking Fine-grained query-doc relevance scoring
Document Chunking Semantic > Structure > Fixed Chunking is RAG's foundation
Agentification Agentic RAG multi-turn + self-evaluation Agent decides when to retrieve
Multimodal Unified image/table/code retrieval Vision embeddings + text descriptions
Evaluation Faithfulness/Relevancy/Correctness No evaluation, no optimization
Productionization Cache + async + monitoring + tuning Engineering determines RAG success

RAG isn't just "retrieve + generate" — it's a systems engineering challenge requiring careful design at every stage. From query rewriting to hybrid retrieval, from document chunking to agentic reasoning — each component determines the quality of the final answer. In 2026, Advanced RAG is the standard, Agentic RAG is the frontier, and evaluation frameworks are the foundation for continuous improvement.

Try these browser-local tools — no sign-up required →

#RAG#检索增强生成#向量数据库#Embedding#Agentic RAG#Spring AI