RAG檢索增強生成深度實戰：從Naive RAG到Agentic RAG的三代進化與企業級落地

RAG為什麼是企業AI必修課

大模型再強，也有三大硬傷。2026年了，如果你還在裸調LLM API做企業問答，那一定踩過這些坑：

硬傷	表現	業務影響
知識截止	模型訓練資料停留在某個時間點	無法回答最新政策、產品資訊
幻覺（Hallucination）	編造看似合理但不存在的專實	誤導決策，法律風險
私有知識缺失	不懂企業內部文件、流程、術語	通用模型無法替代專業問答

RAG（Retrieval-Augmented Generation）的本質：讓大模型在回答前先「查資料」，用檢索結果約束生成，從根源上解決三大硬傷。

資料：2026年企業AI落地專案中，92%採用了RAG架構，純Prompt方案佔比不足5%。

┌─────────────────────────────────────────────────────────┐
│              沒有RAG vs 有RAG 的本質區別                   │
├──────────────────────┬──────────────────────────────────┤
│     純LLM呼叫         │         RAG增強呼叫              │
├──────────────────────┼──────────────────────────────────┤
│  使用者提問 → LLM → 答│  使用者提問 → 檢索 → 上下文注入 →  │
│  （憑記憶回答）        │  LLM → 答案+引用來源              │
│                      │  （查資料後回答）                   │
├──────────────────────┼──────────────────────────────────┤
│  知識：訓練資料截止    │  知識：即時檢索，可隨時更新         │
│  準確性：不可控        │  準確性：檢索結果約束生成           │
│  可追溯：無           │  可追溯：每個答案附帶引用           │
└──────────────────────┴──────────────────────────────────┘

Naive RAG → Advanced RAG → Agentic RAG三代進化

RAG不是一成不變的，從2023年到2026年，RAG架構經歷了三代進化：

┌───────────────────────────────────────────────────────────────┐
│                     RAG 三代進化路線圖                          │
├───────────────────┬───────────────────┬───────────────────────┤
│   Naive RAG       │  Advanced RAG     │   Agentic RAG         │
│   (2023)          │  (2024-2025)      │   (2026)              │
├───────────────────┼───────────────────┼───────────────────────┤
│ Query → 檢索 → 生成│ 查詢改寫 → 混合檢索 │ Agent自主決策         │
│                   │ → 重排序 → 生成    │ → 多輪檢索+自我評估    │
│                   │                   │ → 動態工具呼叫          │
├───────────────────┼───────────────────┼───────────────────────┤
│ 問題：檢索品質差    │ 問題：缺乏自主性    │ 問題：複雜度高          │
│ 幻覺率30%+        │ 幻覺率10-15%      │ 幻覺率<5%             │
│ 不可控             │ 可控但被動         │ 主動推理+自主糾錯       │
└───────────────────┴───────────────────┴───────────────────────┘

三代RAG對比

維度	Naive RAG	Advanced RAG	Agentic RAG
查詢處理	原始查詢直接檢索	查詢改寫/擴展/HyDE	Agent自主分解子問題
檢索策略	單路向量檢索	混合檢索+重排序	多輪檢索+工具呼叫
生成策略	直接拼接上下文	精選上下文+引用	自我評估+迭代最佳化
典型幻覺率	30-40%	10-15%	<5%
端到端延遲	1-2s	2-4s	5-15s
適用場景	Demo/原型	生產環境	複雜推理場景
實現複雜度	低	中	高

Embedding模型選型與評測

Embedding是RAG的「眼睛」，選錯模型，檢索品質直接拉胯。2026年主流Embedding模型橫評：

模型	維度	MTEB得分	中文MTEB	價格/1M tokens	部署方式
text-embedding-3-large	3072	68.4	64.2	$0.13	API
text-embedding-3-small	1536	62.3	58.7	$0.02	API
bge-m3	1024	65.8	70.1	免費	本地/API
gte-Qwen2-1.5B	1536	67.2	72.5	免費	本地/API
gte-Qwen2-7B	3584	70.1	75.8	免費	本地（需GPU）
Cohere embed-v4	1024	66.1	61.3	$0.10	API
Voyage-3	1024	67.8	63.9	$0.12	API

選型建議

場景	推薦模型	理由
中文為主的企業知識庫	gte-Qwen2-1.5B	中文MTEB最高，免費本地部署
多語言混合	bge-m3	原生多語言支援，1024維性價比高
追求極致效果	gte-Qwen2-7B	7B參數，效果最佳，需GPU
快速上線不想運維	text-embedding-3-small	API呼叫，0.02$/1M tokens
預算充足追求穩定	text-embedding-3-large	OpenAI生態，3072維高精度

Java Embedding呼叫實現

public class EmbeddingService {

    private final OpenAiChatModel chatModel;
    private final RestTemplate restTemplate;
    private final String embeddingApiUrl;
    private final String embeddingModel;

    public EmbeddingService(OpenAiChatModel chatModel, String apiUrl, String model) {
        this.chatModel = chatModel;
        this.restTemplate = new RestTemplate();
        this.embeddingApiUrl = apiUrl;
        this.embeddingModel = model;
    }

    public float[] embed(String text) {
        Map<String, Object> request = Map.of(
            "model", embeddingModel,
            "input", text
        );

        ResponseEntity<Map> response = restTemplate.postForEntity(
            embeddingApiUrl + "/embeddings",
            request,
            Map.class
        );

        List<Double> embedding = (List<Double>) ((Map) ((List<?>) response.getBody().get("data")).get(0)).get("embedding");

        float[] result = new float[embedding.size()];
        for (int i = 0; i < embedding.size(); i++) {
            result[i] = embedding.get(i).floatValue();
        }
        return result;
    }

    public List<float[]> embedBatch(List<String> texts) {
        Map<String, Object> request = Map.of(
            "model", embeddingModel,
            "input", texts
        );

        ResponseEntity<Map> response = restTemplate.postForEntity(
            embeddingApiUrl + "/embeddings",
            request,
            Map.class
        );

        List<?> dataList = (List<?>) response.getBody().get("data");
        return dataList.stream()
            .map(item -> {
                List<Double> embedding = (List<Double>) ((Map) item).get("embedding");
                float[] arr = new float[embedding.size()];
                for (int i = 0; i < embedding.size(); i++) {
                    arr[i] = embedding.get(i).floatValue();
                }
                return arr;
            })
            .collect(Collectors.toList());
    }

    public static float cosineSimilarity(float[] a, float[] b) {
        float dotProduct = 0.0f;
        float normA = 0.0f;
        float normB = 0.0f;
        for (int i = 0; i < a.length; i++) {
            dotProduct += a[i] * b[i];
            normA += a[i] * a[i];
            normB += b[i] * b[i];
        }
        return dotProduct / (float) (Math.sqrt(normA) * Math.sqrt(normB));
    }
}

向量資料庫對比：PGVector vs Milvus vs Qdrant

2026年向量資料庫已經不是「有沒有」的問題，而是「選哪個」的問題。三大主流方案深度對比：

維度	PGVector	Milvus	Qdrant
底層	PostgreSQL擴展	獨立分散式系統	獨立Rust服務
最大向量數	千萬級	百億級	十億級
查詢延遲（1M向量）	25-40ms	15-25ms	10-18ms
混合檢索	需配合tsvector	原生支援	原生支援
分散式	依賴PG邏輯複製	原生分散式	分片支援
運維複雜度	低（復用PG）	高	中
事務支援	ACID	有限	有限
過濾能力	SQL全功能	標量過濾	Payload過濾
生態整合	Spring Data JPA	Java SDK	Java SDK
適用場景	已有PG、中小規模	超大規模、高併發	高效能、中等規模

架構對比

┌────────────────────────────────────────────────────────────────┐
│                     PGVector 架構                               │
├────────────────────────────────────────────────────────────────┤
│  Application ──→ PostgreSQL (pgvector擴展)                     │
│                    ├── 向量索引 (HNSW/IVFFlat)                  │
│                    ├── 全文檢索 (tsvector)                       │
│                    ├── 關聯資料 (行存)                           │
│                    └── 事務/ACID                                │
│  優點：零額外運維，SQL全功能                                     │
│  缺點：大規模效能受限，分散式弱                                   │
├────────────────────────────────────────────────────────────────┤
│                     Milvus 架構                                 │
├────────────────────────────────────────────────────────────────┤
│  Application ──→ Proxy ──→ Coordinator                        │
│                              ├── Query Node (檢索)              │
│                              ├── Data Node (寫入)               │
│                              └── Index Node (索引構建)          │
│  儲存：MinIO/S3 + etcd                                         │
│  優點：百億級、原生分散式、雲原生                                 │
│  缺點：運維複雜、資源佔用大                                      │
├────────────────────────────────────────────────────────────────┤
│                     Qdrant 架構                                 │
├────────────────────────────────────────────────────────────────┤
│  Application ──→ Qdrant Service (Rust)                        │
│                    ├── HNSW索引                                 │
│                    ├── Payload過濾                              │
│                    ├── WAL持久化                                │
│                    └── 分片叢集                                  │
│  優點：Rust高效能、低延遲、API簡潔                               │
│  缺點：超大規模不如Milvus                                        │
└────────────────────────────────────────────────────────────────┘

Spring Boot整合Qdrant

@Configuration
public class QdrantConfig {

    @Bean
    public QdrantClient qdrantClient() {
        return new QdrantClient(
            QdrantGrpcClient.newBuilder("localhost", 6334, false).build()
        );
    }
}

@Service
public class QdrantVectorStore {

    private final QdrantClient qdrantClient;
    private final EmbeddingService embeddingService;

    private static final String COLLECTION_NAME = "knowledge_base";
    private static final int VECTOR_SIZE = 1536;

    public QdrantVectorStore(QdrantClient qdrantClient, EmbeddingService embeddingService) {
        this.qdrantClient = qdrantClient;
        this.embeddingService = embeddingService;
    }

    public void createCollection() throws ExecutionException, InterruptedException {
        qdrantClient.createCollectionAsync(
            CollectionInfo.newBuilder()
                .setCollectionName(COLLECTION_NAME)
                .setVectorsConfig(VectorsConfig.newBuilder()
                    .setParams(VectorParams.newBuilder()
                        .setSize(VECTOR_SIZE)
                        .setDistance(Distance.Cosine)
                        .build())
                    .build())
                .setOptimizersConfig(OptimizersConfigDiff.newBuilder()
                    .setIndexingThreshold(20000)
                    .build())
                .setHnswConfig(HnswConfigDiff.newBuilder()
                    .setM(16)
                    .setEfConstruct(100)
                    .build())
                .build()
        ).get();
    }

    public void upsertDocuments(List<DocumentChunk> chunks) throws ExecutionException, InterruptedException {
        List<float[]> embeddings = embeddingService.embedBatch(
            chunks.stream().map(DocumentChunk::getContent).collect(Collectors.toList())
        );

        List<PointStruct> points = new ArrayList<>();
        for (int i = 0; i < chunks.size(); i++) {
            DocumentChunk chunk = chunks.get(i);
            points.add(PointStruct.newBuilder()
                .setId(PointId.newBuilder().setUuid(UUID.randomUUID().toString()).build())
                .setVectors(Vectors.newBuilder().setVector(Vector.newBuilder()
                    .addAllData(FloatVector.newBuilder()
                        .addAllData(toFloatList(embeddings.get(i)))
                        .build().getDataList())
                    .build()).build())
                .putAllPayload(Map.of(
                    "content", Value.newBuilder().setStringValue(chunk.getContent()).build(),
                    "source", Value.newBuilder().setStringValue(chunk.getSource()).build(),
                    "section", Value.newBuilder().setStringValue(chunk.getSection()).build()
                ))
                .build());
        }

        qdrantClient.upsertAsync(COLLECTION_NAME, points).get();
    }

    public List<SearchResult> search(String query, int topK) throws ExecutionException, InterruptedException {
        float[] queryVector = embeddingService.embed(query);

        List<ScoredPoint> results = qdrantClient.searchAsync(
            SearchPoints.newBuilder()
                .setCollectionName(COLLECTION_NAME)
                .setVector(Vector.newBuilder().addAllData(toFloatList(queryVector)).build())
                .setLimit(topK)
                .setWithPayload(true)
                .build()
        ).get();

        return results.stream()
            .map(point -> new SearchResult(
                point.getPayload().get("content").getStringValue(),
                point.getPayload().get("source").getStringValue(),
                point.getScore()
            ))
            .collect(Collectors.toList());
    }

    private List<Float> toFloatList(float[] arr) {
        List<Float> list = new ArrayList<>(arr.length);
        for (float v : arr) {
            list.add(v);
        }
        return list;
    }
}

Spring Boot整合PGVector

@Entity
@Table(name = "documents")
public class DocumentEntity {

    @Id
    @GeneratedValue(strategy = GenerationType.UUID)
    private UUID id;

    private String content;

    private String source;

    private String section;

    @Column(columnDefinition = "vector(1536)")
    private float[] embedding;

    @Column(columnDefinition = "tsvector")
    private String searchText;
}

@Mapper
public interface DocumentMapper extends BaseMapper<DocumentEntity> {

    @Select("SELECT *, embedding <=> #{embedding} AS distance " +
            "FROM documents " +
            "WHERE embedding <=> #{embedding} < #{threshold} " +
            "ORDER BY embedding <=> #{embedding} " +
            "LIMIT #{limit}")
    List<DocumentEntity> vectorSearch(@Param("embedding") float[] embedding,
                                       @Param("threshold") float threshold,
                                       @Param("limit") int limit);

    @Select("SELECT *, ts_rank(search_text, plainto_tsquery(#{query})) AS rank " +
            "FROM documents " +
            "WHERE search_text @@ plainto_tsquery(#{query}) " +
            "ORDER BY rank DESC " +
            "LIMIT #{limit}")
    List<DocumentEntity> fullTextSearch(@Param("query") String query,
                                         @Param("limit") int limit);
}

Advanced RAG實戰：查詢改寫 + 混合檢索 + 重排序

這是2026年RAG生產環境的標配架構。單一向量檢索已經不夠用了，混合檢索+重排序才是正解。

┌──────────────────────────────────────────────────────────────────┐
│                 Advanced RAG 完整Pipeline                         │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  使用者查詢：「公司2025年Q4營收增長原因是什麼？」                    │
│       │                                                          │
│       ▼                                                          │
│  ┌─────────────────────────────────┐                            │
│  │  1. 查詢改寫（Query Rewrite）    │                            │
│  │  原始 → 3個改寫查詢 + HyDE       │                            │
│  └──────────────┬──────────────────┘                            │
│                 │                                                │
│       ┌─────────┴─────────┐                                     │
│       ▼                   ▼                                      │
│  ┌──────────┐      ┌──────────┐                                  │
│  │ 向量檢索  │      │ BM25檢索  │                                  │
│  │ (語意相似)│      │ (精確匹配)│                                  │
│  └────┬─────┘      └────┬─────┘                                  │
│       │                  │                                        │
│       └────────┬─────────┘                                        │
│                ▼                                                  │
│  ┌─────────────────────────────────┐                            │
│  │  2. RRF融合（Reciprocal Rank    │                            │
│  │     Fusion）                     │                            │
│  │  向量權重0.7 + 關鍵詞權重0.3      │                            │
│  └──────────────┬──────────────────┘                            │
│                 │                                                │
│                 ▼                                                │
│  ┌─────────────────────────────────┐                            │
│  │  3. Cross-Encoder重排序          │                            │
│  │  精細評估query-doc相關性          │                            │
│  └──────────────┬──────────────────┘                            │
│                 │                                                │
│                 ▼                                                │
│  ┌─────────────────────────────────┐                            │
│  │  4. 上下文注入 + LLM生成          │                            │
│  │  帶引用來源的精準回答              │                            │
│  └─────────────────────────────────┘                            │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

查詢改寫實現

@Service
public class QueryRewriteService {

    private final OpenAiChatModel chatModel;

    public QueryRewriteService(OpenAiChatModel chatModel) {
        this.chatModel = chatModel;
    }

    public List<String> rewriteQuery(String originalQuery) {
        String prompt = """
            將以下使用者查詢改寫為3個不同角度的搜尋查詢，提高檢索召回率。
            要求：
            1. 保留原始查詢的核心意圖
            2. 從不同角度表達相同需求
            3. 新增可能的專業術語或同義詞
            4. 輸出JSON格式：{"rewrites": ["查詢1", "查詢2", "查詢3"]}

            原始查詢：%s
            """.formatted(originalQuery);

        ChatResponse response = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.3).build()
        ));

        String content = response.getResult().getOutput().getContent();
        Map<String, Object> result = new ObjectMapper().readValue(content, Map.class);
        return (List<String>) result.get("rewrites");
    }

    public String generateHyde(String query) {
        String prompt = """
            請給出以下問題的詳細答案，即使你不確定也要盡量回答。
            這個答案將用於檢索相關文件，所以請包含盡可能多的相關細節和專業術語。

            問題：%s
            """.formatted(query);

        ChatResponse response = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.5).build()
        ));

        return response.getResult().getOutput().getContent();
    }
}

混合檢索 + RRF融合

@Service
public class HybridRetrievalService {

    private final QdrantVectorStore vectorStore;
    private final DocumentMapper documentMapper;
    private final EmbeddingService embeddingService;

    private static final double VECTOR_WEIGHT = 0.7;
    private static final double KEYWORD_WEIGHT = 0.3;
    private static final int RRF_K = 60;

    public HybridRetrievalService(QdrantVectorStore vectorStore,
                                   DocumentMapper documentMapper,
                                   EmbeddingService embeddingService) {
        this.vectorStore = vectorStore;
        this.documentMapper = documentMapper;
        this.embeddingService = embeddingService;
    }

    public List<SearchResult> hybridSearch(String query, int topK) {
        List<SearchResult> vectorResults = vectorSearch(query, topK * 2);
        List<SearchResult> keywordResults = keywordSearch(query, topK * 2);

        List<SearchResult> fused = reciprocalRankFusion(
            List.of(vectorResults, keywordResults),
            List.of(VECTOR_WEIGHT, KEYWORD_WEIGHT)
        );

        return fused.stream().limit(topK).collect(Collectors.toList());
    }

    private List<SearchResult> vectorSearch(String query, int limit) {
        try {
            return vectorStore.search(query, limit);
        } catch (Exception e) {
            return Collections.emptyList();
        }
    }

    private List<SearchResult> keywordSearch(String query, int limit) {
        List<DocumentEntity> results = documentMapper.fullTextSearch(query, limit);
        return results.stream()
            .map(doc -> new SearchResult(doc.getContent(), doc.getSource(), 0.0))
            .collect(Collectors.toList());
    }

    private List<SearchResult> reciprocalRankFusion(
            List<List<SearchResult>> resultSets,
            List<Double> weights) {

        Map<String, Double> scoreMap = new HashMap<>();
        Map<String, SearchResult> resultMap = new HashMap<>();

        for (int setIndex = 0; setIndex < resultSets.size(); setIndex++) {
            List<SearchResult> results = resultSets.get(setIndex);
            double weight = weights.get(setIndex);

            for (int rank = 0; rank < results.size(); rank++) {
                String key = results.get(rank).getContent();
                double score = weight / (rank + 1 + RRF_K);
                scoreMap.merge(key, score, Double::sum);
                resultMap.putIfAbsent(key, results.get(rank));
            }
        }

        return scoreMap.entrySet().stream()
            .sorted(Map.Entry.<String, Double>comparingByValue().reversed())
            .map(entry -> {
                SearchResult result = resultMap.get(entry.getKey());
                return new SearchResult(result.getContent(), result.getSource(), entry.getValue());
            })
            .collect(Collectors.toList());
    }
}

Cross-Encoder重排序

@Service
public class RerankService {

    private final OpenAiChatModel chatModel;

    public RerankService(OpenAiChatModel chatModel) {
        this.chatModel = chatModel;
    }

    public List<SearchResult> rerank(String query, List<SearchResult> candidates, int topK) {
        List<CompletableFuture<ScoredResult>> futures = candidates.stream()
            .map(candidate -> CompletableFuture.supplyAsync(() -> scoreRelevance(query, candidate)))
            .collect(Collectors.toList());

        List<ScoredResult> scored = futures.stream()
            .map(CompletableFuture::join)
            .sorted(Comparator.comparingDouble(ScoredResult::score).reversed())
            .limit(topK)
            .collect(Collectors.toList());

        return scored.stream()
            .map(sr -> new SearchResult(sr.content(), sr.source(), sr.score()))
            .collect(Collectors.toList());
    }

    private ScoredResult scoreRelevance(String query, SearchResult candidate) {
        String prompt = """
            判斷以下文件與查詢的相關性，只輸出0到10的整數評分。

            查詢：%s
            文件：%s

            評分：
            """.formatted(query, candidate.getContent());

        ChatResponse response = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.0).build()
        ));

        double score = Double.parseDouble(response.getResult().getOutput().getContent().trim());
        return new ScoredResult(candidate.getContent(), candidate.getSource(), score / 10.0);
    }
}

完整Advanced RAG Pipeline

@Service
public class AdvancedRagPipeline {

    private final QueryRewriteService queryRewriteService;
    private final HybridRetrievalService hybridRetrievalService;
    private final RerankService rerankService;
    private final EmbeddingService embeddingService;
    private final OpenAiChatModel chatModel;

    public RagResponse query(String userQuery) {
        List<String> allQueries = new ArrayList<>();
        allQueries.add(userQuery);
        allQueries.addAll(queryRewriteService.rewriteQuery(userQuery));

        String hydeAnswer = queryRewriteService.generateHyde(userQuery);
        allQueries.add(hydeAnswer);

        List<SearchResult> allResults = new ArrayList<>();
        for (String q : allQueries) {
            allResults.addAll(hybridRetrievalService.hybridSearch(q, 10));
        }

        List<SearchResult> deduplicated = deduplicate(allResults);
        List<SearchResult> reranked = rerankService.rerank(userQuery, deduplicated, 5);

        String context = buildContext(reranked);
        String answer = generateAnswer(userQuery, context);

        return new RagResponse(answer, reranked);
    }

    private List<SearchResult> deduplicate(List<SearchResult> results) {
        Map<String, SearchResult> uniqueMap = new LinkedHashMap<>();
        for (SearchResult result : results) {
            uniqueMap.putIfAbsent(result.getContent(), result);
        }
        return new ArrayList<>(uniqueMap.values());
    }

    private String buildContext(List<SearchResult> results) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < results.size(); i++) {
            SearchResult r = results.get(i);
            sb.append("[%d] 來源：%s\n%s\n\n".formatted(i + 1, r.getSource(), r.getContent()));
        }
        return sb.toString();
    }

    private String generateAnswer(String query, String context) {
        String systemPrompt = """
            你是知識庫問答助手。基於以下檢索到的文件回答使用者問題。

            規則：
            1. 只基於檢索到的文件回答，不編造資訊
            2. 每個陳述必須標注引用來源 [1][2]...
            3. 如果檢索結果不足以回答，明確說明
            4. 優先使用最新、最相關的資訊

            檢索文件：
            %s
            """.formatted(context);

        ChatResponse response = chatModel.call(new ChatRequest(
            List.of(
                new Message("system", systemPrompt),
                new Message("user", query)
            ),
            ChatOptions.builder().withTemperature(0.1).build()
        ));

        return response.getResult().getOutput().getContent();
    }
}

文件切分策略

切分是RAG的地基，切得不好，檢索再強也白搭。四種主流策略深度對比：

策略	原理	優點	缺點	適用場景	切分粒度
固定長度	按token/字元數切割	簡單可控	切斷語意完整性	日誌、表格資料	固定chunk_size
語意切分	Embedding相似度斷點檢測	語意完整性好	計算成本高	技術文件、論文	自適應
結構切分	按標題/章節/段落切	保留文件結構	需要解析器支援	Markdown/HTML/PDF	按結構層級
主題切分	LLM識別主題邊界	主題高度聚合	LLM呼叫成本高	長文件、多主題文件	按主題

固定長度切分（Java實現）

public class FixedLengthChunker {

    private final int chunkSize;
    private final int overlapSize;

    public FixedLengthChunker(int chunkSize, int overlapSize) {
        this.chunkSize = chunkSize;
        this.overlapSize = overlapSize;
    }

    public List<DocumentChunk> chunk(String content, String source) {
        List<DocumentChunk> chunks = new ArrayList<>();
        int start = 0;
        int index = 0;

        while (start < content.length()) {
            int end = Math.min(start + chunkSize, content.length());
            String text = content.substring(start, end);

            if (end < content.length()) {
                int lastPeriod = text.lastIndexOf('。');
                int lastNewline = text.lastIndexOf('\n');
                int breakPoint = Math.max(lastPeriod, lastNewline);
                if (breakPoint > chunkSize / 2) {
                    text = text.substring(0, breakPoint + 1);
                    end = start + breakPoint + 1;
                }
            }

            chunks.add(new DocumentChunk(text, source, "chunk-" + index, index));
            start = end - overlapSize;
            index++;
        }

        return chunks;
    }
}

語意切分（Java實現）

@Service
public class SemanticChunker {

    private final EmbeddingService embeddingService;

    private static final double SIMILARITY_THRESHOLD = 0.85;

    public SemanticChunker(EmbeddingService embeddingService) {
        this.embeddingService = embeddingService;
    }

    public List<DocumentChunk> chunk(String content, String source) {
        List<String> sentences = splitSentences(content);
        if (sentences.isEmpty()) {
            return Collections.emptyList();
        }

        List<float[]> embeddings = embeddingService.embedBatch(sentences);

        List<DocumentChunk> chunks = new ArrayList<>();
        StringBuilder currentChunk = new StringBuilder(sentences.get(0));
        int chunkIndex = 0;

        for (int i = 1; i < sentences.size(); i++) {
            float similarity = EmbeddingService.cosineSimilarity(
                embeddings.get(i - 1), embeddings.get(i)
            );

            if (similarity >= SIMILARITY_THRESHOLD) {
                currentChunk.append(sentences.get(i));
            } else {
                chunks.add(new DocumentChunk(
                    currentChunk.toString(), source, "chunk-" + chunkIndex, chunkIndex
                ));
                currentChunk = new StringBuilder(sentences.get(i));
                chunkIndex++;
            }
        }

        if (!currentChunk.isEmpty()) {
            chunks.add(new DocumentChunk(
                currentChunk.toString(), source, "chunk-" + chunkIndex, chunkIndex
            ));
        }

        return chunks;
    }

    private List<String> splitSentences(String text) {
        return Arrays.stream(text.split("(?<=[。！？.!?])"))
            .map(String::trim)
            .filter(s -> !s.isEmpty())
            .collect(Collectors.toList());
    }
}

結構切分（Markdown）

@Service
public class MarkdownStructureChunker {

    private static final Pattern HEADING_PATTERN = Pattern.compile("^(#{1,6})\\s+(.+)$", Pattern.MULTILINE);

    public List<DocumentChunk> chunk(String markdown, String source) {
        List<Section> sections = parseSections(markdown);

        return sections.stream()
            .map(section -> new DocumentChunk(
                section.content(),
                source,
                section.heading(),
                section.level()
            ))
            .collect(Collectors.toList());
    }

    private List<Section> parseSections(String markdown) {
        List<Section> sections = new ArrayList<>();
        Matcher matcher = HEADING_PATTERN.matcher(markdown);

        List<Integer> positions = new ArrayList<>();
        List<String> headings = new ArrayList<>();
        List<Integer> levels = new ArrayList<>();

        while (matcher.find()) {
            positions.add(matcher.start());
            headings.add(matcher.group(2).trim());
            levels.add(matcher.group(1).length());
        }

        for (int i = 0; i < positions.size(); i++) {
            int start = positions.get(i);
            int end = (i + 1 < positions.size()) ? positions.get(i + 1) : markdown.length();
            String content = markdown.substring(start, end).trim();
            sections.add(new Section(headings.get(i), content, levels.get(i)));
        }

        if (!positions.isEmpty() && positions.get(0) > 0) {
            String preamble = markdown.substring(0, positions.get(0)).trim();
            if (!preamble.isEmpty()) {
                sections.add(0, new Section("前言", preamble, 0));
            }
        }

        return sections;
    }
}

Agentic RAG：Agent自主決定何時檢索

Agentic RAG是2026年的前沿方向。核心思想：讓Agent自己決定要不要檢索、檢索什麼、檢索夠不夠。

┌──────────────────────────────────────────────────────────────────┐
│                    Agentic RAG 工作流程                           │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  使用者查詢：「對比A產品和B產品的技術架構差異」                      │
│       │                                                          │
│       ▼                                                          │
│  ┌────────────────────────────────────────┐                     │
│  │  Agent思考：需要檢索A產品的架構文件      │                     │
│  └──────────────────┬─────────────────────┘                     │
│                     ▼                                            │
│  ┌────────────────────────────────────────┐                     │
│  │  檢索：A產品架構文件 → 獲得上下文         │                     │
│  └──────────────────┬─────────────────────┘                     │
│                     ▼                                            │
│  ┌────────────────────────────────────────┐                     │
│  │  Agent評估：還需要B產品的架構文件          │                     │
│  └──────────────────┬─────────────────────┘                     │
│                     ▼                                            │
│  ┌────────────────────────────────────────┐                     │
│  │  檢索：B產品架構文件 → 獲得上下文         │                     │
│  └──────────────────┬─────────────────────┘                     │
│                     ▼                                            │
│  ┌────────────────────────────────────────┐                     │
│  │  Agent評估：資訊足夠，可以生成對比回答     │                     │
│  └──────────────────┬─────────────────────┘                     │
│                     ▼                                            │
│  ┌────────────────────────────────────────┐                     │
│  │  生成：帶引用的A/B產品架構對比分析        │                     │
│  └────────────────────────────────────────┘                     │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Agentic RAG核心實現

@Service
public class AgenticRagService {

    private final OpenAiChatModel chatModel;
    private final HybridRetrievalService retrievalService;
    private final RerankService rerankService;

    private static final int MAX_ITERATIONS = 5;

    public AgenticRagResponse query(String userQuery) {
        List<RetrievalStep> steps = new ArrayList<>();
        List<SearchResult> allContext = new ArrayList<>();
        String currentThought = userQuery;

        for (int i = 0; i < MAX_ITERATIONS; i++) {
            AgentDecision decision = decideAction(currentThought, allContext);
            steps.add(new RetrievalStep(i + 1, decision.thought(), decision.action()));

            if ("GENERATE".equals(decision.action())) {
                String answer = generateAnswer(userQuery, allContext);
                return new AgenticRagResponse(answer, allContext, steps);
            }

            if ("SEARCH".equals(decision.action())) {
                List<SearchResult> results = retrievalService.hybridSearch(decision.searchQuery(), 5);
                List<SearchResult> reranked = rerankService.rerank(decision.searchQuery(), results, 3);
                allContext.addAll(reranked);
                currentThought = decision.thought();
            }

            if ("INSUFFICIENT".equals(decision.action())) {
                return new AgenticRagResponse(
                    "抱歉，經過多輪檢索仍無法找到足夠資訊來回答您的問題。",
                    allContext, steps
                );
            }
        }

        String answer = generateAnswer(userQuery, allContext);
        return new AgenticRagResponse(answer, allContext, steps);
    }

    private AgentDecision decideAction(String query, List<SearchResult> context) {
        String contextStr = context.isEmpty() ? "（暫無檢索結果）" :
            context.stream()
                .map(r -> "- " + r.getContent().substring(0, Math.min(200, r.getContent().length())))
                .collect(Collectors.joining("\n"));

        String prompt = """
            你是一個RAG Agent，需要決定下一步行動。

            使用者查詢：%s
            已有上下文：
            %s

            請決定下一步行動：
            - SEARCH: 需要進一步檢索（提供search_query）
            - GENERATE: 已有足夠資訊，可以生成回答
            - INSUFFICIENT: 無法找到足夠資訊

            輸出JSON格式：
            {"thought": "你的思考過程", "action": "SEARCH|GENERATE|INSUFFICIENT", "search_query": "檢索查詢（僅SEARCH時需要）"}
            """.formatted(query, contextStr);

        ChatResponse response = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.1).build()
        ));

        try {
            Map<String, String> result = new ObjectMapper().readValue(
                response.getResult().getOutput().getContent(), Map.class
            );
            return new AgentDecision(
                result.get("thought"),
                result.get("action"),
                result.getOrDefault("search_query", "")
            );
        } catch (Exception e) {
            return new AgentDecision("解析失敗，嘗試生成回答", "GENERATE", "");
        }
    }

    private String generateAnswer(String query, List<SearchResult> context) {
        String contextStr = context.stream()
            .map(r -> "[來源: " + r.getSource() + "]\n" + r.getContent())
            .collect(Collectors.joining("\n\n"));

        String systemPrompt = """
            基於以下檢索到的文件回答使用者問題。
            規則：
            1. 只基於檢索文件回答，不編造
            2. 每個陳述標注引用來源
            3. 資訊不足時明確說明

            檢索文件：
            %s
            """.formatted(contextStr);

        ChatResponse response = chatModel.call(new ChatRequest(
            List.of(new Message("system", systemPrompt), new Message("user", query)),
            ChatOptions.builder().withTemperature(0.1).build()
        ));

        return response.getResult().getOutput().getContent();
    }
}

多模態RAG：圖片、表格、程式碼的統一檢索

2026年，RAG不再只是文字檢索。多模態RAG讓圖片、表格、程式碼也能被檢索和引用。

┌──────────────────────────────────────────────────────────────────┐
│                    多模態RAG架構                                  │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  輸入文件（PDF/HTML/Markdown）                                    │
│       │                                                          │
│       ├── 文字提取 ──→ 文字Embedding ──→ 向量庫                   │
│       │                                                          │
│       ├── 表格提取 ──→ 表格→文字描述 ──→ Embedding ──→ 向量庫     │
│       │                                                          │
│       ├── 圖片提取 ──→ 視覺Embedding ──→ 向量庫                   │
│       │              (CLIP/Qwen-VL)                              │
│       │                                                          │
│       └── 程式碼提取 ──→ 程式碼Embedding ──→ 向量庫               │
│                       (CodeBERT/專用模型)                         │
│                                                                  │
│  查詢時：統一向量檢索 → 多模態結果融合 → LLM生成                   │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

多模態文件處理

@Service
public class MultimodalDocumentProcessor {

    private final EmbeddingService textEmbeddingService;
    private final OpenAiChatModel visionModel;

    public List<DocumentChunk> processDocument(Document document) {
        List<DocumentChunk> chunks = new ArrayList<>();

        chunks.addAll(processText(document.getTextContent(), document.getSource()));
        chunks.addAll(processTables(document.getTables(), document.getSource()));
        chunks.addAll(processImages(document.getImages(), document.getSource()));
        chunks.addAll(processCodeBlocks(document.getCodeBlocks(), document.getSource()));

        return chunks;
    }

    private List<DocumentChunk> processText(String text, String source) {
        SemanticChunker chunker = new SemanticChunker(textEmbeddingService);
        return chunker.chunk(text, source);
    }

    private List<DocumentChunk> processTables(List<Table> tables, String source) {
        return tables.stream()
            .map(table -> {
                String description = convertTableToText(table);
                return new DocumentChunk(
                    description, source,
                    "table-" + table.getIndex(),
                    table.getIndex(),
                    "TABLE"
                );
            })
            .collect(Collectors.toList());
    }

    private String convertTableToText(Table table) {
        StringBuilder sb = new StringBuilder();
        sb.append("表格描述：").append(table.getCaption()).append("\n");

        List<String> headers = table.getHeaders();
        sb.append("列名：").append(String.join(", ", headers)).append("\n");

        for (List<String> row : table.getRows()) {
            for (int i = 0; i < headers.size() && i < row.size(); i++) {
                sb.append(headers.get(i)).append(": ").append(row.get(i)).append("; ");
            }
            sb.append("\n");
        }

        return sb.toString();
    }

    private List<DocumentChunk> processImages(List<DocumentImage> images, String source) {
        return images.stream()
            .map(image -> {
                String description = describeImage(image);
                return new DocumentChunk(
                    "[圖片] " + description, source,
                    "image-" + image.getIndex(),
                    image.getIndex(),
                    "IMAGE"
                );
            })
            .collect(Collectors.toList());
    }

    private String describeImage(DocumentImage image) {
        String prompt = "請詳細描述這張圖片的內容，包括圖表資料、關鍵資訊等。";

        ChatResponse response = visionModel.call(new ChatRequest(
            List.of(new Message("user", prompt + "\n[圖片base64: " + image.getBase64() + "]")),
            ChatOptions.builder().withTemperature(0.1).build()
        ));

        return response.getResult().getOutput().getContent();
    }

    private List<DocumentChunk> processCodeBlocks(List<CodeBlock> codeBlocks, String source) {
        return codeBlocks.stream()
            .map(code -> new DocumentChunk(
                "[程式碼] " + code.getLanguage() + "\n" + code.getContent() +
                "\n功能說明：" + code.getDescription(),
                source,
                "code-" + code.getIndex(),
                code.getIndex(),
                "CODE"
            ))
            .collect(Collectors.toList());
    }
}

RAG評估體系：量化評估

沒有評估就沒有最佳化。RAG評估需要從檢索品質和生成品質兩個維度量化：

評估指標體系

維度	指標	說明	計算方式	目標值
檢索品質	Recall@K	前K個結果中相關文件的比例	相關文件∩檢索結果 / 相關文件總數	> 90%
檢索品質	MRR	首個相關文件排名倒數的均值	avg(1/first_relevant_rank)	> 0.8
檢索品質	nDCG@K	歸一化折損累積增益	DCG/IDCG	> 0.85
生成品質	Faithfulness	答案對檢索內容的忠實度	可被檢索內容支撐的陳述數 / 總陳述數	> 95%
生成品質	Relevancy	答案與查詢的相關性	LLM評估0-1分	> 0.9
生成品質	Correctness	答案與ground truth的一致性	與標準答案的語意相似度	> 0.85
端到端	延遲	從查詢到答案的總時間	P95延遲	< 3s
端到端	拒答準確率	正確拒絕無法回答的問題	正確拒答數 / 應拒答總數	> 80%

評估框架實現

@Service
public class RagEvaluationService {

    private final OpenAiChatModel chatModel;

    public EvaluationResult evaluate(RagResponse response, String groundTruth) {
        double faithfulness = evaluateFaithfulness(response);
        double relevancy = evaluateRelevancy(response);
        double correctness = evaluateCorrectness(response, groundTruth);

        return new EvaluationResult(faithfulness, relevancy, correctness);
    }

    private double evaluateFaithfulness(RagResponse response) {
        String prompt = """
            評估以下回答對檢索內容的忠實度。

            檢索內容：
            %s

            回答：
            %s

            請提取回答中的每個陳述，判斷其是否能被檢索內容支撐。
            輸出JSON：{"total_claims": N, "supported_claims": M}
            """.formatted(
                response.getSources().stream()
                    .map(SearchResult::getContent)
                    .collect(Collectors.joining("\n")),
                response.getAnswer()
            );

        ChatResponse llmResponse = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.0).build()
        ));

        try {
            Map<String, Integer> result = new ObjectMapper().readValue(
                llmResponse.getResult().getOutput().getContent(), Map.class
            );
            return (double) result.get("supported_claims") / result.get("total_claims");
        } catch (Exception e) {
            return 0.0;
        }
    }

    private double evaluateRelevancy(RagResponse response) {
        String prompt = """
            評估以下回答與查詢的相關性（0-10分）。

            查詢：%s
            回答：%s

            只輸出0到10的整數。
            """.formatted(response.getQuery(), response.getAnswer());

        ChatResponse llmResponse = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.0).build()
        ));

        try {
            return Double.parseDouble(llmResponse.getResult().getOutput().getContent().trim()) / 10.0;
        } catch (Exception e) {
            return 0.0;
        }
    }

    private double evaluateCorrectness(RagResponse response, String groundTruth) {
        String prompt = """
            評估以下回答與標準答案的語意一致性（0-10分）。

            回答：%s
            標準答案：%s

            只輸出0到10的整數。
            """.formatted(response.getAnswer(), groundTruth);

        ChatResponse llmResponse = chatModel.call(new ChatRequest(
            List.of(new Message("user", prompt)),
            ChatOptions.builder().withTemperature(0.0).build()
        ));

        try {
            return Double.parseDouble(llmResponse.getResult().getOutput().getContent().trim()) / 10.0;
        } catch (Exception e) {
            return 0.0;
        }
    }
}

生產級RAG架構與效能最佳化

生產級架構總覽

┌──────────────────────────────────────────────────────────────────────┐
│                        生產級RAG架構                                  │
├──────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌────────────────────────────────────────────────────────────┐     │
│  │                    API Gateway (Spring Cloud Gateway)       │     │
│  │    認證(JWT) │ 限流(Sentinel) │ 快取(Redis) │ 日誌(ELK)    │     │
│  └────────────────────────────┬───────────────────────────────┘     │
│                               │                                      │
│  ┌────────────────────────────▼───────────────────────────────┐     │
│  │                    RAG Service (Spring Boot)                │     │
│  │    ┌──────────┐  ┌──────────┐  ┌──────────┐               │     │
│  │    │查詢改寫   │  │混合檢索   │  │重排序     │               │     │
│  │    │Service   │→ │Service   │→ │Service   │               │     │
│  │    └──────────┘  └──────────┘  └──────────┘               │     │
│  │                                        │                    │     │
│  │    ┌──────────┐  ┌──────────┐          ▼                    │     │
│  │    │評估Service│  │快取Service│  ┌──────────┐               │     │
│  │    └──────────┘  └──────────┘  │生成Service│               │     │
│  │                                └──────────┘               │     │
│  └────────────────────────────┬───────────────────────────────┘     │
│                               │                                      │
│       ┌───────────────────────┼───────────────────────┐             │
│       ▼                       ▼                       ▼             │
│  ┌──────────┐          ┌──────────┐          ┌──────────┐          │
│  │ Qdrant   │          │Elastic-  │          │ Redis    │          │
│  │ 向量庫    │          │search    │          │ 快取層    │          │
│  │          │          │ BM25索引  │          │          │          │
│  └──────────┘          └──────────┘          └──────────┘          │
│                                                                      │
│  ┌────────────────────────────────────────────────────────────┐     │
│  │                文件攝入Pipeline (非同步)                     │     │
│  │    檔案上傳 → 解析 → 切分 → Embedding → 索引 → 元資料儲存    │     │
│  │    (Kafka訊息佇列驅動，支援增量更新)                          │     │
│  └────────────────────────────────────────────────────────────┘     │
│                                                                      │
│  ┌────────────────────────────────────────────────────────────┐     │
│  │                    監控與可觀測性                             │     │
│  │    Prometheus指標 │ Grafana儀表板 │ 告警 │ 檢索品質追蹤       │     │
│  └────────────────────────────────────────────────────────────┘     │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

快取層最佳化

@Service
public class RagCacheService {

    private final RedisTemplate<String, String> redisTemplate;
    private final ObjectMapper objectMapper;

    private static final long CACHE_TTL_HOURS = 24;

    public Optional<RagResponse> getCachedResponse(String query) {
        String cacheKey = generateCacheKey(query);
        String cached = redisTemplate.opsForValue().get(cacheKey);

        if (cached != null) {
            try {
                return Optional.of(objectMapper.readValue(cached, RagResponse.class));
            } catch (Exception e) {
                return Optional.empty();
            }
        }
        return Optional.empty();
    }

    public void cacheResponse(String query, RagResponse response) {
        String cacheKey = generateCacheKey(query);
        try {
            String json = objectMapper.writeValueAsString(response);
            redisTemplate.opsForValue().set(cacheKey, json, CACHE_TTL_HOURS, TimeUnit.HOURS);
        } catch (Exception e) {
            // 快取失敗不影響主流程
        }
    }

    private String generateCacheKey(String query) {
        return "rag:cache:" + DigestUtils.md5Hex(query);
    }
}

文件攝入Pipeline

@Service
public class DocumentIngestionPipeline {

    private final DocumentParserService parserService;
    private final SemanticChunker semanticChunker;
    private final MarkdownStructureChunker structureChunker;
    private final EmbeddingService embeddingService;
    private final QdrantVectorStore vectorStore;
    private final DocumentMapper documentMapper;

    @Async("ingestionExecutor")
    public CompletableFuture<Void> ingestDocument(MultipartFile file, String source) {
        String content = parserService.parse(file);

        List<DocumentChunk> chunks;
        if (isMarkdown(content)) {
            chunks = structureChunker.chunk(content, source);
        } else {
            chunks = semanticChunker.chunk(content, source);
        }

        List<float[]> embeddings = embeddingService.embedBatch(
            chunks.stream().map(DocumentChunk::getContent).collect(Collectors.toList())
        );

        for (int i = 0; i < chunks.size(); i++) {
            chunks.get(i).setEmbedding(embeddings.get(i));
        }

        vectorStore.upsertDocuments(chunks);

        for (DocumentChunk chunk : chunks) {
            DocumentEntity entity = new DocumentEntity();
            entity.setContent(chunk.getContent());
            entity.setSource(chunk.getSource());
            entity.setSection(chunk.getSection());
            entity.setEmbedding(chunk.getEmbedding());
            entity.setSearchText(chunk.getContent());
            documentMapper.insert(entity);
        }

        return CompletableFuture.completedFuture(null);
    }

    private boolean isMarkdown(String content) {
        return content.contains("# ") || content.contains("## ") || content.contains("```");
    }
}

效能最佳化清單

最佳化項	方法	效果
Embedding快取	相同文字復用Embedding結果	減少50%+ API呼叫
查詢快取	Redis快取相似查詢結果	P95延遲降低60%
批量Embedding	合併多個文字一次呼叫	吞吐量提升3-5x
非同步攝入	Kafka驅動的非同步文件處理	攝入不影響查詢
連線池最佳化	Qdrant/ES連線池調優	併發能力提升2x
預計算HyDE	熱門查詢預生成HyDE Embedding	熱門查詢延遲降低40%
索引最佳化	HNSW參數調優（M=16, ef=100）	檢索精度與速度平衡
分片策略	按文件型別/時間分片	減少檢索範圍，提速30%

Spring Boot設定

spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: gpt-4o
          temperature: 0.1
      embedding:
        options:
          model: text-embedding-3-small

rag:
  vector-store:
    type: qdrant
    qdrant:
      host: localhost
      port: 6334
      collection: knowledge_base
      vector-size: 1536
  chunking:
    default-strategy: semantic
    chunk-size: 512
    overlap-size: 64
    similarity-threshold: 0.85
  retrieval:
    hybrid: true
    vector-weight: 0.7
    keyword-weight: 0.3
    rrf-k: 60
    top-k: 10
  cache:
    enabled: true
    ttl-hours: 24
  ingestion:
    async: true
    batch-size: 100
    pool-size: 4

總結

環節	2026年最佳實踐	關鍵要點
查詢處理	查詢改寫 + HyDE	多角度檢索提升召回
檢索策略	混合檢索（向量+BM25）+ RRF融合	向量70% + 關鍵詞30%
排序最佳化	Cross-Encoder重排序	精細評估query-doc相關性
文件切分	語意切分 > 結構切分 > 固定切分	切分是RAG的地基
Agent化	Agentic RAG多輪檢索+自我評估	Agent自主決定何時檢索
多模態	圖片/表格/程式碼統一檢索	視覺Embedding + 文字描述
評估	Faithfulness/Relevancy/Correctness	沒有評估就沒有最佳化
生產化	快取+非同步+監控+調優	工程化決定RAG成敗

RAG不是「檢索+生成」這麼簡單，而是一個需要精心設計每個環節的系統工程。從查詢改寫到混合檢索，從文件切分到Agentic推理——每個環節都決定了最終答案的品質。2026年，Advanced RAG已是標配，Agentic RAG是前沿，而評估體系是持續最佳化的基石。