OpenTelemetry LLM 可觀測性實戰：2026年端到端監控 AI 應用完全指南

2026年，為什麼 LLM 應用需要專門的可觀測性

傳統應用監控關注延遲、吞吐和錯誤率。LLM 應用引入了全新的可觀測性維度：Token 使用量、API 呼叫成本、幻覺率、上下本品質。

維度	傳統應用監控	LLM 應用監控
延遲	請求延遲	首 Token 延遲(TTFT)、總延遲
吞吐	QPS	Token/秒
錯誤	錯誤率	錯誤率 + 幻覺率
成本	伺服器成本	API 呼叫成本（按 Token 計費）
品質	無	準確率、相關性、完整性
上下文	無	Prompt 長度、上下文視窗利用率

OpenTelemetry LLM 語意約定

Span: gen_ai.client.chat
  Attributes:
    gen_ai.system              = "openai"
    gen_ai.request.model       = "gpt-4o"
    gen_ai.usage.input_tokens  = 1523
    gen_ai.usage.output_tokens = 847
    gen_ai.usage.total_tokens  = 2370

完整 Python 實作

pip install opentelemetry-api opentelemetry-sdk \
  opentelemetry-exporter-otlp openai

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

trace_provider = TracerProvider()
trace_provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True))
)
trace.set_tracer_provider(trace_provider)
tracer = trace.get_tracer("llm-app", "1.0.0")
meter = metrics.get_meter("llm-app", "1.0.0")

手動追蹤 LLM 呼叫

import openai, time
from opentelemetry.trace import Status, StatusCode

client = openai.OpenAI(api_key="sk-xxx")
token_counter = meter.create_counter("gen_ai.client.token.usage", unit="tokens")
llm_latency = meter.create_histogram("gen_ai.client.operation.latency", unit="ms")

def traced_chat_completion(messages, model="gpt-4o", temperature=0.7, max_tokens=4096):
    with tracer.start_as_current_span("gen_ai.client.chat") as span:
        span.set_attribute("gen_ai.system", "openai")
        span.set_attribute("gen_ai.request.model", model)
        start_time = time.time()
        try:
            response = client.chat.completions.create(
                model=model, messages=messages,
                temperature=temperature, max_tokens=max_tokens
            )
            content = response.choices[0].message.content
            usage = {
                "prompt_tokens": response.usage.prompt_tokens,
                "completion_tokens": response.usage.completion_tokens,
                "total_tokens": response.usage.total_tokens
            }
            latency_ms = (time.time() - start_time) * 1000
            span.set_attribute("gen_ai.usage.input_tokens", usage["prompt_tokens"])
            span.set_attribute("gen_ai.usage.output_tokens", usage["completion_tokens"])
            span.set_status(Status(StatusCode.OK))
            token_counter.add(usage["prompt_tokens"], {"gen_ai.token.type": "input", "gen_ai.request.model": model})
            token_counter.add(usage["completion_tokens"], {"gen_ai.token.type": "output", "gen_ai.request.model": model})
            llm_latency.record(latency_ms, {"gen_ai.request.model": model})
            return {"content": content, "usage": usage, "latency_ms": latency_ms}
        except Exception as e:
            span.set_status(Status(StatusCode.ERROR, str(e)))
            raise

成本監控

COST_PER_TOKEN = {
    "gpt-4o": {"input": 0.000005, "output": 0.000015},
    "gpt-4o-mini": {"input": 0.00000015, "output": 0.0000006},
}

def calculate_cost(model, input_tokens, output_tokens):
    rates = COST_PER_TOKEN.get(model, {"input": 0, "output": 0})
    return (input_tokens * rates["input"]) + (output_tokens * rates["output"])

品質監控：幻覺偵測

def evaluate_answer_quality(question, answer, context):
    with tracer.start_as_current_span("llm.quality.evaluation") as span:
        eval_prompt = [
            {"role": "system", "content": "評估回答品質，回傳JSON：{grounded, relevance, completeness}"},
            {"role": "user", "content": f"問題：{question}\n參考：{context}\n回答：{answer}"}
        ]
        result = traced_chat_completion(eval_prompt, model="gpt-4o-mini", temperature=0)
        import json
        evaluation = json.loads(result["content"])
        span.set_attribute("llm.quality.grounded", evaluation["grounded"])
        return evaluation

LLM 可觀測性工具對比

工具	OpenTelemetry	Token追蹤	成本監控	品質監控	自託管
OTel + Grafana	原生	自訂	自訂	自訂	是
Langfuse	整合	原生	原生	原生	是
Helicone	代理	原生	原生	基礎	否
Arize Phoenix	整合	原生	原生	原生	是

5 個常見陷阱

1. 忽略串流回應的 Token 計數

串流回應的 usage 欄位可能為空。

2. 取樣率設定不當

高流量 LLM 應用如果 100% 取樣，會產生大量 Trace 資料。

3. Prompt 內容洩露到 Trace

避免將完整 Prompt 記錄到 Span 屬性中。

4. 未區分模型版本

同一模型名稱可能對應不同版本。

5. 缺少端到端關聯

RAG 管道中的 Span 需要透過 parent span 關聯。

10 個錯誤排查

#	錯誤現象	可能原因	排查方法
1	Trace 未出現在 Grafana	OTLP Exporter 設定錯誤	檢查 endpoint 和認證
2	Token 計數為零	串流回應未取得 usage	在最後一個 chunk 取得
3	成本計算偏差大	價格表未更新	更新 COST_PER_TOKEN
4	幻覺率異常高	RAG 檢索品質差	檢查向量索引
5	Span 屬性缺失	語意約定版本不匹配	更新 OTel SDK
6	記憶體洩漏	未關閉 TracerProvider	程式退出前呼叫 shutdown()
7	關鍵 Trace 因取樣遺失	取樣率太低	對錯誤請求 100% 取樣
8	TTFT 異常高	網路延遲或冷啟動	檢查網路和模型預熱
9	品質評估本身幻覺	評估模型不可靠	使用更強的評估模型
10	Metric 和 Trace 不匹配	時區不一致	統一使用 UTC

線上工具推薦

JSON 格式化：/zh-TW/json/format
Base64 編碼：/zh-TW/encode/base64
雜湊計算：/zh-TW/encode/hash

總結：LLM 應用的可觀測性需要關注 Token 使用量、API 成本、幻覺率和回答品質。OpenTelemetry 的 LLM 語意約定為 2026 年提供了標準化方案。配合 Grafana 儀表板，實現從成本到品質的全方位 LLM 應用監控。