OpenTelemetry LLM 可观测性实战:2026年端到端监控 AI 应用完全指南

DevOps

2026年,为什么 LLM 应用需要专门的可观测性

传统应用监控关注延迟、吞吐和错误率。LLM 应用引入了全新的可观测性维度:Token 使用量、API 调用成本、幻觉率、上下文质量。一个 LLM 应用可能响应时间正常、错误率为零,但每次调用花费 $0.5 且幻觉率高达 30%——这在传统监控中完全不可见。

2026年,OpenTelemetry 已经发布了 LLM 语义约定(Semantic Conventions),为 LLM 可观测性提供了标准化方案。

维度 传统应用监控 LLM 应用监控
延迟 请求延迟 首 Token 延迟(TTFT)、总延迟
吞吐 QPS Token/秒
错误 错误率 错误率 + 幻觉率
成本 服务器成本 API 调用成本(按 Token 计费)
质量 准确率、相关性、完整性
上下文 Prompt 长度、上下文窗口利用率
缓存 缓存命中率
模型 模型版本、温度参数

OpenTelemetry LLM 语义约定

OpenTelemetry 为 LLM 定义了标准的 Span 属性和 Metric:

Span: gen_ai.client.chat
  Attributes:
    gen_ai.system              = "openai"
    gen_ai.request.model       = "gpt-4o"
    gen_ai.request.max_tokens  = 4096
    gen_ai.request.temperature = 0.7
    gen_ai.response.model      = "gpt-4o-2026-05-13"
    gen_ai.response.finish_reasons = ["stop"]
    gen_ai.usage.input_tokens  = 1523
    gen_ai.usage.output_tokens = 847
    gen_ai.usage.total_tokens  = 2370

Metric: gen_ai.client.token.usage
  Attributes:
    gen_ai.system        = "openai"
    gen_ai.request.model = "gpt-4o"
  Value: 2370 (total tokens)

完整 Python 实现:追踪 LLM 调用、Token 使用、延迟

安装依赖

pip install opentelemetry-api opentelemetry-sdk \
  opentelemetry-exporter-otlp \
  opentelemetry-instrumentation-openai \
  openai grafana-api

基础配置

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

# 配置 Trace 导出
trace_provider = TracerProvider()
trace_provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
    )
)
trace.set_tracer_provider(trace_provider)

# 配置 Metric 导出
metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="http://localhost:4317", insecure=True),
    export_interval_millis=15000
)
metrics.set_meter_provider(MeterProvider(metric_readers=[metric_reader]))

tracer = trace.get_tracer("llm-app", "1.0.0")
meter = metrics.get_meter("llm-app", "1.0.0")

手动追踪 LLM 调用

import openai
import time
from opentelemetry import trace, metrics
from opentelemetry.trace import Status, StatusCode

client = openai.OpenAI(api_key="sk-xxx")

token_counter = meter.create_counter(
    "gen_ai.client.token.usage",
    description="Token usage for LLM calls",
    unit="tokens"
)

llm_latency = meter.create_histogram(
    "gen_ai.client.operation.latency",
    description="Latency of LLM operations",
    unit="ms"
)

ttft_histogram = meter.create_histogram(
    "gen_ai.client.time_to_first_token",
    description="Time to first token",
    unit="ms"
)

def traced_chat_completion(
    messages: list[dict],
    model: str = "gpt-4o",
    temperature: float = 0.7,
    max_tokens: int = 4096,
    stream: bool = False
) -> dict:
    with tracer.start_as_current_span("gen_ai.client.chat") as span:
        span.set_attribute("gen_ai.system", "openai")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.request.temperature", temperature)
        span.set_attribute("gen_ai.request.max_tokens", max_tokens)

        start_time = time.time()
        first_token_time = None

        try:
            if stream:
                response = client.chat.completions.create(
                    model=model,
                    messages=messages,
                    temperature=temperature,
                    max_tokens=max_tokens,
                    stream=True
                )
                content_parts = []
                for chunk in response:
                    if first_token_time is None:
                        first_token_time = time.time()
                    if chunk.choices[0].delta.content:
                        content_parts.append(chunk.choices[0].delta.content)
                content = "".join(content_parts)
                usage = {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0}
            else:
                response = client.chat.completions.create(
                    model=model,
                    messages=messages,
                    temperature=temperature,
                    max_tokens=max_tokens
                )
                content = response.choices[0].message.content
                usage = {
                    "prompt_tokens": response.usage.prompt_tokens,
                    "completion_tokens": response.usage.completion_tokens,
                    "total_tokens": response.usage.total_tokens
                }
                first_token_time = start_time

            end_time = time.time()
            latency_ms = (end_time - start_time) * 1000
            ttft_ms = (first_token_time - start_time) * 1000 if first_token_time else 0

            span.set_attribute("gen_ai.usage.input_tokens", usage["prompt_tokens"])
            span.set_attribute("gen_ai.usage.output_tokens", usage["completion_tokens"])
            span.set_attribute("gen_ai.usage.total_tokens", usage["total_tokens"])
            span.set_status(Status(StatusCode.OK))

            token_counter.add(usage["prompt_tokens"], {"gen_ai.system": "openai", "gen_ai.token.type": "input", "gen_ai.request.model": model})
            token_counter.add(usage["completion_tokens"], {"gen_ai.system": "openai", "gen_ai.token.type": "output", "gen_ai.request.model": model})
            llm_latency.record(latency_ms, {"gen_ai.system": "openai", "gen_ai.request.model": model})
            ttft_histogram.record(ttft_ms, {"gen_ai.system": "openai", "gen_ai.request.model": model})

            return {
                "content": content,
                "usage": usage,
                "latency_ms": latency_ms,
                "ttft_ms": ttft_ms
            }

        except Exception as e:
            span.set_status(Status(StatusCode.ERROR, str(e)))
            span.record_exception(e)
            raise

追踪 RAG 管道

def traced_rag_query(question: str, top_k: int = 5) -> dict:
    with tracer.start_as_current_span("rag.pipeline") as rag_span:
        rag_span.set_attribute("rag.question", question)
        rag_span.set_attribute("rag.top_k", top_k)

        # 阶段1:检索
        with tracer.start_as_current_span("rag.retrieve") as retrieve_span:
            retrieved_docs = vector_search(question, top_k)
            retrieve_span.set_attribute("rag.retrieved_count", len(retrieved_docs))
            retrieve_span.set_attribute("rag.retrieval_method", "vector_similarity")

        # 阶段2:构建 Prompt
        with tracer.start_as_current_span("rag.construct_prompt") as prompt_span:
            context_text = "\n".join([doc.content for doc in retrieved_docs])
            messages = [
                {"role": "system", "content": "基于以下参考资料回答问题。"},
                {"role": "user", "content": f"参考资料:\n{context_text}\n\n问题:{question}"}
            ]
            prompt_span.set_attribute("rag.context_length", len(context_text))
            prompt_span.set_attribute("rag.prompt_tokens_estimate", len(context_text) // 4)

        # 阶段3:生成
        with tracer.start_as_current_span("rag.generate") as gen_span:
            result = traced_chat_completion(messages)
            gen_span.set_attribute("rag.answer_length", len(result["content"]))

        # 质量评估
        with tracer.start_as_current_span("rag.quality_check") as quality_span:
            quality_score = evaluate_answer_quality(question, result["content"], context_text)
            quality_span.set_attribute("rag.quality_score", quality_score)

        rag_span.set_attribute("rag.total_latency_ms", result["latency_ms"])
        return {
            "answer": result["content"],
            "usage": result["usage"],
            "quality_score": quality_score,
            "latency_ms": result["latency_ms"]
        }

Grafana 仪表盘配置

成本监控仪表盘

{
  "dashboard": {
    "title": "LLM Cost Monitoring",
    "panels": [
      {
        "title": "Daily API Spend ($)",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(increase(gen_ai_client_token_usage_total{gen_ai_token_type=\"input\"}[$__range])) * 0.000005 + sum(increase(gen_ai_client_token_usage_total{gen_ai_token_type=\"output\"}[$__range])) * 0.000015"
          }
        ]
      },
      {
        "title": "Token Usage by Model",
        "type": "piechart",
        "targets": [
          {
            "expr": "sum by (gen_ai_request_model) (increase(gen_ai_client_token_usage_total[$__range]))"
          }
        ]
      },
      {
        "title": "Cost per Request",
        "type": "stat",
        "targets": [
          {
            "expr": "(sum(increase(gen_ai_client_token_usage_total[$__range])) * 0.00001) / sum(increase(gen_ai_client_chat_total[$__range]))"
          }
        ]
      }
    ]
  }
}

质量监控仪表盘

{
  "dashboard": {
    "title": "LLM Quality Monitoring",
    "panels": [
      {
        "title": "Hallucination Rate",
        "type": "gauge",
        "targets": [
          {
            "expr": "sum(rate(llm_hallucination_total[$__rate_interval])) / sum(rate(llm_total_responses[$__rate_interval]))"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                {"value": 0, "color": "green"},
                {"value": 0.1, "color": "yellow"},
                {"value": 0.2, "color": "red"}
              ]
            },
            "max": 1
          }
        }
      },
      {
        "title": "Average Quality Score",
        "type": "timeseries",
        "targets": [
          {
            "expr": "avg(llm_quality_score)"
          }
        ]
      }
    ]
  }
}

成本监控:Token 使用与 API 支出

成本计算模型

COST_PER_TOKEN = {
    "gpt-4o": {"input": 0.000005, "output": 0.000015},
    "gpt-4o-mini": {"input": 0.00000015, "output": 0.0000006},
    "gpt-4-turbo": {"input": 0.00001, "output": 0.00003},
    "claude-3.5-sonnet": {"input": 0.000003, "output": 0.000015},
    "claude-3-haiku": {"input": 0.00000025, "output": 0.00000125},
}

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    rates = COST_PER_TOKEN.get(model, {"input": 0, "output": 0})
    return (input_tokens * rates["input"]) + (output_tokens * rates["output"])

# 示例:1000次 GPT-4o 调用,平均 1500 input + 800 output tokens
cost = calculate_cost("gpt-4o", 1500 * 1000, 800 * 1000)
print(f"月成本: ${cost:.2f}")  # 月成本: $19.50

成本预警

from opentelemetry import metrics

budget_gauge = meter.create_gauge(
    "llm.budget.utilization",
    description="Budget utilization percentage"
)

monthly_budget = 1000.0  # $1000/月

def check_budget(current_spend: float):
    utilization = current_spend / monthly_budget
    budget_gauge.set(utilization, {"budget_type": "monthly"})

    if utilization > 0.8:
        send_alert(f"LLM预算使用率: {utilization:.1%},已超过80%阈值")
    if utilization > 1.0:
        send_critical_alert(f"LLM预算已超支!当前: ${current_spend:.2f}")

质量监控:幻觉率与准确率

幻觉检测

def evaluate_answer_quality(
    question: str,
    answer: str,
    context: str
) -> dict:
    with tracer.start_as_current_span("llm.quality.evaluation") as span:
        evaluation_prompt = [
            {"role": "system", "content": """评估AI回答的质量。返回JSON格式:
{"grounded": true/false, "relevance": 0-1, "completeness": 0-1, "reasoning": "说明"}"""},
            {"role": "user", "content": f"""问题:{question}
参考资料:{context}
AI回答:{answer}

请评估回答是否基于参考资料(grounded)、与问题相关(relevance)、完整(completeness)。"""}
        ]

        result = traced_chat_completion(evaluation_prompt, model="gpt-4o-mini", temperature=0)
        import json
        evaluation = json.loads(result["content"])

        span.set_attribute("llm.quality.grounded", evaluation["grounded"])
        span.set_attribute("llm.quality.relevance", evaluation["relevance"])
        span.set_attribute("llm.quality.completeness", evaluation["completeness"])

        hallucination_counter = meter.create_counter("llm.hallucination")
        if not evaluation["grounded"]:
            hallucination_counter.add(1, {"gen_ai.request.model": "gpt-4o"})

        return evaluation

LLM 可观测性工具对比

工具 类型 OpenTelemetry Token追踪 成本监控 质量监控 自托管
OpenTelemetry + Grafana 框架+仪表盘 原生 自定义 自定义 自定义
Langfuse 专用平台 集成 原生 原生 原生
Helicone 专用平台 代理 原生 原生 基础
Arize Phoenix 专用平台 集成 原生 原生 原生
Braintrust 专用平台 集成 原生 原生 原生
Weave 专用平台 集成 原生 基础 原生

5 个常见陷阱

1. 忽略流式响应的 Token 计数

流式响应的 usage 字段可能为空,需要在最后一个 chunk 中获取。

# 流式响应正确获取 Token 使用量
total_usage = {"prompt_tokens": 0, "completion_tokens": 0}
for chunk in stream_response:
    if hasattr(chunk, 'usage') and chunk.usage:
        total_usage = {
            "prompt_tokens": chunk.usage.prompt_tokens,
            "completion_tokens": chunk.usage.completion_tokens
        }

2. 采样率设置不当

高流量 LLM 应用如果 100% 采样,会产生大量 Trace 数据和成本。

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased

# LLM 调用建议 10% 采样,错误 100% 采样
sampler = TraceIdRatioBased(rate=0.1)
trace_provider = TracerProvider(sampler=sampler)

3. Prompt 内容泄露到 Trace

将完整的 Prompt 和回答记录到 Span 属性中,可能泄露敏感信息。

# 错误:记录完整 Prompt
span.set_attribute("gen_ai.prompt", messages)  # 可能包含用户隐私

# 正确:只记录元数据
span.set_attribute("gen_ai.prompt.token_count", count_tokens(messages))
span.set_attribute("gen_ai.prompt.has_system_message", True)

4. 未区分模型版本

同一模型名称可能对应不同版本,导致性能数据混乱。

# 记录实际响应的模型版本
span.set_attribute("gen_ai.request.model", "gpt-4o")
span.set_attribute("gen_ai.response.model", response.model)  # "gpt-4o-2026-05-13"

5. 缺少端到端关联

RAG 管道中检索、生成、评估是独立的 Span,缺少关联会导致无法追踪完整请求。

# 使用 parent span 关联
with tracer.start_as_current_span("rag.pipeline") as parent:
    ctx = trace.get_current()
    with tracer.start_as_current_span("rag.retrieve", context=ctx):
        pass
    with tracer.start_as_current_span("rag.generate", context=ctx):
        pass

10 个错误排查

# 错误现象 可能原因 排查方法
1 Trace 数据未出现在 Grafana OTLP Exporter 配置错误 检查 endpoint 和认证信息
2 Token 计数为零 流式响应未获取 usage 在最后一个 chunk 中获取 usage
3 成本计算偏差大 模型价格表未更新 更新 COST_PER_TOKEN 映射
4 幻觉率异常高 RAG 检索质量差 检查向量索引和检索参数
5 Span 属性缺失 语义约定版本不匹配 更新 OpenTelemetry SDK
6 内存泄漏 未关闭 TracerProvider 程序退出前调用 shutdown()
7 采样导致关键 Trace 丢失 采样率太低 对错误请求 100% 采样
8 TTFT 异常高 网络延迟或模型冷启动 检查网络和模型预热状态
9 质量评估本身产生幻觉 评估模型不可靠 使用更强的评估模型
10 Metric 和 Trace 不匹配 时区或时间窗口不一致 统一使用 UTC 时间
# 通用排查命令
# 检查 OTel Collector 状态
kubectl logs -n monitoring deployment/otel-collector

# 验证 Trace 导出
curl -X POST http://localhost:4318/v1/traces \
  -H "Content-Type: application/json" \
  -d '{"resourceSpans":[]}'

# 检查 Grafana 数据源
curl http://localhost:3000/api/datasources

在线工具推荐


总结:LLM 应用的可观测性需要关注传统指标之外的维度——Token 使用量、API 成本、幻觉率和回答质量。OpenTelemetry 的 LLM 语义约定为 2026 年的 LLM 可观测性提供了标准化方案。关键实践:正确追踪流式响应的 Token、合理设置采样率、避免 Prompt 内容泄露到 Trace、区分模型版本、建立端到端的 Span 关联。配合 Grafana 仪表盘,你可以实现从成本到质量的全方位 LLM 应用监控。

本站提供浏览器本地工具,免注册即可试用 →

#OpenTelemetry#LLM可观测性#AI监控#Token追踪#Grafana#RAG监控#成本监控#质量监控