OpenTelemetry LLM Observability: End-to-End AI Application Monitoring Guide

DevOps

Why LLM Applications Need Specialized Observability in 2026

Traditional application monitoring focuses on latency, throughput, and error rates. LLM applications introduce entirely new observability dimensions: Token usage, API call costs, hallucination rates, and context quality. An LLM application might have normal response times and zero errors, yet cost $0.50 per call with a 30% hallucination rate — completely invisible in traditional monitoring.

In 2026, OpenTelemetry has released LLM Semantic Conventions, providing a standardized approach to LLM observability.

Dimension Traditional Monitoring LLM Monitoring
Latency Request latency Time to First Token (TTFT), total latency
Throughput QPS Tokens/second
Errors Error rate Error rate + hallucination rate
Cost Server cost API call cost (token-based billing)
Quality N/A Accuracy, relevance, completeness
Context N/A Prompt length, context window utilization
Caching N/A Cache hit rate
Model N/A Model version, temperature parameter

OpenTelemetry LLM Semantic Conventions

Span: gen_ai.client.chat
  Attributes:
    gen_ai.system              = "openai"
    gen_ai.request.model       = "gpt-4o"
    gen_ai.request.max_tokens  = 4096
    gen_ai.request.temperature = 0.7
    gen_ai.response.model      = "gpt-4o-2026-05-13"
    gen_ai.response.finish_reasons = ["stop"]
    gen_ai.usage.input_tokens  = 1523
    gen_ai.usage.output_tokens = 847
    gen_ai.usage.total_tokens  = 2370

Metric: gen_ai.client.token.usage
  Attributes:
    gen_ai.system        = "openai"
    gen_ai.request.model = "gpt-4o"
  Value: 2370 (total tokens)

Complete Python Implementation: Trace LLM Calls, Token Usage, Latency

Setup

pip install opentelemetry-api opentelemetry-sdk \
  opentelemetry-exporter-otlp \
  opentelemetry-instrumentation-openai \
  openai

Base Configuration

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

trace_provider = TracerProvider()
trace_provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
    )
)
trace.set_tracer_provider(trace_provider)

metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="http://localhost:4317", insecure=True),
    export_interval_millis=15000
)
metrics.set_meter_provider(MeterProvider(metric_readers=[metric_reader]))

tracer = trace.get_tracer("llm-app", "1.0.0")
meter = metrics.get_meter("llm-app", "1.0.0")

Manual LLM Call Tracing

import openai
import time
from opentelemetry.trace import Status, StatusCode

client = openai.OpenAI(api_key="sk-xxx")

token_counter = meter.create_counter(
    "gen_ai.client.token.usage",
    description="Token usage for LLM calls",
    unit="tokens"
)

llm_latency = meter.create_histogram(
    "gen_ai.client.operation.latency",
    description="Latency of LLM operations",
    unit="ms"
)

def traced_chat_completion(
    messages: list[dict],
    model: str = "gpt-4o",
    temperature: float = 0.7,
    max_tokens: int = 4096,
    stream: bool = False
) -> dict:
    with tracer.start_as_current_span("gen_ai.client.chat") as span:
        span.set_attribute("gen_ai.system", "openai")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.request.temperature", temperature)
        span.set_attribute("gen_ai.request.max_tokens", max_tokens)

        start_time = time.time()
        first_token_time = None

        try:
            if stream:
                response = client.chat.completions.create(
                    model=model, messages=messages,
                    temperature=temperature, max_tokens=max_tokens, stream=True
                )
                content_parts = []
                for chunk in response:
                    if first_token_time is None:
                        first_token_time = time.time()
                    if chunk.choices[0].delta.content:
                        content_parts.append(chunk.choices[0].delta.content)
                content = "".join(content_parts)
                usage = {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0}
            else:
                response = client.chat.completions.create(
                    model=model, messages=messages,
                    temperature=temperature, max_tokens=max_tokens
                )
                content = response.choices[0].message.content
                usage = {
                    "prompt_tokens": response.usage.prompt_tokens,
                    "completion_tokens": response.usage.completion_tokens,
                    "total_tokens": response.usage.total_tokens
                }
                first_token_time = start_time

            end_time = time.time()
            latency_ms = (end_time - start_time) * 1000
            ttft_ms = (first_token_time - start_time) * 1000 if first_token_time else 0

            span.set_attribute("gen_ai.usage.input_tokens", usage["prompt_tokens"])
            span.set_attribute("gen_ai.usage.output_tokens", usage["completion_tokens"])
            span.set_attribute("gen_ai.usage.total_tokens", usage["total_tokens"])
            span.set_status(Status(StatusCode.OK))

            token_counter.add(usage["prompt_tokens"], {"gen_ai.system": "openai", "gen_ai.token.type": "input", "gen_ai.request.model": model})
            token_counter.add(usage["completion_tokens"], {"gen_ai.system": "openai", "gen_ai.token.type": "output", "gen_ai.request.model": model})
            llm_latency.record(latency_ms, {"gen_ai.system": "openai", "gen_ai.request.model": model})

            return {"content": content, "usage": usage, "latency_ms": latency_ms, "ttft_ms": ttft_ms}

        except Exception as e:
            span.set_status(Status(StatusCode.ERROR, str(e)))
            span.record_exception(e)
            raise

Tracing RAG Pipeline

def traced_rag_query(question: str, top_k: int = 5) -> dict:
    with tracer.start_as_current_span("rag.pipeline") as rag_span:
        rag_span.set_attribute("rag.question", question)
        rag_span.set_attribute("rag.top_k", top_k)

        with tracer.start_as_current_span("rag.retrieve") as retrieve_span:
            retrieved_docs = vector_search(question, top_k)
            retrieve_span.set_attribute("rag.retrieved_count", len(retrieved_docs))

        with tracer.start_as_current_span("rag.construct_prompt") as prompt_span:
            context_text = "\n".join([doc.content for doc in retrieved_docs])
            messages = [
                {"role": "system", "content": "Answer based on the following references."},
                {"role": "user", "content": f"References:\n{context_text}\n\nQuestion: {question}"}
            ]
            prompt_span.set_attribute("rag.context_length", len(context_text))

        with tracer.start_as_current_span("rag.generate") as gen_span:
            result = traced_chat_completion(messages)
            gen_span.set_attribute("rag.answer_length", len(result["content"]))

        return {"answer": result["content"], "usage": result["usage"], "latency_ms": result["latency_ms"]}

Grafana Dashboard Setup

Cost Monitoring Dashboard

{
  "dashboard": {
    "title": "LLM Cost Monitoring",
    "panels": [
      {
        "title": "Daily API Spend ($)",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(increase(gen_ai_client_token_usage_total{gen_ai_token_type=\"input\"}[$__range])) * 0.000005 + sum(increase(gen_ai_client_token_usage_total{gen_ai_token_type=\"output\"}[$__range])) * 0.000015"
          }
        ]
      },
      {
        "title": "Token Usage by Model",
        "type": "piechart",
        "targets": [
          {
            "expr": "sum by (gen_ai_request_model) (increase(gen_ai_client_token_usage_total[$__range]))"
          }
        ]
      }
    ]
  }
}

Cost Monitoring

COST_PER_TOKEN = {
    "gpt-4o": {"input": 0.000005, "output": 0.000015},
    "gpt-4o-mini": {"input": 0.00000015, "output": 0.0000006},
    "gpt-4-turbo": {"input": 0.00001, "output": 0.00003},
}

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    rates = COST_PER_TOKEN.get(model, {"input": 0, "output": 0})
    return (input_tokens * rates["input"]) + (output_tokens * rates["output"])

Quality Monitoring

Hallucination Detection

def evaluate_answer_quality(question: str, answer: str, context: str) -> dict:
    with tracer.start_as_current_span("llm.quality.evaluation") as span:
        evaluation_prompt = [
            {"role": "system", "content": "Evaluate AI answer quality. Return JSON: {\"grounded\": bool, \"relevance\": 0-1, \"completeness\": 0-1}"},
            {"role": "user", "content": f"Question: {question}\nContext: {context}\nAnswer: {answer}"}
        ]
        result = traced_chat_completion(evaluation_prompt, model="gpt-4o-mini", temperature=0)
        import json
        evaluation = json.loads(result["content"])
        span.set_attribute("llm.quality.grounded", evaluation["grounded"])
        span.set_attribute("llm.quality.relevance", evaluation["relevance"])
        return evaluation

LLM Observability Tool Comparison

Tool Type OpenTelemetry Token Tracking Cost Monitoring Quality Monitoring Self-hosted
OTel + Grafana Framework+Dashboard Native Custom Custom Custom Yes
Langfuse Dedicated Platform Integrated Native Native Native Yes
Helicone Dedicated Platform Proxy Native Native Basic No
Arize Phoenix Dedicated Platform Integrated Native Native Native Yes
Braintrust Dedicated Platform Integrated Native Native Native No

5 Common Pitfalls

1. Ignoring Token Counting in Streaming Responses

total_usage = {"prompt_tokens": 0, "completion_tokens": 0}
for chunk in stream_response:
    if hasattr(chunk, 'usage') and chunk.usage:
        total_usage = {
            "prompt_tokens": chunk.usage.prompt_tokens,
            "completion_tokens": chunk.usage.completion_tokens
        }

2. Improper Sampling Rate

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
sampler = TraceIdRatioBased(rate=0.1)
trace_provider = TracerProvider(sampler=sampler)

3. Prompt Content Leaking into Traces

# Wrong
span.set_attribute("gen_ai.prompt", messages)

# Correct
span.set_attribute("gen_ai.prompt.token_count", count_tokens(messages))

4. Not Distinguishing Model Versions

span.set_attribute("gen_ai.request.model", "gpt-4o")
span.set_attribute("gen_ai.response.model", response.model)

5. Missing End-to-End Span Correlation

with tracer.start_as_current_span("rag.pipeline") as parent:
    ctx = trace.get_current()
    with tracer.start_as_current_span("rag.retrieve", context=ctx):
        pass
    with tracer.start_as_current_span("rag.generate", context=ctx):
        pass

10 Error Troubleshooting

# Symptom Possible Cause Resolution
1 Traces not appearing in Grafana OTLP Exporter misconfigured Check endpoint and auth
2 Token count is zero Streaming response missing usage Get usage from last chunk
3 Cost calculation inaccurate Price table outdated Update COST_PER_TOKEN
4 Hallucination rate abnormally high Poor RAG retrieval quality Check vector index and params
5 Missing Span attributes Semantic convention version mismatch Update OTel SDK
6 Memory leak TracerProvider not closed Call shutdown() on exit
7 Critical traces lost to sampling Sampling rate too low 100% sample for errors
8 Abnormally high TTFT Network latency or cold start Check network and model warmup
9 Quality evaluation hallucinating Evaluation model unreliable Use stronger eval model
10 Metric/Trace mismatch Timezone inconsistency Use UTC consistently


Summary: LLM application observability requires monitoring dimensions beyond traditional metrics — token usage, API costs, hallucination rates, and answer quality. OpenTelemetry's LLM semantic conventions provide a standardized approach for 2026. Key practices: correctly track streaming response tokens, set appropriate sampling rates, avoid leaking prompt content into traces, distinguish model versions, and establish end-to-end span correlation. Combined with Grafana dashboards, you can achieve comprehensive LLM application monitoring from cost to quality.

Try these browser-local tools — no sign-up required →

#OpenTelemetry#LLM可观测性#AI监控#Token追踪#Grafana#RAG监控#成本监控#质量监控