OpenTelemetry LLM Observability: End-to-End AI Application Monitoring Guide

Why LLM Applications Need Specialized Observability in 2026

Traditional application monitoring focuses on latency, throughput, and error rates. LLM applications introduce entirely new observability dimensions: Token usage, API call costs, hallucination rates, and context quality. An LLM application might have normal response times and zero errors, yet cost $0.50 per call with a 30% hallucination rate — completely invisible in traditional monitoring.

In 2026, OpenTelemetry has released LLM Semantic Conventions, providing a standardized approach to LLM observability.

Dimension	Traditional Monitoring	LLM Monitoring
Latency	Request latency	Time to First Token (TTFT), total latency
Throughput	QPS	Tokens/second
Errors	Error rate	Error rate + hallucination rate
Cost	Server cost	API call cost (token-based billing)
Quality	N/A	Accuracy, relevance, completeness
Context	N/A	Prompt length, context window utilization
Caching	N/A	Cache hit rate
Model	N/A	Model version, temperature parameter

OpenTelemetry LLM Semantic Conventions

Span: gen_ai.client.chat
  Attributes:
    gen_ai.system              = "openai"
    gen_ai.request.model       = "gpt-4o"
    gen_ai.request.max_tokens  = 4096
    gen_ai.request.temperature = 0.7
    gen_ai.response.model      = "gpt-4o-2026-05-13"
    gen_ai.response.finish_reasons = ["stop"]
    gen_ai.usage.input_tokens  = 1523
    gen_ai.usage.output_tokens = 847
    gen_ai.usage.total_tokens  = 2370

Metric: gen_ai.client.token.usage
  Attributes:
    gen_ai.system        = "openai"
    gen_ai.request.model = "gpt-4o"
  Value: 2370 (total tokens)

Complete Python Implementation: Trace LLM Calls, Token Usage, Latency

Setup

pip install opentelemetry-api opentelemetry-sdk \
  opentelemetry-exporter-otlp \
  opentelemetry-instrumentation-openai \
  openai

Base Configuration

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

trace_provider = TracerProvider()
trace_provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
    )
)
trace.set_tracer_provider(trace_provider)

metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="http://localhost:4317", insecure=True),
    export_interval_millis=15000
)
metrics.set_meter_provider(MeterProvider(metric_readers=[metric_reader]))

tracer = trace.get_tracer("llm-app", "1.0.0")
meter = metrics.get_meter("llm-app", "1.0.0")

Manual LLM Call Tracing

import openai
import time
from opentelemetry.trace import Status, StatusCode

client = openai.OpenAI(api_key="sk-xxx")

token_counter = meter.create_counter(
    "gen_ai.client.token.usage",
    description="Token usage for LLM calls",
    unit="tokens"
)

llm_latency = meter.create_histogram(
    "gen_ai.client.operation.latency",
    description="Latency of LLM operations",
    unit="ms"
)

def traced_chat_completion(
    messages: list[dict],
    model: str = "gpt-4o",
    temperature: float = 0.7,
    max_tokens: int = 4096,
    stream: bool = False
) -> dict:
    with tracer.start_as_current_span("gen_ai.client.chat") as span:
        span.set_attribute("gen_ai.system", "openai")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.request.temperature", temperature)
        span.set_attribute("gen_ai.request.max_tokens", max_tokens)

        start_time = time.time()
        first_token_time = None

        try:
            if stream:
                response = client.chat.completions.create(
                    model=model, messages=messages,
                    temperature=temperature, max_tokens=max_tokens, stream=True
                )
                content_parts = []
                for chunk in response:
                    if first_token_time is None:
                        first_token_time = time.time()
                    if chunk.choices[0].delta.content:
                        content_parts.append(chunk.choices[0].delta.content)
                content = "".join(content_parts)
                usage = {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0}
            else:
                response = client.chat.completions.create(
                    model=model, messages=messages,
                    temperature=temperature, max_tokens=max_tokens
                )
                content = response.choices[0].message.content
                usage = {
                    "prompt_tokens": response.usage.prompt_tokens,
                    "completion_tokens": response.usage.completion_tokens,
                    "total_tokens": response.usage.total_tokens
                }
                first_token_time = start_time

            end_time = time.time()
            latency_ms = (end_time - start_time) * 1000
            ttft_ms = (first_token_time - start_time) * 1000 if first_token_time else 0

            span.set_attribute("gen_ai.usage.input_tokens", usage["prompt_tokens"])
            span.set_attribute("gen_ai.usage.output_tokens", usage["completion_tokens"])
            span.set_attribute("gen_ai.usage.total_tokens", usage["total_tokens"])
            span.set_status(Status(StatusCode.OK))

            token_counter.add(usage["prompt_tokens"], {"gen_ai.system": "openai", "gen_ai.token.type": "input", "gen_ai.request.model": model})
            token_counter.add(usage["completion_tokens"], {"gen_ai.system": "openai", "gen_ai.token.type": "output", "gen_ai.request.model": model})
            llm_latency.record(latency_ms, {"gen_ai.system": "openai", "gen_ai.request.model": model})

            return {"content": content, "usage": usage, "latency_ms": latency_ms, "ttft_ms": ttft_ms}

        except Exception as e:
            span.set_status(Status(StatusCode.ERROR, str(e)))
            span.record_exception(e)
            raise

Tracing RAG Pipeline

def traced_rag_query(question: str, top_k: int = 5) -> dict:
    with tracer.start_as_current_span("rag.pipeline") as rag_span:
        rag_span.set_attribute("rag.question", question)
        rag_span.set_attribute("rag.top_k", top_k)

        with tracer.start_as_current_span("rag.retrieve") as retrieve_span:
            retrieved_docs = vector_search(question, top_k)
            retrieve_span.set_attribute("rag.retrieved_count", len(retrieved_docs))

        with tracer.start_as_current_span("rag.construct_prompt") as prompt_span:
            context_text = "\n".join([doc.content for doc in retrieved_docs])
            messages = [
                {"role": "system", "content": "Answer based on the following references."},
                {"role": "user", "content": f"References:\n{context_text}\n\nQuestion: {question}"}
            ]
            prompt_span.set_attribute("rag.context_length", len(context_text))

        with tracer.start_as_current_span("rag.generate") as gen_span:
            result = traced_chat_completion(messages)
            gen_span.set_attribute("rag.answer_length", len(result["content"]))

        return {"answer": result["content"], "usage": result["usage"], "latency_ms": result["latency_ms"]}

Grafana Dashboard Setup

Cost Monitoring Dashboard

{
  "dashboard": {
    "title": "LLM Cost Monitoring",
    "panels": [
      {
        "title": "Daily API Spend ($)",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(increase(gen_ai_client_token_usage_total{gen_ai_token_type=\"input\"}[$__range])) * 0.000005 + sum(increase(gen_ai_client_token_usage_total{gen_ai_token_type=\"output\"}[$__range])) * 0.000015"
          }
        ]
      },
      {
        "title": "Token Usage by Model",
        "type": "piechart",
        "targets": [
          {
            "expr": "sum by (gen_ai_request_model) (increase(gen_ai_client_token_usage_total[$__range]))"
          }
        ]
      }
    ]
  }
}

Cost Monitoring

COST_PER_TOKEN = {
    "gpt-4o": {"input": 0.000005, "output": 0.000015},
    "gpt-4o-mini": {"input": 0.00000015, "output": 0.0000006},
    "gpt-4-turbo": {"input": 0.00001, "output": 0.00003},
}

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    rates = COST_PER_TOKEN.get(model, {"input": 0, "output": 0})
    return (input_tokens * rates["input"]) + (output_tokens * rates["output"])

Quality Monitoring

Hallucination Detection

def evaluate_answer_quality(question: str, answer: str, context: str) -> dict:
    with tracer.start_as_current_span("llm.quality.evaluation") as span:
        evaluation_prompt = [
            {"role": "system", "content": "Evaluate AI answer quality. Return JSON: {\"grounded\": bool, \"relevance\": 0-1, \"completeness\": 0-1}"},
            {"role": "user", "content": f"Question: {question}\nContext: {context}\nAnswer: {answer}"}
        ]
        result = traced_chat_completion(evaluation_prompt, model="gpt-4o-mini", temperature=0)
        import json
        evaluation = json.loads(result["content"])
        span.set_attribute("llm.quality.grounded", evaluation["grounded"])
        span.set_attribute("llm.quality.relevance", evaluation["relevance"])
        return evaluation

LLM Observability Tool Comparison

Tool	Type	OpenTelemetry	Token Tracking	Cost Monitoring	Quality Monitoring	Self-hosted
OTel + Grafana	Framework+Dashboard	Native	Custom	Custom	Custom	Yes
Langfuse	Dedicated Platform	Integrated	Native	Native	Native	Yes
Helicone	Dedicated Platform	Proxy	Native	Native	Basic	No
Arize Phoenix	Dedicated Platform	Integrated	Native	Native	Native	Yes
Braintrust	Dedicated Platform	Integrated	Native	Native	Native	No

5 Common Pitfalls

1. Ignoring Token Counting in Streaming Responses

total_usage = {"prompt_tokens": 0, "completion_tokens": 0}
for chunk in stream_response:
    if hasattr(chunk, 'usage') and chunk.usage:
        total_usage = {
            "prompt_tokens": chunk.usage.prompt_tokens,
            "completion_tokens": chunk.usage.completion_tokens
        }

2. Improper Sampling Rate

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
sampler = TraceIdRatioBased(rate=0.1)
trace_provider = TracerProvider(sampler=sampler)

3. Prompt Content Leaking into Traces

# Wrong
span.set_attribute("gen_ai.prompt", messages)

# Correct
span.set_attribute("gen_ai.prompt.token_count", count_tokens(messages))

4. Not Distinguishing Model Versions

span.set_attribute("gen_ai.request.model", "gpt-4o")
span.set_attribute("gen_ai.response.model", response.model)

5. Missing End-to-End Span Correlation

with tracer.start_as_current_span("rag.pipeline") as parent:
    ctx = trace.get_current()
    with tracer.start_as_current_span("rag.retrieve", context=ctx):
        pass
    with tracer.start_as_current_span("rag.generate", context=ctx):
        pass

10 Error Troubleshooting

#	Symptom	Possible Cause	Resolution
1	Traces not appearing in Grafana	OTLP Exporter misconfigured	Check endpoint and auth
2	Token count is zero	Streaming response missing usage	Get usage from last chunk
3	Cost calculation inaccurate	Price table outdated	Update COST_PER_TOKEN
4	Hallucination rate abnormally high	Poor RAG retrieval quality	Check vector index and params
5	Missing Span attributes	Semantic convention version mismatch	Update OTel SDK
6	Memory leak	TracerProvider not closed	Call shutdown() on exit
7	Critical traces lost to sampling	Sampling rate too low	100% sample for errors
8	Abnormally high TTFT	Network latency or cold start	Check network and model warmup
9	Quality evaluation hallucinating	Evaluation model unreliable	Use stronger eval model
10	Metric/Trace mismatch	Timezone inconsistency	Use UTC consistently

Recommended Tools

JSON Formatter: Use /en/json/format to format Grafana dashboard configs
Base64 Encoder: Use /en/encode/base64 for API key encoding/decoding
Hash Calculator: Use /en/encode/hash for config file integrity verification

Summary: LLM application observability requires monitoring dimensions beyond traditional metrics — token usage, API costs, hallucination rates, and answer quality. OpenTelemetry's LLM semantic conventions provide a standardized approach for 2026. Key practices: correctly track streaming response tokens, set appropriate sampling rates, avoid leaking prompt content into traces, distinguish model versions, and establish end-to-end span correlation. Combined with Grafana dashboards, you can achieve comprehensive LLM application monitoring from cost to quality.