OpenTelemetry LLM Observability: End-to-End AI Application Monitoring Guide
Why LLM Applications Need Specialized Observability in 2026
Traditional application monitoring focuses on latency, throughput, and error rates. LLM applications introduce entirely new observability dimensions: Token usage, API call costs, hallucination rates, and context quality. An LLM application might have normal response times and zero errors, yet cost $0.50 per call with a 30% hallucination rate — completely invisible in traditional monitoring.
In 2026, OpenTelemetry has released LLM Semantic Conventions, providing a standardized approach to LLM observability.
| Dimension | Traditional Monitoring | LLM Monitoring |
|---|---|---|
| Latency | Request latency | Time to First Token (TTFT), total latency |
| Throughput | QPS | Tokens/second |
| Errors | Error rate | Error rate + hallucination rate |
| Cost | Server cost | API call cost (token-based billing) |
| Quality | N/A | Accuracy, relevance, completeness |
| Context | N/A | Prompt length, context window utilization |
| Caching | N/A | Cache hit rate |
| Model | N/A | Model version, temperature parameter |
OpenTelemetry LLM Semantic Conventions
Span: gen_ai.client.chat
Attributes:
gen_ai.system = "openai"
gen_ai.request.model = "gpt-4o"
gen_ai.request.max_tokens = 4096
gen_ai.request.temperature = 0.7
gen_ai.response.model = "gpt-4o-2026-05-13"
gen_ai.response.finish_reasons = ["stop"]
gen_ai.usage.input_tokens = 1523
gen_ai.usage.output_tokens = 847
gen_ai.usage.total_tokens = 2370
Metric: gen_ai.client.token.usage
Attributes:
gen_ai.system = "openai"
gen_ai.request.model = "gpt-4o"
Value: 2370 (total tokens)
Complete Python Implementation: Trace LLM Calls, Token Usage, Latency
Setup
pip install opentelemetry-api opentelemetry-sdk \
opentelemetry-exporter-otlp \
opentelemetry-instrumentation-openai \
openai
Base Configuration
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
trace_provider = TracerProvider()
trace_provider.add_span_processor(
BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
)
)
trace.set_tracer_provider(trace_provider)
metric_reader = PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint="http://localhost:4317", insecure=True),
export_interval_millis=15000
)
metrics.set_meter_provider(MeterProvider(metric_readers=[metric_reader]))
tracer = trace.get_tracer("llm-app", "1.0.0")
meter = metrics.get_meter("llm-app", "1.0.0")
Manual LLM Call Tracing
import openai
import time
from opentelemetry.trace import Status, StatusCode
client = openai.OpenAI(api_key="sk-xxx")
token_counter = meter.create_counter(
"gen_ai.client.token.usage",
description="Token usage for LLM calls",
unit="tokens"
)
llm_latency = meter.create_histogram(
"gen_ai.client.operation.latency",
description="Latency of LLM operations",
unit="ms"
)
def traced_chat_completion(
messages: list[dict],
model: str = "gpt-4o",
temperature: float = 0.7,
max_tokens: int = 4096,
stream: bool = False
) -> dict:
with tracer.start_as_current_span("gen_ai.client.chat") as span:
span.set_attribute("gen_ai.system", "openai")
span.set_attribute("gen_ai.request.model", model)
span.set_attribute("gen_ai.request.temperature", temperature)
span.set_attribute("gen_ai.request.max_tokens", max_tokens)
start_time = time.time()
first_token_time = None
try:
if stream:
response = client.chat.completions.create(
model=model, messages=messages,
temperature=temperature, max_tokens=max_tokens, stream=True
)
content_parts = []
for chunk in response:
if first_token_time is None:
first_token_time = time.time()
if chunk.choices[0].delta.content:
content_parts.append(chunk.choices[0].delta.content)
content = "".join(content_parts)
usage = {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0}
else:
response = client.chat.completions.create(
model=model, messages=messages,
temperature=temperature, max_tokens=max_tokens
)
content = response.choices[0].message.content
usage = {
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens,
"total_tokens": response.usage.total_tokens
}
first_token_time = start_time
end_time = time.time()
latency_ms = (end_time - start_time) * 1000
ttft_ms = (first_token_time - start_time) * 1000 if first_token_time else 0
span.set_attribute("gen_ai.usage.input_tokens", usage["prompt_tokens"])
span.set_attribute("gen_ai.usage.output_tokens", usage["completion_tokens"])
span.set_attribute("gen_ai.usage.total_tokens", usage["total_tokens"])
span.set_status(Status(StatusCode.OK))
token_counter.add(usage["prompt_tokens"], {"gen_ai.system": "openai", "gen_ai.token.type": "input", "gen_ai.request.model": model})
token_counter.add(usage["completion_tokens"], {"gen_ai.system": "openai", "gen_ai.token.type": "output", "gen_ai.request.model": model})
llm_latency.record(latency_ms, {"gen_ai.system": "openai", "gen_ai.request.model": model})
return {"content": content, "usage": usage, "latency_ms": latency_ms, "ttft_ms": ttft_ms}
except Exception as e:
span.set_status(Status(StatusCode.ERROR, str(e)))
span.record_exception(e)
raise
Tracing RAG Pipeline
def traced_rag_query(question: str, top_k: int = 5) -> dict:
with tracer.start_as_current_span("rag.pipeline") as rag_span:
rag_span.set_attribute("rag.question", question)
rag_span.set_attribute("rag.top_k", top_k)
with tracer.start_as_current_span("rag.retrieve") as retrieve_span:
retrieved_docs = vector_search(question, top_k)
retrieve_span.set_attribute("rag.retrieved_count", len(retrieved_docs))
with tracer.start_as_current_span("rag.construct_prompt") as prompt_span:
context_text = "\n".join([doc.content for doc in retrieved_docs])
messages = [
{"role": "system", "content": "Answer based on the following references."},
{"role": "user", "content": f"References:\n{context_text}\n\nQuestion: {question}"}
]
prompt_span.set_attribute("rag.context_length", len(context_text))
with tracer.start_as_current_span("rag.generate") as gen_span:
result = traced_chat_completion(messages)
gen_span.set_attribute("rag.answer_length", len(result["content"]))
return {"answer": result["content"], "usage": result["usage"], "latency_ms": result["latency_ms"]}
Grafana Dashboard Setup
Cost Monitoring Dashboard
{
"dashboard": {
"title": "LLM Cost Monitoring",
"panels": [
{
"title": "Daily API Spend ($)",
"type": "timeseries",
"targets": [
{
"expr": "sum(increase(gen_ai_client_token_usage_total{gen_ai_token_type=\"input\"}[$__range])) * 0.000005 + sum(increase(gen_ai_client_token_usage_total{gen_ai_token_type=\"output\"}[$__range])) * 0.000015"
}
]
},
{
"title": "Token Usage by Model",
"type": "piechart",
"targets": [
{
"expr": "sum by (gen_ai_request_model) (increase(gen_ai_client_token_usage_total[$__range]))"
}
]
}
]
}
}
Cost Monitoring
COST_PER_TOKEN = {
"gpt-4o": {"input": 0.000005, "output": 0.000015},
"gpt-4o-mini": {"input": 0.00000015, "output": 0.0000006},
"gpt-4-turbo": {"input": 0.00001, "output": 0.00003},
}
def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
rates = COST_PER_TOKEN.get(model, {"input": 0, "output": 0})
return (input_tokens * rates["input"]) + (output_tokens * rates["output"])
Quality Monitoring
Hallucination Detection
def evaluate_answer_quality(question: str, answer: str, context: str) -> dict:
with tracer.start_as_current_span("llm.quality.evaluation") as span:
evaluation_prompt = [
{"role": "system", "content": "Evaluate AI answer quality. Return JSON: {\"grounded\": bool, \"relevance\": 0-1, \"completeness\": 0-1}"},
{"role": "user", "content": f"Question: {question}\nContext: {context}\nAnswer: {answer}"}
]
result = traced_chat_completion(evaluation_prompt, model="gpt-4o-mini", temperature=0)
import json
evaluation = json.loads(result["content"])
span.set_attribute("llm.quality.grounded", evaluation["grounded"])
span.set_attribute("llm.quality.relevance", evaluation["relevance"])
return evaluation
LLM Observability Tool Comparison
| Tool | Type | OpenTelemetry | Token Tracking | Cost Monitoring | Quality Monitoring | Self-hosted |
|---|---|---|---|---|---|---|
| OTel + Grafana | Framework+Dashboard | Native | Custom | Custom | Custom | Yes |
| Langfuse | Dedicated Platform | Integrated | Native | Native | Native | Yes |
| Helicone | Dedicated Platform | Proxy | Native | Native | Basic | No |
| Arize Phoenix | Dedicated Platform | Integrated | Native | Native | Native | Yes |
| Braintrust | Dedicated Platform | Integrated | Native | Native | Native | No |
5 Common Pitfalls
1. Ignoring Token Counting in Streaming Responses
total_usage = {"prompt_tokens": 0, "completion_tokens": 0}
for chunk in stream_response:
if hasattr(chunk, 'usage') and chunk.usage:
total_usage = {
"prompt_tokens": chunk.usage.prompt_tokens,
"completion_tokens": chunk.usage.completion_tokens
}
2. Improper Sampling Rate
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
sampler = TraceIdRatioBased(rate=0.1)
trace_provider = TracerProvider(sampler=sampler)
3. Prompt Content Leaking into Traces
# Wrong
span.set_attribute("gen_ai.prompt", messages)
# Correct
span.set_attribute("gen_ai.prompt.token_count", count_tokens(messages))
4. Not Distinguishing Model Versions
span.set_attribute("gen_ai.request.model", "gpt-4o")
span.set_attribute("gen_ai.response.model", response.model)
5. Missing End-to-End Span Correlation
with tracer.start_as_current_span("rag.pipeline") as parent:
ctx = trace.get_current()
with tracer.start_as_current_span("rag.retrieve", context=ctx):
pass
with tracer.start_as_current_span("rag.generate", context=ctx):
pass
10 Error Troubleshooting
| # | Symptom | Possible Cause | Resolution |
|---|---|---|---|
| 1 | Traces not appearing in Grafana | OTLP Exporter misconfigured | Check endpoint and auth |
| 2 | Token count is zero | Streaming response missing usage | Get usage from last chunk |
| 3 | Cost calculation inaccurate | Price table outdated | Update COST_PER_TOKEN |
| 4 | Hallucination rate abnormally high | Poor RAG retrieval quality | Check vector index and params |
| 5 | Missing Span attributes | Semantic convention version mismatch | Update OTel SDK |
| 6 | Memory leak | TracerProvider not closed | Call shutdown() on exit |
| 7 | Critical traces lost to sampling | Sampling rate too low | 100% sample for errors |
| 8 | Abnormally high TTFT | Network latency or cold start | Check network and model warmup |
| 9 | Quality evaluation hallucinating | Evaluation model unreliable | Use stronger eval model |
| 10 | Metric/Trace mismatch | Timezone inconsistency | Use UTC consistently |
Recommended Tools
- JSON Formatter: Use /en/json/format to format Grafana dashboard configs
- Base64 Encoder: Use /en/encode/base64 for API key encoding/decoding
- Hash Calculator: Use /en/encode/hash for config file integrity verification
Summary: LLM application observability requires monitoring dimensions beyond traditional metrics — token usage, API costs, hallucination rates, and answer quality. OpenTelemetry's LLM semantic conventions provide a standardized approach for 2026. Key practices: correctly track streaming response tokens, set appropriate sampling rates, avoid leaking prompt content into traces, distinguish model versions, and establish end-to-end span correlation. Combined with Grafana dashboards, you can achieve comprehensive LLM application monitoring from cost to quality.
Try these browser-local tools — no sign-up required →