Python OpenTelemetry LLMトレーシング：Spanからインテリジェントアラートまでの6つのプロダクションパターン

あなたのLLMアプリケーションはブラックボックスではありませんか？

2026年、LLMアプリケーションは本番環境に深く浸透しています。しかし、ほとんどのチームが同じ問題に直面しています：LLMコールチェーンが見えない。1つのユーザーリクエストが、Prompt構築、Embedding検索、マルチターン対話、モデル推論、後処理など複数のステージを経る可能性があります — どこかで問題が発生しても、推測するしかありません。

従来のAPMツールはLLM特有のセマンティクスを理解できません：トークン使用量、初回トークンレイテンシ、Prompt漏洩、ハルシネーション検出。OpenTelemetryは、そのセマンティック規約（Semantic Conventions）と拡張可能なSpan属性により、LLMトレーシングの最適な選択肢です。

本記事のポイント：

LLM Span計装と属性アノテーションの標準パターン
トークン使用量追跡とリアルタイムコスト監視
Prompt/Response構造化ログとサニタイズ
マルチモデルチェーントレーシング（RAG + Agent + Tool）
エラー監視と異常検知
インテリジェントアラートとSLA保証体系

OpenTelemetry LLMコア概念
Pattern 1: LLM Span計装と属性アノテーション
Pattern 2: トークン使用量追跡とコスト監視
Pattern 3: Prompt/Response構造化ログ
Pattern 4: マルチモデルチェーントレーシング
Pattern 5: エラー監視と異常検知
Pattern 6: インテリジェントアラートとSLA保証
5つのよくある落とし穴と解決策
10のよくあるエラートラブルシューティング
高度な最適化のヒント
比較分析：OpenTelemetry vs LangSmith vs Promptflow
オンラインツール推奨

OpenTelemetry LLMコア概念

なぜOpenTelemetryでLLMトレーシングを行うのか

┌─────────────────────────────────────────────────────────┐
│              LLMアプリケーショントレーシングアーキテクチャ    │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ユーザーリクエスト ──► API Gateway ──► LLM Service       │
│                                        │                │
│                          ┌─────────────┼─────────────┐  │
│                          ▼             ▼             ▼  │
│                    Embedding      Chat Model    Tool Call│
│                    Service        (GPT-4o)      Service │
│                          │             │             │  │
│                          ▼             ▼             ▼  │
│                    Vector DB      Reranker      External │
│                                                         │
│  各ホップ = 1 Span、LLMセマンティック属性を付与              │
│  全体チェーン = 1 Trace、エンドツーエンドで追跡可能           │
│                                                         │
│  ──► OTel Collector ──► Jaeger/Tempo (Trace)            │
│                      ──► Prometheus (Metrics)            │
│                      ──► Loki/ELK (Logs)                 │
└─────────────────────────────────────────────────────────┘

OpenTelemetryはLLMトレーシングに3つのコア機能を提供します：

機能	説明	LLMでの価値
Trace	サービス間チェーン追跡	ユーザーリクエストからLLMレスポンスまでの完全なコールチェーンを追跡
Span	単一操作の記録	各LLMコールのモデル、パラメータ、トークン使用量を記録
Metric	メトリクス収集	トークン消費、レイテンシ分布、エラー率をリアルタイム監視
Baggage	サービス間メタデータ伝播	ユーザーID、セッションID、A/Bテストラベルを伝播

OpenTelemetry LLMセマンティック規約（2026版）

Span Kind: CLIENT
Span Name:  gen_ai.client.chat

標準属性：
  gen_ai.system              = "openai"        # LLMプロバイダー
  gen_ai.request.model       = "gpt-4o"        # リクエストモデル
  gen_ai.request.max_tokens  = 4096            # 最大出力トークン
  gen_ai.request.temperature = 0.7             # 温度パラメータ
  gen_ai.response.model      = "gpt-4o-2026-05-13"  # 実際のモデル
  gen_ai.response.finish_reasons = ["stop"]    # 終了理由
  gen_ai.usage.input_tokens  = 1523            # 入力トークン
  gen_ai.usage.output_tokens = 847             # 出力トークン

イベント（Span Events）：
  gen_ai.content.prompt     = "..."            # Prompt内容
  gen_ai.content.completion = "..."            # レスポンス内容

Pattern 1: LLM Span計装と属性アノテーション

基本的なSpan作成

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

resource = Resource.create({
    "service.name": "llm-chat-service",
    "service.version": "2.1.0",
    "deployment.environment": "production",
})

tracer_provider = TracerProvider(resource=resource)
tracer_provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
    )
)
trace.set_tracer_provider(tracer_provider)
tracer = trace.get_tracer("llm-chat-service", "2.1.0")

標準LLM Span計装

import time
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def trace_llm_call(model: str, prompt: str, **kwargs):
    with tracer.start_as_current_span(
        "gen_ai.client.chat",
        kind=trace.SpanKind.CLIENT,
        attributes={
            "gen_ai.system": "openai",
            "gen_ai.request.model": model,
            "gen_ai.request.max_tokens": kwargs.get("max_tokens", 4096),
            "gen_ai.request.temperature": kwargs.get("temperature", 0.7),
        },
    ) as span:
        start_time = time.time()

        try:
            response = call_openai_api(model, prompt, **kwargs)

            span.set_attributes({
                "gen_ai.response.model": response.model,
                "gen_ai.response.finish_reasons": [
                    choice.finish_reason for choice in response.choices
                ],
                "gen_ai.usage.input_tokens": response.usage.prompt_tokens,
                "gen_ai.usage.output_tokens": response.usage.completion_tokens,
                "llm.time_to_first_token_ms": kwargs.get("ttft_ms", 0),
                "llm.total_duration_ms": (time.time() - start_time) * 1000,
            })

            span.add_event(
                "gen_ai.content.completion",
                attributes={
                    "gen_ai.content": response.choices[0].message.content[:500],
                },
            )

            return response

        except Exception as e:
            span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
            span.record_exception(e)
            raise

イベントアノテーション付きSpan

def trace_llm_with_events(model: str, messages: list, **kwargs):
    with tracer.start_as_current_span(
        "gen_ai.client.chat",
        kind=trace.SpanKind.CLIENT,
        attributes={
            "gen_ai.system": "openai",
            "gen_ai.request.model": model,
            "gen_ai.request.max_tokens": kwargs.get("max_tokens", 4096),
        },
    ) as span:
        for i, msg in enumerate(messages):
            span.add_event(
                f"gen_ai.content.prompt.{i}",
                attributes={
                    "gen_ai.content.role": msg["role"],
                    "gen_ai.content": msg["content"][:1000],
                },
            )

        start_time = time.time()
        first_token_time = None

        response = call_openai_streaming(model, messages, **kwargs)

        chunks = []
        for chunk in response:
            if first_token_time is None:
                first_token_time = time.time()
                ttft_ms = (first_token_time - start_time) * 1000
                span.set_attribute("llm.time_to_first_token_ms", ttft_ms)
            chunks.append(chunk)

        full_content = "".join(chunks)
        span.set_attributes({
            "gen_ai.usage.output_tokens": len(full_content) // 4,
            "llm.total_duration_ms": (time.time() - start_time) * 1000,
        })

        span.add_event(
            "gen_ai.content.completion",
            attributes={"gen_ai.content": full_content[:500]},
        )

        return full_content

Pattern 2: トークン使用量追跡とコスト監視

トークンカウンターとコスト計算

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="http://localhost:4317", insecure=True),
    export_interval_millis=60000,
)
meter_provider = MeterProvider(metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)

meter = metrics.get_meter("llm-chat-service", "2.1.0")

token_counter = meter.create_counter(
    name="gen_ai.client.token.usage",
    description="Token usage counter for LLM calls",
    unit="tokens",
)

cost_counter = meter.create_counter(
    name="llm.cost.usd",
    description="LLM API cost in USD",
    unit="USD",
)

MODEL_PRICING = {
    "gpt-4o": {"input": 2.50 / 1_000_000, "output": 10.00 / 1_000_000},
    "gpt-4o-mini": {"input": 0.15 / 1_000_000, "output": 0.60 / 1_000_000},
    "claude-sonnet-4-20250514": {"input": 3.00 / 1_000_000, "output": 15.00 / 1_000_000},
    "deepseek-r1": {"input": 0.55 / 1_000_000, "output": 2.19 / 1_000_000},
}

def record_token_usage(model: str, input_tokens: int, output_tokens: int):
    token_counter.add(input_tokens, {
        "gen_ai.system": "openai",
        "gen_ai.request.model": model,
        "token.type": "input",
    })
    token_counter.add(output_tokens, {
        "gen_ai.system": "openai",
        "gen_ai.request.model": model,
        "token.type": "output",
    })

    pricing = MODEL_PRICING.get(model, {"input": 0, "output": 0})
    cost = input_tokens * pricing["input"] + output_tokens * pricing["output"]
    cost_counter.add(cost, {
        "gen_ai.request.model": model,
        "cost.type": "api_call",
    })

コストクォータと予算管理

import threading
from datetime import datetime
from collections import defaultdict

class TokenBudgetManager:
    def __init__(self, daily_budget_usd: float = 100.0):
        self.daily_budget = daily_budget_usd
        self.current_spend = defaultdict(float)
        self.lock = threading.Lock()
        self._reset_date = datetime.now().date()

    def check_and_record(self, model: str, input_tokens: int, output_tokens: int) -> bool:
        with self.lock:
            today = datetime.now().date()
            if today != self._reset_date:
                self.current_spend.clear()
                self._reset_date = today

            pricing = MODEL_PRICING.get(model, {"input": 0, "output": 0})
            cost = input_tokens * pricing["input"] + output_tokens * pricing["output"]

            total_spend = sum(self.current_spend.values()) + cost
            if total_spend > self.daily_budget:
                return False

            self.current_spend[model] += cost
            return True

    def get_remaining_budget(self) -> float:
        with self.lock:
            return self.daily_budget - sum(self.current_spend.values())

budget_manager = TokenBudgetManager(daily_budget_usd=200.0)

予算チェック付きLLMコール

def call_llm_with_budget(model: str, messages: list, **kwargs):
    with tracer.start_as_current_span(
        "gen_ai.client.chat",
        kind=trace.SpanKind.CLIENT,
        attributes={
            "gen_ai.system": "openai",
            "gen_ai.request.model": model,
        },
    ) as span:
        response = call_openai_api(model, messages, **kwargs)

        input_tokens = response.usage.prompt_tokens
        output_tokens = response.usage.completion_tokens

        allowed = budget_manager.check_and_record(model, input_tokens, output_tokens)
        if not allowed:
            span.set_attribute("llm.budget.exceeded", True)
            raise BudgetExceededError(
                f"Daily budget exceeded. Remaining: ${budget_manager.get_remaining_budget():.2f}"
            )

        record_token_usage(model, input_tokens, output_tokens)

        span.set_attributes({
            "gen_ai.usage.input_tokens": input_tokens,
            "gen_ai.usage.output_tokens": output_tokens,
            "llm.cost.usd": calculate_cost(model, input_tokens, output_tokens),
            "llm.budget.remaining_usd": budget_manager.get_remaining_budget(),
        })

        return response

Pattern 3: Prompt/Response構造化ログ

サニタイズと構造化記録

import re
import json
from dataclasses import dataclass, field, asdict

@dataclass
class LLMCallLog:
    trace_id: str
    span_id: str
    model: str
    prompt_hash: str
    prompt_length: int
    response_hash: str
    response_length: int
    input_tokens: int
    output_tokens: int
    duration_ms: float
    ttft_ms: float = 0.0
    finish_reason: str = ""
    sanitized_prompt: str = ""
    sanitized_response: str = ""
    metadata: dict = field(default_factory=dict)

PII_PATTERNS = [
    (re.compile(r'\b\d{3}[-.]?\d{4}\b'), "[PHONE]"),
    (re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'), "[EMAIL]"),
    (re.compile(r'\b\d{17}[\dXx]\b'), "[ID_CARD]"),
    (re.compile(r'\b(?:\d[ -]*?){13,19}\b'), "[CREDIT_CARD]"),
    (re.compile(r'api[_-]?key[=:]\s*\S+', re.IGNORECASE), "[API_KEY]"),
]

def sanitize_text(text: str, max_length: int = 500) -> str:
    sanitized = text[:max_length]
    for pattern, replacement in PII_PATTERNS:
        sanitized = pattern.sub(replacement, sanitized)
    return sanitized

def hash_content(content: str) -> str:
    import hashlib
    return hashlib.sha256(content.encode()).hexdigest()[:16]

構造化ロガー

import logging
from datetime import datetime, timezone

class LLMStructuredLogger:
    def __init__(self, service_name: str = "llm-chat-service"):
        self.logger = logging.getLogger(f"{service_name}.llm_calls")
        self.sanitize_enabled = True
        self.max_content_length = 500

    def log_llm_call(self, span, model: str, prompt: str, response: str,
                     input_tokens: int, output_tokens: int,
                     duration_ms: float, ttft_ms: float = 0,
                     finish_reason: str = "", **metadata):
        log_entry = LLMCallLog(
            trace_id=format(span.get_span_context().trace_id, "032x"),
            span_id=format(span.get_span_context().span_id, "016x"),
            model=model,
            prompt_hash=hash_content(prompt),
            prompt_length=len(prompt),
            response_hash=hash_content(response),
            response_length=len(response),
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            duration_ms=round(duration_ms, 2),
            ttft_ms=round(ttft_ms, 2),
            finish_reason=finish_reason,
            sanitized_prompt=sanitize_text(prompt, self.max_content_length) if self.sanitize_enabled else "",
            sanitized_response=sanitize_text(response, self.max_content_length) if self.sanitize_enabled else "",
            metadata=metadata,
        )

        self.logger.info(
            "LLM call completed",
            extra={"llm_call": asdict(log_entry)},
        )

        span.add_event("gen_ai.content.prompt", attributes={
            "gen_ai.content": sanitize_text(prompt, 1000),
            "gen_ai.content.hash": hash_content(prompt),
        })
        span.add_event("gen_ai.content.completion", attributes={
            "gen_ai.content": sanitize_text(response, 1000),
            "gen_ai.content.hash": hash_content(response),
        })

llm_logger = LLMStructuredLogger()

Pattern 4: マルチモデルチェーントレーシング

RAGパイプライントレーシング

def trace_rag_pipeline(query: str, user_id: str = ""):
    with tracer.start_as_current_span(
        "rag.pipeline",
        kind=trace.SpanKind.INTERNAL,
        attributes={
            "rag.query": sanitize_text(query),
            "rag.user_id": user_id,
        },
    ) as pipeline_span:
        embedding = trace_embedding(query, pipeline_span)
        documents = trace_vector_search(embedding, pipeline_span)
        context = trace_reranker(query, documents, pipeline_span)
        answer = trace_llm_generation(query, context, pipeline_span)

        pipeline_span.set_attributes({
            "rag.documents_retrieved": len(documents),
            "rag.context_length": len(context),
            "rag.answer_length": len(answer),
        })

        return answer

def trace_embedding(query: str, parent_span):
    with tracer.start_as_current_span(
        "gen_ai.client.embedding",
        kind=trace.SpanKind.CLIENT,
        attributes={
            "gen_ai.system": "openai",
            "gen_ai.request.model": "text-embedding-3-large",
        },
    ) as span:
        start_time = time.time()
        response = call_embedding_api(query)

        span.set_attributes({
            "gen_ai.usage.input_tokens": response.usage.prompt_tokens,
            "llm.duration_ms": (time.time() - start_time) * 1000,
        })

        return response.data[0].embedding

def trace_vector_search(embedding: list, parent_span):
    with tracer.start_as_current_span(
        "vector.search",
        kind=trace.SpanKind.CLIENT,
        attributes={
            "db.system": "pgvector",
            "db.operation": "similarity_search",
        },
    ) as span:
        start_time = time.time()
        results = search_pgvector(embedding, top_k=10)

        span.set_attributes({
            "db.results.count": len(results),
            "db.results.top_score": results[0].score if results else 0,
            "llm.duration_ms": (time.time() - start_time) * 1000,
        })

        return results

def trace_llm_generation(query: str, context: str, parent_span):
    with tracer.start_as_current_span(
        "gen_ai.client.chat",
        kind=trace.SpanKind.CLIENT,
        attributes={
            "gen_ai.system": "openai",
            "gen_ai.request.model": "gpt-4o",
        },
    ) as span:
        start_time = time.time()
        messages = [
            {"role": "system", "content": f"以下のコンテキストに基づいて回答してください：\n{context}"},
            {"role": "user", "content": query},
        ]
        response = call_openai_api("gpt-4o", messages)

        span.set_attributes({
            "gen_ai.usage.input_tokens": response.usage.prompt_tokens,
            "gen_ai.usage.output_tokens": response.usage.completion_tokens,
            "llm.duration_ms": (time.time() - start_time) * 1000,
        })

        return response.choices[0].message.content

Agentチェーントレーシング

def trace_agent_execution(user_query: str, max_iterations: int = 5):
    with tracer.start_as_current_span(
        "agent.execution",
        kind=trace.SpanKind.INTERNAL,
        attributes={
            "agent.type": "react",
            "agent.max_iterations": max_iterations,
            "agent.query": sanitize_text(user_query),
        },
    ) as agent_span:
        messages = [{"role": "user", "content": user_query}]
        iteration = 0

        while iteration < max_iterations:
            iteration += 1

            with tracer.start_as_current_span(
                f"agent.iteration.{iteration}",
                kind=trace.SpanKind.INTERNAL,
            ) as iter_span:
                llm_response = trace_llm_call(
                    model="gpt-4o",
                    messages=messages,
                    temperature=0.0,
                )

                choice = llm_response.choices[0]

                if choice.finish_reason == "stop":
                    agent_span.set_attributes({
                        "agent.iterations": iteration,
                        "agent.final_answer_length": len(choice.message.content),
                    })
                    return choice.message.content

                if choice.message.tool_calls:
                    for tool_call in choice.message.tool_calls:
                        tool_result = trace_tool_call(tool_call, iter_span)
                        messages.append({
                            "role": "tool",
                            "tool_call_id": tool_call.id,
                            "content": str(tool_result),
                        })

                iter_span.set_attribute("agent.iteration.tools_called", len(
                    choice.message.tool_calls or []
                ))

                messages.append(choice.message.model_dump())

        agent_span.set_attribute("agent.status", "max_iterations_reached")
        return "Agent reached maximum iterations without a final answer."

マルチモデルチェーン可視化

Trace: rag-agent-pipeline-trace-abc123

┌─ rag.pipeline (1280ms)
│  │
│  ├─ gen_ai.client.embedding (45ms)
│  │    model=text-embedding-3-large
│  │    tokens=23
│  │
│  ├─ vector.search (32ms)
│  │    db=pgvector, results=10
│  │
│  ├─ gen_ai.client.rerank (180ms)
│  │    model=rerank-v3.5
│  │    documents=10 → top_n=3
│  │
│  ├─ agent.execution (1023ms)
│  │  │
│  │  ├─ agent.iteration.1 (520ms)
│  │  │    ├─ gen_ai.client.chat (480ms)
│  │  │    │    model=gpt-4o, tokens=1523/847
│  │  │    └─ tool.search_database (40ms)
│  │  │
│  │  ├─ agent.iteration.2 (503ms)
│  │  │    ├─ gen_ai.client.chat (460ms)
│  │  │    │    model=gpt-4o, tokens=2100/620
│  │  │    └─ tool.calculate (43ms)
│  │  │
│  │  └─ agent.iteration.3 (final, 0ms)
│  │       final_answer_length=342
│  │
│  └─ Total: 3 LLM calls, 2 tool calls, 4247 input tokens, 1467 output tokens

Pattern 5: エラー監視と異常検知

LLM専用エラー分類

from enum import Enum
from dataclasses import dataclass

class LLMErrorType(Enum):
    RATE_LIMIT = "rate_limit"
    CONTEXT_LENGTH_EXCEEDED = "context_length_exceeded"
    INVALID_API_KEY = "invalid_api_key"
    MODEL_NOT_FOUND = "model_not_found"
    TIMEOUT = "timeout"
    CONTENT_FILTER = "content_filter"
    HALLUCINATION = "hallucination"
    EMPTY_RESPONSE = "empty_response"
    BUDGET_EXCEEDED = "budget_exceeded"
    SERVICE_UNAVAILABLE = "service_unavailable"

def classify_llm_error(error: Exception) -> LLMErrorType:
    error_str = str(error).lower()
    if "rate_limit" in error_str or "429" in error_str:
        return LLMErrorType.RATE_LIMIT
    elif "context_length" in error_str or "maximum context" in error_str:
        return LLMErrorType.CONTEXT_LENGTH_EXCEEDED
    elif "invalid_api_key" in error_str or "401" in error_str:
        return LLMErrorType.INVALID_API_KEY
    elif "model_not_found" in error_str or "404" in error_str:
        return LLMErrorType.MODEL_NOT_FOUND
    elif "timeout" in error_str:
        return LLMErrorType.TIMEOUT
    elif "content_filter" in error_str or "content_policy" in error_str:
        return LLMErrorType.CONTENT_FILTER
    else:
        return LLMErrorType.SERVICE_UNAVAILABLE

エラートレーシングとリトライ

import time
from functools import wraps

error_counter = meter.create_counter(
    name="llm.errors",
    description="LLM call errors by type",
    unit="errors",
)

def llm_retry_with_tracing(max_retries: int = 3, base_delay: float = 1.0):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            last_error = None
            for attempt in range(max_retries + 1):
                with tracer.start_as_current_span(
                    f"llm.call.attempt.{attempt}",
                    kind=trace.SpanKind.CLIENT,
                ) as attempt_span:
                    attempt_span.set_attributes({
                        "llm.retry.attempt": attempt,
                        "llm.retry.max_retries": max_retries,
                    })

                    try:
                        result = func(*args, **kwargs)
                        if attempt > 0:
                            attempt_span.set_attribute("llm.retry.succeeded", True)
                        return result

                    except Exception as e:
                        error_type = classify_llm_error(e)
                        is_retryable = error_type in {
                            LLMErrorType.RATE_LIMIT,
                            LLMErrorType.TIMEOUT,
                            LLMErrorType.SERVICE_UNAVAILABLE,
                        }

                        error_counter.add(1, {
                            "error.type": error_type.value,
                            "gen_ai.request.model": kwargs.get("model", "unknown"),
                        })

                        attempt_span.set_attributes({
                            "error.type": error_type.value,
                            "error.message": str(e)[:200],
                            "error.retryable": is_retryable,
                        })
                        attempt_span.record_exception(e)
                        attempt_span.set_status(
                            trace.Status(trace.StatusCode.ERROR, str(e))
                        )

                        if not is_retryable or attempt == max_retries:
                            raise

                        delay = base_delay * (2 ** attempt)
                        time.sleep(delay)
                        last_error = e

            raise last_error
        return wrapper
    return decorator

ハルシネーション検知Span

def trace_hallucination_check(
    query: str, answer: str, context: str, model: str = "gpt-4o-mini"
):
    with tracer.start_as_current_span(
        "llm.hallucination_check",
        kind=trace.SpanKind.INTERNAL,
        attributes={
            "llm.check.type": "hallucination",
            "llm.check.model": model,
        },
    ) as span:
        check_prompt = f"""以下の回答が与えられたコンテキストに基づいているか判断してください。コンテキストに存在しない情報が含まれている場合、ハルシネーションとしてフラグを立ててください。

コンテキスト：
{context}

回答：
{answer}

JSON形式で返してください：{{"is_hallucination": true/false, "confidence": 0.0-1.0, "reason": "..."}}"""

        start_time = time.time()
        response = call_openai_api(model, [
            {"role": "user", "content": check_prompt}
        ], temperature=0.0)

        import json
        result = json.loads(response.choices[0].message.content)

        span.set_attributes({
            "llm.hallucination.detected": result.get("is_hallucination", False),
            "llm.hallucination.confidence": result.get("confidence", 0.0),
            "llm.hallucination.reason": result.get("reason", "")[:200],
            "llm.duration_ms": (time.time() - start_time) * 1000,
        })

        return result

Pattern 6: インテリジェントアラートとSLA保証

SLAメトリクス定義

from dataclasses import dataclass

@dataclass
class LLMSLA:
    p50_latency_ms: float = 2000.0
    p95_latency_ms: float = 5000.0
    p99_latency_ms: float = 10000.0
    ttft_p95_ms: float = 1000.0
    error_rate_threshold: float = 0.01
    daily_budget_usd: float = 200.0
    hallucination_rate_threshold: float = 0.05
    min_success_rate: float = 0.99

sla = LLMSLA()

インテリジェントアラートルール

from collections import deque
import threading

class LLMAlertManager:
    def __init__(self, sla_config: LLMSLA):
        self.sla = sla_config
        self.recent_latencies = deque(maxlen=1000)
        self.recent_errors = deque(maxlen=1000)
        self.recent_ttft = deque(maxlen=1000)
        self.lock = threading.Lock()
        self.alert_callbacks = []

    def add_alert_callback(self, callback):
        self.alert_callbacks.append(callback)

    def record_latency(self, latency_ms: float, ttft_ms: float = 0, is_error: bool = False):
        with self.lock:
            self.recent_latencies.append(latency_ms)
            self.recent_ttft.append(ttft_ms)
            self.recent_errors.append(1 if is_error else 0)

        self._check_alerts(latency_ms, ttft_ms, is_error)

    def _check_alerts(self, latency_ms: float, ttft_ms: float, is_error: bool):
        alerts = []

        if latency_ms > self.sla.p99_latency_ms:
            alerts.append({
                "level": "CRITICAL",
                "type": "latency_p99_breach",
                "message": f"LLMレイテンシ {latency_ms:.0f}ms がP99 SLA {self.sla.p99_latency_ms:.0f}msを超過",
                "value": latency_ms,
                "threshold": self.sla.p99_latency_ms,
            })

        if ttft_ms > self.sla.ttft_p95_ms:
            alerts.append({
                "level": "WARNING",
                "type": "ttft_breach",
                "message": f"TTFT {ttft_ms:.0f}ms がP95 SLA {self.sla.ttft_p95_ms:.0f}msを超過",
                "value": ttft_ms,
                "threshold": self.sla.ttft_p95_ms,
            })

        with self.lock:
            if len(self.recent_errors) >= 100:
                error_rate = sum(self.recent_errors) / len(self.recent_errors)
                if error_rate > self.sla.error_rate_threshold:
                    alerts.append({
                        "level": "CRITICAL",
                        "type": "error_rate_breach",
                        "message": f"エラー率 {error_rate:.2%} が閾値 {self.sla.error_rate_threshold:.2%}を超過",
                        "value": error_rate,
                        "threshold": self.sla.error_rate_threshold,
                    })

        for alert in alerts:
            self._fire_alert(alert)

    def _fire_alert(self, alert: dict):
        with tracer.start_as_current_span(
            "llm.alert",
            kind=trace.SpanKind.INTERNAL,
            attributes={
                "alert.level": alert["level"],
                "alert.type": alert["type"],
                "alert.message": alert["message"],
            },
        ):
            for callback in self.alert_callbacks:
                try:
                    callback(alert)
                except Exception:
                    pass

alert_manager = LLMAlertManager(sla)

Prometheusアラートルール

groups:
  - name: llm_sla_alerts
    rules:
      - alert: LLMLatencyP99Breach
        expr: histogram_quantile(0.99, rate(llm_request_latency_bucket[5m])) > 10000
        for: 2m
        labels:
          severity: critical
          team: ai-platform
        annotations:
          summary: "LLM P99レイテンシがSLA閾値を超過"

      - alert: LLMErrorRateHigh
        expr: rate(llm_errors_total[5m]) / rate(llm_calls_total[5m]) > 0.01
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "LLMエラー率が1%を超過"

      - alert: LLMTokenBudgetApproaching
        expr: llm_cost_usd_total > 180
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM日次消費が予算上限に接近"

5つのよくある落とし穴と解決策

落とし穴1：Span属性が多すぎてサンプリングが失われる

コアメトリクスはSpan属性、長いテキストはSpan Event、1KBを超えるコンテンツは切り捨てまたはハッシュ化。

落とし穴2：トークンカウントが不正確

APIレスポンスのusageフィールドを常に使用し、文字数からトークンを推定しない。

落とし穴3：Context Propagationを無視

サービス間コールでpropagateを使用してTrace Contextを注入・抽出。

落とし穴4：ストリーミングレスポンスでTTFTを記録しない

ストリーミングレスポンスではTTFTを必ず記録 — ユーザー体験のコアメトリクス。

落とし穴5：サンプリング戦略が不適切

LLM Spanは100%サンプリング必須、非LLM Spanは比率ベースのサンプリングで可能。

10のよくあるエラートラブルシューティング

#	エラーメッセージ	原因	解決策
1	`StatusCode.UNAVAILABLE`	Collector未起動	`localhost:4317`の到達性を確認
2	`context_length_exceeded`	Promptがコンテキストウィンドウを超過	`tiktoken`でトークン数を事前計算
3	`rate_limit_exceeded` 429	API呼び出し頻度が制限超過	指数バックオフリトライを実装
4	`Span is not being recorded`	TracerProvider未設定	`set_tracer_provider()`の呼び出しを確認
5	`Baggage not propagated`	サービス間でコンテキスト未伝播	`propagate.inject()/extract()`を使用
6	`Metric reader timeout`	メトリクスエクスポートタイムアウト	`export_timeout_millis`を増加
7	`content_filter_triggered`	Promptがセーフティフィルターをトリガー	コンテンツ事前チェックを追加
8	`Empty response from LLM`	モデルが空のコンテンツを返却	`finish_reason`を確認、リトライロジックを追加
9	`Memory leak in span processor`	BatchSpanProcessorのバックログ	`max_queue_size`を調整
10	`Duplicate span attributes`	同じ属性が複数回設定	`set_attribute()`を`add_event()`の代わりに使用

高度な最適化のヒント

自動計装Instrumentation

from opentelemetry.instrumentation.openai import OpenAIInstrumentor

OpenAIInstrumentor().instrument()

from opentelemetry.instrumentation.requests import RequestsInstrumentor
RequestsInstrumentor().instrument()

カスタムPropagator

from opentelemetry.propagate import composite, global_set_text_map_propagator
from opentelemetry.trace.propagation.tracecontext import TraceContextPropagator
from opentelemetry.baggage.propagation import W3CBaggagePropagator

global_set_text_map_propagator(
    composite.CompositePropagator([
        TraceContextPropagator(),
        W3CBaggagePropagator(),
    ])
)

動的サンプリング戦略

from opentelemetry.sdk.trace.sampling import Sampler, SamplingResult, Decision

class AdaptiveLLMSampler(Sampler):
    def __init__(self, base_rate: float = 0.1):
        self.base_rate = base_rate

    def should_sample(self, parent_context, trace_id, name, kind,
                      attributes, links, trace_state):
        if attributes and attributes.get("gen_ai.system"):
            return SamplingResult(
                decision=Decision.RECORD_AND_SAMPLE,
                attributes=attributes,
                trace_state=trace_state,
            )

        should_sample = (hash(trace_id) % 10000) < (self.base_rate * 10000)
        return SamplingResult(
            decision=Decision.RECORD_AND_SAMPLE if should_sample else Decision.DROP,
            attributes=attributes,
            trace_state=trace_state,
        )

    def get_description(self):
        return f"AdaptiveLLMSampler(base={self.base_rate})"

比較分析：OpenTelemetry vs LangSmith vs Promptflow

次元	OpenTelemetry	LangSmith	Promptflow
オープンソース	はい（CNCFプロジェクト）	いいえ（SaaS）	はい（Microsoft）
ベンダーロックイン	なし	高い（LangChainエコシステム）	中（Azureエコシステム）
トレーシング	ネイティブ、標準プロトコル	独自Traceシステム	独自Traceシステム
トークン追跡	手動計装が必要	自動	自動
コスト監視	カスタムメトリクスが必要	内蔵	内蔵
多言語対応	11+言語	Python/JS	Python
既存APM統合	ネイティブ（Jaeger/Tempo/Grafana）	エクスポート必要	エクスポート必要
カスタマイズ性	非常に高い	中程度	中程度
学習曲線	急	緩やか	中程度
プロダクション対応	高い（CNCF卒業プロジェクト）	高い	中程度
データ主権	完全制御	LangSmithにデータ	Azureにデータ

選択ガイド：

既存のAPMインフラがある場合 → OpenTelemetry（統一オブザーバビリティスタック）
LangChainヘビーユーザー → LangSmith（すぐに使える）
Azureエコシステム → Promptflow（Azure AI Studioと深い統合）

オンラインツール推奨

JSONフォーマッター — OpenTelemetry Span JSONデータをフォーマット、属性の問題を素早く特定
Base64エンコーダー — Trace Context伝播のBase64データをエンコード/デコード
cURL to Code変換 — APIデバッグのcURLコマンドをPythonコードに変換

まとめ

Python OpenTelemetry LLMトレーシングは、プロダクションレベルのAIアプリケーションオブザーバビリティを構築するための基盤です。6つのプロダクションパターンは、基本的なSpan計装からインテリジェントアラートまでの完全なチェーンをカバーしています：

LLM Span計装 — gen_ai.*セマンティック規約で標準化された属性アノテーション
トークン追跡 — 使用量とコストをリアルタイム監視、予算超過を防止
構造化ログ — サニタイズされたPrompt/Response記録、デバッグとコンプライアンスの両立
マルチモデルチェーン — RAG、Agentなどの複雑なコールチェーンを追跡
エラー監視 — 分類された例外、自動リトライ、ハルシネーション検知
インテリジェントアラート — SLA保証、動的サンプリング、マルチレベルアラート

重要原則：LLM Spanは100%サンプリング必須、トークン使用量はAPIレスポンスから取得、Prompt内容はサニタイズ必須、TTFTはユーザー体験のコアメトリクス。

OpenTelemetry LLMオブザーバビリティ実践ガイド — LLMセマンティック規約とGrafanaダッシュボードの深掘り
Go OpenTelemetry分散トレーシング — Go言語のOTelトレーシングプラクティス
Python AIプロダクションデプロイガイド — LLMアプリケーションの開発から本番までの完全パス

参考リソース

OpenTelemetry LLM Semantic Conventions — 公式LLMセマンティック規約仕様
OpenTelemetry Python Documentation — Python SDKドキュメント