Go OpenTelemetry Distributed Tracing: 6 Key Steps from Zero Integration to Full Observability

DevOps

Microservice Call Chains Have Become Black Boxes

Users report "slow checkout", you open logs and see timestamps scattered across a dozen services — order service 3ms, inventory service 2ms, payment service... timed out? Or never called? You have no idea which services a request traversed or where it got stuck. Distributed tracing is the silver bullet, and OpenTelemetry (OTel) has become the de facto standard.

This article starts from scratch and walks you through OTel SDK init → Trace/Span creation → Context propagation → Auto-instrumentation → Jaeger/Tempo integration → Metrics correlation — 6 key steps to turn microservice call chains from black boxes into transparent pipelines.


OpenTelemetry Core Concepts

Concept Description
Trace A complete request trace chain composed of multiple Spans
Span A single operation unit with name, duration, status, attributes
Context Trace context containing TraceID/SpanID, propagated across processes
Propagator Context propagator, injects and extracts Context in HTTP/gRPC headers
TracerProvider Tracer factory, creates and manages Tracer instances
SpanProcessor Span processor, handles batching, filtering, and exporting of Spans
Exporter Sends Span data to Jaeger/Tempo/OTLP backends
Resource Resource descriptor identifying the service producing telemetry data

Trace Data Flow

Request Flow:
1. Entry service receives request, creates Root Span
2. When calling downstream, Propagator injects Context into HTTP/gRPC headers
3. Downstream service extracts Context from headers, creates Child Span
4. After Span completes, SpanProcessor batches it
5. Exporter sends Span data to Jaeger/Tempo
6. View complete call chain graph in UI

Problem Analysis: 5 Major Distributed Tracing Challenges

  1. Complex SDK initialization: TracerProvider, SpanProcessor, Exporter, Resource configuration order and dependencies are easy to mix up
  2. Missing context propagation: Forgetting to propagate Context in cross-service calls causes broken chains
  3. Uncontrolled Span granularity: Too coarse hides bottlenecks, too fine generates massive data overwhelming backends
  4. Auto vs manual instrumentation conflicts: HTTP/gRPC auto-instrumentation and manual business Spans can duplicate or nest incorrectly
  5. Metrics-Traces disconnect: Metrics and Traces operate independently, unable to locate specific Traces from metrics

Step-by-Step: Complete OTel Integration

Step 1: Initialize TracerProvider

package telemetry

import (
	"context"
	"fmt"
	"time"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
	"go.opentelemetry.io/otel/propagation"
	"go.opentelemetry.io/otel/sdk/resource"
	sdktrace "go.opentelemetry.io/otel/sdk/trace"
	semconv "go.opentelemetry.io/otel/semconv/v1.26.0"
)

type Telemetry struct {
	provider *sdktrace.TracerProvider
}

func InitTelemetry(ctx context.Context, serviceName, serviceVersion, otlpEndpoint string) (*Telemetry, error) {
	exporter, err := otlptracegrpc.New(ctx,
		otlptracegrpc.WithEndpoint(otlpEndpoint),
		otlptracegrpc.WithInsecure(),
	)
	if err != nil {
		return nil, fmt.Errorf("create OTLP exporter: %w", err)
	}

	res, err := resource.New(ctx,
		resource.WithAttributes(
			semconv.ServiceNameKey.String(serviceName),
			semconv.ServiceVersionKey.String(serviceVersion),
		),
	)
	if err != nil {
		return nil, fmt.Errorf("create resource: %w", err)
	}

	bsp := sdktrace.NewBatchSpanProcessor(exporter,
		sdktrace.WithBatchTimeout(5*time.Second),
		sdktrace.WithMaxExportBatchSize(512),
		sdktrace.WithMaxQueueSize(2048),
	)

	provider := sdktrace.NewTracerProvider(
		sdktrace.WithResource(res),
		sdktrace.WithSpanProcessor(bsp),
		sdktrace.WithSampler(sdktrace.ParentBased(
			sdktrace.TraceIDRatioBased(0.5),
		)),
	)

	otel.SetTracerProvider(provider)
	otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
		propagation.TraceContext{},
		propagation.Baggage{},
	))

	return &Telemetry{provider: provider}, nil
}

func (t *Telemetry) Shutdown(ctx context.Context) error {
	return t.provider.Shutdown(ctx)
}

Step 2: Create Traces and Spans

package service

import (
	"context"
	"fmt"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/codes"
	"go.opentelemetry.io/otel/trace"
)

var tracer = otel.Tracer("order-service")

func ProcessOrder(ctx context.Context, orderID string) error {
	ctx, span := tracer.Start(ctx, "ProcessOrder",
		trace.WithAttributes(
			attribute.String("order.id", orderID),
		),
		trace.WithSpanKind(trace.SpanKindInternal),
	)
	defer span.End()

	if err := validateOrder(ctx, orderID); err != nil {
		span.RecordError(err)
		span.SetStatus(codes.Error, err.Error())
		return err
	}

	if err := reserveInventory(ctx, orderID); err != nil {
		span.RecordError(err)
		span.SetStatus(codes.Error, err.Error())
		return err
	}

	span.SetStatus(codes.Ok, "")
	return nil
}

func validateOrder(ctx context.Context, orderID string) error {
	ctx, span := tracer.Start(ctx, "validateOrder",
		trace.WithAttributes(attribute.String("order.id", orderID)),
	)
	defer span.End()

	if orderID == "" {
		err := fmt.Errorf("order ID is empty")
		span.RecordError(err)
		span.SetStatus(codes.Error, err.Error())
		return err
	}

	span.AddEvent("validation_passed", trace.WithAttributes(
		attribute.String("order.id", orderID),
	))
	return nil
}

func reserveInventory(ctx context.Context, orderID string) error {
	ctx, span := tracer.Start(ctx, "reserveInventory")
	defer span.End()

	span.SetAttributes(attribute.String("order.id", orderID))
	return nil
}

Step 3: HTTP Context Propagation

package middleware

import (
	"net/http"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/propagation"
	"go.opentelemetry.io/otel/trace"
)

func HTTPMiddleware(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		propagator := otel.GetTextMapPropagator()
		ctx := propagator.Extract(r.Context(), propagation.HeaderCarrier(r.Header))

		tracer := otel.Tracer("http-server")
		spanName := r.Method + " " + r.URL.Path
		ctx, span := tracer.Start(ctx, spanName,
			trace.WithSpanKind(trace.SpanKindServer),
			trace.WithAttributes(
				attribute.String("http.method", r.Method),
				attribute.String("http.url", r.URL.String()),
				attribute.String("http.host", r.Host),
			),
		)
		defer span.End()

		rw := &responseWriter{ResponseWriter: w, statusCode: 200}
		next.ServeHTTP(rw, r.WithContext(ctx))

		span.SetAttributes(
			attribute.Int("http.status_code", rw.statusCode),
		)
	})
}

type responseWriter struct {
	http.ResponseWriter
	statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
	rw.statusCode = code
	rw.ResponseWriter.WriteHeader(code)
}

Step 4: Outbound HTTP Context Propagation

package client

import (
	"context"
	"fmt"
	"io"
	"net/http"
	"time"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/propagation"
	"go.opentelemetry.io/otel/trace"
)

type InstrumentedClient struct {
	client *http.Client
}

func NewInstrumentedClient() *InstrumentedClient {
	return &InstrumentedClient{
		client: &http.Client{Timeout: 30 * time.Second},
	}
}

func (c *InstrumentedClient) Do(ctx context.Context, method, url string, body io.Reader) (*http.Response, error) {
	tracer := otel.Tracer("http-client")
	ctx, span := tracer.Start(ctx, method+" "+url,
		trace.WithSpanKind(trace.SpanKindClient),
		trace.WithAttributes(
			attribute.String("http.method", method),
			attribute.String("http.url", url),
		),
	)
	defer span.End()

	req, err := http.NewRequestWithContext(ctx, method, url, body)
	if err != nil {
		span.RecordError(err)
		return nil, fmt.Errorf("create request: %w", err)
	}

	propagator := otel.GetTextMapPropagator()
	propagator.Inject(ctx, propagation.HeaderCarrier(req.Header))

	resp, err := c.client.Do(req)
	if err != nil {
		span.RecordError(err)
		return nil, fmt.Errorf("execute request: %w", err)
	}

	span.SetAttributes(attribute.Int("http.status_code", resp.StatusCode))
	return resp, nil
}

Step 5: gRPC Auto-Instrumentation

package main

import (
	"context"

	"go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc"
	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"
)

func NewGRPCClient(ctx context.Context, target string) (*grpc.ClientConn, error) {
	conn, err := grpc.DialContext(ctx, target,
		grpc.WithTransportCredentials(insecure.NewCredentials()),
		grpc.WithStatsHandler(otelgrpc.NewClientHandler()),
	)
	if err != nil {
		return nil, err
	}
	return conn, nil
}

func NewGRPCServer() *grpc.Server {
	server := grpc.NewServer(
		grpc.StatsHandler(otelgrpc.NewServerHandler()),
	)
	return server
}

Step 6: Metrics and Trace Correlation

package telemetry

import (
	"context"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
	sdkmetric "go.opentelemetry.io/otel/sdk/metric"
	"go.opentelemetry.io/otel/sdk/resource"
	semconv "go.opentelemetry.io/otel/semconv/v1.26.0"
)

type MetricsProvider struct {
	provider *sdkmetric.MeterProvider
}

func InitMetrics(ctx context.Context, serviceName, otlpEndpoint string) (*MetricsProvider, error) {
	exporter, err := otlpmetricgrpc.New(ctx,
		otlpmetricgrpc.WithEndpoint(otlpEndpoint),
		otlpmetricgrpc.WithInsecure(),
	)
	if err != nil {
		return nil, err
	}

	res, err := resource.New(ctx,
		resource.WithAttributes(
			semconv.ServiceNameKey.String(serviceName),
		),
	)
	if err != nil {
		return nil, err
	}

	provider := sdkmetric.NewMeterProvider(
		sdkmetric.WithResource(res),
		sdkmetric.WithReader(sdkmetric.NewPeriodicReader(exporter)),
	)

	otel.SetMeterProvider(provider)
	return &MetricsProvider{provider: provider}, nil
}

func (m *MetricsProvider) Shutdown(ctx context.Context) error {
	return m.provider.Shutdown(ctx)
}

Pitfall Guide

Pitfall 1: Forgetting to Set Global Propagator

// ❌ Wrong: no Propagator set, Context can't propagate across processes
provider := sdktrace.NewTracerProvider(...)
otel.SetTracerProvider(provider)
// Missing otel.SetTextMapPropagator(...)

// ✅ Correct: set Composite Propagator
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
    propagation.TraceContext{},
    propagation.Baggage{},
))

Pitfall 2: Not Calling Span.End()

// ❌ Wrong: Span never ends, never exported
ctx, span := tracer.Start(ctx, "operation")
doWork(ctx)
// Forgot span.End()

// ✅ Correct: use defer to ensure Span ends
ctx, span := tracer.Start(ctx, "operation")
defer span.End()
doWork(ctx)

Pitfall 3: Improper Sampling Rate

// ❌ Wrong: AlwaysSample in production causes massive data
sdktrace.WithSampler(sdktrace.AlwaysSample())

// ✅ Correct: use ParentBased + TraceIDRatioBased
sdktrace.WithSampler(sdktrace.ParentBased(
    sdktrace.TraceIDRatioBased(0.1), // 10% sampling
))

Pitfall 4: Losing Context in Goroutines

// ❌ Wrong: no ctx passed to goroutine
go func() {
    ctx, span := tracer.Start(context.Background(), "async_work")
    defer span.End()
}()

// ✅ Correct: pass parent Context to goroutine
go func(ctx context.Context) {
    ctx, span := tracer.Start(ctx, "async_work")
    defer span.End()
}(ctx)

Pitfall 5: Shutdown Timeout Causing Data Loss

// ❌ Wrong: no sufficient time for Shutdown
func main() {
    tel, _ := telemetry.InitTelemetry(ctx, "svc", "1.0", "localhost:4317")
    defer tel.Shutdown(context.Background()) // may timeout
}

// ✅ Correct: give Shutdown enough timeout
func main() {
    tel, _ := telemetry.InitTelemetry(ctx, "svc", "1.0", "localhost:4317")
    defer func() {
        shutdownCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
        defer cancel()
        tel.Shutdown(shutdownCtx)
    }()
}

Error Troubleshooting

# Error Message Cause Solution
1 connection refused: localhost:4317 OTLP Collector not running Start otel-collector container, check port mapping
2 traces not showing in Jaeger Exporter misconfigured or sampling rate 0 Check Exporter target, confirm sampling rate > 0
3 context deadline exceeded Collector slow or network unreachable Increase timeout, check network connectivity
4 span missing parent Context propagation failed Confirm Propagator is set, check HTTP header injection
5 resource attributes missing Resource not configured Add resource.WithAttributes(semconv.ServiceNameKey.String(...))
6 too many open files Span queue backlog, Exporter blocked Reduce MaxQueueSize, increase BatchTimeout
7 trace_id not found in baggage Baggage and TraceContext confused TraceContext propagates TraceID, Baggage propagates business data
8 grpc: no transport security gRPC using WithInsecure Acceptable in dev, configure TLS in production
9 duplicate span name Multiple Spans with same name Add distinguishing attributes or use dynamic names
10 metric reader timeout Metric export timeout Increase Interval and Timeout in PeriodicReader

Advanced Optimization

1. Custom SpanProcessor for Sensitive Data Filtering

package telemetry

import (
	"context"
	"strings"

	"go.opentelemetry.io/otel/attribute"
	sdktrace "go.opentelemetry.io/otel/sdk/trace"
)

type sanitizingProcessor struct {
	next          sdktrace.SpanProcessor
	sensitiveKeys []string
}

func NewSanitizingProcessor(next sdktrace.SpanProcessor, sensitiveKeys []string) sdktrace.SpanProcessor {
	return &sanitizingProcessor{next: next, sensitiveKeys: sensitiveKeys}
}

func (p *sanitizingProcessor) OnStart(ctx context.Context, s sdktrace.ReadWriteSpan) {
	p.next.OnStart(ctx, s)
}

func (p *sanitizingProcessor) OnEnd(s sdktrace.ReadOnlySpan) {
	attrs := s.Attributes()
	filtered := make([]attribute.KeyValue, 0, len(attrs))
	for _, attr := range attrs {
		if p.isSensitive(string(attr.Key)) {
			filtered = append(filtered, attribute.String(string(attr.Key), "[REDACTED]"))
		} else {
			filtered = append(filtered, attr)
		}
	}
	p.next.OnEnd(s)
}

func (p *sanitizingProcessor) isSensitive(key string) bool {
	for _, sk := range p.sensitiveKeys {
		if strings.Contains(strings.ToLower(key), strings.ToLower(sk)) {
			return true
		}
	}
	return false
}

func (p *sanitizingProcessor) ForceFlush(ctx context.Context) error {
	return p.next.ForceFlush(ctx)
}

func (p *sanitizingProcessor) Shutdown(ctx context.Context) error {
	return p.next.Shutdown(ctx)
}

2. Error-Rate-Based Dynamic Sampling

package telemetry

import (
	"sync/atomic"

	sdktrace "go.opentelemetry.io/otel/sdk/trace"
)

type errorAwareSampler struct {
	errorCount atomic.Int64
	totalCount atomic.Int64
	baseRatio  float64
	errorRatio float64
}

func NewErrorAwareSampler(baseRatio, errorRatio float64) sdktrace.Sampler {
	return &errorAwareSampler{baseRatio: baseRatio, errorRatio: errorRatio}
}

func (s *errorAwareSampler) ShouldSample(p sdktrace.SamplingParameters) sdktrace.SamplingResult {
	s.totalCount.Add(1)

	for _, attr := range p.Attributes {
		if attr.Key == "error" {
			s.errorCount.Add(1)
		}
	}

	ratio := s.baseRatio
	if s.errorCount.Load() > 0 {
		errorRate := float64(s.errorCount.Load()) / float64(s.totalCount.Load())
		if errorRate > 0.01 {
			ratio = s.errorRatio
		}
	}

	return sdktrace.TraceIDRatioBased(ratio).ShouldSample(p)
}

func (s *errorAwareSampler) Description() string {
	return "ErrorAwareSampler"
}

3. Trace-Log Correlation

package telemetry

import (
	"context"
	"log/slog"

	"go.opentelemetry.io/otel/trace"
)

type traceHandler struct {
	next slog.Handler
}

func NewTraceHandler(next slog.Handler) slog.Handler {
	return &traceHandler{next: next}
}

func (h *traceHandler) Handle(ctx context.Context, r slog.Record) error {
	spanCtx := trace.SpanContextFromContext(ctx)
	if spanCtx.IsValid() {
		r.AddAttrs(
			slog.String("trace_id", spanCtx.TraceID().String()),
			slog.String("span_id", spanCtx.SpanID().String()),
		)
	}
	return h.next.Handle(ctx, r)
}

func (h *traceHandler) Enabled(ctx context.Context, level slog.Level) bool {
	return h.next.Enabled(ctx, level)
}

func (h *traceHandler) WithAttrs(attrs []slog.Attr) slog.Handler {
	return &traceHandler{next: h.next.WithAttrs(attrs)}
}

func (h *traceHandler) WithGroup(name string) slog.Handler {
	return &traceHandler{next: h.next.WithGroup(name)}
}

Comparison Analysis

Dimension OpenTelemetry Jaeger Client Zipkin Brave SkyWalking Datadog APM
Vendor-neutral ✅ CNCF standard ⚠️ Jaeger only ⚠️ Zipkin only ❌ Apache but closed ecosystem ❌ Commercial
Multi-language ✅ 11+ languages ⚠️ 6 languages ⚠️ Java-centric ⚠️ 8 languages ✅ 10+
Metrics integration ✅ Native ❌ Needs Prometheus
Auto-instrumentation ✅ HTTP/gRPC ⚠️ Limited
Sampling strategies ✅ Flexible ⚠️ Simple ⚠️ Simple ❌ Fixed
Community activity ⭐ Very high ⭐ High ⭐ Medium ⭐ High ⭐ Commercial
Cost Free Free Free Free $31/mo+

Summary: OpenTelemetry isn't just another APM tool — it's the infrastructure layer for observability. Its core value: instrument once, export to multiple backends, unified Trace/Metrics/Logs. The 2026 best practice: use OTel SDK for unified instrumentation → OTLP protocol to Collector → Collector routes to Jaeger (Trace) + Prometheus (Metrics) + Loki (Log). The key is configuring Propagator and sampling strategy at SDK initialization to avoid broken chains or data floods later.


Try these browser-local tools — no sign-up required →

#Go#OpenTelemetry#分布式追踪#可观测性#链路追踪#2026#DevOps