Go OpenTelemetry Distributed Tracing: 6 Key Steps from Zero Integration to Full Observability
Microservice Call Chains Have Become Black Boxes
Users report "slow checkout", you open logs and see timestamps scattered across a dozen services — order service 3ms, inventory service 2ms, payment service... timed out? Or never called? You have no idea which services a request traversed or where it got stuck. Distributed tracing is the silver bullet, and OpenTelemetry (OTel) has become the de facto standard.
This article starts from scratch and walks you through OTel SDK init → Trace/Span creation → Context propagation → Auto-instrumentation → Jaeger/Tempo integration → Metrics correlation — 6 key steps to turn microservice call chains from black boxes into transparent pipelines.
OpenTelemetry Core Concepts
| Concept | Description |
|---|---|
| Trace | A complete request trace chain composed of multiple Spans |
| Span | A single operation unit with name, duration, status, attributes |
| Context | Trace context containing TraceID/SpanID, propagated across processes |
| Propagator | Context propagator, injects and extracts Context in HTTP/gRPC headers |
| TracerProvider | Tracer factory, creates and manages Tracer instances |
| SpanProcessor | Span processor, handles batching, filtering, and exporting of Spans |
| Exporter | Sends Span data to Jaeger/Tempo/OTLP backends |
| Resource | Resource descriptor identifying the service producing telemetry data |
Trace Data Flow
Request Flow:
1. Entry service receives request, creates Root Span
2. When calling downstream, Propagator injects Context into HTTP/gRPC headers
3. Downstream service extracts Context from headers, creates Child Span
4. After Span completes, SpanProcessor batches it
5. Exporter sends Span data to Jaeger/Tempo
6. View complete call chain graph in UI
Problem Analysis: 5 Major Distributed Tracing Challenges
- Complex SDK initialization: TracerProvider, SpanProcessor, Exporter, Resource configuration order and dependencies are easy to mix up
- Missing context propagation: Forgetting to propagate Context in cross-service calls causes broken chains
- Uncontrolled Span granularity: Too coarse hides bottlenecks, too fine generates massive data overwhelming backends
- Auto vs manual instrumentation conflicts: HTTP/gRPC auto-instrumentation and manual business Spans can duplicate or nest incorrectly
- Metrics-Traces disconnect: Metrics and Traces operate independently, unable to locate specific Traces from metrics
Step-by-Step: Complete OTel Integration
Step 1: Initialize TracerProvider
package telemetry
import (
"context"
"fmt"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.26.0"
)
type Telemetry struct {
provider *sdktrace.TracerProvider
}
func InitTelemetry(ctx context.Context, serviceName, serviceVersion, otlpEndpoint string) (*Telemetry, error) {
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint(otlpEndpoint),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return nil, fmt.Errorf("create OTLP exporter: %w", err)
}
res, err := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceNameKey.String(serviceName),
semconv.ServiceVersionKey.String(serviceVersion),
),
)
if err != nil {
return nil, fmt.Errorf("create resource: %w", err)
}
bsp := sdktrace.NewBatchSpanProcessor(exporter,
sdktrace.WithBatchTimeout(5*time.Second),
sdktrace.WithMaxExportBatchSize(512),
sdktrace.WithMaxQueueSize(2048),
)
provider := sdktrace.NewTracerProvider(
sdktrace.WithResource(res),
sdktrace.WithSpanProcessor(bsp),
sdktrace.WithSampler(sdktrace.ParentBased(
sdktrace.TraceIDRatioBased(0.5),
)),
)
otel.SetTracerProvider(provider)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
))
return &Telemetry{provider: provider}, nil
}
func (t *Telemetry) Shutdown(ctx context.Context) error {
return t.provider.Shutdown(ctx)
}
Step 2: Create Traces and Spans
package service
import (
"context"
"fmt"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/codes"
"go.opentelemetry.io/otel/trace"
)
var tracer = otel.Tracer("order-service")
func ProcessOrder(ctx context.Context, orderID string) error {
ctx, span := tracer.Start(ctx, "ProcessOrder",
trace.WithAttributes(
attribute.String("order.id", orderID),
),
trace.WithSpanKind(trace.SpanKindInternal),
)
defer span.End()
if err := validateOrder(ctx, orderID); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return err
}
if err := reserveInventory(ctx, orderID); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return err
}
span.SetStatus(codes.Ok, "")
return nil
}
func validateOrder(ctx context.Context, orderID string) error {
ctx, span := tracer.Start(ctx, "validateOrder",
trace.WithAttributes(attribute.String("order.id", orderID)),
)
defer span.End()
if orderID == "" {
err := fmt.Errorf("order ID is empty")
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return err
}
span.AddEvent("validation_passed", trace.WithAttributes(
attribute.String("order.id", orderID),
))
return nil
}
func reserveInventory(ctx context.Context, orderID string) error {
ctx, span := tracer.Start(ctx, "reserveInventory")
defer span.End()
span.SetAttributes(attribute.String("order.id", orderID))
return nil
}
Step 3: HTTP Context Propagation
package middleware
import (
"net/http"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/trace"
)
func HTTPMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
propagator := otel.GetTextMapPropagator()
ctx := propagator.Extract(r.Context(), propagation.HeaderCarrier(r.Header))
tracer := otel.Tracer("http-server")
spanName := r.Method + " " + r.URL.Path
ctx, span := tracer.Start(ctx, spanName,
trace.WithSpanKind(trace.SpanKindServer),
trace.WithAttributes(
attribute.String("http.method", r.Method),
attribute.String("http.url", r.URL.String()),
attribute.String("http.host", r.Host),
),
)
defer span.End()
rw := &responseWriter{ResponseWriter: w, statusCode: 200}
next.ServeHTTP(rw, r.WithContext(ctx))
span.SetAttributes(
attribute.Int("http.status_code", rw.statusCode),
)
})
}
type responseWriter struct {
http.ResponseWriter
statusCode int
}
func (rw *responseWriter) WriteHeader(code int) {
rw.statusCode = code
rw.ResponseWriter.WriteHeader(code)
}
Step 4: Outbound HTTP Context Propagation
package client
import (
"context"
"fmt"
"io"
"net/http"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/trace"
)
type InstrumentedClient struct {
client *http.Client
}
func NewInstrumentedClient() *InstrumentedClient {
return &InstrumentedClient{
client: &http.Client{Timeout: 30 * time.Second},
}
}
func (c *InstrumentedClient) Do(ctx context.Context, method, url string, body io.Reader) (*http.Response, error) {
tracer := otel.Tracer("http-client")
ctx, span := tracer.Start(ctx, method+" "+url,
trace.WithSpanKind(trace.SpanKindClient),
trace.WithAttributes(
attribute.String("http.method", method),
attribute.String("http.url", url),
),
)
defer span.End()
req, err := http.NewRequestWithContext(ctx, method, url, body)
if err != nil {
span.RecordError(err)
return nil, fmt.Errorf("create request: %w", err)
}
propagator := otel.GetTextMapPropagator()
propagator.Inject(ctx, propagation.HeaderCarrier(req.Header))
resp, err := c.client.Do(req)
if err != nil {
span.RecordError(err)
return nil, fmt.Errorf("execute request: %w", err)
}
span.SetAttributes(attribute.Int("http.status_code", resp.StatusCode))
return resp, nil
}
Step 5: gRPC Auto-Instrumentation
package main
import (
"context"
"go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
)
func NewGRPCClient(ctx context.Context, target string) (*grpc.ClientConn, error) {
conn, err := grpc.DialContext(ctx, target,
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithStatsHandler(otelgrpc.NewClientHandler()),
)
if err != nil {
return nil, err
}
return conn, nil
}
func NewGRPCServer() *grpc.Server {
server := grpc.NewServer(
grpc.StatsHandler(otelgrpc.NewServerHandler()),
)
return server
}
Step 6: Metrics and Trace Correlation
package telemetry
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
sdkmetric "go.opentelemetry.io/otel/sdk/metric"
"go.opentelemetry.io/otel/sdk/resource"
semconv "go.opentelemetry.io/otel/semconv/v1.26.0"
)
type MetricsProvider struct {
provider *sdkmetric.MeterProvider
}
func InitMetrics(ctx context.Context, serviceName, otlpEndpoint string) (*MetricsProvider, error) {
exporter, err := otlpmetricgrpc.New(ctx,
otlpmetricgrpc.WithEndpoint(otlpEndpoint),
otlpmetricgrpc.WithInsecure(),
)
if err != nil {
return nil, err
}
res, err := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceNameKey.String(serviceName),
),
)
if err != nil {
return nil, err
}
provider := sdkmetric.NewMeterProvider(
sdkmetric.WithResource(res),
sdkmetric.WithReader(sdkmetric.NewPeriodicReader(exporter)),
)
otel.SetMeterProvider(provider)
return &MetricsProvider{provider: provider}, nil
}
func (m *MetricsProvider) Shutdown(ctx context.Context) error {
return m.provider.Shutdown(ctx)
}
Pitfall Guide
Pitfall 1: Forgetting to Set Global Propagator
// ❌ Wrong: no Propagator set, Context can't propagate across processes
provider := sdktrace.NewTracerProvider(...)
otel.SetTracerProvider(provider)
// Missing otel.SetTextMapPropagator(...)
// ✅ Correct: set Composite Propagator
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
))
Pitfall 2: Not Calling Span.End()
// ❌ Wrong: Span never ends, never exported
ctx, span := tracer.Start(ctx, "operation")
doWork(ctx)
// Forgot span.End()
// ✅ Correct: use defer to ensure Span ends
ctx, span := tracer.Start(ctx, "operation")
defer span.End()
doWork(ctx)
Pitfall 3: Improper Sampling Rate
// ❌ Wrong: AlwaysSample in production causes massive data
sdktrace.WithSampler(sdktrace.AlwaysSample())
// ✅ Correct: use ParentBased + TraceIDRatioBased
sdktrace.WithSampler(sdktrace.ParentBased(
sdktrace.TraceIDRatioBased(0.1), // 10% sampling
))
Pitfall 4: Losing Context in Goroutines
// ❌ Wrong: no ctx passed to goroutine
go func() {
ctx, span := tracer.Start(context.Background(), "async_work")
defer span.End()
}()
// ✅ Correct: pass parent Context to goroutine
go func(ctx context.Context) {
ctx, span := tracer.Start(ctx, "async_work")
defer span.End()
}(ctx)
Pitfall 5: Shutdown Timeout Causing Data Loss
// ❌ Wrong: no sufficient time for Shutdown
func main() {
tel, _ := telemetry.InitTelemetry(ctx, "svc", "1.0", "localhost:4317")
defer tel.Shutdown(context.Background()) // may timeout
}
// ✅ Correct: give Shutdown enough timeout
func main() {
tel, _ := telemetry.InitTelemetry(ctx, "svc", "1.0", "localhost:4317")
defer func() {
shutdownCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
tel.Shutdown(shutdownCtx)
}()
}
Error Troubleshooting
| # | Error Message | Cause | Solution |
|---|---|---|---|
| 1 | connection refused: localhost:4317 |
OTLP Collector not running | Start otel-collector container, check port mapping |
| 2 | traces not showing in Jaeger |
Exporter misconfigured or sampling rate 0 | Check Exporter target, confirm sampling rate > 0 |
| 3 | context deadline exceeded |
Collector slow or network unreachable | Increase timeout, check network connectivity |
| 4 | span missing parent |
Context propagation failed | Confirm Propagator is set, check HTTP header injection |
| 5 | resource attributes missing |
Resource not configured | Add resource.WithAttributes(semconv.ServiceNameKey.String(...)) |
| 6 | too many open files |
Span queue backlog, Exporter blocked | Reduce MaxQueueSize, increase BatchTimeout |
| 7 | trace_id not found in baggage |
Baggage and TraceContext confused | TraceContext propagates TraceID, Baggage propagates business data |
| 8 | grpc: no transport security |
gRPC using WithInsecure | Acceptable in dev, configure TLS in production |
| 9 | duplicate span name |
Multiple Spans with same name | Add distinguishing attributes or use dynamic names |
| 10 | metric reader timeout |
Metric export timeout | Increase Interval and Timeout in PeriodicReader |
Advanced Optimization
1. Custom SpanProcessor for Sensitive Data Filtering
package telemetry
import (
"context"
"strings"
"go.opentelemetry.io/otel/attribute"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
)
type sanitizingProcessor struct {
next sdktrace.SpanProcessor
sensitiveKeys []string
}
func NewSanitizingProcessor(next sdktrace.SpanProcessor, sensitiveKeys []string) sdktrace.SpanProcessor {
return &sanitizingProcessor{next: next, sensitiveKeys: sensitiveKeys}
}
func (p *sanitizingProcessor) OnStart(ctx context.Context, s sdktrace.ReadWriteSpan) {
p.next.OnStart(ctx, s)
}
func (p *sanitizingProcessor) OnEnd(s sdktrace.ReadOnlySpan) {
attrs := s.Attributes()
filtered := make([]attribute.KeyValue, 0, len(attrs))
for _, attr := range attrs {
if p.isSensitive(string(attr.Key)) {
filtered = append(filtered, attribute.String(string(attr.Key), "[REDACTED]"))
} else {
filtered = append(filtered, attr)
}
}
p.next.OnEnd(s)
}
func (p *sanitizingProcessor) isSensitive(key string) bool {
for _, sk := range p.sensitiveKeys {
if strings.Contains(strings.ToLower(key), strings.ToLower(sk)) {
return true
}
}
return false
}
func (p *sanitizingProcessor) ForceFlush(ctx context.Context) error {
return p.next.ForceFlush(ctx)
}
func (p *sanitizingProcessor) Shutdown(ctx context.Context) error {
return p.next.Shutdown(ctx)
}
2. Error-Rate-Based Dynamic Sampling
package telemetry
import (
"sync/atomic"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
)
type errorAwareSampler struct {
errorCount atomic.Int64
totalCount atomic.Int64
baseRatio float64
errorRatio float64
}
func NewErrorAwareSampler(baseRatio, errorRatio float64) sdktrace.Sampler {
return &errorAwareSampler{baseRatio: baseRatio, errorRatio: errorRatio}
}
func (s *errorAwareSampler) ShouldSample(p sdktrace.SamplingParameters) sdktrace.SamplingResult {
s.totalCount.Add(1)
for _, attr := range p.Attributes {
if attr.Key == "error" {
s.errorCount.Add(1)
}
}
ratio := s.baseRatio
if s.errorCount.Load() > 0 {
errorRate := float64(s.errorCount.Load()) / float64(s.totalCount.Load())
if errorRate > 0.01 {
ratio = s.errorRatio
}
}
return sdktrace.TraceIDRatioBased(ratio).ShouldSample(p)
}
func (s *errorAwareSampler) Description() string {
return "ErrorAwareSampler"
}
3. Trace-Log Correlation
package telemetry
import (
"context"
"log/slog"
"go.opentelemetry.io/otel/trace"
)
type traceHandler struct {
next slog.Handler
}
func NewTraceHandler(next slog.Handler) slog.Handler {
return &traceHandler{next: next}
}
func (h *traceHandler) Handle(ctx context.Context, r slog.Record) error {
spanCtx := trace.SpanContextFromContext(ctx)
if spanCtx.IsValid() {
r.AddAttrs(
slog.String("trace_id", spanCtx.TraceID().String()),
slog.String("span_id", spanCtx.SpanID().String()),
)
}
return h.next.Handle(ctx, r)
}
func (h *traceHandler) Enabled(ctx context.Context, level slog.Level) bool {
return h.next.Enabled(ctx, level)
}
func (h *traceHandler) WithAttrs(attrs []slog.Attr) slog.Handler {
return &traceHandler{next: h.next.WithAttrs(attrs)}
}
func (h *traceHandler) WithGroup(name string) slog.Handler {
return &traceHandler{next: h.next.WithGroup(name)}
}
Comparison Analysis
| Dimension | OpenTelemetry | Jaeger Client | Zipkin Brave | SkyWalking | Datadog APM |
|---|---|---|---|---|---|
| Vendor-neutral | ✅ CNCF standard | ⚠️ Jaeger only | ⚠️ Zipkin only | ❌ Apache but closed ecosystem | ❌ Commercial |
| Multi-language | ✅ 11+ languages | ⚠️ 6 languages | ⚠️ Java-centric | ⚠️ 8 languages | ✅ 10+ |
| Metrics integration | ✅ Native | ❌ Needs Prometheus | ❌ | ✅ | ✅ |
| Auto-instrumentation | ✅ HTTP/gRPC | ⚠️ Limited | ❌ | ✅ | ✅ |
| Sampling strategies | ✅ Flexible | ⚠️ Simple | ⚠️ Simple | ✅ | ❌ Fixed |
| Community activity | ⭐ Very high | ⭐ High | ⭐ Medium | ⭐ High | ⭐ Commercial |
| Cost | Free | Free | Free | Free | $31/mo+ |
Summary: OpenTelemetry isn't just another APM tool — it's the infrastructure layer for observability. Its core value: instrument once, export to multiple backends, unified Trace/Metrics/Logs. The 2026 best practice: use OTel SDK for unified instrumentation → OTLP protocol to Collector → Collector routes to Jaeger (Trace) + Prometheus (Metrics) + Loki (Log). The key is configuring Propagator and sampling strategy at SDK initialization to avoid broken chains or data floods later.
Recommended Online Tools
- JSON Formatter: /en/json/format
- Base64 Encode/Decode: /en/encode/base64
- Hash Calculator: /en/encode/hash
- JWT Decode: /en/encode/jwt-decode
Try these browser-local tools — no sign-up required →