Go Microservice Resilience: 5 Production Patterns from Circuit Breaker to Rate Limiting

云原生

When Cascading Failures Meet Avalanche Effects: Microservices' Darkest Hour

Friday evening peak, payment service slows down. The upstream order service has no timeout control — all requests block on payment calls, goroutines pile up to 500K. Within 5 minutes, the order service crashes with OOM, the user service follows, and the entire call chain avalanches. The irony? The payment service recovered 30 seconds later — but nobody could reach it anymore.

This isn't fear-mongering. In microservice architectures, one slow node can drag down the entire call chain. You need circuit breakers to prevent failure propagation, rate limiting to protect downstream, retries for transient glitches, timeouts to avoid infinite waits, and bulkheads to isolate fault domains. This article covers 5 production-grade resilience patterns to help you build anti-fragile Go microservices.


Core Concepts Reference

Pattern Core Idea Key Library Typical Scenario
Circuit Breaker Disconnect calls after failure threshold, protect downstream sony/gobreaker Fast-fail when downstream is unavailable
Rate Limiting Control request rate, prevent overload golang.org/x/time/rate API rate limiting, resource quota control
Retry with Backoff Auto-retry transient failures, exponential backoff avoids retry storms Custom implementation Network jitter, temporary errors
Timeout Control Limit maximum duration per call context.Context All external calls must have timeouts
Bulkhead Isolate resources for different callers, prevent mutual impact Custom implementation Multi-downstream call resource isolation

Table of Contents

  1. Microservice Resilience Architecture Overview
  2. Pattern 1: Circuit Breaker — sony/gobreaker in Practice
  3. Pattern 2: Rate Limiting — Token Bucket and Sliding Window
  4. Pattern 3: Retry — Exponential Backoff and Jitter
  5. Pattern 4: Timeout Control — context.Context in Practice
  6. Pattern 5: Bulkhead Isolation — Resource Partitioning
  7. 5 Common Pitfalls
  8. 10 Error Troubleshooting
  9. Advanced Optimization Techniques
  10. Resilience Pattern Comparison
  11. Recommended Tools
  12. Summary and Further Reading

Microservice Resilience Architecture Overview

                    ┌─────────────────────────────────────────────┐
                    │           Client (HTTP/gRPC)                │
                    └──────────────────┬──────────────────────────┘
                                       │
                    ┌──────────────────▼──────────────────────────┐
                    │          API Gateway / BFF                  │
                    │  ┌──────────┐  ┌──────────┐  ┌──────────┐ │
                    │  │ Rate     │  │ Circuit  │  │ Timeout  │ │
                    │  │ Limiter  │  │ Breaker  │  │ Control  │ │
                    │  └──────────┘  └──────────┘  └──────────┘ │
                    └──────────────────┬──────────────────────────┘
                                       │
              ┌────────────────────────┼────────────────────────┐
              │                        │                        │
    ┌─────────▼──────────┐  ┌─────────▼──────────┐  ┌─────────▼──────────┐
    │   Service A        │  │   Service B        │  │   Service C        │
    │  ┌──────────────┐  │  │  ┌──────────────┐  │  │  ┌──────────────┐  │
    │  │  Bulkhead    │  │  │  │  Bulkhead    │  │  │  │  Bulkhead    │  │
    │  │  ┌───┐ ┌───┐ │  │  │  │  ┌───┐ ┌───┐ │  │  │  │  ┌───┐ ┌───┐ │  │
    │  │  │ P1│ │ P2│ │  │  │  │  │ P1│ │ P2│ │  │  │  │  │ P1│ │ P2│ │  │
    │  │  └───┘ └───┘ │  │  │  │  └───┘ └───┘ │  │  │  │  └───┘ └───┘ │  │
    │  └──────────────┘  │  │  └──────────────┘  │  │  └──────────────┘  │
    │  ┌──────────────┐  │  │  ┌──────────────┐  │  │  ┌──────────────┐  │
    │  │  Retry with  │  │  │  │  Retry with  │  │  │  │  Retry with  │  │
    │  │  Backoff     │  │  │  │  Backoff     │  │  │  │  Backoff     │  │
    │  └──────────────┘  │  │  └──────────────┘  │  │  └──────────────┘  │
    └────────────────────┘  └────────────────────┘  └────────────────────┘
              │                        │                        │
    ┌─────────▼──────────┐  ┌─────────▼──────────┐  ┌─────────▼──────────┐
    │     Database       │  │    Cache (Redis)    │  │   Message Queue    │
    └────────────────────┘  └────────────────────┘  └────────────────────┘

Four Principles of Resilience:

  • Fail Fast: Circuit breaker opens and returns errors immediately, no timeout wait
  • Graceful Degradation: Return fallback data when non-critical functions fail
  • Fault Isolation: Bulkhead pattern ensures one fault doesn't affect other calls
  • Self-Healing: Circuit breaker half-open state probes recovery, retry handles transient faults

Pattern 1: Circuit Breaker — sony/gobreaker in Practice

The circuit breaker is the first line of defense for microservice resilience. When downstream service failure rate exceeds a threshold, the breaker "trips" — subsequent requests return errors immediately without calling downstream, giving it time to recover.

Circuit Breaker State Machine

         Recovery rate                    Failure threshold
    ┌──────────────┐             ┌──────────────┐
    │              │             │              │
    │   Half-Open  │◄────────────│    Closed    │
    │   (Probing)  │             │  (Allowing)  │
    │              │─────────────►│              │
    └──────┬───────┘  Probe OK   └──────────────┘
           │                       ▲
           │ Probe failed          │ Auto half-open
           ▼                       │ after timeout
    ┌──────────────┐               │
    │              │───────────────┘
    │    Open      │
    │  (Rejecting) │
    │              │
    └──────────────┘

Complete Implementation

package circuitbreaker

import (
    "fmt"
    "time"

    "github.com/sony/gobreaker/v2"
)

type CircuitBreaker[T any] struct {
    cb *gobreaker.CircuitBreaker[T]
}

type Config struct {
    Name          string
    MaxRequests   uint32
    Interval      time.Duration
    Timeout       time.Duration
    FailThreshold uint32
    FailRatio     float64
}

func NewCircuitBreaker[T any](cfg Config) *CircuitBreaker[T] {
    settings := gobreaker.Settings{
        Name:        cfg.Name,
        MaxRequests: cfg.MaxRequests,
        Interval:    cfg.Interval,
        Timeout:     cfg.Timeout,
        ReadyToTrip: func(counts gobreaker.Counts) bool {
            failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
            return counts.Requests >= cfg.FailThreshold && failureRatio >= cfg.FailRatio
        },
        OnStateChange: func(name string, from gobreaker.State, to gobreaker.State) {
            fmt.Printf("CircuitBreaker '%s': %s -> %s\n", name, from, to)
        },
    }

    return &CircuitBreaker[T]{
        cb: gobreaker.NewCircuitBreaker[T](settings),
    }
}

func (cb *CircuitBreaker[T]) Execute(fn func() (T, error)) (T, error) {
    result, err := cb.cb.Execute(fn)
    if err != nil {
        var zero T
        if err == gobreaker.ErrOpenState {
            return zero, fmt.Errorf("circuit breaker open: %w", err)
        }
        if err == gobreaker.ErrTooManyRequests {
            return zero, fmt.Errorf("circuit breaker half-open: too many requests: %w", err)
        }
        return zero, err
    }
    return result, nil
}

func (cb *CircuitBreaker[T]) State() gobreaker.State {
    return cb.cb.State()
}

Usage Example — HTTP Client Circuit Breaking

package httpclient

import (
    "context"
    "fmt"
    "io"
    "net/http"
    "time"

    "github.com/sony/gobreaker/v2"
)

type ResilientClient struct {
    client *http.Client
    cb     *gobreaker.CircuitBreaker[string]
}

func NewResilientClient() *ResilientClient {
    cb := gobreaker.NewCircuitBreaker[string](gobreaker.Settings{
        Name:        "http-client",
        MaxRequests: 3,
        Interval:    60 * time.Second,
        Timeout:     30 * time.Second,
        ReadyToTrip: func(counts gobreaker.Counts) bool {
            failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
            return counts.Requests >= 10 && failureRatio >= 0.6
        },
    })

    return &ResilientClient{
        client: &http.Client{Timeout: 10 * time.Second},
        cb:     cb,
    }
}

func (c *ResilientClient) Get(ctx context.Context, url string) (string, error) {
    result, err := c.cb.Execute(func() (string, error) {
        req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
        if err != nil {
            return "", fmt.Errorf("create request: %w", err)
        }

        resp, err := c.client.Do(req)
        if err != nil {
            return "", fmt.Errorf("do request: %w", err)
        }
        defer resp.Body.Close()

        if resp.StatusCode >= 500 {
            return "", fmt.Errorf("server error: status %d", resp.StatusCode)
        }

        body, err := io.ReadAll(resp.Body)
        if err != nil {
            return "", fmt.Errorf("read body: %w", err)
        }
        return string(body), nil
    })

    if err != nil {
        return "", fmt.Errorf("resilient get %s: %w", url, err)
    }
    return result, nil
}

Key Design Points:

  • ReadyToTrip customizes trip conditions: 60% failure rate with minimum 10 requests
  • MaxRequests controls probe request count in half-open state
  • Timeout controls wait time from Open to Half-Open
  • OnStateChange callback for monitoring and alerting
  • Generic support (Go 1.18+) for type-safe return values

Pattern 2: Rate Limiting — Token Bucket and Sliding Window

Rate limiting is the core mechanism to protect downstream services. golang.org/x/time/rate implements the token bucket algorithm — simple and efficient.

Token Bucket Algorithm

    Request Arrives              Token Bucket
    ┌──────┐     ┌──────────────────────────┐
    │ Req1 │────►│  ● ● ● ● ● ○ ○ ○ ○ ○   │ ──► Allow (has token)
    └──────┘     │  5/10 tokens             │
    ┌──────┐     │  r=10/s, burst=10        │
    │ Req6 │────►│  ● ● ● ● ● ○ ○ ○ ○ ○   │ ──► Allow (has token)
    └──────┘     │  4/10 tokens             │
    ┌──────┐     │                          │
    │ Req11│────►│  ○ ○ ○ ○ ○ ○ ○ ○ ○ ○   │ ──► Reject (no token)
    └──────┘     │  0/10 tokens             │
                 └──────────────────────────┘
                      ▲
                      │ 10 tokens replenished per second
                      │ r = 10 tokens/sec
                      │ burst = 10 (bucket capacity)

Complete Implementation

package ratelimit

import (
    "context"
    "fmt"
    "sync"
    "time"

    "golang.org/x/time/rate"
)

type RateLimiter struct {
    limiters sync.Map
    rate     rate.Limit
    burst    int
}

func NewRateLimiter(r rate.Limit, burst int) *RateLimiter {
    return &RateLimiter{
        rate:  r,
        burst: burst,
    }
}

func (rl *RateLimiter) getLimiter(key string) *rate.Limiter {
    if limiter, ok := rl.limiters.Load(key); ok {
        return limiter.(*rate.Limiter)
    }

    newLimiter := rate.NewLimiter(rl.rate, rl.burst)
    actual, _ := rl.limiters.LoadOrStore(key, newLimiter)
    return actual.(*rate.Limiter)
}

func (rl *RateLimiter) Allow(key string) bool {
    return rl.getLimiter(key).Allow()
}

func (rl *RateLimiter) Wait(ctx context.Context, key string) error {
    return rl.getLimiter(key).Wait(ctx)
}

func (rl *RateLimiter) Reserve(key string) (*rate.Reservation, error) {
    return rl.getLimiter(key).Reserve(), nil
}

HTTP Middleware Rate Limiting

package middleware

import (
    "net/http"
    "time"

    "golang.org/x/time/rate"
)

type IPRateLimiter struct {
    limiters sync.Map
    rate     rate.Limit
    burst    int
}

func NewIPRateLimiter(r rate.Limit, burst int) *IPRateLimiter {
    return &IPRateLimiter{
        rate:  r,
        burst: burst,
    }
}

func (rl *IPRateLimiter) getLimiter(ip string) *rate.Limiter {
    if limiter, ok := rl.limiters.Load(ip); ok {
        return limiter.(*rate.Limiter)
    }
    newLimiter := rate.NewLimiter(rl.rate, rl.burst)
    actual, _ := rl.limiters.LoadOrStore(ip, newLimiter)
    return actual.(*rate.Limiter)
}

func (rl *IPRateLimiter) Middleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        ip := r.RemoteAddr

        limiter := rl.getLimiter(ip)
        if !limiter.Allow() {
            http.Error(w, "Too Many Requests", http.StatusTooManyRequests)
            return
        }

        next.ServeHTTP(w, r)
    })
}

Tiered Rate Limiting

type TieredRateLimiter struct {
    tiers map[string]*rate.Limiter
}

func NewTieredRateLimiter() *TieredRateLimiter {
    return &TieredRateLimiter{
        tiers: map[string]*rate.Limiter{
            "free":      rate.NewLimiter(10, 20),
            "basic":     rate.NewLimiter(50, 100),
            "premium":   rate.NewLimiter(200, 400),
            "enterprise": rate.NewLimiter(1000, 2000),
        },
    }
}

func (trl *TieredRateLimiter) Allow(tier string) bool {
    limiter, ok := trl.tiers[tier]
    if !ok {
        limiter = trl.tiers["free"]
    }
    return limiter.Allow()
}

Key Design Points:

  • rate.Limit represents tokens generated per second, burst is bucket capacity
  • Allow() is non-blocking, Wait() blocks until a token is available
  • sync.Map implements per-key limiters with auto-scaling
  • Tiered rate limiting adapts to different user levels
  • Token bucket allows burst traffic but keeps long-term average rate controlled

Pattern 3: Retry — Exponential Backoff and Jitter

Retries handle transient failures, but uncontrolled retries worsen problems. Exponential Backoff progressively increases retry intervals, and Jitter prevents retry storms.

Retry Strategy Comparison

    No backoff:       █ █ █ █ █ █ █ █ █ █    ← All clients retry simultaneously
    Fixed interval:   █   █   █   █   █      ← Still has synchronized retry risk
    Exponential:      █   █     █         █   ← Intervals grow progressively
    Exponential+Jitter: █  █    █       █    ← Randomization prevents retry storms

Complete Implementation

package retry

import (
    "context"
    "fmt"
    "math"
    "math/rand"
    "time"
)

type Config struct {
    MaxAttempts     int
    InitialInterval time.Duration
    MaxInterval     time.Duration
    Multiplier      float64
    Jitter          float64
}

func DefaultConfig() Config {
    return Config{
        MaxAttempts:     5,
        InitialInterval: 100 * time.Millisecond,
        MaxInterval:     30 * time.Second,
        Multiplier:      2.0,
        Jitter:          0.1,
    }
}

type RetryableError struct {
    Err error
}

func (e *RetryableError) Error() string {
    return fmt.Sprintf("retryable: %v", e.Err)
}

func IsRetryable(err error) bool {
    _, ok := err.(*RetryableError)
    return ok
}

func Do[T any](ctx context.Context, cfg Config, fn func() (T, error)) (T, error) {
    var lastErr error

    for attempt := 0; attempt < cfg.MaxAttempts; attempt++ {
        if attempt > 0 {
            delay := calculateDelay(attempt, cfg)
            select {
            case <-ctx.Done():
                var zero T
                return zero, fmt.Errorf("retry canceled: %w", ctx.Err())
            case <-time.After(delay):
            }
        }

        result, err := fn()
        if err == nil {
            return result, nil
        }

        lastErr = err
        if !IsRetryable(err) {
            var zero T
            return zero, fmt.Errorf("non-retryable error: %w", err)
        }
    }

    var zero T
    return zero, fmt.Errorf("max retries (%d) exceeded: %w", cfg.MaxAttempts, lastErr)
}

func calculateDelay(attempt int, cfg Config) time.Duration {
    delay := float64(cfg.InitialInterval) * math.Pow(cfg.Multiplier, float64(attempt-1))

    if delay > float64(cfg.MaxInterval) {
        delay = float64(cfg.MaxInterval)
    }

    jitter := delay * cfg.Jitter
    delay = delay - jitter + rand.Float64()*2*jitter

    return time.Duration(delay)
}

Usage Example — HTTP Call with Retry

func fetchWithRetry(ctx context.Context, url string) (string, error) {
    cfg := retry.Config{
        MaxAttempts:     3,
        InitialInterval: 200 * time.Millisecond,
        MaxInterval:     10 * time.Second,
        Multiplier:      2.0,
        Jitter:          0.2,
    }

    return retry.Do(ctx, cfg, func() (string, error) {
        resp, err := http.Get(url)
        if err != nil {
            return "", &retry.RetryableError{Err: err}
        }
        defer resp.Body.Close()

        if resp.StatusCode >= 500 {
            return "", &retry.RetryableError{
                Err: fmt.Errorf("server error: %d", resp.StatusCode),
            }
        }

        if resp.StatusCode >= 400 && resp.StatusCode < 500 {
            return "", fmt.Errorf("client error: %d", resp.StatusCode)
        }

        body, err := io.ReadAll(resp.Body)
        if err != nil {
            return "", &retry.RetryableError{Err: err}
        }
        return string(body), nil
    })
}

Retry with Circuit Breaker

func fetchResilient(ctx context.Context, url string, cb *CircuitBreaker[string]) (string, error) {
    cfg := retry.DefaultConfig()

    return retry.Do(ctx, cfg, func() (string, error) {
        result, err := cb.Execute(func() (string, error) {
            return doHTTPGet(ctx, url)
        })
        if err != nil {
            if errors.Is(err, gobreaker.ErrOpenState) {
                var zero string
                return zero, err
            }
            return zero, &retry.RetryableError{Err: err}
        }
        return result, nil
    })
}

Key Design Points:

  • RetryableError distinguishes retryable from non-retryable errors
  • 5xx is retryable, 4xx is not
  • Exponential backoff + jitter prevents retry storms
  • Context cancellation stops retries immediately
  • Don't retry when circuit breaker is open (fail fast takes priority)

Pattern 4: Timeout Control — context.Context in Practice

Timeout is the most fundamental and important resilience mechanism. Every external call must have a timeout — a call without a timeout is a ticking time bomb.

Timeout Hierarchy Design

    Request Total Timeout: 30s
    ├── API Gateway: 28s (2s margin)
    │   ├── Service A: 10s
    │   │   ├── DB Query: 5s
    │   │   └── Cache Get: 500ms
    │   ├── Service B: 15s
    │   │   ├── gRPC Call: 12s
    │   │   └── Redis Get: 1s
    │   └── Service C: 20s
    │       ├── HTTP Call: 15s
    │       └── MQ Publish: 3s
    └── Response Margin: 2s

Complete Implementation

package timeout

import (
    "context"
    "fmt"
    "time"
)

type TimeoutConfig struct {
    Connect    time.Duration
    Read       time.Duration
    Write      time.Duration
    Overall    time.Duration
    Graceful   time.Duration
}

func DefaultTimeoutConfig() TimeoutConfig {
    return TimeoutConfig{
        Connect:  5 * time.Second,
        Read:     10 * time.Second,
        Write:    10 * time.Second,
        Overall:  30 * time.Second,
        Graceful: 10 * time.Second,
    }
}

type CallResult struct {
    Data      string
    Duration  time.Duration
    TimedOut  bool
    Err       error
}

func CallWithTimeout(ctx context.Context, endpoint string, cfg TimeoutConfig) CallResult {
    start := time.Now()

    callCtx, cancel := context.WithTimeout(ctx, cfg.Overall)
    defer cancel()

    resultCh := make(chan CallResult, 1)

    go func() {
        data, err := doCall(callCtx, endpoint)
        elapsed := time.Since(start)

        resultCh <- CallResult{
            Data:     data,
            Duration: elapsed,
            Err:      err,
        }
    }()

    select {
    case result := <-resultCh:
        return result
    case <-callCtx.Done():
        return CallResult{
            Duration: time.Since(start),
            TimedOut: true,
            Err:      fmt.Errorf("call %s timed out after %v: %w", endpoint, cfg.Overall, callCtx.Err()),
        }
    }
}

func doCall(ctx context.Context, endpoint string) (string, error) {
    req, err := http.NewRequestWithContext(ctx, http.MethodGet, endpoint, nil)
    if err != nil {
        return "", fmt.Errorf("create request: %w", err)
    }

    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        if ctx.Err() == context.DeadlineExceeded {
            return "", fmt.Errorf("request deadline exceeded: %w", ctx.Err())
        }
        return "", fmt.Errorf("do request: %w", err)
    }
    defer resp.Body.Close()

    body, err := io.ReadAll(resp.Body)
    if err != nil {
        return "", fmt.Errorf("read body: %w", err)
    }
    return string(body), nil
}

Cascading Timeout Control

func handleRequest(ctx context.Context, req *Request) (*Response, error) {
    ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
    defer cancel()

    var (
        user    *User
        orders  []*Order
        profile *Profile
        err     error
    )

    g, ctx := errgroup.WithContext(ctx)

    g.Go(func() error {
        userCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
        defer cancel()
        user, err = fetchUser(userCtx, req.UserID)
        return err
    })

    g.Go(func() error {
        ordersCtx, cancel := context.WithTimeout(ctx, 3*time.Second)
        defer cancel()
        orders, err = fetchOrders(ordersCtx, req.UserID)
        return err
    })

    g.Go(func() error {
        profileCtx, cancel := context.WithTimeout(ctx, 1*time.Second)
        defer cancel()
        profile, err = fetchProfile(profileCtx, req.UserID)
        return err
    })

    if err := g.Wait(); err != nil {
        return nil, fmt.Errorf("handle request: %w", err)
    }

    return &Response{
        User:    user,
        Orders:  orders,
        Profile: profile,
    }, nil
}

Graceful Shutdown

func main() {
    ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
    defer stop()

    server := &http.Server{Addr: ":8080"}

    go func() {
        <-ctx.Done()
        fmt.Println("shutting down...")

        shutdownCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
        defer cancel()

        if err := server.Shutdown(shutdownCtx); err != nil {
            fmt.Printf("shutdown error: %v\n", err)
        }
    }()

    if err := server.ListenAndServe(); err != http.ErrServerClosed {
        fmt.Printf("server error: %v\n", err)
    }
}

Key Design Points:

  • Each layer's timeout must be less than the parent layer's, with margin
  • WithTimeout is safer than time.After (auto-releases resources)
  • errgroup + sub-contexts enable parallel calls with independent timeouts
  • signal.NotifyContext listens for system signals for graceful shutdown
  • defer cancel() prevents context leaks

Pattern 5: Bulkhead Isolation — Resource Partitioning

The bulkhead pattern originates from ship design: dividing the hull into watertight compartments so one breach doesn't sink the entire ship. In microservices, isolate resources (connection pools, goroutines, semaphores) for different downstream calls to prevent one slow downstream from dragging down all calls.

Bulkhead Architecture

    ┌─────────────────────────────────────────┐
    │              Service X                   │
    │                                         │
    │  ┌──────────────┐  ┌──────────────┐     │
    │  │  Bulkhead A  │  │  Bulkhead B  │     │
    │  │  Service A   │  │  Service B   │     │
    │  │  ┌──┐┌──┐┌──┐│  │  ┌──┐┌──┐   │     │
    │  │  │w1││w2││w3││  │  │w1││w2│   │     │
    │  │  └──┘└──┘└──┘│  │  └──┘└──┘   │     │
    │  │  pool: 3     │  │  pool: 2     │     │
    │  └──────┬───────┘  └──────┬───────┘     │
    │         │                 │              │
    └─────────┼─────────────────┼──────────────┘
              │                 │
    ┌─────────▼───────┐ ┌──────▼──────────┐
    │   Service A     │ │   Service B     │
    │   (Slow)        │ │   (Healthy)     │
    └─────────────────┘ └─────────────────┘

    Slow requests fill up Bulkhead A
    → Doesn't affect Bulkhead B's calls to Service B

Complete Implementation

package bulkhead

import (
    "context"
    "fmt"
    "sync"
    "time"

    "golang.org/x/sync/semaphore"
)

type Bulkhead struct {
    name      string
    sem       *semaphore.Weighted
    timeout   time.Duration
    active    atomic.Int64
    rejected  atomic.Int64
    timedOut  atomic.Int64
}

type Config struct {
    Name          string
    MaxConcurrent int64
    Timeout       time.Duration
}

func New(cfg Config) *Bulkhead {
    return &Bulkhead{
        name:    cfg.Name,
        sem:     semaphore.NewWeighted(cfg.MaxConcurrent),
        timeout: cfg.Timeout,
    }
}

func (b *Bulkhead) Execute(ctx context.Context, fn func() error) error {
    acquireCtx, cancel := context.WithTimeout(ctx, b.timeout)
    defer cancel()

    if err := b.sem.Acquire(acquireCtx, 1); err != nil {
        b.rejected.Add(1)
        if ctx.Err() == context.DeadlineExceeded {
            b.timedOut.Add(1)
            return fmt.Errorf("bulkhead '%s' acquire timeout: %w", b.name, err)
        }
        return fmt.Errorf("bulkhead '%s' full: %w", b.name, err)
    }
    defer b.sem.Release(1)

    b.active.Add(1)
    defer b.active.Add(-1)

    return fn()
}

func (b *Bulkhead) Stats() BulkheadStats {
    return BulkheadStats{
        Name:     b.name,
        Active:   b.active.Load(),
        Rejected: b.rejected.Load(),
        TimedOut: b.timedOut.Load(),
    }
}

type BulkheadStats struct {
    Name     string
    Active   int64
    Rejected int64
    TimedOut int64
}

Multi-Downstream Bulkhead Manager

type BulkheadManager struct {
    bulkheads sync.Map
}

func NewManager() *BulkheadManager {
    return &BulkheadManager{}
}

func (m *BulkheadManager) Register(name string, maxConcurrent int64, timeout time.Duration) {
    bh := New(Config{
        Name:          name,
        MaxConcurrent: maxConcurrent,
        Timeout:       timeout,
    })
    m.bulkheads.Store(name, bh)
}

func (m *BulkheadManager) Execute(ctx context.Context, service string, fn func() error) error {
    val, ok := m.bulkheads.Load(service)
    if !ok {
        return fmt.Errorf("bulkhead not found: %s", service)
    }
    return val.(*Bulkhead).Execute(ctx, fn)
}

func (m *BulkheadManager) Stats() map[string]BulkheadStats {
    stats := make(map[string]BulkheadStats)
    m.bulkheads.Range(func(key, value any) bool {
        stats[key.(string)] = value.(*Bulkhead).Stats()
        return true
    })
    return stats
}

Usage Example

func setupService() *BulkheadManager {
    mgr := NewManager()
    mgr.Register("user-service", 20, 5*time.Second)
    mgr.Register("order-service", 30, 8*time.Second)
    mgr.Register("payment-service", 10, 10*time.Second)
    mgr.Register("inventory-service", 15, 5*time.Second)
    return mgr
}

func (s *OrderService) CreateOrder(ctx context.Context, req *CreateOrderRequest) error {
    var user *User
    var inventory *Inventory
    var payment *Payment

    g, ctx := errgroup.WithContext(ctx)

    g.Go(func() error {
        return s.bulkheads.Execute(ctx, "user-service", func() error {
            var err error
            user, err = s.userClient.Get(ctx, req.UserID)
            return err
        })
    })

    g.Go(func() error {
        return s.bulkheads.Execute(ctx, "inventory-service", func() error {
            var err error
            inventory, err = s.inventoryClient.Check(ctx, req.ProductID)
            return err
        })
    })

    if err := g.Wait(); err != nil {
        return fmt.Errorf("create order pre-check: %w", err)
    }

    return s.bulkheads.Execute(ctx, "payment-service", func() error {
        var err error
        payment, err = s.paymentClient.Charge(ctx, &ChargeRequest{
            UserID: user.ID,
            Amount: req.Amount,
        })
        return err
    })
}

Key Design Points:

  • Each downstream service has an independent semaphore pool, no mutual interference
  • Semaphore acquisition has timeout, avoiding infinite waits
  • Track active/rejected/timed-out counts for monitoring
  • Combine with errgroup for parallel calls with individual isolation
  • sync.Map supports dynamic bulkhead registration

5 Common Pitfalls and Fixes

Pitfall 1: Unreasonable Circuit Breaker Thresholds

❌ Wrong:

cb := gobreaker.NewCircuitBreaker[string](gobreaker.Settings{
    ReadyToTrip: func(counts gobreaker.Counts) bool {
        return counts.TotalFailures > 3
    },
})

3 failures trigger a break — too sensitive for low-QPS scenarios. A single network jitter could trip the breaker.

✅ Correct:

cb := gobreaker.NewCircuitBreaker[string](gobreaker.Settings{
    ReadyToTrip: func(counts gobreaker.Counts) bool {
        failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
        return counts.Requests >= 10 && failureRatio >= 0.6
    },
})

Pitfall 2: Retrying Non-Idempotent Operations

❌ Wrong:

func createOrder(ctx context.Context, req *OrderRequest) error {
    return retry.Do(ctx, cfg, func() error {
        return db.Insert(ctx, req)
    })
}

Database inserts aren't idempotent — retries could create duplicate orders.

✅ Correct:

func createOrder(ctx context.Context, req *OrderRequest) error {
    return db.Insert(ctx, req)
}

func queryOrder(ctx context.Context, orderID string) (*Order, error) {
    return retry.Do(ctx, cfg, func() (*Order, error) {
        return db.FindByID(ctx, orderID)
    })
}

Pitfall 3: Stacking Context Timeouts

❌ Wrong:

func handler(ctx context.Context) {
    ctx1, cancel1 := context.WithTimeout(ctx, 30*time.Second)
    defer cancel1()

    ctx2, cancel2 := context.WithTimeout(ctx1, 20*time.Second)
    defer cancel2()

    ctx3, cancel3 := context.WithTimeout(ctx2, 15*time.Second)
    defer cancel3()

    doWork(ctx3)
}

Three layers of timeout stacking — the effective timeout is 15s, but 3 timers are created, wasting resources.

✅ Correct:

func handler(ctx context.Context) {
    ctx, cancel := context.WithTimeout(ctx, 15*time.Second)
    defer cancel()

    doWork(ctx)
}

Pitfall 4: Rate Limiter Not Isolated Per Client

❌ Wrong:

var limiter = rate.NewLimiter(100, 200)

func handler(w http.ResponseWriter, r *http.Request) {
    if !limiter.Allow() {
        http.Error(w, "too many requests", 429)
        return
    }
    handle(w, r)
}

A single high-traffic client can exhaust the entire quota with a global shared limiter.

✅ Correct:

var limiters sync.Map

func getLimiter(clientID string) *rate.Limiter {
    if v, ok := limiters.Load(clientID); ok {
        return v.(*rate.Limiter)
    }
    l := rate.NewLimiter(100, 200)
    actual, _ := limiters.LoadOrStore(clientID, l)
    return actual.(*rate.Limiter)
}

Pitfall 5: Bulkhead Semaphore Leak

❌ Wrong:

func (b *Bulkhead) Execute(fn func() error) error {
    if err := b.sem.Acquire(context.Background(), 1); err != nil {
        return err
    }

    result := fn()
    b.sem.Release(1)
    return result
}

If fn() panics, Release is never called — the semaphore permanently decreases.

✅ Correct:

func (b *Bulkhead) Execute(ctx context.Context, fn func() error) error {
    if err := b.sem.Acquire(ctx, 1); err != nil {
        return err
    }
    defer b.sem.Release(1)

    return fn()
}

Error Troubleshooting Quick Reference

Error Symptom Possible Cause Investigation Method Solution
circuit breaker open frequently Downstream failure or threshold too low Check downstream health, review breaker stats Adjust ReadyToTrip threshold, increase Timeout wait
too many requests 429 errors Rate limiter burst too small or rate too low Monitor token consumption rate, check client concurrency Increase rate and burst, isolate limiters per client
Retry doubling request volume Too many retries, no backoff strategy Check retry config, monitor retry request ratio Reduce retry count, use exponential backoff + jitter
context deadline exceeded frequent Timeout too short or downstream slow Check P99 latency, compare with timeout Set timeout to 2-3x P99, add fallback logic
High bulkhead rejection rate Bulkhead capacity too small or downstream slow Check bulkhead Stats, focus on Active/Rejected Increase MaxConcurrent, add timeout to prevent long holds
Retry storm worsening failure No jitter, all clients retry simultaneously Monitor retry request time distribution Add Jitter, use Retry-After header
Circuit breaker never recovers Timeout too long or no half-open probing Check breaker state change logs Reduce Timeout, set MaxRequests to allow probing
Rate limiter memory growing Per-key limiters not cleaning expired keys Monitor sync.Map size Periodically clean inactive limiters, use LRU eviction
Goroutines still running after timeout Only cancelled context but no check pprof goroutine profile Ensure all goroutines check ctx.Done() in select
Inter-bulkhead resource contention Multiple bulkheads sharing connection pool Check connection pool config and wait times Independent connection pool per bulkhead, or weighted semaphore

Advanced Optimization Techniques

Optimization 1: Adaptive Circuit Breaking

Dynamically adjust circuit breaker thresholds based on real-time metrics:

package adaptive

import (
    "sync"
    "time"

    "github.com/sony/gobreaker/v2"
)

type AdaptiveCircuitBreaker struct {
    cb         *gobreaker.CircuitBreaker[string]
    mu         sync.Mutex
    window     []time.Duration
    windowSize int
    threshold  time.Duration
}

func NewAdaptiveCircuitBreaker(windowSize int, threshold time.Duration) *AdaptiveCircuitBreaker {
    acb := &AdaptiveCircuitBreaker{
        window:     make([]time.Duration, 0, windowSize),
        windowSize: windowSize,
        threshold:  threshold,
    }

    acb.cb = gobreaker.NewCircuitBreaker[string](gobreaker.Settings{
        Name:        "adaptive",
        Interval:    10 * time.Second,
        Timeout:     30 * time.Second,
        MaxRequests: 5,
        ReadyToTrip: func(counts gobreaker.Counts) bool {
            return acb.shouldTrip(counts)
        },
    })

    return acb
}

func (acb *AdaptiveCircuitBreaker) shouldTrip(counts gobreaker.Counts) bool {
    acb.mu.Lock()
    defer acb.mu.Unlock()

    if len(acb.window) < acb.windowSize {
        failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
        return counts.Requests >= 10 && failureRatio >= 0.5
    }

    avgLatency := acb.averageLatency()
    failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)

    if avgLatency > acb.threshold && failureRatio >= 0.3 {
        return true
    }

    return failureRatio >= 0.6
}

func (acb *AdaptiveCircuitBreaker) averageLatency() time.Duration {
    var total time.Duration
    for _, d := range acb.window {
        total += d
    }
    return total / time.Duration(len(acb.window))
}

func (acb *AdaptiveCircuitBreaker) RecordLatency(d time.Duration) {
    acb.mu.Lock()
    defer acb.mu.Unlock()

    acb.window = append(acb.window, d)
    if len(acb.window) > acb.windowSize {
        acb.window = acb.window[1:]
    }
}

Optimization 2: Sliding Window Rate Limiting

Token buckets are good for average rate control; sliding windows more precisely control request counts within a time window:

package slidingwindow

import (
    "sync"
    "time"
)

type Window struct {
    mu      sync.Mutex
    size    time.Duration
    limit   int
    buckets []bucket
    count   int
}

type bucket struct {
    time  time.Time
    count int
}

func NewWindow(size time.Duration, limit int) *Window {
    return &Window{
        size:    size,
        limit:   limit,
        buckets: make([]bucket, 0),
    }
}

func (w *Window) Allow() bool {
    w.mu.Lock()
    defer w.mu.Unlock()

    now := time.Now()
    cutoff := now.Add(-w.size)

    i := 0
    for i < len(w.buckets) && w.buckets[i].time.Before(cutoff) {
        i++
    }
    w.buckets = w.buckets[i:]
    w.count = 0
    for _, b := range w.buckets {
        w.count += b.count
    }

    if w.count >= w.limit {
        return false
    }

    w.buckets = append(w.buckets, bucket{time: now, count: 1})
    w.count++
    return true
}

Optimization 3: Resilience Middleware Chain

Combine all 5 resilience patterns into a reusable middleware chain:

package resilience

import (
    "context"
    "fmt"
    "time"

    "github.com/sony/gobreaker/v2"
    "golang.org/x/time/rate"
)

type Middleware func(ctx context.Context, req *Request) (*Response, error)

type Chain struct {
    middlewares []Middleware
}

func NewChain(middlewares ...Middleware) *Chain {
    return &Chain{middlewares: middlewares}
}

func (c *Chain) Then(final func(ctx context.Context, req *Request) (*Response, error)) Middleware {
    return func(ctx context.Context, req *Request) (*Response, error) {
        var handler Middleware = func(ctx context.Context, req *Request) (*Response, error) {
            return final(ctx, req)
        }

        for i := len(c.middlewares) - 1; i >= 0; i-- {
            handler = func(mw Middleware, h Middleware) Middleware {
                return func(ctx context.Context, req *Request) (*Response, error) {
                    return mw(ctx, req)
                }
            }(c.middlewares[i], handler)
        }

        return handler(ctx, req)
    }
}

func RateLimitMiddleware(limiter *rate.Limiter) Middleware {
    return func(ctx context.Context, req *Request) (*Response, error) {
        if !limiter.Allow() {
            return nil, fmt.Errorf("rate limit exceeded")
        }
        return nil, nil
    }
}

func CircuitBreakerMiddleware(cb *gobreaker.CircuitBreaker[string]) Middleware {
    return func(ctx context.Context, req *Request) (*Response, error) {
        _, err := cb.Execute(func() (string, error) {
            return "", nil
        })
        if err != nil {
            return nil, err
        }
        return nil, nil
    }
}

func TimeoutMiddleware(timeout time.Duration) Middleware {
    return func(ctx context.Context, req *Request) (*Response, error) {
        _, cancel := context.WithTimeout(ctx, timeout)
        defer cancel()
        return nil, nil
    }
}

Resilience Pattern Comparison

Feature Circuit Breaker Rate Limiting Retry Timeout Bulkhead
Core Purpose Fail fast, protect downstream Control request rate Handle transient faults Limit wait time Isolate resources
Key Library sony/gobreaker golang.org/x/time/rate Custom context semaphore
State Management Stateful (3 states) Stateful (token count) Stateless Stateless Stateful (semaphore)
Applicable Fault Downstream completely unavailable Traffic overload Transient network jitter Slow downstream response Partial downstream fault
False Positive Risk Medium (bad threshold) Low High (non-idempotent) Medium (too short) Low
Combination Rec. With retry With bulkhead With circuit breaker Required everywhere With rate limiting
Production Rating ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
Implementation Complexity Low Low Medium Low Medium

  • JSON Formatter — Format microservice JSON responses, quickly debug data structure issues
  • Hash Calculator — Compute request signatures and data checksums for distributed call data consistency
  • HTTP Status Code Lookup — Look up HTTP status code meanings, quickly identify resilience-related errors like 429/503

Summary

Microservice resilience isn't "add a circuit breaker and you're done" — it's about answering five questions: When should we reject requests? How do we control request rate? Should we retry on failure? How long before we timeout? Will one fault drag down other calls? The circuit breaker answers "when to reject", rate limiting answers "how to control rate", retry answers "should we retry", timeout answers "how long to wait", and bulkhead answers "how to isolate". Only by combining all 5 patterns can you build truly anti-fragile Go microservices.


Further Reading

Try these browser-local tools — no sign-up required →

#Go微服务#熔断器#限流#重试#服务韧性#2026#云原生