Go Microservice Resilience: 5 Production Patterns from Circuit Breaker to Rate Limiting
When Cascading Failures Meet Avalanche Effects: Microservices' Darkest Hour
Friday evening peak, payment service slows down. The upstream order service has no timeout control — all requests block on payment calls, goroutines pile up to 500K. Within 5 minutes, the order service crashes with OOM, the user service follows, and the entire call chain avalanches. The irony? The payment service recovered 30 seconds later — but nobody could reach it anymore.
This isn't fear-mongering. In microservice architectures, one slow node can drag down the entire call chain. You need circuit breakers to prevent failure propagation, rate limiting to protect downstream, retries for transient glitches, timeouts to avoid infinite waits, and bulkheads to isolate fault domains. This article covers 5 production-grade resilience patterns to help you build anti-fragile Go microservices.
Core Concepts Reference
| Pattern | Core Idea | Key Library | Typical Scenario |
|---|---|---|---|
| Circuit Breaker | Disconnect calls after failure threshold, protect downstream | sony/gobreaker |
Fast-fail when downstream is unavailable |
| Rate Limiting | Control request rate, prevent overload | golang.org/x/time/rate |
API rate limiting, resource quota control |
| Retry with Backoff | Auto-retry transient failures, exponential backoff avoids retry storms | Custom implementation | Network jitter, temporary errors |
| Timeout Control | Limit maximum duration per call | context.Context |
All external calls must have timeouts |
| Bulkhead | Isolate resources for different callers, prevent mutual impact | Custom implementation | Multi-downstream call resource isolation |
Table of Contents
- Microservice Resilience Architecture Overview
- Pattern 1: Circuit Breaker — sony/gobreaker in Practice
- Pattern 2: Rate Limiting — Token Bucket and Sliding Window
- Pattern 3: Retry — Exponential Backoff and Jitter
- Pattern 4: Timeout Control — context.Context in Practice
- Pattern 5: Bulkhead Isolation — Resource Partitioning
- 5 Common Pitfalls
- 10 Error Troubleshooting
- Advanced Optimization Techniques
- Resilience Pattern Comparison
- Recommended Tools
- Summary and Further Reading
Microservice Resilience Architecture Overview
┌─────────────────────────────────────────────┐
│ Client (HTTP/gRPC) │
└──────────────────┬──────────────────────────┘
│
┌──────────────────▼──────────────────────────┐
│ API Gateway / BFF │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Rate │ │ Circuit │ │ Timeout │ │
│ │ Limiter │ │ Breaker │ │ Control │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└──────────────────┬──────────────────────────┘
│
┌────────────────────────┼────────────────────────┐
│ │ │
┌─────────▼──────────┐ ┌─────────▼──────────┐ ┌─────────▼──────────┐
│ Service A │ │ Service B │ │ Service C │
│ ┌──────────────┐ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │
│ │ Bulkhead │ │ │ │ Bulkhead │ │ │ │ Bulkhead │ │
│ │ ┌───┐ ┌───┐ │ │ │ │ ┌───┐ ┌───┐ │ │ │ │ ┌───┐ ┌───┐ │ │
│ │ │ P1│ │ P2│ │ │ │ │ │ P1│ │ P2│ │ │ │ │ │ P1│ │ P2│ │ │
│ │ └───┘ └───┘ │ │ │ │ └───┘ └───┘ │ │ │ │ └───┘ └───┘ │ │
│ └──────────────┘ │ │ └──────────────┘ │ │ └──────────────┘ │
│ ┌──────────────┐ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │
│ │ Retry with │ │ │ │ Retry with │ │ │ │ Retry with │ │
│ │ Backoff │ │ │ │ Backoff │ │ │ │ Backoff │ │
│ └──────────────┘ │ │ └──────────────┘ │ │ └──────────────┘ │
└────────────────────┘ └────────────────────┘ └────────────────────┘
│ │ │
┌─────────▼──────────┐ ┌─────────▼──────────┐ ┌─────────▼──────────┐
│ Database │ │ Cache (Redis) │ │ Message Queue │
└────────────────────┘ └────────────────────┘ └────────────────────┘
Four Principles of Resilience:
- Fail Fast: Circuit breaker opens and returns errors immediately, no timeout wait
- Graceful Degradation: Return fallback data when non-critical functions fail
- Fault Isolation: Bulkhead pattern ensures one fault doesn't affect other calls
- Self-Healing: Circuit breaker half-open state probes recovery, retry handles transient faults
Pattern 1: Circuit Breaker — sony/gobreaker in Practice
The circuit breaker is the first line of defense for microservice resilience. When downstream service failure rate exceeds a threshold, the breaker "trips" — subsequent requests return errors immediately without calling downstream, giving it time to recover.
Circuit Breaker State Machine
Recovery rate Failure threshold
┌──────────────┐ ┌──────────────┐
│ │ │ │
│ Half-Open │◄────────────│ Closed │
│ (Probing) │ │ (Allowing) │
│ │─────────────►│ │
└──────┬───────┘ Probe OK └──────────────┘
│ ▲
│ Probe failed │ Auto half-open
▼ │ after timeout
┌──────────────┐ │
│ │───────────────┘
│ Open │
│ (Rejecting) │
│ │
└──────────────┘
Complete Implementation
package circuitbreaker
import (
"fmt"
"time"
"github.com/sony/gobreaker/v2"
)
type CircuitBreaker[T any] struct {
cb *gobreaker.CircuitBreaker[T]
}
type Config struct {
Name string
MaxRequests uint32
Interval time.Duration
Timeout time.Duration
FailThreshold uint32
FailRatio float64
}
func NewCircuitBreaker[T any](cfg Config) *CircuitBreaker[T] {
settings := gobreaker.Settings{
Name: cfg.Name,
MaxRequests: cfg.MaxRequests,
Interval: cfg.Interval,
Timeout: cfg.Timeout,
ReadyToTrip: func(counts gobreaker.Counts) bool {
failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
return counts.Requests >= cfg.FailThreshold && failureRatio >= cfg.FailRatio
},
OnStateChange: func(name string, from gobreaker.State, to gobreaker.State) {
fmt.Printf("CircuitBreaker '%s': %s -> %s\n", name, from, to)
},
}
return &CircuitBreaker[T]{
cb: gobreaker.NewCircuitBreaker[T](settings),
}
}
func (cb *CircuitBreaker[T]) Execute(fn func() (T, error)) (T, error) {
result, err := cb.cb.Execute(fn)
if err != nil {
var zero T
if err == gobreaker.ErrOpenState {
return zero, fmt.Errorf("circuit breaker open: %w", err)
}
if err == gobreaker.ErrTooManyRequests {
return zero, fmt.Errorf("circuit breaker half-open: too many requests: %w", err)
}
return zero, err
}
return result, nil
}
func (cb *CircuitBreaker[T]) State() gobreaker.State {
return cb.cb.State()
}
Usage Example — HTTP Client Circuit Breaking
package httpclient
import (
"context"
"fmt"
"io"
"net/http"
"time"
"github.com/sony/gobreaker/v2"
)
type ResilientClient struct {
client *http.Client
cb *gobreaker.CircuitBreaker[string]
}
func NewResilientClient() *ResilientClient {
cb := gobreaker.NewCircuitBreaker[string](gobreaker.Settings{
Name: "http-client",
MaxRequests: 3,
Interval: 60 * time.Second,
Timeout: 30 * time.Second,
ReadyToTrip: func(counts gobreaker.Counts) bool {
failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
return counts.Requests >= 10 && failureRatio >= 0.6
},
})
return &ResilientClient{
client: &http.Client{Timeout: 10 * time.Second},
cb: cb,
}
}
func (c *ResilientClient) Get(ctx context.Context, url string) (string, error) {
result, err := c.cb.Execute(func() (string, error) {
req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
if err != nil {
return "", fmt.Errorf("create request: %w", err)
}
resp, err := c.client.Do(req)
if err != nil {
return "", fmt.Errorf("do request: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode >= 500 {
return "", fmt.Errorf("server error: status %d", resp.StatusCode)
}
body, err := io.ReadAll(resp.Body)
if err != nil {
return "", fmt.Errorf("read body: %w", err)
}
return string(body), nil
})
if err != nil {
return "", fmt.Errorf("resilient get %s: %w", url, err)
}
return result, nil
}
Key Design Points:
ReadyToTripcustomizes trip conditions: 60% failure rate with minimum 10 requestsMaxRequestscontrols probe request count in half-open stateTimeoutcontrols wait time from Open to Half-OpenOnStateChangecallback for monitoring and alerting- Generic support (Go 1.18+) for type-safe return values
Pattern 2: Rate Limiting — Token Bucket and Sliding Window
Rate limiting is the core mechanism to protect downstream services. golang.org/x/time/rate implements the token bucket algorithm — simple and efficient.
Token Bucket Algorithm
Request Arrives Token Bucket
┌──────┐ ┌──────────────────────────┐
│ Req1 │────►│ ● ● ● ● ● ○ ○ ○ ○ ○ │ ──► Allow (has token)
└──────┘ │ 5/10 tokens │
┌──────┐ │ r=10/s, burst=10 │
│ Req6 │────►│ ● ● ● ● ● ○ ○ ○ ○ ○ │ ──► Allow (has token)
└──────┘ │ 4/10 tokens │
┌──────┐ │ │
│ Req11│────►│ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ │ ──► Reject (no token)
└──────┘ │ 0/10 tokens │
└──────────────────────────┘
▲
│ 10 tokens replenished per second
│ r = 10 tokens/sec
│ burst = 10 (bucket capacity)
Complete Implementation
package ratelimit
import (
"context"
"fmt"
"sync"
"time"
"golang.org/x/time/rate"
)
type RateLimiter struct {
limiters sync.Map
rate rate.Limit
burst int
}
func NewRateLimiter(r rate.Limit, burst int) *RateLimiter {
return &RateLimiter{
rate: r,
burst: burst,
}
}
func (rl *RateLimiter) getLimiter(key string) *rate.Limiter {
if limiter, ok := rl.limiters.Load(key); ok {
return limiter.(*rate.Limiter)
}
newLimiter := rate.NewLimiter(rl.rate, rl.burst)
actual, _ := rl.limiters.LoadOrStore(key, newLimiter)
return actual.(*rate.Limiter)
}
func (rl *RateLimiter) Allow(key string) bool {
return rl.getLimiter(key).Allow()
}
func (rl *RateLimiter) Wait(ctx context.Context, key string) error {
return rl.getLimiter(key).Wait(ctx)
}
func (rl *RateLimiter) Reserve(key string) (*rate.Reservation, error) {
return rl.getLimiter(key).Reserve(), nil
}
HTTP Middleware Rate Limiting
package middleware
import (
"net/http"
"time"
"golang.org/x/time/rate"
)
type IPRateLimiter struct {
limiters sync.Map
rate rate.Limit
burst int
}
func NewIPRateLimiter(r rate.Limit, burst int) *IPRateLimiter {
return &IPRateLimiter{
rate: r,
burst: burst,
}
}
func (rl *IPRateLimiter) getLimiter(ip string) *rate.Limiter {
if limiter, ok := rl.limiters.Load(ip); ok {
return limiter.(*rate.Limiter)
}
newLimiter := rate.NewLimiter(rl.rate, rl.burst)
actual, _ := rl.limiters.LoadOrStore(ip, newLimiter)
return actual.(*rate.Limiter)
}
func (rl *IPRateLimiter) Middleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
ip := r.RemoteAddr
limiter := rl.getLimiter(ip)
if !limiter.Allow() {
http.Error(w, "Too Many Requests", http.StatusTooManyRequests)
return
}
next.ServeHTTP(w, r)
})
}
Tiered Rate Limiting
type TieredRateLimiter struct {
tiers map[string]*rate.Limiter
}
func NewTieredRateLimiter() *TieredRateLimiter {
return &TieredRateLimiter{
tiers: map[string]*rate.Limiter{
"free": rate.NewLimiter(10, 20),
"basic": rate.NewLimiter(50, 100),
"premium": rate.NewLimiter(200, 400),
"enterprise": rate.NewLimiter(1000, 2000),
},
}
}
func (trl *TieredRateLimiter) Allow(tier string) bool {
limiter, ok := trl.tiers[tier]
if !ok {
limiter = trl.tiers["free"]
}
return limiter.Allow()
}
Key Design Points:
rate.Limitrepresents tokens generated per second,burstis bucket capacityAllow()is non-blocking,Wait()blocks until a token is availablesync.Mapimplements per-key limiters with auto-scaling- Tiered rate limiting adapts to different user levels
- Token bucket allows burst traffic but keeps long-term average rate controlled
Pattern 3: Retry — Exponential Backoff and Jitter
Retries handle transient failures, but uncontrolled retries worsen problems. Exponential Backoff progressively increases retry intervals, and Jitter prevents retry storms.
Retry Strategy Comparison
No backoff: █ █ █ █ █ █ █ █ █ █ ← All clients retry simultaneously
Fixed interval: █ █ █ █ █ ← Still has synchronized retry risk
Exponential: █ █ █ █ ← Intervals grow progressively
Exponential+Jitter: █ █ █ █ ← Randomization prevents retry storms
Complete Implementation
package retry
import (
"context"
"fmt"
"math"
"math/rand"
"time"
)
type Config struct {
MaxAttempts int
InitialInterval time.Duration
MaxInterval time.Duration
Multiplier float64
Jitter float64
}
func DefaultConfig() Config {
return Config{
MaxAttempts: 5,
InitialInterval: 100 * time.Millisecond,
MaxInterval: 30 * time.Second,
Multiplier: 2.0,
Jitter: 0.1,
}
}
type RetryableError struct {
Err error
}
func (e *RetryableError) Error() string {
return fmt.Sprintf("retryable: %v", e.Err)
}
func IsRetryable(err error) bool {
_, ok := err.(*RetryableError)
return ok
}
func Do[T any](ctx context.Context, cfg Config, fn func() (T, error)) (T, error) {
var lastErr error
for attempt := 0; attempt < cfg.MaxAttempts; attempt++ {
if attempt > 0 {
delay := calculateDelay(attempt, cfg)
select {
case <-ctx.Done():
var zero T
return zero, fmt.Errorf("retry canceled: %w", ctx.Err())
case <-time.After(delay):
}
}
result, err := fn()
if err == nil {
return result, nil
}
lastErr = err
if !IsRetryable(err) {
var zero T
return zero, fmt.Errorf("non-retryable error: %w", err)
}
}
var zero T
return zero, fmt.Errorf("max retries (%d) exceeded: %w", cfg.MaxAttempts, lastErr)
}
func calculateDelay(attempt int, cfg Config) time.Duration {
delay := float64(cfg.InitialInterval) * math.Pow(cfg.Multiplier, float64(attempt-1))
if delay > float64(cfg.MaxInterval) {
delay = float64(cfg.MaxInterval)
}
jitter := delay * cfg.Jitter
delay = delay - jitter + rand.Float64()*2*jitter
return time.Duration(delay)
}
Usage Example — HTTP Call with Retry
func fetchWithRetry(ctx context.Context, url string) (string, error) {
cfg := retry.Config{
MaxAttempts: 3,
InitialInterval: 200 * time.Millisecond,
MaxInterval: 10 * time.Second,
Multiplier: 2.0,
Jitter: 0.2,
}
return retry.Do(ctx, cfg, func() (string, error) {
resp, err := http.Get(url)
if err != nil {
return "", &retry.RetryableError{Err: err}
}
defer resp.Body.Close()
if resp.StatusCode >= 500 {
return "", &retry.RetryableError{
Err: fmt.Errorf("server error: %d", resp.StatusCode),
}
}
if resp.StatusCode >= 400 && resp.StatusCode < 500 {
return "", fmt.Errorf("client error: %d", resp.StatusCode)
}
body, err := io.ReadAll(resp.Body)
if err != nil {
return "", &retry.RetryableError{Err: err}
}
return string(body), nil
})
}
Retry with Circuit Breaker
func fetchResilient(ctx context.Context, url string, cb *CircuitBreaker[string]) (string, error) {
cfg := retry.DefaultConfig()
return retry.Do(ctx, cfg, func() (string, error) {
result, err := cb.Execute(func() (string, error) {
return doHTTPGet(ctx, url)
})
if err != nil {
if errors.Is(err, gobreaker.ErrOpenState) {
var zero string
return zero, err
}
return zero, &retry.RetryableError{Err: err}
}
return result, nil
})
}
Key Design Points:
RetryableErrordistinguishes retryable from non-retryable errors- 5xx is retryable, 4xx is not
- Exponential backoff + jitter prevents retry storms
- Context cancellation stops retries immediately
- Don't retry when circuit breaker is open (fail fast takes priority)
Pattern 4: Timeout Control — context.Context in Practice
Timeout is the most fundamental and important resilience mechanism. Every external call must have a timeout — a call without a timeout is a ticking time bomb.
Timeout Hierarchy Design
Request Total Timeout: 30s
├── API Gateway: 28s (2s margin)
│ ├── Service A: 10s
│ │ ├── DB Query: 5s
│ │ └── Cache Get: 500ms
│ ├── Service B: 15s
│ │ ├── gRPC Call: 12s
│ │ └── Redis Get: 1s
│ └── Service C: 20s
│ ├── HTTP Call: 15s
│ └── MQ Publish: 3s
└── Response Margin: 2s
Complete Implementation
package timeout
import (
"context"
"fmt"
"time"
)
type TimeoutConfig struct {
Connect time.Duration
Read time.Duration
Write time.Duration
Overall time.Duration
Graceful time.Duration
}
func DefaultTimeoutConfig() TimeoutConfig {
return TimeoutConfig{
Connect: 5 * time.Second,
Read: 10 * time.Second,
Write: 10 * time.Second,
Overall: 30 * time.Second,
Graceful: 10 * time.Second,
}
}
type CallResult struct {
Data string
Duration time.Duration
TimedOut bool
Err error
}
func CallWithTimeout(ctx context.Context, endpoint string, cfg TimeoutConfig) CallResult {
start := time.Now()
callCtx, cancel := context.WithTimeout(ctx, cfg.Overall)
defer cancel()
resultCh := make(chan CallResult, 1)
go func() {
data, err := doCall(callCtx, endpoint)
elapsed := time.Since(start)
resultCh <- CallResult{
Data: data,
Duration: elapsed,
Err: err,
}
}()
select {
case result := <-resultCh:
return result
case <-callCtx.Done():
return CallResult{
Duration: time.Since(start),
TimedOut: true,
Err: fmt.Errorf("call %s timed out after %v: %w", endpoint, cfg.Overall, callCtx.Err()),
}
}
}
func doCall(ctx context.Context, endpoint string) (string, error) {
req, err := http.NewRequestWithContext(ctx, http.MethodGet, endpoint, nil)
if err != nil {
return "", fmt.Errorf("create request: %w", err)
}
resp, err := http.DefaultClient.Do(req)
if err != nil {
if ctx.Err() == context.DeadlineExceeded {
return "", fmt.Errorf("request deadline exceeded: %w", ctx.Err())
}
return "", fmt.Errorf("do request: %w", err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
return "", fmt.Errorf("read body: %w", err)
}
return string(body), nil
}
Cascading Timeout Control
func handleRequest(ctx context.Context, req *Request) (*Response, error) {
ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
defer cancel()
var (
user *User
orders []*Order
profile *Profile
err error
)
g, ctx := errgroup.WithContext(ctx)
g.Go(func() error {
userCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
defer cancel()
user, err = fetchUser(userCtx, req.UserID)
return err
})
g.Go(func() error {
ordersCtx, cancel := context.WithTimeout(ctx, 3*time.Second)
defer cancel()
orders, err = fetchOrders(ordersCtx, req.UserID)
return err
})
g.Go(func() error {
profileCtx, cancel := context.WithTimeout(ctx, 1*time.Second)
defer cancel()
profile, err = fetchProfile(profileCtx, req.UserID)
return err
})
if err := g.Wait(); err != nil {
return nil, fmt.Errorf("handle request: %w", err)
}
return &Response{
User: user,
Orders: orders,
Profile: profile,
}, nil
}
Graceful Shutdown
func main() {
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
defer stop()
server := &http.Server{Addr: ":8080"}
go func() {
<-ctx.Done()
fmt.Println("shutting down...")
shutdownCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
if err := server.Shutdown(shutdownCtx); err != nil {
fmt.Printf("shutdown error: %v\n", err)
}
}()
if err := server.ListenAndServe(); err != http.ErrServerClosed {
fmt.Printf("server error: %v\n", err)
}
}
Key Design Points:
- Each layer's timeout must be less than the parent layer's, with margin
WithTimeoutis safer thantime.After(auto-releases resources)- errgroup + sub-contexts enable parallel calls with independent timeouts
signal.NotifyContextlistens for system signals for graceful shutdowndefer cancel()prevents context leaks
Pattern 5: Bulkhead Isolation — Resource Partitioning
The bulkhead pattern originates from ship design: dividing the hull into watertight compartments so one breach doesn't sink the entire ship. In microservices, isolate resources (connection pools, goroutines, semaphores) for different downstream calls to prevent one slow downstream from dragging down all calls.
Bulkhead Architecture
┌─────────────────────────────────────────┐
│ Service X │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Bulkhead A │ │ Bulkhead B │ │
│ │ Service A │ │ Service B │ │
│ │ ┌──┐┌──┐┌──┐│ │ ┌──┐┌──┐ │ │
│ │ │w1││w2││w3││ │ │w1││w2│ │ │
│ │ └──┘└──┘└──┘│ │ └──┘└──┘ │ │
│ │ pool: 3 │ │ pool: 2 │ │
│ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │
└─────────┼─────────────────┼──────────────┘
│ │
┌─────────▼───────┐ ┌──────▼──────────┐
│ Service A │ │ Service B │
│ (Slow) │ │ (Healthy) │
└─────────────────┘ └─────────────────┘
Slow requests fill up Bulkhead A
→ Doesn't affect Bulkhead B's calls to Service B
Complete Implementation
package bulkhead
import (
"context"
"fmt"
"sync"
"time"
"golang.org/x/sync/semaphore"
)
type Bulkhead struct {
name string
sem *semaphore.Weighted
timeout time.Duration
active atomic.Int64
rejected atomic.Int64
timedOut atomic.Int64
}
type Config struct {
Name string
MaxConcurrent int64
Timeout time.Duration
}
func New(cfg Config) *Bulkhead {
return &Bulkhead{
name: cfg.Name,
sem: semaphore.NewWeighted(cfg.MaxConcurrent),
timeout: cfg.Timeout,
}
}
func (b *Bulkhead) Execute(ctx context.Context, fn func() error) error {
acquireCtx, cancel := context.WithTimeout(ctx, b.timeout)
defer cancel()
if err := b.sem.Acquire(acquireCtx, 1); err != nil {
b.rejected.Add(1)
if ctx.Err() == context.DeadlineExceeded {
b.timedOut.Add(1)
return fmt.Errorf("bulkhead '%s' acquire timeout: %w", b.name, err)
}
return fmt.Errorf("bulkhead '%s' full: %w", b.name, err)
}
defer b.sem.Release(1)
b.active.Add(1)
defer b.active.Add(-1)
return fn()
}
func (b *Bulkhead) Stats() BulkheadStats {
return BulkheadStats{
Name: b.name,
Active: b.active.Load(),
Rejected: b.rejected.Load(),
TimedOut: b.timedOut.Load(),
}
}
type BulkheadStats struct {
Name string
Active int64
Rejected int64
TimedOut int64
}
Multi-Downstream Bulkhead Manager
type BulkheadManager struct {
bulkheads sync.Map
}
func NewManager() *BulkheadManager {
return &BulkheadManager{}
}
func (m *BulkheadManager) Register(name string, maxConcurrent int64, timeout time.Duration) {
bh := New(Config{
Name: name,
MaxConcurrent: maxConcurrent,
Timeout: timeout,
})
m.bulkheads.Store(name, bh)
}
func (m *BulkheadManager) Execute(ctx context.Context, service string, fn func() error) error {
val, ok := m.bulkheads.Load(service)
if !ok {
return fmt.Errorf("bulkhead not found: %s", service)
}
return val.(*Bulkhead).Execute(ctx, fn)
}
func (m *BulkheadManager) Stats() map[string]BulkheadStats {
stats := make(map[string]BulkheadStats)
m.bulkheads.Range(func(key, value any) bool {
stats[key.(string)] = value.(*Bulkhead).Stats()
return true
})
return stats
}
Usage Example
func setupService() *BulkheadManager {
mgr := NewManager()
mgr.Register("user-service", 20, 5*time.Second)
mgr.Register("order-service", 30, 8*time.Second)
mgr.Register("payment-service", 10, 10*time.Second)
mgr.Register("inventory-service", 15, 5*time.Second)
return mgr
}
func (s *OrderService) CreateOrder(ctx context.Context, req *CreateOrderRequest) error {
var user *User
var inventory *Inventory
var payment *Payment
g, ctx := errgroup.WithContext(ctx)
g.Go(func() error {
return s.bulkheads.Execute(ctx, "user-service", func() error {
var err error
user, err = s.userClient.Get(ctx, req.UserID)
return err
})
})
g.Go(func() error {
return s.bulkheads.Execute(ctx, "inventory-service", func() error {
var err error
inventory, err = s.inventoryClient.Check(ctx, req.ProductID)
return err
})
})
if err := g.Wait(); err != nil {
return fmt.Errorf("create order pre-check: %w", err)
}
return s.bulkheads.Execute(ctx, "payment-service", func() error {
var err error
payment, err = s.paymentClient.Charge(ctx, &ChargeRequest{
UserID: user.ID,
Amount: req.Amount,
})
return err
})
}
Key Design Points:
- Each downstream service has an independent semaphore pool, no mutual interference
- Semaphore acquisition has timeout, avoiding infinite waits
- Track active/rejected/timed-out counts for monitoring
- Combine with errgroup for parallel calls with individual isolation
sync.Mapsupports dynamic bulkhead registration
5 Common Pitfalls and Fixes
Pitfall 1: Unreasonable Circuit Breaker Thresholds
❌ Wrong:
cb := gobreaker.NewCircuitBreaker[string](gobreaker.Settings{
ReadyToTrip: func(counts gobreaker.Counts) bool {
return counts.TotalFailures > 3
},
})
3 failures trigger a break — too sensitive for low-QPS scenarios. A single network jitter could trip the breaker.
✅ Correct:
cb := gobreaker.NewCircuitBreaker[string](gobreaker.Settings{
ReadyToTrip: func(counts gobreaker.Counts) bool {
failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
return counts.Requests >= 10 && failureRatio >= 0.6
},
})
Pitfall 2: Retrying Non-Idempotent Operations
❌ Wrong:
func createOrder(ctx context.Context, req *OrderRequest) error {
return retry.Do(ctx, cfg, func() error {
return db.Insert(ctx, req)
})
}
Database inserts aren't idempotent — retries could create duplicate orders.
✅ Correct:
func createOrder(ctx context.Context, req *OrderRequest) error {
return db.Insert(ctx, req)
}
func queryOrder(ctx context.Context, orderID string) (*Order, error) {
return retry.Do(ctx, cfg, func() (*Order, error) {
return db.FindByID(ctx, orderID)
})
}
Pitfall 3: Stacking Context Timeouts
❌ Wrong:
func handler(ctx context.Context) {
ctx1, cancel1 := context.WithTimeout(ctx, 30*time.Second)
defer cancel1()
ctx2, cancel2 := context.WithTimeout(ctx1, 20*time.Second)
defer cancel2()
ctx3, cancel3 := context.WithTimeout(ctx2, 15*time.Second)
defer cancel3()
doWork(ctx3)
}
Three layers of timeout stacking — the effective timeout is 15s, but 3 timers are created, wasting resources.
✅ Correct:
func handler(ctx context.Context) {
ctx, cancel := context.WithTimeout(ctx, 15*time.Second)
defer cancel()
doWork(ctx)
}
Pitfall 4: Rate Limiter Not Isolated Per Client
❌ Wrong:
var limiter = rate.NewLimiter(100, 200)
func handler(w http.ResponseWriter, r *http.Request) {
if !limiter.Allow() {
http.Error(w, "too many requests", 429)
return
}
handle(w, r)
}
A single high-traffic client can exhaust the entire quota with a global shared limiter.
✅ Correct:
var limiters sync.Map
func getLimiter(clientID string) *rate.Limiter {
if v, ok := limiters.Load(clientID); ok {
return v.(*rate.Limiter)
}
l := rate.NewLimiter(100, 200)
actual, _ := limiters.LoadOrStore(clientID, l)
return actual.(*rate.Limiter)
}
Pitfall 5: Bulkhead Semaphore Leak
❌ Wrong:
func (b *Bulkhead) Execute(fn func() error) error {
if err := b.sem.Acquire(context.Background(), 1); err != nil {
return err
}
result := fn()
b.sem.Release(1)
return result
}
If fn() panics, Release is never called — the semaphore permanently decreases.
✅ Correct:
func (b *Bulkhead) Execute(ctx context.Context, fn func() error) error {
if err := b.sem.Acquire(ctx, 1); err != nil {
return err
}
defer b.sem.Release(1)
return fn()
}
Error Troubleshooting Quick Reference
| Error Symptom | Possible Cause | Investigation Method | Solution |
|---|---|---|---|
circuit breaker open frequently |
Downstream failure or threshold too low | Check downstream health, review breaker stats | Adjust ReadyToTrip threshold, increase Timeout wait |
too many requests 429 errors |
Rate limiter burst too small or rate too low | Monitor token consumption rate, check client concurrency | Increase rate and burst, isolate limiters per client |
| Retry doubling request volume | Too many retries, no backoff strategy | Check retry config, monitor retry request ratio | Reduce retry count, use exponential backoff + jitter |
context deadline exceeded frequent |
Timeout too short or downstream slow | Check P99 latency, compare with timeout | Set timeout to 2-3x P99, add fallback logic |
| High bulkhead rejection rate | Bulkhead capacity too small or downstream slow | Check bulkhead Stats, focus on Active/Rejected | Increase MaxConcurrent, add timeout to prevent long holds |
| Retry storm worsening failure | No jitter, all clients retry simultaneously | Monitor retry request time distribution | Add Jitter, use Retry-After header |
| Circuit breaker never recovers | Timeout too long or no half-open probing |
Check breaker state change logs | Reduce Timeout, set MaxRequests to allow probing |
| Rate limiter memory growing | Per-key limiters not cleaning expired keys | Monitor sync.Map size |
Periodically clean inactive limiters, use LRU eviction |
| Goroutines still running after timeout | Only cancelled context but no check | pprof goroutine profile | Ensure all goroutines check ctx.Done() in select |
| Inter-bulkhead resource contention | Multiple bulkheads sharing connection pool | Check connection pool config and wait times | Independent connection pool per bulkhead, or weighted semaphore |
Advanced Optimization Techniques
Optimization 1: Adaptive Circuit Breaking
Dynamically adjust circuit breaker thresholds based on real-time metrics:
package adaptive
import (
"sync"
"time"
"github.com/sony/gobreaker/v2"
)
type AdaptiveCircuitBreaker struct {
cb *gobreaker.CircuitBreaker[string]
mu sync.Mutex
window []time.Duration
windowSize int
threshold time.Duration
}
func NewAdaptiveCircuitBreaker(windowSize int, threshold time.Duration) *AdaptiveCircuitBreaker {
acb := &AdaptiveCircuitBreaker{
window: make([]time.Duration, 0, windowSize),
windowSize: windowSize,
threshold: threshold,
}
acb.cb = gobreaker.NewCircuitBreaker[string](gobreaker.Settings{
Name: "adaptive",
Interval: 10 * time.Second,
Timeout: 30 * time.Second,
MaxRequests: 5,
ReadyToTrip: func(counts gobreaker.Counts) bool {
return acb.shouldTrip(counts)
},
})
return acb
}
func (acb *AdaptiveCircuitBreaker) shouldTrip(counts gobreaker.Counts) bool {
acb.mu.Lock()
defer acb.mu.Unlock()
if len(acb.window) < acb.windowSize {
failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
return counts.Requests >= 10 && failureRatio >= 0.5
}
avgLatency := acb.averageLatency()
failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
if avgLatency > acb.threshold && failureRatio >= 0.3 {
return true
}
return failureRatio >= 0.6
}
func (acb *AdaptiveCircuitBreaker) averageLatency() time.Duration {
var total time.Duration
for _, d := range acb.window {
total += d
}
return total / time.Duration(len(acb.window))
}
func (acb *AdaptiveCircuitBreaker) RecordLatency(d time.Duration) {
acb.mu.Lock()
defer acb.mu.Unlock()
acb.window = append(acb.window, d)
if len(acb.window) > acb.windowSize {
acb.window = acb.window[1:]
}
}
Optimization 2: Sliding Window Rate Limiting
Token buckets are good for average rate control; sliding windows more precisely control request counts within a time window:
package slidingwindow
import (
"sync"
"time"
)
type Window struct {
mu sync.Mutex
size time.Duration
limit int
buckets []bucket
count int
}
type bucket struct {
time time.Time
count int
}
func NewWindow(size time.Duration, limit int) *Window {
return &Window{
size: size,
limit: limit,
buckets: make([]bucket, 0),
}
}
func (w *Window) Allow() bool {
w.mu.Lock()
defer w.mu.Unlock()
now := time.Now()
cutoff := now.Add(-w.size)
i := 0
for i < len(w.buckets) && w.buckets[i].time.Before(cutoff) {
i++
}
w.buckets = w.buckets[i:]
w.count = 0
for _, b := range w.buckets {
w.count += b.count
}
if w.count >= w.limit {
return false
}
w.buckets = append(w.buckets, bucket{time: now, count: 1})
w.count++
return true
}
Optimization 3: Resilience Middleware Chain
Combine all 5 resilience patterns into a reusable middleware chain:
package resilience
import (
"context"
"fmt"
"time"
"github.com/sony/gobreaker/v2"
"golang.org/x/time/rate"
)
type Middleware func(ctx context.Context, req *Request) (*Response, error)
type Chain struct {
middlewares []Middleware
}
func NewChain(middlewares ...Middleware) *Chain {
return &Chain{middlewares: middlewares}
}
func (c *Chain) Then(final func(ctx context.Context, req *Request) (*Response, error)) Middleware {
return func(ctx context.Context, req *Request) (*Response, error) {
var handler Middleware = func(ctx context.Context, req *Request) (*Response, error) {
return final(ctx, req)
}
for i := len(c.middlewares) - 1; i >= 0; i-- {
handler = func(mw Middleware, h Middleware) Middleware {
return func(ctx context.Context, req *Request) (*Response, error) {
return mw(ctx, req)
}
}(c.middlewares[i], handler)
}
return handler(ctx, req)
}
}
func RateLimitMiddleware(limiter *rate.Limiter) Middleware {
return func(ctx context.Context, req *Request) (*Response, error) {
if !limiter.Allow() {
return nil, fmt.Errorf("rate limit exceeded")
}
return nil, nil
}
}
func CircuitBreakerMiddleware(cb *gobreaker.CircuitBreaker[string]) Middleware {
return func(ctx context.Context, req *Request) (*Response, error) {
_, err := cb.Execute(func() (string, error) {
return "", nil
})
if err != nil {
return nil, err
}
return nil, nil
}
}
func TimeoutMiddleware(timeout time.Duration) Middleware {
return func(ctx context.Context, req *Request) (*Response, error) {
_, cancel := context.WithTimeout(ctx, timeout)
defer cancel()
return nil, nil
}
}
Resilience Pattern Comparison
| Feature | Circuit Breaker | Rate Limiting | Retry | Timeout | Bulkhead |
|---|---|---|---|---|---|
| Core Purpose | Fail fast, protect downstream | Control request rate | Handle transient faults | Limit wait time | Isolate resources |
| Key Library | sony/gobreaker |
golang.org/x/time/rate |
Custom | context |
semaphore |
| State Management | Stateful (3 states) | Stateful (token count) | Stateless | Stateless | Stateful (semaphore) |
| Applicable Fault | Downstream completely unavailable | Traffic overload | Transient network jitter | Slow downstream response | Partial downstream fault |
| False Positive Risk | Medium (bad threshold) | Low | High (non-idempotent) | Medium (too short) | Low |
| Combination Rec. | With retry | With bulkhead | With circuit breaker | Required everywhere | With rate limiting |
| Production Rating | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Implementation Complexity | Low | Low | Medium | Low | Medium |
Recommended Tools
- JSON Formatter — Format microservice JSON responses, quickly debug data structure issues
- Hash Calculator — Compute request signatures and data checksums for distributed call data consistency
- HTTP Status Code Lookup — Look up HTTP status code meanings, quickly identify resilience-related errors like 429/503
Summary
Microservice resilience isn't "add a circuit breaker and you're done" — it's about answering five questions: When should we reject requests? How do we control request rate? Should we retry on failure? How long before we timeout? Will one fault drag down other calls? The circuit breaker answers "when to reject", rate limiting answers "how to control rate", retry answers "should we retry", timeout answers "how long to wait", and bulkhead answers "how to isolate". Only by combining all 5 patterns can you build truly anti-fragile Go microservices.
Further Reading
Try these browser-local tools — no sign-up required →