Redis High Availability Cluster: A Practical Guide

The Evolution of Redis Architecture

Limitations of Standalone Mode

While standalone Redis is simple to use, it faces significant challenges in production:

Single Point of Failure: Complete service outage when the server goes down
Memory Bottleneck: Single-machine memory limits data capacity
Performance Ceiling: QPS is capped under the single-threaded model

From Standalone to Sentinel to Cluster

Redis architecture has evolved through three stages:

Stage	Architecture	HA	Horizontal Scaling	Use Case
1	Standalone	❌	❌	Dev/Test
2	Sentinel	✅	❌	Small-Medium Production
3	Cluster	✅	✅	Large-Scale Production

Redis Sentinel Mode

Sentinel Architecture Principles

Redis Sentinel is the official HA solution. A Sentinel system composed of one or more Sentinel instances can monitor any number of master servers and their replicas:

# sentinel.conf — Sentinel configuration example
port 26379
sentinel monitor mymaster 127.0.0.1 6379 2
sentinel down-after-milliseconds mymaster 30000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 180000
sentinel auth-pass mymaster your_strong_password

Failover Mechanism

The complete Sentinel failover process:

Subjective Down (SDOWN): A single Sentinel considers the master unavailable
Objective Down (ODOWN): More than quorum Sentinels agree the master is down
Leader Election: Raft algorithm elects the Sentinel to perform failover
New Master Election: Priority → Replication offset → Run ID ordering
Failover Execution: Promote replica to master, repoint other replicas

# Start Sentinel cluster (3 instances)
redis-sentinel /etc/redis/sentinel-26379.conf
redis-sentinel /etc/redis/sentinel-26380.conf
redis-sentinel /etc/redis/sentinel-26381.conf

# Check master status
redis-cli -p 26379 sentinel master mymaster

# List replicas
redis-cli -p 26379 sentinel slaves mymaster

Sentinel Deployment Best Practices

Deploy at least 3 Sentinel nodes for majority quorum
Place Sentinel nodes on different physical machines
Don't set down-after-milliseconds too small to avoid false positives from network jitter
Clients must implement Sentinel awareness to auto-discover the new master

Redis Cluster Mode

Hash Slot Principles

Redis Cluster partitions data into 16,384 hash slots, each master node responsible for a subset:

slot = CRC16(key) % 16384

Example cluster node assignment:

Node	Slot Range	Slot Count
Node A	0 ~ 5460	5461
Node B	5461 ~ 10922	5462
Node C	10923 ~ 16383	5462

Cluster Configuration and Deployment

# redis.conf — Cluster node configuration
port 6379
cluster-enabled yes
cluster-config-file nodes-6379.conf
cluster-node-timeout 15000
cluster-announce-ip 192.168.1.101
cluster-announce-port 6379
cluster-announce-bus-port 16379
appendonly yes
requirepass your_strong_password
masterauth your_strong_password

Step-by-Step Cluster Deployment

# Step 1: Start 6 Redis instances (3 masters + 3 replicas)
for port in 6379 6380 6381 6382 6383 6384; do
  redis-server /etc/redis/redis-${port}.conf
done

# Step 2: Create the cluster
redis-cli --cluster create \
  192.168.1.101:6379 192.168.1.102:6380 192.168.1.103:6381 \
  192.168.1.101:6382 192.168.1.102:6383 192.168.1.103:6384 \
  --cluster-replicas 1 -a your_strong_password

# Step 3: Verify cluster status
redis-cli -c -p 6379 cluster info
redis-cli -c -p 6379 cluster nodes

# Step 4: Check slot distribution
redis-cli -c -p 6379 cluster slots

Data Migration and Resharding

Online Resharding

Redis Cluster supports online resharding without downtime:

# Migrate 1000 slots from Node A to Node C
redis-cli --cluster reshard 192.168.1.101:6379 \
  --cluster-from <node-a-id> \
  --cluster-to <node-c-id> \
  --cluster-slots 1000 \
  -a your_strong_password

Using Hash Tags to Control Data Distribution

When related keys must reside on the same node, use Hash Tags:

# Content inside curly braces determines slot assignment
SET user:{1000}:profile "profile_data"
SET user:{1000}:orders "orders_data"
# Both keys will be assigned to the same slot

Batch Migration Considerations

During migration, the target node enters importing state
The source node enters migrating state
Clients accessing migrating keys receive ASK redirects
Schedule large-scale resharding during off-peak hours

Common Data Structure Optimizations

String vs Hash for Object Storage

When storing user objects, Hash structures are generally more memory-efficient:

# Approach 1: String + JSON (simple but higher memory overhead)
SET user:1000 '{"name":"John","age":30,"city":"New York"}'

# Approach 2: Hash (saves memory, supports partial read/write)
HSET user:1000 name "John" age 30 city "New York"
HGET user:1000 name
# => "John"

Memory comparison (1 million user objects, 5 fields each):

Storage	Memory	Partial Update	Per-Field TTL
String + JSON	~320MB	❌ Full rewrite	✅ Whole key
Hash	~160MB	✅ Single field	❌ Not supported

Using ziplist for Small Collections

# Redis 7.0+ uses listpack instead of ziplist
hash-max-listpack-entries 512
hash-max-listpack-value 64
zset-max-listpack-entries 128
zset-max-listpack-value 64

Caching Strategies and Patterns

Cache-Aside Pattern

The most commonly used caching pattern with separate read and write handling:

# Cache-Aside pattern
def get_user(user_id):
    # 1. Check cache first
    data = redis.get(f"user:{user_id}")
    if data:
        return json.loads(data)

    # 2. Cache miss — query database
    data = db.query("SELECT * FROM users WHERE id = %s", user_id)
    if data:
        # 3. Write to cache with TTL
        redis.setex(f"user:{user_id}", 3600, json.dumps(data))
    return data

def update_user(user_id, data):
    # 1. Update database
    db.update("UPDATE users SET ... WHERE id = %s", user_id)
    # 2. Invalidate cache (not update)
    redis.delete(f"user:{user_id}")

Write-Through Pattern

All writes go through the cache layer, which synchronously writes to the database:

# Write-Through pattern
def write_through(key, value):
    # Cache layer handles synchronous DB write
    redis.set(key, value)
    db.sync_write(key, value)

Write-Behind (Write-Back) Pattern

Writes only update the cache; the backend asynchronously flushes to the database:

# Write-Behind pattern (async write-back)
def write_behind(key, value):
    redis.set(key, value)
    # Mark as dirty, await async flush
    dirty_key_queue.append(key)

async def flush_to_db():
    while True:
        keys = batch_get_dirty_keys(100)
        for key in keys:
            value = redis.get(key)
            db.async_write(key, value)
        await asyncio.sleep(1)

The Three Cache Problems and Solutions

Cache Penetration

Queries for non-existent data bypass cache and hit the database directly:

# Solution 1: Bloom Filter
def get_with_bloom(key):
    if not bloom_filter.might_contain(key):
        return None  # Definitely not present
    return cache_aside_get(key)

# Solution 2: Cache Null Values
def get_with_null_cache(key):
    data = redis.get(key)
    if data == "NULL":
        return None  # Null cache hit
    if data:
        return data
    data = db.query(key)
    if not data:
        redis.setex(key, 60, "NULL")  # Short TTL for null values
    return data

Cache Breakdown

A hot key expires, causing a sudden surge of requests to the database:

# Solution: Mutex lock + logical expiration
def get_with_mutex(key):
    data = redis.get(key)
    if data:
        return data
    # Acquire mutex lock
    lock_key = f"lock:{key}"
    if redis.set(lock_key, 1, nx=True, ex=5):
        try:
            data = db.query(key)
            redis.setex(key, 3600, data)
            return data
        finally:
            redis.delete(lock_key)
    else:
        time.sleep(0.1)
        return get_with_mutex(key)  # Retry

Cache Avalanche

Mass key expiration causes a sudden spike in database load:

# Solution: Add random jitter to TTL
import random

def set_with_jitter(key, value, base_ttl=3600):
    jitter = random.randint(0, 300)  # 0~5 min random offset
    redis.setex(key, base_ttl + jitter, value)

Memory Optimization Techniques

Key Configuration Options

# Memory optimization settings
maxmemory 8gb
maxmemory-policy allkeys-lru

# Enable lazy-free async deletion
lazyfree-lazy-eviction yes
lazyfree-lazy-expire yes
lazyfree-lazy-server-del yes

# Shared integer object pool (0-9999 shared by default)
# Integers outside this range are not shared

Eviction Policy Selection

Policy	Description	Use Case
noeviction	No eviction, writes fail	Data must not be lost
allkeys-lru	LRU across all keys	General caching
volatile-lru	LRU on keys with TTL	Mixed usage
allkeys-lfu	LFU across all keys	Clear hot data patterns
volatile-ttl	Evict shortest TTL first	Business-defined priority

Persistence Strategies

RDB vs AOF vs Hybrid Persistence

# RDB snapshot configuration
save 900 1
save 300 10
save 60 10000
rdbcompression yes
rdbchecksum yes

# AOF append configuration
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb

# Redis 4.0+ hybrid persistence
aof-use-rdb-preamble yes

Feature	RDB	AOF	Hybrid
File Size	Small	Large	Medium
Recovery Speed	Fast	Slow	Moderate
Data Safety	May lose data	Max 1 sec loss	Max 1 sec loss
Performance Impact	During fork	During writes	Balanced

Monitoring and Operations

Monitoring with Redis Insight

# Install Redis Insight
docker run -d --name redis-insight \
  -p 8001:8001 \
  redis/redisinsight:latest

# Fetch key metrics via CLI
redis-cli info memory | grep used_memory_human
redis-cli info stats | grep instantaneous_ops_per_sec
redis-cli info replication | grep connected_slaves

Key Monitoring Metrics

Memory Usage: used_memory / maxmemory > 80% needs attention
Hit Rate: keyspace_hits / (keyspace_hits + keyspace_misses)
Connections: Alert when connected_clients approaches maxclients
Slow Queries: SLOWLOG GET 10 for recent slow queries
Replication Lag: master_repl_offset - slave_repl_offset

Common Error Troubleshooting

CLUSTERDOWN Error

# Error message
# (error) CLUSTERDOWN The cluster is not available

# Troubleshooting steps
redis-cli -p 6379 cluster info
# cluster_state:fail means uncovered slots exist

# Fix: check and repair all nodes
redis-cli --cluster fix 192.168.1.101:6379 -a your_strong_password

MOVED and ASK Redirects

# MOVED: slot permanently migrated to new node
# (error) MOVED 3999 192.168.1.103:6381

# ASK: slot is being migrated (temporary redirect)
# (error) ASK 3999 192.168.1.103:6381

# Solution: client must implement smart redirects
redis-cli -c -p 6379  # -c enables cluster mode with auto-redirect

Common Connection Errors

# NOAUTH Authentication required
redis-cli -a your_strong_password -p 6379

# CLUSTERDOWN Hash slot not served
redis-cli --cluster check 192.168.1.101:6379

# BUSY Redis is busy running a script
CONFIG SET lua-time-limit 5000  # Adjust Lua script timeout

Production Environment Checklist

Pre-Deployment Checks

At least 3 masters + 3 replicas across different physical machines/availability zones
Enable appendonly yes and aof-use-rdb-preamble yes
Set appropriate maxmemory and eviction policy
Configure requirepass and masterauth
Set system vm.overcommit_memory=1
Disable THP: echo never > /sys/kernel/mm/transparent_hugepage/enabled
Set file descriptor limit: ulimit -n 65535
Client implements connection pool and retry mechanism
Monitoring and alerting configured

Operational Standards

Prohibit blocking commands like KEYS *
Set reasonable TTL on keys, avoid permanent caching
Large values (>10KB) should be compressed or split
Use Pipeline for batch operations
Use Hash Tags wisely in cluster mode

FAQ

Q: Should I choose Sentinel or Cluster? A: For data that fits in a single machine and only needs HA, choose Sentinel. For horizontal scaling, choose Cluster. Don't mix both.

Q: Can I use MGET with multiple keys in Cluster mode? A: Only when all keys belong to the same hash slot. Use Hash Tags {prefix} to ensure related keys share a slot.

Q: What's the maximum number of nodes in a cluster? A: The official recommendation is up to 1,000 master nodes. In practice, keep it to a few dozen masters.

Q: Should I use RDB or AOF? A: For production, use hybrid persistence (aof-use-rdb-preamble yes) for the best balance of recovery speed and data safety.

Q: How to estimate cluster memory requirements? A: Total memory = per-node data × master count × 1.5 (50% overhead buffer). Keep per-node data below 70% of available memory.

For more Redis tools and online encoding/decoding, visit ToolsKu JSON Formatter, Hash Calculator, Base64 Codec.