Redis High Availability Cluster: A Practical Guide
The Evolution of Redis Architecture
Limitations of Standalone Mode
While standalone Redis is simple to use, it faces significant challenges in production:
- Single Point of Failure: Complete service outage when the server goes down
- Memory Bottleneck: Single-machine memory limits data capacity
- Performance Ceiling: QPS is capped under the single-threaded model
From Standalone to Sentinel to Cluster
Redis architecture has evolved through three stages:
| Stage | Architecture | HA | Horizontal Scaling | Use Case |
|---|---|---|---|---|
| 1 | Standalone | ❌ | ❌ | Dev/Test |
| 2 | Sentinel | ✅ | ❌ | Small-Medium Production |
| 3 | Cluster | ✅ | ✅ | Large-Scale Production |
Redis Sentinel Mode
Sentinel Architecture Principles
Redis Sentinel is the official HA solution. A Sentinel system composed of one or more Sentinel instances can monitor any number of master servers and their replicas:
# sentinel.conf — Sentinel configuration example
port 26379
sentinel monitor mymaster 127.0.0.1 6379 2
sentinel down-after-milliseconds mymaster 30000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 180000
sentinel auth-pass mymaster your_strong_password
Failover Mechanism
The complete Sentinel failover process:
- Subjective Down (SDOWN): A single Sentinel considers the master unavailable
- Objective Down (ODOWN): More than quorum Sentinels agree the master is down
- Leader Election: Raft algorithm elects the Sentinel to perform failover
- New Master Election: Priority → Replication offset → Run ID ordering
- Failover Execution: Promote replica to master, repoint other replicas
# Start Sentinel cluster (3 instances)
redis-sentinel /etc/redis/sentinel-26379.conf
redis-sentinel /etc/redis/sentinel-26380.conf
redis-sentinel /etc/redis/sentinel-26381.conf
# Check master status
redis-cli -p 26379 sentinel master mymaster
# List replicas
redis-cli -p 26379 sentinel slaves mymaster
Sentinel Deployment Best Practices
- Deploy at least 3 Sentinel nodes for majority quorum
- Place Sentinel nodes on different physical machines
- Don't set
down-after-millisecondstoo small to avoid false positives from network jitter - Clients must implement Sentinel awareness to auto-discover the new master
Redis Cluster Mode
Hash Slot Principles
Redis Cluster partitions data into 16,384 hash slots, each master node responsible for a subset:
slot = CRC16(key) % 16384
Example cluster node assignment:
| Node | Slot Range | Slot Count |
|---|---|---|
| Node A | 0 ~ 5460 | 5461 |
| Node B | 5461 ~ 10922 | 5462 |
| Node C | 10923 ~ 16383 | 5462 |
Cluster Configuration and Deployment
# redis.conf — Cluster node configuration
port 6379
cluster-enabled yes
cluster-config-file nodes-6379.conf
cluster-node-timeout 15000
cluster-announce-ip 192.168.1.101
cluster-announce-port 6379
cluster-announce-bus-port 16379
appendonly yes
requirepass your_strong_password
masterauth your_strong_password
Step-by-Step Cluster Deployment
# Step 1: Start 6 Redis instances (3 masters + 3 replicas)
for port in 6379 6380 6381 6382 6383 6384; do
redis-server /etc/redis/redis-${port}.conf
done
# Step 2: Create the cluster
redis-cli --cluster create \
192.168.1.101:6379 192.168.1.102:6380 192.168.1.103:6381 \
192.168.1.101:6382 192.168.1.102:6383 192.168.1.103:6384 \
--cluster-replicas 1 -a your_strong_password
# Step 3: Verify cluster status
redis-cli -c -p 6379 cluster info
redis-cli -c -p 6379 cluster nodes
# Step 4: Check slot distribution
redis-cli -c -p 6379 cluster slots
Data Migration and Resharding
Online Resharding
Redis Cluster supports online resharding without downtime:
# Migrate 1000 slots from Node A to Node C
redis-cli --cluster reshard 192.168.1.101:6379 \
--cluster-from <node-a-id> \
--cluster-to <node-c-id> \
--cluster-slots 1000 \
-a your_strong_password
Using Hash Tags to Control Data Distribution
When related keys must reside on the same node, use Hash Tags:
# Content inside curly braces determines slot assignment
SET user:{1000}:profile "profile_data"
SET user:{1000}:orders "orders_data"
# Both keys will be assigned to the same slot
Batch Migration Considerations
- During migration, the target node enters importing state
- The source node enters migrating state
- Clients accessing migrating keys receive ASK redirects
- Schedule large-scale resharding during off-peak hours
Common Data Structure Optimizations
String vs Hash for Object Storage
When storing user objects, Hash structures are generally more memory-efficient:
# Approach 1: String + JSON (simple but higher memory overhead)
SET user:1000 '{"name":"John","age":30,"city":"New York"}'
# Approach 2: Hash (saves memory, supports partial read/write)
HSET user:1000 name "John" age 30 city "New York"
HGET user:1000 name
# => "John"
Memory comparison (1 million user objects, 5 fields each):
| Storage | Memory | Partial Update | Per-Field TTL |
|---|---|---|---|
| String + JSON | ~320MB | ❌ Full rewrite | ✅ Whole key |
| Hash | ~160MB | ✅ Single field | ❌ Not supported |
Using ziplist for Small Collections
# Redis 7.0+ uses listpack instead of ziplist
hash-max-listpack-entries 512
hash-max-listpack-value 64
zset-max-listpack-entries 128
zset-max-listpack-value 64
Caching Strategies and Patterns
Cache-Aside Pattern
The most commonly used caching pattern with separate read and write handling:
# Cache-Aside pattern
def get_user(user_id):
# 1. Check cache first
data = redis.get(f"user:{user_id}")
if data:
return json.loads(data)
# 2. Cache miss — query database
data = db.query("SELECT * FROM users WHERE id = %s", user_id)
if data:
# 3. Write to cache with TTL
redis.setex(f"user:{user_id}", 3600, json.dumps(data))
return data
def update_user(user_id, data):
# 1. Update database
db.update("UPDATE users SET ... WHERE id = %s", user_id)
# 2. Invalidate cache (not update)
redis.delete(f"user:{user_id}")
Write-Through Pattern
All writes go through the cache layer, which synchronously writes to the database:
# Write-Through pattern
def write_through(key, value):
# Cache layer handles synchronous DB write
redis.set(key, value)
db.sync_write(key, value)
Write-Behind (Write-Back) Pattern
Writes only update the cache; the backend asynchronously flushes to the database:
# Write-Behind pattern (async write-back)
def write_behind(key, value):
redis.set(key, value)
# Mark as dirty, await async flush
dirty_key_queue.append(key)
async def flush_to_db():
while True:
keys = batch_get_dirty_keys(100)
for key in keys:
value = redis.get(key)
db.async_write(key, value)
await asyncio.sleep(1)
The Three Cache Problems and Solutions
Cache Penetration
Queries for non-existent data bypass cache and hit the database directly:
# Solution 1: Bloom Filter
def get_with_bloom(key):
if not bloom_filter.might_contain(key):
return None # Definitely not present
return cache_aside_get(key)
# Solution 2: Cache Null Values
def get_with_null_cache(key):
data = redis.get(key)
if data == "NULL":
return None # Null cache hit
if data:
return data
data = db.query(key)
if not data:
redis.setex(key, 60, "NULL") # Short TTL for null values
return data
Cache Breakdown
A hot key expires, causing a sudden surge of requests to the database:
# Solution: Mutex lock + logical expiration
def get_with_mutex(key):
data = redis.get(key)
if data:
return data
# Acquire mutex lock
lock_key = f"lock:{key}"
if redis.set(lock_key, 1, nx=True, ex=5):
try:
data = db.query(key)
redis.setex(key, 3600, data)
return data
finally:
redis.delete(lock_key)
else:
time.sleep(0.1)
return get_with_mutex(key) # Retry
Cache Avalanche
Mass key expiration causes a sudden spike in database load:
# Solution: Add random jitter to TTL
import random
def set_with_jitter(key, value, base_ttl=3600):
jitter = random.randint(0, 300) # 0~5 min random offset
redis.setex(key, base_ttl + jitter, value)
Memory Optimization Techniques
Key Configuration Options
# Memory optimization settings
maxmemory 8gb
maxmemory-policy allkeys-lru
# Enable lazy-free async deletion
lazyfree-lazy-eviction yes
lazyfree-lazy-expire yes
lazyfree-lazy-server-del yes
# Shared integer object pool (0-9999 shared by default)
# Integers outside this range are not shared
Eviction Policy Selection
| Policy | Description | Use Case |
|---|---|---|
| noeviction | No eviction, writes fail | Data must not be lost |
| allkeys-lru | LRU across all keys | General caching |
| volatile-lru | LRU on keys with TTL | Mixed usage |
| allkeys-lfu | LFU across all keys | Clear hot data patterns |
| volatile-ttl | Evict shortest TTL first | Business-defined priority |
Persistence Strategies
RDB vs AOF vs Hybrid Persistence
# RDB snapshot configuration
save 900 1
save 300 10
save 60 10000
rdbcompression yes
rdbchecksum yes
# AOF append configuration
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
# Redis 4.0+ hybrid persistence
aof-use-rdb-preamble yes
| Feature | RDB | AOF | Hybrid |
|---|---|---|---|
| File Size | Small | Large | Medium |
| Recovery Speed | Fast | Slow | Moderate |
| Data Safety | May lose data | Max 1 sec loss | Max 1 sec loss |
| Performance Impact | During fork | During writes | Balanced |
Monitoring and Operations
Monitoring with Redis Insight
# Install Redis Insight
docker run -d --name redis-insight \
-p 8001:8001 \
redis/redisinsight:latest
# Fetch key metrics via CLI
redis-cli info memory | grep used_memory_human
redis-cli info stats | grep instantaneous_ops_per_sec
redis-cli info replication | grep connected_slaves
Key Monitoring Metrics
- Memory Usage:
used_memory / maxmemory> 80% needs attention - Hit Rate:
keyspace_hits / (keyspace_hits + keyspace_misses) - Connections: Alert when
connected_clientsapproachesmaxclients - Slow Queries:
SLOWLOG GET 10for recent slow queries - Replication Lag:
master_repl_offset - slave_repl_offset
Common Error Troubleshooting
CLUSTERDOWN Error
# Error message
# (error) CLUSTERDOWN The cluster is not available
# Troubleshooting steps
redis-cli -p 6379 cluster info
# cluster_state:fail means uncovered slots exist
# Fix: check and repair all nodes
redis-cli --cluster fix 192.168.1.101:6379 -a your_strong_password
MOVED and ASK Redirects
# MOVED: slot permanently migrated to new node
# (error) MOVED 3999 192.168.1.103:6381
# ASK: slot is being migrated (temporary redirect)
# (error) ASK 3999 192.168.1.103:6381
# Solution: client must implement smart redirects
redis-cli -c -p 6379 # -c enables cluster mode with auto-redirect
Common Connection Errors
# NOAUTH Authentication required
redis-cli -a your_strong_password -p 6379
# CLUSTERDOWN Hash slot not served
redis-cli --cluster check 192.168.1.101:6379
# BUSY Redis is busy running a script
CONFIG SET lua-time-limit 5000 # Adjust Lua script timeout
Production Environment Checklist
Pre-Deployment Checks
- At least 3 masters + 3 replicas across different physical machines/availability zones
- Enable
appendonly yesandaof-use-rdb-preamble yes - Set appropriate
maxmemoryand eviction policy - Configure
requirepassandmasterauth - Set system
vm.overcommit_memory=1 - Disable THP:
echo never > /sys/kernel/mm/transparent_hugepage/enabled - Set file descriptor limit:
ulimit -n 65535 - Client implements connection pool and retry mechanism
- Monitoring and alerting configured
Operational Standards
- Prohibit blocking commands like
KEYS * - Set reasonable TTL on keys, avoid permanent caching
- Large values (>10KB) should be compressed or split
- Use Pipeline for batch operations
- Use Hash Tags wisely in cluster mode
FAQ
Q: Should I choose Sentinel or Cluster? A: For data that fits in a single machine and only needs HA, choose Sentinel. For horizontal scaling, choose Cluster. Don't mix both.
Q: Can I use MGET with multiple keys in Cluster mode?
A: Only when all keys belong to the same hash slot. Use Hash Tags {prefix} to ensure related keys share a slot.
Q: What's the maximum number of nodes in a cluster? A: The official recommendation is up to 1,000 master nodes. In practice, keep it to a few dozen masters.
Q: Should I use RDB or AOF?
A: For production, use hybrid persistence (aof-use-rdb-preamble yes) for the best balance of recovery speed and data safety.
Q: How to estimate cluster memory requirements? A: Total memory = per-node data × master count × 1.5 (50% overhead buffer). Keep per-node data below 70% of available memory.
For more Redis tools and online encoding/decoding, visit ToolsKu JSON Formatter, Hash Calculator, Base64 Codec.
Try these browser-local tools — no sign-up required →