Redis 高可用集群方案實戰指南

Redis 架構演進之路

單機模式的侷限性

單機 Redis 雖然簡單易用，但在生產環境中面臨諸多挑戰：

單點故障：伺服器當機後服務完全不可用
記憶體瓶頸：單機記憶體上限制約資料容量
效能瓶頸：單執行緒模型下 QPS 存在天花板

從單機到哨兵再到集群

Redis 架構經歷了三個階段的演進：

階段	架構	高可用	水平擴展	適用場景
1	Standalone	❌	❌	開發/測試
2	Sentinel	✅	❌	中小規模生產
3	Cluster	✅	✅	大規模生產

Redis Sentinel 哨兵模式

哨兵架構原理

Redis Sentinel 是 Redis 官方提供的高可用方案，由一個或多個 Sentinel 實例組成的 Sentinel 系統可以監視任意多個主伺服器及其從伺服器：

# sentinel.conf — 哨兵配置範例
port 26379
sentinel monitor mymaster 127.0.0.1 6379 2
sentinel down-after-milliseconds mymaster 30000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 180000
sentinel auth-pass mymaster your_strong_password

故障轉移機制

Sentinel 故障轉移的完整流程：

主觀下線（SDOWN）：單個 Sentinel 認為主節點不可用
客觀下線（ODOWN）：超過 quorum 數量的 Sentinel 認為主節點不可用
選舉 Leader Sentinel：透過 Raft 演算法選出執行故障轉移的 Sentinel
選舉新主節點：按照優先級 → 複製偏移量 → Run ID 排序選舉
執行故障轉移：將從節點提升為主節點，其他從節點指向新主

# 啟動 Sentinel 集群（3 個實例）
redis-sentinel /etc/redis/sentinel-26379.conf
redis-sentinel /etc/redis/sentinel-26380.conf
redis-sentinel /etc/redis/sentinel-26381.conf

# 檢視主節點狀態
redis-cli -p 26379 sentinel master mymaster

# 檢視從節點列表
redis-cli -p 26379 sentinel slaves mymaster

Sentinel 部署最佳實踐

至少部署 3 個 Sentinel 節點實現多數派
Sentinel 節點應部署在不同實體機上
down-after-milliseconds 不宜設定過小，避免網路抖動誤判
客戶端必須實現 Sentinel 感知，自動取得新主節點位址

Redis Cluster 集群模式

Hash Slot 雜湊槽原理

Redis Cluster 將資料劃分為 16384 個雜湊槽，每個主節點負責一部分槽：

slot = CRC16(key) % 16384

集群節點分配範例：

節點	槽範圍	槽數量
Node A	0 ~ 5460	5461
Node B	5461 ~ 10922	5462
Node C	10923 ~ 16383	5462

集群配置與部署

# redis.conf — 集群節點配置
port 6379
cluster-enabled yes
cluster-config-file nodes-6379.conf
cluster-node-timeout 15000
cluster-announce-ip 192.168.1.101
cluster-announce-port 6379
cluster-announce-bus-port 16379
appendonly yes
requirepass your_strong_password
masterauth your_strong_password

逐步部署集群

# 步驟 1：啟動 6 個 Redis 實例（3 主 3 從）
for port in 6379 6380 6381 6382 6383 6384; do
  redis-server /etc/redis/redis-${port}.conf
done

# 步驟 2：建立集群
redis-cli --cluster create \
  192.168.1.101:6379 192.168.1.102:6380 192.168.1.103:6381 \
  192.168.1.101:6382 192.168.1.102:6383 192.168.1.103:6384 \
  --cluster-replicas 1 -a your_strong_password

# 步驟 3：驗證集群狀態
redis-cli -c -p 6379 cluster info
redis-cli -c -p 6379 cluster nodes

# 步驟 4：檢查槽分配
redis-cli -c -p 6379 cluster slots

資料遷移與 Resharding

線上 Resharding

Redis Cluster 支援線上重新分片，無需停機：

# 將 1000 個槽從 Node A 遷移到 Node C
redis-cli --cluster reshard 192.168.1.101:6379 \
  --cluster-from <node-a-id> \
  --cluster-to <node-c-id> \
  --cluster-slots 1000 \
  -a your_strong_password

使用 Hash Tag 控制資料分佈

當需要將相關 Key 分配到同一節點時，使用 Hash Tag：

# 花括號內的內容決定槽分配
SET user:{1000}:profile "profile_data"
SET user:{1000}:orders "orders_data"
# 兩個 Key 會被分配到同一個槽

批次遷移注意事項

遷移期間目標節點會進入匯入狀態（importing）
來源節點會進入遷移狀態（migrating）
客戶端存取遷移中的 Key 會收到 ASK 重定向
建議在離峰期執行大規模 resharding

常用資料結構最佳化

String vs Hash 儲存物件

儲存使用者物件時，Hash 結構通常優於 String：

# 方式 1：String + JSON（簡單但記憶體開銷大）
SET user:1000 '{"name":"張三","age":30,"city":"台北"}'

# 方式 2：Hash（節省記憶體，支援部分讀寫）
HSET user:1000 name "張三" age 30 city "台北"
HGET user:1000 name
# => "張三"

記憶體對比（儲存 100 萬個使用者物件，每個 5 個欄位）：

儲存方式	記憶體佔用	部分更新	過期控制
String + JSON	~320MB	❌ 需要全量	✅ 整體過期
Hash	~160MB	✅ 單欄位更新	❌ 不能單欄位過期

使用 ziplist 最佳化小集合

# Redis 7.0+ 使用 listpack 替代 ziplist
hash-max-listpack-entries 512
hash-max-listpack-value 64
zset-max-listpack-entries 128
zset-max-listpack-value 64

快取策略與模式

Cache-Aside 旁路快取

最常用的快取模式，讀和寫分離處理：

# Cache-Aside 模式
def get_user(user_id):
    # 1. 先查快取
    data = redis.get(f"user:{user_id}")
    if data:
        return json.loads(data)

    # 2. 快取未命中，查資料庫
    data = db.query("SELECT * FROM users WHERE id = %s", user_id)
    if data:
        # 3. 寫入快取，設定過期時間
        redis.setex(f"user:{user_id}", 3600, json.dumps(data))
    return data

def update_user(user_id, data):
    # 1. 更新資料庫
    db.update("UPDATE users SET ... WHERE id = %s", user_id)
    # 2. 刪除快取（而非更新快取）
    redis.delete(f"user:{user_id}")

Write-Through 寫穿透

所有寫操作先經過快取層，由快取層同步寫入資料庫：

# Write-Through 模式
def write_through(key, value):
    # 快取層負責同步寫入資料庫
    redis.set(key, value)
    db.sync_write(key, value)

Write-Behind 非同步回寫

寫操作只更新快取，由背景非同步批次寫入資料庫：

# Write-Behind 模式（非同步回寫）
def write_behind(key, value):
    redis.set(key, value)
    # 標記為髒資料，等待非同步刷盤
    dirty_key_queue.append(key)

async def flush_to_db():
    while True:
        keys = batch_get_dirty_keys(100)
        for key in keys:
            value = redis.get(key)
            db.async_write(key, value)
        await asyncio.sleep(1)

快取三大問題及解決方案

快取穿透

查詢不存在的資料，請求直達資料庫：

# 方案 1：布隆過濾器
def get_with_bloom(key):
    if not bloom_filter.might_contain(key):
        return None  # 一定不存在
    return cache_aside_get(key)

# 方案 2：快取空值
def get_with_null_cache(key):
    data = redis.get(key)
    if data == "NULL":
        return None  # 空值快取命中
    if data:
        return data
    data = db.query(key)
    if not data:
        redis.setex(key, 60, "NULL")  # 短時間快取空值
    return data

快取擊穿

熱點 Key 過期瞬間大量請求穿透到資料庫：

# 方案：互斥鎖 + 邏輯過期
def get_with_mutex(key):
    data = redis.get(key)
    if data:
        return data
    # 取得互斥鎖
    lock_key = f"lock:{key}"
    if redis.set(lock_key, 1, nx=True, ex=5):
        try:
            data = db.query(key)
            redis.setex(key, 3600, data)
            return data
        finally:
            redis.delete(lock_key)
    else:
        time.sleep(0.1)
        return get_with_mutex(key)  # 重試

快取雪崩

大量 Key 同時過期，導致資料庫壓力驟增：

# 方案：過期時間加隨機偏移
import random

def set_with_jitter(key, value, base_ttl=3600):
    jitter = random.randint(0, 300)  # 0~5 分鐘隨機偏移
    redis.setex(key, base_ttl + jitter, value)

記憶體最佳化技巧

關鍵配置項

# 記憶體最佳化相關配置
maxmemory 8gb
maxmemory-policy allkeys-lru

# 開啟 lazy-free 非同步刪除
lazyfree-lazy-eviction yes
lazyfree-lazy-expire yes
lazyfree-lazy-server-del yes

# 共享整數物件池（0-9999 預設共享）
# 超出範圍的整數不再共享

記憶體淘汰策略選擇

策略	說明	適用場景
noeviction	不淘汰，寫入報錯	資料不能遺失
allkeys-lru	所有 Key LRU	通用快取
volatile-lru	有 TTL 的 Key LRU	混合使用場景
allkeys-lfu	所有 Key LFU	熱點資料明顯
volatile-ttl	淘汰 TTL 最短的	業務有明確優先級

持久化策略

RDB vs AOF vs 混合持久化

# RDB 快照配置
save 900 1
save 300 10
save 60 10000
rdbcompression yes
rdbchecksum yes

# AOF 追加配置
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb

# Redis 4.0+ 混合持久化
aof-use-rdb-preamble yes

特性	RDB	AOF	混合
檔案體積	小	大	中
恢復速度	快	慢	較快
資料安全	可能丟資料	最多丟 1 秒	最多丟 1 秒
效能影響	fork 時影響	寫入時有影響	折中

監控與運維

使用 Redis Insight 監控

# 安裝 Redis Insight
docker run -d --name redis-insight \
  -p 8001:8001 \
  redis/redisinsight:latest

# 透過 CLI 取得關鍵指標
redis-cli info memory | grep used_memory_human
redis-cli info stats | grep instantaneous_ops_per_sec
redis-cli info replication | grep connected_slaves

關鍵監控指標

記憶體使用率：used_memory / maxmemory > 80% 需要關注
命中率：keyspace_hits / (keyspace_hits + keyspace_misses)
連線數：connected_clients 接近 maxclients 時告警
慢查詢：SLOWLOG GET 10 取得最近慢查詢
主從延遲：master_repl_offset - slave_repl_offset

常見錯誤排查

CLUSTERDOWN 錯誤

# 錯誤訊息
# (error) CLUSTERDOWN The cluster is not available

# 排查步驟
redis-cli -p 6379 cluster info
# cluster_state:fail 表示有槽未覆蓋

# 修復：檢查所有節點狀態
redis-cli --cluster fix 192.168.1.101:6379 -a your_strong_password

MOVED 與 ASK 重定向

# MOVED：槽已永久遷移到新節點
# (error) MOVED 3999 192.168.1.103:6381

# ASK：槽正在遷移中（臨時重定向）
# (error) ASK 3999 192.168.1.103:6381

# 解決：客戶端需實現智慧重定向
redis-cli -c -p 6379  # -c 參數啟用集群模式自動跟隨重定向

常見連線錯誤

# NOAUTH Authentication required
redis-cli -a your_strong_password -p 6379

# CLUSTERDOWN Hash slot not served
redis-cli --cluster check 192.168.1.101:6379

# BUSY Redis is busy running a script
CONFIG SET lua-time-limit 5000  # 調整 Lua 腳本逾時

生產環境 Checklist

部署前檢查

至少 3 主 3 從，分佈在不同實體機/可用區
開啟 appendonly yes 和 aof-use-rdb-preamble yes
設定合理的 maxmemory 和淘汰策略
配置 requirepass 和 masterauth
調整系統 vm.overcommit_memory=1
停用 THP：echo never > /sys/kernel/mm/transparent_hugepage/enabled
設定合理的檔案描述符限制：ulimit -n 65535
客戶端實現連線池和重試機制
監控告警配置就緒

運維規範

禁止使用 KEYS * 等阻塞命令
Key 設定合理 TTL，避免永久快取
大 Value（>10KB）考慮壓縮或拆分
批次操作使用 Pipeline
集群模式下注意 Key 的 Hash Tag 使用

常見問題 FAQ

Q: Sentinel 和 Cluster 該選哪個？ A: 資料量不大（< 單機記憶體）且只需高可用選 Sentinel；需要水平擴展選 Cluster。兩者不要混用。

Q: Cluster 中可以執行 MGET 等多 Key 操作嗎？ A: 只有當所有 Key 屬於同一 Hash Slot 時才可以。使用 Hash Tag {prefix} 確保相關 Key 在同一槽。

Q: 集群最大支援多少節點？ A: 官方推薦最多 1000 個主節點。實際生產中建議控制在幾十個主節點以內。

Q: RDB 和 AOF 該用哪個？ A: 生產環境推薦使用混合持久化（aof-use-rdb-preamble yes），兼顧恢復速度和資料安全。

Q: 如何估算集群所需記憶體？ A: 總記憶體 = 單節點資料量 × 主節點數 × 1.5（預留 50% 給緩衝和開銷）。建議單節點資料量不超過可用記憶體的 70%。

更多 Redis 工具和線上編解碼，請造訪工具庫 JSON 格式化、雜湊計算、Base64 編解碼。