TiDB HTAP Real-Time Analytics: 6 Production Patterns from TiFlash to Real-Time Dashboards
T+1 Analytics Is Dead — Real-Time Insights Are the Future
3 AM, online orders surge 10x, CEO asks: "What's the real-time GMV by category?" — You can only answer: "Data is still in the ETL pipeline, you'll see it tomorrow."
This is the dilemma of traditional T+1 analytics:
- Fragile ETL pipelines: Kafka → Flink → ClickHouse, any single point of failure breaks the chain
- Uncontrollable data latency: Minutes to hours from OLTP to OLAP is the norm
- Dual-system ops nightmare: MySQL for transactions, ClickHouse for analytics, data consistency by manual effort
- Cost overruns: Two storage systems, two compute engines, two ops teams, low resource utilization
- Insufficient real-time capability: Business decisions need second-level responses, T+1 reports are legacy
TiDB HTAP's promise: One system, simultaneously handling OLTP and OLAP, data synced from row storage to column storage with zero latency. This isn't a slide-ware architecture — it's a solution validated in production at thousands of enterprises.
HTAP Core Concepts Quick Reference
| Concept | Full Name | Core Role | Analogy |
|---|---|---|---|
| HTAP | Hybrid Transactional/Analytical Processing | One system handles both transactions and analytics | One person acting as both trader and analyst |
| TiFlash | TiDB Flash | Columnar storage engine, handles OLAP queries | Dedicated OLAP accelerator |
| TiKV | TiDB Key-Value | Row storage engine, handles OLTP transactions | OLTP cornerstone |
| MPP | Massively Parallel Processing | Multi-node parallel computation framework | Multiple people computing different shards simultaneously |
| Columnar Storage | Columnar Storage | Stores data by column, analytics queries only read needed columns | Only flipping the pages you need |
| Raft Learner | Raft consensus protocol learner role | Asynchronously replicates data to TiFlash without affecting the main flow | Shadow scribe |
| Data Mart | Data Mart | Pre-aggregated data set for a specific business domain | Customized report showcase for each department |
💡 Use the JSON Formatter to format TiDB configuration files for easier troubleshooting.
5 Pain Points of Traditional Real-Time Analytics Architecture
Pain Point 1: ETL Pipeline Is the Biggest Single Point of Failure
MySQL → Canal → Kafka → Flink → ClickHouse
│ │ │ │ │
│ │ │ │ └─ Write failure, data loss
│ │ │ └─ Job crash, data backlog
│ │ └─ Uneven partitions, consumption lag
│ └─ Binlog parsing error
└─ Replication lag
Any single link failure breaks the entire pipeline. Late-night alerts never stop, ops teams suffer endlessly.
Pain Point 2: Data Consistency Cannot Be Guaranteed
OLTP and OLAP are two independent systems, data sync has a latency window:
-- MySQL: Order has been refunded
UPDATE orders SET status = 'refunded' WHERE id = 10086;
-- ClickHouse: Refund status only visible 5 minutes later
-- During those 5 minutes, report data is wrong
Pain Point 3: Dual-System Ops Costs Double
| Ops Item | MySQL | ClickHouse | Total |
|---|---|---|---|
| Monitoring & Alerts | Grafana + Prometheus | Grafana + Prometheus | 2 sets |
| Backup & Recovery | mysqldump / XtraBackup | clickhouse-backup | 2 sets |
| Capacity Planning | Primary-replica + sharding | Distributed tables + replicas | 2 sets |
| Version Upgrades | Rolling upgrade | Rolling upgrade | 2 sets |
| Staffing | DBA | ClickHouse expert | 2 teams |
Pain Point 4: Query Performance vs Data Freshness Dilemma
- ClickHouse pursues query performance, requires MergeTree compaction, data has latency
- For real-time needs, you use ReplacingMergeTree, but queries need FINAL, causing performance to plummet
- Real-time and performance — you can't have both
Pain Point 5: Low Development Efficiency
// Developers must maintain two data models
public class OrderService {
// OLTP: MySQL
@Transactional
public void createOrder(Order order) {
orderMapper.insert(order);
}
// OLAP: ClickHouse - write it again
public void syncToClickHouse(Order order) {
clickHouseMapper.insert(order);
// Field mapping, type conversion, error handling...
}
}
6 Production-Grade HTAP Patterns
Pattern 1: TiFlash Columnar Replica Setup
TiFlash receives TiKV data changes asynchronously through the Raft Learner role, achieving real-time sync from row storage to column storage.
1.1 Deploy TiFlash Nodes
# tiup topology.yaml snippet
tiflash_servers:
- host: 10.0.1.21
tcp_port: 9000
http_port: 8123
flash_service_port: 3930
flash_proxy_port: 20170
flash_proxy_status_port: 20292
metrics_interval: 15
data_dir: /data/tiflash/data
deploy_dir: /data/tiflash/deploy
config:
logger:
level: information
storage:
main:
capacity: "2TB"
latest:
capacity: "200GB"
- host: 10.0.1.22
tcp_port: 9000
http_port: 8123
flash_service_port: 3930
flash_proxy_port: 20170
flash_proxy_status_port: 20292
data_dir: /data/tiflash/data
deploy_dir: /data/tiflash/deploy
# Scale out TiFlash nodes
tiup cluster scale-out tidb-prod topology-tiflash.yaml
# Verify TiFlash node status
tiup cluster display tidb-prod | grep tiflash
1.2 Add TiFlash Replicas to Tables
-- Add TiFlash replica for entire table (default 1 replica)
ALTER TABLE orders SET TIFLASH REPLICA 1;
-- Add 2 TiFlash replicas (high availability)
ALTER TABLE orders SET TIFLASH REPLICA 2;
-- Add TiFlash replica for partitioned table
ALTER TABLE orders_partition SET TIFLASH REPLICA 1;
-- Check replica sync progress
SELECT * FROM information_schema.tiflash_replica
WHERE table_name = 'orders';
Sync progress output example:
+----------+------------+----------+---------------+------------------+-----------+
| TABLE_ID | TABLE_NAME | REPLICA | LOCATION_LABELS| AVAILABLE | PROGRESS |
+----------+------------+----------+---------------+------------------+-----------+
| 85 | orders | 1 | | True | 1.0 |
+----------+------------+----------+---------------+------------------+-----------+
1.3 Add TiFlash Replica by Column (Save Storage)
-- Create TiFlash replica only for columns needed for analytics
ALTER TABLE orders SET TIFLASH REPLICA 1
COLUMNS (order_id, user_id, product_id, amount, status, created_at);
-- Verify column-level replica
SHOW CREATE TABLE orders;
Pattern 2: MPP Distributed Query Acceleration
TiDB's MPP engine pushes analytical queries down to multiple TiFlash nodes for parallel execution, achieving query performance close to ClickHouse.
2.1 Enable MPP Mode
-- Check current MPP mode
SHOW VARIABLES LIKE 'tidb_allow_mpp';
-- Enable MPP (enabled by default)
SET SESSION tidb_allow_mpp = 1;
-- Force MPP usage (for testing/verification)
SET SESSION tidb_enforce_mpp = 1;
-- Set MPP query Exchange compression
SET SESSION tidb_exchange_compression = 'ON';
2.2 MPP Query in Action: Real-Time Sales Analytics
-- Scenario: Real-time top 10 category sales today
-- MPP automatically pushes computation to TiFlash nodes for parallel execution
SELECT
c.category_name,
SUM(oi.quantity * oi.unit_price) AS total_sales,
COUNT(DISTINCT o.user_id) AS unique_buyers,
AVG(oi.quantity * oi.unit_price) AS avg_order_value
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN categories c ON p.category_id = c.category_id
WHERE o.created_at >= CURDATE()
AND o.status IN ('paid', 'shipped', 'completed')
GROUP BY c.category_name
ORDER BY total_sales DESC
LIMIT 10;
2.3 Verify MPP Execution Plan
-- View execution plan, confirm MPP operators
EXPLAIN ANALYZE
SELECT
c.category_name,
SUM(oi.quantity * oi.unit_price) AS total_sales
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN categories c ON p.category_id = c.category_id
WHERE o.created_at >= CURDATE()
GROUP BY c.category_name;
Key execution plan characteristics:
+------------------------------------------+----------+---------+--------------+
| id | task | rows | operator info|
+------------------------------------------+----------+---------+--------------+
| TableReader_45 | root | 10 | ... |
| └─ExchangeSender_44 | mpp[tif] | 10 | ExchangeType |
| └─Projection_42 | mpp[tif] | 10 | ... |
| └─HashAgg_43 | mpp[tif] | 10 | group by:... |
| └─ExchangeReceiver_38 | mpp[tif] | 100 | ... |
| └─ExchangeSender_37 | mpp[tif] | 100 | ExchangeType |
| └─HashJoin_36 | mpp[tif] | 100 | inner join |
| ├─Selection_25(Build) | mpp[tif] | 500 | ... |
| │ └─TableFullScan_24 | mpp[tif] | 10000 | tikv:tiflash |
| └─Selection_30(Probe) | mpp[tif] | 800 | ... |
| └─TableFullScan_29 | mpp[tif] | 20000 | tikv:tiflash |
+------------------------------------------+----------+---------+--------------+
Seeing
mpp[tif]andExchangeSender/Receiverindicates MPP is in effect.
2.4 MPP Performance Comparison
-- Disable MPP, use traditional TiKV scan
SET SESSION tidb_allow_mpp = 0;
EXPLAIN ANALYZE
SELECT c.category_name, SUM(oi.quantity * oi.unit_price) AS total_sales
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN categories c ON p.category_id = c.category_id
WHERE o.created_at >= CURDATE()
GROUP BY c.category_name;
-- Re-enable MPP
SET SESSION tidb_allow_mpp = 1;
| Query Mode | Data Volume | Execution Time | Scanned Rows |
|---|---|---|---|
| TiKV Scan | 100M rows | 45.2s | 100,000,000 |
| TiFlash Non-MPP | 100M rows | 12.8s | 100,000,000 |
| TiFlash MPP | 100M rows | 3.2s | 100,000,000 |
Pattern 3: TiCDC Real-Time ETL Replacement
TiCDC (Change Data Capture) can replace traditional ETL pipelines, syncing TiDB data changes to downstream systems in real-time.
3.1 Deploy TiCDC
# tiup topology.yaml snippet
cdc_servers:
- host: 10.0.1.31
port: 8300
deploy_dir: /data/cdc/deploy
data_dir: /data/cdc/data
config:
per-table-memory-quota: 10485760
sink-memory-quota: 536870912
- host: 10.0.1.32
port: 8300
deploy_dir: /data/cdc/deploy
data_dir: /data/cdc/data
# Scale out TiCDC
tiup cluster scale-out tidb-prod topology-cdc.yaml
# Verify TiCDC status
tiup cluster display tidb-prod | grep cdc
3.2 Create Sync Task: Replace MySQL → Kafka → Flink Pipeline
# Create TiCDC sync task: TiDB → Kafka (replacing Canal + Kafka)
cdc cli changefeed create \
--server=http://10.0.1.31:8300 \
--sink-uri="kafka://kafka-cluster:9092/tidb-orders?partition-num=6&max-message-bytes=10485760&replication-factor=3" \
--changefeed-id="orders-to-kafka" \
--config=cdc-orders.yaml
# cdc-orders.yaml
filter:
rules:
- 'ecommerce.orders'
- 'ecommerce.order_items'
- 'ecommerce.payments'
ignore_txn_start_ts: []
mounter:
worker-num: 16
sink:
protocol: "canal-json"
dispatchers:
- matcher: ['ecommerce.*']
partition: "index-value"
selector:
rules:
- 'ecommerce.orders'
- 'ecommerce.order_items'
- 'ecommerce.payments'
3.3 Direct Sync to ClickHouse (If Still Needed)
# TiCDC → ClickHouse direct connection
cdc cli changefeed create \
--server=http://10.0.1.31:8300 \
--sink-uri="clickhouse://clickhouse-server:8123?database=analytics&username=default&password=***" \
--changefeed-id="orders-to-clickhouse" \
--config=cdc-ch.yaml
3.4 TiCDC Sync to S3 (Data Lake Solution)
# TiCDC → S3 Data Lake
cdc cli changefeed create \
--server=http://10.0.1.31:8300 \
--sink-uri="s3://data-lake/tidb-cdc/?endpoint=https://s3.amazonaws.com&access-key=***&secret-access-key=***" \
--changefeed-id="orders-to-s3" \
--config=cdc-s3.yaml
# cdc-s3.yaml
sink:
protocol: "csv"
column-selectors:
- matcher: ['ecommerce.orders']
columns: ['order_id', 'user_id', 'amount', 'status', 'created_at']
file:
size: 67108864
interval: 5m
filter:
rules:
- 'ecommerce.orders'
- 'ecommerce.order_items'
3.5 Monitor TiCDC Sync Latency
-- Check sync task status
cdc cli changefeed list \
--server=http://10.0.1.31:8300;
-- Check sync latency
cdc cli changefeed query \
--changefeed-id=orders-to-kafka \
--server=http://10.0.1.31:8300;
Key metrics:
{
"state": "normal",
"tso": 446685327530688513,
"checkpoint": "2026-06-15 10:30:00.123",
"lag": "1.2s"
}
Pattern 4: Dashboard Query Optimization
Real-time dashboards are the most direct application scenario for HTAP, but query performance directly impacts user experience.
4.1 Dashboard Query Pattern Identification
-- Typical dashboard query 1: Real-time GMV
SELECT
DATE_FORMAT(created_at, '%Y-%m-%d %H:00') AS hour_slot,
SUM(amount) AS gmv,
COUNT(*) AS order_count
FROM orders
WHERE created_at >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
AND status != 'cancelled'
GROUP BY hour_slot
ORDER BY hour_slot;
-- Typical dashboard query 2: Real-time user activity
SELECT
DATE_FORMAT(login_time, '%Y-%m-%d %H:00') AS hour_slot,
COUNT(DISTINCT user_id) AS dau,
COUNT(*) AS total_actions
FROM user_actions
WHERE login_time >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
GROUP BY hour_slot
ORDER BY hour_slot;
-- Typical dashboard query 3: Real-time conversion funnel
SELECT
COUNT(DISTINCT CASE WHEN event = 'view' THEN user_id END) AS views,
COUNT(DISTINCT CASE WHEN event = 'cart' THEN user_id END) AS carts,
COUNT(DISTINCT CASE WHEN event = 'pay' THEN user_id END) AS pays,
ROUND(COUNT(DISTINCT CASE WHEN event = 'pay' THEN user_id END) * 100.0
/ NULLIF(COUNT(DISTINCT CASE WHEN event = 'view' THEN user_id END), 0), 2) AS conversion_rate
FROM user_events
WHERE created_at >= DATE_SUB(NOW(), INTERVAL 1 HOUR);
4.2 Force Queries to Use TiFlash
-- Method 1: Force TiFlash via Hint
SELECT /*+ read_from_storage(tiflash[orders]) */
DATE_FORMAT(created_at, '%Y-%m-%d %H:00') AS hour_slot,
SUM(amount) AS gmv
FROM orders
WHERE created_at >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
GROUP BY hour_slot;
-- Method 2: Set Session variable
SET SESSION tidb_isolation_read_engines = 'tiflash';
SELECT
DATE_FORMAT(created_at, '%Y-%m-%d %H:00') AS hour_slot,
SUM(amount) AS gmv
FROM orders
WHERE created_at >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
GROUP BY hour_slot;
SET SESSION tidb_isolation_read_engines = 'tikv,tiflash';
-- Method 3: Isolate OLAP traffic via Resource Group
CREATE RESOURCE GROUP IF NOT EXISTS olap_group
RU_PER_SEC = 5000
PRIORITY = LOW;
ALTER USER dashboard_user RESOURCE GROUP olap_group;
4.3 Dashboard Query Cache Optimization
-- Use Binding to fix execution plan
CREATE SESSION BINDING FOR
SELECT SUM(amount) FROM orders WHERE created_at >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
USING
SELECT /*+ read_from_storage(tiflash[orders]) */ SUM(amount)
FROM orders WHERE created_at >= DATE_SUB(NOW(), INTERVAL 24 HOUR);
-- View Bindings
SHOW SESSION BINDINGS;
4.4 Dashboard Refresh Strategy
-- Dashboard queries should use READ ONLY mode to avoid competing with OLTP
SET SESSION tidb_isolation_read_engines = 'tiflash';
-- Limit dashboard query resource consumption
SET SESSION max_execution_time = 30000;
-- Use prepared statements to reduce parsing overhead
PREPARE dashboard_gmv FROM
SELECT
DATE_FORMAT(created_at, '%Y-%m-%d %H:00') AS hour_slot,
SUM(amount) AS gmv
FROM orders
WHERE created_at >= ?
AND status != 'cancelled'
GROUP BY hour_slot
ORDER BY hour_slot;
EXECUTE dashboard_gmv USING DATE_SUB(NOW(), INTERVAL 24 HOUR);
Pattern 5: Materialized Views for Data Mart
TiDB 8.x supports Materialized Views, which can pre-compute aggregated data and build business-oriented data marts.
5.1 Create Materialized Views
-- Create daily sales summary materialized view
CREATE MATERIALIZED VIEW mv_daily_sales
REFRESH EVERY 1 HOUR
AS
SELECT
DATE(created_at) AS sale_date,
c.category_name,
p.product_name,
SUM(oi.quantity) AS total_quantity,
SUM(oi.quantity * oi.unit_price) AS total_revenue,
COUNT(DISTINCT o.user_id) AS unique_buyers,
AVG(oi.quantity * oi.unit_price) AS avg_price
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN categories c ON p.category_id = c.category_id
WHERE o.status IN ('paid', 'shipped', 'completed')
GROUP BY DATE(created_at), c.category_name, p.product_name;
5.2 Create User Behavior Materialized View
-- User behavior funnel materialized view
CREATE MATERIALIZED VIEW mv_user_funnel
REFRESH EVERY 5 MINUTE
AS
SELECT
DATE(created_at) AS event_date,
COUNT(DISTINCT CASE WHEN event_type = 'page_view' THEN user_id END) AS page_views,
COUNT(DISTINCT CASE WHEN event_type = 'add_to_cart' THEN user_id END) AS add_to_carts,
COUNT(DISTINCT CASE WHEN event_type = 'checkout' THEN user_id END) AS checkouts,
COUNT(DISTINCT CASE WHEN event_type = 'payment' THEN user_id END) AS payments,
ROUND(
COUNT(DISTINCT CASE WHEN event_type = 'payment' THEN user_id END) * 100.0
/ NULLIF(COUNT(DISTINCT CASE WHEN event_type = 'page_view' THEN user_id END), 0),
2
) AS view_to_pay_rate
FROM user_events
GROUP BY DATE(created_at);
5.3 Manage Materialized Views
-- View materialized view status
SHOW MATERIALIZED VIEWS;
-- Manually refresh materialized view
ALTER MATERIALIZED VIEW mv_daily_sales REFRESH;
-- Pause auto-refresh
ALTER MATERIALIZED VIEW mv_daily_sales DISABLE;
-- Resume auto-refresh
ALTER MATERIALIZED VIEW mv_daily_sales ENABLE;
-- Drop materialized view
DROP MATERIALIZED VIEW IF EXISTS mv_daily_sales;
5.4 Dashboard Queries Based on Materialized Views
-- Query materialized view (millisecond response)
SELECT
sale_date,
category_name,
total_revenue,
unique_buyers
FROM mv_daily_sales
WHERE sale_date >= DATE_SUB(CURDATE(), INTERVAL 7 DAY)
ORDER BY total_revenue DESC
LIMIT 20;
Pattern 6: Production HTAP Deployment and Monitoring
Production HTAP deployment requires resource isolation, monitoring/alerting, and failure recovery.
6.1 Resource Control Configuration
-- Create OLTP resource group (high priority)
CREATE RESOURCE GROUP IF NOT EXISTS oltp_group
RU_PER_SEC = 20000
PRIORITY = HIGH;
-- Create OLAP resource group (low priority, limited resources)
CREATE RESOURCE GROUP IF NOT EXISTS olap_group
RU_PER_SEC = 8000
PRIORITY = LOW;
-- Bind users to resource groups
ALTER USER app_user RESOURCE GROUP oltp_group;
ALTER USER dashboard_user RESOURCE GROUP olap_group;
ALTER USER etl_user RESOURCE GROUP olap_group;
-- View resource group configuration
SELECT * FROM information_schema.resource_groups;
6.2 TiFlash Key Monitoring Metrics
# Grafana Dashboard Key Metrics
# TiFlash Summary Dashboard
tiflash_system_profile:
- metric: tiflash_storage_write_duration_seconds
alert_threshold: "> 5s"
description: "TiFlash write latency"
- metric: tiflash_mpp_task_duration_seconds
alert_threshold: "> 60s"
description: "MPP task execution duration"
- metric: tiflash_storage_segment_count
alert_threshold: "> 10000"
description: "Segment count (too many requires compaction)"
- metric: tiflash_cpu_usage
alert_threshold: "> 80%"
description: "TiFlash CPU usage"
- metric: tiflash_storage_disk_usage_ratio
alert_threshold: "> 0.8"
description: "TiFlash disk usage ratio"
6.3 Prometheus Alert Rules
# tidb-htap-alerts.yaml
groups:
- name: tidb_htap
rules:
- alert: TiFlashReplicaSyncLag
expr: |
tiflash_replica_available == 0
or (tiflash_replica_progress < 0.95 and tiflash_replica_progress > 0)
for: 10m
labels:
severity: warning
annotations:
summary: "TiFlash replica sync lag"
description: "Table {{ $labels.table_name }} TiFlash replica sync progress only {{ $value }}"
- alert: TiFlashHighWriteDuration
expr: |
histogram_quantile(0.95,
rate(tiflash_storage_write_duration_seconds_bucket[5m])
) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "TiFlash write latency too high"
description: "P95 write latency {{ $value }}s"
- alert: MPPQueryTimeout
expr: |
rate(tiflash_mpp_task_error_count{reason="timeout"}[5m]) > 0.1
for: 3m
labels:
severity: critical
annotations:
summary: "MPP query timeout rate too high"
description: "MPP query timeout rate {{ $value }}/s"
- alert: TiFlashDiskSpaceLow
expr: |
tiflash_storage_disk_usage_ratio > 0.85
for: 5m
labels:
severity: critical
annotations:
summary: "TiFlash disk space running low"
description: "Disk usage ratio {{ $value }}"
6.4 Daily Operations Script
#!/bin/bash
# tidb-htap-healthcheck.sh
# TiDB HTAP health check script
CLUSTER_NAME="tidb-prod"
LOG_FILE="/var/log/tidb-htap-healthcheck.log"
echo "=== TiDB HTAP Health Check $(date) ===" >> $LOG_FILE
# Check TiFlash node status
echo "--- TiFlash Node Status ---" >> $LOG_FILE
tiup cluster display $CLUSTER_NAME 2>/dev/null | grep tiflash >> $LOG_FILE
# Check TiFlash replica sync status
echo "--- TiFlash Replica Sync Status ---" >> $LOG_FILE
mysql -h10.0.1.10 -P4000 -uroot -e "
SELECT table_schema, table_name, replica_count, available, progress
FROM information_schema.tiflash_replica
WHERE available = 0 OR progress < 1.0;
" >> $LOG_FILE 2>&1
# Check TiCDC sync lag
echo "--- TiCDC Sync Lag ---" >> $LOG_FILE
cdc cli changefeed list --server=http://10.0.1.31:8300 >> $LOG_FILE 2>&1
# Check MPP query performance
echo "--- MPP Query Performance ---" >> $LOG_FILE
mysql -h10.0.1.10 -P4000 -uroot -e "
SELECT DIGEST_TEXT, AVG_LATENCY, EXEC_COUNT
FROM information_schema.statements_summary
WHERE DIGEST_TEXT LIKE '%GROUP BY%'
AND AVG_LATENCY > 1000000000
ORDER BY AVG_LATENCY DESC
LIMIT 10;
" >> $LOG_FILE 2>&1
echo "=== Health Check Complete ===" >> $LOG_FILE
5 Common Pitfalls Guide
❌ Pitfall 1: Querying Before TiFlash Replica Is Synced
-- ❌ Wrong: Query immediately after adding TiFlash replica, data may be incomplete
ALTER TABLE orders SET TIFLASH REPLICA 1;
-- Immediate query may read incomplete data
SELECT COUNT(*) FROM orders;
-- ✅ Correct: Wait for replica sync to complete
ALTER TABLE orders SET TIFLASH REPLICA 1;
-- Check sync progress
SELECT table_name, progress, available
FROM information_schema.tiflash_replica
WHERE table_name = 'orders';
-- Wait for available = 1 before querying
SELECT COUNT(*) FROM orders;
❌ Pitfall 2: Adding TiFlash Replicas to Small Tables
-- ❌ Wrong: Adding TiFlash replica to small tables is slower (network overhead > scan overhead)
ALTER TABLE config_dict SET TIFLASH REPLICA 1;
-- ✅ Correct: Keep small tables on TiKV, only add TiFlash replicas for large tables
-- Rule of thumb: Consider TiFlash replicas only when data > 1M rows
-- Force small tables to use TiKV via Hint
SELECT /*+ read_from_storage(tikv[config_dict]) */
* FROM config_dict WHERE type = 'payment';
❌ Pitfall 3: OLAP Queries Impacting OLTP Performance
-- ❌ Wrong: OLAP and OLTP using same user, no resource isolation
SELECT user, resource_group FROM mysql.user WHERE user = 'app_user';
-- app_user | default
-- ✅ Correct: Use Resource Group for isolation
CREATE RESOURCE GROUP IF NOT EXISTS oltp_group
RU_PER_SEC = 20000 PRIORITY = HIGH;
CREATE RESOURCE GROUP IF NOT EXISTS olap_group
RU_PER_SEC = 8000 PRIORITY = LOW;
ALTER USER app_user RESOURCE GROUP oltp_group;
ALTER USER dashboard_user RESOURCE GROUP olap_group;
-- Set timeout for OLAP queries
SET SESSION max_execution_time = 60000;
❌ Pitfall 4: Insufficient TiFlash Node Disk Space
# ❌ Wrong: TiFlash data directory on system disk
tiflash_servers:
- host: 10.0.1.21
data_dir: /var/lib/tiflash/data # System disk has limited space
# ✅ Correct: TiFlash uses dedicated data disk, reserve 30% space
tiflash_servers:
- host: 10.0.1.21
data_dir: /data/tiflash/data # Dedicated data disk
config:
storage:
main:
capacity: "2TB" # Explicit capacity limit
latest:
capacity: "200GB"
❌ Pitfall 5: Ignoring TiCDC Sync Latency Alerts
# ❌ Wrong: Not monitoring TiCDC latency, data sync breaks without anyone knowing
# No alert configuration at all
# ✅ Correct: Configure TiCDC latency alerts
# Prometheus alert rule
- alert: TiCDCReplicationLag
expr: |
tidb_cdc_checkpoint_ts_lag_seconds > 300
for: 5m
labels:
severity: critical
annotations:
summary: "TiCDC replication lag exceeds 5 minutes"
description: "changefeed {{ $labels.changefeed }} lag {{ $value }}s"
Error Troubleshooting Quick Reference
| Error Symptom | Possible Cause | Diagnostic Command/Method | Solution |
|---|---|---|---|
| TiFlash replica sync stuck | TiFlash node disk full or down | SELECT * FROM information_schema.tiflash_replica WHERE progress < 1.0; |
Expand disk or restart TiFlash node |
MPP query reports Too many MPP tasks |
Too many concurrent MPP queries | SHOW CONFIG WHERE name LIKE '%mpp%'; |
Adjust tidb_mpp_task_limit or reduce concurrency |
| Query not using TiFlash | Optimizer judges TiKV as better | EXPLAIN SELECT ...; check operators |
Use /*+ read_from_storage(tiflash[t]) */ Hint |
| TiCDC sync latency increasing | Slow downstream writes or network jitter | cdc cli changefeed query --changefeed-id=xxx |
Scale downstream or increase TiCDC worker count |
| TiFlash OOM | Query data volume too large | grep OOM /data/tiflash/log/tiflash.log |
Increase max_memory_usage or narrow query scope |
| Dashboard query timeout | No query resource limits | SHOW PROCESSLIST; |
Set max_execution_time and Resource Group |
| Materialized view refresh failure | Source table too large or lock conflict | SHOW MATERIALIZED VIEWS; |
Adjust refresh frequency or optimize source table indexes |
| TiFlash high write latency | Raft Learner sync pressure high | Grafana TiFlash Dashboard | Scale TiFlash nodes or adjust raft-store.sync-log |
| MPP Join data skew | Join key uneven distribution | EXPLAIN ANALYZE check per-node data volume |
Use /*+ shuffle_join() */ or increase bucket count |
| TiFlash high disk IO | Compaction or large scans | iostat -x 1 check IO |
Adjust bg_task_io_limit or expand disk |
Advanced Optimization Techniques
3.1 TiFlash Storage Engine Tuning
-- Adjust TiFlash Segment split threshold (reduce small segments)
-- Modify in TiFlash configuration
# tiflash-config.yaml
storage:
main:
segment_delta_rows: 8192
segment_delta_size: 536870912
segment_stable_rows: 10485760
latest:
segment_delta_rows: 4096
segment_delta_size: 134217728
raft:
apply_batch: 1024
sync_log: false
engine:
dt:
enable_read_thread: true
read_thread_count: 8
enable_leaf_dt_segment_optimize: true
3.2 MPP Query Advanced Optimization
-- 1. Use Runtime Filter to reduce Join data transfer
-- Enabled by default in TiDB 8.x, verify status
SET SESSION tidb_runtime_filter_mode = "LOCAL";
-- 2. Adjust MPP Exchange data compression
SET SESSION tidb_exchange_compression = 'ON';
SET SESSION tidb_exchange_compression_level = 3;
-- 3. Large table Join optimization: Use Hash Join Hint
SELECT /*+ hash_join_build(o) */
c.category_name,
SUM(oi.quantity * oi.unit_price) AS total_sales
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN categories c ON oi.product_id = c.category_id
GROUP BY c.category_name;
-- 4. Use Index Lookup Join instead of Hash Join (small table driving large table)
SELECT /*+ inl_join(p) */
o.order_id, p.product_name
FROM orders o
JOIN products p ON o.product_id = p.product_id
WHERE o.created_at >= CURDATE();
3.3 Multi-Tenant HTAP Resource Isolation
-- Create independent resource groups for different business lines
CREATE RESOURCE GROUP IF NOT EXISTS rg_ecommerce
RU_PER_SEC = 15000 PRIORITY = HIGH
QUERY_LIMIT = (EXECUTION_TIME = 30s);
CREATE RESOURCE GROUP IF NOT EXISTS rg_analytics
RU_PER_SEC = 10000 PRIORITY = LOW
QUERY_LIMIT = (EXECUTION_TIME = 120s);
CREATE RESOURCE GROUP IF NOT EXISTS rg_data_science
RU_PER_SEC = 5000 PRIORITY = LOW
QUERY_LIMIT = (EXECUTION_TIME = 600s);
-- Bind users to resource groups
ALTER USER ecommerce_app RESOURCE GROUP rg_ecommerce;
ALTER USER analytics_user RESOURCE GROUP rg_analytics;
ALTER USER ds_user RESOURCE GROUP rg_data_science;
-- Monitor resource group usage
SELECT * FROM information_schema.resource_group_usage;
HTAP Solution Comparison
| Dimension | TiDB HTAP | ClickHouse + MySQL | Snowflake | Apache Doris | SingleStore |
|---|---|---|---|---|---|
| Architecture | Integrated HTAP | Dual-system stitching | Cloud-native OLAP | Integrated HTAP | Integrated HTAP |
| OLTP capability | Strong (MySQL compatible) | MySQL native | None | Weak | Medium |
| OLAP performance | Strong (MPP) | Very strong | Very strong | Strong | Strong |
| Data sync | Raft Learner (automatic) | Self-built ETL needed | Import needed | Automatic | Automatic |
| Real-time | Seconds | Minutes | Minutes | Seconds | Seconds |
| MySQL compat | 95%+ | Native | Incompatible | Medium | Medium |
| Horizontal scaling | Transparent | Each scales independently | Auto scaling | Transparent | Transparent |
| Ops complexity | Medium | High | Low | Medium | Medium |
| Cost | Medium | High (dual system) | High (pay-per-use) | Medium | High |
| Open source | Fully open source | Partially open source | Closed source | Fully open source | Partially open source |
| Use case | General HTAP | Existing MySQL + strong OLAP needed | Pure analytics | Real-time data warehouse | Real-time analytics |
Summary
TiDB HTAP isn't a marketing slogan about "doing both OLTP and OLAP" — it's an engineered solution that achieves real-time sync from row storage to column storage via Raft Learner, distributed parallel computation via the MPP engine, and OLTP/OLAP resource isolation via Resource Control. In 2026, if you're still maintaining a MySQL + ClickHouse dual-system architecture, it's time to seriously evaluate TiDB HTAP — one system, one ops team, zero data latency. This is what real-time analytics should look like.
Recommended Tools
- JSON Formatter - Format TiDB configuration JSON files
- Base64 Encode/Decode - Handle TiCDC certificate encoding
- Hash Calculator - Verify data sync consistency
Try these browser-local tools — no sign-up required →