TiDB HTAP Real-Time Analytics: 6 Production Patterns from TiFlash to Real-Time Dashboards

数据库

T+1 Analytics Is Dead — Real-Time Insights Are the Future

3 AM, online orders surge 10x, CEO asks: "What's the real-time GMV by category?" — You can only answer: "Data is still in the ETL pipeline, you'll see it tomorrow."

This is the dilemma of traditional T+1 analytics:

  • Fragile ETL pipelines: Kafka → Flink → ClickHouse, any single point of failure breaks the chain
  • Uncontrollable data latency: Minutes to hours from OLTP to OLAP is the norm
  • Dual-system ops nightmare: MySQL for transactions, ClickHouse for analytics, data consistency by manual effort
  • Cost overruns: Two storage systems, two compute engines, two ops teams, low resource utilization
  • Insufficient real-time capability: Business decisions need second-level responses, T+1 reports are legacy

TiDB HTAP's promise: One system, simultaneously handling OLTP and OLAP, data synced from row storage to column storage with zero latency. This isn't a slide-ware architecture — it's a solution validated in production at thousands of enterprises.


HTAP Core Concepts Quick Reference

Concept Full Name Core Role Analogy
HTAP Hybrid Transactional/Analytical Processing One system handles both transactions and analytics One person acting as both trader and analyst
TiFlash TiDB Flash Columnar storage engine, handles OLAP queries Dedicated OLAP accelerator
TiKV TiDB Key-Value Row storage engine, handles OLTP transactions OLTP cornerstone
MPP Massively Parallel Processing Multi-node parallel computation framework Multiple people computing different shards simultaneously
Columnar Storage Columnar Storage Stores data by column, analytics queries only read needed columns Only flipping the pages you need
Raft Learner Raft consensus protocol learner role Asynchronously replicates data to TiFlash without affecting the main flow Shadow scribe
Data Mart Data Mart Pre-aggregated data set for a specific business domain Customized report showcase for each department

💡 Use the JSON Formatter to format TiDB configuration files for easier troubleshooting.


5 Pain Points of Traditional Real-Time Analytics Architecture

Pain Point 1: ETL Pipeline Is the Biggest Single Point of Failure

MySQL → Canal → Kafka → Flink → ClickHouse
  │        │       │       │        │
  │        │       │       │        └─ Write failure, data loss
  │        │       │       └─ Job crash, data backlog
  │        │       └─ Uneven partitions, consumption lag
  │        └─ Binlog parsing error
  └─ Replication lag

Any single link failure breaks the entire pipeline. Late-night alerts never stop, ops teams suffer endlessly.

Pain Point 2: Data Consistency Cannot Be Guaranteed

OLTP and OLAP are two independent systems, data sync has a latency window:

-- MySQL: Order has been refunded
UPDATE orders SET status = 'refunded' WHERE id = 10086;

-- ClickHouse: Refund status only visible 5 minutes later
-- During those 5 minutes, report data is wrong

Pain Point 3: Dual-System Ops Costs Double

Ops Item MySQL ClickHouse Total
Monitoring & Alerts Grafana + Prometheus Grafana + Prometheus 2 sets
Backup & Recovery mysqldump / XtraBackup clickhouse-backup 2 sets
Capacity Planning Primary-replica + sharding Distributed tables + replicas 2 sets
Version Upgrades Rolling upgrade Rolling upgrade 2 sets
Staffing DBA ClickHouse expert 2 teams

Pain Point 4: Query Performance vs Data Freshness Dilemma

  • ClickHouse pursues query performance, requires MergeTree compaction, data has latency
  • For real-time needs, you use ReplacingMergeTree, but queries need FINAL, causing performance to plummet
  • Real-time and performance — you can't have both

Pain Point 5: Low Development Efficiency

// Developers must maintain two data models
public class OrderService {
    // OLTP: MySQL
    @Transactional
    public void createOrder(Order order) {
        orderMapper.insert(order);
    }
    
    // OLAP: ClickHouse - write it again
    public void syncToClickHouse(Order order) {
        clickHouseMapper.insert(order);
        // Field mapping, type conversion, error handling...
    }
}

6 Production-Grade HTAP Patterns

Pattern 1: TiFlash Columnar Replica Setup

TiFlash receives TiKV data changes asynchronously through the Raft Learner role, achieving real-time sync from row storage to column storage.

1.1 Deploy TiFlash Nodes

# tiup topology.yaml snippet
tiflash_servers:
  - host: 10.0.1.21
    tcp_port: 9000
    http_port: 8123
    flash_service_port: 3930
    flash_proxy_port: 20170
    flash_proxy_status_port: 20292
    metrics_interval: 15
    data_dir: /data/tiflash/data
    deploy_dir: /data/tiflash/deploy
    config:
      logger:
        level: information
      storage:
        main:
          capacity: "2TB"
        latest:
          capacity: "200GB"
  - host: 10.0.1.22
    tcp_port: 9000
    http_port: 8123
    flash_service_port: 3930
    flash_proxy_port: 20170
    flash_proxy_status_port: 20292
    data_dir: /data/tiflash/data
    deploy_dir: /data/tiflash/deploy
# Scale out TiFlash nodes
tiup cluster scale-out tidb-prod topology-tiflash.yaml

# Verify TiFlash node status
tiup cluster display tidb-prod | grep tiflash

1.2 Add TiFlash Replicas to Tables

-- Add TiFlash replica for entire table (default 1 replica)
ALTER TABLE orders SET TIFLASH REPLICA 1;

-- Add 2 TiFlash replicas (high availability)
ALTER TABLE orders SET TIFLASH REPLICA 2;

-- Add TiFlash replica for partitioned table
ALTER TABLE orders_partition SET TIFLASH REPLICA 1;

-- Check replica sync progress
SELECT * FROM information_schema.tiflash_replica
WHERE table_name = 'orders';

Sync progress output example:

+----------+------------+----------+---------------+------------------+-----------+
| TABLE_ID | TABLE_NAME | REPLICA  | LOCATION_LABELS| AVAILABLE        | PROGRESS  |
+----------+------------+----------+---------------+------------------+-----------+
|       85 | orders     |        1 |               | True             | 1.0       |
+----------+------------+----------+---------------+------------------+-----------+

1.3 Add TiFlash Replica by Column (Save Storage)

-- Create TiFlash replica only for columns needed for analytics
ALTER TABLE orders SET TIFLASH REPLICA 1
  COLUMNS (order_id, user_id, product_id, amount, status, created_at);

-- Verify column-level replica
SHOW CREATE TABLE orders;

Pattern 2: MPP Distributed Query Acceleration

TiDB's MPP engine pushes analytical queries down to multiple TiFlash nodes for parallel execution, achieving query performance close to ClickHouse.

2.1 Enable MPP Mode

-- Check current MPP mode
SHOW VARIABLES LIKE 'tidb_allow_mpp';

-- Enable MPP (enabled by default)
SET SESSION tidb_allow_mpp = 1;

-- Force MPP usage (for testing/verification)
SET SESSION tidb_enforce_mpp = 1;

-- Set MPP query Exchange compression
SET SESSION tidb_exchange_compression = 'ON';

2.2 MPP Query in Action: Real-Time Sales Analytics

-- Scenario: Real-time top 10 category sales today
-- MPP automatically pushes computation to TiFlash nodes for parallel execution

SELECT
    c.category_name,
    SUM(oi.quantity * oi.unit_price) AS total_sales,
    COUNT(DISTINCT o.user_id) AS unique_buyers,
    AVG(oi.quantity * oi.unit_price) AS avg_order_value
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN categories c ON p.category_id = c.category_id
WHERE o.created_at >= CURDATE()
  AND o.status IN ('paid', 'shipped', 'completed')
GROUP BY c.category_name
ORDER BY total_sales DESC
LIMIT 10;

2.3 Verify MPP Execution Plan

-- View execution plan, confirm MPP operators
EXPLAIN ANALYZE
SELECT
    c.category_name,
    SUM(oi.quantity * oi.unit_price) AS total_sales
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN categories c ON p.category_id = c.category_id
WHERE o.created_at >= CURDATE()
GROUP BY c.category_name;

Key execution plan characteristics:

+------------------------------------------+----------+---------+--------------+
| id                                       | task     | rows    | operator info|
+------------------------------------------+----------+---------+--------------+
| TableReader_45                           | root     | 10      | ...          |
| └─ExchangeSender_44                      | mpp[tif] | 10      | ExchangeType |
|   └─Projection_42                        | mpp[tif] | 10      | ...          |
|     └─HashAgg_43                         | mpp[tif] | 10      | group by:... |
|       └─ExchangeReceiver_38              | mpp[tif] | 100     | ...          |
|         └─ExchangeSender_37              | mpp[tif] | 100     | ExchangeType |
|           └─HashJoin_36                  | mpp[tif] | 100     | inner join   |
|             ├─Selection_25(Build)        | mpp[tif] | 500     | ...          |
|             │ └─TableFullScan_24         | mpp[tif] | 10000   | tikv:tiflash |
|             └─Selection_30(Probe)        | mpp[tif] | 800     | ...          |
|               └─TableFullScan_29         | mpp[tif] | 20000   | tikv:tiflash |
+------------------------------------------+----------+---------+--------------+

Seeing mpp[tif] and ExchangeSender/Receiver indicates MPP is in effect.

2.4 MPP Performance Comparison

-- Disable MPP, use traditional TiKV scan
SET SESSION tidb_allow_mpp = 0;

EXPLAIN ANALYZE
SELECT c.category_name, SUM(oi.quantity * oi.unit_price) AS total_sales
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN categories c ON p.category_id = c.category_id
WHERE o.created_at >= CURDATE()
GROUP BY c.category_name;

-- Re-enable MPP
SET SESSION tidb_allow_mpp = 1;
Query Mode Data Volume Execution Time Scanned Rows
TiKV Scan 100M rows 45.2s 100,000,000
TiFlash Non-MPP 100M rows 12.8s 100,000,000
TiFlash MPP 100M rows 3.2s 100,000,000

Pattern 3: TiCDC Real-Time ETL Replacement

TiCDC (Change Data Capture) can replace traditional ETL pipelines, syncing TiDB data changes to downstream systems in real-time.

3.1 Deploy TiCDC

# tiup topology.yaml snippet
cdc_servers:
  - host: 10.0.1.31
    port: 8300
    deploy_dir: /data/cdc/deploy
    data_dir: /data/cdc/data
    config:
      per-table-memory-quota: 10485760
      sink-memory-quota: 536870912
  - host: 10.0.1.32
    port: 8300
    deploy_dir: /data/cdc/deploy
    data_dir: /data/cdc/data
# Scale out TiCDC
tiup cluster scale-out tidb-prod topology-cdc.yaml

# Verify TiCDC status
tiup cluster display tidb-prod | grep cdc
# Create TiCDC sync task: TiDB → Kafka (replacing Canal + Kafka)
cdc cli changefeed create \
  --server=http://10.0.1.31:8300 \
  --sink-uri="kafka://kafka-cluster:9092/tidb-orders?partition-num=6&max-message-bytes=10485760&replication-factor=3" \
  --changefeed-id="orders-to-kafka" \
  --config=cdc-orders.yaml
# cdc-orders.yaml
filter:
  rules:
    - 'ecommerce.orders'
    - 'ecommerce.order_items'
    - 'ecommerce.payments'
  ignore_txn_start_ts: []

mounter:
  worker-num: 16

sink:
  protocol: "canal-json"
  dispatchers:
    - matcher: ['ecommerce.*']
      partition: "index-value"

selector:
  rules:
    - 'ecommerce.orders'
    - 'ecommerce.order_items'
    - 'ecommerce.payments'

3.3 Direct Sync to ClickHouse (If Still Needed)

# TiCDC → ClickHouse direct connection
cdc cli changefeed create \
  --server=http://10.0.1.31:8300 \
  --sink-uri="clickhouse://clickhouse-server:8123?database=analytics&username=default&password=***" \
  --changefeed-id="orders-to-clickhouse" \
  --config=cdc-ch.yaml

3.4 TiCDC Sync to S3 (Data Lake Solution)

# TiCDC → S3 Data Lake
cdc cli changefeed create \
  --server=http://10.0.1.31:8300 \
  --sink-uri="s3://data-lake/tidb-cdc/?endpoint=https://s3.amazonaws.com&access-key=***&secret-access-key=***" \
  --changefeed-id="orders-to-s3" \
  --config=cdc-s3.yaml
# cdc-s3.yaml
sink:
  protocol: "csv"
  column-selectors:
    - matcher: ['ecommerce.orders']
      columns: ['order_id', 'user_id', 'amount', 'status', 'created_at']

file:
  size: 67108864
  interval: 5m

filter:
  rules:
    - 'ecommerce.orders'
    - 'ecommerce.order_items'

3.5 Monitor TiCDC Sync Latency

-- Check sync task status
cdc cli changefeed list \
  --server=http://10.0.1.31:8300;

-- Check sync latency
cdc cli changefeed query \
  --changefeed-id=orders-to-kafka \
  --server=http://10.0.1.31:8300;

Key metrics:

{
  "state": "normal",
  "tso": 446685327530688513,
  "checkpoint": "2026-06-15 10:30:00.123",
  "lag": "1.2s"
}

Pattern 4: Dashboard Query Optimization

Real-time dashboards are the most direct application scenario for HTAP, but query performance directly impacts user experience.

4.1 Dashboard Query Pattern Identification

-- Typical dashboard query 1: Real-time GMV
SELECT
    DATE_FORMAT(created_at, '%Y-%m-%d %H:00') AS hour_slot,
    SUM(amount) AS gmv,
    COUNT(*) AS order_count
FROM orders
WHERE created_at >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
  AND status != 'cancelled'
GROUP BY hour_slot
ORDER BY hour_slot;

-- Typical dashboard query 2: Real-time user activity
SELECT
    DATE_FORMAT(login_time, '%Y-%m-%d %H:00') AS hour_slot,
    COUNT(DISTINCT user_id) AS dau,
    COUNT(*) AS total_actions
FROM user_actions
WHERE login_time >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
GROUP BY hour_slot
ORDER BY hour_slot;

-- Typical dashboard query 3: Real-time conversion funnel
SELECT
    COUNT(DISTINCT CASE WHEN event = 'view' THEN user_id END) AS views,
    COUNT(DISTINCT CASE WHEN event = 'cart' THEN user_id END) AS carts,
    COUNT(DISTINCT CASE WHEN event = 'pay' THEN user_id END) AS pays,
    ROUND(COUNT(DISTINCT CASE WHEN event = 'pay' THEN user_id END) * 100.0
      / NULLIF(COUNT(DISTINCT CASE WHEN event = 'view' THEN user_id END), 0), 2) AS conversion_rate
FROM user_events
WHERE created_at >= DATE_SUB(NOW(), INTERVAL 1 HOUR);

4.2 Force Queries to Use TiFlash

-- Method 1: Force TiFlash via Hint
SELECT /*+ read_from_storage(tiflash[orders]) */
    DATE_FORMAT(created_at, '%Y-%m-%d %H:00') AS hour_slot,
    SUM(amount) AS gmv
FROM orders
WHERE created_at >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
GROUP BY hour_slot;

-- Method 2: Set Session variable
SET SESSION tidb_isolation_read_engines = 'tiflash';
SELECT
    DATE_FORMAT(created_at, '%Y-%m-%d %H:00') AS hour_slot,
    SUM(amount) AS gmv
FROM orders
WHERE created_at >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
GROUP BY hour_slot;
SET SESSION tidb_isolation_read_engines = 'tikv,tiflash';

-- Method 3: Isolate OLAP traffic via Resource Group
CREATE RESOURCE GROUP IF NOT EXISTS olap_group
  RU_PER_SEC = 5000
  PRIORITY = LOW;

ALTER USER dashboard_user RESOURCE GROUP olap_group;

4.3 Dashboard Query Cache Optimization

-- Use Binding to fix execution plan
CREATE SESSION BINDING FOR
SELECT SUM(amount) FROM orders WHERE created_at >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
USING
SELECT /*+ read_from_storage(tiflash[orders]) */ SUM(amount)
FROM orders WHERE created_at >= DATE_SUB(NOW(), INTERVAL 24 HOUR);

-- View Bindings
SHOW SESSION BINDINGS;

4.4 Dashboard Refresh Strategy

-- Dashboard queries should use READ ONLY mode to avoid competing with OLTP
SET SESSION tidb_isolation_read_engines = 'tiflash';

-- Limit dashboard query resource consumption
SET SESSION max_execution_time = 30000;

-- Use prepared statements to reduce parsing overhead
PREPARE dashboard_gmv FROM
SELECT
    DATE_FORMAT(created_at, '%Y-%m-%d %H:00') AS hour_slot,
    SUM(amount) AS gmv
FROM orders
WHERE created_at >= ?
  AND status != 'cancelled'
GROUP BY hour_slot
ORDER BY hour_slot;

EXECUTE dashboard_gmv USING DATE_SUB(NOW(), INTERVAL 24 HOUR);

Pattern 5: Materialized Views for Data Mart

TiDB 8.x supports Materialized Views, which can pre-compute aggregated data and build business-oriented data marts.

5.1 Create Materialized Views

-- Create daily sales summary materialized view
CREATE MATERIALIZED VIEW mv_daily_sales
REFRESH EVERY 1 HOUR
AS
SELECT
    DATE(created_at) AS sale_date,
    c.category_name,
    p.product_name,
    SUM(oi.quantity) AS total_quantity,
    SUM(oi.quantity * oi.unit_price) AS total_revenue,
    COUNT(DISTINCT o.user_id) AS unique_buyers,
    AVG(oi.quantity * oi.unit_price) AS avg_price
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN categories c ON p.category_id = c.category_id
WHERE o.status IN ('paid', 'shipped', 'completed')
GROUP BY DATE(created_at), c.category_name, p.product_name;

5.2 Create User Behavior Materialized View

-- User behavior funnel materialized view
CREATE MATERIALIZED VIEW mv_user_funnel
REFRESH EVERY 5 MINUTE
AS
SELECT
    DATE(created_at) AS event_date,
    COUNT(DISTINCT CASE WHEN event_type = 'page_view' THEN user_id END) AS page_views,
    COUNT(DISTINCT CASE WHEN event_type = 'add_to_cart' THEN user_id END) AS add_to_carts,
    COUNT(DISTINCT CASE WHEN event_type = 'checkout' THEN user_id END) AS checkouts,
    COUNT(DISTINCT CASE WHEN event_type = 'payment' THEN user_id END) AS payments,
    ROUND(
        COUNT(DISTINCT CASE WHEN event_type = 'payment' THEN user_id END) * 100.0
        / NULLIF(COUNT(DISTINCT CASE WHEN event_type = 'page_view' THEN user_id END), 0),
        2
    ) AS view_to_pay_rate
FROM user_events
GROUP BY DATE(created_at);

5.3 Manage Materialized Views

-- View materialized view status
SHOW MATERIALIZED VIEWS;

-- Manually refresh materialized view
ALTER MATERIALIZED VIEW mv_daily_sales REFRESH;

-- Pause auto-refresh
ALTER MATERIALIZED VIEW mv_daily_sales DISABLE;

-- Resume auto-refresh
ALTER MATERIALIZED VIEW mv_daily_sales ENABLE;

-- Drop materialized view
DROP MATERIALIZED VIEW IF EXISTS mv_daily_sales;

5.4 Dashboard Queries Based on Materialized Views

-- Query materialized view (millisecond response)
SELECT
    sale_date,
    category_name,
    total_revenue,
    unique_buyers
FROM mv_daily_sales
WHERE sale_date >= DATE_SUB(CURDATE(), INTERVAL 7 DAY)
ORDER BY total_revenue DESC
LIMIT 20;

Pattern 6: Production HTAP Deployment and Monitoring

Production HTAP deployment requires resource isolation, monitoring/alerting, and failure recovery.

6.1 Resource Control Configuration

-- Create OLTP resource group (high priority)
CREATE RESOURCE GROUP IF NOT EXISTS oltp_group
  RU_PER_SEC = 20000
  PRIORITY = HIGH;

-- Create OLAP resource group (low priority, limited resources)
CREATE RESOURCE GROUP IF NOT EXISTS olap_group
  RU_PER_SEC = 8000
  PRIORITY = LOW;

-- Bind users to resource groups
ALTER USER app_user RESOURCE GROUP oltp_group;
ALTER USER dashboard_user RESOURCE GROUP olap_group;
ALTER USER etl_user RESOURCE GROUP olap_group;

-- View resource group configuration
SELECT * FROM information_schema.resource_groups;

6.2 TiFlash Key Monitoring Metrics

# Grafana Dashboard Key Metrics
# TiFlash Summary Dashboard

tiflash_system_profile:
  - metric: tiflash_storage_write_duration_seconds
    alert_threshold: "> 5s"
    description: "TiFlash write latency"
  
  - metric: tiflash_mpp_task_duration_seconds
    alert_threshold: "> 60s"
    description: "MPP task execution duration"
  
  - metric: tiflash_storage_segment_count
    alert_threshold: "> 10000"
    description: "Segment count (too many requires compaction)"
  
  - metric: tiflash_cpu_usage
    alert_threshold: "> 80%"
    description: "TiFlash CPU usage"
  
  - metric: tiflash_storage_disk_usage_ratio
    alert_threshold: "> 0.8"
    description: "TiFlash disk usage ratio"

6.3 Prometheus Alert Rules

# tidb-htap-alerts.yaml
groups:
  - name: tidb_htap
    rules:
      - alert: TiFlashReplicaSyncLag
        expr: |
          tiflash_replica_available == 0
          or (tiflash_replica_progress < 0.95 and tiflash_replica_progress > 0)
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "TiFlash replica sync lag"
          description: "Table {{ $labels.table_name }} TiFlash replica sync progress only {{ $value }}"

      - alert: TiFlashHighWriteDuration
        expr: |
          histogram_quantile(0.95,
            rate(tiflash_storage_write_duration_seconds_bucket[5m])
          ) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "TiFlash write latency too high"
          description: "P95 write latency {{ $value }}s"

      - alert: MPPQueryTimeout
        expr: |
          rate(tiflash_mpp_task_error_count{reason="timeout"}[5m]) > 0.1
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "MPP query timeout rate too high"
          description: "MPP query timeout rate {{ $value }}/s"

      - alert: TiFlashDiskSpaceLow
        expr: |
          tiflash_storage_disk_usage_ratio > 0.85
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "TiFlash disk space running low"
          description: "Disk usage ratio {{ $value }}"

6.4 Daily Operations Script

#!/bin/bash
# tidb-htap-healthcheck.sh
# TiDB HTAP health check script

CLUSTER_NAME="tidb-prod"
LOG_FILE="/var/log/tidb-htap-healthcheck.log"

echo "=== TiDB HTAP Health Check $(date) ===" >> $LOG_FILE

# Check TiFlash node status
echo "--- TiFlash Node Status ---" >> $LOG_FILE
tiup cluster display $CLUSTER_NAME 2>/dev/null | grep tiflash >> $LOG_FILE

# Check TiFlash replica sync status
echo "--- TiFlash Replica Sync Status ---" >> $LOG_FILE
mysql -h10.0.1.10 -P4000 -uroot -e "
SELECT table_schema, table_name, replica_count, available, progress
FROM information_schema.tiflash_replica
WHERE available = 0 OR progress < 1.0;
" >> $LOG_FILE 2>&1

# Check TiCDC sync lag
echo "--- TiCDC Sync Lag ---" >> $LOG_FILE
cdc cli changefeed list --server=http://10.0.1.31:8300 >> $LOG_FILE 2>&1

# Check MPP query performance
echo "--- MPP Query Performance ---" >> $LOG_FILE
mysql -h10.0.1.10 -P4000 -uroot -e "
SELECT DIGEST_TEXT, AVG_LATENCY, EXEC_COUNT
FROM information_schema.statements_summary
WHERE DIGEST_TEXT LIKE '%GROUP BY%'
  AND AVG_LATENCY > 1000000000
ORDER BY AVG_LATENCY DESC
LIMIT 10;
" >> $LOG_FILE 2>&1

echo "=== Health Check Complete ===" >> $LOG_FILE

5 Common Pitfalls Guide

❌ Pitfall 1: Querying Before TiFlash Replica Is Synced

-- ❌ Wrong: Query immediately after adding TiFlash replica, data may be incomplete
ALTER TABLE orders SET TIFLASH REPLICA 1;
-- Immediate query may read incomplete data
SELECT COUNT(*) FROM orders;

-- ✅ Correct: Wait for replica sync to complete
ALTER TABLE orders SET TIFLASH REPLICA 1;

-- Check sync progress
SELECT table_name, progress, available
FROM information_schema.tiflash_replica
WHERE table_name = 'orders';

-- Wait for available = 1 before querying
SELECT COUNT(*) FROM orders;

❌ Pitfall 2: Adding TiFlash Replicas to Small Tables

-- ❌ Wrong: Adding TiFlash replica to small tables is slower (network overhead > scan overhead)
ALTER TABLE config_dict SET TIFLASH REPLICA 1;

-- ✅ Correct: Keep small tables on TiKV, only add TiFlash replicas for large tables
-- Rule of thumb: Consider TiFlash replicas only when data > 1M rows
-- Force small tables to use TiKV via Hint
SELECT /*+ read_from_storage(tikv[config_dict]) */
    * FROM config_dict WHERE type = 'payment';

❌ Pitfall 3: OLAP Queries Impacting OLTP Performance

-- ❌ Wrong: OLAP and OLTP using same user, no resource isolation
SELECT user, resource_group FROM mysql.user WHERE user = 'app_user';
-- app_user | default

-- ✅ Correct: Use Resource Group for isolation
CREATE RESOURCE GROUP IF NOT EXISTS oltp_group
  RU_PER_SEC = 20000 PRIORITY = HIGH;
CREATE RESOURCE GROUP IF NOT EXISTS olap_group
  RU_PER_SEC = 8000 PRIORITY = LOW;

ALTER USER app_user RESOURCE GROUP oltp_group;
ALTER USER dashboard_user RESOURCE GROUP olap_group;

-- Set timeout for OLAP queries
SET SESSION max_execution_time = 60000;

❌ Pitfall 4: Insufficient TiFlash Node Disk Space

# ❌ Wrong: TiFlash data directory on system disk
tiflash_servers:
  - host: 10.0.1.21
    data_dir: /var/lib/tiflash/data  # System disk has limited space

# ✅ Correct: TiFlash uses dedicated data disk, reserve 30% space
tiflash_servers:
  - host: 10.0.1.21
    data_dir: /data/tiflash/data    # Dedicated data disk
    config:
      storage:
        main:
          capacity: "2TB"           # Explicit capacity limit
        latest:
          capacity: "200GB"

❌ Pitfall 5: Ignoring TiCDC Sync Latency Alerts

# ❌ Wrong: Not monitoring TiCDC latency, data sync breaks without anyone knowing
# No alert configuration at all

# ✅ Correct: Configure TiCDC latency alerts
# Prometheus alert rule
- alert: TiCDCReplicationLag
  expr: |
    tidb_cdc_checkpoint_ts_lag_seconds > 300
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "TiCDC replication lag exceeds 5 minutes"
    description: "changefeed {{ $labels.changefeed }} lag {{ $value }}s"

Error Troubleshooting Quick Reference

Error Symptom Possible Cause Diagnostic Command/Method Solution
TiFlash replica sync stuck TiFlash node disk full or down SELECT * FROM information_schema.tiflash_replica WHERE progress < 1.0; Expand disk or restart TiFlash node
MPP query reports Too many MPP tasks Too many concurrent MPP queries SHOW CONFIG WHERE name LIKE '%mpp%'; Adjust tidb_mpp_task_limit or reduce concurrency
Query not using TiFlash Optimizer judges TiKV as better EXPLAIN SELECT ...; check operators Use /*+ read_from_storage(tiflash[t]) */ Hint
TiCDC sync latency increasing Slow downstream writes or network jitter cdc cli changefeed query --changefeed-id=xxx Scale downstream or increase TiCDC worker count
TiFlash OOM Query data volume too large grep OOM /data/tiflash/log/tiflash.log Increase max_memory_usage or narrow query scope
Dashboard query timeout No query resource limits SHOW PROCESSLIST; Set max_execution_time and Resource Group
Materialized view refresh failure Source table too large or lock conflict SHOW MATERIALIZED VIEWS; Adjust refresh frequency or optimize source table indexes
TiFlash high write latency Raft Learner sync pressure high Grafana TiFlash Dashboard Scale TiFlash nodes or adjust raft-store.sync-log
MPP Join data skew Join key uneven distribution EXPLAIN ANALYZE check per-node data volume Use /*+ shuffle_join() */ or increase bucket count
TiFlash high disk IO Compaction or large scans iostat -x 1 check IO Adjust bg_task_io_limit or expand disk

Advanced Optimization Techniques

3.1 TiFlash Storage Engine Tuning

-- Adjust TiFlash Segment split threshold (reduce small segments)
-- Modify in TiFlash configuration
# tiflash-config.yaml
storage:
  main:
    segment_delta_rows: 8192
    segment_delta_size: 536870912
    segment_stable_rows: 10485760
  latest:
    segment_delta_rows: 4096
    segment_delta_size: 134217728

raft:
  apply_batch: 1024
  sync_log: false

engine:
  dt:
    enable_read_thread: true
    read_thread_count: 8
    enable_leaf_dt_segment_optimize: true

3.2 MPP Query Advanced Optimization

-- 1. Use Runtime Filter to reduce Join data transfer
-- Enabled by default in TiDB 8.x, verify status
SET SESSION tidb_runtime_filter_mode = "LOCAL";

-- 2. Adjust MPP Exchange data compression
SET SESSION tidb_exchange_compression = 'ON';
SET SESSION tidb_exchange_compression_level = 3;

-- 3. Large table Join optimization: Use Hash Join Hint
SELECT /*+ hash_join_build(o) */
    c.category_name,
    SUM(oi.quantity * oi.unit_price) AS total_sales
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN categories c ON oi.product_id = c.category_id
GROUP BY c.category_name;

-- 4. Use Index Lookup Join instead of Hash Join (small table driving large table)
SELECT /*+ inl_join(p) */
    o.order_id, p.product_name
FROM orders o
JOIN products p ON o.product_id = p.product_id
WHERE o.created_at >= CURDATE();

3.3 Multi-Tenant HTAP Resource Isolation

-- Create independent resource groups for different business lines
CREATE RESOURCE GROUP IF NOT EXISTS rg_ecommerce
  RU_PER_SEC = 15000 PRIORITY = HIGH
  QUERY_LIMIT = (EXECUTION_TIME = 30s);

CREATE RESOURCE GROUP IF NOT EXISTS rg_analytics
  RU_PER_SEC = 10000 PRIORITY = LOW
  QUERY_LIMIT = (EXECUTION_TIME = 120s);

CREATE RESOURCE GROUP IF NOT EXISTS rg_data_science
  RU_PER_SEC = 5000 PRIORITY = LOW
  QUERY_LIMIT = (EXECUTION_TIME = 600s);

-- Bind users to resource groups
ALTER USER ecommerce_app RESOURCE GROUP rg_ecommerce;
ALTER USER analytics_user RESOURCE GROUP rg_analytics;
ALTER USER ds_user RESOURCE GROUP rg_data_science;

-- Monitor resource group usage
SELECT * FROM information_schema.resource_group_usage;

HTAP Solution Comparison

Dimension TiDB HTAP ClickHouse + MySQL Snowflake Apache Doris SingleStore
Architecture Integrated HTAP Dual-system stitching Cloud-native OLAP Integrated HTAP Integrated HTAP
OLTP capability Strong (MySQL compatible) MySQL native None Weak Medium
OLAP performance Strong (MPP) Very strong Very strong Strong Strong
Data sync Raft Learner (automatic) Self-built ETL needed Import needed Automatic Automatic
Real-time Seconds Minutes Minutes Seconds Seconds
MySQL compat 95%+ Native Incompatible Medium Medium
Horizontal scaling Transparent Each scales independently Auto scaling Transparent Transparent
Ops complexity Medium High Low Medium Medium
Cost Medium High (dual system) High (pay-per-use) Medium High
Open source Fully open source Partially open source Closed source Fully open source Partially open source
Use case General HTAP Existing MySQL + strong OLAP needed Pure analytics Real-time data warehouse Real-time analytics

Summary

TiDB HTAP isn't a marketing slogan about "doing both OLTP and OLAP" — it's an engineered solution that achieves real-time sync from row storage to column storage via Raft Learner, distributed parallel computation via the MPP engine, and OLTP/OLAP resource isolation via Resource Control. In 2026, if you're still maintaining a MySQL + ClickHouse dual-system architecture, it's time to seriously evaluate TiDB HTAP — one system, one ops team, zero data latency. This is what real-time analytics should look like.


Try these browser-local tools — no sign-up required →

#TiDB#HTAP#实时分析#TiFlash#分布式数据库#2026#列存