TiFlash Alert Rules
This document introduces the alert rules of the TiFlash cluster.
TiFlash_schema_error
Alert rule:
increase(tiflash_schema_apply_count{type="failed"}[15m]) > 0
Description:
When the schema apply error occurs, an alert is triggered.
Solution:
The error might be caused by some wrong logic. Get support from PingCAP or the community.
TiFlash_schema_apply_duration
Alert rule:
histogram_quantile(0.99, sum(rate(tiflash_schema_apply_duration_seconds_bucket[1m])) BY (le, instance)) > 20
Description:
When the probability that the apply duration exceeds 20 seconds is over 99%, an alert is triggered.
Solution:
It might be caused by the internal problems of the TiFlash storage engine. Get support from PingCAP or the community.
TiFlash_raft_read_index_duration
Alert rule:
histogram_quantile(0.99, sum(rate(tiflash_raft_read_index_duration_seconds_bucket[1m])) BY (le, instance)) > 3
Description:
When the probability that the read index duration exceeds 3 seconds is over 99%, an alert is triggered.
Solution:
The frequent retries might be caused by frequent splitting or migration of the TiKV cluster. You can check the TiKV cluster status to identify the retry reason.
TiFlash_raft_wait_index_duration
Alert rule:
histogram_quantile(0.99, sum(rate(tiflash_raft_wait_index_duration_seconds_bucket[1m])) BY (le, instance)) > 2
Description:
When the probability that the waiting time for Region Raft Index in TiFlash exceeds 2 seconds is over 99%, an alert is triggered.
Solution:
It might be caused by a communication error between TiKV and the proxy. Get support from PingCAP or the community.