Key Monitoring Metrics of PD

If you use TiDB Ansible to deploy the TiDB cluster, the monitoring system is deployed at the same time. For more information, see Overview of the Monitoring Framework.

The Grafana dashboard is divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node_exporter, Disk Performance, and so on. A lot of metrics are there to help you diagnose.

You can get an overview of the component PD status from the PD dashboard, where the key metrics are displayed. This document provides a detailed description of these key metrics.

Key metrics description

To understand the key metrics displayed on the Overview dashboard, check the following table:

ServicePanel nameDescriptionNormal range
ClusterPD roleThe role of the current PD
ClusterStorage capacityThe total capacity size of the cluster
ClusterCurrent storage sizeThe current storage size of the cluster
ClusterCurrent storage usageThe total number of Regions without replicas
ClusterNormal storesThe count of healthy stores
ClusterAbnormal storesThe count of unhealthy storesThe normal value is 0. If the number is bigger than 0, it means at least one instance is abnormal.
ClusterCurrent peer countThe current peer count of the cluster
ClusterNumber of RegionsThe total number of Regions of the cluster
ClusterPD scheduler configThe list of PD scheduler configurations
ClusterRegion label isolation levelThe number of Regions in different label levels
ClusterLabel distributionThe distribution status of the labels in the cluster
Clusterpd_cluster_metadataThe metadata of the PD cluster including cluster ID, the timestamp, and the generated ID.
ClusterRegion healthThe health status of Regions indicated via count of unusual Regions including pending peers, down peers, extra peers, offline peers, missing peers, learner peers and incorrect namespacesThe number of pending peers should be less than 100. The missing peers should not be persistently greater than 0.
Statistics - BalanceStore capacityThe capacity size per TiKV instance
Statistics - BalanceStore availableThe available capacity size per TiKV instance
Statistics - BalanceStore usedThe used capacity size per TiKV instance
Statistics - BalanceSize amplificationThe size amplification ratio per TiKV instance, which is equal to (Store Region size)/(Store used capacity size)
Statistics - BalanceSize available ratioThe size availability ratio per TiKV instance, which is equal to (Store available capacity size)/(Store capacity size)
Statistics - BalanceStore leader scoreThe leader score per TiKV instance
Statistics - BalanceStore Region scoreThe Region score per TiKV instance
Statistics - BalanceStore leader sizeThe total leader size per TiKV instance
Statistics - BalanceStore Region sizeThe total Region size per TiKV instance
Statistics - BalanceStore leader countThe leader count per TiKV instance
Statistics - BalanceStore Region countThe Region count per TiKV instance
Statistics - HotspotLeader distribution in hot write RegionsThe total number of leader Regions in hot write on each TiKV instance
Statistics - HotspotPeer distribution in hot write RegionsThe total number of peer Regions under in hot write on each TiKV instance
Statistics - HotspotLeader written bytes in hot write RegionsThe total written bytes by Leader regions in hot write on leader Regions for each TiKV instance
Statistics - HotspotPeer written bytes in hot write RegionsThe total bytes of hot write on peer Regions per each TiKV instance
Statistics - HotspotLeader distribution in hot read RegionsThe total number of leader Regions in hot read per each TiKV instance
Statistics - HotspotPeer distribution in hot read RegionsThe total number of Regions which are not leader under hot read per each TiKV instance
Statistics - HotspotLeader read bytes in hot read RegionsThe total bytes of hot read on leader Regions per each TiKV instance
Statistics - HotspotPeer read bytes in hot read RegionsThe total bytes of hot read on peer Regions per TiKV instance
SchedulerRunning schedulersThe current running schedulers
SchedulerBalance leader movementThe leader movement details among TiKV instances
SchedulerBalance Region movementThe Region movement details among TiKV instances
SchedulerBalance leader eventThe count of balance leader events
SchedulerBalance Region eventThe count of balance Region events
SchedulerBalance leader schedulerThe inner status of balance leader scheduler
SchedulerBalance Region schedulerThe inner status of balance Region scheduler
SchedulerNamespace checkerThe namespace checker's status
SchedulerReplica checkerThe replica checker's status
SchedulerRegion merge checkerThe merge checker's status
OperatorSchedule operator createThe number of newly created operators per type
OperatorSchedule operator checkThe number of checked operator per type. It mainly checks if the current step is finished; if yes, it returns the next step to be executed.
OperatorSchedule operator finishThe number of finished operators per type
OperatorSchedule operator timeoutThe number of timeout operators per type
OperatorSchedule operator replaced or canceledThe number of replaced or canceled operators per type
OperatorSchedule operators count by stateThe number of operators per state
Operator99% Operator finish durationThe operator step duration (P99)
Operator50% Operator finish durationThe operator duration (P50)
Operator99% Operator step durationThe operator step duration (P99)
Operator50% Operator step durationThe operator step duration (P50)
gRPCCompleted commands rateThe rate per command type type at which gRPC commands are completed
gRPC99% Completed commands durationThe rate per command type type at which gRPC commands are completed (P99)
etcdTransaction handling rateThe rate at which etcd handles transactions
etcd99% transactions durationThe transaction handling rate (P99)
etcd99% WAL fsync durationThe time consumed for writing WAL into the persistent storage (P99)The value is less than 1s.
etcd99% Peer round trip time secondsThe network latency for etcd (P99)The value is less than 1s.
etcdetcd disk wal fsync rateThe rate of writing WAL into the persistent storage
etcdRaft termThe current term of Raft
etcdRaft committed indexThe last committed index of Raft
etcdRaft applied indexThe last applied index of Raft
TiDBHandled requests countThe count of TiDB requests
TiDBRequest handling durationThe time consumed for handling TiDB requestsIt should be less than 100ms (P99).
HeartbeatRegion heartbeat reportThe count of heartbeats reported reported to PD per instance
HeartbeatRegion heartbeat report errorThe count of heartbeats with the error status
HeartbeatRegion heartbeat report activeThe count of heartbeats with the ok status
HeartbeatRegion schedule pushThe count of corresponding schedule commands sent from PD per TiKV instance
Heartbeat99% Region heartbeat latencyThe heartbeat latency per TiKV instance (P99)
Region storageSyncer indexThe maximum index in the Region change history recorded by the leader
Region storageHistory last indexThe last index where the Region change history is synchronizedsuccessfully with the follower

PD dashboard interface

Cluster

PD Dashboard - Cluster metrics

Statistics - Balance

PD Dashboard - Statistics - Balance metrics

Statistics - Hotspot

PD Dashboard - Statistics - Hotspot metrics

Scheduler

PD Dashboard - Scheduler metrics

Operator

PD Dashboard - Operator metrics

gRPC

PD Dashboard - gRPC metrics

etcd

PD Dashboard - etcd metrics

TiDB

PD Dashboard - TiDB metrics

Heartbeat

PD Dashboard - Heartbeat metrics

Region storage

PD Dashboard - Region storage