Key Monitoring Metrics of PD

If you use TiDB Ansible to deploy the TiDB cluster, the monitoring system is deployed at the same time. For more information, see Overview of the Monitoring Framework.

The Grafana dashboard is divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node_exporter, Disk Performance, and so on. A lot of metrics are there to help you diagnose.

You can get an overview of the component PD status from the PD dashboard, where the key metrics are displayed. This document provides a detailed description of these key metrics.

Key metrics description

To understand the key metrics displayed on the Overview dashboard, check the following table:

ServicePanel nameDescriptionNormal range
ClusterPD roleIt indicates whether the current PD is the leader or a follower.
ClusterStorage capacityThe total capacity size of the cluster
ClusterCurrent storage sizeThe current storage size of the cluster
ClusterNumber of RegionsThe total number of Regions without replicas
ClusterLeader balance ratioThe leader ratio difference of the instances with the biggest leader ratio and the smallest leader ratioIt is less than 5% for a balanced situation and becomes bigger when you restart an instance
ClusterRegion balance ratioThe Region ratio difference of the instances with the biggest Region ratio and the smallest Region ratioIt is less than 5% for a balanced situation and becomes bigger when you add or remove an instance
ClusterNormal storesThe count of healthy stores
ClusterAbnormal storesThe count of unhealthy storesThe normal value is 0. If the number is bigger than 0, it means at least one instance is abnormal.
ClusterCurrent storage usageThe current storage size and used ratio of the cluster
ClusterCurrent peer countThe current peer count of the cluster
ClusterMetadata informationIt records the cluster ID, the last ID the allocator generated, and the last timestamp TSO generated.
ClusterRegion label isolation levelThe number of Regions in different label levels
ClusterRegion healthIt records the unusual Regions' count which may include pending peers, down peers, extra peers, offline peers, missing peers, learner peers or incorrect namespacesThe number of pending peers should be less than 100. The missing peers should not be persistently greater than 0.
BalanceStore capacityThe capacity size of each TiKV instance
BalanceStore availableThe available capacity size of each TiKV instance
BalanceStore usedThe used capacity size of each TiKV instance
BalanceSize amplificationThe size amplification, which is equal to Store Region size over Store used capacity size, of each TiKV instance
BalanceSize available ratioIt is equal to Store available capacity size over Store capacity size for each TiKV instance
BalanceStore leader scoreThe leader score of each TiKV instance
BalanceStore Region scoreThe Region score of each TiKV instance
BalanceStore leader sizeThe total leader size of each TiKV instance
BalanceStore Region sizeThe total Region size of each TiKV instance
BalanceStore leader countThe leader count of each TiKV instance
BalanceStore Region countThe Region count of each TiKV instance
HotRegionHot write Region's leader distributionThe total number of leader Regions under hot write on each TiKV instance
HotRegionHot write Region's peer distributionThe total number of Regions which are not leader under hot write on each TiKV instance
HotRegionHot write Region's leader written bytesThe total bytes of hot write on leader Regions for each TiKV instance
HotRegionHot write Region's peer written bytesThe total bytes of hot write on Regions which are not leader for each TiKV instance
HotRegionHot read Region's leader distributionThe total number of leader Regions under hot read on each TiKV instance
HotRegionHot read Region's peer distributionThe total number of Regions which are not leader under hot read on each TiKV instance
HotRegionHot read Region's leader read bytesThe total bytes of hot read on leader Regions for each TiKV instance
HotRegionHot read Region's peer read bytesThe total bytes of hot read on Regions which are not leader for each TiKV instance
SchedulerScheduler is runningThe current running schedulers
SchedulerBalance leader movementThe leader movement details among TiKV instances
SchedulerBalance Region movementThe Region movement details among TiKV instances
SchedulerBalance leader eventThe count of balance leader events
SchedulerBalance Region eventThe count of balance Region events
SchedulerBalance leader schedulerThe inner status of balance leader scheduler
SchedulerBalance Region schedulerThe inner status of balance Region scheduler
SchedulerNamespace checkerThe namespace checker's status
SchedulerReplica checkerThe replica checker's status
SchedulerRegion merge checkerThe merge checker's status
OperatorSchedule operator createThe number of different operators that are newly created
OperatorSchedule operator checkThe number of different operators that have been checked. It mainly checks if the current step is finished; if yes, it returns the next step to be executed.
OperatorSchedule operator finishThe number of different operators that are finished
OperatorSchedule operator timeoutThe number of different operators that are timeout
OperatorSchedule operator replaced or canceledThe number of different operators that are replaced or canceled
OperatorSchedule operators count by stateThe number of operators in different status
Operator99% Operator finish durationThe time consumed when the operator is finished in .99
Operator50% Operator finish durationThe time consumed when the operator is finished in .50
Operator99% Operator step durationThe time consumed when the operator step is finished in .99
Operator50% Operator step durationThe time consumed when the operator step is finished in .50
gRPCCompleted commands rateThe rate of completing each kind of gRPC commands
gRPC99% Completed commands durationThe time consumed of completing each kind of gRPC commands in .99
etcdHandle transactions countThe count of etcd transactions
etcd99% Handle transactions durationThe time consumed of handling etcd transactions in .99
etcd99% WAL fsync durationThe time consumed of writing WAL into the persistent storage in .99The value is less than 1s.
etcd99% Peer round trip time secondsThe latency of the network in .99The value is less than 1s.
etcdetcd disk wal fsync rateThe rate of writing WAL into the persistent storage
etcdRaft termThe current term of Raft
etcdRaft committed indexThe last committed index of Raft
etcdRaft applied indexThe last applied index of Raft
TiDBHandle requests countThe count of TiDB requests
TiDBHandle requests durationThe time consumed of handling TiDB requestsIt should be less than 100ms in .99.
HeartbeatRegion heartbeat reportThe count of the heartbeats which each TiKV instance reports to PD
HeartbeatRegion heartbeat report errorThe count of the heartbeats with the error status
HeartbeatRegion heartbeat report activeThe count of the heartbeats with the ok status
HeartbeatRegion schedule pushThe count of the corresponding schedule commands which PD sends to each TiKV instance
Heartbeat99% Region heartbeat latencyThe heartbeat latency of each TiKV instance in .99

PD dashboard interface

Cluster

PD Dashboard - Cluster metrics

Balance

PD Dashboard - Balance metrics

HotRegion

PD Dashboard - HotRegion metrics

Scheduler

PD Dashboard - Scheduler metrics

Operator

PD Dashboard - Operator metrics

gRPC

PD Dashboard - gRPC metrics

etcd

PD Dashboard - etcd metrics

TiDB

PD Dashboard - TiDB metrics

Heartbeat

PD Dashboard - Heartbeat metrics