Key Metrics

If you use TiDB Ansible to deploy the TiDB cluster, the monitoring system is deployed at the same time. For more information, see TiDB Monitoring Framework Overview.

The Grafana dashboard is divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node_exporter, Disk Performance, and so on. A lot of metrics are there to help you diagnose.

For routine operations, you can get an overview of the component (PD, TiDB, TiKV) status and the entire cluster from the Overview dashboard, where the key metrics are displayed. This document provides a detailed description of these key metrics.

Key metrics description

To understand the key metrics displayed on the Overview dashboard, check the following table:

ServicePanel NameDescriptionNormal Range
Services Port StatusServices Onlinethe online nodes number of each service
Services Port StatusServices Offlinethe offline nodes number of each service
PDStorage Capacitythe total storage capacity of the TiDB cluster
PDCurrent Storage Sizethe occupied storage capacity of the TiDB cluster
PDNumber of Regionsthe total number of Regions of the current cluster
PDLeader Balance Ratiothe leader ratio difference of the nodes with the biggest leader ratio and the smallest leader ratioIt is less than 5% for a balanced situation and becomes bigger when you restart a node.
PDRegion Balance Ratiothe Region ratio difference of the nodes with the biggest Region ratio and the smallest Region ratioIt is less than 5% for a balanced situation and becomes bigger when you add or remove a node.
PDStore Status -- Up Storesthe number of TiKV nodes that are up
PDStore Status -- Disconnect Storesthe number of TiKV nodes that encounter abnormal communication within a short time
PDStore Status -- LowSpace Storesthe number of TiKV nodes with an available space of less than 20%
PDStore Status -- Down Storesthe number of TiKV nodes that are downThe normal value is 0. If the number is bigger than 0, it means some node(s) are abnormal.
PDStore Status -- Offline Storesthe number of TiKV nodes (still providing service) that are being made offline
PDStore Status -- Tombstone Storesthe number of TiKV nodes that are successfully offline
PD99% completed_cmds_duration_secondsthe 99th percentile duration to complete a pd-server requestless than 5ms
PDhandle_requests_duration_secondsthe request duration of a PD request
PDRegion healthThe state of each Region.Generally, the number of pending peers is less than 100, and that of the missing peers cannot always be greater than 0.
PDHot write Region's leader distributionThe total number of leaders who are the write hotspots on each TiKV instance.
PDHot read Region's leader distributionThe total number of leaders who are the read hotspots on each TiKV instance.
PDRegion heartbeat reportThe count of heartbeats reported to PD per instance.
PD99% Region heartbeat latencyThe heartbeat latency per TiKV instance (P99).
TiDBStatement OPSthe total number of executed SQL statements, including SELECT, INSERT, UPDATE and so on
TiDBDurationthe execution time of a SQL statement
TiDBQPS By Instancethe QPS on each TiDB instance
TiDBFailed Query OPMthe number of failed SQL statements, including syntax error and key conflicts and so on
TiDBConnection Countthe connection number of each TiDB instance
TiDBHeap Memory Usagethe size of heap memory used by each TiDB instance
TiDBTransaction OPSthe number of executed transactions per second
TiDBTransaction Durationthe execution time of a transaction
TiDBKV Cmd OPSthe number of executed KV commands
TiDBKV Cmd Duration 99the execution time of the KV command
TiDBPD TSO OPSthe number of TSO that TiDB obtains from PD
TiDBPD TSO Wait Durationthe time consumed when TiDB obtains TSO from PD
TiDBTiClient Region Error OPSthe number of Region related errors returned by TiKV
TiDBLock Resolve OPSthe number of transaction related conflicts
TiDBLoad Schema Durationthe time consumed when TiDB obtains Schema from TiKV
TiDBKV Backoff OPSthe number of errors returned by TiKV (such as transaction conflicts )
TiKVleaderthe number of leaders on each TiKV node
TiKVregionthe number of Regions on each TiKV node
TiKVCPUthe CPU usage ratio on each TiKV node
TiKVMemorythe memory usage on each TiKV node
TiKVstore sizethe data amount on each TiKV node
TiKVcf sizethe data amount on different CFs in the cluster
TiKVchannel fullNo data points is displayed in normal conditions. If a monitoring value displays, it means the corresponding TiKV node fails to handle the messages
TiKVserver report failuresNo data points is displayed in normal conditions. If Unreachable is displayed, it means TiKV encounters a communication issue.
TiKVscheduler pending commandsthe number of commits on queueOccasional value peaks are normal.
TiKVcoprocessor pending requeststhe number of requests on queue0 or very small
TiKVcoprocessor executor countthe number of various query operations
TiKVcoprocessor request durationthe time consumed by TiKV queries
TiKVraft store CPUthe CPU usage ratio of the raftstore threadThe default number of threads is 2 (configured by raftstore.store-pool-size). A value of over 80% for a single thread indicates that the CPU usage ratio is very high.
TiKVCoprocessor CPUthe CPU usage ratio of the TiKV query thread, related to the application; complex queries consume a great deal of CPU
System InfoVcoresthe number of CPU cores
System InfoMemorythe total memory
System InfoCPU Usagethe CPU usage ratio, 100% at a maximum
System InfoLoad [1m]the overload within 1 minute
System InfoMemory Availablethe size of the available memory
System InfoNetwork Trafficthe statistics of the network traffic
System InfoTCP Retransthe statistics about network monitoring and TCP
System InfoIO Utilthe disk usage ratio, 100% at a maximum; generally you need to consider adding a new node when the usage ratio is up to 80% ~ 90%

Interface of the Overview dashboard

Overview Dashboard