Key Monitoring Metrics of TiKV

If you use TiDB Ansible to deploy the TiDB cluster, the monitoring system is deployed at the same time. For more information, see Overview of the Monitoring Framework.

The Grafana dashboard is divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node_exporter, Disk Performance, and so on. A lot of metrics are there to help you diagnose.

You can get an overview of the component TiKV status from the TiKV dashboard, where the key metrics are displayed. This document provides a detailed description of these key metrics.

Key metrics description

To understand the key metrics displayed on the Overview dashboard, check the following table:

Service	Panel name	Description	Normal range
Cluster	Store size	The storage size per TiKV instance
Cluster	Available size	The available capacity per TiKV instance
Cluster	Capacity size	The capacity size per TiKV instance
Cluster	CPU	The CPU usage per TiKV instance
Cluster	Memory	The memory usage per TiKV instance
Cluster	IO utilization	The I/O utilization per TiKV instance
Cluster	MBps	The total bytes of read and write in each TiKV instance
Cluster	QPS	The QPS per command in each TiKV instance
Cluster	Errors-gRPC	The total number of gRPC message failures
Cluster	Leaders	The number of leaders per TiKV instance
Cluster	Regions	The number of Regions per TiKV instance
Errors	Server is busy	Indicates occurrences of events that make the TiKV instance unavailable temporarily, such as Write Stall, Channel Full, Scheduler Busy, and Coprocessor Full
Errors	Server message failures	The number of failed messages between TiKV instances	It should be `0` in normal case.
Errors	Raftstore errors	The number of Raftstore errors per type on each TiKV instance
Errors	Scheduler errors	The number of scheduler errors per type on each TiKV instance
Errors	Coprocessor errors	The number of coprocessor errors per type on each TiKV instance
Errors	gRPC message errors	The number of gRPC message errors per type on each TiKV instance
Errors	Leader drop	The count of dropped leaders per TiKV instance
Errors	Leader missing	The count of missing leaders per TiKV instance
Server	Leaders	The number of leaders per TiKV instance
Server	Regions	The number of Regions per TiKV instance
Server	CF size	The size of each column family
Server	Store size	The storage size per TiKV instance
Server	Channel full	The number of Channel Full errors per TiKV instance	It should be `0` in normal case.
Server	Server message failures	The number of failed messages between TiKV instances
Server	Average Region written keys	The average rate of written keys to Regions per TiKV instance
Server	Average Region written bytes	The average rate of writing bytes to Regions per TiKV instance
Server	Active written leaders	The number of leaders being written on each TiKV instance
Server	Approximate Region size	The approximate Region size
Raft IO	Apply log duration	The time consumed for Raft to apply logs
Raft IO	Apply log duration per server	The time consumed for Raft to apply logs per TiKV instance
Raft IO	Append log duration	The time consumed for Raft to append logs
Raft IO	Append log duration per server	The time consumed for Raft to append logs per TiKV instance
Raft process	Ready handled	The count of handled ready buckets per region
Raft process	Process ready duration per server	The time consumed for peer processes to be ready in Raft	It should be less than `2s` (P99.99).
Raft process	Process tick duration per server	The peer processes in Raft
Raft process	99% Duration of raftstore events	The time consumed by raftstore events (P99)
Raft message	Sent messages per server	The number of Raft messages sent by each TiKV instance
Raft message	Flush messages per server	The number of Raft messages flushed by each TiKV instance
Raft message	Receive messages per server	The number of Raft messages received by each TiKV instance
Raft message	Messages	The number of Raft messages sent per type
Raft message	Vote	The number of Vote messages sent in Raft
Raft message	Raft dropped messages	The number of dropped Raft messages per type
Raft proposal	Raft proposals per ready	The number of Raft proposals of all Regions per ready handled bucket
Raft proposal	Raft read/write proposals	The number of proposals per type
Raft proposal	Raft read proposals per server	The number of read proposals made by each TiKV instance
Raft proposal	Raft write proposals per server	The number of write proposals made by each TiKV instance
Raft proposal	Proposal wait duration	The wait time of each proposal
Raft proposal	Proposal wait duration per server	The wait time of each proposal per TiKV instance
Raft proposal	Raft log speed	The rate at which peers propose logs
Raft admin	Admin proposals	The number of admin proposals
Raft admin	Admin apply	The number of processed apply commands
Raft admin	Check split	The number of raftstore split checks
Raft admin	99.99% Check split duration	The time consumed when running split checks (P99.99)
Local reader	Local reader requests	The number of total requests and the number of rejections from the local read thread
Local reader	Local read requests duration	The wait time of local read requests
Local reader	Local read requests batch size	The batch size of local read requests
Storage	Storage command total	The total number of received commands per type
Storage	Storage async request error	The total number of engine asynchronous request errors
Storage	Storage async snapshot duration	The time consumed by processing asynchronous snapshot requests	It should be less than `1s` in `.99`.
Storage	Storage async write duration	The time consumed by processing asynchronous write requests	It should be less than `1s` in `.99`.
Scheduler	Scheduler stage total	The total number of commands at each stage	There should not be lots of errors in a short time.
Scheduler	Scheduler priority commands	The count of different priority commands
Scheduler	Scheduler pending commands	The count of pending commands per TiKV instance
Scheduler - XX	Scheduler stage total	The total number of commands at each stage when executing the batch_get command	There should not be lots of errors in a short time.
Scheduler - XX	Scheduler command duration	The time consumed when executing the batch_get command	It should be less than `1s`.
Scheduler - XX	Scheduler latch wait duration	The wait time caused by latch when executing the batch_get command	It should be less than `1s`.
Scheduler - XX	Scheduler keys read	The count of keys read by a batch_get command
Scheduler - XX	Scheduler keys written	The count of keys written by a batch_get command
Scheduler - XX	Scheduler scan details	The keys scan details of each CF when executing the batch_get command
Scheduler - XX	Scheduler scan details [lock]	The keys scan details of lock CF when executing the batch_get command
Scheduler - XX	Scheduler scan details [write]	The keys scan details of write CF when executing the batch_get command
Scheduler - XX	Scheduler scan details [default]	The keys scan details of default CF when executing the batch_get command
Coprocessor	Request duration	The time consumed to handle coprocessor read requests
Coprocessor	Wait duration	The time consumed when coprocessor requests are waiting to be handled	It should be less than `10s` (P99.99).
Coprocessor	Processing duration	The time consumed to handle coprocessor requests
Coprocessor	95% Request duration by store	The time consumed to handle coprocessor read requests per TiKV instance (P95)
Coprocessor	95% Wait duration by store	The time consumed when coprocessor requests are waiting to be handled per TiKV instance (P95)
Coprocessor	95% Handling duration by store	The time consumed to handle coprocessor requests per TiKV instance (P95)
Coprocessor	Request errors	The total number of the push down request errors	There should not be lots of errors in a short time.
Coprocessor	DAG executors	The total number of DAG executors
Coprocessor	Scan keys	The number of keys that each request scans
Coprocessor	Scan details	The scan details for each CF
Coprocessor	Table Scan - Details by CF	The table scan details for each CF
Coprocessor	Index Scan - Details by CF	The index scan details for each CF
Coprocessor	Table Scan - Perf Statistics	The total number of RocksDB internal operations from PerfContext when executing table scan
Coprocessor	Index Scan - Perf Statistics	The total number of RocksDB internal operations from PerfContext when executing index scan
GC	MVCC versions	The number of versions for each key
GC	MVCC deleted versions	The number of versions deleted by GC for each key
GC	GC tasks	The count of GC tasks processed by gc_worker
GC	GC tasks Duration	The time consumed when executing GC tasks
GC	GC keys (write CF)	The count of keys in write CF affected during GC
GC	TiDB GC actions result	The TiDB GC action result on Region level
GC	TiDB GC worker actions	The count of TiDB GC worker actions
GC	TiDB GC seconds	The GC duration
GC	TiDB GC failure	The count of failed TiDB GC jobs
GC	GC lifetime	The lifetime of TiDB GC
GC	GC interval	The interval of TiDB GC
Snapshot	Rate snapshot message	The rate at which Raft snapshot messages are sent
Snapshot	99% Handle snapshot duration	The time consumed to handle snapshots (P99)
Snapshot	Snapshot state count	The number of snapshots per state
Snapshot	99.99% Snapshot size	The snapshot size (P99.99)
Snapshot	99.99% Snapshot KV count	The number of KV within a snapshot (P99.99)
Task	Worker handled tasks	The number of tasks handled by worker
Task	Worker pending tasks	Current number of pending and running tasks of worker	It should be less than `1000`.
Task	FuturePool handled tasks	The number of tasks handled by future_pool
Task	FuturePool pending tasks	Current number of pending and running tasks of future_pool
Thread CPU	Raft store CPU	The CPU utilization of the raftstore thread	The CPU usage should be less than `80%`.
Thread CPU	Async apply CPU	The CPU utilization of async apply	The CPU usage should be less than `90%`.
Thread CPU	Scheduler CPU	The CPU utilization of scheduler	The CPU usage should be less than `80%`.
Thread CPU	Scheduler Worker CPU	The CPU utilization of scheduler worker
Thread CPU	Storage ReadPool CPU	The CPU utilization of readpool
Thread CPU	Coprocessor CPU	The CPU utilization of coprocessor
Thread CPU	Snapshot worker CPU	The CPU utilization of snapshot worker
Thread CPU	Split check CPU	The CPU utilization of split check
Thread CPU	RocksDB CPU	The CPU utilization of RocksDB
Thread CPU	gRPC poll CPU	The CPU utilization of gRPC	The CPU usage should be less than `80%`.
RocksDB - XX	Get operations	The count of get operations
RocksDB - XX	Get duration	The time consumed when executing get operations
RocksDB - XX	Seek operations	The count of seek operations
RocksDB - XX	Seek duration	The time consumed when executing seek operations
RocksDB - XX	Write operations	The count of write operations
RocksDB - XX	Write duration	The time consumed when executing write operations
RocksDB - XX	WAL sync operations	The count of WAL sync operations
RocksDB - XX	WAL sync duration	The time consumed when executing WAL sync operations
RocksDB - XX	Compaction operations	The count of compaction and flush operations
RocksDB - XX	Compaction duration	The time consumed when executing the compaction and flush operations
RocksDB - XX	SST read duration	The time consumed when reading SST files
RocksDB - XX	Write stall duration	Write stall duration	It should be `0` in normal case.
RocksDB - XX	Memtable size	The memtable size of each column family
RocksDB - XX	Memtable hit	The hit rate of memtable
RocksDB - XX	Block cache size	The block cache size. Broken down by column family if shared block cache is disabled.
RocksDB - XX	Block cache hit	The hit rate of block cache
RocksDB - XX	Block cache flow	The flow rate of block cache operations per type
RocksDB - XX	Block cache operations	The count of block cache operations per type
RocksDB - XX	Keys flow	The flow rate of operations on keys per type
RocksDB - XX	Total keys	The count of keys in each column family
RocksDB - XX	Read flow	The flow rate of read operations per type
RocksDB - XX	Bytes / Read	The bytes per read operation
RocksDB - XX	Write flow	The flow rate of write operations per type
RocksDB - XX	Bytes / Write	The bytes per write operation
RocksDB - XX	Compaction flow	The flow rate of compaction operations per type
RocksDB - XX	Compaction pending bytes	The pending bytes to be compacted
RocksDB - XX	Read amplification	The read amplification per TiKV instance
RocksDB - XX	Compression ratio	The compression ratio of each level
RocksDB - XX	Number of snapshots	The number of snapshots per TiKV instance
RocksDB - XX	Oldest snapshots duration	The time that the oldest unreleased snapshot survivals
RocksDB - XX	Number files at each level	The number of SST files for different column families in each level
RocksDB - XX	Ingest SST duration seconds	The time consumed to ingest SST files
RocksDB - XX	Stall conditions changed of each CF	Stall conditions changed of each column family
gRPC	gRPC messages	The count of gRPC messages per type
gRPC	gRPC message failed	The count of failed gRPC messages per type
gRPC	99% gRPC message duration	The gRPC message duration per message type (P99)
gRPC	gRPC GC message count	The count of gRPC GC messages
gRPC	99% gRPC KV GC message duration	The execution time of gRPC GC messages (P99)
PD	PD requests	The count of requests that TiKV sends to PD
PD	PD request duration (average)	The time consumed by requests that TiKV sends to PD
PD	PD heartbeats	The total number of PD heartbeat messages
PD	PD validated peers	The total number of peers validated by the PD worker