Key Monitoring Metrics of TiKV

If you use TiDB Ansible to deploy the TiDB cluster, the monitoring system is deployed at the same time. For more information, see Overview of the Monitoring Framework.

The Grafana dashboard is divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node_exporter, Disk Performance, and so on. A lot of metrics are there to help you diagnose.

You can get an overview of the component TiKV status from the TiKV dashboard, where the key metrics are displayed. This document provides a detailed description of these key metrics.

Key metrics description

To understand the key metrics displayed on the Overview dashboard, check the following table:

ServicePanel nameDescriptionNormal range
ClusterStore sizeThe storage size per TiKV instance
ClusterAvailable sizeThe available capacity per TiKV instance
ClusterCapacity sizeThe capacity size per TiKV instance
ClusterCPUThe CPU usage per TiKV instance
ClusterMemoryThe memory usage per TiKV instance
ClusterIO utilizationThe I/O utilization per TiKV instance
ClusterMBpsThe total bytes of read and write in each TiKV instance
ClusterQPSThe QPS per command in each TiKV instance
ClusterErrors-gRPCThe total number of gRPC message failures
ClusterLeadersThe number of leaders per TiKV instance
ClusterRegionsThe number of Regions per TiKV instance
ErrorsServer is busyIndicates occurrences of events that make the TiKV instance unavailable temporarily, such as Write Stall, Channel Full, Scheduler Busy, and Coprocessor Full
ErrorsServer message failuresThe number of failed messages between TiKV instancesIt should be 0 in normal case.
ErrorsRaftstore errorsThe number of Raftstore errors per type on each TiKV instance
ErrorsScheduler errorsThe number of scheduler errors per type on each TiKV instance
ErrorsCoprocessor errorsThe number of coprocessor errors per type on each TiKV instance
ErrorsgRPC message errorsThe number of gRPC message errors per type on each TiKV instance
ErrorsLeader dropThe count of dropped leaders per TiKV instance
ErrorsLeader missingThe count of missing leaders per TiKV instance
ServerLeadersThe number of leaders per TiKV instance
ServerRegionsThe number of Regions per TiKV instance
ServerCF sizeThe size of each column family
ServerStore sizeThe storage size per TiKV instance
ServerChannel fullThe number of Channel Full errors per TiKV instanceIt should be 0 in normal case.
ServerServer message failuresThe number of failed messages between TiKV instances
ServerAverage Region written keysThe average rate of written keys to Regions per TiKV instance
ServerAverage Region written bytesThe average rate of writing bytes to Regions per TiKV instance
ServerActive written leadersThe number of leaders being written on each TiKV instance
ServerApproximate Region sizeThe approximate Region size
Raft IOApply log durationThe time consumed for Raft to apply logs
Raft IOApply log duration per serverThe time consumed for Raft to apply logs per TiKV instance
Raft IOAppend log durationThe time consumed for Raft to append logs
Raft IOAppend log duration per serverThe time consumed for Raft to append logs per TiKV instance
Raft processReady handledThe count of handled ready buckets per region
Raft processProcess ready duration per serverThe time consumed for peer processes to be ready in RaftIt should be less than 2s (P99.99).
Raft processProcess tick duration per serverThe peer processes in Raft
Raft process99% Duration of raftstore eventsThe time consumed by raftstore events (P99)
Raft messageSent messages per serverThe number of Raft messages sent by each TiKV instance
Raft messageFlush messages per serverThe number of Raft messages flushed by each TiKV instance
Raft messageReceive messages per serverThe number of Raft messages received by each TiKV instance
Raft messageMessagesThe number of Raft messages sent per type
Raft messageVoteThe number of Vote messages sent in Raft
Raft messageRaft dropped messagesThe number of dropped Raft messages per type
Raft proposalRaft proposals per readyThe number of Raft proposals of all Regions per ready handled bucket
Raft proposalRaft read/write proposalsThe number of proposals per type
Raft proposalRaft read proposals per serverThe number of read proposals made by each TiKV instance
Raft proposalRaft write proposals per serverThe number of write proposals made by each TiKV instance
Raft proposalProposal wait durationThe wait time of each proposal
Raft proposalProposal wait duration per serverThe wait time of each proposal per TiKV instance
Raft proposalRaft log speedThe rate at which peers propose logs
Raft adminAdmin proposalsThe number of admin proposals
Raft adminAdmin applyThe number of processed apply commands
Raft adminCheck splitThe number of raftstore split checks
Raft admin99.99% Check split durationThe time consumed when running split checks (P99.99)
Local readerLocal reader requestsThe number of total requests and the number of rejections from the local read thread
Local readerLocal read requests durationThe wait time of local read requests
Local readerLocal read requests batch sizeThe batch size of local read requests
StorageStorage command totalThe total number of received commands per type
StorageStorage async request errorThe total number of engine asynchronous request errors
StorageStorage async snapshot durationThe time consumed by processing asynchronous snapshot requestsIt should be less than 1s in .99.
StorageStorage async write durationThe time consumed by processing asynchronous write requestsIt should be less than 1s in .99.
SchedulerScheduler stage totalThe total number of commands at each stageThere should not be lots of errors in a short time.
SchedulerScheduler priority commandsThe count of different priority commands
SchedulerScheduler pending commandsThe count of pending commands per TiKV instance
Scheduler - XXScheduler stage totalThe total number of commands at each stage when executing the batch_get commandThere should not be lots of errors in a short time.
Scheduler - XXScheduler command durationThe time consumed when executing the batch_get commandIt should be less than 1s.
Scheduler - XXScheduler latch wait durationThe wait time caused by latch when executing the batch_get commandIt should be less than 1s.
Scheduler - XXScheduler keys readThe count of keys read by a batch_get command
Scheduler - XXScheduler keys writtenThe count of keys written by a batch_get command
Scheduler - XXScheduler scan detailsThe keys scan details of each CF when executing the batch_get command
Scheduler - XXScheduler scan details [lock]The keys scan details of lock CF when executing the batch_get command
Scheduler - XXScheduler scan details [write]The keys scan details of write CF when executing the batch_get command
Scheduler - XXScheduler scan details [default]The keys scan details of default CF when executing the batch_get command
CoprocessorRequest durationThe time consumed to handle coprocessor read requests
CoprocessorWait durationThe time consumed when coprocessor requests are waiting to be handledIt should be less than 10s (P99.99).
CoprocessorProcessing durationThe time consumed to handle coprocessor requests
Coprocessor95% Request duration by storeThe time consumed to handle coprocessor read requests per TiKV instance (P95)
Coprocessor95% Wait duration by storeThe time consumed when coprocessor requests are waiting to be handled per TiKV instance (P95)
Coprocessor95% Handling duration by storeThe time consumed to handle coprocessor requests per TiKV instance (P95)
CoprocessorRequest errorsThe total number of the push down request errorsThere should not be lots of errors in a short time.
CoprocessorDAG executorsThe total number of DAG executors
CoprocessorScan keysThe number of keys that each request scans
CoprocessorScan detailsThe scan details for each CF
CoprocessorTable Scan - Details by CFThe table scan details for each CF
CoprocessorIndex Scan - Details by CFThe index scan details for each CF
CoprocessorTable Scan - Perf StatisticsThe total number of RocksDB internal operations from PerfContext when executing table scan
CoprocessorIndex Scan - Perf StatisticsThe total number of RocksDB internal operations from PerfContext when executing index scan
GCMVCC versionsThe number of versions for each key
GCMVCC deleted versionsThe number of versions deleted by GC for each key
GCGC tasksThe count of GC tasks processed by gc_worker
GCGC tasks DurationThe time consumed when executing GC tasks
GCGC keys (write CF)The count of keys in write CF affected during GC
GCTiDB GC actions resultThe TiDB GC action result on Region level
GCTiDB GC worker actionsThe count of TiDB GC worker actions
GCTiDB GC secondsThe GC duration
GCTiDB GC failureThe count of failed TiDB GC jobs
GCGC lifetimeThe lifetime of TiDB GC
GCGC intervalThe interval of TiDB GC
SnapshotRate snapshot messageThe rate at which Raft snapshot messages are sent
Snapshot99% Handle snapshot durationThe time consumed to handle snapshots (P99)
SnapshotSnapshot state countThe number of snapshots per state
Snapshot99.99% Snapshot sizeThe snapshot size (P99.99)
Snapshot99.99% Snapshot KV countThe number of KV within a snapshot (P99.99)
TaskWorker handled tasksThe number of tasks handled by worker
TaskWorker pending tasksCurrent number of pending and running tasks of workerIt should be less than 1000.
TaskFuturePool handled tasksThe number of tasks handled by future_pool
TaskFuturePool pending tasksCurrent number of pending and running tasks of future_pool
Thread CPURaft store CPUThe CPU utilization of the raftstore threadThe CPU usage should be less than 80%.
Thread CPUAsync apply CPUThe CPU utilization of async applyThe CPU usage should be less than 90%.
Thread CPUScheduler CPUThe CPU utilization of schedulerThe CPU usage should be less than 80%.
Thread CPUScheduler Worker CPUThe CPU utilization of scheduler worker
Thread CPUStorage ReadPool CPUThe CPU utilization of readpool
Thread CPUCoprocessor CPUThe CPU utilization of coprocessor
Thread CPUSnapshot worker CPUThe CPU utilization of snapshot worker
Thread CPUSplit check CPUThe CPU utilization of split check
Thread CPURocksDB CPUThe CPU utilization of RocksDB
Thread CPUgRPC poll CPUThe CPU utilization of gRPCThe CPU usage should be less than 80%.
RocksDB - XXGet operationsThe count of get operations
RocksDB - XXGet durationThe time consumed when executing get operations
RocksDB - XXSeek operationsThe count of seek operations
RocksDB - XXSeek durationThe time consumed when executing seek operations
RocksDB - XXWrite operationsThe count of write operations
RocksDB - XXWrite durationThe time consumed when executing write operations
RocksDB - XXWAL sync operationsThe count of WAL sync operations
RocksDB - XXWAL sync durationThe time consumed when executing WAL sync operations
RocksDB - XXCompaction operationsThe count of compaction and flush operations
RocksDB - XXCompaction durationThe time consumed when executing the compaction and flush operations
RocksDB - XXSST read durationThe time consumed when reading SST files
RocksDB - XXWrite stall durationWrite stall durationIt should be 0 in normal case.
RocksDB - XXMemtable sizeThe memtable size of each column family
RocksDB - XXMemtable hitThe hit rate of memtable
RocksDB - XXBlock cache sizeThe block cache size. Broken down by column family if shared block cache is disabled.
RocksDB - XXBlock cache hitThe hit rate of block cache
RocksDB - XXBlock cache flowThe flow rate of block cache operations per type
RocksDB - XXBlock cache operationsThe count of block cache operations per type
RocksDB - XXKeys flowThe flow rate of operations on keys per type
RocksDB - XXTotal keysThe count of keys in each column family
RocksDB - XXRead flowThe flow rate of read operations per type
RocksDB - XXBytes / ReadThe bytes per read operation
RocksDB - XXWrite flowThe flow rate of write operations per type
RocksDB - XXBytes / WriteThe bytes per write operation
RocksDB - XXCompaction flowThe flow rate of compaction operations per type
RocksDB - XXCompaction pending bytesThe pending bytes to be compacted
RocksDB - XXRead amplificationThe read amplification per TiKV instance
RocksDB - XXCompression ratioThe compression ratio of each level
RocksDB - XXNumber of snapshotsThe number of snapshots per TiKV instance
RocksDB - XXOldest snapshots durationThe time that the oldest unreleased snapshot survivals
RocksDB - XXNumber files at each levelThe number of SST files for different column families in each level
RocksDB - XXIngest SST duration secondsThe time consumed to ingest SST files
RocksDB - XXStall conditions changed of each CFStall conditions changed of each column family
gRPCgRPC messagesThe count of gRPC messages per type
gRPCgRPC message failedThe count of failed gRPC messages per type
gRPC99% gRPC message durationThe gRPC message duration per message type (P99)
gRPCgRPC GC message countThe count of gRPC GC messages
gRPC99% gRPC KV GC message durationThe execution time of gRPC GC messages (P99)
PDPD requestsThe count of requests that TiKV sends to PD
PDPD request duration (average)The time consumed by requests that TiKV sends to PD
PDPD heartbeatsThe total number of PD heartbeat messages
PDPD validated peersThe total number of peers validated by the PD worker

TiKV dashboard interface

This section shows images of the service panels on the TiKV dashboard.

Cluster

TiKV Dashboard - Cluster metrics

Errors

TiKV Dashboard - Errors metrics

Server

TiKV Dashboard - Server metrics

Raft IO

TiKV Dashboard - Raft IO metrics

Raft process

TiKV Dashboard - Raft process metrics

Raft message

TiKV Dashboard - Raft message metrics

Raft proposal

TiKV Dashboard - Raft proposal metrics

Raft admin

TiKV Dashboard - Raft admin metrics

Local reader

TiKV Dashboard - Local reader metrics

Storage

TiKV Dashboard - Storage metrics

Scheduler

TiKV Dashboard - Scheduler metrics

Scheduler - batch_get

TiKV Dashboard - Scheduler - batch_get metrics

Scheduler - cleanup

TiKV Dashboard - Scheduler - cleanup metrics

Scheduler - commit

TiKV Dashboard - Scheduler commit metrics