Key Monitoring Metrics of TiKV

If you use TiUP or TiDB Ansible to deploy the TiDB cluster, the monitoring system (Prometheus/Grafana) is deployed at the same time. For more information, see Overview of the Monitoring Framework.

The Grafana dashboard is divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node_exporter, and so on. A lot of metrics are there to help you diagnose.

You can get an overview of the component TiKV status from the TiKV-Details dashboard, where the key metrics are displayed. According to the Performance Map, you can check whether the status of the cluster is as expected.

This document provides a detailed description of these key metrics on the TiKV-Details dashboard.

Cluster

Store size: The storage size per TiKV instance
Available size: The available capacity per TiKV instance
Capacity size: The capacity size per TiKV instance
CPU: The CPU utilization per TiKV instance
Memory: The memory usage per TiKV instance
IO utilization: The I/O utilization per TiKV instance
MBps: The total bytes of read and write in each TiKV instance
QPS: The QPS per command in each TiKV instance
Errps: The rate of gRPC message failures
leader: The number of leaders per TiKV instance
Region: The number of Regions per TiKV instance
Uptime: The runtime of TiKV since last restart

TiKV Dashboard - Cluster metrics

Errors

Critical error: The number of critical errors
Server is busy: Indicates occurrences of events that make the TiKV instance unavailable temporarily, such as Write Stall, Channel Full, and so on. It should be 0 in normal case.
Server report failures: The number of error messages reported by server. It should be 0 in normal case.
Raftstore error: The number of Raftstore errors per type on each TiKV instance
Scheduler error: The number of scheduler errors per type on each TiKV instance
Coprocessor error: The number of coprocessor errors per type on each TiKV instance
gRPC message error: The number of gRPC message errors per type on each TiKV instance
Leader drop: The count of dropped leaders per TiKV instance
Leader missing: The count of missing leaders per TiKV instance

TiKV Dashboard - Errors metrics

Server

CF size: The size of each column family
Store size: The storage size per TiKV instance
Channel full: The number of Channel Full errors per TiKV instance. It should be 0 in normal case.
Active written leaders: The number of leaders being written on each TiKV instance
Approximate Region size: The approximate Region size
Approximate Region size Histogram: The histogram of each approximate Region size
Region average written keys: The average number of written keys to Regions per TiKV instance
Region average written bytes: The average written bytes to Regions per TiKV instance

TiKV Dashboard - Server metrics

gRPC

gRPC message count: The rate of gRPC messages per type
gRPC message failed: The rate of failed gRPC messages
99% gRPC message duration: The gRPC message duration per message type (P99)
Average gRPC message duration: The average execution time of gRPC messages
gRPC batch size: The batch size of gRPC messages between TiDB and TiKV
Raft message batch size: The batch size of Raft messages between TiKV instances

Thread CPU

Raft store CPU: The CPU utilization of the raftstore thread. The CPU utilization should be less than 80% * raftstore.store-pool-size in normal case.
Async apply CPU: The CPU utilization of the async apply thread. The CPU utilization should be less than 90% * raftstore.apply-pool-size in normal cases.
Scheduler worker CPU: The CPU utilization of the scheduler worker thread. The CPU utilization should be less than 90% * storage.scheduler-worker-pool-size in normal cases.
gRPC poll CPU: The CPU utilization of the gRPC thread. The CPU utilization should be less than 80% * server.grpc-concurrency in normal cases.
Unified read pool CPU: The CPU utilization of the unified read pool thread
Storage ReadPool CPU: The CPU utilization of the storage read pool thread
Coprocessor CPU: The CPU utilization of the coprocessor thread
RocksDB CPU: The CPU utilization of the RocksDB thread
Split check CPU: The CPU utilization of the split check thread
GC worker CPU: The CPU utilization of the GC worker thread
Snapshot worker CPU: The CPU utilization of the snapshot worker thread

PD

PD requests: The rate at which TiKV sends to PD
PD request duration (average): The average duration of processing requests that TiKV sends to PD
PD heartbeats: The rate at which heartbeat messages are sent from TiKV to PD
PD validate peers: The rate at which messages are sent from TiKV to PD to validate TiKV peers

Raft IO

Apply log duration: The time consumed for Raft to apply logs
Apply log duration per server: The time consumed for Raft to apply logs per TiKV instance
Append log duration: The time consumed for Raft to append logs
Append log duration per server: The time consumed for Raft to append logs per TiKV instance
Commit log duration: The time consumed by Raft to commit logs
Commit log duration per server: The time consumed by Raft to commit logs per TiKV instance

TiKV Dashboard - Raft IO metrics

Raft process

Ready handled: The count of handled ready operations per second
0.99 Duration of Raft store events: The time consumed by Raftstore events (P99)
Process ready duration: The time consumed for processes to be ready in Raft
Process ready duration per server: The time consumed for peer processes to be ready in Raft per TiKV instance. It should be less than 2 seconds (P99.99).

TiKV Dashboard - Raft process metrics

Raft message

Sent messages per server: The number of Raft messages sent by each TiKV instance per second
Flush messages per server: The number of Raft messages flushed by the Raft client in each TiKV instance per second
Receive messages per server: The number of Raft messages received by each TiKV instance per second
Messages: The number of Raft messages sent per type per second
Vote: The number of Vote messages sent in Raft per second
Raft dropped messages: The number of dropped Raft messages per type per second

TiKV Dashboard - Raft message metrics

Raft propose

Raft apply proposals per ready: The histogram of the number of proposals that each ready operation contains in a batch while applying proposal.
Raft read/write proposals: The number of proposals per type per second
Raft read proposals per server: The number of read proposals made by each TiKV instance per second
Raft write proposals per server: The number of write proposals made by each TiKV instance per second
Propose wait duration: The histogram of waiting time of each proposal
Propose wait duration per server: The histogram of waiting time of each proposal per TiKV instance
Apply wait duration: The histogram of apply time of each proposal
Apply wait duration per server: The histogram of apply time of each proposal per TiKV instance
Raft log speed: The average rate at which peers propose logs

TiKV Dashboard - Raft propose metrics

Raft admin

Admin proposals: The number of admin proposals per second
Admin apply: The number of processed apply commands per second
Check split: The number of Raftstore split check commands per second
99.99% Check split duration: The time consumed when running split check commands (P99.99)

TiKV Dashboard - Raft admin metrics

Local reader

Local reader requests: The number of total requests and the number of rejections from the local read thread

TiKV Dashboard - Local reader metrics

Unified Read Pool

Time used by level: The time consumed for each level in the unified read pool. Level 0 means small queries.
Level 0 chance: The proportion of level 0 tasks in unified read pool
Running tasks: The number of tasks running concurrently in the unified read pool

Storage

Storage command total: The number of received command by type per second
Storage async request error: The number of engine asynchronous request errors per second
Storage async snapshot duration: The time consumed by processing asynchronous snapshot requests. It should be less than 1s in .99.
Storage async write duration: The time consumed by processing asynchronous write requests. It should be less than 1s in .99.

TiKV Dashboard - Storage metrics

Scheduler

Scheduler stage total: The number of commands at each stage per second. There should not be a lot of errors in a short time.
Scheduler writing bytes: The total written bytes by commands processed on each TiKV instance
Scheduler priority commands: The count of different priority commands per second
Scheduler pending commands: The count of pending commands per TiKV instance per second

TiKV Dashboard - Scheduler metrics

Scheduler - commit

Scheduler stage total: The number of commands at each stage per second when executing the commit command. There should not be a lot of errors in a short time.
Scheduler command duration: The time consumed when executing the commit command. It should be less than 1s.
Scheduler latch wait duration: The waiting time caused by latch when executing the commit command. It should be less than 1s.
Scheduler keys read: The count of keys read by a commit command
Scheduler keys written: The count of keys written by a commit command
Scheduler scan details: The keys scan details of each CF when executing the commit command.
Scheduler scan details [lock]: The keys scan details of lock CF when executing the commit command
Scheduler scan details [write]: The keys scan details of write CF when executing the commit command
Scheduler scan details [default]: The keys scan details of default CF when executing the commit command

TiKV Dashboard - Scheduler commit metrics

Scheduler - pessimistic_rollback

Scheduler stage total: The number of commands at each stage per second when executing the pessimistic_rollback command. There should not be a lot of errors in a short time.
Scheduler command duration: The time consumed when executing the pessimistic_rollback command. It should be less than 1s.
Scheduler latch wait duration: The waiting time caused by latch when executing the pessimistic_rollback command. It should be less than 1s.
Scheduler keys read: The count of keys read by a pessimistic_rollback command
Scheduler keys written: The count of keys written by a pessimistic_rollback command
Scheduler scan details: The keys scan details of each CF when executing the pessimistic_rollback command.
Scheduler scan details [lock]: The keys scan details of lock CF when executing the pessimistic_rollback command
Scheduler scan details [write]: The keys scan details of write CF when executing the pessimistic_rollback command
Scheduler scan details [default]: The keys scan details of default CF when executing the pessimistic_rollback command

Scheduler - prewrite

Scheduler stage total: The number of commands at each stage per second when executing the prewrite command. There should not be a lot of errors in a short time.
Scheduler command duration: The time consumed when executing the prewrite command. It should be less than 1s.
Scheduler latch wait duration: The waiting time caused by latch when executing the prewrite command. It should be less than 1s.
Scheduler keys read: The count of keys read by a prewrite command
Scheduler keys written: The count of keys written by a prewrite command
Scheduler scan details: The keys scan details of each CF when executing the prewrite command.
Scheduler scan details [lock]: The keys scan details of lock CF when executing the prewrite command
Scheduler scan details [write]: The keys scan details of write CF when executing the prewrite command
Scheduler scan details [default]: The keys scan details of default CF when executing the prewrite command

Scheduler - rollback

Scheduler stage total: The number of commands at each stage per second when executing the rollback command. There should not be a lot of errors in a short time.
Scheduler command duration: The time consumed when executing the rollback command. It should be less than 1s.
Scheduler latch wait duration: The waiting time caused by latch when executing the rollback command. It should be less than 1s.
Scheduler keys read: The count of keys read by a rollback command
Scheduler keys written: The count of keys written by a rollback command
Scheduler scan details: The keys scan details of each CF when executing the rollback command.
Scheduler scan details [lock]: The keys scan details of lock CF when executing the rollback command
Scheduler scan details [write]: The keys scan details of write CF when executing the rollback command
Scheduler scan details [default]: The keys scan details of default CF when executing the rollback command

GC

MVCC versions: The number of versions for each key
MVCC delete versions: The number of versions deleted by GC for each key
GC tasks: The count of GC tasks processed by gc_worker
GC tasks Duration: The time consumed when executing GC tasks
GC keys (write CF): The count of keys in write CF affected during GC
TiDB GC worker actions: The count of TiDB GC worker actions
TiDB GC seconds: The GC duration
GC speed: The number of keys deleted by GC per second
TiKV AutoGC Working: The status of Auto GC
ResolveLocks Progress: The progress of the first phase of GC (Resolve Locks)
TiKV Auto GC Progress: The progress of the second phase of GC
TiKV Auto GC SafePoint: The value of TiKV GC safe point. The safe point is the current GC timestamp
GC lifetime: The lifetime of TiDB GC
GC interval: The interval of TiDB GC

Snapshot

Rate snapshot message: The rate at which Raft snapshot messages are sent
99% Handle snapshot duration: The time consumed to handle snapshots (P99)
Snapshot state count: The number of snapshots per state
99.99% Snapshot size: The snapshot size (P99.99)
99.99% Snapshot KV count: The number of KV within a snapshot (P99.99)

Task

Worker handled tasks: The number of tasks handled by worker per second
Worker pending tasks: Current number of pending and running tasks of worker per second. It should be less than 1000 in normal case.
FuturePool handled tasks: The number of tasks handled by future pool per second
FuturePool pending tasks: Current number of pending and running tasks of future pool per second

Coprocessor Overview

Request duration: The total duration from the time of receiving the coprocessor request to the time of finishing processing the request
Total Requests: The number of requests by type per second
Handle duration: The histogram of time spent actually processing coprocessor requests per minute
Total Request Errors: The number of request errors of Coprocessor per second. There should not be a lot of errors in a short time.
Total KV Cursor Operations: The total number of the KV cursor operations by type per second, such as select, index, analyze_table, analyze_index, checksum_table, checksum_index, and so on.
KV Cursor Operations: The histogram of KV cursor operations by type per second
Total RocksDB Perf Statistics: The statistics of RocksDB performance
Total Response Size: The total size of coprocessor response

Coprocessor Detail

Handle duration: The histogram of time spent actually processing coprocessor requests per minute
95% Handle duration by store: The time consumed to handle coprocessor requests per TiKV instance per second (P95)
Wait duration: The time consumed when coprocessor requests are waiting to be handled. It should be less than 10s (P99.99).
95% Wait duration by store: The time consumed when coprocessor requests are waiting to be handled per TiKV instance per second (P95)
Total DAG Requests: The total number of DAG requests per second
Total DAG Executors: The total number of DAG executors per second
Total Ops Details (Table Scan): The number of RocksDB internal operations per second when executing select scan in coprocessor
Total Ops Details (Index Scan): The number of RocksDB internal operations per second when executing index scan in coprocessor
Total Ops Details by CF (Table Scan): The number of RocksDB internal operations for each CF per second when executing select scan in coprocessor
Total Ops Details by CF (Index Scan): The number of RocksDB internal operations for each CF per second when executing index scan in coprocessor

Threads

Threads state: The state of TiKV threads
Threads IO: The I/O traffic of each TiKV thread
Thread Voluntary Context Switches: The number of TiKV threads voluntary context switches
Thread Nonvoluntary Context Switches: The number of TiKV threads nonvoluntary context switches

RocksDB - kv/raft

Get operations: The count of get operations per second
Get duration: The time consumed when executing get operations
Seek operations: The count of seek operations per second
Seek duration: The time consumed when executing seek operations
Write operations: The count of write operations per second
Write duration: The time consumed when executing write operations
WAL sync operations: The count of WAL sync operations per second
Write WAL duration: The time consumed for writing WAL
WAL sync duration: The time consumed when executing WAL sync operations
Compaction operations: The count of compaction and flush operations per second
Compaction duration: The time consumed when executing the compaction and flush operations
SST read duration: The time consumed when reading SST files
Write stall duration: Write stall duration. It should be 0 in normal case.
Memtable size: The memtable size of each column family
Memtable hit: The hit rate of memtable
Block cache size: The block cache size. Broken down by column family if shared block cache is disabled.
Block cache hit: The hit rate of block cache
Block cache flow: The flow rate of block cache operations per type
Block cache operations: The count of block cache operations per type
Keys flow: The flow rate of operations on keys per type
Total keys: The count of keys in each column family
Read flow: The flow rate of read operations per type
Bytes / Read: The bytes per read operation
Write flow: The flow rate of write operations per type
Bytes / Write: The bytes per write operation
Compaction flow: The flow rate of compaction operations per type
Compaction pending bytes: The pending bytes to be compacted
Read amplification: The read amplification per TiKV instance
Compression ratio: The compression ratio of each level
Number of snapshots: The number of snapshots per TiKV instance
Oldest snapshots duration: The time that the oldest unreleased snapshot survivals
Number files at each level: The number of SST files for different column families in each level
Ingest SST duration seconds: The time consumed to ingest SST files
Stall conditions changed of each CF: Stall conditions changed of each column family

Titan - All

Blob file count: The number of Titan blob files
Blob file size: The total size of Titan blob file
Live blob size: The total size of valid blob record
Blob cache hit: The hit rate of Titan block cache
Iter touched blob file count: The number of blob file involved in a single iterator
Blob file discardable ratio distribution: The ratio distribution of blob record failure of blob files
Blob key size: The size of Titan blob keys
Blob value size: The size of Titan blob values
Blob get operations: The count of get operations in Titan blob
Blob get duration: The time consumed when executing get operations in Titan blob
Blob iter operations: The time consumed when executing iter operations in Titan blob
Blob seek duration: The time consumed when executing seek operations in Titan blob
Blob next duration: The time consumed when executing next operations in Titan blob
Blob prev duration: The time consumed when executing prev operations in Titan blob
Blob keys flow: The flow rate of operations on Titan blob keys
Blob bytes flow: The flow rate of bytes on Titan blob keys
Blob file read duration: The time consumed when reading Titan blob file
Blob file write duration: The time consumed when writing Titan blob file
Blob file sync operations: The count of blob file sync operations
Blob file sync duration: The time consumed when synchronizing blob file
Blob GC action: The count of Titan GC actions
Blob GC duration: The Titan GC duration
Blob GC keys flow: The flow rate of keys read and written by Titan GC
Blob GC bytes flow: The flow rate of bytes read and written by Titan GC
Blob GC input file size: The size of Titan GC input file
Blob GC output file size: The size of Titan GC output file
Blob GC file count: The count of blob files involved in Titan GC

Lock manager

Thread CPU: The CPU utilization of the lock manager thread
Handled tasks: The number of tasks handled by lock manager
Waiter lifetime duration: The waiting time of the transaction for the lock to be released
Wait table: The status information of wait table, including the number of locks and the number of transactions waiting for the lock
Deadlock detect duration: The time consumed for detecting deadlock
Detect error: The number of errors encountered when detecting deadlock, including the number of deadlocks
Deadlock detector leader: The information of the node where the deadlock detector leader is located

Memory

Allocator Stats: The statistics of the memory allocator

Backup

Backup CPU: The CPU utilization of the backup thread
Range Size: The histogram of backup range size
Backup Duration: The time consumed for backup
Backup Flow: The total bytes of backup
Disk Throughput: The disk throughput per instance
Backup Range Duration: The time consumed for backing up a range
Backup Errors: The number of errors encountered during a backup

Encryption

Encryption data keys: The total number of encrypted data keys
Encrypted files: The number of encrypted files
Encryption initialized: Shows whether encryption is enabled. 1 means enabled.
Encryption meta files size: The size of the encryption meta file
Encrypt/decrypt data nanos: The histogram of duration on encrypting/decrypting data each time
Read/write encryption meta duration: The time consumed for reading/writing encryption meta files

Explanation of Common Parameters

gRPC Message Type

Transactional API:
- kv_get: The command of getting the latest version of data specified by ts
- kv_scan: The command of scanning a range of data
- kv_prewrite: The command of prewriting the data to be committed at first phase of 2PC
- kv_pessimistic_lock: The command of adding a pessimistic lock to the key to prevent other transaction from modifying this key
- kv_pessimistic_rollback: The command of deleting the pessimistic lock on the key
- kv_txn_heart_beat: The command of updating lock_ttl for pessimistic transactions or large transactions to prevent them from rolling back
- kv_check_txn_status: The command of checking the status of the transaction
- kv_commit: The command of committing the data written by the prewrite command
- kv_cleanup: The command of rolling back a transaction, which is deprecated in v4.0
- kv_batch_get: The command of getting the value of batch key at once, similar to kv_get
- kv_batch_rollback: The command of batch rollback of multiple prewrite transactions
- kv_scan_lock: The command of scanning all locks with a version number before max_version to clean up expired transactions
- kv_resolve_lock: The command of committing or rollback the transaction lock, according to the transaction status.
- kv_gc: The command of GC
- kv_delete_range: The command of deleting a range of data from TiKV
Raw API:
- raw_get: The command of getting the value of key
- raw_batch_get: The command of getting the value of batch keys
- raw_scan: The command of scanning a range of data
- raw_batch_scan: The command of scanning multiple consecutive data range
- raw_put: The command of writing a key/value pair
- raw_batch_put: The command of writing a batch of key/value pairs
- raw_delete: The command of deleting a key/value pair
- raw_batch_delete: The command of a batch of key/value pairs
- raw_delete_range: The command of deleting a range of data