TiDB Monitoring Metrics
If you use TiDB Ansible to deploy the TiDB cluster, the monitoring system is deployed at the same time. For more information, see TiDB Monitoring Framework Overview.
The Grafana dashboard is divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node_exporter, Disk Performance, and so on. A lot of metrics are there to help you diagnose.
This document describes some key monitoring metrics displayed on the TiDB dashboard.
Key metrics description
To understand the key metrics displayed on the TiDB dashboard, check the following list:
Query Summary
- Duration: the execution time of a SQL statement
- Statement OPS: the statistics of executed SQL statements (including
SELECT
,INSERT
,UPDATE
and so on) - QPS By Instance: the QPS on each TiDB instance
Query Detail
- Internal SQL OPS: the statistics of the executed SQL statements within TiDB
Server
- Connection Count: the number of clients connected to each TiDB instance
- Failed Query OPM: the statistics of failed SQL statements, such as grammar mistake, primary key, and so on
- Heap Memory Usage: the heap memory size used by each TiDB instance
- Events OPM: the statistics of key events, such as "start", "close", "graceful-shutdown","kill", "hang", and so on
- Uncommon Error OPM: the statistics of abnormal TiDB errors, including panic, binlog write failure, and so on
Transaction
- Transaction OPS: the statistics of executed transactions
- Transaction Duration: the execution time of a transaction
- Session Retry Error OPS: the number of errors encountered during the transaction retry
Executor
- Expensive Executor OPS: the statistics of the operators that consume many system resources, including
Merge Join
,Hash Join
,Index Look Up Join
,Hash Agg
,Stream Agg
,Sort
,TopN
, and so on - Queries Using Plan Cache OPS: the statistics of queries using the Plan Cache
- Expensive Executor OPS: the statistics of the operators that consume many system resources, including
Distsql
- Distsql Duration: the processing time of Distsql statements
- Distsql QPS: the statistics of Distsql statements
KV Errors
- KV Retry Duration: the time that a KV retry request lasts
- TiClient Region Error OPS: the number of Region related error messages returned by TiKV
- KV Backoff OPS: the number of error messages (transaction conflicts and so on) returned by TiKV
- Lock Resolve OPS: the number of errors related to transaction conflicts
- Other Errors OPS: the number of other types of errors, including clearing locks and updating SafePoint
KV Duration
- KV Cmd Duration 99: the execution time of KV commands
KV Count
- KV Cmd OPS: the statistics of executed KV commands
- Txn OPS: the statistics of started transactions
- Load SafePoint OPS: the statistics of operations that update SafePoint
PD Client
- PD TSO OPS: the number of TSO that TiDB obtains from PD
- PD TSO Wait Duration: the time it takes TiDB to obtain TSO from PD
- PD Client CMD OPS: the statistics of commands executed by PD Client
- PD Client CMD Duration: the time it takes PD Client to execute commands
- PD Client CMD Fail OPS: the statistics of failed commands executed by PD Client
Schema Load
- Load Schema Duration: the time it takes TiDB to obtain the schema from TiKV
- Load Schema OPS: the statistics of the schemas that TiDB obtains from TiKV
- Schema Lease Error OPM: the Schema Lease error, including two types named "change" and "outdate"; an alarm is triggered when an "outdate" error occurs
DDL
- DDL Duration 95: the statistics of DDL statements processing time
- DDL Batch Add Index Duration 100: the statistics of the time that it takes each Batch to create the index
- DDL Deploy Syncer Duration: the time consumed by Schema Version Syncer initialization, restart, and clearing up operations
- Owner Handle Syncer Duration: the time that it takes the DDL Owner to update, obtain, and check the Schema Version
- Update Self Version Duration: the time consumed by updating the version information of Schema Version Syncer
Statistics
- Auto Analyze Duration 95: the time consumed by automatic
ANALYZE
- Auto Analyze QPS: the statistics of automatic
ANALYZE
- Stats Inaccuracy Rate: the information of the statistics inaccuracy rate
- Pseudo Estimation OPS: the number of the SQL statements optimized using pseudo statistics
- Dump Feedback OPS: the number of stored statistical Feedbacks
- Update Stats OPS: the statistics of using Feedback to update the statistics information
- Auto Analyze Duration 95: the time consumed by automatic
Meta
- AutoID QPS: AutoID related statistics, including three operations (global ID allocation, a single table AutoID allocation, a single table AutoID Rebase)
- AutoID Duration: the time consumed by AutoID related operations
GC
- Worker Action OPM: the statistics of GC related operations, including
run_job
,resolve_lock
, anddelete\_range
- Duration 99: the time consumed by GC related operations
- GC Failure OPM: the number of failed GC related operations
- Too Many Locks Error OPM: the number of the error that GC clears up too many locks
- Worker Action OPM: the statistics of GC related operations, including