Troubleshoot Increased Read and Write Latency

This document introduces the possible causes of read and write latency and jitters, and how to troubleshoot these issues.

Common causes

Incorrect TiDB execution plan

The execution plan of queries is unstable and might select the incorrect index, which causes higher latency.

Phenomenon

  • If the query execution plan is output in the slow log, you can directly view the plan. Execute the select tidb_decode_plan('xxx...') statement to parse the detailed execution plan.
  • The number of scanned keys in the monitor abnormally increases; in the slow log, the number of Scan Keys are large.
  • The SQL execution duration in TiDB is greatly different than that in other databases such as MySQL. You can compare the execution plan of other databases (for example, whether Join Order is different).

Possible reason

The statistics is inaccurate.

Troubleshooting methods

  • Update the statistical information
    • Execute analyze table manually and execute analyze periodically with the crontab command to keep the statistics accurate.
    • Execute auto analyze automatically. Lower the threshold value of analyze ratio, increase the frequency of information collection, and set the start and end time of the execution. See the following examples:
      • set global tidb_auto_analyze_ratio=0.2;
      • set global tidb_auto_analyze_start_time='00:00 +0800';
      • set global tidb_auto_analyze_end_time='06:00 +0800';
  • Bind the execution plan
    • Modify the application SQL statements and execute use index to consistently use the index of the column.
    • In 3.0 versions, you do not need to modify the application SQL statements. Use create global binding to create the binding SQL statement of force index.
    • In 4.0 versions, SQL Plan Management is supported, which avoids the performance decrease caused by unstable execution plans.

PD anomalies

Phenomenon

There is an abnormal increase of the wait duration metric for the PD TSO. This metric represents the duration of waiting for PD to return requests.

Possible reasons

  • Disk issue. The disk where the PD node is located has full I/O load. Investigate whether PD is deployed with other components with high I/O demand and the health of the disk. You can verify the cause by viewing the monitor metrics in Grafana -> disk performance -> latency/load. You can also use the FIO tool to run a check on the disk if necessary.

  • Network issues between PD peers. The PD log shows lost the TCP streaming connection. You need to check whether there is a problem with the network between PD nodes and verify the cause by viewing round trip in the monitor Grafana -> PD -> etcd.

  • High server load. The log shows server is likely overloaded.

  • PD cannot elect a Leader: The PD log shows lease is not expired. This issue has been fixed in v3.0.x and v2.1.19.

  • The leader election is slow. The Region loading duration is long. You can check this issue by running grep "regions cost" in the PD log. If the result is in seconds, such as load 460927 regions cost 11.77099s, it means the Region loading is slow. You can enable the region storage feature in v3.0 by setting use-region-storage to true, which significantly reduce the Region loading duration.

  • The network issue between TiDB and PD. Check whether the network from TiDB to PD Leader is running normally by accessing the monitor Grafana -> blackbox_exporter -> ping latency.

  • PD reports the FATAL error, and the log shows range failed to find revision pair. This issue has been fixed in v3.0.8 (#2040).

  • When the /api/v1/regions interface is used, too many Regions might cause PD OOM. This issue has been fixed in v3.0.8 (#1986).

  • PD OOM during the rolling upgrade. The size of gRPC messages is not limited, and the monitor shows that TCP InSegs is relatively large. This issue has been fixed in v3.0.6 (#1952).

  • PD panics. Report a bug.

  • Other causes. Get goroutine by running curl http://127.0.0.1:2379/debug/pprof/goroutine?debug=2 and report a bug.

TiKV anomalies

Phenomenon

The KV Cmd Duration metric in the monitor increases abnormally. This metric represents the duration between the time that TiDB sends a request to TiKV and the time that TiDB receives the response.

Possible reasons

  • Check the gRPC duration metric. This metric represents the total duration of a gRPC request in TiKV. You can find out the potential network issue by comparing gRPC duration of TiKV and KV duration of TiDB. For example, the gRPC duration is short but the KV duration of TiDB is long, which indicates that the network latency between TiDB and TiKV might be high, or that the NIC bandwidth between TiDB and TiKV is fully occupied.

  • Re-election because TiKV is restarted.

    • After TiKV panics, it is pulled up by systemd and runs normally. You can check whether panic has occurred by viewing the TiKV log. Because this issue is unexpected, report a bug if it happens.
    • TiKV is stopped or killed by a third party and then pulled up by systemd. Check the cause by viewing dmesg and the TiKV log.
    • TiKV is OOM, which causes restart.
    • TiKV is hung because of dynamically adjusting THP (Transparent Hugepage).
  • Check monitor: TiKV RocksDB encounters write stall and thus results in re-election. You can check if the monitor Grafana -> TiKV-details -> errors shows server is busy.

  • Re-election because of network isolation.

  • If the block-cache configuration is too large, it might cause TiKV OOM. To verify the cause of the problem, check the block cache size of RocksDB by selecting the corresponding instance in the monitor Grafana -> TiKV-details. Meanwhile, check whether the [storage.block-cache] capacity = # "1GB"parameter is set properly. By default, TiKV's block-cache is set to 45% of the total memory of the machine. You need to explicitly specify this parameter when you deploy TiKV in the container, because TiKV obtains the memory of the physical machine, which might exceed the memory limit of the container.

  • Coprocessor receives many large queries and returns a large volume of data. gRPC fails to send data as quickly as the coprocessor returns data, which results in OOM. To verify the cause, you can check whether response size exceeds the network outbound traffic by viewing the monitor Grafana -> TiKV-details -> coprocessor overview.

Bottleneck of a single TiKV thread

There are some single threads in TiKV that might become the bottleneck.

  • Too many Regions in a TiKV instance causes a single gRPC thread to be the bottleneck (Check the Grafana -> TiKV-details -> Thread CPU/gRPC CPU Per Thread metric). In v3.x or later versions, you can enable Hibernate Region to resolve the issue.
  • For versions earlier than v3.0, when the raftstore thread or the apply thread becomes the bottleneck (Grafana -> TiKV-details -> Thread CPU/raft store CPU and Async apply CPU metrics exceed 80%), you can scale out TiKV (v2.x) instances or upgrade to v3.x with multi-threading.

CPU load increases

Phenomenon

The usage of CPU resources becomes the bottleneck.

Possible reasons

  • Hotspot issue
  • High overall load. Check the slow queries and expensive queries of TiDB. Optimize the executing queries by adding indexes or executing queries in batches. Another solution is to scale out the cluster.

Other causes

Cluster maintenance

Most of each online cluster has three or five nodes. If the machine to be maintained has the PD component, you need to determine whether the node is the leader or the follower. Disabling a follower has no impact on the cluster operation. Before disabling a leader, you need to switch the leadership. During the leadership change, performance jitter of about 3 seconds will occur.

Minority of replicas are offline

By default, each TiDB cluster has three replicas, so each Region has three replicas in the cluster. These Regions elect the leader and replicate data through the Raft protocol. The Raft protocol ensures that TiDB can still provide services without data loss even when the nodes (that are fewer than half of replicas) fail or are isolated. For the cluster with three replicas, the failure of one node might cause performance jitter but the usability and correctness in theory are not affected.

New indexes

Creating indexes consumes a huge amount of resources when TiDB scans tables and backfills indexes. Index creation might even conflict with the frequently updated fields, which affects the application. Creating indexes on a large table often takes a long time, so you must try to balance the index creation time and the cluster performance (for example, creating indexes at the off-peak time).

Parameter adjustment:

Currently, you can use tidb_ddl_reorg_worker_cnt and tidb_ddl_reorg_batch_size to dynamically adjust the speed of index creation. Usually, the smaller the values, the smaller the impact on the system, with longer execution time though.

In general cases, you can first keep their default values (4 and 256), observe the resource usage and response speed of the cluster, and then increase the value of tidb_ddl_reorg_worker_cnt to increase the concurrency. If no obvious jitter is observed in the monitor, increase the value of tidb_ddl_reorg_batch_size. If the columns involved in the index creation are frequently updated, the many resulting conflicts will cause the index creation to fail and be retried.

In addition, you can also set the value of tidb_ddl_reorg_priority to PRIORITY_HIGH to prioritize the index creation and speed up the process. But in the general OLTP system, it is recommended to keep its default value.

High GC pressure

The transaction of TiDB adopts the Multi-Version Concurrency Control (MVCC) mechanism. When the newly written data overwrites the old data, the old data is not replaced, and both versions of data are stored. Timestamps are used to mark different versions. The task of GC is to clear the obsolete data.

  • In the phase of Resolve Locks, a large amount of scan_lock requests are created in TiKV, which can be observed in the gRPC-related metrics. These scan_lock requests call all Regions.
  • In the phase of Delete Ranges, a few (or no) unsafe_destroy_range requests are sent to TiKV, which can be observed in the gRPC-related metrics and the GC tasks panel.
  • In the phase of Do GC, each TiKV by default scans the leader Regions on the machine and performs GC to each leader, which can be observed in the GC tasks panel.