Handle Alerts in TiDB Data Migration
This document introduces how to deal with the alert information in DM.
Alerts related to high availability
DM_master_all_down
Description:
If all DM-master nodes are offline, this alert is triggered.
Solution:
You can take the following steps to handle the alert:
- Check the environment of the cluster.
- Check the logs of all DM-master nodes for troubleshooting.
DM_worker_offline
Description:
If a DM-worker node is offline for more than one hour, this alert is triggered. In a high-availability architecture, this alert might not directly interrupt the task but increases the risk of interruption.
Solution:
You can take the following steps to handle the alert:
- View the working status of the corresponding DM-worker node.
- Check whether the node is connected.
- Troubleshoot errors through logs.
DM_DDL_error
Description:
This error occurs when DM is processing the sharding DDL operations.
Solution:
Refer to Troubleshoot DM.
DM_pending_DDL
Description:
If a sharding DDL operation is pending for more than one hour, this alert is triggered.
Solution:
In some scenarios, the pending sharding DDL operation might be what users expect. Otherwise, refer to Handle Sharding DDL Locks Manually in DM for solution.
Alert rules related to task status
DM_task_state
Description:
When a sub-task of DM-worker is in the
Paused
state for over 20 minutes, an alert is triggered.Solution:
Refer to Troubleshoot DM.
Alert rules related to relay log
DM_relay_process_exits_with_error
Description:
When the relay log processing unit encounters a non-autorecoverable error (for example, binlog files not found), or when it encounters multiple recoverable errors (for example, network problems) in a short period of time (for example, more than 3 times in 2 minutes), this alert is triggered.
Solution:
Refer to Troubleshoot DM.
DM_remain_storage_of_relay_log
Description:
When the free space of the disk where the relay log is located is less than 10G, an alert is triggered.
Solutions:
You can take the following methods to handle the alert:
- Delete unwanted data manually to increase free disk space.
- Reconfigure the automatic data purge strategy of the relay log or purge data manually.
- Execute the command
pause-relay
to pause the relay log pulling process. After there is enough free disk space, resume the process by running the commandresume-relay
. Note that you must not purge upstream binlog files that have not been pulled after the relay log pulling process is paused.
DM_relay_log_data_corruption
Description:
When the relay log processing unit validates the binlog event read from the upstream and detects abnormal checksum information, this unit moves to the
Paused
state, and an alert is triggered.Solution:
Refer to Troubleshoot DM.
DM_fail_to_read_binlog_from_master
Description:
If an error occurs when the relay log processing unit tries to read the binlog event from the upstream, this unit moves to the
Paused
state, and an alert is triggered.Solution:
Refer to Troubleshoot DM.
DM_fail_to_write_relay_log
Description:
If an error occurs when the relay log processing unit tries to write the binlog event into the relay log file, this unit moves to the
Paused
state, and an alert is triggered.Solution:
Refer to Troubleshoot DM.
DM_binlog_file_gap_between_master_relay
Description:
When the number of the binlog files in the current upstream MySQL/MariaDB exceeds that of the latest binlog files pulled by the relay log processing unit by more than 1 for 10 minutes, and an alert is triggered.
Solution:
Refer to Troubleshoot DM.
Alert rules related to Dump/Load
DM_dump_process_exists_with_error
Description:
When the Dump processing unit encounters a non-autorecoverable error (for example, binlog files not found), or when it encounters multiple recoverable errors (for example, network problems) in a short period of time (for example, more than 3 times in 2 minutes), this alert is triggered.
Solution:
Refer to Troubleshoot DM.
DM_load_process_exists_with_error
Description:
When the Load processing unit encounters a non-autorecoverable error (for example, binlog files not found), or when it encounters multiple recoverable errors (for example, network problems) in a short period of time (for example, more than 3 times in 2 minutes), this alert is triggered.
Solution:
Refer to Troubleshoot DM.
Alert rules related to binlog replication
DM_sync_process_exists_with_error
Description:
When the binlog replication processing unit encounters a non-autorecoverable error (for example, binlog files not found), or when it encounters multiple recoverable errors (for example, network problems) in a short period of time (for example, more than 3 times in 2 minutes), this alert is triggered.
Solution:
Refer to Troubleshoot DM.
DM_binlog_file_gap_between_master_syncer
Description:
When the number of the binlog files in the current upstream MySQL/MariaDB exceeds that of the latest binlog files processed by the relay log processing unit by more than 1 for 10 minutes, an alert is triggered.
Solution:
Refer to Handle Performance Issues.
DM_binlog_file_gap_between_relay_syncer
Description:
When the number of the binlog files in the current relay log processing unit exceeds that of the latest binlog files processed by the binlog replication processing unit by more than 1 for 10 minutes, an alert is triggered.
Solution:
Refer to Handle Performance Issues.