TiDB Binlog FAQ

This document collects the frequently asked questions (FAQs) about TiDB Binlog.

What is the impact of enabling TiDB Binlog on the performance of TiDB?

  • There is no impact on the query.

  • There is a slight performance impact on INSERT, DELETE and UPDATE transactions. In latency, a p-binlog is written concurrently in the TiKV prewrite stage before the transactions are committed. Generally, writing binlog is faster than TiKV prewrite, so it does not increase latency. You can check the response time of writing binlog in Pump's monitoring panel.

How high is the replication latency of TiDB Binlog?

The latency of TiDB Binlog replication is measured in seconds, which is generally about 3 seconds during off-peak hours.

What privileges does Drainer need to replicate data to the downstream MySQL or TiDB cluster?

To replicate data to the downstream MySQL or TiDB cluster, Drainer must have the following privileges:

  • Insert
  • Update
  • Delete
  • Create
  • Drop
  • Alter
  • Execute
  • Index
  • Select

What can I do if the Pump disk is almost full?

  1. Check whether Pump's GC works well:

    • Check whether the gc_tso time in Pump’s monitoring panel is identical with that of the configuration file.
  2. If GC works well, perform the following steps to reduce the amount of space required for a single Pump:

    • Modify the GC parameter of Pump to reduce the number of days to retain data.

    • Add pump instances.

What can I do if Drainer replication is interrupted?

Execute the following command to check whether the status of Pump is normal and whether all the Pump instances that are not in the offline state are running.

binlogctl -cmd pumps

Then, check whether the Drainer monitor or log outputs corresponding errors. If so, resolve them accordingly.

What can I do if Drainer is slow to replicate data to the downstream MySQL or TiDB cluster?

Check the following monitoring items:

  • For the Drainer Event monitoring metric, check the speed of Drainer replicating INSERT, UPDATE and DELETE transactions to the downstream per second.

  • For the SQL Query Time monitoring metric, check the time Drainer takes to execute SQL statements in the downstream.

Possible causes and solutions for slow replication:

  • If the replicated database contains a table without a primary key or unique index, add a primary key to the table.

  • If the latency between Drainer and the downstream is high, increase the value of the worker-count parameter of Drainer. For cross-datacenter replication, it is recommended to deploy Drainer in the downstream.

  • If the load in the downstream is not high, increase the value of the worker-count parameter of Drainer.

What can I do if a Pump instance crashes?

If a Pump instance crashes, Drainer cannot replicate data to the downstream because it cannot obtain the data of this instance. If this Pump instance can recover to the normal state, Drainer resumes replication; if not, perform the following steps:

  1. Use binlogctl to change the state of this Pump instance to offline to discard the data of this Pump instance.

  2. Because Drainer cannot obtain the data of this pump instance, the data in the downstream and upstream is inconsistent. In this situation, perform full and incremental backups again. The steps are as follows:

    1. Stop the Drainer.

    2. Perform a full backup in the upstream.

    3. Clear the data in the downstream including the tidb_binlog.checkpoint table.

    4. Restore the full backup to the downstream.

    5. Deploy Drainer and use initialCommitTs (set initialCommitTs as the snapshot timestamp of the full backup) as the start point of initial replication.

What is checkpoint?

Checkpoint records the commit-ts that Drainer replicates to the downstream. When Drainer restarts, it reads the checkpoint and then replicates data to the downstream starting from the corresponding commit-ts. The ["write save point"] [ts=411222863322546177] Drainer log means saving the checkpoint with the corresponding timestamp.

Checkpoint is saved in different ways for different types of downstream platforms:

  • For MySQL/TiDB, it is saved in the tidb_binlog.checkpoint table.

  • For Kafka/file, it is saved in the file of the corresponding configuration directory.

The data of kafka/file contains commit-ts, so if the checkpoint is lost, you can check the latest commit-ts of the downstream data by consuming the latest data in the downstream .

Drainer reads the checkpoint when it starts. If Drainer cannot read the checkpoint, it uses the configured initialCommitTs as the start point of the initial replication.

How to redeploy Drainer on the new machine when Drainer fails and the data in the downstream remains?

If the data in the downstream is not affected, you can redeploy Drainer on the new machine as long as the data can be replicated from the corresponding checkpoint.

How to restore the data of a cluster using a full backup and a binlog backup file?

  1. Clean up the cluster and restore a full backup.

  2. To restore the latest data of the backup file, use Reparo to set start-tso = {snapshot timestamp of the full backup + 1} and end-ts = 0 (or you can specify a point in time).

How to redeploy Drainer when enabling ignore-error in Primary-Secondary replication triggers a critical error?

If a critical error is triggered when TiDB fails to write binlog after enabling ignore-error, TiDB stops writing binlog and binlog data loss occurs. To resume replication, perform the following steps:

  1. Stop the Drainer instance.

  2. Restart the tidb-server instance that triggers critical error and resume writing binlog (TiDB does not write binlog to Pump after critical error is triggered).

  3. Perform a full backup in the upstream.

  4. Clear the data in the downstream including the tidb_binlog.checkpoint table.

  5. Restore the full backup to the downstream.

  6. Deploy Drainer and use initialCommitTs (set initialCommitTs as the snapshot timestamp of the full backup) as the start point of initial replication.

When can I pause or close a Pump or Drainer node?

Refer to TiDB Binlog Cluster Operations to learn the description of the Pump or Drainer state and how to start and exit the process.

Pause a Pump or Drainer node when you need to temporarily stop the service. For example:

  • Version upgrade

    Use the new binary to restart the service after the process is stopped.

  • Server maintenance

    When the server needs a downtime maintenance, exit the process and restart the service after the maintenance is finished.

Close a Pump or Drainer node when you no longer need the service. For example:

  • Pump scale-in

    If you do not need too many Pump services, close some of them.

  • Cancelling replication tasks

    If you no longer need to replicate data to a downstream database, close the corresponding Drainer node.

  • Service migration

    If you need to migrate the service to another server, close the service and re-deploy it on the new server.

How can I pause a Pump or Drainer process?

  • Directly kill the process.

  • If the Pump or Drainer node runs in the foreground, pause it by pressing Ctrl+C.

  • Use the pause-pump or pause-drainer command in binlogctl.

Can I use the update-pump or update-drainer command in binlogctl to pause the Pump or Drainer service?

No. The update-pump or update-drainer command directly modifies the state information saved in PD without notifying Pump or Drainer to perform the corresponding operation. Misusing the two commands can interrupt data replication and might even cause data loss.

Can I use the update-pump or update-drainer command in binlogctl to close the Pump or Drainer service?

No. The update-pump or update-drainer command directly modifies the state information saved in PD without notifying Pump or Drainer to perform the corresponding operation. Misusing the two commands interrupts data replication and might even cause data inconsistency. For example:

  • When a Pump node runs normally or is in the paused state, if you use the update-pump command to set the Pump state to offline, the Drainer node stops pulling the binlog data from the offline Pump. In this situation, the newest binlog cannot be replicated to the Drainer node, causing data inconsistency between upstream and downstream.
  • When a Drainer node runs normally, if you use the update-drainer command to set the Drainer state to offline, the newly started Pump node only notifies Drainer nodes in the online state. In this situation, the offline Drainer fails to pull the binlog data from the Pump node in time, causing data inconsistency between upstream and downstream.

When can I use the update-pump command in binlogctl to set the Pump state to paused?

In some abnormal situations, Pump fails to correctly maintain its state. Then, use the update-pump command to modify the state.

For example, when a Pump process is exited abnormally (caused by directly exiting the process when a panic occurs or mistakenly using the kill -9 command to kill the process), the Pump state information saved in PD is still online. In this situation, if you do not need to restart Pump to recover the service at the moment, use the update-pump command to update the Pump state to paused. Then, interruptions can be avoided when TiDB writes binlogs and Drainer pulls binlogs.

When can I use the update-drainer command in binlogctl to set the Drainer state to paused?

In some abnormal situations, the Drainer node fails to correctly maintain its state, which has influenced the replication task. Then, use the update-drainer command to modify the state.

For example, when a Drainer process is exited abnormally (caused by directly exiting the process when a panic occurs or mistakenly using the kill -9 command to kill the process), the Drainer state information saved in PD is still online. When a Pump node is started, it fails to notify the exited Drainer node (the notify drainer ... error), which cause the Pump node failure. In this situation, use the update-drainer command to update the Drainer state to paused and restart the Pump node.

How can I close a Pump or Drainer node?

Currently, you can only use the offline-pump or offline-drainer command in binlogctl to close a Pump or Drainer node.

When can I use the update-pump command in binlogctl to set the Pump state to offline?

You can use the update-pump command to set the Pump state to offline in the following situations:

  • When a Pump process is exited abnormally and the service cannot be recovered, the replication task is interrupted. If you want to recover the replication and accept some losses of binlog data, use the update-pump command to set the Pump state to offline. Then, the Drainer node stops pulling binlog from the Pump node and continues replicating data.
  • Some stale Pump nodes are left over from historical tasks. Their processes have been exited and their services are no longer needed. Then, use the update-pump command to set their state to offline.

For other situations, use the offline-pump command to close the Pump service, which is the regular process.

Can I use the update-pump command in binlogctl to set the Pump state to offline if I want to close a Pump node that is exited and set to paused?

When a Pump process is exited and the node is in the paused state, not all the binlog data in the node is consumed in its downstream Drainer node. Therefore, doing so might risk data inconsistency between upstream and downstream. In this situation, restart the Pump and use the offline-pump command to close the Pump node.

When can I use the update-drainer command in binlogctl to set the Drainer state to offline?

Some stale Drainer nodes are left over from historical tasks. Their processes have been exited and their services are no longer needed. Then, use the update-drainer command to set their state to offline.

Can I use SQL operations such as change pump and change drainer to pause or close the Pump or Drainer service?

No. For more details on these SQL operations, refer to Use SQL statements to manage Pump or Drainer.

These SQL operations directly modifies the state information saved in PD and are functionally equivalent to the update-pump and update-drainer commands in binlogctl. To pause or close the Pump or Drainer service, use the binlogctl tool.

What can I do when some DDL statements supported by the upstream database cause error when executed in the downstream database?

To solve the problem, follow these steps:

  1. Check drainer.log. Search exec failed for the last failed DDL operation before the Drainer process is exited.

  2. Change the DDL version to the one compatible to the downstream. Perform this step manually in the downstream database.

  3. Check drainer.log. Search for the failed DDL operation and find the commit-ts of this operation. For example:

    [2020/05/21 09:51:58.019 +08:00] [INFO] [syncer.go:398] ["add ddl item to syncer, you can add this commit ts to `ignore-txn-commit-ts` to skip this ddl if needed"] [sql="ALTER TABLE `test` ADD INDEX (`index1`)"] ["commit ts"=416815754209656834].
  4. Modify the drainer.toml configuration file. Add the commit-ts in the ignore-txn-commit-ts item and restart the Drainer node.

TiDB fails to write to binlog and gets stuck, and listener stopped, waiting for manual stop appears in the log

In TiDB v3.0.12 and earlier versions, the binlog write failure causes TiDB to report the fatal error. TiDB does not automatically exit but only stops the service, which seems like getting stuck. You can see the listener stopped, waiting for manual stop error in the log.

You need to determine the specific causes of the binlog write failure. If the failure occurs because binlog is slowly written into the downstream, you can consider scaling out Pump or increasing the timeout time for writing binlog.

Since v3.0.13, the error-reporting logic is optimized. The binlog write failure causes transaction execution to fail and TiDB Binlog will return an error but will not get TiDB stuck.

TiDB writes duplicate binlogs to Pump

This issue does not affect the downstream and replication logic.

When the binlog write fails or becomes timeout, TiDB retries writing binlog to the next available Pump node until the write succeeds. Therefore, if the binlog write to a Pump node is slow and causes TiDB timeout (default 15s), then TiDB determines that the write fails and tries to write to the next Pump node. If binlog is actually successfully written to the timeout-causing Pump node, the same binlog is written to multiple Pump nodes. When Drainer processes the binlog, it automatically de-duplicates binlogs with the same TSO, so this duplicate write does not affect the downstream and replication logic.

Reparo is interrupted during the full and incremental restore process. Can I use the last TSO in the log to resume replication?

Yes. Reparo does not automatically enable the safe-mode when you start it. You need to perform the following steps manually:

  1. After Reparo is interrupted, record the last TSO in the log as checkpoint-tso.
  2. Modify the Reparo configuration file, set the configuration item start-tso to checkpoint-tso + 1, set stop-tso to checkpoint-tso + 80,000,000,000 (approximately five minutes after the checkpoint-tso), and set safe-mode to true. Start Reparo, and Reparo replicates data to stop-tso and then stops automatically.
  3. After Reparo stops automatically, set start-tso to checkpoint tso + 80,000,000,001, set stop-tso to 0, and set safe-mode to false. Start Reparo to resume replication.