Data Check for TiDB Upstream and Downstream Clusters

When you use TiCDC to build upstream and downstream clusters of TiDB, you might need to verify the consistency of upstream and downstream data without stopping replication. In the regular replication mode, TiCDC only guarantees that the data is eventually consistent, but cannot guarantee that the data is consistent during the replication process. Therefore, it is difficult to verify the consistency of dynamically changing data. To meet such a need, TiCDC provides the Syncpoint feature.

Syncpoint uses the snapshot feature provided by TiDB and enables TiCDC to maintain a ts-map that has consistency between upstream and downstream snapshots during the replication process. In this way, the issue of verifying the consistency of dynamic data is converted to the issue of verifying the consistency of static snapshot data, which achieves the effect of nearly real-time verification.

To enable the Syncpoint feature, set the value of the TiCDC configuration item enable-sync-point to true when creating a replication task. After enabling Syncpoint, TiCDC will periodically align the upstream and downstream snapshots according to the TiCDC parameter sync-point-interval during the data replication process, and will save the upstream and downstream TSO correspondences in the downstream tidb_cdc.syncpoint_v1 table.

Then, you only need to configure snapshot in sync-diff-inspector to verify the data of the TiDB upstream-downstream clusters. The following TiCDC configuration example enables Syncpoint for a created replication task:

# Enables SyncPoint. enable-sync-point = true # Aligns the upstream and downstream snapshots every 5 minutes sync-point-interval = "5m" # Cleans up the ts-map data in the downstream tidb_cdc.syncpoint_v1 table every hour sync-point-retention = "1h"

Step 1: obtain ts-map

You can execute the following SQL statement in the downstream TiDB cluster to obtain the upstream TSO (primary_ts) and downstream TSO (secondary_ts):

select * from tidb_cdc.syncpoint_v1; +------------------+----------------+--------------------+--------------------+---------------------+ | ticdc_cluster_id | changefeed | primary_ts | secondary_ts | created_at | +------------------+----------------+--------------------+--------------------+---------------------+ | default | test-2 | 435953225454059520 | 435953235516456963 | 2022-09-13 08:40:15 | +------------------+----------------+--------------------+--------------------+---------------------+

The fields in the preceding syncpoint_v1 table are described as follows:

  • ticdc_cluster_id: The ID of the TiCDC cluster in this record.
  • changefeed: The ID of the changefeed in this record. Because different TiCDC clusters might have changefeeds with the same name, you need to confirm the ts-map inserted by a changefeed with the TiCDC cluster ID and changefeed ID.
  • primary_ts: The timestamp of the upstream database snapshot.
  • secondary_ts: The timestamp of the downstream database snapshot.
  • created_at: The time when this record is inserted.

Step 2: configure snapshot

Then configure the snapshot information of the upstream and downstream databases by using the ts-map information obtained in Step 1.

Here is a configuration example of the Datasource config section:

######################### Datasource config ######################## [data-sources.uptidb] host = "172.16.0.1" port = 4000 user = "root" password = "" snapshot = "435953225454059520" [data-sources.downtidb] host = "172.16.0.2" port = 4000 user = "root" snapshot = "435953235516456963"