TiCDC Canal-JSON Protocol

Canal-JSON is a data exchange format protocol defined by Alibaba Canal. In this document, you can learn how Canal-JSON data formats are implemented in TiCDC, including the TiDB extension field, the definitions of the Canal-JSON data formats, and comparison with the official Canal.

Use Canal-JSON

When using Message Queue (MQ) as the downstream Sink, you can specify Canal-JSON in sink-uri. TiCDC wraps and constructs Canal-JSON messages with Event as the basic unit, and sends TiDB data change Events to the downstream.

There are three types of Events:

  • DDL Event: Represents a DDL change record. It is sent after an upstream DDL statement is successfully executed. The DDL Event is sent to the MQ Partition with the index being 0.
  • DML Event: Represents a row data change record. This type of Event is sent when a row change occurs. It contains the information about the row after the change occurs.
  • WATERMARK Event: Represents a special time point. It indicates that the Events received before this point is complete. It applies only to the TiDB extension field and takes effect when you set enable-tidb-extension to true in sink-uri.

The following is an example of using Canal-JSON:

cdc cli changefeed create --server=http://127.0.0.1:8300 --changefeed-id="kafka-canal-json" --sink-uri="kafka://127.0.0.1:9092/topic-name?kafka-version=2.4.0&protocol=canal-json"

TiDB extension field

The Canal-JSON protocol is originally designed for MySQL. It does not contain important fields such as the TiDB-specific unique identifier for the CommitTS transaction. To solve this problem, TiCDC appends a TiDB extension field to the Canal-JSON protocol format. After you set enable-tidb-extension to true (false by default) in sink-uri, TiCDC behaves as follows when generating Canal-JSON messages:

  • TiCDC sends DML Event and DDL Event messages that contain a field named _tidb.
  • TiCDC sends WATERMARK Event messages.

The following is an example:

cdc cli changefeed create --server=http://127.0.0.1:8300 --changefeed-id="kafka-canal-json-enable-tidb-extension" --sink-uri="kafka://127.0.0.1:9092/topic-name?kafka-version=2.4.0&protocol=canal-json&enable-tidb-extension=true"

Definitions of message formats

This section describes the formats of DDL Event, DML Event and WATERMARK Event, and how the data is resolved on the consumer side.

DDL Event

TiCDC encodes a DDL Event into the following Canal-JSON format.

{ "id": 0, "database": "test", "table": "", "pkNames": null, "isDdl": true, "type": "QUERY", "es": 1639633094670, "ts": 1639633095489, "sql": "drop database if exists test", "sqlType": null, "mysqlType": null, "data": null, "old": null, "_tidb": { // TiDB extension field "commitTs": 163963309467037594 } }

The fields are explained as follows.

FieldTypeDescription
idNumberThe default value is 0 in TiCDC.
databaseStringThe name of the database where the row is located
tableStringThe name of the table where the row is located
pkNamesArrayThe names of all the columns that make up the primary key
isDdlBoolWhether the message is a DDL event
typeStringEvent types defined by Canal-JSON
esNumber13-bit (millisecond) timestamp when the event that generated the message happened
tsNumber13-bit (millisecond) timestamp when TiCDC generated the message
sqlStringWhen isDdl is true, records the corresponding DDL statement
sqlTypeObjectWhen isDdl is false, records how the data type of each column is represented in Java
mysqlTypeobjectWhen isDdl is false, records how the data type of each column is represented in MySQL
dataObjectWhen isDdl is false, records the name of each column and its data value
oldObjectOnly if the message is generated by an update Event, records the name of each column and the data value before the update
_tidbObjectTiDB extension field. It exists only if you set enable-tidb-extension to true. The value of commitTs is the TSO of the transaction that caused the row to change.

DML Event

TiCDC encodes a row of DML data change event as follows:

{ "id": 0, "database": "test", "table": "tp_int", "pkNames": [ "id" ], "isDdl": false, "type": "INSERT", "es": 1639633141221, "ts": 1639633142960, "sql": "", "sqlType": { "c_bigint": -5, "c_int": 4, "c_mediumint": 4, "c_smallint": 5, "c_tinyint": -6, "id": 4 }, "mysqlType": { "c_bigint": "bigint", "c_int": "int", "c_mediumint": "mediumint", "c_smallint": "smallint", "c_tinyint": "tinyint", "id": "int" }, "data": [ { "c_bigint": "9223372036854775807", "c_int": "2147483647", "c_mediumint": "8388607", "c_smallint": "32767", "c_tinyint": "127", "id": "2" } ], "old": null, "_tidb": { // TiDB extension field "commitTs": 163963314122145239 } }

WATERMARK Event

TiCDC sends a WATERMARK Event only when you set enable-tidb-extension to true. The value of the type field is TIDB_WATERMARK. The Event contains the _tidb field, and the field contains only one parameter watermarkTs. The value of watermarkTs is the TSO recorded when the Event is sent.

When you receive an Event of this type, all Events with commitTs less than watermarkTs have been sent. Because TiCDC provides the "At Least Once" semantics, data might be sent repeatedly. If a subsequent Event with commitTs less than watermarkTs is received, you can safely ignore this Event.

The following is an example of the WATERMARK Event.

{ "id": 0, "database": "", "table": "", "pkNames": null, "isDdl": false, "type": "TIDB_WATERMARK", "es": 1640007049196, "ts": 1640007050284, "sql": "", "sqlType": null, "mysqlType": null, "data": null, "old": null, "_tidb": { // TiDB extension field "watermarkTs": 429918007904436226 } }

Data resolution on the consumer side

As you can see from the example above, Canal-JSON has a uniform data format, with different field filling rules for different Event types. You can use a uniform method to resolve this JSON format data, and then determine the Event type by checking the field values.

  • When isDdl is true, the message contains a DDL Event.
  • When isDdl is false, you need to further check the type field. If type is TIDB_WATERMARK, it is a WATERMARK Event; otherwise, it is a DML Event.

Field descriptions

The Canal-JSON format records the corresponding data type in the mysqlType field and the sqlType field.

MySQL Type field

In the mysqlType field, the Canal-JSON format records the string of MySQL Type in each column. For more information, see TiDB Data Types.

SQL Type field

In the sqlType field, the Canal-JSON format records Java SQL Type of each column, which is the data type corresponding to the data in JDBC. Its value can be calculated by MySQL Type and the specific data value. The mapping is as follows:

MySQL TypeJava SQL Type Code
Boolean-6
Float7
Double8
Decimal3
Char1
Varchar12
Binary2004
Varbinary2004
Tinytext2005
Text2005
Mediumtext2005
Longtext2005
Tinyblob2004
Blob2004
Mediumblob2004
Longblob2004
Date91
Datetime93
Timestamp93
Time92
Year12
Enum4
Set-7
Bit-7
JSON12

Integer types

You need to consider whether integer types have the Unsigned constraint and the value size, which corresponds to different Java SQL Type Codes respectively, as shown in the following table.

MySQL Type StringValue RangeJava SQL Type Code
tinyint[-128, 127]-6
tinyint unsigned[0, 127]-6
tinyint unsigned[128, 255]5
smallint[-32768, 32767]5
smallint unsigned[0, 32767]5
smallint unsigned[32768, 65535]4
mediumint[-8388608, 8388607]4
mediumint unsigned[0, 8388607]4
mediumint unsigned[8388608, 16777215]4
int[-2147483648, 2147483647]4
int unsigned[0, 2147483647]4
int unsigned[2147483648, 4294967295]-5
bigint[-9223372036854775808, 9223372036854775807]-5
bigint unsigned[0, 9223372036854775807]-5
bigint unsigned[9223372036854775808, 18446744073709551615]3

The following table shows the mapping relationships between Java SQL Types in TiCDC and their codes.

Java SQL TypeJava SQL Type Code
CHAR1
DECIMAL3
INTEGER4
SMALLINT5
REAL7
DOUBLE8
VARCHAR12
DATE91
TIME92
TIMESTAMP93
BLOB2004
CLOB2005
BIGINT-5
TINYINT-6
Bit-7

For more information about Java SQL Types, see Java SQL Class Types.

Binary and Blob types

TiCDC encodes binary types in the Canal-JSON format by converting each byte to its character representation as follows:

  • Printable characters are represented using the ISO/IEC 8859-1 character encodings.
  • Non-printable characters and certain characters with special meaning in HTML are represented using their UTF-8 escape sequence.

The following table shows the detailed representation information.

Character typeValue rangeCharacter representation
Control characters[0, 31]UTF-8 escape (such as \u0000 through \u001F)
Horizontal tab[9]\t
Line feed[10]\n
Carriage return[13]\r
Printable characters[32, 127]Literal character (such as A)
Ampersand[38]\u0026
Less-than sign[60]\u0038
Greater-than sign[62]\u003E
Extended control characters[128, 159]Literal character
ISO 8859-1 (Latin-1)[160, 255]Literal character

Example of the encoding

For example, the following 16 bytes [5 7 10 15 36 50 43 99 120 60 38 255 254 45 55 70] stored in a VARBINARY column called c_varbinary are encoded in a Canal-JSON Update event as follows:

{ ... "data": [ { ... "c_varbinary": "\u0005\u0007\n\u000f$2+cx\u003c\u0026ÿþ-7F" } ] ... }

Comparison of TiCDC Canal-JSON and the official Canal

The way that TiCDC implements the Canal-JSON data format, including the Update Event and the mysqlType field, differs from the official Canal. The following table shows the main differences.

ItemTiCDC Canal-JSONCanal
Event of Update TypeBy default, the old field contains all the column data. When only_output_updated_columns is true, the old field contains only the modified column data.The old field contains only the modified column data
mysqlType fieldFor types with parameters, it does not contain the information of the type parameterFor types with parameters, it contains the full information of the type parameter

Event of Update Type

For an Event of Update Type:

  • In TiCDC, the old field contains all the column data
  • In the official Canal, the old field contains only the modified column data

Assume that the following SQL statements are executed sequentially in the upstream TiDB:

create table tp_int ( id int auto_increment, c_tinyint tinyint null, c_smallint smallint null, c_mediumint mediumint null, c_int int null, c_bigint bigint null, constraint pk primary key (id) ); insert into tp_int(c_tinyint, c_smallint, c_mediumint, c_int, c_bigint) values (127, 32767, 8388607, 2147483647, 9223372036854775807); update tp_int set c_int = 0, c_tinyint = 0 where c_smallint = 32767;

For the update statement, TiCDC outputs an Event message with type as UPDATE, as shown below. The update statement only modifies the c_int and c_tinyint columns. The old field in the output event message contains all the column data.

{ "id": 0, ... "type": "UPDATE", ... "sqlType": { ... }, "mysqlType": { ... }, "data": [ { "c_bigint": "9223372036854775807", "c_int": "0", "c_mediumint": "8388607", "c_smallint": "32767", "c_tinyint": "0", "id": "2" } ], "old": [ // In TiCDC, this field contains all the column data. { "c_bigint": "9223372036854775807", "c_int": "2147483647", // Modified column "c_mediumint": "8388607", "c_smallint": "32767", "c_tinyint": "127", // Modified column "id": "2" } ] }

For the official Canal, the old field in the output event message contains only the modified column data, as shown below.

{ "id": 0, ... "type": "UPDATE", ... "sqlType": { ... }, "mysqlType": { ... }, "data": [ { "c_bigint": "9223372036854775807", "c_int": "0", "c_mediumint": "8388607", "c_smallint": "32767", "c_tinyint": "0", "id": "2" } ], "old": [ // In Canal, this field contains only the modified column data. { "c_int": "2147483647", // Modified column "c_tinyint": "127", // Modified column } ] }

mysqlType field

For the mysqlType field, if a type contains parameters, the official Canal contains the full information of the type parameter. TiCDC does not contain such information.

In the following example, the table-defining SQL statement contains a parameter for each column, such as the ones for decimal, char, varchar and enum. By comparing the Canal-JSON formats generated by TiCDC and the official Canal, you can see that TiCDC only contains the basic MySQL information in the mysqlType field. If you need the full information of the type parameter, you need to implement it by other means.

Assume that the following SQL statements are executed sequentially in the upstream TiDB:

create table t ( id int auto_increment, c_decimal decimal(10, 4) null, c_char char(16) null, c_varchar varchar(16) null, c_binary binary(16) null, c_varbinary varbinary(16) null, c_enum enum('a','b','c') null, c_set set('a','b','c') null, c_bit bit(64) null, constraint pk primary key (id) ); insert into t (c_decimal, c_char, c_varchar, c_binary, c_varbinary, c_enum, c_set, c_bit) values (123.456, "abc", "abc", "abc", "abc", 'a', 'a,b', b'1000001');

The output of TiCDC is as follows:

{ "id": 0, ... "isDdl": false, "sqlType": { ... }, "mysqlType": { "c_binary": "binary", "c_bit": "bit", "c_char": "char", "c_decimal": "decimal", "c_enum": "enum", "c_set": "set", "c_varbinary": "varbinary", "c_varchar": "varchar", "id": "int" }, "data": [ { ... } ], "old": null, }

The output of the official Canal is as follows:

{ "id": 0, ... "isDdl": false, "sqlType": { ... }, "mysqlType": { "c_binary": "binary(16)", "c_bit": "bit(64)", "c_char": "char(16)", "c_decimal": "decimal(10, 4)", "c_enum": "enum('a','b','c')", "c_set": "set('a','b','c')", "c_varbinary": "varbinary(16)", "c_varchar": "varchar(16)", "id": "int" }, "data": [ { ... } ], "old": null, }

Changes in TiCDC Canal-JSON

Changes in the Old field of the Delete events

From v5.4.0, the old field of the Delete events has changed.

The following is a Delete event message. Before v5.4.0, the old field contains the same content as the "data" field. In v5.4.0 and later versions, the old field is set to null. You can get the deleted data by using the "data" field.

{ "id": 0, "database": "test", ... "type": "DELETE", ... "sqlType": { ... }, "mysqlType": { ... }, "data": [ { "c_bigint": "9223372036854775807", "c_int": "0", "c_mediumint": "8388607", "c_smallint": "32767", "c_tinyint": "0", "id": "2" } ], "old": null, // The following is an example before v5.4.0. The `old` field contains the same content as the "data" field. "old": [ { "c_bigint": "9223372036854775807", "c_int": "0", "c_mediumint": "8388607", "c_smallint": "32767", "c_tinyint": "0", "id": "2" } ] }