Scale the TiDB Cluster Using TiDB Ansible

The capacity of a TiDB cluster can be increased or decreased without affecting the online services.

Assume that the topology is as follows:

NameHost IPServices
node1172.16.10.1PD1
node2172.16.10.2PD2
node3172.16.10.3PD3, Monitor
node4172.16.10.4TiDB1
node5172.16.10.5TiDB2
node6172.16.10.6TiKV1
node7172.16.10.7TiKV2
node8172.16.10.8TiKV3
node9172.16.10.9TiKV4

Increase the capacity of a TiDB/TiKV node

For example, if you want to add two TiDB nodes (node101, node102) with the IP addresses 172.16.10.101 and 172.16.10.102, take the following steps:

  1. Edit the inventory.ini file and the hosts.ini file, and append the node information.

    • Edit the inventory.ini file:

      [tidb_servers] 172.16.10.4 172.16.10.5 172.16.10.101 172.16.10.102 [pd_servers] 172.16.10.1 172.16.10.2 172.16.10.3 [tikv_servers] 172.16.10.6 172.16.10.7 172.16.10.8 172.16.10.9 [monitored_servers] 172.16.10.1 172.16.10.2 172.16.10.3 172.16.10.4 172.16.10.5 172.16.10.6 172.16.10.7 172.16.10.8 172.16.10.9 172.16.10.101 172.16.10.102 [monitoring_servers] 172.16.10.3 [grafana_servers] 172.16.10.3

      Now the topology is as follows:

      NameHost IPServices
      node1172.16.10.1PD1
      node2172.16.10.2PD2
      node3172.16.10.3PD3, Monitor
      node4172.16.10.4TiDB1
      node5172.16.10.5TiDB2
      node101172.16.10.101TiDB3
      node102172.16.10.102TiDB4
      node6172.16.10.6TiKV1
      node7172.16.10.7TiKV2
      node8172.16.10.8TiKV3
      node9172.16.10.9TiKV4
    • Edit the hosts.ini file:

      [servers] 172.16.10.1 172.16.10.2 172.16.10.3 172.16.10.4 172.16.10.5 172.16.10.6 172.16.10.7 172.16.10.8 172.16.10.9 172.16.10.101 172.16.10.102 [all:vars] username = tidb ntp_server = pool.ntp.org
  2. Initialize the newly added node.

    1. Configure the SSH mutual trust and sudo rules of the deployment machine on the central control machine:

      ansible-playbook -i hosts.ini create_users.yml -l 172.16.10.101,172.16.10.102 -u root -k
    2. Install the NTP service on the deployment target machine:

      ansible-playbook -i hosts.ini deploy_ntp.yml -u tidb -b
    3. Initialize the node on the deployment target machine:

      ansible-playbook bootstrap.yml -l 172.16.10.101,172.16.10.102
  3. Deploy the newly added node:

    ansible-playbook deploy.yml -l 172.16.10.101,172.16.10.102
  4. Start the newly added node:

    ansible-playbook start.yml -l 172.16.10.101,172.16.10.102
  5. Update the Prometheus configuration and restart the cluster:

    ansible-playbook rolling_update_monitor.yml --tags=prometheus
  6. Monitor the status of the entire cluster and the newly added node by opening a browser to access the monitoring platform: http://172.16.10.3:3000.

You can use the same procedure to add a TiKV node. But to add a PD node, some configuration files need to be manually updated.

Increase the capacity of a PD node

For example, if you want to add a PD node (node103) with the IP address 172.16.10.103, take the following steps:

  1. Edit the inventory.ini file and append the node information to the end of the [pd_servers] group:

    [tidb_servers] 172.16.10.4 172.16.10.5 [pd_servers] 172.16.10.1 172.16.10.2 172.16.10.3 172.16.10.103 [tikv_servers] 172.16.10.6 172.16.10.7 172.16.10.8 172.16.10.9 [monitored_servers] 172.16.10.4 172.16.10.5 172.16.10.1 172.16.10.2 172.16.10.3 172.16.10.103 172.16.10.6 172.16.10.7 172.16.10.8 172.16.10.9 [monitoring_servers] 172.16.10.3 [grafana_servers] 172.16.10.3

    Now the topology is as follows:

    NameHost IPServices
    node1172.16.10.1PD1
    node2172.16.10.2PD2
    node3172.16.10.3PD3, Monitor
    node103172.16.10.103PD4
    node4172.16.10.4TiDB1
    node5172.16.10.5TiDB2
    node6172.16.10.6TiKV1
    node7172.16.10.7TiKV2
    node8172.16.10.8TiKV3
    node9172.16.10.9TiKV4
  2. Initialize the newly added node:

    ansible-playbook bootstrap.yml -l 172.16.10.103
  3. Deploy the newly added node:

    ansible-playbook deploy.yml -l 172.16.10.103
  4. Login the newly added PD node and edit the starting script:

    {deploy_dir}/scripts/run_pd.sh
    1. Remove the --initial-cluster="xxxx" \ configuration.

    2. Add --join="http://172.16.10.1:2379" \. The IP address (172.16.10.1) can be any of the existing PD IP address in the cluster.

    3. Start the PD service in the newly added PD node:

      {deploy_dir}/scripts/start_pd.sh
    4. Use pd-ctl to check whether the new node is added successfully:

      ./pd-ctl -u "http://172.16.10.1:2379"
  5. Start the monitoring service:

    ansible-playbook start.yml -l 172.16.10.103
  6. Update the cluster configuration:

    ansible-playbook deploy.yml
  7. Restart Prometheus, and enable the monitoring of PD nodes used for increasing the capacity:

    ansible-playbook stop.yml --tags=prometheus ansible-playbook start.yml --tags=prometheus
  8. Monitor the status of the entire cluster and the newly added node by opening a browser to access the monitoring platform: http://172.16.10.3:3000.

Decrease the capacity of a TiDB node

For example, if you want to remove a TiDB node (node5) with the IP address 172.16.10.5, take the following steps:

  1. Stop all services on node5:

    ansible-playbook stop.yml -l 172.16.10.5
  2. Edit the inventory.ini file and remove the node information:

    [tidb_servers] 172.16.10.4 #172.16.10.5 # the removed node [pd_servers] 172.16.10.1 172.16.10.2 172.16.10.3 [tikv_servers] 172.16.10.6 172.16.10.7 172.16.10.8 172.16.10.9 [monitored_servers] 172.16.10.4 #172.16.10.5 # the removed node 172.16.10.1 172.16.10.2 172.16.10.3 172.16.10.6 172.16.10.7 172.16.10.8 172.16.10.9 [monitoring_servers] 172.16.10.3 [grafana_servers] 172.16.10.3

    Now the topology is as follows:

    NameHost IPServices
    node1172.16.10.1PD1
    node2172.16.10.2PD2
    node3172.16.10.3PD3, Monitor
    node4172.16.10.4TiDB1
    node5172.16.10.5TiDB2 removed
    node6172.16.10.6TiKV1
    node7172.16.10.7TiKV2
    node8172.16.10.8TiKV3
    node9172.16.10.9TiKV4
  3. Update the Prometheus configuration and restart the cluster:

    ansible-playbook rolling_update_monitor.yml --tags=prometheus
  4. Monitor the status of the entire cluster by opening a browser to access the monitoring platform: http://172.16.10.3:3000.

Decrease the capacity of a TiKV node

For example, if you want to remove a TiKV node (node9) with the IP address 172.16.10.9, take the following steps:

  1. Remove the node from the cluster using pd-ctl:

    1. View the store ID of node9:

      ./pd-ctl -u "http://172.16.10.1:2379" -d store
    2. Remove node9 from the cluster, assuming that the store ID is 10:

      ./pd-ctl -u "http://172.16.10.1:2379" -d store delete 10
  2. Use pd-ctl to check whether the node is successfully removed:

    ./pd-ctl -u "http://172.16.10.1:2379" -d store 10
  3. After the node is successfully removed, stop the services on node9:

    ansible-playbook stop.yml -l 172.16.10.9
  4. Edit the inventory.ini file and remove the node information:

    [tidb_servers] 172.16.10.4 172.16.10.5 [pd_servers] 172.16.10.1 172.16.10.2 172.16.10.3 [tikv_servers] 172.16.10.6 172.16.10.7 172.16.10.8 #172.16.10.9 # the removed node [monitored_servers] 172.16.10.4 172.16.10.5 172.16.10.1 172.16.10.2 172.16.10.3 172.16.10.6 172.16.10.7 172.16.10.8 #172.16.10.9 # the removed node [monitoring_servers] 172.16.10.3 [grafana_servers] 172.16.10.3

    Now the topology is as follows:

    NameHost IPServices
    node1172.16.10.1PD1
    node2172.16.10.2PD2
    node3172.16.10.3PD3, Monitor
    node4172.16.10.4TiDB1
    node5172.16.10.5TiDB2
    node6172.16.10.6TiKV1
    node7172.16.10.7TiKV2
    node8172.16.10.8TiKV3
    node9172.16.10.9TiKV4 removed
  5. Update the Prometheus configuration and restart the cluster:

    ansible-playbook rolling_update_monitor.yml --tags=prometheus
  6. Monitor the status of the entire cluster by opening a browser to access the monitoring platform: http://172.16.10.3:3000.

Decrease the capacity of a PD node

For example, if you want to remove a PD node (node2) with the IP address 172.16.10.2, take the following steps:

  1. Remove the node from the cluster using pd-ctl:

    1. View the name of node2:

      ./pd-ctl -u "http://172.16.10.1:2379" -d member
    2. Remove node2 from the cluster, assuming that the name is pd2:

      ./pd-ctl -u "http://172.16.10.1:2379" -d member delete name pd2
  2. Use Grafana or pd-ctl to check whether the node is successfully removed:

    ./pd-ctl -u "http://172.16.10.1:2379" -d member
  3. After the node is successfully removed, stop the services on node2:

    ansible-playbook stop.yml -l 172.16.10.2
  4. Edit the inventory.ini file and remove the node information:

    [tidb_servers] 172.16.10.4 172.16.10.5 [pd_servers] 172.16.10.1 #172.16.10.2 # the removed node 172.16.10.3 [tikv_servers] 172.16.10.6 172.16.10.7 172.16.10.8 172.16.10.9 [monitored_servers] 172.16.10.4 172.16.10.5 172.16.10.1 #172.16.10.2 # the removed node 172.16.10.3 172.16.10.6 172.16.10.7 172.16.10.8 172.16.10.9 [monitoring_servers] 172.16.10.3 [grafana_servers] 172.16.10.3

    Now the topology is as follows:

    NameHost IPServices
    node1172.16.10.1PD1
    node2172.16.10.2PD2 removed
    node3172.16.10.3PD3, Monitor
    node4172.16.10.4TiDB1
    node5172.16.10.5TiDB2
    node6172.16.10.6TiKV1
    node7172.16.10.7TiKV2
    node8172.16.10.8TiKV3
    node9172.16.10.9TiKV4
  5. Update the cluster configuration:

    ansible-playbook deploy.yml
  6. Restart Prometheus, and disable the monitoring of PD nodes used for increasing the capacity:

    ansible-playbook stop.yml --tags=prometheus ansible-playbook start.yml --tags=prometheus
  7. To monitor the status of the entire cluster, open a browser to access the monitoring platform: http://172.16.10.3:3000.