TiDB is a highly available database that can run smoothly when some of the database nodes go offline. For this reason, you can safely shut down and maintain the Kubernetes nodes at the bottom layer without influencing TiDB's service. Specifically, you need to adopt various maintenance strategies when handling nodes that hold PD, TiKV, and TiDB instances because of their different features.
This document introduces how to perform a temporary or long-term maintenance task for the Kubernetes nodes.
Migrating PD and TiDB instances from a node is relatively fast, so you can proactively evict the instances to other nodes and perform maintenance on the desired node:
Check whether there is any TiKV instance on the node to be maintained:kubectl get pod --all-namespaces -o wide | grep <node-name>
If any TiKV instance is found, see Maintain nodes that hold TiKV instances.
kubectl cordoncommand to prevent new Pods from being scheduled to the node to be maintained:kubectl cordon <node-name>
kubectl draincommand to migrate the database instances on the maintenance node to other nodes:kubectl drain <node-name> --ignore-daemonsets --delete-local-data
After running this command, TiDB instances on this node are automatically migrated to another available node, and the PD instance will trigger the auto-failover mechanism after five minutes and complete the nodes.
At this time, if you want to make this Kubernetes node offline, you can delete it by running:kubectl delete node <node-name>
If you want to recover a Kubernetes node, you need to first make sure that it is in a healthy state:watch kubectl get node <node-name>
After the node goes into the
Readystate, you can proceed with the following operations.
kubectl uncordonto lift the scheduling restriction on the node:kubectl uncordon <node-name>
See whether all Pods get back to normal and are running:watch kubectl get -n $namespace pod -o wide
Or:watch tkctl get all
When you confirm that all Pods are running normally, then you have successfully finished the maintenance task.
Migrating TiKV instances is relatively slow and might lead to unwanted data migration load on the cluster. For this reason, you need to choose different strategies as needed prior to maintaining the nodes that hold TiKV instances:
- If you want to maintain a node that is recoverable in a short term, you can recover the TiKV node from where it is after the maintenance without migrating it elsewhere.
- If you want to maintain a node that is not recoverable in a short term, you need to make a plan for the TiKV migration task.
For a short-term maintenance, you can increase the TiKV instance downtime that the cluster allows by adjusting the
max-store-down-time configuration of the PD cluster. You can finish the maintenance and recover the Kubernetes node during this time, and then all TiKV instances on this node will be automatically recovered.
kubectl port-forward svc/<CLUSTER_NAME>-pd 2379:2379
pd-ctl -d config set max-store-down-time 10m
max-store-down-time to an appropriate value, the follow-up operations are the same as those in Maintain nodes that hold PD and TiDB instances.
For the maintenance on an node that cannot be recovered in a short term (for example, a node has to go offline for a long time), you need to use
pd-ctl to proactively tell the TiDB cluster to make the corresponding TiKV instances offline, and manually unbind the instances from the node.
kubectl cordonto prevent new Pods from being scheduled to the node to be maintained:kubectl cordon <node-name>
Check the TiKV instance on the maintenance node:tkctl get -A tikv | grep <node-name>
pd-ctlto proactively put the TiKV instance offline:
store-idof the TiKV instance:kubectl get tc <CLUSTER_NAME> -ojson | jq '.status.tikv.stores | . | select ( .podName == "<POD_NAME>" ) | .id'
Make the TiKV instance offline:kubectl port-forward svc/<CLUSTER_NAME>-pd 2379:2379pd-ctl -d store delete <ID>
Wait for the store to change its status from
Tombstone:watch pd-ctl -d store <ID>
Unbind the TiKV instance from the local drive of the node:
PesistentVolumeClaimused by the Pod:kubectl get -n <namespace> pod <pod_name> -ojson | jq '.spec.volumes | . | select (.name == "tikv") | .persistentVolumeClaim.claimName'
PesistentVolumeClaim:kubectl delete -n <namespace> pvc <pvc_name>
Delete the TiKV instance:kubectl delete -n <namespace> pod <pod_name>
Check whether the TiKV instance is normally scheduled to another node:watch kubectl -n <namespace> get pod -o wide
If there are more TiKV instances on the maintenance node, you need to follow the above steps until all instances are migrated to other nodes.
After you make sure that there is no more TiKV instance on the node, you can evict other instances on the node:kubectl drain <node-name> --ignore-daemonsets --delete-local-data
Confirm again that there is no more TiKV, TiDB and PD instances running on this node:kubectl get pod --all-namespaces | grep <node-name>
(Optional) If this node is made offline for a long time, it is recommended to delete it from the Kubernetes cluster:kubectl delete node <node-name>
Now, you have successfully finished the maintenance task for the node.