Maintain Kubernetes Nodes that Hold the TiDB Cluster
TiDB is a highly available database that can run smoothly when some of the database nodes go offline. For this reason, you can safely shut down and maintain the Kubernetes nodes at the bottom layer without influencing TiDB's service. Specifically, you need to adopt various maintenance strategies when handling nodes that hold PD, TiKV, and TiDB instances because of their different features.
This document introduces how to perform a temporary or long-term maintenance task for the Kubernetes nodes.
Prerequisites
Maintain nodes that hold PD and TiDB instances
Migrating PD and TiDB instances from a node is relatively fast, so you can proactively evict the instances to other nodes and perform maintenance on the desired node:
Check whether there is any TiKV instance on the node to be maintained:
kubectl get pod --all-namespaces -o wide | grep <node-name>If any TiKV instance is found, see Maintain nodes that hold TiKV instances.
Use the
kubectl cordon
command to prevent new Pods from being scheduled to the node to be maintained:kubectl cordon <node-name>Use the
kubectl drain
command to migrate the database instances on the maintenance node to other nodes:kubectl drain <node-name> --ignore-daemonsets --delete-local-dataAfter running this command, TiDB instances on this node are automatically migrated to another available node, and the PD instance will trigger the auto-failover mechanism after five minutes and complete the nodes.
At this time, if you want to make this Kubernetes node offline, you can delete it by running:
kubectl delete node <node-name>If you want to recover a Kubernetes node, you need to first make sure that it is in a healthy state:
watch kubectl get node <node-name>After the node goes into the
Ready
state, you can proceed with the following operations.Use
kubectl uncordon
to lift the scheduling restriction on the node:kubectl uncordon <node-name>See whether all Pods get back to normal and are running:
watch kubectl get -n $namespace pod -o wideOr:
watch tkctl get allWhen you confirm that all Pods are running normally, then you have successfully finished the maintenance task.
Maintain nodes that hold TiKV instances
Migrating TiKV instances is relatively slow and might lead to unwanted data migration load on the cluster. For this reason, you need to choose different strategies as needed prior to maintaining the nodes that hold TiKV instances:
- If you want to maintain a node that is recoverable in a short term, you can recover the TiKV node from where it is after the maintenance without migrating it elsewhere.
- If you want to maintain a node that is not recoverable in a short term, you need to make a plan for the TiKV migration task.
Maintain a recoverable node in a short term
For a short-term maintenance, you can increase the TiKV instance downtime that the cluster allows by adjusting the max-store-down-time
configuration of the PD cluster. You can finish the maintenance and recover the Kubernetes node during this time, and then all TiKV instances on this node will be automatically recovered.
kubectl port-forward svc/<CLUSTER_NAME>-pd 2379:2379
pd-ctl -d config set max-store-down-time 10m
After configuring max-store-down-time
to an appropriate value, the follow-up operations are the same as those in Maintain nodes that hold PD and TiDB instances.
Maintain an irrecoverable node in a short term
For the maintenance on an node that cannot be recovered in a short term (for example, a node has to go offline for a long time), you need to use pd-ctl
to proactively tell the TiDB cluster to make the corresponding TiKV instances offline, and manually unbind the instances from the node.
Use
kubectl cordon
to prevent new Pods from being scheduled to the node to be maintained:kubectl cordon <node-name>Check the TiKV instance on the maintenance node:
tkctl get -A tikv | grep <node-name>Use
pd-ctl
to proactively put the TiKV instance offline:Check
store-id
of the TiKV instance:kubectl get tc <CLUSTER_NAME> -ojson | jq '.status.tikv.stores | .[] | select ( .podName == "<POD_NAME>" ) | .id'Make the TiKV instance offline:
kubectl port-forward svc/<CLUSTER_NAME>-pd 2379:2379pd-ctl -d store delete <ID>Wait for the store to change its status from
state_name
toTombstone
:watch pd-ctl -d store <ID>Unbind the TiKV instance from the local drive of the node:
Get the
PesistentVolumeClaim
used by the Pod:kubectl get -n <namespace> pod <pod_name> -ojson | jq '.spec.volumes | .[] | select (.name == "tikv") | .persistentVolumeClaim.claimName'Delete the
PesistentVolumeClaim
:kubectl delete -n <namespace> pvc <pvc_name>Delete the TiKV instance:
kubectl delete -n <namespace> pod <pod_name>Check whether the TiKV instance is normally scheduled to another node:
watch kubectl -n <namespace> get pod -o wideIf there are more TiKV instances on the maintenance node, you need to follow the above steps until all instances are migrated to other nodes.
After you make sure that there is no more TiKV instance on the node, you can evict other instances on the node:
kubectl drain <node-name> --ignore-daemonsets --delete-local-dataConfirm again that there is no more TiKV, TiDB and PD instances running on this node:
kubectl get pod --all-namespaces | grep <node-name>(Optional) If this node is made offline for a long time, it is recommended to delete it from the Kubernetes cluster:
kubectl delete node <node-name>
Now, you have successfully finished the maintenance task for the node.