PRODUCT
11 min read
Published on 04/08/2018
Last updated on 03/21/2024
Draining Kubernetes Nodes
Share
Cloud cost management series: Overspending in the cloud Managing spot instance clusters on Kubernetes with Hollowtrees Monitor AWS spot instance terminations Diversifying AWS auto-scaling groups Draining Kubernetes nodes Cluster recommender Cloud instance type and price information as a serviceKubernetes was designed in such a way as to be fault tolerant of worker node failures. If a node goes missing because of a hardware problem, a cloud infrastructure problem, or if Kubernetes simply ceases to receive heartbeat messages from a node for any reason, the Kubernetes control plane is clever enough to handle it. But that doesn't mean it will be able to solve every conceivable problem. A common misconception is as follows: "If there are enough free resources, Kubernetes will re-schedule all the pods from the lost node to another, so there's absolutely no reason to worry about losing a node. Everything will be re-scheduled; the autoscaler will add a new node if necessary; life goes on." To topple this misconception, let's take a look at what disruptions really mean, and how the
kubectl drain
command works: what it does, and how it operates so gracefully. The cluster autoscaler uses similar logic to scale a cluster, and our Pipeline Platform also has a similar feature that automatically handles spot instance terminations gracefully via Hollowtrees.
Pod disruptions
Pods disappear from clusters for one of two reasons:- there was some kind of unavoidable hardware, software or user error
- the pod was deleted voluntarily, because someone wanted to delete its deployment, or wanted to remove the VM that held the pod
kubectl drain
command as a means of exploring voluntary disruptions, and note the ways in which handling involuntary disruptions is less graceful.
The kubectl
drain command
According to the Kubernetes documentation the drain command can be used to "safely evict all of your pods from a node before you perform maintenance on the node," and "safe evictions allow the pod’s containers to gracefully terminate and will respect the PodDisruptionBudgets you have specified". So if it's not a problem that a node is being removed from the cluster, then why do we need this safe eviction and how does it work, exactly?
From a bird's eye view drain
does two things:
1. cordons the node
This part is quite simple, cordoning a node means that it will be marked unschedulable, so new pods can no longer be scheduled to the node. If we know in advance that a node will be taken from the cluster (because of maintenance, like a kernel update, or because we know that there will be scaling in the node), cordoning is a good first step. We don't want new pods scheduled on this node and then taken away after a few seconds. For example, if we know two minutes in advance that a spot instance on AWS will be terminated, new pods shouldn't be scheduled on that node, then we can work towards gracefully scheduling all the other pods, as well. On the API level, cordoning means patching the node with node.Spec.Unschedulable=true
.
2. evicts or deletes the pods
After the node is made unschedulable, the drain
command will try to evict the pods that are already running on that node. If eviction is supported on the cluster (from Kubernetes version 1.7) the drain command will use the Eviction API that takes disruption budgets into account, if it's not supported it will simply delete the pods on the node. Let's look into these options next.
Deleting pods on a node
Let's start with something simple, like when the Eviction API cannot be used. This is how it looks ingo
code:
err := client.CoreV1().Pods(pod.Namespace).Delete(pod.Name, &metav1.DeleteOptions{
GracePeriodSeconds: &gracePeriodSeconds,
})
Delete
method of the K8S client, the first thing you can catch is GracePeriodSeconds
. As always, Kubernetes' excellent documentation will help explain a few things:
"Because pods represent running processes on nodes in the cluster, it is important to allow those processes to gracefully terminate when they are no longer needed (vs being violently killed with a KILL signal and having no chance to clean up)."Cleaning up can mean a lot of things, like completing any outstanding HTTP requests, making sure that data is flushed properly when writing a file, finishing a batch job, rolling back transactions, or saving state to external storage like S3. There is a timeout that facilitates clean up, called the grace period. Note that when you call delete on a pod it returns asynchronously, and you should always poll that pod and wait until the deletion finishes or the grace period ends. Check the Kubernetes documentation to learn more. If the node is disrupted involuntarily, the processes in the pods will have no chance to exit gracefully. So let's go back to our example of spot instance termination: if all we can do in the two minutes before the VM is terminated is cordon the node and call
Delete
on the pods with a grace period of about two minutes, we're still better off than if we just let our instance die. But Kubernetes provides us with some better options.
Evicting pods from a node
From Kubernetes 1.7, onward, there's been an option to use the Eviction API instead of directly deleting pods. First let's see thego
code again and note how it differs from the go
code above. It's easy to see that this is a different API call, but we still have to provide pod.Namespace
, pod.Name
and DeleteOptions
along with the grace period. And though, elsewhere it looks very similar at a glance, we also have to add some meta info (EvictionKind
and APIVersion
).
eviction := &policyv1beta1.Eviction{
TypeMeta: metav1.TypeMeta{
APIVersion: policyGroupVersion,
Kind: EvictionKind,
},
ObjectMeta: metav1.ObjectMeta{
Name: pod.Name,
Namespace: pod.Namespace,
},
DeleteOptions: &metav1.DeleteOptions{
GracePeriodSeconds: &gracePeriodSeconds,
},
}
client.PolicyV1beta1().Evictions(eviction.Namespace).Evict(eviction)
poddisruptionbudget
, or pdb
- that can be attached to a deployment via labels. According to the documentation:
A PDB limits the number of pods of a replicated application that are down simultaneously from voluntary disruptions.The following simplified example of a PDB specifies that the minimum available pods of the
nginx
app cannot be at less than 70% at any time (see more examples here):
kubectl create pdb my-pdb --selector=app=nginx --min-available=70%
Delete
. If the delete is not granted because a PDB will not allow it, then the API returns 429 Too Many Requests
. See more details here.
If you call the drain
command and it cannot evict a pod because of a PDB, it will sleep five seconds, and retry. You can try this by creating a basic nginx
deployment with two replicas, adding the pdb
above, and finding a node in which one of the pods is scheduled and by trying to drain it with this command (--v=6
is all that's necessary to see the Too Many Requests
messages that are returned):
kubectl --v=6 drain <node-name> --force
deadlocks
, in which drain will wait forever. Usually these are misconfigurations like in my very simple example, when neither of the two nginx replicas could be evicted because of the 70% threshold, but deadlocks
may occur in real-world situations as well. The Eviction API won't start new replicas on other nodes or do any other magic, but return Too Many Requests
. To handle these cases, you must intervene manually (e.g.: by temporarily adding a new replica), or write your code in a way that detects them.
Special pods to delete
Let's complicate things even further. There are some pods that can't be simply deleted or evicted. Thedrain
command uses four different filters when checking for pods to delete, and these filters can temporarily reject the drain or the drain can move on without touching certain pods:
DaemonSet filter
The DaemonSet controller ignores unschedulable markings, so a pod that belongs to a DaemonSet will be immediately replaced. If there are pods belonging to a DaemonSet on the node, the drain command proceeds only if the --ignore-daemonsets
flag is set to true, but even if that is the case, it won't delete the pod because of the DaemonSet controller. Usually it doesn't cause problems if a DaemonSet pod is deleted with a node (see node exporters, logs collection, storage daemons, etc.), so in most cases this flag can be set.
Mirror pods filter
drain
uses the Kubernetes API server to manage pods and other resources, and mirror pods are merely the corresponding read-only API resources of static pods - pods that are managed by the Kubelet, directly, without the API server managing them. Mirror pods are visible from the API server but cannot be controlled, so drain
won't delete these either.
Unreplicated filter
If a pod has no controller it cannot be easily deleted, because it won't be rescheduled to a new node. It's usually advised that you not have pods without controllers (not managed by a ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet), but if you still have pods like this, and want to write code that handles voluntary node disruptions, it's up to the implementation as to whether it will delete these pods or fail. The drain
command lets the user decide: when --force
is set, unreplicated pods will be deleted (or evicted): if they're not set, drain will fail.
When using go
, the k8s apimachinery package has a util function that returns the controller for a pod, or nil, if there's no controller for it: metav1.GetControllerOf(&pod)
LocalStorage filter
This filter checks if emptyDir exists for a pod or not. If the pod uses emptyDir
to store local data, it may not be safe to delete because if a pod is removed from a node the data in the emptyDir
is deleted with it. Just like with the unreplicated filter, it is up for the implementation to decide what to do with these pods. drain
provides a switch for this as well; if --delete-local-data
is set, drain will proceed even if there are pods using the emptyDir
and will delete the pods and therefore delete the local data as well.
Spot instance termination
We use a drain-like logic to handle AWS spot instance terminations. We monitor AWS spot instance terminations with Prometheus, and have Hollowtrees configured to call our Kubernetes action plugin to drain the node. AWS gives the notice two minutes in advance, which is usually enough time to gracefully delete the pods, while also watching forPodDisruptionBudgets
. Our action plugin uses a very similar logic to the drain
command, but ignores DaemonSets and mirror pods, and force deletes unreplicated and emptyDir
pods by default.
If you'd like to learn more about Banzai Cloud check out our other posts in the blog, the Pipeline and Hollowtrees projects.Subscribe to
the Shift!
Get emerging insights on emerging technology straight to your inbox.
Unlocking Multi-Cloud Security: Panoptica's Graph-Based Approach
Discover why security teams rely on Panoptica's graph-based technology to navigate and prioritize risks across multi-cloud landscapes, enhancing accuracy and resilience in safeguarding diverse ecosystems.
Subscribe
to
the Shift
!Get on emerging technology straight to your inbox.
emerging insights
The Shift keeps you at the forefront of cloud native modern applications, application security, generative AI, quantum computing, and other groundbreaking innovations that are shaping the future of technology.