Limits of a PodDisruptionBudget

2022-03-27 kubernetes

What is a disruption

the action of preventing something, especially a system, process, or event, from continuing as usual or as expected.

– https://dictionary.cambridge.org/de/worterbuch/englisch/disruption

Kubernetes differentiates between 2 kinds of disruptions:

involuntary (Outside the control of Kubernetes)
- hardware failure
- kernel panic
- etc.
voluntary (controlled using Kubernetes mechanisms)
- draining a node (repair, upgrade, scale down)
- priority based eviction (to allow other, high priority pods to be scheduled instead)
- Deployment,StatefulSet,DaemonSet, etc. update
- deleting a pod

Kubernetes will try to move pods from unhealthy to healthy nodes when encountering involuntary disruptions. For voluntary disruptions, Kubernetes provides APIs to control how many pods for a given selector can be disrupted simultaneously.

Eviction requests

An eviction is a process of moving a pod off a node. Moving a pod off a node is done by deleting the pod object. Kubernetes will, if it’s managed by a controller object (DaemonSet, ReplicaSet, Deployment, StatefulSet), recreate it.

In Kubernetes, there are two endpoints to delete a pod:

DELETE /api/v1/namespaces/«NAMESPACE»/pods/«POD»

The well-known delete endpoint - Like found in most REST APIs.

DELETE https://192.168.2.10:6443/api/v1/namespaces/default/pods/counter

This request is also what’s being executed when running kubectl delete pod counter. This endpoint, however, has no safeguards. For example, you can easily create an outage by running kubectl delete pod --all --all-namespaces (Deletes all pods in your cluster).

POST /api/v1/namespaces/«NAMESPACE»/pods/«POD»/eviction

The pod resource has a subresource for a more secure way of deleting an instance, the eviction request.

POST https://192.168.2.10:6443/api/v1/namespaces/default/pods/counter/eviction
{
  "kind": "Eviction",
  "apiVersion": "policy/v1",
  "metadata": {
    "name": "counter",
    "namespace": "default",
    "creationTimestamp": null
  },
  "deleteOptions": {}
}

Unlike the simple delete request, the kubernetes API server will verify if the eviction would violate a PodDisruptionBudget. If the eviction of the pod violated a PodDisruptionBudget, the kubernetes API server would deny the request with status code 429. If not, the API server will delete the pod. This deletion, however, will not trigger an additional DELETE call. Instead, the deletion happens internally in the Kubernetes API server.

Relevant Kubernetes code: https://github.com/kubernetes/kubernetes/blob/10b07085f8cc5a5a6dd6d6e6a48324b89fcf8770/pkg/registry/core/pod/storage/eviction.go#L119

PodDisruptionBudget

The PodDisruptionBudget is an API object which defines allowed disruptions for a set of pods.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: zookeeper
spec:
  # An eviction is allowed if at most "maxUnavailable" pods selected by
  # "selector" are unavailable after the eviction, i.e. even in absence of
  # the evicted pod. For example, one can prevent all voluntary evictions
  # by specifying 0. This is a mutually exclusive setting with "minAvailable".
  maxUnavailable: 1
  # An eviction is allowed if at least "minAvailable" pods selected by
  # "selector" will still be available after the eviction, i.e. even in the
  # absence of the evicted pod. So for example you can prevent all voluntary
  # evictions by specifying "100%".
  minAvailable: 0
  # Label query over pods whose evictions are managed by the disruption
  # budget.
  selector:
    matchLabels:
      app: zookeeper

Deleting a pod and evicting a pod

PodDisruptionBudgets are not respected on regular pod deletions. I.e, kubectl delete pod foo will not respect a PodDisruptionBudget. The same applies when updating a Deployment/Replicaset. The ReplicaSet controller will issue a DELETE call for pods.

kubectl drain node-1 uses the /eviction subresource. kubectl delete pod uses the DELETE request.

In what situations will Kubernetes respect the PodDisruptionBudget?

Node upgrades? Yes.

When performing node upgrades in Kubernetes, the recommended approach is to provision new nodes and then move workloads from old to new nodes. kubectl drain <<node>> is used for moving workload during upgrades. It’ll first cordon the node (prevents scheduling of new workload) and then create eviction requests for all pods running on that node. If a pod cannot be evicted due to a PodDisruptionBudget, it’ll retry indefinitely or until --timeout has been reached.

When using a managed Kubernetes offering, the cloud provider will typically automate this procedure.

Priority-based preemption? Yes, but not guaranteed.

In v1.14, kubernetes added PriorityClasses, which lets you specify a priority on pods. The priority indicates the importance of a pod relative to other pods.
When the PriorityClass specifies preemptionPolicy: PreemptLowerPriority (default), the scheduler will preempt lower-priority pods to enable scheduling of higher priority pods. However, this preemption logic takes the PodDisruptionBudget into account only on a best effort basis.

The scheduler tries to find victims whose PDB are not violated by preemption, but if no such victims are found, preemption will still happen, and lower priority Pods will be removed despite their PDBs being violated.

– https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#poddisruptionbudget-is-supported-but-not-guaranteed

Docs: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/

Node-pressure “eviction”? No.

The kubelet considers a set of signals that determine the health state of a node. For a subset of those signals, the kubelet will start to proactively fail pods once those signals cross a certain threshold. i.e., the kubelet starts failing pods when the root volume has no space left

The kubelet will not respect any PodDisruptionBudget. Pods will be transitioned into the failed state. There will be no DELETE call. There is no mechanism to hook into this logic.

Docs: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/

Disparity between Ready state and an application’s cluster state

For stateless applications, the PodDisruptionBudget works fine. However, for stateful applications, it’s more complicated. Take Apache Kafka as an example. Kafka is a distributed commit log, which utilizes partitioning and multiple leaders. Clients will discover available partitions and their corresponding leaders and connect directly to the leader of a partition for read/write operations. This also means a broker might be a leader for partition A while catching up on partition B. One could say it’s 50% ready. As the Ready condition in Kubernetes is used to express if a pod can receive traffic or not, Kafka is usually configured to report ready as soon as the TCP port is available.

Example:

PodDisruptionBudget: 
  maxUnavailable:1
Kafka:
  Brokers: 3
  Replication factor: 3
  Min in-sync replicas: 2
  Acks (client): all

We start with a healthy cluster. 3 brokers, 3 partitions - replicated across all brokers.

Kafka-0 gets evicted & recreated. Due to the restart, leadership for partition-a got moved to another broker (Kafka-1). It starts replicating data. partition-c is a busy partition and not fully replicated yet.

Now, Kafka-0 is ready according to Kubernetes, which means we have 3 ready brokers—allowing 1 to become unavailable according to the PodDisruptionBudget.

kafka-1 gets evicted.

Leadership for partition-a & partition-b got moved to Kafka-0. Kafka-0 still hasn’t finished replicating the progress on partition-c, same with Kafka-1. Only one replica is “in-sync” which violates the Min in-sync replicas: 2 setting. Clients now receive errors as Acks: all cannot be satisfied.

Kubernetes will now also allow Kafka-2 to be evicted, which will result in partition-c being offline, even with acks: 1.

An article explaining the Kafka logic: https://betterprogramming.pub/kafka-acks-explained-c0515b3b707e

In summary: Kubernetes assumes the cluster state can be derived from the readiness checks of all pods in a set, which does not work well with stateful applications that use some mechanism of sharding.

There’s also a Github issue for this

Why report ready when a member is not fully in-sync?

Partitions can be moved between brokers, which causes partitions to be under-replicated for some time. A ReadinessProbe would mark the broker now as not ready, breaking communication for clients. If this broken were a leader, all partitions the broker was the leader for would be down.

How to solve this today

Let a controller update the PodDisruptionBudget based on the cluster state

A controller could constantly query Kafka’s cluster state and update the spec of the PodDisruptionBudget. i.e., As long as the Kafka cluster reports under-replicated partitions, the controller would set PDB.spec.maxUnavailable: 0. Once the cluster is stable, the controller would set PDB.spec.minUnavailable: 1

Downside: Racy - Delay between cluster state change and it being reflected in the PodDisruptionBudget

Admission-controller validating /eviction requests

An admission controller could validate every /eviction request, which queries the cluster state. It could be implemented generically by adding an HTTP endpoint to the PodDisruptionBudget (via an annotation).

Downside: Additional latency on the eviction request + The endpoint would need to respond within 30s (admission-controller max timeout)