Introduction:

In a Kubernetes cluster, there may be scenarios where the available resources are insufficient to accommodate all the pods that need to be scheduled. In such cases, Kubernetes provides a mechanism to prioritize the scheduling of certain pods over others, ensuring that the critical workloads are given precedence. This mechanism is known as Pod Priority and Preemption.

This includes ensuring that critical workloads get the resources they need, even when resources are scarce. Pod priority and preemption mechanisms play a crucial role in achieving this balance, allowing Kubernetes to make intelligent decisions about resource allocation and workload scheduling. If a Pod cannot be scheduled, the kubernetes scheduler tries to preempt (evict) lower priority Pods to make scheduling of the pending Pod possible.

Pod Priority:

Pod priority is a feature that allows you to specify the importance of a pod relative to other pods in the cluster. This is achieved by assigning a numerical value, called a priority, to each pod. The higher the priority value, the more important the pod is considered by the Kubernetes scheduler.

What are Pod Priority Classes in Kubernetes?

A PriorityClass is a non-namespaced object that defines a mapping from a priority class name to the integer value of the priority. Pod priority is a feature that allows you to specify the importance of a pod relative to other pods in the cluster. This is achieved by assigning a numerical value, called a priority, to each pod. The higher the priority value, the more important the pod is considered by the Kubernetes scheduler. To use the Priority Classes in Pod, you need to create one or more Priority Classes.

  1. High Priority: This priority class is reserved for critical workloads that must be prioritized above all others. Examples include database servers, critical backend services, or real-time processing applications.
  2. Medium Priority: Workloads with moderate importance fall into this category. These may include batch processing jobs, non-critical background tasks, or secondary services.
  3. Low Priority: Least critical workloads are assigned to this priority class. Examples include development and testing environments, logging services, or non-essential batch jobs.

Step-1: Create PriorityClass 

Here’s an example of a PriorityClass YAML manifest:

high-priority-priorityclass.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for high-priority, critical workloads only."

Explanation of the above PriorityClass:

  • apiVersion: The API version of the PriorityClass resource. Currently, the only valid value is scheduling.k8s.io/v1.
  • kind: The kind of resource, which is PriorityClass.
  • metadata: Metadata about the PriorityClass, such as its name and labels.
  • value: An integer value that represents the priority of the PriorityClass. The higher the value, the higher the priority.
  • globalDefault: A boolean value that indicates whether this PriorityClass should be used as the default priority for all pods without a priority class. There can only be one global default PriorityClass in a cluster.
  • description: An optional string that describes the purpose of the PriorityClass.

You can read the article here for understanding Various propertyies of Pod PriorityClass in Kubernetes.

We will create 2 more priority classes, mid-priority and low-priority for using it later.

mid-priority-priorityclass.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: mid-priority
value: 100000
globalDefault: false
description: "This priority class is for important workloads."
low-priority-priorityclass.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 10000
globalDefault: false
description: "This priority class is for best-effort workloads."

Here’s a breakdown of the priority values used:

  • high-priority: This class has the highest priority value of 1000000, ensuring that pods with this priority will be scheduled before any other pods, subject to available resources.
  • mid-priority: This class has a priority value of 100000, which is lower than high-priority but higher than low-priority.
  • low-priority: This class has the lowest priority value of 10000, meaning that pods with this priority will be scheduled only when there are sufficient resources available after scheduling higher-priority pods.

You can create these PriorityClasses in your Kubernetes cluster using the following commands:

kubectl apply -f high-priority-priorityclass.yaml
kubectl apply -f mid-priority-priorityclass.yaml
kubectl apply -f low-priority-priorityclass.yaml

Step-2. Verify the PriorityClass

You can verify that the PriorityClass has been created successfully by using the kubectl get command. This command will list all PriorityClasses in your cluster, and you should see high-priority, mid-priority and low-priority listed among them.

kubectl get priorityclass

Optional: Set as Default PriorityClass

If you want to make this PriorityClass the default for all pods that do not have a PriorityClass explicitly set, you can update the globalDefault field in the manifest to true. However, be cautious when setting a PriorityClass as the default, as it can affect the scheduling behavior of all pods in your cluster.

YAML
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: mid-priority
value: 100000
globalDefault: true
description: "This priority class is for important workloads."

Step-3: Assign PriorityClass to a Pod

You can then assign this PriorityClass to a pod by setting the priorityClassName field in the pod’s specification:

YAML
apiVersion: v1
kind: Pod
metadata:
  name: nginx-high-priority-pod
  labels:
    env: test
spec:
  containers:
  - name: nginx-container
    image: nginx
    imagePullPolicy: IfNotPresent
  priorityClassName: high-priority

This will ensure that the high-priority-pod has a higher priority than pods without a PriorityClass or with a lower priority value. If there are no nodes with sufficient resources to schedule the pod, the scheduler will preempt lower-priority pods to make room for the high-priority pod.

Similarly, you can also use the mid-priority and low-priority classes based on the requirements.

Pod Preemption in Kubernetes:

Pod preemption is the process of evicting lower-priority pods from a node to make room for higher-priority pods. Preemption only happens when there are no other nodes in the cluster that can satisfy the resource requirements of the higher-priority pod.

When a higher-priority pod is created, the scheduler checks if any nodes can accommodate the pod’s resource requirements. If there are no such nodes, the scheduler looks for lower-priority pods on nodes that can satisfy the higher-priority pod’s requirements. The scheduler then preempts the lower-priority pods by evicting them from the node and rescheduling them on another node.

Test Pod Preemption:

Create 3 Pods using the following manifest file. Use this as an example, you will need to modify the resource values inorder to see the Pod eviction.

YAML
#Low-Priority Pod:
apiVersion: v1
kind: Pod
metadata:
  name: low-priority-pod
spec:
  priorityClassName: low-priority
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        cpu: "100m"
        memory: "128Mi"
---
# Mid Priority Pod
apiVersion: v1
kind: Pod
metadata:
  name: mid-priority-pod
spec:
  priorityClassName: mid-priority
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        cpu: "200m"
        memory: "256Mi"
---
# High Priority Pod
apiVersion: v1
kind: Pod
metadata:
  name: high-priority-pod
spec:
  priorityClassName: high-priority
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        cpu: "500m"
        memory: "512Mi"

Now follow the steps:

  1. Use the kubectl apply -f filename.yml to create the pods.
  2. Now scale up the low-priority pod by using kubectl scale pod/low-priority-pod --replicas=10
  3. Similarly, scale up the mid-priority pod using kubectl scale pod/mid-priority-pod --replicas=10
  4. Now watch the pods using kubectl get pods -w
  5. Now, when I try to scale up the high-priority pods, you will notice that some of the low-priority pods are terminated to allocate enough resources to high-priority pods. kubectl scale pod/high-priority-pod --replicas=10

NOTE: Depending upon the resource availability, you will have to use the appropriate amount to replicas to trigger the Pod preemption by the Kubernetes Scheduler.

Preemption Policy:

By default, the preemption policy is set to PreemptLowerPriority, which means that the scheduler will preempt pods with lower priority than the higher-priority pod. However, you can also set the preemption policy to Never, which means that the scheduler will not preempt any pods, even if there are no nodes that can satisfy the higher-priority pod’s requirements.

Here’s an example of a PriorityClass with a Never preemption policy:

YAML
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: non-preempting
value: 1000000
globalDefault: false
preemptionPolicy: Never
description: "This priority class should be used for non-preempting workloads only."

The preemptionPolicy field in Kubernetes PriorityClass determines whether a pod with a specific priority class can preempt (evict) lower-priority pods to make room for itself. The preemptionPolicy field can have two possible values: PreemptLowerPriority and Never.

  • When preemptionPolicy is set to PreemptLowerPriority, pods with this priority class can preempt lower-priority pods. This means that if there are not enough resources available for a high-priority pod, the scheduler will evict lower-priority pods to make room for the high-priority pod.
  • On the other hand, when preemptionPolicy is set to Never, pods with this priority class cannot preempt lower-priority pods. This means that if there are not enough resources available for a high-priority pod, the scheduler will not evict lower-priority pods to make room for the high-priority pod. Instead, the high-priority pod will remain unscheduled until sufficient resources become available.

How does Scheduler determine which pods to preempt?

In Kubernetes, the scheduler determines which pods to preempt based on the priority of the pods and the availability of resources on the nodes. When a high-priority pod needs to be scheduled but there are no nodes with sufficient resources, the scheduler looks for lower-priority pods that can be preempted to make room for the high-priority pod.

The scheduler uses a preemption algorithm to select the pods to preempt. The algorithm takes into account the priority of the pods, the resources required by the high-priority pod, and the resources available on the nodes. The goal is to minimize the impact of preemption on the system while ensuring that high-priority pods are scheduled as soon as possible.

Here’s an example of how the preemption algorithm works:

Suppose there are three pods, A, B, and C, with priorities 10, 20, and 30, respectively. The resources required by each pod and the resources available on the nodes are as follows:

PodResources Required
A0.5 CPU, 0.5 Gi memory
B1 CPU, 1 Gi memory
C2 CPU, 2 Gi memory

Resources available in the K8s Cluster Nodes:

NodeResources Available
Node13 CPU, 4 Gi memory
Node22 CPU, 2 Gi memory

Initially, pods A and B are scheduled on Node1 and Node2, respectively. Now, the high-priority pod C needs to be scheduled, but there are no nodes with sufficient resources. The scheduler then looks for pods that can be preempted to make room for pod C.

The scheduler first considers pod A on Node1. Preempting pod A would free up 0.5 CPU and 0.5 Gi memory on Node1, which is enough to schedule pod C. However, pod A has a lower priority than pod C, so the scheduler preempts pod A and schedules pod C on Node1.

After preempting pod A, the resources available on the nodes are as follows:

NodeResources Available
Node12.5 CPU, 3.5 Gi memory
Node22 CPU, 2 Gi memory

Now, the scheduler looks for pods that can be preempted to make room for pod A. The scheduler considers pod B on Node2. Preempting pod B would free up 1 CPU and 1 Gi memory on Node2, which is enough to schedule pod A. However, pod B has a higher priority than pod A, so the scheduler does not preempt pod B.

Instead, the scheduler looks for other nodes where pod A can be scheduled. The scheduler finds that Node1 has sufficient resources to schedule pod A, so it schedules pod A on Node1.

After scheduling pod A on Node1, the resources available on the nodes are as follows:

NodeResources Available
Node12 CPU, 3 Gi memory
Node22 CPU, 2 Gi memory

Now, all pods are scheduled on nodes with sufficient resources, and the high-priority pod C is running.

In summary, the Kubernetes scheduler determines which pods to preempt based on the priority of the pods and the availability of resources on the nodes. The scheduler uses a preemption algorithm to select the pods to preempt, to minimize the impact of preemption on the system while ensuring that high-priority pods are scheduled as soon as possible.

Q. Interview Questions – Pod Priority and Preemption:

  • How to create a custom priority class in Kubernetes?
  • Can I change the priority class of a running pod without deleting it?
  • How to set a different priority class for existing pods?
  • What happens if resources are insufficient for high-priority pods?

Conclusion:

Pod Priority and Priority Classes in Kubernetes provide a way to define the importance of a pod and ensure that critical pods are scheduled before lower-priority ones. With the help of Priority Classes, you can easily define how your pod is going to be scheduled in Kubernetes. It will also make sure that your critical pods get priority over lower-priority pods when it comes to scheduling.

By |Last Updated: May 5th, 2024|Categories: Kubernetes|