Horizontal Pod Autoscaling - Manage Kubernetes Workload

Horizontal Pod Autoscaling (HPA) is a feature in Kubernetes that automatically adjusts the number of running pods in a deployment, replica set, or stateful set based on observed metrics such as CPU utilization or custom metrics. The goal of HPA is to ensure that your application has enough resources to handle varying levels of traffic and workload demands efficiently.

With HPA, you can define target metrics and thresholds for your pods, specifying the desired minimum and maximum number of replicas. When the observed metrics exceed or fall below the specified thresholds, HPA dynamically scales the number of pod replicas up or down to maintain the desired level of resource utilization. HPA continuously monitors the resource utilization metrics provided by the Metrics Server, which collects data from pods and nodes in the Kubernetes cluster. It evaluates these metrics against the defined thresholds and triggers scaling actions as necessary.

By automatically adjusting the number of pod replicas based on workload demand, HPA helps optimize resource utilization, improve application performance, and ensure cost-effectiveness in Kubernetes deployments. It enables your applications to handle varying levels of traffic efficiently, without manual intervention, and ensures that your infrastructure resources are utilized optimally.

To use HPA, you need a metrics pipeline that can provide the necessary metrics to the HPA controller. The metrics pipeline can have a built-in metrics server or a custom metrics pipeline. The HPA controller uses the metrics to calculate the desired number of replicas and updates the replication controller accordingly.

Supported Matrics

Metric	Description	API version
CPU utilization	Number of CPU cores used. Can be used to calculate a percentage of the pod’s requested CPU.	`autoscaling/v1`, `autoscaling/v2`
Memory utilization	Amount of memory used. Can be used to calculate a percentage of the pod’s requested memory.	`autoscaling/v2`

How does a HorizontalPodAutoscaler work?

A HorizontalPodAutoscaler (HPA) in Kubernetes works by dynamically adjusting the number of pod replicas in a deployment, replica set, or stateful set based on observed metrics and defined scaling policies. Here’s a breakdown of how an HPA operates:

Metrics Collection: The HPA relies on metrics collected from pods and nodes within the Kubernetes cluster. The Metrics Server aggregates resource utilization metrics such as CPU and memory usage.
HPA Configuration: You define an HPA resource in Kubernetes by specifying:
- Target Metrics: The resource utilization metrics (e.g., CPU utilization) that the HPA should monitor.
- Target Value: The desired target value for the specified metric.
- Scaling Policies: Minimum and maximum number of pod replicas, as well as scaling thresholds.
Monitoring: The HPA continuously monitors the metrics provided by the Metrics Server. It compares the observed metrics against the specified target value and thresholds.
Scaling Decision: Based on the observed metrics and defined scaling policies, the HPA makes scaling decisions:
- If the observed metric exceeds the upper threshold, indicating high resource utilization, the HPA scales up the number of pod replicas.
- If the observed metric falls below the lower threshold, indicating low resource utilization, the HPA scales down the number of pod replicas.
Scaling Action: When the HPA determines that scaling is necessary, it triggers scaling actions to adjust the number of pod replicas:
- Scale Up: Creates new pod replicas to handle increased workload demand.
- Scale Down: Terminates existing pod replicas to reduce resource consumption during periods of low demand.
Pod Scheduling: Once the scaling action is initiated, Kubernetes orchestrates the creation or termination of pod replicas according to the scaling decision.

Strategies to implement Horizontal Pod Autoscaling in Kubernetes:

There are various strategies to implement Horizontal Pod Autoscaling (HPA) in Kubernetes, and we are going to explore some of the most commonly used approaches.

CPU Utilization: Scale pods based on CPU utilization metrics.
Memory Utilization Strategy: Scale pods based on memory utilization metrics.
Custom Metrics Strategy: Scale pods based on custom metrics specific to your application.
Combination of Metrics Strategy: Use a combination of CPU, memory, and custom metrics for scaling decisions.
Predictive Scaling Strategy: Use machine learning algorithms or predictive analytics to forecast future workload demand.

Before getting into the HPA, let us create a Deployment as shown below to test the example. Note the resources.requests and resources.limits:

ngx-admin-deployment.yml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ngx-admin-deployment
spec:
  # Do not set replicas
  #replicas: 3
  selector:
    matchLabels:
      app: ngx-admin-app
  template:
    metadata:
      labels:
        app: ngx-admin-app
    spec:
      containers:
        - name: ngx-admin-pod
          image: dockerbikram/ngx-admin-nginx:latest
          resources:
            requests:
              memory: "15Mi"
              cpu: "5m"
            limits:
              memory: "20Mi"
              cpu: "5.5m"
          ports:
            - containerPort: 80

Let us take a moment to understand the above Kubernetes Deployment resource named ngx-admin-deployment.yml.

Replicas: You should avoid setting this when used with HPA.
Selector: To match pods with the label “app: ngx-admin-app”.
Resources: Resource requests and limits are defined for the container to ensure resource allocation. In this example, the container requests a minimum of 15Mi of memory and 5m of CPU. Has a max resources limits, set at 20Mi of memory and 5.5m of CPU .
Ports: The container exposes port 80.

To access this Pod from outside, optionally you can create a LoadBalancer service as shown here:

svc-ngx-admin-loadbalancer.yml

apiVersion: v1
kind: Service
metadata:
  name: svc-ngx-admin-loadbalancer
spec:
  selector:
    app: ngx-admin-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
  type: LoadBalancer

Now, we will create HPA – HorizontalPodAutoscaler to manage the dynamic workloads. When the HPA determines that scaling is necessary, it triggers scaling actions to adjust the number of pod replicas:

Scale Up: Creates new pod replicas to handle increased workload demand.
Scale Down: Terminates existing pod replicas to reduce resource consumption during periods of low demand.

The Criteria to decide when to Scale can be determined on CPU Utilization, RAM Utilization, based on Custom Metrics or Combinations of Metrics which are discuused here.

1. HPA based on CPU utilization

Horizontal Pod Autoscaling is a feature in Kubernetes that automatically scales the number of pods in a deployment based on observed CPU utilization or other selected performance metrics. With HPA, you can specify the target CPU utilization percentage for your application, and the Kubernetes controller will automatically adjust the number of replicas to maintain that target.

Here’s an example of a Horizontal Pod Autoscaler configuration that scales Pods based on CPU utilization:

ngx-admin-hpa.yml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ngx-admin-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ngx-admin-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 20

scaleTargetRef: Specifies the target deployment for autoscaling. In this case, it references a Deployment named “ngx-admin-deployment” in the “apps/v1” API version.
minReplicas: Specifies the minimum number of replicas that the HPA can scale down to. In this example, it’s set to 2.
maxReplicas: Specifies the maximum number of replicas that the HPA can scale up to. Here, it’s set to 10.
metrics: Defines the metrics used for autoscaling:
- type: Specifies the type of metric used for autoscaling. In this case, it’s “Resource“, indicating resource utilization metrics.
- resource: Specifies the name of the resource (CPU in this case) and the target utilization:
- resource.name: Specifies the resource to monitor for autoscaling. Here, it’s CPU.
- resource.target: Specifies the target utilization for the resource:
- target.type: Specifies the type of target utilization. Here, it’s “Utilization“.
- target.averageUtilization: Specifies the target average utilization percentage for the resource. In this example, it’s set to 20, indicating that the HPA will scale the number of replicas to maintain an average CPU utilization of 20%.

I will use a simple commandline tool called Cassowary to perform a load testing on the service IP for the autoscaling:

$ brew update && brew install cassowary

$ cassowary run -u http://34.42.233.255/ -c 10000 -n 10000 --duration 50

-u <Ip_Address>: Specifies the URL of the LoadBalancer. Use kubectl get svc svc-ngx-admin-loadbalancer to obtain the IP.
-c 10000: Specifies the number of concurrent users (clients) that will be simulated during the load test. You can increase or reduce based on the need.
-n 10000: Specifies the total number of requests to be sent during the load test. Modify this as per our need.
–duration 50: Specifies the duration of the load test, in seconds. In this example, the test will run for 50 seconds.

Now in another terminal, let us watch the no of pods created during this load test.

$ kubectl get pods -w

NAME                                    READY   STATUS    RESTARTS   AGE
ngx-admin-deployment-7f8b846459-5jl22   1/1     Running   0          14s
ngx-admin-deployment-7f8b846459-7ttg6   1/1     Running   0          14s
ngx-admin-deployment-7f8b846459-grrcm   1/1     Running   0          14s
ngx-admin-deployment-7f8b846459-l5hxz   1/1     Running   0          74s
ngx-admin-deployment-7f8b846459-r5jfx   1/1     Running   0          74s
ngx-admin-deployment-7f8b846459-zl7fp   1/1     Running   0          10m

Additionally, you can use the following commands to monitor the Resource Utilizations by each of these Pods

$ watch -n 0.01 kubectl top pods

## If no watch installed
$ kubectl top pods

## You can watch the hpa itself
$ kubectl get hpa ngx-admin-hpa --watch

NAME         REFERENCE                        TARGET     MINPODS   MAXPODS   REPLICAS   AGE
php-apache   Deployment/ngx-admin-hpa/scale    0% / 50%     1         10        1      11m

2. HPA based on RAM utilization:

This HPA configuration ensures that the ngx-admin-deployment is scaled based on the average memory usage of the pods in the deployment. The HPA will ensure that there are always at least 2 replicas of the deployment running, and will not scale the deployment beyond 10 replicas. The HPA will use the average memory usage of the pods to determine when to scale the deployment up or down. When the average memory usage exceeds 200 Mi, the HPA will scale the deployment up by adding more replicas. When the average memory usage falls below 200 Mi, the HPA will scale the deployment down by removing replicas.

Similar to CPU utilization, you can scale on the RAM utilization as below.

YAML

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ngx-admin-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ngx-admin-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: memory
        target:
          type: AverageValue
          averageValue: 200Mi

3. HPA based on Multiple Resource types – CPU & RAM Usage

This HPA configuration ensures that the ngx-admin-deployment is scaled based on both the average memory usage and CPU utilization of the pods in the deployment. The HPA will ensure that there are always at least 2 replicas of the deployment running, and will not scale the deployment beyond 10 replicas. The HPA will use both the average memory usage across all pods and CPU utilization of the indivisual pods to determine when to scale the deployment up or down. When the average memory usage or CPU utilization exceeds the specified thresholds, the HPA will scale the deployment up by adding more replicas. When the average memory usage or CPU utilization falls below the specified thresholds, the HPA will scale the deployment down by removing replicas.

HPA will trigger when any one of the criteria is met.

YAML

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ngx-admin-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ngx-admin-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: memory
        target:
          type: AverageValue
          averageValue: 200Mi
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 20

4. HPA based on traffic requests – Custom Metrics example

Here is a basic example that shows how to Autoscale if the traffic increased above 50% on average. For this to work, please follow the Configuring horizontal Pod autoscaling to configure and install the required packages.

YAML

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ngx-admin-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ngx-admin-deployment
 
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Object
    object:
      describedObject:
        kind: Service
        name: svc-ngx-admin-loadbalancer
     
      metric:
        name: "autoscaling.googleapis.com|gclb-capacity-utilization"
      target:
        averageValue: 50
        type: AverageValue

1. Resource target types in HPA – `resource.target.type`

As you dive deeper into Kubernetes, this can become very confusing, so please pay attention here!

In Kubernetes Horizontal Pod Autoscaler (HPA) configurations, the resource.target.type field specifies the type of target utilization for autoscaling. Here are the possible values for resource.target.type:

a. Average Utilization:

Indicates that the target utilization is specified as a percentage of the resource’s capacity. With this type, the HPA scales the number of replicas to maintain the specified average utilization percentage of the resource (e.g., CPU or memory). This value is Pod specific.

YAML

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nginx-hpa
...
...
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

b. Value:

Specifies an absolute value as the target utilization. With this type, you provide a fixed value for the target, such as a specific number of requests per second or a certain amount of memory. This value here is indivisual Pod average value.

YAML

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
....
...

  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Value
        averageValue: 200Mi

c. AverageValue:

Indicates that the target utilization is specified as an average value across all pods. This type is useful when you want to maintain a specific average value for a custom metric or a metric that doesn’t directly correlate with pod count or resource utilization.

YAML

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
...
spec:
 ...
  metrics:
    - type: Resource
      resource:
        name: memory
        target:
          type: AverageValue
          averageValue: 200Mi

NOTE: When you use any custom metrics, configurations will be changed.

2. HPA Config with Scaling Policies:

One or more scaling policies can be specified in the behavior section of the spec. When multiple policies are specified the policy which allows the highest amount of change is the policy which is selected by default.

a. ScaleDown Example:

The behavior section in a Horizontal Pod Autoscaler (HPA) object in Kubernetes allows you to customize the scaling behavior of your application. In this case, you have defined a scaleDown behavior with two policies. The following example shows how to set up scale down behavior:

ngx-admin-hpa.yml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ngx-admin-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ngx-admin-deployment
  minReplicas: 1
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50
  # Can add Multiple Policies 
  behavior:
    scaleDown:
      #optional - but best practice
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
        - type: Percent
          value: 10
          periodSeconds: 60
      selectPolicy: Max

stabilizationWindowSeconds: The stabilizationWindowSeconds parameter in a Horizontal Pod Autoscaler (HPA) object in Kubernetes allows you to specify a window of time during which the HPA should not trigger any new scaling events. So here the scaling event would be triggered once in every 300 Seconds (5 mins). This can help to prevent rapid and unnecessary scaling actions.
There are 2 policies added there. Pods and Percentage.
Pods policy: This policy reduces the number of replicas by at least 4 pods every 60 seconds. This policy is applied when the average CPU utilization of your application falls below the target value.
Percent policy: This policy reduces the number of replicas by at least 10% every 60 seconds. This policy is applied when the average CPU utilization of your application falls below the target value.
selectPolicy: Max indicates that whichever among the 2 policies (Percent and Pods) causes the maximum changes, will be considered.
You can also alternatively set to Min to cause maximum 4 pods to be scaled down.

b. Scale Up Example:

The scaleUp behavior in a Horizontal Pod Autoscaler (HPA) object in Kubernetes allows you to customize the scaling behavior of your application when the demand for resources increases. Here is an example code snippet that implements the scaleUp policies you described:

YAML

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ngx-admin-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ngx-admin-deployment
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50
  behavior:
    ## Scale Down behaviour
    scaleDown:
    ...
    ...

    ## Scale Up behaviour
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 70
          periodSeconds: 15
        - type: Pods
          value: 4
          periodSeconds: 15
      selectPolicy: Max

By setting the stabilizationWindowSeconds parameter to 0 seconds, you can ensure that the HPA responds quickly to changes in demand. The selectPolicy parameter ensures that the HPA selects the policy that results in the highest number of replicas, which can help to ensure that your application remains responsive and available during periods of high traffic.
The selectPolicy parameter is set to Max, which means that the HPA will select the policy that results in the highest number of replicas.
Percent policy: This policy increases the number of replicas by at least 70% every 15 seconds. This policy is applied when the average CPU utilization of your application exceeds the target value.
Pods policy: This policy increases the number of replicas by at least 4 pods every 15 seconds. This policy is applied when the average CPU utilization of your application exceeds the target value.
selectPolicy: Max indicates that whichever among the 2 policies (Percent and Pods) causes the maximum changes, will be considered.

C. Disable Scale Down:

The selectPolicy value of Disabled turns off scaling the given direction. So to prevent downscaling the following policy would be used:

YAML

...
...
behavior:
  scaleDown:
    selectPolicy: Disabled
...

Use Cases of Horizontal Pod Autoscaling:

Handling Variable Workloads: HPA automatically adjusts the number of pod replicas to accommodate fluctuations in workload demand, ensuring optimal performance and responsiveness.
Cost Optimization: By scaling pods based on resource utilization, HPA helps optimize resource allocation and minimize infrastructure costs.
Auto-scaling Web Applications: HPA can scale web application pods to handle increasing traffic loads during peak hours and scale down during periods of low activity.
Optimizing Resource Utilization: HPA optimizes resource utilization by dynamically adjusting the number of pods based on actual demand, reducing resource wastage and improving cluster efficiency.

Best Practices for Using Horizontal Pod Autoscaling:

When an HPA is enabled, it is recommended that the value of spec.replicas of the Deployment and / or StatefulSet be removed from their manifest(s).
Always set the stabilizationWindowSeconds for scale down and scale up events to avoid too much spikes in the events. This is also called as cool down period.
Define appropriate target resource utilization thresholds based on workload characteristics and performance requirements.
Regularly monitor HPA behavior and adjust scaling policies as needed to ensure optimal performance and resource utilization.
Test and validate HPA configurations in staging environments before deploying them in production to avoid unexpected behavior.

Conclusion:

Horizontal Pod Autoscaling is a critical feature of Kubernetes that enables automatic scaling of pods based on observed resource utilization metrics. By leveraging HPA, organizations can achieve efficient resource management, optimize performance, and ensure cost-effectiveness in their Kubernetes deployments. With proper configuration and monitoring, HPA enables Kubernetes clusters to dynamically adapt to changing workload demands and maintain optimal performance levels.

By Bikram Kundu|Last Updated: April 19th, 2024|Categories: Kubernetes|

Horizontal Pod Autoscaling: Manage Kubernetes Workload

Supported Matrics

How does a HorizontalPodAutoscaler work?

Strategies to implement Horizontal Pod Autoscaling in Kubernetes:

1. HPA based on CPU utilization

2. HPA based on RAM utilization:

3. HPA based on Multiple Resource types – CPU & RAM Usage

4. HPA based on traffic requests – Custom Metrics example

1. Resource target types in HPA – `resource.target.type`

a. Average Utilization:

b. Value:

c. AverageValue:

2. HPA Config with Scaling Policies:

a. ScaleDown Example:

b. Scale Up Example:

C. Disable Scale Down:

Use Cases of Horizontal Pod Autoscaling:

Best Practices for Using Horizontal Pod Autoscaling:

Conclusion:

Table of Contents

Horizontal Pod Autoscaling: Manage Kubernetes Workload

Supported Matrics

How does a HorizontalPodAutoscaler work?

Strategies to implement Horizontal Pod Autoscaling in Kubernetes:

1. HPA based on CPU utilization

2. HPA based on RAM utilization:

3. HPA based on Multiple Resource types – CPU & RAM Usage

4. HPA based on traffic requests – Custom Metrics example

1. Resource target types in HPA – resource.target.type

a. Average Utilization:

b. Value:

c. AverageValue:

2. HPA Config with Scaling Policies:

a. ScaleDown Example:

b. Scale Up Example:

C. Disable Scale Down:

Use Cases of Horizontal Pod Autoscaling:

Best Practices for Using Horizontal Pod Autoscaling:

Conclusion:

Table of Contents

1. Resource target types in HPA – `resource.target.type`