Background

A few days ago, while developing the KubeRay project, I learned about a Kubernetes behavior from the issues comment section. There are two types of Eviction: Node-pressure Eviction and API-initiated Eviction. API-initiated Eviction is done by directly calling the API or using commands like kubectl drain. Pods evicted this way will ultimately be deleted and usually recreated on another node. However, for Node-pressure Eviction, kubelet will only set the Pod's Phase to Failed without deleting it. Therefore, if the controller does not handle it properly, the Pod will not be recreated on another node.

Heres a brief overview of the issue: When a Pod created by the KubeRay operator is on a node with insufficient disk space, the Pod gets evicted. After the disk space is cleared, the Pod remains in a Failed state and is not recreated on another node.

Now, I need to reproduce this issue. The key point is that since these two types of Evictions behave differently, I cannot use kubectl drain or similar commands to reproduce the scenario. I need to specifically create a Node-pressure Eviction. However, I don't have a cluster to use; I do all my development on my personal computer, making it difficult to reproduce the issue. When developing Kubernetes applications locally, most people use minikube, kind, or k3d. Since I need a multi-node environment, minikube is excluded. Although it now supports multiple nodes, it's still more commonly used for single-node scenarios. Both kind and k3d use Docker containers as Kubernetes nodes. My operating system is Linux Mint, and Docker runs natively, unlike macOS where Docker runs in a virtual machine. Because the resources (memory, disk, etc.) are shared between Docker and my local machine, if I do create a Node-pressure scenario, my computer might become unusable.

After extensive Googling, I discovered that Docker can set Runtime Memory Limits, and k3d has a --agents-memory flag to set agent node memory. This is how I found a way to reproduce the issue.

Steps

First, create a k3d cluster with 2 agent nodes, each with 3GB of memory, and trigger Pod Eviction when the available memory is less than 1GiB.

k3d cluster create \
  --agents 2 \
  --k3s - arg "--disable=traefik@server:0" \
  --agents - memory 3g \
  --k3s - arg "--kubelet-arg=eviction-hard=memory.available<1Gi@agent:0" \
  --k3s - arg "--kubelet-arg=eviction-hard=memory.available<1Gi@agent:1"

Check the memory of all nodes

kubectl get nodes -o custom-columns=NAME:.metadata.name,CAPACITY_MEMORY:.status.capacity.memory,ALLOCATABLE_MEMORY:.status.allocatable.memory

Output:

# NAME                       CAPACITY_MEMORY   ALLOCATABLE_MEMORY
# k3d-k3s-default-agent-1    3221225Ki         2172649Ki
# k3d-k3s-default-agent-0    3221225Ki         2172649Ki
# k3d-k3s-default-server-0   32590664Ki        32590664Ki

You can see that both agent 0 and agent 1 have 3GB of memory, but only 2GB is allocatable because Pod Eviction is triggered when available memory is less than 1GiB.

Next, add taints to agent 0 and agent 1 so that subsequent Pods will only be deployed to the server-0 node.

kubectl taint nodes k3d-k3s-default-agent-0 k3d=noschedule:NoSchedule
kubectl taint nodes k3d-k3s-default-agent-1 k3d=noschedule:NoSchedule

Install the KubeRay operator, so the operator Pod will run on the server-0 node.

helm install kuberay-operator kuberay/kuberay-operator --namespace ray-system --version 1.1.1 --create-namespace

Remove the taints from agent 0 and agent 1 and add a taint to server 0 so that subsequent Pods will not be deployed to server 0.

kubectl taint nodes k3d-k3s-default-server-0 k3d=noschedule:NoSchedule
kubectl taint nodes k3d-k3s-default-agent-0 k3d=noschedule:NoSchedule-
kubectl taint nodes k3d-k3s-default-agent-1 k3d=noschedule:NoSchedule-

Install the RayCluster custom resource. After installation, the KubeRay operator will create a head pod and a worker pod. Since the memory resource request for the head pod is 2GB and for the worker pod is 1GB in the helm chart, and both agent 0 and agent 1 have only 2GB of allocatable memory, these two Pods will definitely not be on the same node.

helm install raycluster kuberay/ray-cluster --version 1.1.1

Next, we need to perform a memory stress test on the node where the head pod is located. After some Googling, I found that stress-ng is commonly used for this purpose, so Ill use it as well. We need to ensure that the head pod has stress-ng available. The simplest way is to copy the statically compiled stress-ng binary directly into the head pod, so we don't have to worry about the head pod's base image or any missing dependencies. As for obtaining the statically compiled binary, you can compile it yourself, but I took a shortcut by copying it from a Docker image that includes the binary. Assuming the head pod is named raycluster-kuberay-head-ldg9f.

kubectl cp ./stress-ng raycluster-kuberay-head-ldg9f:/home/ray

Open a shell on the head pod

kubectl exec -it raycluster-kuberay-head-ldg9f -- bash

Simulate memory stress

./stress-ng --vm 4 --vm-bytes 2G --vm-keep

In this way, you can see the head pod being evicted due to Node-pressure Eviction.

How to Reproduce Kubernetes Node-pressure Eviction via K3d

Background

Steps