GPUs are nothing new. We’ve seen them for countless years in servers for various jobs that have to do with heavy graphics. Photography, video editing, CAD, and cryptocurrency are just a few categories that need powerful GPUs.
Another huge category, which is quite possibly the biggest right now, is GPU support for Machine Learning and AI. The goal of processing data and training Models is at the top of most organization's minds right now and to do that, GPU support is necessary
In this blog post, you’ll learn why GPUs on Kubernetes matter and how to get a Kubernetes cluster ready for GPUs.
Why GPU Support On Kubernetes Matters
As it stands right now, Kubernetes is the de facto standard for not only orchestrating containers, but orchestrating workflows as a whole. Whether you want an easy way to manage services outside of Kubernetes (Azure vNets, AWS S3 Buckets, etc.), containers, or run Virtual Machines, you can do it all in Kubernetes. This was necessary growth for Kubernetes as it became the underlying platform of choice.
The need to run ML and AI workloads has become increasingly important and because of that importance, k8s needs a way to orchestrate these workloads as well.
So far, the community has done a great job of getting ML to properly work in Kubernetes with Kubeflow and various other tools. Now, it’s time to think about how to run the ML and AI workloads on a larger scale. Yes, tools like Kubeflow are important and necessary, but you also need to consider the underlying components. One of the most important underlying components for ML and AI is GPUs.
Since the whole LLM thing has gotten bigger and bigger, engineers have been hearing about GPUs more and more. ChatGPT and the entire GenAI didn’t make the need for proper GPU support, it just enhanced the level of awareness (kind of like what Docker did to Containers). The goal now is to understand how to properly run decoupled applications and microservices (which are currently the smallest form factor we have) with proper GPU support.
Now that you know a bit about the “why” behind it, let’s take a look at some GPU options.
💡
GPU stands for Graphical Processing Unit
GPU Options
You’ll have two primary types of options for a Kubernetes cluster:
- On-prem
- Managed Kubernetes Services in the cloud
Both have the ability to support GPUs, but how you consume and get the GPUs will be different. Let’s start with on-prem.
As with all on-prem/self-hosted Kubernetes clusters, they’re running on physical (and hopefully) virtual servers that you and/or your organization manage. From a technical perspective, it could be anything from a real server to a desktop to an Intel NUC or a Rasberry Pi. There are a ton of great ways to run Kubernetes in today’s world.
Within the server or desktop or Intel NUC (there’s support for external GPUs) you can have a GPU. For example, below is a picture of my desktop at home. I have a Nvidia GPU in it.
A GPU is a piece of hardware that you have to buy and add into your computer/server.
On the other hand, using GPUs in the cloud are a bit different.
When you use a GPU-based Worker Node/Node Pool in the cloud, you’re using a “piece” of a GPU. The physical GPU is being shared across other cloud customers much like the other infrastructure resources (CPU, memory, storage, etc.).
Each major cloud (and even some smaller clouds including Digital Ocean) has GPU support.
For example, in Azure, the N-series VMs are GPU-centric
In GCP, you can choose based on the Nvidia model.
And for the small price of $142K per month, you too can have your very own cloud-based GPU!
💡
Joking aside, cloud-based GPUs are incredibly expensive.
Sidenote: Azure appears to have a bit better pricing.
Much like any other cloud-based environment, it all comes down to what you want to “rent” vs what you want to “own”.
The cheapest/quickest option found was in GCP. There’s a Nvidia Tesla T4 option that costs $2.95 USD per hour, so you can get a few hours out of it at a low cost. Just remember to delete it because if you don’t, it’ll end up being over $2,000 USD per month. Digital Ocean may have a cheaper option, but it was not tested for the purposes of this blog post.
Nvidia Installation
There are two steps to the installation:
- The Nvidia driver. This is like any other computer driver. It’s needed for the GPU to be compatible/bind to the Operating System.
- The Operator. Much like any other Kubernetes Operator, the Nvidia Operator allows you to extend the Kubernetes API to work with, in this case, the Nvidia hardware.
Let’s see how both are done.
💡
A Kubernetes Operator is a combination of CRDs (to extend the k8s API) and a Controller (which provides what many know as the self-healing capabilities of a resource in Kubernetes).
Driver
With the driver, you have two choices:
- Install the driver and manage it manually yourself.
- Let the cloud manage the driver for you.
If you go for number 1, you can configure a DaemonSet like the one below.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
If you go with number 2, you can set up the cluster to be automatically deployed with the Driver. Notice how in the screenshot below the AKS implementation actually installs both the driver and the Operator.
GCP however only does the Driver Installation.
Personally, I feel that if you’re using a cloud provider anyway, you might as well let them handle this piece. The cloud providers are doing the same thing as the DaemonSet underneath the hood, so you might as well let them take care of it for you.
Operator
If you didn’t go with the Azure implementation, you’ll need to install and manage the Operator. Luckily, there’s a Helm Chart available.
First, add the Nvidia Helm Chart.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
Next, ensure that the Helm Chart is up to date.
helm repo update
Last but certainly not least, install the Helm Chart. The resources will go into a Namespace called gpu-operator
.
helm install gpuoperator --wait \
-n gpu-operator --create-namespace \
nvidia/gpu-operator
Sidenote: I had an issue with the Driver on EKS. It kept trying to, what seemed like, download a specific driver for Amazon, but it kept saying that the container image couldn't be found. To combat this, I just used driver.enabled=false
in the Helm configuration and installed the Driver as you saw in the previous section.
Now that the Operator is installed, you can start working with it within your Pod deployments.
Workload Deployment
When you’re using GPU’s in Kubernetes, there are two places that you want to consider when enabling them:
- Under resource > limits.
- Within the Node Selector.
Much like you can set CPU and Memory limits, you can set GPU limits. By default when you use Worker Nodes that have GPUs enabled, the GPUs are there, but your resources may no be using them to it’s fullest extent. You can ensure that you set how many GPUs you want available to a workflow.
You can also set specific GPUs via the Node Selector. This allows you to choose which exact GPU you want to use in your deployment.
Let’s take a look at a few examples.
Ubuntu Example
Below is an example that shows the limits of one GPU used.
apiVersion: v1
kind: Pod
metadata:
name: gpu-operator-test
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
resources:
limits:
nvidia.com/gpu: 1
This example uses an example container image from Nvidia that’s GPU-ready. It’s an Ubuntu 20.04 custom image.
Ollama Example
Below is an example that shows:
- One GPU used under the
resources
map. - A specific GPU under the
nodeSelector
.
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama
spec:
selector:
matchLabels:
name: ollama
replicas: 2
template:
metadata:
labels:
name: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
resources:
limits:
nvidia.com/gpu: 1
ports:
- name: http
containerPort: 11434
protocol: TCP
nodeSelector:
nvidia.com/gpu.product: H100-PCIE-80GB
This is a great way to not only specify the amount of GPUs that your Pod needs to use, but a specific GPU to handle the potential load that the Pod has. If you have GPU-centric application stacks running in Kubernetes, chances are you’ll have apps that need more GPU power in comparison to others.