How We Saved 10s of Thousands of Dollars Deploying Low Cost Open Source AI Technologies At Scale with Kubernetes

John McBride - May 14 - - Dev Community

When you first start building AI applications with generative AI, you'll likely end up using OpenAI's API at some point in your project's journey. And for good reason! Their API is well-structured, fast, and supported by great libraries. At a small scale or when you’re just getting started, using OpenAI can be relatively economical. There’s also a huge amount of really great educational material out there that walks you through the process of building AI applications and understanding complex techniques using OpenAI’s API.

One of my personal favorite OpenAI resources these days is the OpenAI Cookbook: this is an excellent way to start learning how their different models work, how to start taking advantage of the many cutting edge techniques in the AI space, and how to start integrating your data with AI workloads.

However, as soon as you need to scale up your generative AI operations, you'll quickly encounter a pretty significant obstacle: the cost. Once you start generating thousands (and eventually tens of thousands) of texts via GPT-4, or even the lower-cost GPT-3.5 models, you'll quickly find your OpenAI bill is also growing into the thousands of dollars every month.

Thankfully, for small and agile teams, there are a lot of great options out there for deploying low cost open source technologies to reproduce an OpenAI compatible API that uses the latest and greatest of the very solid open source models (which in many cases, rival the performance of the GPT 3.5 class of models).

This is the very situation we at OpenSauced found ourselves in when building the infrastructure for our new AI offering, StarSearch: we needed a data pipeline that would continuously get summaries and embeddings of GitHub issues and pull requests in order to do a “needle in the haystack” cosine similarity search in our vector store as part of a Retrieval Augmented Generation (RAG) flow. RAG is a very popular technique that enables you to provide additional context and search results to a large language model where it wouldn’t have that information in its foundational data otherwise. In this way, an LLM’s answers can be much more accurate for queries that you can "augment" with data you’ve given it context on.

Simple RAG

Cosine similarity search on top of a vector store is a way to enhance this RAG flow even further: because much of our data is unstructured and would be very difficult to parse through using a full text search, we’ve created vector embeddings on AI generated summaries of relevant rows in our database that we want to be able to search on. Vectors are really just a list of numbers but they represent an “understanding” from an embedding machine learning model that can be used with query vector embeddings to find the “nearest neighbor” data to the end users question.

Advanced RAG techniques with vector store

Initially, for the summary generation part of our RAG data pipeline, we were using OpenAI directly and wanted to target "knowing" about the events and communities of the top 40,000+ repositories on GitHub. This way, anyone could ask about and gain unique insights into what's going on across the most prominent projects in the open source ecosystem. But, since new issues and pull request events are always flowing through this pipeline, on any one given day, upwards of 100,000 new events for the 40,000+ repos would flow through to have summaries generated: that’s a lot of calls to the OpenAI API!!

Vector generation data pipeline architecture

At this kind of scale, we quickly ran into "cost" bottlenecks: we considered further optimizing our usage of OpenAI's APIs to reduce our overall usage, but felt that there was a powerful path forward by using open source technologies at a significantly lower cost to accomplish the same goal at our target scale.

And while this post won’t get too deep into how we implemented the actual RAG part of StarSearch, we will look at how we bootstrapped the infrastructure to be able to consume many tens of thousands of GitHub events, generate AI summaries from them, and surface those as part of a nearest neighbor search using vLLM and Kubernetes. This was the biggest unlock to getting StarSearch to be able to surface relevant information about various technologies and "know" about what's going on across the open source ecosystem.

There’s a lot more that could be said about RAG and vector search - I recommend the following resources:

Running open source inference engines locally

Today, thanks to the power and ingenuity of the open source ecosystem, there are a lot of great options for running AI models and doing "generative inference" on your own hardware.

A few of the most prominent that come to mind are llama.cpp, vLLM, llamafile, llm, gpt4all, and the Huggingface transformers. One of my personal favorites is Ollama: it allows me to easily run an LLM with ollama run on the command line of my MacBook. All of these, with their own spin and flavors on the open source AI space, provide a very solid way for you to run open source large language models (like Meta's llama3, Mistral's mixtral model, etc.) locally on your own hardware without the need for a third party API.

Maybe even more importantly, these pieces of software are well optimized for running models on consumer grade hardware like personal laptops and gaming computers: you don't need a cluster of enterprise grade GPUs or an expensive third party service in order to start playing around with generating text! You can get started today and start building AI applications right from your laptop using open source technology with no 3rd party API.

This is exactly how I started transitioning our generative AI pipelines from OpenAI to a service we run on top of Kubernetes for StarSearch: I started simple with Ollama running a Mistral model locally on my laptop. Then, I began transitioning our OpenAI data pipelines that read from our database and generate summaries to start using my local Ollama server. Ollama, along with many of the other inference engines out there, provide an OpenAI compatible API. Using this, I didn’t have to re-write much of the client code: simply replace the OpenAI API endpoint with the localhost pointed to Ollama.

Using Ollama locally

Choosing vLLM for production

Eventually, I ran into a real bottleneck using Ollama: it didn't support servicing concurrent clients. And, at the kind of scale we're targeting, at any given time, we likely need a couple dozen of our data pipeline microservice runners to all concurrently be batch processing summaries from the generative AI service all at once. This way, we could keep up with the constant load from over 40,000+ repos on GitHub. Obviously OpenAI's API can handle this kind of load, but how would we replicate this with our own service?

Eventually, I found vLLM, a fast inference runner that can service multiple clients behind an OpenAI compatible API and take advantage of multiple GPUs on a given computer with request batching and an efficient use of "PagedAttention" when doing inference. Also like Ollama, the vLLM community provides a container runtime image which makes it very easy to use on a number of different production platforms. Excellent!

Using vLLM at scale

Note to the reader: Ollama very recently merged changes to support concurrent clients. At the time of this writing, it was not supported in the main upstream image, but I’m very excited to see how it performs compared to other multi-client inference engines!

Running vLLM locally

To run vLLM locally, you’ll need a linux system and a python runtime:

python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2
Enter fullscreen mode Exit fullscreen mode

This will start the OpenAI compatible server which you can then hit locally on port 8000:

curl http://localhost:8000/v1/models
Enter fullscreen mode Exit fullscreen mode
{
    "object": "list",
    "data": [
        {
            "id": "mistralai/Mistral-7B-Instruct-v0.2",
            "object": "model",
            "created": 1715528945,
            "owned_by": "vllm",
            "root": "mistralai/Mistral-7B-Instruct-v0.2",
            "parent": null,
            "permission": [
                {
                    "id": "modelperm-020c373d027347aab5ffbb73cc20a688",
                    "object": "model_permission",
                    "created": 1715528945,
                    "allow_create_engine": false,
                    "allow_sampling": true,
                    "allow_logprobs": true,
                    "allow_search_indices": false,
                    "allow_view": true,
                    "allow_fine_tuning": false,
                    "organization": "*",
                    "group": null,
                    "is_blocking": false
                }
            ]
        }
    ]
}

Enter fullscreen mode Exit fullscreen mode

Alternatively, to run a container with the OpenAI compatible API, you can use docker on your linux system:

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model mistralai/Mistral-7B-Instruct-v0.2
Enter fullscreen mode Exit fullscreen mode

This will mount the local Huggingface cache on my linux machine and use the host network. Then, using localhost again, we can hit the OpenAI compatible server running on docker. Let’s do a chat completion now:

curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
      "messages": [
        {"role": "user", "content": "Who won the world series in 2020?"}
      ]
  }'
Enter fullscreen mode Exit fullscreen mode
{
    "id": "cmpl-9f8b1a17ee814b5db6a58fdfae107977",
    "object": "chat.completion",
    "created": 1715529007,
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The Major League Baseball (MLB) World Series in 2020 was won by the Tampa Bay Rays. They defeated the Los Angeles Dodgers in six games to secure their first-ever World Series title. The series took place from October 20 to October 27, 2020, at Globe Life Field in Arlington, Texas."
            },
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null
        }
    ],
    "usage": {
        "prompt_tokens": 21,
        "total_tokens": 136,
        "completion_tokens": 115
    }
}

Enter fullscreen mode Exit fullscreen mode

Using Kubernetes for a large scale vLLM service

Running vLLM locally works just fine for testing, developing, and experimenting with inference, but at the kind of scale we're targeting, I knew we'd need some kind of environment that could easily handle any number of compute instances with GPUs, scale up with our needs, and load balance vLLM behind an agnostic service that our data pipeline microservices could hit at a production rate: enter Kubernetes, a familiar and popular container orchestration system!

This, in my opinion, is a perfect use case for Kubernetes and would make scaling up an internal AI service that looked like OpenAI's API relatively seamless.

In the end, the architecture for this kind of deployment looks like this:

  • Deploy any number of Kubernetes nodes with any number of GPUs on each node into a nodepool
  • Deploy a daemonset for vLLM to run on each node with a GPU
  • Deploy a Kubernetes service to load balance internal requests to vLLM's OpenAI compatible API

kubernetes architecture

Getting the cluster ready

If you're following along at home and looking to reproduce these results, I'm assuming at this point you have a Kubernetes cluster already up and running, likely through a managed Kubernetes provider, and have also installed the necessary GPU drivers onto the nodes that have GPUs.

Again, on Azure’s AKS, where we deployed this service, we needed to run a daemonset that installs the Nvidia drivers for us on each of the nodes with a GPU:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: gpu-resources
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      containers:
      - image: mcr.microsoft.com/oss/nvidia/k8s-device-plugin:v0.14.1
        name: nvidia-device-plugin-ctr
        securityContext:
          capabilities:
            drop:
            - All
        volumeMounts:
        - mountPath: /var/lib/kubelet/device-plugins
          name: device-plugin
      nodeSelector:
        accelerator: nvidia
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists
      volumes:
      - hostPath:
          path: /var/lib/kubelet/device-plugins
          type: ""
        name: device-plugin
Enter fullscreen mode Exit fullscreen mode

This daemonset installs the Nvidia device plugin pod on each node that has the node selector accelerator: nvidia and can tolerate a few taints from the system. Again, this is more or less platform specific but this enables our AKS cluster to have the necessary drivers for the nodes that have GPUs so vLLM can take full advantage of those compute units.

Eventually, we end up with a cluster node configuration that has the default nodes and the nodes with GPUs:

❯ kubectl get nodes -A
NAME                    STATUS   ROLES  AGE   VERSION
defaultpool-88943984-0   Ready  <none>   5d v1.29.2
defaultpool-88943984-1   Ready  <none>   5d v1.29.2
gpupool-42074538-0      Ready   <none>   41h   v1.29.2
gpupool-42074538-1      Ready   <none>   41h   v1.29.2
gpupool-42074538-2      Ready   <none>   41h   v1.29.2
gpupool-42074538-3      Ready   <none>   41h   v1.29.2
gpupool-42074538-4      Ready   <none>   41h   v1.29.2
Enter fullscreen mode Exit fullscreen mode

Each of these nodes has a gpu device plugin pod managed by the daemonset where the drivers get installed:

❯ kubectl get daemonsets.apps -n gpu-resources
NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR        AGE
nvidia-device-plugin-daemonset   5         5         5       5            5           accelerator=nvidia   41h
Enter fullscreen mode Exit fullscreen mode

One thing to note for this setup: each of these gpu nodes have a accelerator: nvidia label and taints for nvidia.com/gpu. These are to ensure that no other pods are scheduled on these nodes since we anticipate vLLM consuming all the compute and GPU resources on each of these nodes.

Deploying a vLLM DaemonSet

In order to take full advantage of each of the GPUs deployed on the cluster, we can deploy an additional vLLM daemonset that also selects for each of the Nvidia GPU nodes:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: vllm-daemonset-ec9831c8
  namespace: vllm-ns
spec:
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - args:
        - --model
        - mistralai/Mistral-7B-Instruct-v0.2
        - --gpu-memory-utilization
        - "0.95"
        - --enforce-eager
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              key: HUGGINGFACE_TOKEN
              name: vllm-huggingface-token
        image: vllm/vllm-openai:latest
        name: vllm
        ports:
        - containerPort: 8000
          protocol: TCP
        resources:
          limits:
            nvidia.com/gpu: "1"
      nodeSelector:
        accelerator: nvidia
      tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists
Enter fullscreen mode Exit fullscreen mode

Let’s break down what’s going on here:

First, we create the metadata and label selectors for the vllm daemonset pods on the cluster. Then, in the container spec, we provide the arguments to the vLLM container running on the cluster. You’ll notice a few things here: we’re utilizing about 95% of GPU memory in this deployment and we are enforcing CUDA eager mode (which helps with memory consumption while trading off inference performance). One of the things I like about vLLM is its many options for tuning and running on different hardware: there are lots of capabilities for tweaking how the inference works or how your hardware is consumed. So check out the vLLM docs for further reading!

Next, you’ll notice we provide a Huggingface token: this is so that vLLM can pull down the model from Huggingface’s API and bypass any “gated” models that we’ve been given permission to access.

Next, we expose port 8000 for the pod. This will be used latter in a service to select for these pods and provide an agnostic way to hit a load balanced endpoint for any of the various deployed vLLM pods on port 8000. Then, we use a nvidia.com/gpu resource (which is provided as a node level resource by the Nvidia device plugin daemonset - again, depending on your managed Kubernetes provider and how you installed the GPU drivers, this may varry). And finally, we provide the same node selector and taint tolerations to ensure that vLLM runs only on the GPU nodes! Now, when we deploy this, we’ll see the vLLM daemonset has successfully deployed onto each of the GPU nodes:

❯ kubectl get daemonsets.apps -n vllm-ns
NAME                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR        AGE
vllm-daemonset-ec9831c8   5         5         5       5            5           accelerator=nvidia   41h
Enter fullscreen mode Exit fullscreen mode

Load balancing with an internal Kubernetes service

In order to provide a OpenAI like API to other microservices internally on the cluster, we can apply a Kubernetes service that selects for the vllm pods in the vllm namespace:

apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: vllm-ns
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 8000
  selector:
    app: vllm
  sessionAffinity: None
  type: ClusterIP
Enter fullscreen mode Exit fullscreen mode

This simply selects for app: vllm pods and targets the vLLM 8000 port. This then will get picked up by the internal Kubernetes DNS server and we can use the resolved “vllm-service.vllm-ns” endpoint to be load balanced to one of the vLLM APIs.

Results

Let's hit this vLLM Kubernetes service endpoint:

# hitting the vllm-service internal api endpoint resolved by Kubernetes DNS

curl vllm-service.vllm-ns.svc.cluster.local/v1/chat \
    -H "Content-Type: application/json" \
    -d '{
     "model": "mistralai/Mistral-7B-Instruct-v0.2",
     "prompt": "Why is the sky blue?"
}'
Enter fullscreen mode Exit fullscreen mode

This "vllm-service.vllm-ns" internal Kubernetes service domain name will resolve to one of the nodes running a vLLM daemonset (again, load-balanced across all the running vLLM pods) and will return inference generation for the prompt "Why is the sky blue?":

{
    "id": "cmpl-76cf74f9b05c4026aef7d64c06c681c4",
    "object": "chat.completion",
    "created": 1715533000,
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The color of the sky appears blue due to a natural phenomenon called Rayleigh scattering. As sunlight reaches Earth's atmosphere, it interacts with molecules and particles in the air, such as nitrogen and oxygen. These particles scatter short-wavelength light, like blue and violet light, more than longer wavelengths, like red, orange, and yellow. However, we perceive the sky as blue and not violet because our eyes are more sensitive to blue light and because sunlight reaches us more abundantly in the blue part of the spectrum.\n\nAdditionally, some of the violet light gets absorbed by the ozone layer in the stratosphere, which prevents us from seeing a violet sky. At sunrise and sunset, the sky can take on hues of red, orange, and pink due to the scattering of sunlight through the Earth's atmosphere at those angles."
            },
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null
        }
    ],
    "usage": {
        "prompt_tokens": 15,
        "total_tokens": 201,
        "completion_tokens": 186
    }
}

Enter fullscreen mode Exit fullscreen mode

Conclusion

In the end, this provides our internal microservices running on the cluster a way to generate summaries without having to use an expensive 3rd party API: we found that we’ve gotten very good results from using the Mistral models and, for this use case at this scale, using a service we run on some GPUs has been significantly more economical.

You could expand on this and provide some additional networking policy or configurations to your internal service or even add an ingress controller to provide this as service to others outside of your cluster. The sky is the limit with what you can do from here! Good luck, and stay saucey!

If you want to check out StarSearch, join our waitlist now!

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player