Infinity embeddings on Kubernetes with KubeAI

Just merged and released the Infinity support PR in KubeAI, adding Infinity as an embedding engine. So you can get embeddings running on your Kubernetes clusters with an OpenAI compatible API.

Infinity is a high performance and low latency embeddings engine: https://github.com/michaelfeil/infinity
KubeAI is a Kubernetes Operator for running OSS ML serving engines: https://github.com/substratusai/kubeai

How to use this?

Run on any K8s cluster:

helm repo add kubeai https://www.kubeai.org
helm install kubeai kubeai/kubeai --wait --timeout 10m
cat > model-values.yaml << EOF
catalog:
  bge-embed-text-cpu:
    enabled: true
    features: ["TextEmbedding"]
    owner: baai
    url: "hf://BAAI/bge-small-en-v1.5"
    engine: Infinity
    resourceProfile: cpu:1
    minReplicas: 1
EOF
helm install kubeai-models kubeai/models -f ./model-values.yaml

Forward kubeai service to local host:

kubectl port-forward svc/kubeai 8000:80

Afterwards you could use the OpenAI Python client to get embeddings:

from openai import OpenAI
# Assumes port-forward of kubeai service to localhost:8000.
client = OpenAI(api_key="ignored", base_url="http://localhost:8000/openai/v1")
response = client.embeddings.create(
    input="Your text goes here.",
    model="bge-embed-text-cpu"
)
print(response)

What’s next?

Support for autoscaling based on Infinity reported metrics.