This post continues my AI exploration series with a look at the open source solution Ray and how it can support AI workloads on Google Kubernetes Engine.

What is Ray?

Initially created at UC Berkeley in 2018, Ray is an open-source unified compute framework to scale AI and Python workloads. The Ray project is predominantly managed and maintained by Anyscale.

At a high level, Ray is made up of 3 layers: Ray AI Libraries (Python), Ray Core, and Ray Clusters. These layers enable Ray to act as a solution that spans the gap between data scientists/machine learning engineers and infrastructure/platform engineers. Ray enables developers to run their AI applications more easily and efficiently across distributed hardware via a variety of libraries. Ray enables data scientists/machine learning (ML) engineers to focus on the applications and models they’re developing. And finally, Ray provides infrastructure/platform engineers with tools to more easily and effectively support the infrastructure needs of these highly performance-sensitive applications.

This diagram from the Ray documentation shows how the Ray Libraries relate to Ray Core, while Ray Clusters are a distributed computing platform that can be run on top of a variety of infrastructure configurations, generally in the cloud.

Through its various Python libraries in the Ray AI Libraries and Ray Core layers, Ray provides ML practitioners with tools that simplify the challenge of running highly performance-sensitive distributed machine learning-style applications on hardware accelerators. Ray Clusters are a distributed computing platform where worker nodes run user code as Ray tasks and actors. These worker nodes are managed by a head node which handles tasks like autoscaling the cluster and scheduling the workloads. Ray Cluster also provides a dashboard that gives a status of running jobs and services.

Ray on Kubernetes with KubeRay

Ray Clusters and Kubernetes clusters pair very well together. While Ray Clusters have been developed with a focus on enabling efficient distributed computing for hardware-intensive ML workloads, Kubernetes has a decade of experience in more generalized distributed computing. By running a Ray Cluster on Kubernetes, both Ray users and Kubernetes Administrators benefit from the smooth path from development to production that Ray’s Libraries combined with the Ray Cluster (running on Kubernetes) provide. KubeRay is an operator which enables you to run a Ray Cluster on a Kubernetes Cluster.

KubeRay adds 3 Custom Resource Definitions (CRDs) to provide Ray integration in Kubernetes. The RayCluster CRD enables the Kubernetes cluster to manage the lifecycle of its custom RayCluster objects, including managing RayCluster creation/deletion, autoscaling, and ensuring fault tolerance. The custom RayJob object enables the user to define Ray jobs (Ray Jobs, with a space!) and a submitter, which can either be directed to run the job on an existing RayCluster object, or to create a new RayCluster to be deleted upon that RayJob’s completion. The RayService custom resource encapsulates a multi-node Ray Cluster and a Serve application that runs on top of it into a single Kubernetes manifest.

KubeRay on GKE with the Ray Operator Add-On

The Ray Operator is an add-on for Google Kubernetes Engine (GKE) which is based on KubeRay and provides a smooth, native way to deploy and manage KubeRay resources on GKE. Enabling the Ray Operator on your GKE Cluster automatically installs the KubeRay CRDs (RayCluster, RayJob, and RayService), enabling you to run Ray workloads on your cluster. You can enable the operator either at cluster creation or, you can add it to an existing cluster via the console, gcloud cli, or IAC such as Terraform.

The Ray Operator Add-On in a GKE cluster is hosted by Google and does not run on GKE nodes, meaning that no overhead is added to the cluster in order to run the operator. You may also choose to run the KubeRay operator on your GKE cluster without using the Add-On. In the case where you are not using the Add-On, the KubeRay operator would run on the nodes, meaning there may be some slight overhead added to the cluster. In other words, the Ray Operator Add-On for GKE enables users to run Ray on their GKE clusters without any added overhead from the operator.

This short video (~3.5 minutes) shows you how to add the Ray Operator to your cluster and demonstrates creating a RayCluster and running a Ray Job on that cluster running on GKE.

When to Use Ray on GKE

In many organizations, the people creating AI applications and the people running and managing the GKE clusters and other infrastructure are different people, because the skills to do these types of activities well are specialized in nature.

As a platform engineer, you may want to consider encouraging use of Ray as a single scalable ML platform that members of your organization could use to simplify the development lifecycle of AI workloads. If AI practitioners are using Ray, you can use the Ray Operator for GKE to simplify onboarding and integration into your existing GKE ecosystem. As a practitioner who is building AI applications, you may want to consider using or advocating for Ray if your organization is already using GKE and you want to reuse the same code between development and production without modification, and to leverage Ray’s ML ecosystem with multiple integrations.

Ray Alternatives

Ray does not exist in a vacuum and there are numerous other tools that can support the same types of AI and Python workloads. Ray's layered approach solves challenges from development to production that alternatives may not holistically address.. Understanding how Ray relates to other tools can help you understand the value it provides.

The Python Library components of Ray could be considered analogous to solutions like numpy, scipy, and pandas (which is most analogous to the Ray Data library specifically). As a framework and distributed computing solution, Ray could be used in place of a tool like Apache Spark or Python Dask. It’s also worthwhile to note that Ray Clusters can be used as a distributed computing solution within Kubernetes, as we’ve explored here, but Ray Clusters can also be created independent of Kubernetes.

To learn more about the relationships between Ray and alternatives, check out the “Ray & KubeRay, with Richard Liaw and Kai-Hsun Chen” episode of the Kubernetes Podcast from Google.

As in many other areas of tech, the right answer to the question “which tool is best for me” is, “it depends.” The best solution for your use case will depend on a variety of factors including which tools you’re comfortable with, how you divide work, what's in use in your environment, and more.

Try out the Ray Operator on GKE today!

Ray is a powerful solution that bridges the gap between those developing AI applications, and those running them on infrastructure. The Ray Operator for GKE makes it easy to take advantage of Ray on GKE clusters.

Learn more about Ray and how to use it in the Google Cloud “About Ray on Google Kubernetes Engine (GKE)” docs page. And check out the “Simplify Kuberay with Ray Operator on GKE” video on YouTube to see an example of the Ray Operator on GKE in action.

At time of publishing, Ray Summit 2024 is just around the corner! Join Anyscale and the Ray community in San Francisco on September 30 through October 2nd for 3 days of Ray-focused training, collaboration, and exploration. The schedule is available now!

Intro to Ray on GKE