Name: Delving deep into the adoption of Alibaba Cloud's cloud-native technologies in AutoMQ
Rating: 4 (3728 reviews)
Author: automq

Author information: Zhou Xinyu, Co-founder & CTO of AutoMQ

Introduction: AutoMQ[1] is a groundbreaking cloud-native Kafka built on a shared storage architecture. By utilizing its compute-storage separation and integrating deeply with Alibaba Cloud's robust and advanced services such as Object Storage OSS, Block Storage ESSD, Elastic Scaling ESS, and Spot Instances, AutoMQ provides a cost advantage ten times greater than Apache Kafka while offering automatic scalability.

Leading the charge toward the cloud-native era, our mission at Alibaba Cloud and AutoMQ is to enhance our customers' capabilities in the cloud-based business landscape. As the industry evolves, we've observed that many products hastily claim to be cloud-native without fundamentally embracing cloud computing capabilities. Merely supporting deployment on Kubernetes does not suffice. True cloud-native products must exploit the full potential, elasticity, and scalability of cloud computing, thereby achieving significant cost and efficiency benefits.
Today, we delve into how Alibaba Cloud leverages cloud-native technologies with AutoMQ, addressing practical challenges effectively.
Object Storage OSS
With data increasingly migrating to the cloud, object storage has emerged as the primary storage solution for big data and data lake ecosystems. The shift from file APIs to object APIs is becoming prevalent, especially as stream data, often handled by Kafka, increasingly flows into these data lakes.
AutoMQ has developed the [S3Stream][1] stream storage library utilizing object storage, which enables efficient reading and ingestion of stream data via the Object API. By adopting a storage-compute separation architecture, it integrates Apache Kafka's storage layer with object storage, fully capitalizing on the technical and cost advantages of shared storage:

The standard version storage price of OSS with in-city redundancy is 0.12 yuan/GiB/month, vastly more economical—over eight times less—than ESSD PL1 priced at 1 yuan/1 GiB/month. Moreover, OSS inherently provides multi-zone availability and durability. Without the need for additional data replication, it significantly reduces costs by 25 times compared to the conventional cloud disk-based 3-copy architecture.
In contrast to a Shared-Nothing architecture, the shared storage model achieves a true separation of storage and compute, decoupling data from compute nodes. Consequently, when AutoMQ undertakes partition reassignment, it avoids data replication, facilitating true second-level lossless partition reassignments. This feature supports AutoMQ's capability for real-time self-balancing and rapid horizontal scaling of nodes.

While AutoMQ has effectively harnessed OSS for cost and architectural benefits, this represents merely the beginning. The adoption of shared storage is set to spur a wave of technical and product innovations at AutoMQ.

Disaster Recovery: As a fundamental aspect of software infrastructure, the greatest concern is the failure of a cluster to continue delivering services or the inability to restore data following a cluster failure. Potential issues include software bugs and catastrophic data center-level disasters. Thanks to shared storage and a straightforward metadata snapshot system, it is feasible to shut down a compromised cluster and restart it as a new cluster using the data stored on OSS to resume operations.
Cross-Region Disaster Recovery: OSS provides near real-time replication across different regions. Companies don't have to establish their own cross-regional networks or set up costly data connectivity clusters. Paired with the previously mentioned disaster recovery technology, this enables straightforward, code-free implementation of cross-region disaster recovery strategies.
Shared Read-Only Copies: High fan-out is a critical business use case for consuming streaming data. In a data-driven company, a single data item might be accessed by dozens of subscribers. The original cluster is unable to manage the increased load. With OSS, it is possible to create read-only copies directly from OSS without data duplication, offering scalable high fan-out capabilities.
Zero ETL: Modern data technology frameworks rely on object storage. When data resides in a common storage pool and possesses a level of self-description, data silos can be dismantled at a minimal cost without the necessity to construct ETL pipelines. Various analytical tools or computing engines can access shared data from multiple sources.

On the other hand, incorporating stream data into lakes completes the modern data stack, laying the groundwork for the Stream-Lake architecture. This is the source of the vast creative potential behind Confluent's TableFlow[2]. Data is produced and stored in stream formats, which align with the nature of continuously generated and evolving information in the real world. Real-time data must be in stream form, enabling stream computing frameworks to extract more immediate value. Eventually, as data ages, it transitions to table formats like Iceberg[3] for broader scale data analysis. From a lifecycle perspective, the move from streams to tables naturally matches the data's progression from high frequency to low frequency, from hot to cold, and constructing a stream-table integrated data technology stack on object storage represents a forward-looking trend.

Block Storage ESSD
If ECS is still regarded as a physical server, cloud disk ESSD faces a similar predicament. Users generally harbor two misconceptions about ESSD:

Comparing ESSD to local disks, they are concerned about data durability, apprehensive that problems typical of physical disks such as faulty disks or bad sectors might persist.
It's commonly believed that ESSD is a cloud disk, which leads to assumptions of poor remote write performance, uncontrollable latency, and jitter.

However, ESSD is supported by a robust distributed file system, utilizing triple-replica technology that ensures nine nines of data durability. Users are insulated from errors in physical storage media, as the system automatically detects and corrects faults across millions of physical disks.
Moreover, ESSD functions as shared storage. In the event of ECS failures, ESSD volumes can be mounted on other nodes to continue providing read and write services. In this respect, ESSD, similar to OSS, is shared storage and not a stateful local disk, which is a key reason why AutoMQ is touted as a stateless data software.
From a performance standpoint, ESSD benefits from combined software and hardware enhancements, including offloading the ESSD client to the Shenlong MOC[5] for hardware acceleration. It employs a high-performance proprietary network protocol and a congestion control algorithm based on RDMA technology, bypassing the traditional TCP stack to meet the low-latency and low packet loss requirements of data centers. These improvements ensure stable IOPS and throughput performance, as well as highly scalable storage capacity.
AutoMQ employs ESSD innovatively in three ways:

First, reliability separation, by fully utilizing the multi-replica technology of ESSD to circumvent the need for replication mechanisms such as Raft or ISR at the application layer, significantly reducing storage costs and network replication bandwidth.
Second, using ESSD as a WAL, where data is cyclically written to ESSD as bare devices and Direct IO, exclusively for recovery in fault scenarios. The shared nature of ESSD allows AutoMQ's WAL to be a remote, shareable WAL that can be taken over and recovered by any node in the cluster.
Finally, a cloud service-oriented billing design, where ESSD provides at least approximately 100 MiB/s throughput and about 1800 IOPS for any volume size. AutoMQ requires only a minimal ESSD volume as the WAL disk, such as a 2GiB ESSD PL0 volume, costing just 1 yuan per month to deliver the aforementioned performance. For enhanced storage performance on a single machine, simply combine multiple small-spec WAL disks for linear expansion.

ESSD and OSS offer distinctly different storage characteristics. ESSD provides high performance, low latency, and high IOPS, albeit at a higher cost. AutoMQ, however, has developed a cost-effective approach to utilizing ESSD. OSS is not ideal for environments requiring high IOPS as it charges per IO operation, yet it offers economical storage with virtually unlimited scalability in both throughput and capacity. As primary storage, OSS delivers high throughput, low cost, high availability, and limitless scalability; ESSD provides durable, highly available, low-latency storage ideal for storing WAL, and its virtualized nature allows for requesting very small storage capacities. AutoMQ's proprietary streaming library, S3Stream[1], cleverly merges the benefits of both ESSD and OSS shared storage, achieving low-latency, high-throughput, low-cost, and unlimited capacity streaming storage.

Multiple mounting and NVMe protocol
Although ESSD is shared storage, it functions as a block device. To efficiently share ESSD, additional storage technology support is necessary, specifically multiple mounting and the NVMe PR protocol.
Cloud disks natively support remounting to other nodes for recovery after unloading, but when the original mounting node encounters issues, such as an ECS Hang, the unloading time for the cloud disk becomes unpredictable. Therefore, with ESSD's multiple mounting capability, it's feasible to mount directly to another ECS node without unmounting the cloud disk.
Taking the AutoMQ Failover process as an example, when a Broker node is identified as a Failed Broker, its cloud disk is multiply mounted to a healthy Broker for data recovery. Before commencing the actual Recovery process, it's crucial to ensure that the original node has ceased writing. AutoMQ utilizes the NVMe protocol's PR lock for IO Fencing on the original node.
Both these processes are millisecond-level operations, effectively transforming ESSD into shared storage within the AutoMQ framework.
Regional ESSD
While ESSD typically uses a multi-replica architecture, these replicas are often confined to a single AZ, restricting ESSD's capability to handle AZ-level failures. Regional EBS[6] is crafted to solve this problem. By spreading the underlying multi-replica redundancy across multiple AZs with robust consistency read-write technology, it can withstand single AZ failures.
In terms of shared mounting, it supports cross-AZ mounting within a region and multi-AZ shared mounting, with preemptive IO Fencing and NVMe PR lock as forms of IO Fencing. Regional ESSD, offered by major international cloud providers, is also soon to be launched on Alibaba Cloud. This product enables AutoMQ to handle single AZ failures at a very low cost, satisfying the requirements of scenarios that demand higher availability.

Elastic Cloud Server (ECS)
Over the past decade, businesses have increasingly adopted a Rehost strategy to transition to the cloud, primarily replacing traditional on-premise physical servers with cloud-based servers like ECS. A key distinction between ECS and on-premise servers is the SLA services that ECS offers. Leveraging virtualization technologies, ECS addresses many hardware and software failures typical of physical servers. For those failures that cannot be avoided, cloud servers can quickly recover on a new physical server, substantially reducing downtime and limiting disruption to business operations.

Alibaba Cloud offers a 99.975% SLA for individual ECS instances. Operating a service on a single ECS node can ensure an availability of 99.9%, making it well-suited for production environments and fulfilling the availability demands of numerous services. For example, running a single-node AutoMQ cluster on an ECS setup with 2 CPUs and 16GB of RAM can achieve this level of availability and provide a write capacity of 80MiB/s, all while keeping costs low.
Since its development, AutoMQ has been designed to operate on ECS as a cloud service rather than as a physical server. Should an ECS failure occur, the system depends on the quick recovery features of the ECS node, such as automatic reassignment and restart. AutoMQ initiates proactive failover only after detecting several missed heartbeats from a node, considering two primary factors:

In the event of physical hardware or kernel failures, ECS can recover within seconds. Therefore, AutoMQ relies on the swift recovery capabilities of ECS to manage such issues, while avoiding overly sensitive failover mechanisms that might prompt unnecessary disaster recovery efforts.
AutoMQ's failover mechanisms are activated only in the event of ECS failures, network partitions, or even failures at the availability zone level, taking advantage of the features provided by ESSD and OSS for proactive disaster recovery. Elastic Scaling Service (ESS) In March 2024, AutoMQ was officially launched on the Alibaba Cloud marketplace through a collaborative release with Alibaba Cloud. From the general availability of AutoMQ's core features to its swift listing on the Alibaba Cloud marketplace, two key products played a pivotal role: Alibaba Cloud Compute Nest, which ensures standardized delivery processes for service providers, and Elastic Scaling Service (ESS). Although AutoMQ's architecture naturally supports elastic scaling, providing these capabilities seamlessly presents challenges[4]. AutoMQ leverages ESS to simplify the final delivery steps. AutoMQ chose ESS over Kubernetes for its public cloud deployment for a variety of reasons:
AutoMQ's initial deployment model is BYOC, which simplifies dependencies and eliminates the need for each user to set up a Kubernetes cluster when installing AutoMQ.
Elastic Scaling Service (ESS) offers configuration management, automatic scaling, scheduled scaling, instance management, multi-AZ deployment, and health checks, all akin to the core deployment features of Kubernetes. We view ESS as a streamlined version of Kubernetes at the IaaS layer.
Future sections will explore AutoMQ's use of multiple mounts, Regional ESSDs, and other advanced features provided by cloud vendors, which Kubernetes may not immediately support. Using APIs at the IaaS layer, rather than Kubernetes APIs, is similar to the difference between the C++ and Java programming languages; native functionalities need to be accessible at the Kubernetes level for effective use. Certainly, Kubernetes is an exceptional platform, and we plan to support deployments on Kubernetes in the future, particularly in private cloud scenarios, to abstract many of the differences at the IaaS layer. Spot Instances Elastic capabilities are not inherently available to cloud providers; they must incur considerable holding costs to offer adequate elasticity, which often results in an excess of unused computing resources. These resources are made available through spot instances, which function just like regular ECS instances but can offer savings of up to 90% compared to standard pay-as-you-go rates. Unlike regular pay-as-you-go instances, the pricing of spot instances varies with market supply and demand. For example, if demand for computing power decreases overnight, prices typically drop, adding a temporal dimension to spot instance pricing. If all users adopt spot instances, prices will adjust accordingly, promoting optimal usage times for different workloads. For instance, AutoMQ conducts extensive testing overnight using spot instances, drastically cutting testing costs. Another characteristic of preemptive instances is their ability to be interrupted and reclaimed at any time, which indeed poses a high barrier to their adoption. However, the compute-storage separation architecture employed by AutoMQ ensures that Broker nodes maintain no local state, enabling them to gracefully manage the reclamation of preemptive instances. The diagram below illustrates the process of WAL recovery via the ESSD API when preemptive instances are reclaimed by AutoMQ. Through this approach, AutoMQ achieves a tenfold reduction in costs, with preemptive instances playing a significant role in lowering compute expenses.

Closing Remarks
Today, much of the foundational software that supports the rapid growth of big data and the internet was developed a decade ago. Yet, software crafted for IDC environments does not translate to high efficiency or low costs in the current mature cloud computing landscape. Thus, there is a significant push to redesign foundational software for the cloud, including components for observability storage, TP and AP databases, and data lake software. As an essential piece of flow storage software within the big data ecosystem, Kafka occupies a crucial position, representing 10% to 20% of IT spending in data-centric enterprises. Redesigning Kafka with cloud-native features is vital in today’s environment of cost reduction. AutoMQ utilizes deep cloud integration and cloud-native capabilities to reengineer Apache Kafka®, achieving a tenfold cost advantage. In comparison to Kafka, AutoMQ’s shared storage architecture has led to a drastic improvement in operational metrics, such as partition reassignment, dynamic node scaling, and traffic self-balancing.
Cloud computing has heralded a new era, and embracing a cloud-native approach ensures no regrets in transitioning to the cloud. We are convinced that all foundational software should be reengineered based on cloud architectures to fully capitalize on its benefits.
References

Open-source cloud-native version of Kafka — AutoMQ: https://github.com/AutoMQ/automq
Confluent recently introduced Tableflow, merging streaming and analytical computing: https://www.confluent.io/blog/introducing-tableflow/
Official site of the open table format Iceberg: https://iceberg.apache.org/
Why is it difficult to fully utilize the elasticity of public clouds? https://www.infoq.cn/article/tugbtfhemdiqlxm1x63y
Alibaba Cloud's in-house developed "Shenlong Architecture": https://developer.aliyun.com/article/743920
Announced at the 2023 Yunqi Conference, the Regional ESSD: https://developer.aliyun.com/article/1390447