Maintaining apps' availability during zones and regions outages
📑 Introduction
Amazon Elastic Kubernetes Service (EKS) is a widely used managed Kubernetes service for running applications in the cloud. However, cloud parts, known as Availability Zones (AZs) or Regions, can experience outages, disrupting your applications.
To address this, Amazon provides the Application Recovery Controller (ARC) with a key feature called zonal shift, which redirects traffic away from affected zones, ensuring operational reliability.
In this blog post, we will explore how ARC zonal shift works within Amazon EKS, its benefits, and why it is essential for maintaining application availability during outages.
Let’s dive in!
⚡ What is the Application Recovery Controller (ARC)?
The Application Recovery Controller (ARC) helps manage and coordinate the recovery of applications across different AWS Regions and AZs. It provides insights into recovery readiness, making it easier to handle issues when they arise.
ARC supports the following resources for zonal shift and zonal autoshift:
- Amazon Elastic Kubernetes Service
- Application Load Balancers with cross-zone load balancing disabled
- Network Load Balancers with cross-zone load balancing enabled or disabled
Benefits of ARC:
- Manage Recovery for Multi-AZ and Multi-Region Applications: Quickly address problems for applications spanning multiple AZs or Regions, supporting both active-active and active-standby setups.
- Validate Recovery Readiness: Continuously monitor resources, checking limits, capacity, and configurations, and suggest fixes.
- Maintain High Availability: Manage failures for critical applications by quickly shifting traffic between environments.
- Automate Recovery: Automatically redirect traffic away from an AZ when AWS detects a potential failure.
What is ARC Zonal Shift?
ARC zonal shift automatically moves network traffic away from an AZ experiencing problems, keeping applications running smoothly.
How It Helps During AZ Problems
When an AZ faces issues, zonal shift quickly redirects traffic to healthy AZs, reducing downtime and maintaining a better user experience.
How Does Zonal Shift Work?
Shifting Traffic Away from an Impaired AZ
When an AZ is identified as having problems, zonal shift activates, redirecting traffic to healthy AZs to minimize user impact.
Role of AWS in Managing This Process
AWS continuously monitors AZ health and automatically triggers zonal shifts, ensuring traffic is redirected without manual intervention.
Benefits of Using ARC Zonal Shift
- Faster Recovery: Allows for quicker recovery during outages by redirecting traffic to healthy AZs.
- Reduced Downtime: Significantly reduces downtime by automatically managing traffic during AZ problems.
- Simplified Application Management: Automates traffic redirection, reducing the workload on your team.
- Enhanced Reliability: Quickly shifts traffic away from impaired zones, maintaining high availability.
- Improved User Experience: Minimizes downtime, ensuring applications remain accessible.
Amazon Application Recovery Controller’s (ARC) Zonal Shift in Amazon EKS
EKS Zonal Shift Requirements
To ensure effective ARC zonal shift in Amazon EKS:
- Distribute Worker Nodes Across Multiple AZs: Protect applications from issues in any single AZ.
- Ensure Sufficient Compute Capacity: Handle the loss of one AZ.
- Pre-scale Your Pods: Ensure enough Pods are ready to manage traffic.
- Spread Pod Replicas Across AZs: Maintain capacity in healthy AZs.
- Co-locate Related Pods: Maintain performance and connectivity.
- Test Your Setup: Verify cluster functionality with one less AZ.
Kubernetes has built-in features for resilience during AZ impairments. Using ARC with zonal shift and zonal autoshift enhances fault tolerance and recovery capabilities in Amazon EKS.
During an EKS zonal shift:
- Nodes in the impacted AZ are cordoned to prevent new Pods from being scheduled there.
- Availability Zone rebalancing is suspended for Managed Node Groups.
- Nodes in the unhealthy AZ are not terminated, and Pods are not evicted. This is to make sure that when a zonal shift expires or gets cancelled, the traffic can be safely returned to the AZ which still has full capacity
- The EndpointSlice controller removes Pod endpoints in the impaired AZ from relevant EndpointSlices.
Real-World Example
Consider an EKS cluster spread across three AZs. If one AZ experiences an outage, ARC zonal shift redirects traffic to the remaining healthy AZs, ensuring minimal interruption.
Bookinfo Application on Amazon EKS
The Bookinfo application, deployed in Amazon EKS, consists of four microservices operating across multiple AZs in the eu-west-1 region:
- Productpage: Aggregates data from details and reviews microservices.
- Details: Provides detailed information about books.
- Reviews: Manages user-generated reviews.
- Ratings: Handles ranking information for books.
Deployment in Amazon EKS
Deployed in Amazon EKS, the application benefits from high availability and scalability, distributed across multiple AZs.
How to Use Zonal Shift
Initiate a zonal shift manually or enable zonal autoshift for automatic traffic updates.
Hands-On: Simulating an AZ Outage
To test ARC zonal shift, you can simulate an AZ outage by manually cordoning and draining nodes in one of the AZs. This will mimic an AZ impairment and allow you to observe how ARC handles traffic redirection. Use kubectl cordon <node-name>
and kubectl drain <node-name> --ignore-daemonsets --delete-local-data
to simulate the outage.
Monitor your application to ensure traffic is redirected to healthy AZs and verify application availability. After testing, uncordon the nodes using kubectl uncordon <node-name>
to revert the simulation.
Routing Traffic with Load Balancers
Application Load Balancers (ALBs) and Network Load Balancers (NLBs) automatically route traffic to healthy AZs during a zonal shift.
Understanding the Importance of ARC Zonal Shift
ARC zonal shift automates recovery by redirecting traffic away from impaired AZs, avoiding lengthy recovery steps and extended downtime.
Integration with AWS Services
ARC zonal shift modifies network traffic routing, working seamlessly with AWS Load Balancers and interacting with Amazon EC2 Auto Scaling Groups.
Enhancing Resilience Beyond Kubernetes Protections
ARC zonal shift complements Kubernetes' built-in protections, providing an additional layer of safety by isolating degraded AZs.
Automation with Zonal Autoshift
Enable ARC zonal autoshift for AWS to monitor AZ health and automatically trigger shifts, ensuring minimal disruption.
Preparing for Zonal Shifts
Pre-scale resources to ensure application availability during AZ issues.
Considerations for Stateful Applications
Assess fault tolerance for stateful applications, ensuring connectivity to persistent volumes in healthy AZs.
Compatibility with Karpenter and EKS Fargate
Karpenter does not support ARC zonal shift. Adjust NodePool configuration for new worker nodes in healthy AZs. ARC zonal shift does not apply to EKS Fargate.
Impact on the EKS Control Plane
ARC zonal shift affects the Kubernetes data plane, not the control plane.
Cost Considerations
ARC zonal shift and zonal autoshift are available at no extra charge, but you will incur costs for provisioned instances. Pre-scale your Kubernetes data plane to balance cost and availability effectively.
Conclusion
In this blog post, we explored the Amazon Application Recovery Controller (ARC) and its zonal shift feature, which is very important for maintaining application availability during AZ outages. We discussed how ARC zonal shift works, its benefits, and the steps to prepare your EKS cluster for effective zonal shifts. Additionally, we provided a hands-on guide to simulate an AZ outage and test the functionality of ARC zonal shift.
By making use of ARC zonal shift, you can enhance the resilience and reliability of your applications running on Amazon EKS. This feature ensures that your applications remain operational even during unexpected AZ impairments, providing a seamless user experience and reducing downtime.
References and links for further reading:
-
Amazon Application Recovery Controller (ARC):
-
ARC Zonal Shift Documentation:
-
ARC Supported Resources:
-
Kubernetes Commands: