Multi-AZ EKS Architecture: Resilience Against Disasters
Multi-AZ EKS Architecture: Resilience Against Disasters
AWS's most fundamental reliability principle is that systems should not be dependent on a single data center (Availability Zone - AZ). A fire can break out, a flood can occur, or power can go out in an AZ. You can eliminate these risks by setting up your Amazon EKS architecture as Multi-AZ.
EKS Control Plane is Already Multi-AZ
Good news: AWS automatically distributes and manages the EKS Control Plane (API Server) across at least 2 different AZs. What you need to do is distribute your own Data Plane (Worker Nodes).
Worker Node Distribution
When creating your node groups (with Terraform or eksctl), give subnets in at least 3 different AZs (e.g., eu-central-1a, 1b, 1c) to the subnets parameter.
EKS tries to distribute nodes equally to these AZs via Auto Scaling Group.
Pod Distribution (Topology Spread Constraints)
Even if your nodes are in different AZs, Kubernetes might by chance place all 5 pods of your application on nodes in the same AZ. If that AZ goes down, your application will be interrupted.
Use Topology Spread Constraints to prevent this:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-app
This rule orders Kubernetes to "Distribute my pods equally across different Zones".
Storage (EBS vs EFS)
Caution: Standard EBS (gp3) volumes are bound to a single AZ. A pod in AZ-b cannot use a volume in AZ-a (The pod gets stuck in that AZ). Solution:
- Design stateless applications (Don't need a disk).
- If a disk is mandatory, use Amazon EFS which supports Multi-AZ.
In our AWS Consultancy service, we always design an architecture with at least 3 AZs and Topology Awareness for critical systems.