Problem Statement
Organizations running authentication services on AWS EKS Fargate face a critical challenge: when AWS initiates mandatory infrastructure patches, pods running on outdated infrastructure are evicted after the patch deadline. In traditional single-pod authentication architectures, this leads to:
- Complete authentication system failure
- All active user sessions being terminated
- Applications becoming inaccessible
- Service disruptions requiring manual intervention
- Loss of business continuity
This guide presents a comprehensive solution to build resilient authentication systems that maintain service availability during pod evictions.
Traditional Vulnerable Architecture
In a typical single-pod authentication setup:
# Vulnerable Deployment Configuration apiVersion: apps/v1 kind: Deployment metadata: name: auth-service spec: replicas: 1 # Single point of failure selector: matchLabels: app: auth-service template: spec: containers: - name: auth-service image: auth-service:latest ports: - containerPort: 8080 volumeMounts: - name: local-data mountPath: /data volumes: - name: local-data emptyDir: {} # Non-persistent storage
Problems with Traditional Architecture
In many EKS deployments, applications directly integrate with authentication services, creating several critical issues:
- Single Point of Failure
- Single authentication pod serving multiple applications
- No redundancy in authentication layer
- Direct dependency between applications and auth service
- Token Management Issues
- No shared token storage
- Session data lost during pod evictions
- No token persistence across pod restarts
- Operational Challenges
- Manual intervention required during patches
- Service disruption during pod evictions
- Complex recovery procedures
Impact Analysis
When AWS initiates mandatory patches on Fargate infrastructure, the following sequence occurs:
# Timeline of Events 1. AWS Announces Patch: - Notification received - Deadline set for infrastructure update 2. Grace Period Expires: - Authentication pod marked for eviction - SIGTERM signal sent to pod 3. Service Impact: - Authentication pod terminated - All active sessions lost - Applications unable to validate tokens - New authentication requests fail 4. Cascading Failures: - Application endpoints return 401/403 errors - User sessions terminated - Backend services disrupted
Resilient Architecture Solution
Understanding EKS Fargate Patch Management
The Patching Process
- Announcement Phase
- AWS announces new patches for Fargate infrastructure
- Notice is provided through AWS Health Dashboard and email notifications
- A deadline is communicated (typically several weeks in advance)
- Patches may include security updates, bug fixes, or performance improvements
- Grace Period
- New pods are scheduled on the updated infrastructure
- Existing pods continue running on the old infrastructure
- Organizations should use this time to test and plan their migration
- Enforcement Phase
- After the deadline, AWS begins evicting pods from outdated infrastructure
- Pods receive SIGTERM signals followed by SIGKILL after grace period
- Evictions follow Kubernetes PodDisruptionBudget rules
Migration Path
To migrate from the vulnerable architecture to a resilient solution:
- Phase 1: Add Redundancy
- Deploy multiple auth pods
- Implement load balancing
- Add health checks
- Phase 2: Add Persistent Storage
- Deploy Redis cluster
- Configure session persistence
- Migrate to distributed tokens
- Phase 3: Improve Monitoring
- Add metrics collection
- Implement alerting
- Create runbooks
- Phase 4: Update Applications
- Implement circuit breakers
- Add retry mechanisms
- Update token handling
1. Deployment Configuration
# Resilient Authentication Service Deployment apiVersion: apps/v1 kind: Deployment metadata: name: auth-service spec: replicas: 3 # Multiple replicas for redundancy strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 0 maxSurge: 1 template: spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: topologyKey: topology.kubernetes.io/zone labelSelector: matchLabels: app: auth-service containers: - name: auth-service image: auth-service:latest env: - name: DB_HOST value: "postgres-primary" - name: REDIS_HOSTS value: "redis-0.redis:6379,redis-1.redis:6379,redis-2.redis:6379" - name: CLUSTER_ENABLED value: "true" readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 10 periodSeconds: 5 livenessProbe: httpGet: path: /health/live port: 8080 initialDelaySeconds: 15 periodSeconds: 10
2. Persistent Storage Configuration
# PostgreSQL StatefulSet for User Data and Configuration apiVersion: apps/v1 kind: StatefulSet metadata: name: postgres spec: serviceName: postgres replicas: 2 template: spec: containers: - name: postgres image: postgres:14 env: - name: POSTGRES_DB value: authdb - name: POSTGRES_USER valueFrom: secretKeyRef: name: db-credentials key: username volumeMounts: - name: postgres-data mountPath: /var/lib/postgresql/data volumeClaimTemplates: - metadata: name: postgres-data spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 20Gi
3. Session Management Configuration
# Redis Cluster for Session Management apiVersion: apps/v1 kind: StatefulSet metadata: name: redis spec: serviceName: redis replicas: 3 template: spec: containers: - name: redis image: redis:6.2 command: - redis-server - /usr/local/etc/redis/redis.conf - --cluster-enabled - "yes" ports: - containerPort: 6379 volumeMounts: - name: redis-data mountPath: /data volumeClaimTemplates: - metadata: name: redis-data spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 10Gi
4. Service Configuration
# Load Balanced Service apiVersion: v1 kind: Service metadata: name: auth-service spec: type: ClusterIP ports: - port: 80 targetPort: 8080 selector: app: auth-service --- # Pod Disruption Budget apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: auth-pdb spec: minAvailable: 2 selector: matchLabels: app: auth-service
Implementation Details
1. Authentication Service Code
// Session management implementation class SessionManager { private redisCluster: IORedis.Cluster; constructor() { this.redisCluster = new IORedis.Cluster( process.env.REDIS_HOSTS.split(',').map(host => ({ host: host.split(':')[0], port: parseInt(host.split(':')[1]) })) ); } async storeSession(sessionId: string, data: SessionData): Promise<void> { await this.redisCluster.set( `session:${sessionId}`, JSON.stringify(data), 'EX', 3600 // 1 hour expiry ); } async getSession(sessionId: string): Promise<SessionData | null> { const data = await this.redisCluster.get(`session:${sessionId}`); return data ? JSON.parse(data) : null; } }
2. Database Schema
-- User and configuration management CREATE TABLE users ( id UUID PRIMARY KEY, username VARCHAR(255) UNIQUE, email VARCHAR(255), created_at TIMESTAMP WITH TIME ZONE, last_modified TIMESTAMP WITH TIME ZONE ); CREATE TABLE configuration ( key VARCHAR(255) PRIMARY KEY, value JSONB, last_modified TIMESTAMP WITH TIME ZONE ); CREATE TABLE roles ( id UUID PRIMARY KEY, name VARCHAR(255) UNIQUE, permissions JSONB );
3. Health Check Implementation
class HealthCheck { async checkHealth(): Promise<HealthStatus> { const dbHealth = await this.checkDatabase(); const redisHealth = await this.checkRedis(); const systemHealth = await this.checkSystem(); return { status: dbHealth.healthy && redisHealth.healthy && systemHealth.healthy ? 'healthy' : 'unhealthy', components: { database: dbHealth, redis: redisHealth, system: systemHealth } }; } }
Deployment Process
- Initial Setup
# Deploy storage layer kubectl apply -f postgres-statefulset.yaml kubectl apply -f redis-cluster.yaml # Deploy authentication service kubectl apply -f auth-deployment.yaml kubectl apply -f auth-service.yaml kubectl apply -f auth-pdb.yaml
- Verification
# Verify pod distribution kubectl get pods -o wide # Check cluster health kubectl exec -it redis-0 -- redis-cli cluster info kubectl exec -it postgres-0 -- psql -U auth_user -c "SELECT pg_is_in_recovery();"
Benefits of Resilient Architecture
- Zero-Downtime Operations
- Continuous service during pod evictions
- Automatic session migration
- No manual intervention required
- High Availability
- Multiple authentication pods
- Distributed session storage
- Replicated configuration data
- Scalability
- Horizontal scaling capability
- Load distribution
- Resource optimization
- Maintenance Benefits
- Easy updates and patches
- No service disruption
- Automatic failover
Building Resilient Applications
1. Implement Graceful Shutdown Handling
# Example Python application with Kubernetes-aware shutdown import signal import time import sys from kubernetes import client, config class Application: def __init__(self): self.running = True signal.signal(signal.SIGTERM, self.handle_sigterm) def handle_sigterm(self, signum, frame): print("Received SIGTERM signal") self.running = False self.graceful_shutdown() def graceful_shutdown(self): print("Starting graceful shutdown...") # 1. Stop accepting new requests # 2. Wait for ongoing requests to complete # 3. Close database connections time.sleep(10) # Give time for load balancer to deregister print("Shutdown complete") sys.exit(0) app = Application()
2. State Management
# Example StatefulSet for stateful applications apiVersion: apps/v1 kind: StatefulSet metadata: name: stateful-app spec: serviceName: stateful-service replicas: 3 selector: matchLabels: app: stateful-app template: metadata: labels: app: stateful-app spec: containers: - name: app image: stateful-app:1.0 volumeMounts: - name: data mountPath: /data volumeClaimTemplates: - metadata: name: data spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 1Gi
Monitoring and Observability
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: auth-alerts spec: groups: - name: auth-system rules: - alert: AuthenticationPodCount expr: | count(up{job="auth-service"}) < 2 for: 5m labels: severity: critical annotations: summary: "Insufficient authentication pods"
Prometheus and Kubernetes Events Monitoring
# Example PrometheusRule for monitoring pod evictions
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: pod-eviction-alerts
spec:
groups:
- name: pod-evictions
rules:
- alert: HighPodEvictionRate
expr: |
sum(rate(kube_pod_deleted{reason="Evicted"}[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: High rate of pod evictions detected
# Example PrometheusRule for monitoring pod evictions
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: pod-eviction-alerts
spec:
groups:
- name: pod-evictions
rules:
- alert: HighPodEvictionRate
expr: |
sum(rate(kube_pod_deleted{reason="Evicted"}[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: High rate of pod evictions detected
Custom Metrics for Application Health
# Example ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-monitor
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
# Example ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-monitor
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
Conclusion
By implementing this resilient architecture:
- Pod evictions no longer cause service disruptions
- Authentication services remain available during patches
- No manual intervention required
- Applications maintain continuous operation
- User sessions persist across pod restarts
The key to success lies in:
- Distributed deployment
- Shared state management
- Proper health monitoring
- Automated failover mechanisms
- Regular testing and validation
This architecture ensures that AWS Fargate patch-related pod evictions become a routine operational event rather than a critical incident requiring immediate attention.