Problem Statement
Organizations running authentication services on AWS EKS Fargate face a critical challenge: when AWS initiates mandatory infrastructure patches, pods running on outdated infrastructure are evicted after the patch deadline. In traditional single-pod authentication architectures, this leads to:
- Complete authentication system failure
- All active user sessions being terminated
- Applications becoming inaccessible
- Service disruptions requiring manual intervention
- Loss of business continuity
This guide presents a comprehensive solution to build resilient authentication systems that maintain service availability during pod evictions.
Traditional Vulnerable Architecture
In a typical single-pod authentication setup:
# Vulnerable Deployment Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: auth-service
spec:
replicas: 1 # Single point of failure
selector:
matchLabels:
app: auth-service
template:
spec:
containers:
- name: auth-service
image: auth-service:latest
ports:
- containerPort: 8080
volumeMounts:
- name: local-data
mountPath: /data
volumes:
- name: local-data
emptyDir: {} # Non-persistent storage
Problems with Traditional Architecture
In many EKS deployments, applications directly integrate with authentication services, creating several critical issues:
- Single Point of Failure
- Single authentication pod serving multiple applications
- No redundancy in authentication layer
- Direct dependency between applications and auth service
- Token Management Issues
- No shared token storage
- Session data lost during pod evictions
- No token persistence across pod restarts
- Operational Challenges
- Manual intervention required during patches
- Service disruption during pod evictions
- Complex recovery procedures
Impact Analysis
When AWS initiates mandatory patches on Fargate infrastructure, the following sequence occurs:
# Timeline of Events
1. AWS Announces Patch:
- Notification received
- Deadline set for infrastructure update
2. Grace Period Expires:
- Authentication pod marked for eviction
- SIGTERM signal sent to pod
3. Service Impact:
- Authentication pod terminated
- All active sessions lost
- Applications unable to validate tokens
- New authentication requests fail
4. Cascading Failures:
- Application endpoints return 401/403 errors
- User sessions terminated
- Backend services disrupted
Resilient Architecture Solution
Singe Region:-
Understanding EKS Fargate Patch Management
The Patching Process
- Announcement Phase
- AWS announces new patches for Fargate infrastructure
- Notice is provided through AWS Health Dashboard and email notifications
- A deadline is communicated (typically several weeks in advance)
- Patches may include security updates, bug fixes, or performance improvements
- Grace Period
- New pods are scheduled on the updated infrastructure
- Existing pods continue running on the old infrastructure
- Organizations should use this time to test and plan their migration
- Enforcement Phase
- After the deadline, AWS begins evicting pods from outdated infrastructure
- Pods receive SIGTERM signals followed by SIGKILL after grace period
- Evictions follow Kubernetes PodDisruptionBudget rules
Migration Path
To migrate from the vulnerable architecture to a resilient solution:
- Phase 1: Add Redundancy
- Deploy multiple auth pods
- Implement load balancing
- Add health checks
- Phase 2: Add Persistent Storage
- Deploy Redis cluster
- Configure session persistence
- Migrate to distributed tokens
- Phase 3: Improve Monitoring
- Add metrics collection
- Implement alerting
- Create runbooks
- Phase 4: Update Applications
- Implement circuit breakers
- Add retry mechanisms
- Update token handling
1. Deployment Configuration
# Resilient Authentication Service Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: auth-service
spec:
replicas: 3 # Multiple replicas for redundancy
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
template:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
topologyKey: topology.kubernetes.io/zone
labelSelector:
matchLabels:
app: auth-service
containers:
- name: auth-service
image: auth-service:latest
env:
- name: DB_HOST
value: "postgres-primary"
- name: REDIS_HOSTS
value: "redis-0.redis:6379,redis-1.redis:6379,redis-2.redis:6379"
- name: CLUSTER_ENABLED
value: "true"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
2. Persistent Storage Configuration
# PostgreSQL StatefulSet for User Data and Configuration
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: postgres
replicas: 2
template:
spec:
containers:
- name: postgres
image: postgres:14
env:
- name: POSTGRES_DB
value: authdb
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: db-credentials
key: username
volumeMounts:
- name: postgres-data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: postgres-data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 20Gi
3. Session Management Configuration
# Redis Cluster for Session Management
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis
spec:
serviceName: redis
replicas: 3
template:
spec:
containers:
- name: redis
image: redis:6.2
command:
- redis-server
- /usr/local/etc/redis/redis.conf
- --cluster-enabled
- "yes"
ports:
- containerPort: 6379
volumeMounts:
- name: redis-data
mountPath: /data
volumeClaimTemplates:
- metadata:
name: redis-data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi
4. Service Configuration
# Load Balanced Service
apiVersion: v1
kind: Service
metadata:
name: auth-service
spec:
type: ClusterIP
ports:
- port: 80
targetPort: 8080
selector:
app: auth-service
---
# Pod Disruption Budget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: auth-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: auth-service
Implementation Details
1. Authentication Service Code
// Session management implementation
class SessionManager {
private redisCluster: IORedis.Cluster;
constructor() {
this.redisCluster = new IORedis.Cluster(
process.env.REDIS_HOSTS.split(',').map(host => ({
host: host.split(':')[0],
port: parseInt(host.split(':')[1])
}))
);
}
async storeSession(sessionId: string, data: SessionData): Promise<void> {
await this.redisCluster.set(
`session:${sessionId}`,
JSON.stringify(data),
'EX',
3600 // 1 hour expiry
);
}
async getSession(sessionId: string): Promise<SessionData | null> {
const data = await this.redisCluster.get(`session:${sessionId}`);
return data ? JSON.parse(data) : null;
}
}
2. Database Schema
-- User and configuration management
CREATE TABLE users (
id UUID PRIMARY KEY,
username VARCHAR(255) UNIQUE,
email VARCHAR(255),
created_at TIMESTAMP WITH TIME ZONE,
last_modified TIMESTAMP WITH TIME ZONE
);
CREATE TABLE configuration (
key VARCHAR(255) PRIMARY KEY,
value JSONB,
last_modified TIMESTAMP WITH TIME ZONE
);
CREATE TABLE roles (
id UUID PRIMARY KEY,
name VARCHAR(255) UNIQUE,
permissions JSONB
);
3. Health Check Implementation
class HealthCheck {
async checkHealth(): Promise<HealthStatus> {
const dbHealth = await this.checkDatabase();
const redisHealth = await this.checkRedis();
const systemHealth = await this.checkSystem();
return {
status: dbHealth.healthy && redisHealth.healthy && systemHealth.healthy
? 'healthy'
: 'unhealthy',
components: {
database: dbHealth,
redis: redisHealth,
system: systemHealth
}
};
}
}
Deployment Process
- Initial Setup
# Deploy storage layer
kubectl apply -f postgres-statefulset.yaml
kubectl apply -f redis-cluster.yaml
# Deploy authentication service
kubectl apply -f auth-deployment.yaml
kubectl apply -f auth-service.yaml
kubectl apply -f auth-pdb.yaml
- Verification
# Verify pod distribution
kubectl get pods -o wide
# Check cluster health
kubectl exec -it redis-0 -- redis-cli cluster info
kubectl exec -it postgres-0 -- psql -U auth_user -c "SELECT pg_is_in_recovery();"
Benefits of Resilient Architecture
- Zero-Downtime Operations
- Continuous service during pod evictions
- Automatic session migration
- No manual intervention required
- High Availability
- Multiple authentication pods
- Distributed session storage
- Replicated configuration data
- Scalability
- Horizontal scaling capability
- Load distribution
- Resource optimization
- Maintenance Benefits
- Easy updates and patches
- No service disruption
- Automatic failover
Building Resilient Applications
1. Implement Graceful Shutdown Handling
# Example Python application with Kubernetes-aware shutdown
import signal
import time
import sys
from kubernetes import client, config
class Application:
def __init__(self):
self.running = True
signal.signal(signal.SIGTERM, self.handle_sigterm)
def handle_sigterm(self, signum, frame):
print("Received SIGTERM signal")
self.running = False
self.graceful_shutdown()
def graceful_shutdown(self):
print("Starting graceful shutdown...")
# 1. Stop accepting new requests
# 2. Wait for ongoing requests to complete
# 3. Close database connections
time.sleep(10) # Give time for load balancer to deregister
print("Shutdown complete")
sys.exit(0)
app = Application()
2. State Management
# Example StatefulSet for stateful applications
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: stateful-app
spec:
serviceName: stateful-service
replicas: 3
selector:
matchLabels:
app: stateful-app
template:
metadata:
labels:
app: stateful-app
spec:
containers:
- name: app
image: stateful-app:1.0
volumeMounts:
- name: data
mountPath: /data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
Monitoring and Observability
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: auth-alerts
spec:
groups:
- name: auth-system
rules:
- alert: AuthenticationPodCount
expr: |
count(up{job="auth-service"}) < 2
for: 5m
labels:
severity: critical
annotations:
summary: "Insufficient authentication pods"
Prometheus and Kubernetes Events Monitoring
# Example PrometheusRule for monitoring pod evictions
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: pod-eviction-alerts
spec:
groups:
- name: pod-evictions
rules:
- alert: HighPodEvictionRate
expr: |
sum(rate(kube_pod_deleted{reason="Evicted"}[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: High rate of pod evictions detected
Custom Metrics for Application Health
# Example ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-monitor
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
Conclusion
By implementing this resilient architecture:
- Pod evictions no longer cause service disruptions
- Authentication services remain available during patches
- No manual intervention required
- Applications maintain continuous operation
- User sessions persist across pod restarts
The key to success lies in:
- Distributed deployment
- Shared state management
- Proper health monitoring
- Automated failover mechanisms
- Regular testing and validation
This architecture ensures that AWS Fargate patch-related pod evictions become a routine operational event rather than a critical incident requiring immediate attention.