[Solved] A Complete Guide to Handling AWS Fargate Pod Evictions Building Resilient Authentication Systems with diagram and code

Problem Statement

Organizations running authentication services on AWS EKS Fargate face a critical challenge: when AWS initiates mandatory infrastructure patches, pods running on outdated infrastructure are evicted after the patch deadline. In traditional single-pod authentication architectures, this leads to:

Complete authentication system failure
All active user sessions being terminated
Applications becoming inaccessible
Service disruptions requiring manual intervention
Loss of business continuity

This guide presents a comprehensive solution to build resilient authentication systems that maintain service availability during pod evictions.

Traditional Vulnerable Architecture

In a typical single-pod authentication setup:


# Vulnerable Deployment Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: auth-service
spec:
  replicas: 1  # Single point of failure
  selector:
    matchLabels:
      app: auth-service
  template:
    spec:
      containers:
      - name: auth-service
        image: auth-service:latest
        ports:
        - containerPort: 8080
        volumeMounts:
        - name: local-data
          mountPath: /data
      volumes:
      - name: local-data
        emptyDir: {}  # Non-persistent storage

Problems with Traditional Architecture

In many EKS deployments, applications directly integrate with authentication services, creating several critical issues:

Single Point of Failure
- Single authentication pod serving multiple applications
- No redundancy in authentication layer
- Direct dependency between applications and auth service
Token Management Issues
- No shared token storage
- Session data lost during pod evictions
- No token persistence across pod restarts
Operational Challenges

Manual intervention required during patches
Service disruption during pod evictions
Complex recovery procedures

Impact Analysis

When AWS initiates mandatory patches on Fargate infrastructure, the following sequence occurs:

# Timeline of Events
1. AWS Announces Patch:
   - Notification received
   - Deadline set for infrastructure update

2. Grace Period Expires:
   - Authentication pod marked for eviction
   - SIGTERM signal sent to pod

3. Service Impact:
   - Authentication pod terminated
   - All active sessions lost
   - Applications unable to validate tokens
   - New authentication requests fail

4. Cascading Failures:
   - Application endpoints return 401/403 errors
   - User sessions terminated
   - Backend services disrupted

Resilient Architecture Solution

Singe Region:-

Multiple Region:-

Understanding EKS Fargate Patch Management

The Patching Process

Announcement Phase
- AWS announces new patches for Fargate infrastructure
- Notice is provided through AWS Health Dashboard and email notifications
- A deadline is communicated (typically several weeks in advance)
- Patches may include security updates, bug fixes, or performance improvements
Grace Period
- Existing pods continue running on the old infrastructure
- Organizations should use this time to test and plan their migration
Enforcement Phase
- After the deadline, AWS begins evicting pods from outdated infrastructure
- Pods receive SIGTERM signals followed by SIGKILL after grace period
- Evictions follow Kubernetes PodDisruptionBudget rules

Migration Path

To migrate from the vulnerable architecture to a resilient solution:

Phase 1: Add Redundancy
- Deploy multiple auth pods
- Implement load balancing
- Add health checks
Phase 2: Add Persistent Storage
- Deploy Redis cluster
- Configure session persistence
- Migrate to distributed tokens
Phase 3: Improve Monitoring
- Add metrics collection
- Implement alerting
- Create runbooks
Phase 4: Update Applications
- Implement circuit breakers
- Add retry mechanisms
- Update token handling

1. Deployment Configuration


# Resilient Authentication Service Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: auth-service
spec:
  replicas: 3  # Multiple replicas for redundancy
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              topologyKey: topology.kubernetes.io/zone
              labelSelector:
                matchLabels:
                  app: auth-service
      containers:
      - name: auth-service
        image: auth-service:latest
        env:
        - name: DB_HOST
          value: "postgres-primary"
        - name: REDIS_HOSTS
          value: "redis-0.redis:6379,redis-1.redis:6379,redis-2.redis:6379"
        - name: CLUSTER_ENABLED
          value: "true"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 10

2. Persistent Storage Configuration


# PostgreSQL StatefulSet for User Data and Configuration
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: postgres
  replicas: 2
  template:
    spec:
      containers:
      - name: postgres
        image: postgres:14
        env:
        - name: POSTGRES_DB
          value: authdb
        - name: POSTGRES_USER
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: username
        volumeMounts:
        - name: postgres-data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: postgres-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 20Gi

3. Session Management Configuration


# Redis Cluster for Session Management
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis
spec:
  serviceName: redis
  replicas: 3
  template:
    spec:
      containers:
      - name: redis
        image: redis:6.2
        command:
        - redis-server
        - /usr/local/etc/redis/redis.conf
        - --cluster-enabled
        - "yes"
        ports:
        - containerPort: 6379
        volumeMounts:
        - name: redis-data
          mountPath: /data
  volumeClaimTemplates:
  - metadata:
      name: redis-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi

4. Service Configuration


# Load Balanced Service
apiVersion: v1
kind: Service
metadata:
  name: auth-service
spec:
  type: ClusterIP
  ports:
  - port: 80
    targetPort: 8080
  selector:
    app: auth-service
---
# Pod Disruption Budget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: auth-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: auth-service

Implementation Details

1. Authentication Service Code


// Session management implementation
class SessionManager {
  private redisCluster: IORedis.Cluster;
  
  constructor() {
    this.redisCluster = new IORedis.Cluster(
      process.env.REDIS_HOSTS.split(',').map(host => ({
        host: host.split(':')[0],
        port: parseInt(host.split(':')[1])
      }))
    );
  }

  async storeSession(sessionId: string, data: SessionData): Promise<void> {
    await this.redisCluster.set(
      `session:${sessionId}`,
      JSON.stringify(data),
      'EX',
      3600 // 1 hour expiry
    );
  }

  async getSession(sessionId: string): Promise<SessionData | null> {
    const data = await this.redisCluster.get(`session:${sessionId}`);
    return data ? JSON.parse(data) : null;
  }
}

2. Database Schema


-- User and configuration management
CREATE TABLE users (
    id UUID PRIMARY KEY,
    username VARCHAR(255) UNIQUE,
    email VARCHAR(255),
    created_at TIMESTAMP WITH TIME ZONE,
    last_modified TIMESTAMP WITH TIME ZONE
);

CREATE TABLE configuration (
    key VARCHAR(255) PRIMARY KEY,
    value JSONB,
    last_modified TIMESTAMP WITH TIME ZONE
);

CREATE TABLE roles (
    id UUID PRIMARY KEY,
    name VARCHAR(255) UNIQUE,
    permissions JSONB
);

3. Health Check Implementation


class HealthCheck {
  async checkHealth(): Promise<HealthStatus> {
    const dbHealth = await this.checkDatabase();
    const redisHealth = await this.checkRedis();
    const systemHealth = await this.checkSystem();

    return {
      status: dbHealth.healthy && redisHealth.healthy && systemHealth.healthy
        ? 'healthy'
        : 'unhealthy',
      components: {
        database: dbHealth,
        redis: redisHealth,
        system: systemHealth
      }
    };
  }
}

Deployment Process

Initial Setup


# Deploy storage layer
kubectl apply -f postgres-statefulset.yaml
kubectl apply -f redis-cluster.yaml

# Deploy authentication service
kubectl apply -f auth-deployment.yaml
kubectl apply -f auth-service.yaml
kubectl apply -f auth-pdb.yaml

Verification


# Verify pod distribution
kubectl get pods -o wide

# Check cluster health
kubectl exec -it redis-0 -- redis-cli cluster info
kubectl exec -it postgres-0 -- psql -U auth_user -c "SELECT pg_is_in_recovery();"

Benefits of Resilient Architecture

Zero-Downtime Operations
- Continuous service during pod evictions
- Automatic session migration
- No manual intervention required
High Availability
- Multiple authentication pods
- Distributed session storage
- Replicated configuration data
Scalability
- Horizontal scaling capability
- Load distribution
- Resource optimization
Maintenance Benefits

Easy updates and patches
No service disruption
Automatic failover

Building Resilient Applications

1. Implement Graceful Shutdown Handling


# Example Python application with Kubernetes-aware shutdown
import signal
import time
import sys
from kubernetes import client, config

class Application:
    def __init__(self):
        self.running = True
        signal.signal(signal.SIGTERM, self.handle_sigterm)
        
    def handle_sigterm(self, signum, frame):
        print("Received SIGTERM signal")
        self.running = False
        self.graceful_shutdown()
        
    def graceful_shutdown(self):
        print("Starting graceful shutdown...")
        # 1. Stop accepting new requests
        # 2. Wait for ongoing requests to complete
        # 3. Close database connections
        time.sleep(10)  # Give time for load balancer to deregister
        print("Shutdown complete")
        sys.exit(0)

app = Application()

2. State Management

# Example StatefulSet for stateful applications
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: stateful-app
spec:
  serviceName: stateful-service
  replicas: 3
  selector:
    matchLabels:
      app: stateful-app
  template:
    metadata:
      labels:
        app: stateful-app
    spec:
      containers:
      - name: app
        image: stateful-app:1.0
        volumeMounts:
        - name: data
          mountPath: /data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

Monitoring and Observability


apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: auth-alerts
spec:
  groups:
  - name: auth-system
    rules:
    - alert: AuthenticationPodCount
      expr: |
        count(up{job="auth-service"}) < 2
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Insufficient authentication pods"

Prometheus and Kubernetes Events Monitoring

# Example PrometheusRule for monitoring pod evictions
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: pod-eviction-alerts
spec:
  groups:
  - name: pod-evictions
    rules:
    - alert: HighPodEvictionRate
      expr: |
        sum(rate(kube_pod_deleted{reason="Evicted"}[5m])) > 0.1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: High rate of pod evictions detected

Custom Metrics for Application Health

# Example ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-monitor
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: metrics

Conclusion

By implementing this resilient architecture:

Pod evictions no longer cause service disruptions
Authentication services remain available during patches
No manual intervention required
Applications maintain continuous operation
User sessions persist across pod restarts

The key to success lies in:

Distributed deployment
Shared state management
Proper health monitoring
Automated failover mechanisms
Regular testing and validation

This architecture ensures that AWS Fargate patch-related pod evictions become a routine operational event rather than a critical incident requiring immediate attention.

Cloud Devops Automation