1.8 EKS: Production Architecture and Operations Guide

You’ve built the infrastructure. Now let’s talk about using it properly - the operational commands you’ll run daily, getting production-ready, and most importantly, how to architect applications that take full advantage of what you’ve built.

This concluding article ties everything together. We’ll cover the commands you need, the checklist before going live, and architectural patterns for different application types.

Quick Reference: Daily Operations

These are the commands you’ll use regularly. Bookmark this section.

Cluster Health

# Node status and resource usage
kubectl get nodes
kubectl top nodes

# Pod status across all namespaces
kubectl get pods -A

# Events (recent cluster activity)
kubectl get events --sort-by='.lastTimestamp' | tail -20

# Resource quotas and limits
kubectl describe nodes | grep -A 5 "Allocated resources"

Deployments

# Rollout status
kubectl rollout status deployment/my-app

# Rollback a deployment
kubectl rollout undo deployment/my-app

# Scale a deployment
kubectl scale deployment/my-app --replicas=5

# Restart pods (rolling)
kubectl rollout restart deployment/my-app

Debugging

# Pod logs
kubectl logs -f deployment/my-app

# Previous container logs (after crash)
kubectl logs pod/my-app-xxx --previous

# Exec into a pod
kubectl exec -it pod/my-app-xxx -- /bin/sh

# Describe pod (events, status, conditions)
kubectl describe pod/my-app-xxx

Secrets and Config

# List external secrets and sync status
kubectl get externalsecrets -A

# Force secret refresh
kubectl annotate externalsecret my-secret force-sync=$(date +%s) --overwrite

# View secret (base64 decoded)
kubectl get secret my-secret -o jsonpath='{.data.password}' | base64 -d

Images and Registry

# List ECR images
aws ecr describe-images --repository-name my-app --query 'imageDetails[*].[imageTags,imagePushedAt]'

# Get ECR login token
aws ecr get-login-password | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com

# Check image vulnerabilities
aws ecr describe-image-scan-findings --repository-name my-app --image-id imageTag=latest

Monitoring

# Pod resource usage
kubectl top pods

# CloudWatch log groups for the cluster
aws logs describe-log-groups --log-group-name-prefix /aws/containerinsights/my-cluster

# Recent logs from CloudWatch
aws logs filter-log-events \
    --log-group-name /aws/containerinsights/my-cluster/application \
    --start-time $(($(date +%s) - 3600))000 \
    --filter-pattern "ERROR"

Storage

# List persistent volumes and claims
kubectl get pv,pvc -A

# Check EBS CSI driver status
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver

Production Readiness Checklist

Before serving traffic, verify each item. This assumes you’re cloning your test infrastructure scripts for production - same patterns, different cluster name and values.

Infrastructure

Item	How to Verify	Why It Matters
Multi-AZ nodes	`kubectl get nodes -o wide` shows nodes in different AZs	Survives AZ outages
Node auto-scaling	Check managed node group min/max in AWS Console	Handles traffic spikes
Private subnets	Nodes have private IPs only	Reduces attack surface
ECR scanning enabled	`aws ecr describe-repositories` shows `scanOnPush: true`	Catches vulnerabilities before deployment

Security

Item	How to Verify	Why It Matters
Pod Identity working	Pods can call AWS APIs without stored credentials	No leaked keys
Secrets in Secrets Manager	No secrets in ConfigMaps, env vars, or git	Centralized, auditable secrets
HTTPS only	Ingress redirects HTTP to HTTPS	Encrypted traffic
Network policies	`kubectl get networkpolicies -A`	Limits pod-to-pod traffic
IAM least privilege	Review IAM policies for overly broad access	Blast radius reduction

Reliability

Item	How to Verify	Why It Matters
Resource requests/limits set	`kubectl describe pod` shows requests and limits	Prevents resource starvation
Liveness/readiness probes	Deployment spec includes probes	Automatic recovery
PodDisruptionBudgets	`kubectl get pdb -A`	Survives node drains
Multiple replicas	`kubectl get deployment` shows replicas > 1	No single point of failure

Observability

Item	How to Verify	Why It Matters
CloudWatch Observability	Addon is ACTIVE, dashboards show data	See what’s happening
Log retention set	Check CloudWatch log group retention	Control costs
Alarms configured	CloudWatch alarms for CPU, memory, errors	Get notified of issues
Application metrics	App exposes /metrics or sends custom metrics	Business visibility

Operations

Item	How to Verify	Why It Matters
Backup strategy	EBS snapshots scheduled, PV backup tested	Disaster recovery
Upgrade plan	Know current EKS version, target version	Security patches
Runbooks documented	Common operations and troubleshooting steps	Team can respond
Access controls	RBAC configured, audit logging enabled	Know who did what

Test vs Production: What Changes

Your test cluster scripts are the foundation. For production, these values typically change:

Setting	Test	Production
Cluster name	`my-cluster-test`	`my-cluster-prod`
Node count	2 nodes minimum	3+ nodes across AZs
Instance types	Smaller, more spot	Mix of on-demand and spot
Secrets path	`EKS/test/*`	`EKS/prod/*`
Log retention	7 days	30-90 days
Alarms	Optional	Required
Backup frequency	Manual	Automated daily

The scripts themselves stay identical - just update the configuration variables at the top.

Architecture Patterns

Here’s where everything comes together. Different application types need different combinations of what we’ve built.

Pattern 1: Stateless Web Application

Example: REST API, web frontend, microservice

What You Need:

ECR for container images
EKS with Deployment (multiple replicas)
Ingress Controller for HTTPS/load balancing
External Secrets for database credentials, API keys
CloudWatch for logs and metrics
Pod Identity if calling AWS services (S3, SQS, etc.)

What You Don’t Need:

EBS Storage (stateless - no persistent volumes)

Architecture:

graph LR subgraph EKS Cluster ALB[ALB] --> SVC[Service] SVC --> DEP[Deployment
3+ replicas] DEP --> ES[External Secrets] DEP --> PI[Pod Identity] end INT[Internet] --> ALB ES --> SM[Secrets Manager] PI --> AWS[AWS Services]

Key Decisions:

Set CPU/memory requests based on load testing
Use HorizontalPodAutoscaler for traffic-based scaling
Configure readiness probes to prevent traffic to unhealthy pods
Use rolling deployments with maxUnavailable: 0 for zero-downtime deploys

Deployment Spec Essentials:

spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  template:
    spec:
      containers:
      - name: app
        resources:
          requests:
            cpu: "100m"
            memory: "256Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 20

Pattern 2: Stateful Application (Database, Cache)

Example: PostgreSQL, Redis, Elasticsearch

What You Need:

Everything from Pattern 1, plus:
EBS Storage for persistent volumes
StatefulSet instead of Deployment
Headless Service for stable network identity

Architecture:

graph TB subgraph EKS Cluster APP[App Pods] --> HS[Headless Service] HS --> SS[StatefulSet] SS --> PVC[PersistentVolumeClaim] end PVC --> EBS[EBS Volume]

Key Decisions:

Use volumeBindingMode: WaitForFirstConsumer to ensure volume and pod are in same AZ
Set appropriate storage class (gp3 for general use, io2 for high IOPS)
Configure backup strategy (EBS snapshots or application-level backups)
Consider managed services (RDS, ElastiCache) for less operational burden

When to Use EKS vs Managed Service:

Use EKS StatefulSet When	Use Managed Service When
Need specific version/config not in managed service	Want zero operational overhead
Cost-sensitive at scale	Need multi-AZ automatic failover
Already have K8s expertise	Team is small
Custom extensions required	Standard configuration is fine

StatefulSet Essentials:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: postgres
  replicas: 1
  template:
    spec:
      containers:
      - name: postgres
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: ebs-sc
      resources:
        requests:
          storage: 100Gi

Pattern 3: Background Worker / Queue Processor

Example: Job processor, message consumer, scheduled tasks

What You Need:

ECR for container images
EKS with Deployment (scales based on queue depth)
Pod Identity for SQS/SNS access
External Secrets for credentials
CloudWatch for logs
No Ingress (not internet-facing)

Architecture:

graph TB subgraph EKS Cluster WD[Worker Deployment
autoscaled] --> ES[External Secrets] WD --> PI[Pod Identity] end ES --> SM[Secrets Manager] PI --> SQS[SQS Queue]

Key Decisions:

Scale based on queue depth, not CPU (use KEDA or custom metrics)
Set appropriate visibility timeout in SQS
Implement graceful shutdown (handle SIGTERM)
Use spot instances (workers are interruptible)

Pod Identity for SQS:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sqs:ReceiveMessage",
                "sqs:DeleteMessage",
                "sqs:GetQueueAttributes"
            ],
            "Resource": "arn:aws:sqs:us-east-1:ACCOUNT:my-queue"
        }
    ]
}

Pattern 4: Scheduled Jobs (Cron)

Example: Daily reports, cleanup tasks, data exports

What You Need:

ECR for container images
EKS with CronJob
Pod Identity for AWS access
External Secrets for credentials
CloudWatch for logs

Architecture:

graph TB subgraph EKS Cluster CJ[CronJob] --> JOB[Job] JOB --> POD[Pod] POD --> PI[Pod Identity] end PI --> AWS[AWS Services]

Key Decisions:

Set concurrencyPolicy: Forbid to prevent overlapping runs
Configure startingDeadlineSeconds for missed schedule handling
Set activeDeadlineSeconds to kill hung jobs
Use ttlSecondsAfterFinished to clean up completed jobs

CronJob Essentials:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-report
spec:
  schedule: "0 2 * * *"
  concurrencyPolicy: Forbid
  startingDeadlineSeconds: 300
  jobTemplate:
    spec:
      activeDeadlineSeconds: 3600
      ttlSecondsAfterFinished: 86400
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: report
            image: my-report:latest

Pattern 5: Multi-Tier Application

Example: Frontend + API + Database + Cache + Workers

What You Need:

All components from previous patterns
Network policies for isolation
Multiple namespaces for organization

Architecture:

graph TB subgraph EKS Cluster ALB[ALB] --> FE[Frontend] FE --> SVC[Internal Service] SVC --> API[Backend API] API --> PG[PostgreSQL
StatefulSet] API --> RD[Redis Cache
StatefulSet] PG --> WK[Worker] end INT[Internet] --> ALB SQS[SQS] --> WK

Namespace Organization:

my-app-prod/
├── frontend/       # Web UI
├── backend/        # API services
├── data/           # Databases, caches
└── workers/        # Background processors

Key Decisions:

Use NetworkPolicies to restrict traffic between tiers
Separate External Secrets per namespace (different IAM permissions)
Consider service mesh for mTLS between services
Use resource quotas per namespace to prevent noisy neighbors

Component Decision Matrix

When do you need each component we’ve built?

Component	Stateless API	Stateful DB	Worker	CronJob	Multi-Tier
ECR	Yes	Yes	Yes	Yes	Yes
EKS Base	Yes	Yes	Yes	Yes	Yes
Ingress	Yes	No	No	No	Frontend only
EBS Storage	No	Yes	No	No	Data tier
Pod Identity	If AWS calls	If AWS calls	Usually	Usually	Per service
External Secrets	Yes	Yes	Yes	Yes	Yes
CloudWatch	Yes	Yes	Yes	Yes	Yes

Cost Optimization

EKS costs come from several sources. Here’s how to optimize each:

Compute (Biggest Cost)

Use Spot Instances for:

Stateless workloads (web servers, APIs)
Workers and background jobs
Development/test environments

Use On-Demand for:

Stateful workloads (databases)
Critical single-replica services
Production control plane adjacent services

Right-size Resources:

# Check actual vs requested resources
kubectl top pods
kubectl describe pod my-pod | grep -A 2 "Requests:"

If actual usage is consistently 20% of requests, you’re overprovisioned.

Storage

Use gp3 instead of gp2 (20% cheaper, better performance)
Set appropriate volume sizes (can expand later, can’t shrink)
Delete unused PVCs: kubectl get pvc -A | grep -v Bound

Networking

ALB is charged per hour + LCU. Consolidate ingresses where possible
Use internal load balancers for service-to-service communication
Enable VPC endpoints for ECR, S3, CloudWatch to avoid NAT charges

Monitoring

Set log retention policies (don’t keep logs forever)
Use metric filters instead of storing all logs
Disable Container Insights if using alternative monitoring

Upgrade Strategy

EKS releases new Kubernetes versions regularly. Plan for upgrades:

Before Upgrading

Check version compatibility:

eksctl utils describe-addon-versions --cluster my-cluster

Review deprecation notices:

kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis

Test in non-production first - Clone your scripts, create a test cluster with new version

Upgrade Order

Control plane first: eksctl upgrade cluster --name my-cluster --version 1.XX
Update addons: EBS CSI, Pod Identity Agent, CloudWatch
Update node groups: Rolling update with new AMI
Update kubectl: Match cluster version

Addon Compatibility

Check addon versions before cluster upgrade:

aws eks describe-addon-versions --addon-name vpc-cni --kubernetes-version 1.XX

Disaster Recovery

What to Back Up

Component	Backup Method	Frequency
Application state	EBS snapshots	Daily
Cluster config	GitOps (manifests in git)	Every change
Secrets	Secrets Manager (built-in)	Automatic
ECR images	Cross-region replication	Automatic

Recovery Scenarios

Pod failure: Kubernetes handles automatically (restarts, reschedules)

Node failure: Auto-scaling group replaces node, pods reschedule

AZ failure: Pods reschedule to nodes in other AZs (if multi-AZ)

Cluster failure:

Create new cluster from scripts
Restore EBS volumes from snapshots
Deploy applications from git
Secrets sync automatically from Secrets Manager

Region failure:

Requires pre-planned multi-region setup
ECR cross-region replication
Secrets Manager cross-region replication
DNS failover (Route 53)

Common Pitfalls

1. No Resource Limits

Problem: One pod consumes all node resources, starving others.

Solution: Always set requests and limits. Start conservative, adjust based on metrics.

2. Single Replica in Production

Problem: Any pod restart causes downtime.

Solution: Minimum 2 replicas for anything user-facing. Use PodDisruptionBudgets.

3. Secrets in Environment Variables

Problem: Secrets visible in kubectl describe pod, logged accidentally.

Solution: Use External Secrets → Kubernetes Secrets → Volume mounts.

4. No Readiness Probes

Problem: Traffic sent to pods that aren’t ready, causing errors.

Solution: Always configure readiness probes. Test they actually verify readiness.

5. Ignoring Pod Identity

Problem: AWS credentials in environment variables or mounted files.

Solution: Use Pod Identity for all AWS access. Zero stored credentials.

6. Oversized Persistent Volumes

Problem: Provisioned 500GB, using 10GB, paying for 500GB.

Solution: Start small, monitor usage, expand as needed (can’t shrink).

7. No Log Retention

Problem: CloudWatch costs grow unbounded.

Solution: Set retention policies on all log groups. 7-30 days for most use cases.

What’s Next

This series covered the foundational EKS infrastructure. Here’s what to explore next, with how to install each:

Deployment Automation

Tool	Purpose	Install Method
ArgoCD	GitOps continuous delivery	Helm: `helm install argocd argo/argo-cd`
Flux	GitOps toolkit	CLI: `flux bootstrap github`
GitHub Actions	CI/CD pipelines	No cluster install (runs in GitHub)

Advanced Scaling

Tool	Purpose	Install Method
Karpenter	Intelligent node provisioning	Helm: `helm install karpenter oci://public.ecr.aws/karpenter/karpenter`
KEDA	Event-driven pod autoscaling	Helm: `helm install keda kedacore/keda`
Metrics Server	Pod metrics for HPA	Helm: `helm install metrics-server metrics-server/metrics-server`

Networking & Service Mesh

Tool	Purpose	Install Method
Istio	Service mesh, mTLS	CLI: `istioctl install`
Linkerd	Lightweight service mesh	CLI: `linkerd install \\| kubectl apply -f -`
Calico	Network policies	Helm: `helm install calico projectcalico/tigera-operator`

Security

Tool	Purpose	Install Method
OPA Gatekeeper	Policy enforcement	Helm: `helm install gatekeeper gatekeeper/gatekeeper`
Falco	Runtime threat detection	Helm: `helm install falco falcosecurity/falco`
GuardDuty EKS	AWS threat detection	AWS Console: Enable in GuardDuty settings
Trivy Operator	Vulnerability scanning	Helm: `helm install trivy-operator aqua/trivy-operator`

Observability (Beyond CloudWatch)

Tool	Purpose	Install Method
Prometheus	Metrics collection	Helm: `helm install prometheus prometheus-community/prometheus`
Grafana	Dashboards	Helm: `helm install grafana grafana/grafana`
Loki	Log aggregation	Helm: `helm install loki grafana/loki-stack`

Cost Management

Tool	Purpose	Install Method
Kubecost	Cost allocation	Helm: `helm install kubecost kubecost/cost-analyzer`
AWS Cost Explorer	AWS-level costs	AWS Console (no cluster install)

Most tools use Helm charts. Add the repos first:

# Common Helm repos
helm repo add argo https://argoproj.github.io/argo-helm
helm repo add kedacore https://kedacore.github.io/charts
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add gatekeeper https://open-policy-agent.github.io/gatekeeper/charts
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo add aqua https://aquasecurity.github.io/helm-charts
helm repo add kubecost https://kubecost.github.io/cost-analyzer
helm repo update

Series Summary

You’ve built production-ready EKS infrastructure:

Part	Component	What It Provides
1.1	ECR	Secure container registry with vulnerability scanning
1.2	EKS Base	Kubernetes control plane with auto-scaling nodes
1.3	Ingress	HTTPS termination and load balancing
1.4	EBS Storage	Persistent volumes for stateful workloads
1.5	Pod Identity	Secure AWS access without credentials
1.6	External Secrets	Automated secrets sync from Secrets Manager
1.7	CloudWatch	Dashboards, metrics, and log collection
1.8	This Guide	Operations, production readiness, architecture

The infrastructure is ready. Build something great.

Questions about EKS architecture? Reach out on socials.

Quick Reference: Daily Operations

Cluster Health

Deployments

Debugging

Secrets and Config

Images and Registry

Monitoring

Storage

Production Readiness Checklist

Infrastructure

Security

Reliability

Observability

Operations

Test vs Production: What Changes

Architecture Patterns

Pattern 1: Stateless Web Application

Pattern 2: Stateful Application (Database, Cache)

Pattern 3: Background Worker / Queue Processor

Pattern 4: Scheduled Jobs (Cron)

Pattern 5: Multi-Tier Application

Component Decision Matrix

Cost Optimization

Compute (Biggest Cost)

Storage

Networking

Monitoring

Upgrade Strategy

Before Upgrading

Upgrade Order

Addon Compatibility

Disaster Recovery

What to Back Up

Recovery Scenarios

Common Pitfalls

1. No Resource Limits

2. Single Replica in Production

3. Secrets in Environment Variables

4. No Readiness Probes

5. Ignoring Pod Identity

6. Oversized Persistent Volumes

7. No Log Retention

What’s Next

Deployment Automation

Advanced Scaling

Networking & Service Mesh

Security

Observability (Beyond CloudWatch)

Cost Management

Series Summary

More in this series: EKS Infrastructure Series