1.7 EKS: CloudWatch Observability for Monitoring
A cluster without monitoring is flying blind. CloudWatch Observability gives you ready-to-use dashboards for your EKS cluster - CPU, memory, network, and logs - with zero configuration.
This is part 1.7 of the EKS Infrastructure Series. We’re building on the cluster from 1.6 External Secrets.
What We’re Building
By the end of this article, you’ll have:
- CloudWatch Observability addon installed via EKS
- Container Insights dashboards for cluster, node, and pod metrics
- Log collection via Fluent Bit
- Pod Identity authentication for secure AWS access
Why CloudWatch Observability?
EKS doesn’t include monitoring by default. Without it, you can’t answer basic questions:
- Is the cluster healthy?
- Which pods are using the most memory?
- Why did that deployment fail?
- Are nodes running out of resources?
CloudWatch Observability solves this with:
Zero-configuration dashboards - Pre-built views for cluster, node, pod, and container performance. No Grafana setup, no PromQL queries to write.
Integrated logging - Fluent Bit collects container logs and ships them to CloudWatch Logs. Search across all pods from one place.
AWS-native - Uses the same CloudWatch you already know. Alarms, dashboards, and log insights all work together.
Pod Identity integration - Uses the Pod Identity we set up in 1.5 for authentication. No IRSA configuration needed.
What Gets Installed
The CloudWatch Observability addon installs three components:
| Component | Purpose | Runs As |
|---|---|---|
| CloudWatch Agent | Collects metrics (CPU, memory, network, disk) | DaemonSet |
| Fluent Bit | Collects and ships container logs | DaemonSet |
| ADOT Collector | Prometheus metrics support | Deployment |
All components run in the amazon-cloudwatch namespace.
Container Insights Dashboards
Once installed, you get pre-built dashboards in the CloudWatch console:
Cluster Overview
- Total CPU/memory utilization
- Node count and status
- Pod count by namespace
- Network throughput
Node Performance
- Per-node CPU and memory
- Disk I/O and usage
- Network in/out per node
- Node conditions (ready, disk pressure, memory pressure)
Pod Performance
- Per-pod resource usage
- Container restarts
- Pod status by namespace
- Resource requests vs actual usage
Container Performance
- Individual container metrics
- CPU throttling
- Memory limits and OOM events
Prerequisites
CloudWatch Observability requires Pod Identity (from 1.5). The addon uses Pod Identity to authenticate with CloudWatch APIs - no static credentials or IRSA configuration needed.
Verify Pod Identity is installed:
eksctl get addon --name eks-pod-identity-agent --cluster my-cluster
If not installed, go back to 1.5 Pod Identity first.
Project Structure
monitoring/
├── install-cloudwatch-observability.sh # Installation script
├── delete-cloudwatch-observability.sh # Cleanup script
└── cloudwatch-trust-policy.json # IAM trust policy for Pod Identity
Step 1: Create the IAM Trust Policy
The CloudWatch agent needs an IAM role to write metrics and logs. This trust policy allows Pod Identity to assume the role:
cloudwatch-trust-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "pods.eks.amazonaws.com"
},
"Action": [
"sts:AssumeRole",
"sts:TagSession"
]
}
]
}
This is the same trust policy pattern we used for Pod Identity testing - trust pods.eks.amazonaws.com.
Step 2: Create the Installation Script
install-cloudwatch-observability.sh
#!/bin/bash
# Configuration
CLUSTER_NAME="my-cluster"
REGION="us-east-1"
echo "Installing CloudWatch Observability"
echo "===================================="
echo "Cluster: $CLUSTER_NAME"
echo "Region: $REGION"
echo ""
echo "Checking prerequisites..."
if ! kubectl get nodes &>/dev/null; then
echo "ERROR: Cluster not found or kubectl not configured"
exit 1
fi
# Check if Pod Identity is installed
if ! eksctl get addon --name eks-pod-identity-agent --cluster $CLUSTER_NAME 2>/dev/null | grep -q "ACTIVE"; then
echo "ERROR: Pod Identity Agent is required but not installed"
echo " Install it first: see 1.5 Pod Identity article"
exit 1
fi
echo "Prerequisites verified"
echo ""
# Get account ID
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
# Get script directory for finding trust policy
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
echo "Creating IAM role for CloudWatch Agent..."
aws iam create-role \
--role-name CloudWatchObservabilityRole \
--assume-role-policy-document file://$SCRIPT_DIR/cloudwatch-trust-policy.json \
2>/dev/null || echo "Role already exists"
# CloudWatchAgentServerPolicy covers metrics
aws iam attach-role-policy \
--role-name CloudWatchObservabilityRole \
--policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
2>/dev/null || true
# AmazonEBSCSIDriverPolicy is NOT needed - that was a copy-paste error
# CloudWatchLogsFullAccess allows Fluent Bit to create log groups and write logs
aws iam attach-role-policy \
--role-name CloudWatchObservabilityRole \
--policy-arn arn:aws:iam::aws:policy/CloudWatchLogsFullAccess \
2>/dev/null || true
echo "IAM role created with metrics and logs permissions"
echo ""
echo "Installing CloudWatch Observability addon with Pod Identity..."
aws eks create-addon \
--addon-name amazon-cloudwatch-observability \
--cluster-name $CLUSTER_NAME \
--pod-identity-associations serviceAccount=cloudwatch-agent,roleArn=arn:aws:iam::$ACCOUNT_ID:role/CloudWatchObservabilityRole \
--region $REGION
if [[ $? -ne 0 ]]; then
echo "ERROR: Failed to install CloudWatch Observability addon"
exit 1
fi
echo "Waiting for CloudWatch Observability to be ready..."
sleep 30
kubectl wait --for=condition=Ready pods -n amazon-cloudwatch \
-l app.kubernetes.io/name=cloudwatch-agent --timeout=300s
kubectl wait --for=condition=Ready pods -n amazon-cloudwatch \
-l app.kubernetes.io/name=fluent-bit --timeout=300s
echo ""
echo "SUCCESS! CloudWatch Observability installed"
echo "============================================"
echo "Components installed:"
echo " - CloudWatch Agent (metrics)"
echo " - Fluent Bit (logs)"
echo " - ADOT Collector (Prometheus metrics)"
echo " - Container Insights dashboards"
echo ""
echo "Access dashboards:"
echo " https://console.aws.amazon.com/cloudwatch/home?region=$REGION#container-insights:performance/EKS:Cluster"
echo ""
echo "Verify with:"
echo " kubectl get pods -n amazon-cloudwatch"
Step 3: Create the Cleanup Script
delete-cloudwatch-observability.sh
#!/bin/bash
# Configuration
CLUSTER_NAME="my-cluster"
REGION="us-east-1"
echo "Removing CloudWatch Observability"
echo "=================================="
echo "Cluster: $CLUSTER_NAME"
echo ""
echo "WARNING: This will remove:"
echo " - CloudWatch Observability addon"
echo " - All monitoring dashboards"
echo " - Log collection"
echo ""
read -p "Are you sure? (yes/no): " confirm
if [[ $confirm != "yes" ]]; then
echo "Operation cancelled"
exit 0
fi
echo ""
echo "Removing Pod Identity associations..."
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
# Delete Pod Identity associations for amazon-cloudwatch namespace
aws eks list-pod-identity-associations \
--cluster-name $CLUSTER_NAME \
--region $REGION \
--query 'associations[?namespace==`amazon-cloudwatch`].associationId' \
--output text | tr '\t' '\n' | while read -r association_id; do
if [[ -n "$association_id" ]]; then
echo "Deleting association: $association_id"
aws eks delete-pod-identity-association \
--cluster-name $CLUSTER_NAME \
--association-id "$association_id" \
--region $REGION 2>/dev/null || true
fi
done
echo "Removing CloudWatch Observability addon..."
eksctl delete addon \
--name amazon-cloudwatch-observability \
--cluster $CLUSTER_NAME 2>/dev/null || true
echo "Cleaning up namespace..."
kubectl delete namespace amazon-cloudwatch --ignore-not-found=true
echo "Cleaning up IAM role..."
aws iam detach-role-policy \
--role-name CloudWatchObservabilityRole \
--policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
2>/dev/null || true
aws iam detach-role-policy \
--role-name CloudWatchObservabilityRole \
--policy-arn arn:aws:iam::aws:policy/CloudWatchLogsFullAccess \
2>/dev/null || true
aws iam delete-role \
--role-name CloudWatchObservabilityRole \
2>/dev/null || true
echo "Waiting for cleanup..."
sleep 15
echo ""
echo "Cleanup complete!"
echo "================="
echo "Removed:"
echo " - CloudWatch Observability addon"
echo " - CloudWatch Agent and Fluent Bit"
echo " - IAM role"
echo ""
echo "Note: Historical CloudWatch data is preserved"
echo " (logs and metrics remain in CloudWatch)"
echo ""
echo "To delete log groups (removes all historical logs):"
echo " aws logs delete-log-group --log-group-name /aws/containerinsights/my-cluster/application"
echo " aws logs delete-log-group --log-group-name /aws/containerinsights/my-cluster/dataplane"
echo " aws logs delete-log-group --log-group-name /aws/containerinsights/my-cluster/host"
echo " aws logs delete-log-group --log-group-name /aws/containerinsights/my-cluster/performance"
Step 4: Run the Installation
chmod +x install-cloudwatch-observability.sh
./install-cloudwatch-observability.sh
Expected output:
SUCCESS! CloudWatch Observability installed
============================================
Components installed:
- CloudWatch Agent (metrics)
- Fluent Bit (logs)
- ADOT Collector (Prometheus metrics)
- Container Insights dashboards
Access dashboards:
https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#container-insights:performance/EKS:Cluster
Verify with:
kubectl get pods -n amazon-cloudwatch
Step 5: Verify Installation
Check the pods in the amazon-cloudwatch namespace:
kubectl get pods -n amazon-cloudwatch
Expected output:
NAME READY STATUS RESTARTS AGE
cloudwatch-agent-xxxxx 1/1 Running 0 2m
cloudwatch-agent-yyyyy 1/1 Running 0 2m
fluent-bit-xxxxx 1/1 Running 0 2m
fluent-bit-yyyyy 1/1 Running 0 2m
You should see one cloudwatch-agent and one fluent-bit pod per node (they’re DaemonSets).
Check the addon status:
eksctl get addon --name amazon-cloudwatch-observability --cluster my-cluster
Expected output:
NAME VERSION STATUS ISSUES
amazon-cloudwatch-observability v2.1.0-eksbuild.1 ACTIVE 0
Step 6: Access the Dashboards
Open the CloudWatch Container Insights console:
https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#container-insights:performance
Or navigate manually:
- Go to CloudWatch in the AWS Console
- Click Container Insights in the left navigation
- Select Performance monitoring
- Choose your cluster from the dropdown
Dashboards take 5-10 minutes to populate with data after installation.
Step 7: Verify Log Groups Were Created
Before testing, verify the log groups exist. Fluent Bit creates these automatically:
aws logs describe-log-groups \
--log-group-name-prefix /aws/containerinsights/my-cluster \
--query 'logGroups[*].logGroupName'
Expected output:
[
"/aws/containerinsights/my-cluster/application",
"/aws/containerinsights/my-cluster/dataplane",
"/aws/containerinsights/my-cluster/host",
"/aws/containerinsights/my-cluster/performance"
]
If the log groups don’t exist, check that:
- Fluent Bit pods are running:
kubectl get pods -n amazon-cloudwatch -l app.kubernetes.io/name=fluent-bit - The IAM role has
CloudWatchLogsFullAccessattached - Fluent Bit logs for errors:
kubectl logs -n amazon-cloudwatch -l app.kubernetes.io/name=fluent-bit
Step 8: Test Log Collection
Now let’s verify logs flow from pods to CloudWatch.
A. Create a Test Pod
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: log-generator
namespace: default
spec:
containers:
- name: logger
image: busybox
command: ["/bin/sh", "-c"]
args:
- |
echo "Log generator started"
while true; do
echo "Test log message at \$(date)"
sleep 5
done
EOF
B. Wait for the Pod and Verify It’s Logging
kubectl wait --for=condition=Ready pod/log-generator --timeout=60s
# Verify the pod is generating logs locally
kubectl logs log-generator --tail=3
You should see:
Log generator started
Test log message at Fri Jan 17 12:00:00 UTC 2026
Test log message at Fri Jan 17 12:00:05 UTC 2026
C. Wait for Logs to Appear in CloudWatch
Fluent Bit batches logs, so wait 1-2 minutes for logs to appear in CloudWatch.
# Wait for log delivery
sleep 120
D. Verify Logs in CloudWatch
Check if logs arrived:
aws logs filter-log-events \
--log-group-name /aws/containerinsights/my-cluster/application \
--filter-pattern "log-generator" \
--start-time $(($(date +%s) - 300))000 \
--limit 5 \
--query 'events[*].message'
Expected output (JSON array of log messages):
[
"... \"log\":\"Test log message at Fri Jan 17 12:00:00 UTC 2026\\n\" ...",
"... \"log\":\"Test log message at Fri Jan 17 12:00:05 UTC 2026\\n\" ..."
]
If you get an empty result [], the logs haven’t arrived yet. Common causes:
- Log group doesn’t exist - Check Step 7 above
- IAM permissions - Verify
CloudWatchLogsFullAccessis attached to the role - Fluent Bit errors - Check
kubectl logs -n amazon-cloudwatch -l app.kubernetes.io/name=fluent-bit
E. View Logs in the Console
- Go to CloudWatch in the AWS Console
- Click Log groups in the left navigation
- Click
/aws/containerinsights/my-cluster/application - Click a recent log stream (named after the node)
- Search for
log-generator
F. Clean Up Test Resources
kubectl delete pod log-generator
Understanding the Metrics
CloudWatch Observability collects metrics at multiple levels:
Cluster Metrics
| Metric | Description |
|---|---|
cluster_failed_node_count |
Nodes in NotReady state |
cluster_node_count |
Total node count |
namespace_number_of_running_pods |
Pods per namespace |
Node Metrics
| Metric | Description |
|---|---|
node_cpu_utilization |
CPU usage percentage |
node_memory_utilization |
Memory usage percentage |
node_filesystem_utilization |
Disk usage percentage |
node_network_total_bytes |
Network throughput |
Pod Metrics
| Metric | Description |
|---|---|
pod_cpu_utilization |
CPU vs requested |
pod_memory_utilization |
Memory vs requested |
pod_network_rx_bytes |
Network received |
pod_network_tx_bytes |
Network transmitted |
Container Metrics
| Metric | Description |
|---|---|
container_cpu_utilization |
Per-container CPU |
container_memory_utilization |
Per-container memory |
container_restart_count |
Restart count |
Creating Alarms
Use CloudWatch Alarms to get notified of issues:
High CPU Alarm Example
aws cloudwatch put-metric-alarm \
--alarm-name "EKS-HighCPU" \
--alarm-description "Alert when cluster CPU exceeds 80%" \
--metric-name node_cpu_utilization \
--namespace ContainerInsights \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=ClusterName,Value=my-cluster \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-east-1:ACCOUNT_ID:alerts
Node Not Ready Alarm
aws cloudwatch put-metric-alarm \
--alarm-name "EKS-NodeNotReady" \
--alarm-description "Alert when nodes are not ready" \
--metric-name cluster_failed_node_count \
--namespace ContainerInsights \
--statistic Maximum \
--period 60 \
--threshold 0 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=ClusterName,Value=my-cluster \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:us-east-1:ACCOUNT_ID:alerts
Delete Alarms
To remove alarms you’ve created:
aws cloudwatch delete-alarms --alarm-names "EKS-HighCPU" "EKS-NodeNotReady"
Cost Considerations
CloudWatch Observability has associated costs:
| Component | Cost |
|---|---|
| Custom metrics | $0.30 per metric per month (first 10k) |
| Log ingestion | $0.50 per GB |
| Log storage | $0.03 per GB per month |
| Dashboard | Free (Container Insights dashboards) |
For a small cluster (3 nodes, 20 pods), expect roughly $15-30/month. Costs scale with:
- Number of nodes and pods (more metrics)
- Log volume (application log verbosity)
- Retention period for logs
Reducing Costs
- Set log retention - Don’t keep logs forever:
aws logs put-retention-policy \ --log-group-name /aws/containerinsights/my-cluster/application \ --retention-in-days 7 -
Filter noisy logs - Exclude high-volume, low-value logs in Fluent Bit config
- Use metric filters - Create alarms from logs instead of storing everything
Troubleshooting
Pods Stuck in Pending
Check if the DaemonSet can schedule:
kubectl describe pod -n amazon-cloudwatch -l app.kubernetes.io/name=cloudwatch-agent
Common causes:
- Node taints preventing scheduling
- Resource constraints
No Metrics in Dashboard
Check if the agent can reach CloudWatch:
kubectl logs -n amazon-cloudwatch -l app.kubernetes.io/name=cloudwatch-agent
Look for authentication errors - usually means Pod Identity association is missing.
Logs Not Appearing
Check Fluent Bit:
kubectl logs -n amazon-cloudwatch -l app.kubernetes.io/name=fluent-bit
Common issues:
- Log group doesn’t exist (Fluent Bit creates it, but check permissions)
- IAM policy doesn’t allow log writes
Complete Cleanup
To remove everything created in this article:
A. Remove Test Resources
# Delete test pod (if still running)
kubectl delete pod log-generator --ignore-not-found=true
# Delete any alarms you created
aws cloudwatch delete-alarms --alarm-names "EKS-HighCPU" "EKS-NodeNotReady" 2>/dev/null || true
B. Run the Delete Script
./delete-cloudwatch-observability.sh
This removes the addon, IAM role, and Pod Identity associations.
C. Delete Log Groups (Optional)
The delete script preserves historical logs. To remove them completely:
aws logs delete-log-group --log-group-name /aws/containerinsights/my-cluster/application 2>/dev/null || true
aws logs delete-log-group --log-group-name /aws/containerinsights/my-cluster/dataplane 2>/dev/null || true
aws logs delete-log-group --log-group-name /aws/containerinsights/my-cluster/host 2>/dev/null || true
aws logs delete-log-group --log-group-name /aws/containerinsights/my-cluster/performance 2>/dev/null || true
What We’ve Accomplished
You now have:
- CloudWatch Observability collecting metrics and logs
- Pre-built dashboards for cluster visibility
- Fluent Bit shipping container logs to CloudWatch
- Pod Identity providing secure authentication
What’s Next
The infrastructure is built. The final article covers how to use it - operational commands, production checklists, and architecture patterns for different application types.
Next: 1.8 Production Architecture and Operations Guide
Questions about monitoring? Reach out on socials.