1.7 EKS: CloudWatch Observability for Monitoring

A cluster without monitoring is flying blind. CloudWatch Observability gives you ready-to-use dashboards for your EKS cluster - CPU, memory, network, and logs - with zero configuration.

This is part 1.7 of the EKS Infrastructure Series. We’re building on the cluster from 1.6 External Secrets.

What We’re Building

By the end of this article, you’ll have:

CloudWatch Observability addon installed via EKS
Container Insights dashboards for cluster, node, and pod metrics
Log collection via Fluent Bit
Pod Identity authentication for secure AWS access

Why CloudWatch Observability?

EKS doesn’t include monitoring by default. Without it, you can’t answer basic questions:

Is the cluster healthy?
Which pods are using the most memory?
Why did that deployment fail?
Are nodes running out of resources?

CloudWatch Observability solves this with:

Zero-configuration dashboards - Pre-built views for cluster, node, pod, and container performance. No Grafana setup, no PromQL queries to write.

Integrated logging - Fluent Bit collects container logs and ships them to CloudWatch Logs. Search across all pods from one place.

AWS-native - Uses the same CloudWatch you already know. Alarms, dashboards, and log insights all work together.

Pod Identity integration - Uses the Pod Identity we set up in 1.5 for authentication. No IRSA configuration needed.

What Gets Installed

The CloudWatch Observability addon installs three components:

Component	Purpose	Runs As
CloudWatch Agent	Collects metrics (CPU, memory, network, disk)	DaemonSet
Fluent Bit	Collects and ships container logs	DaemonSet
ADOT Collector	Prometheus metrics support	Deployment

All components run in the amazon-cloudwatch namespace.

Container Insights Dashboards

Once installed, you get pre-built dashboards in the CloudWatch console:

Cluster Overview

Total CPU/memory utilization
Node count and status
Pod count by namespace
Network throughput

Node Performance

Per-node CPU and memory
Disk I/O and usage
Network in/out per node
Node conditions (ready, disk pressure, memory pressure)

Pod Performance

Per-pod resource usage
Container restarts
Pod status by namespace
Resource requests vs actual usage

Container Performance

Individual container metrics
CPU throttling
Memory limits and OOM events

Prerequisites

CloudWatch Observability requires Pod Identity (from 1.5). The addon uses Pod Identity to authenticate with CloudWatch APIs - no static credentials or IRSA configuration needed.

Verify Pod Identity is installed:

eksctl get addon --name eks-pod-identity-agent --cluster my-cluster

If not installed, go back to 1.5 Pod Identity first.

Project Structure

monitoring/
├── install-cloudwatch-observability.sh    # Installation script
├── delete-cloudwatch-observability.sh     # Cleanup script
└── cloudwatch-trust-policy.json           # IAM trust policy for Pod Identity

Step 1: Create the IAM Trust Policy

The CloudWatch agent needs an IAM role to write metrics and logs. This trust policy allows Pod Identity to assume the role:

cloudwatch-trust-policy.json

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "pods.eks.amazonaws.com"
            },
            "Action": [
                "sts:AssumeRole",
                "sts:TagSession"
            ]
        }
    ]
}

This is the same trust policy pattern we used for Pod Identity testing - trust pods.eks.amazonaws.com.

Step 2: Create the Installation Script

install-cloudwatch-observability.sh

#!/bin/bash

# Configuration
CLUSTER_NAME="my-cluster"
REGION="us-east-1"

echo "Installing CloudWatch Observability"
echo "===================================="
echo "Cluster: $CLUSTER_NAME"
echo "Region:  $REGION"
echo ""

echo "Checking prerequisites..."
if ! kubectl get nodes &>/dev/null; then
    echo "ERROR: Cluster not found or kubectl not configured"
    exit 1
fi

# Check if Pod Identity is installed
if ! eksctl get addon --name eks-pod-identity-agent --cluster $CLUSTER_NAME 2>/dev/null | grep -q "ACTIVE"; then
    echo "ERROR: Pod Identity Agent is required but not installed"
    echo "   Install it first: see 1.5 Pod Identity article"
    exit 1
fi

echo "Prerequisites verified"
echo ""

# Get account ID
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

# Get script directory for finding trust policy
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"

echo "Creating IAM role for CloudWatch Agent..."
aws iam create-role \
    --role-name CloudWatchObservabilityRole \
    --assume-role-policy-document file://$SCRIPT_DIR/cloudwatch-trust-policy.json \
    2>/dev/null || echo "Role already exists"

# CloudWatchAgentServerPolicy covers metrics
aws iam attach-role-policy \
    --role-name CloudWatchObservabilityRole \
    --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
    2>/dev/null || true

# AmazonEBSCSIDriverPolicy is NOT needed - that was a copy-paste error
# CloudWatchLogsFullAccess allows Fluent Bit to create log groups and write logs
aws iam attach-role-policy \
    --role-name CloudWatchObservabilityRole \
    --policy-arn arn:aws:iam::aws:policy/CloudWatchLogsFullAccess \
    2>/dev/null || true

echo "IAM role created with metrics and logs permissions"
echo ""

echo "Installing CloudWatch Observability addon with Pod Identity..."
aws eks create-addon \
    --addon-name amazon-cloudwatch-observability \
    --cluster-name $CLUSTER_NAME \
    --pod-identity-associations serviceAccount=cloudwatch-agent,roleArn=arn:aws:iam::$ACCOUNT_ID:role/CloudWatchObservabilityRole \
    --region $REGION

if [[ $? -ne 0 ]]; then
    echo "ERROR: Failed to install CloudWatch Observability addon"
    exit 1
fi

echo "Waiting for CloudWatch Observability to be ready..."
sleep 30

kubectl wait --for=condition=Ready pods -n amazon-cloudwatch \
    -l app.kubernetes.io/name=cloudwatch-agent --timeout=300s
kubectl wait --for=condition=Ready pods -n amazon-cloudwatch \
    -l app.kubernetes.io/name=fluent-bit --timeout=300s

echo ""
echo "SUCCESS! CloudWatch Observability installed"
echo "============================================"
echo "Components installed:"
echo "   - CloudWatch Agent (metrics)"
echo "   - Fluent Bit (logs)"
echo "   - ADOT Collector (Prometheus metrics)"
echo "   - Container Insights dashboards"
echo ""
echo "Access dashboards:"
echo "   https://console.aws.amazon.com/cloudwatch/home?region=$REGION#container-insights:performance/EKS:Cluster"
echo ""
echo "Verify with:"
echo "   kubectl get pods -n amazon-cloudwatch"

Step 3: Create the Cleanup Script

delete-cloudwatch-observability.sh

#!/bin/bash

# Configuration
CLUSTER_NAME="my-cluster"
REGION="us-east-1"

echo "Removing CloudWatch Observability"
echo "=================================="
echo "Cluster: $CLUSTER_NAME"
echo ""
echo "WARNING: This will remove:"
echo "   - CloudWatch Observability addon"
echo "   - All monitoring dashboards"
echo "   - Log collection"
echo ""

read -p "Are you sure? (yes/no): " confirm
if [[ $confirm != "yes" ]]; then
    echo "Operation cancelled"
    exit 0
fi

echo ""
echo "Removing Pod Identity associations..."
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

# Delete Pod Identity associations for amazon-cloudwatch namespace
aws eks list-pod-identity-associations \
    --cluster-name $CLUSTER_NAME \
    --region $REGION \
    --query 'associations[?namespace==`amazon-cloudwatch`].associationId' \
    --output text | tr '\t' '\n' | while read -r association_id; do
    if [[ -n "$association_id" ]]; then
        echo "Deleting association: $association_id"
        aws eks delete-pod-identity-association \
            --cluster-name $CLUSTER_NAME \
            --association-id "$association_id" \
            --region $REGION 2>/dev/null || true
    fi
done

echo "Removing CloudWatch Observability addon..."
eksctl delete addon \
    --name amazon-cloudwatch-observability \
    --cluster $CLUSTER_NAME 2>/dev/null || true

echo "Cleaning up namespace..."
kubectl delete namespace amazon-cloudwatch --ignore-not-found=true

echo "Cleaning up IAM role..."
aws iam detach-role-policy \
    --role-name CloudWatchObservabilityRole \
    --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
    2>/dev/null || true

aws iam detach-role-policy \
    --role-name CloudWatchObservabilityRole \
    --policy-arn arn:aws:iam::aws:policy/CloudWatchLogsFullAccess \
    2>/dev/null || true

aws iam delete-role \
    --role-name CloudWatchObservabilityRole \
    2>/dev/null || true

echo "Waiting for cleanup..."
sleep 15

echo ""
echo "Cleanup complete!"
echo "================="
echo "Removed:"
echo "   - CloudWatch Observability addon"
echo "   - CloudWatch Agent and Fluent Bit"
echo "   - IAM role"
echo ""
echo "Note: Historical CloudWatch data is preserved"
echo "      (logs and metrics remain in CloudWatch)"
echo ""
echo "To delete log groups (removes all historical logs):"
echo "   aws logs delete-log-group --log-group-name /aws/containerinsights/my-cluster/application"
echo "   aws logs delete-log-group --log-group-name /aws/containerinsights/my-cluster/dataplane"
echo "   aws logs delete-log-group --log-group-name /aws/containerinsights/my-cluster/host"
echo "   aws logs delete-log-group --log-group-name /aws/containerinsights/my-cluster/performance"

Step 4: Run the Installation

chmod +x install-cloudwatch-observability.sh
./install-cloudwatch-observability.sh

Expected output:

SUCCESS! CloudWatch Observability installed
============================================
Components installed:
   - CloudWatch Agent (metrics)
   - Fluent Bit (logs)
   - ADOT Collector (Prometheus metrics)
   - Container Insights dashboards

Access dashboards:
   https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#container-insights:performance/EKS:Cluster

Verify with:
   kubectl get pods -n amazon-cloudwatch

Step 5: Verify Installation

Check the pods in the amazon-cloudwatch namespace:

kubectl get pods -n amazon-cloudwatch

Expected output:

NAME                                  READY   STATUS    RESTARTS   AGE
cloudwatch-agent-xxxxx                1/1     Running   0          2m
cloudwatch-agent-yyyyy                1/1     Running   0          2m
fluent-bit-xxxxx                      1/1     Running   0          2m
fluent-bit-yyyyy                      1/1     Running   0          2m

You should see one cloudwatch-agent and one fluent-bit pod per node (they’re DaemonSets).

Check the addon status:

eksctl get addon --name amazon-cloudwatch-observability --cluster my-cluster

Expected output:

NAME                                VERSION                 STATUS  ISSUES
amazon-cloudwatch-observability     v2.1.0-eksbuild.1       ACTIVE  0

Step 6: Access the Dashboards

Open the CloudWatch Container Insights console:

https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#container-insights:performance

Or navigate manually:

Go to CloudWatch in the AWS Console
Click Container Insights in the left navigation
Select Performance monitoring
Choose your cluster from the dropdown

Dashboards take 5-10 minutes to populate with data after installation.

Step 7: Verify Log Groups Were Created

Before testing, verify the log groups exist. Fluent Bit creates these automatically:

aws logs describe-log-groups \
    --log-group-name-prefix /aws/containerinsights/my-cluster \
    --query 'logGroups[*].logGroupName'

Expected output:

[
    "/aws/containerinsights/my-cluster/application",
    "/aws/containerinsights/my-cluster/dataplane",
    "/aws/containerinsights/my-cluster/host",
    "/aws/containerinsights/my-cluster/performance"
]

If the log groups don’t exist, check that:

Fluent Bit pods are running: kubectl get pods -n amazon-cloudwatch -l app.kubernetes.io/name=fluent-bit
The IAM role has CloudWatchLogsFullAccess attached
Fluent Bit logs for errors: kubectl logs -n amazon-cloudwatch -l app.kubernetes.io/name=fluent-bit

Step 8: Test Log Collection

Now let’s verify logs flow from pods to CloudWatch.

A. Create a Test Pod

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: log-generator
  namespace: default
spec:
  containers:
  - name: logger
    image: busybox
    command: ["/bin/sh", "-c"]
    args:
    - |
      echo "Log generator started"
      while true; do
        echo "Test log message at \$(date)"
        sleep 5
      done
EOF

B. Wait for the Pod and Verify It’s Logging

kubectl wait --for=condition=Ready pod/log-generator --timeout=60s

# Verify the pod is generating logs locally
kubectl logs log-generator --tail=3

You should see:

Log generator started
Test log message at Fri Jan 17 12:00:00 UTC 2026
Test log message at Fri Jan 17 12:00:05 UTC 2026

C. Wait for Logs to Appear in CloudWatch

Fluent Bit batches logs, so wait 1-2 minutes for logs to appear in CloudWatch.

# Wait for log delivery
sleep 120

D. Verify Logs in CloudWatch

Check if logs arrived:

aws logs filter-log-events \
    --log-group-name /aws/containerinsights/my-cluster/application \
    --filter-pattern "log-generator" \
    --start-time $(($(date +%s) - 300))000 \
    --limit 5 \
    --query 'events[*].message'

Expected output (JSON array of log messages):

[
    "... \"log\":\"Test log message at Fri Jan 17 12:00:00 UTC 2026\\n\" ...",
    "... \"log\":\"Test log message at Fri Jan 17 12:00:05 UTC 2026\\n\" ..."
]

If you get an empty result [], the logs haven’t arrived yet. Common causes:

Log group doesn’t exist - Check Step 7 above
IAM permissions - Verify CloudWatchLogsFullAccess is attached to the role
Fluent Bit errors - Check kubectl logs -n amazon-cloudwatch -l app.kubernetes.io/name=fluent-bit

E. View Logs in the Console

Go to CloudWatch in the AWS Console
Click Log groups in the left navigation
Click /aws/containerinsights/my-cluster/application
Click a recent log stream (named after the node)
Search for log-generator

F. Clean Up Test Resources

kubectl delete pod log-generator

Understanding the Metrics

CloudWatch Observability collects metrics at multiple levels:

Cluster Metrics

Metric	Description
`cluster_failed_node_count`	Nodes in NotReady state
`cluster_node_count`	Total node count
`namespace_number_of_running_pods`	Pods per namespace

Node Metrics

Metric	Description
`node_cpu_utilization`	CPU usage percentage
`node_memory_utilization`	Memory usage percentage
`node_filesystem_utilization`	Disk usage percentage
`node_network_total_bytes`	Network throughput

Pod Metrics

Metric	Description
`pod_cpu_utilization`	CPU vs requested
`pod_memory_utilization`	Memory vs requested
`pod_network_rx_bytes`	Network received
`pod_network_tx_bytes`	Network transmitted

Container Metrics

Metric	Description
`container_cpu_utilization`	Per-container CPU
`container_memory_utilization`	Per-container memory
`container_restart_count`	Restart count

Creating Alarms

Use CloudWatch Alarms to get notified of issues:

High CPU Alarm Example

aws cloudwatch put-metric-alarm \
    --alarm-name "EKS-HighCPU" \
    --alarm-description "Alert when cluster CPU exceeds 80%" \
    --metric-name node_cpu_utilization \
    --namespace ContainerInsights \
    --statistic Average \
    --period 300 \
    --threshold 80 \
    --comparison-operator GreaterThanThreshold \
    --dimensions Name=ClusterName,Value=my-cluster \
    --evaluation-periods 2 \
    --alarm-actions arn:aws:sns:us-east-1:ACCOUNT_ID:alerts

Node Not Ready Alarm

aws cloudwatch put-metric-alarm \
    --alarm-name "EKS-NodeNotReady" \
    --alarm-description "Alert when nodes are not ready" \
    --metric-name cluster_failed_node_count \
    --namespace ContainerInsights \
    --statistic Maximum \
    --period 60 \
    --threshold 0 \
    --comparison-operator GreaterThanThreshold \
    --dimensions Name=ClusterName,Value=my-cluster \
    --evaluation-periods 1 \
    --alarm-actions arn:aws:sns:us-east-1:ACCOUNT_ID:alerts

Delete Alarms

To remove alarms you’ve created:

aws cloudwatch delete-alarms --alarm-names "EKS-HighCPU" "EKS-NodeNotReady"

Cost Considerations

CloudWatch Observability has associated costs:

Component	Cost
Custom metrics	$0.30 per metric per month (first 10k)
Log ingestion	$0.50 per GB
Log storage	$0.03 per GB per month
Dashboard	Free (Container Insights dashboards)

For a small cluster (3 nodes, 20 pods), expect roughly $15-30/month. Costs scale with:

Number of nodes and pods (more metrics)
Log volume (application log verbosity)
Retention period for logs

Reducing Costs

Set log retention - Don’t keep logs forever:

aws logs put-retention-policy \
 --log-group-name /aws/containerinsights/my-cluster/application \
 --retention-in-days 7

Filter noisy logs - Exclude high-volume, low-value logs in Fluent Bit config
Use metric filters - Create alarms from logs instead of storing everything

Troubleshooting

Pods Stuck in Pending

Check if the DaemonSet can schedule:

kubectl describe pod -n amazon-cloudwatch -l app.kubernetes.io/name=cloudwatch-agent

Common causes:

Node taints preventing scheduling
Resource constraints

No Metrics in Dashboard

Check if the agent can reach CloudWatch:

kubectl logs -n amazon-cloudwatch -l app.kubernetes.io/name=cloudwatch-agent

Look for authentication errors - usually means Pod Identity association is missing.

Logs Not Appearing

Check Fluent Bit:

kubectl logs -n amazon-cloudwatch -l app.kubernetes.io/name=fluent-bit

Common issues:

Log group doesn’t exist (Fluent Bit creates it, but check permissions)
IAM policy doesn’t allow log writes

Complete Cleanup

To remove everything created in this article:

A. Remove Test Resources

# Delete test pod (if still running)
kubectl delete pod log-generator --ignore-not-found=true

# Delete any alarms you created
aws cloudwatch delete-alarms --alarm-names "EKS-HighCPU" "EKS-NodeNotReady" 2>/dev/null || true

B. Run the Delete Script

./delete-cloudwatch-observability.sh

This removes the addon, IAM role, and Pod Identity associations.

C. Delete Log Groups (Optional)

The delete script preserves historical logs. To remove them completely:

aws logs delete-log-group --log-group-name /aws/containerinsights/my-cluster/application 2>/dev/null || true
aws logs delete-log-group --log-group-name /aws/containerinsights/my-cluster/dataplane 2>/dev/null || true
aws logs delete-log-group --log-group-name /aws/containerinsights/my-cluster/host 2>/dev/null || true
aws logs delete-log-group --log-group-name /aws/containerinsights/my-cluster/performance 2>/dev/null || true

What We’ve Accomplished

You now have:

CloudWatch Observability collecting metrics and logs
Pre-built dashboards for cluster visibility
Fluent Bit shipping container logs to CloudWatch
Pod Identity providing secure authentication

What’s Next

The infrastructure is built. The final article covers how to use it - operational commands, production checklists, and architecture patterns for different application types.

Next: 1.8 Production Architecture and Operations Guide

Questions about monitoring? Reach out on socials.

What We’re Building

Why CloudWatch Observability?

What Gets Installed

Container Insights Dashboards

Prerequisites

Project Structure

Step 1: Create the IAM Trust Policy

Step 2: Create the Installation Script

Step 3: Create the Cleanup Script

Step 4: Run the Installation

Step 5: Verify Installation

Step 6: Access the Dashboards

Step 7: Verify Log Groups Were Created

Step 8: Test Log Collection

A. Create a Test Pod

B. Wait for the Pod and Verify It’s Logging

C. Wait for Logs to Appear in CloudWatch

D. Verify Logs in CloudWatch

E. View Logs in the Console

F. Clean Up Test Resources

Understanding the Metrics

Cluster Metrics

Node Metrics

Pod Metrics

Container Metrics

Creating Alarms

High CPU Alarm Example

Node Not Ready Alarm

Delete Alarms

Cost Considerations

Reducing Costs

Troubleshooting

Pods Stuck in Pending

No Metrics in Dashboard

Logs Not Appearing

Complete Cleanup

A. Remove Test Resources

B. Run the Delete Script

C. Delete Log Groups (Optional)

What We’ve Accomplished

What’s Next

More in this series: EKS Infrastructure Series