Advanced14 min read

Monitoring & Logging

Monitor AWS resources with CloudWatch metrics and logs, audit with CloudTrail, set up alarms, and build observability dashboards.

CloudWatch Metrics

CloudWatch collects metrics from AWS services automatically — EC2 CPU, RDS connections, Lambda duration, S3 bucket size. Publish custom metrics with PutMetricData. Metrics have namespaces, dimensions, and timestamps.

Create dashboards visualizing key metrics. Use metric math for calculations across metrics. Standard resolution is 1 minute; high resolution down to 1 second.

Default metrics are free — custom metrics incur charges
Use Embedded Metric Format (EMF) for efficient custom metrics from Lambda
Container Insights provides ECS/EKS metrics automatically

# Publish custom metric
aws cloudwatch put-metric-data \
  --namespace MyApp \
  --metric-data '[{
    "MetricName": "OrdersProcessed",
    "Value": 42,
    "Unit": "Count",
    "Dimensions": [{"Name": "Environment", "Value": "production"}]
  }]'

CloudWatch Logs

CloudWatch Logs aggregates log streams from services. Log groups contain log streams. Define metric filters to extract metrics from log patterns. Subscribe logs to Lambda or Kinesis for processing.

Set retention policies to control costs — 7, 14, 30, 60, 90 days, or never expire. Use Logs Insights for SQL-like log querying.

# Query logs with Logs Insights
aws logs start-query \
  --log-group-name /aws/lambda/my-function \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query-string 'fields @timestamp, @message
    | filter @message like /ERROR/
    | sort @timestamp desc
    | limit 20'

CloudWatch Alarms

Alarms watch metrics and trigger actions when thresholds are breached. Actions include SNS notifications, Auto Scaling adjustments, and EC2 recovery. Composite alarms combine multiple alarms with AND/OR logic.

Set alarms on: CPU utilization, error rates, latency percentiles, queue depth, and disk usage. Use anomaly detection alarms for metrics with variable baselines.

aws cloudwatch put-metric-alarm \
  --alarm-name high-cpu \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-east-1:123:ops-alerts

CloudTrail Auditing

CloudTrail records AWS API calls — who did what, when, and from where. Enable in all regions. Integrate with CloudWatch Logs for real-time monitoring. Use EventBridge for automated response to specific API calls.

Trail logs include: caller identity, source IP, request parameters, and response elements. Essential for security audits and compliance.

# Create multi-region trail
aws cloudtrail create-trail \
  --name org-trail \
  --s3-bucket-name my-cloudtrail-bucket \
  --is-multi-region-trail \
  --enable-log-file-validation \
  --cloud-watch-logs-log-group-arn arn:aws:logs:... \
  --cloud-watch-logs-role-arn arn:aws:iam::123:role/CloudTrailRole

Observability Architecture

Build a complete observability stack: CloudWatch for AWS-native metrics and logs, X-Ray for distributed tracing, and third-party tools (Datadog, New Relic) for unified views.

Implement structured logging with correlation IDs. Create runbooks linked to alarms. Review dashboards weekly and tune alarm thresholds to reduce false positives.