Advanced15 min read

Monitoring & Logging

Monitor Kubernetes clusters and applications — Prometheus metrics, centralized logging, health probes, and resource management.

Prometheus and Metrics

Prometheus collects time-series metrics from Kubernetes components and applications. Install via kube-prometheus-stack Helm chart for a complete monitoring stack with Grafana dashboards and Alertmanager.

Kubernetes exposes metrics from kubelet, API server, and cAdvisor. Application metrics require instrumentation libraries (prometheus/client_golang, prom-client for Node.js).

kube-prometheus-stack is the standard monitoring installation
Grafana dashboards visualize cluster and application metrics
Alertmanager routes alerts to PagerDuty, Slack, or email

# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: web-app
spec:
  selector:
    matchLabels:
      app: web
  endpoints:
    - port: metrics
      interval: 30s

Centralized Logging

Aggregate pod logs with ELK (Elasticsearch, Logstash, Kibana), Loki (Grafana-native), or cloud logging (CloudWatch, Stackdriver). Fluent Bit or Promtail collects logs from each node.

Structured JSON logging from applications enables powerful filtering and correlation. Include request IDs, pod name, and namespace in log entries.

# Fluent Bit DaemonSet collects logs from all nodes
# Logs flow: Pod stdout → Node /var/log/containers → Fluent Bit → Loki/ES

# Query logs with LogCLI (Loki)
logcli query '{namespace="production", app="web"}' --limit=50

Liveness and Readiness Probes

Liveness probes detect crashed containers — failures trigger restarts. Readiness probes detect when a container can accept traffic — failures remove the pod from service endpoints.

Use different endpoints: /health/live for liveness, /health/ready for readiness. Readiness should check dependencies (database, cache). Liveness should only check if the process itself is alive.

livenessProbe:
  httpGet:
    path: /health/live
    port: 3000
  initialDelaySeconds: 15
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health/ready
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 5

Resource Requests and Limits

Requests guarantee minimum resources for scheduling. Limits cap maximum usage. CPU is measured in cores (or millicores: 500m = 0.5 core). Memory is in bytes (256Mi, 1Gi).

Without requests, the scheduler cannot make informed placement decisions. Without limits, one pod can consume all node resources. Always set both.

Requests determine scheduling — pods need a node with available capacity
Limits trigger throttling (CPU) or OOM kill (memory)
Use LimitRanges to enforce defaults per namespace

resources:
  requests:
    cpu: 250m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

Observability Best Practices

Implement the three pillars: metrics (Prometheus), logs (Loki/ELK), and traces (Jaeger/Tempo). Correlate all three with shared trace IDs for effective debugging.

Create runbooks for common alerts. Dashboard per service showing: request rate, error rate, latency (RED method), and resource utilization.