Monitoring & Logging
Monitor Kubernetes clusters and applications — Prometheus metrics, centralized logging, health probes, and resource management.
Prometheus and Metrics
Prometheus collects time-series metrics from Kubernetes components and applications. Install via kube-prometheus-stack Helm chart for a complete monitoring stack with Grafana dashboards and Alertmanager.
Kubernetes exposes metrics from kubelet, API server, and cAdvisor. Application metrics require instrumentation libraries (prometheus/client_golang, prom-client for Node.js).
- kube-prometheus-stack is the standard monitoring installation
- Grafana dashboards visualize cluster and application metrics
- Alertmanager routes alerts to PagerDuty, Slack, or email
# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: web-app
spec:
selector:
matchLabels:
app: web
endpoints:
- port: metrics
interval: 30sCentralized Logging
Aggregate pod logs with ELK (Elasticsearch, Logstash, Kibana), Loki (Grafana-native), or cloud logging (CloudWatch, Stackdriver). Fluent Bit or Promtail collects logs from each node.
Structured JSON logging from applications enables powerful filtering and correlation. Include request IDs, pod name, and namespace in log entries.
# Fluent Bit DaemonSet collects logs from all nodes
# Logs flow: Pod stdout → Node /var/log/containers → Fluent Bit → Loki/ES
# Query logs with LogCLI (Loki)
logcli query '{namespace="production", app="web"}' --limit=50Liveness and Readiness Probes
Liveness probes detect crashed containers — failures trigger restarts. Readiness probes detect when a container can accept traffic — failures remove the pod from service endpoints.
Use different endpoints: /health/live for liveness, /health/ready for readiness. Readiness should check dependencies (database, cache). Liveness should only check if the process itself is alive.
livenessProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5Resource Requests and Limits
Requests guarantee minimum resources for scheduling. Limits cap maximum usage. CPU is measured in cores (or millicores: 500m = 0.5 core). Memory is in bytes (256Mi, 1Gi).
Without requests, the scheduler cannot make informed placement decisions. Without limits, one pod can consume all node resources. Always set both.
- Requests determine scheduling — pods need a node with available capacity
- Limits trigger throttling (CPU) or OOM kill (memory)
- Use LimitRanges to enforce defaults per namespace
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 512MiObservability Best Practices
Implement the three pillars: metrics (Prometheus), logs (Loki/ELK), and traces (Jaeger/Tempo). Correlate all three with shared trace IDs for effective debugging.
Create runbooks for common alerts. Dashboard per service showing: request rate, error rate, latency (RED method), and resource utilization.
# RED method dashboards # Rate: requests per second # Errors: error rate percentage # Duration: p50, p95, p99 latency