Problem Statement
You need comprehensive visibility into your production systems, including metrics, logs, traces, and alerts, to quickly identify, diagnose, and resolve issues.
Three Pillars of Observability
┌─────────────────────────────────────────────────────────────────────────┐
│ Observability Platform │
├─────────────────────┬─────────────────────┬─────────────────────────────┤
│ Metrics │ Logs │ Traces │
│ │ │ │
│ • System metrics │ • Application logs │ • Request flow │
│ • Application │ • Audit logs │ • Service dependencies │
│ metrics │ • Access logs │ • Latency breakdown │
│ • Business │ • Error logs │ • Error propagation │
│ metrics │ │ │
├─────────────────────┴─────────────────────┴─────────────────────────────┤
│ Alerting & Dashboards │
└─────────────────────────────────────────────────────────────────────────┘
Architecture Overview
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Application │ │ Application │ │ Application │
│ Pod A │ │ Pod B │ │ Pod C │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
│ metrics │ logs │ traces
│ /metrics │ stdout │ OTLP
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Prometheus │ │ Fluentd │ │ Jaeger │
│ Scraping │ │ Collection │ │ Collector │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────────────┐
│ Grafana │
│ Dashboards, Alerts, Exploration │
└──────────────────────────────────────────────────┘
1. Metrics with Prometheus
Install Prometheus Stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values prometheus-values.yaml
prometheus-values.yaml
prometheus:
prometheusSpec:
retention: 15d
resources:
requests:
memory: 1Gi
cpu: 500m
limits:
memory: 2Gi
cpu: 1
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
# Service discovery for pods with prometheus.io annotations
additionalScrapeConfigs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
alertmanager:
config:
global:
slack_api_url: 'https://hooks.slack.com/services/xxx'
route:
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'pagerduty'
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'your-pagerduty-key'
grafana:
adminPassword: 'secure-password'
persistence:
enabled: true
size: 10Gi
Application Metrics Instrumentation
Node.js with prom-client
const client = require('prom-client');
const express = require('express');
// Create a registry.
const register = new client.Registry();
// Add default metrics (CPU, memory, etc.).
client.collectDefaultMetrics({ register });
// Custom metrics.
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.001, 0.005, 0.015, 0.05, 0.1, 0.5, 1, 5]
});
register.registerMetric(httpRequestDuration);
const httpRequestsTotal = new client.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
register.registerMetric(httpRequestsTotal);
const activeConnections = new client.Gauge({
name: 'active_connections',
help: 'Number of active connections'
});
register.registerMetric(activeConnections);
// Middleware to track metrics.
const metricsMiddleware = (req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const route = req.route?.path || req.path;
httpRequestDuration.observe(
{ method: req.method, route, status_code: res.statusCode },
duration
);
httpRequestsTotal.inc({ method: req.method, route, status_code: res.statusCode });
});
next();
};
// Metrics endpoint.
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
PHP/Laravel with Prometheus Exporter
// composer require promphp/prometheus_client_php
use Prometheus\CollectorRegistry;
use Prometheus\RenderTextFormat;
use Prometheus\Storage\Redis;
class MetricsController extends Controller
{
private CollectorRegistry $registry;
public function __construct()
{
$adapter = new Redis(['host' => env('REDIS_HOST')]);
$this->registry = new CollectorRegistry($adapter);
}
public function index()
{
$renderer = new RenderTextFormat();
return response($renderer->render($this->registry->getMetricFamilySamples()))
->header('Content-Type', RenderTextFormat::MIME_TYPE);
}
}
// Middleware for request metrics.
class MetricsMiddleware
{
public function handle($request, Closure $next)
{
$start = microtime(true);
$response = $next($request);
$duration = microtime(true) - $start;
$histogram = $this->registry->getOrRegisterHistogram(
'app',
'http_request_duration_seconds',
'Request duration',
['method', 'route', 'status'],
[0.01, 0.05, 0.1, 0.5, 1, 5]
);
$histogram->observe(
$duration,
[$request->method(), $request->route()->uri(), $response->status()]
);
return $response;
}
}
Pod Annotations for Scraping
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: app
ports:
- name: metrics
containerPort: 8080
2. Centralized Logging
Install Loki Stack
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install loki grafana/loki-stack \
--namespace monitoring \
--set promtail.enabled=true \
--set grafana.enabled=false \ # Use existing Grafana.
--values loki-values.yaml
loki-values.yaml
loki:
persistence:
enabled: true
size: 100Gi
config:
limits_config:
retention_period: 30d
table_manager:
retention_deletes_enabled: true
retention_period: 30d
promtail:
config:
snippets:
scrapeConfigs: |
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- cri: {}
- json:
expressions:
level: level
message: message
timestamp: timestamp
- labels:
level:
- timestamp:
source: timestamp
format: RFC3339
relabel_configs:
- source_labels:
- __meta_kubernetes_pod_label_app
target_label: app
- source_labels:
- __meta_kubernetes_namespace
target_label: namespace
- source_labels:
- __meta_kubernetes_pod_name
target_label: pod
Structured Logging in Applications
Node.js with Winston
const winston = require('winston');
const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json()
),
defaultMeta: {
service: 'user-service',
version: process.env.APP_VERSION
},
transports: [
new winston.transports.Console()
]
});
// Usage.
logger.info('User created', {
userId: user.id,
email: user.email,
requestId: req.id
});
logger.error('Database connection failed', {
error: err.message,
stack: err.stack,
retryCount: 3
});
PHP/Laravel Logging
// config/logging.php
'channels' => [
'stdout' => [
'driver' => 'monolog',
'handler' => StreamHandler::class,
'with' => [
'stream' => 'php://stdout',
],
'formatter' => JsonFormatter::class,
],
],
// Usage.
Log::channel('stdout')->info('User created', [
'user_id' => $user->id,
'email' => $user->email,
'request_id' => request()->header('X-Request-ID'),
]);
3. Distributed Tracing with Jaeger
Install Jaeger
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo update
helm install jaeger jaegertracing/jaeger \
--namespace monitoring \
--set provisionDataStore.cassandra=false \
--set storage.type=elasticsearch \
--set storage.elasticsearch.host=elasticsearch.monitoring.svc \
--set collector.service.type=ClusterIP
OpenTelemetry Instrumentation
Node.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'user-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
environment: process.env.NODE_ENV,
}),
traceExporter: new OTLPTraceExporter({
url: 'http://jaeger-collector.monitoring.svc:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
// Custom spans.
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('user-service');
async function processOrder(orderId) {
return tracer.startActiveSpan('processOrder', async (span) => {
span.setAttribute('order.id', orderId);
try {
// Process order logic.
const result = await validateOrder(orderId);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}
4. Alerting Rules
Critical Alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: critical-alerts
namespace: monitoring
spec:
groups:
- name: availability
rules:
- alert: ServiceDown
expr: up{job="kubernetes-pods"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
description: "{{ $labels.job }} has been down for more than 1 minute."
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} (>5%)"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.service }}"
description: "95th percentile latency is {{ $value }}s"
- name: resources
rules:
- alert: HighMemoryUsage
expr: |
container_memory_usage_bytes{container!=""}
/
container_spec_memory_limit_bytes{container!=""}
> 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage in {{ $labels.pod }}"
description: "Memory usage is {{ $value | humanizePercentage }}"
- alert: HighCPUUsage
expr: |
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace)
/
sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, namespace)
> 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage in {{ $labels.pod }}"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
- name: database
rules:
- alert: DatabaseConnectionsExhausted
expr: |
pg_stat_activity_count{datname!~"template.*"}
/
pg_settings_max_connections > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL connections near limit"
- alert: DatabaseReplicationLag
expr: pg_replication_lag > 300
for: 5m
labels:
severity: critical
annotations:
summary: "PostgreSQL replication lag is {{ $value }}s"
5. Grafana Dashboards
Application Dashboard (JSON)
{
"title": "Application Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{ service }}"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{ service }}"
}
]
},
{
"title": "Latency (p95)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
"legendFormat": "{{ service }}"
}
]
},
{
"title": "Active Pods",
"type": "stat",
"targets": [
{
"expr": "count(kube_pod_status_phase{phase=\"Running\", namespace=\"production\"})"
}
]
}
]
}
The Senior Observability Mindset
"The server is down." "Why?" "I don't know; I can't SSH in." If you rely on SSH to view logs, you have already lost.
Senior engineers treat observability as a core part of system design. They build the signals that make debugging fast and incident response effective.
Define SLOs First
Observability is more useful when tied to clear objectives. Define SLOs for critical workflows such as API availability and latency. Use those SLOs to set alert thresholds and dashboard targets. If you don't know what "good" looks like, you cannot detect when the system is unhealthy.
Structured Logging: Stop Logging Strings
Bad: [ERROR] User 123 failed payment: timeout Good: {"level": "error", "user_id": 123, "action": "payment", "error": "timeout"}
In Kibana, you can filter by action: payment. With strings, you are writing regex.
Correlation IDs Are Non-Negotiable
In microservices, one user click triggers logs in five services.
Pattern: 1. Gateway generates X-Request-ID. 2. Every service logs request_id. 3. Every service passes it downstream in headers.
Result: One query in Kibana shows the full trace across the cluster.
Log Retention Strategy
Logs are expensive. Plan accordingly:
- Hot (SSD): Last 7 days. Fast search.
- Warm (HDD): Last 30 days. Slow search.
- Cold (S3): Last 1 year. Archived.
- Delete: Automated Index Lifecycle Management (ILM) policy.
Design Alerting for Action
Alerts should trigger clear actions. Use thresholds that reflect real user impact and avoid alerting on noise. If an alert cannot be acted on, it should be removed or changed.
Use multi-signal alerts when possible, such as error rate plus latency plus saturation. This reduces false positives.
Common Pitfalls
- Logging too much without structure.
- Missing correlation IDs across services.
- Alerting on every error rather than user impact.
- No retention policies, leading to runaway costs.
- Dashboards that are out of date or unused.
Quick Container Log Access
Add simple aliases for on-call engineers to stream logs without waiting for the central pipeline:
# Quick pod log access.
alias klogs='kubectl logs -f'
alias dlogs='docker logs -f'
# Stream logs from all pods of a deployment.
kubectl logs -f deployment/myapp --all-containers
Observability Checklist
Metrics
- [ ] System metrics collected (CPU, memory, disk, network)
- [ ] Application metrics instrumented (RED metrics)
- [ ] Business metrics tracked
- [ ] Dashboards created for key services
- [ ] Alerting rules defined for SLOs
Logs
- [ ] Structured logging implemented (JSON format)
- [ ] Log levels used appropriately
- [ ] Request IDs propagated across all services
- [ ] Sensitive data not logged
- [ ] Log retention configured with ILM policies
Traces
- [ ] Distributed tracing enabled
- [ ] Context propagation working
- [ ] Critical paths traced
- [ ] Sampling configured appropriately
Alerting
- [ ] On-call rotation defined
- [ ] Escalation policies configured
- [ ] Alert fatigue minimized
- [ ] Runbooks linked to alerts
- [ ] Multi-signal alerts for critical paths
The Three Pillars Summary
- Logs tell you why something happened.
- Metrics tell you when.
- Traces tell you where.
You need all three. A good observability stack turns complex systems into understandable ones.
Related Wiki Articles