How to Implement a Complete Observability Stack

About us: Personal website of Timofey Bugaevsky and the company Zetka Interactive

Guides: A comprehensive collection of technical guides written from a senior developer's perspective. Each article provides in-depth explanations, practical code examples, and production-ready patterns.

DevOps: CI/CD, containerization, and deployment

You need comprehensive visibility into your production systems, including metrics, logs, traces, and alerts, to quickly identify, diagnose, and resolve issues.

Problem Statement

You need comprehensive visibility into your production systems, including metrics, logs, traces, and alerts, to quickly identify, diagnose, and resolve issues.

Three Pillars of Observability

┌─────────────────────────────────────────────────────────────────────────┐
│ Observability Platform │
├─────────────────────┬─────────────────────┬─────────────────────────────┤
│ Metrics │ Logs │ Traces │
│ │ │ │
│ • System metrics │ • Application logs │ • Request flow │
│ • Application │ • Audit logs │ • Service dependencies │
│ metrics │ • Access logs │ • Latency breakdown │
│ • Business │ • Error logs │ • Error propagation │
│ metrics │ │ │
├─────────────────────┴─────────────────────┴─────────────────────────────┤
│ Alerting & Dashboards │
└─────────────────────────────────────────────────────────────────────────┘

Architecture Overview

┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Application │ │ Application │ │ Application │
│ Pod A │ │ Pod B │ │ Pod C │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
│ metrics │ logs │ traces
│ /metrics │ stdout │ OTLP
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Prometheus │ │ Fluentd │ │ Jaeger │
│ Scraping │ │ Collection │ │ Collector │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────────────┐
│ Grafana │
│ Dashboards, Alerts, Exploration │
└──────────────────────────────────────────────────┘

1. Metrics with Prometheus

Install Prometheus Stack

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values prometheus-values.yaml

prometheus-values.yaml

prometheus:
prometheusSpec:
retention: 15d
resources:
requests:
memory: 1Gi
cpu: 500m
limits:
memory: 2Gi
cpu: 1
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
# Service discovery for pods with prometheus.io annotations
additionalScrapeConfigs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
alertmanager:
config:
global:
slack_api_url: 'https://hooks.slack.com/services/xxx'
route:
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'pagerduty'
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'your-pagerduty-key'
grafana:
adminPassword: 'secure-password'
persistence:
enabled: true
size: 10Gi

Application Metrics Instrumentation

Node.js with prom-client

const client = require('prom-client');
const express = require('express');
// Create a registry.
const register = new client.Registry();
// Add default metrics (CPU, memory, etc.).
client.collectDefaultMetrics({ register });
// Custom metrics.
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.001, 0.005, 0.015, 0.05, 0.1, 0.5, 1, 5]
});
register.registerMetric(httpRequestDuration);
const httpRequestsTotal = new client.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
register.registerMetric(httpRequestsTotal);
const activeConnections = new client.Gauge({
name: 'active_connections',
help: 'Number of active connections'
});
register.registerMetric(activeConnections);
// Middleware to track metrics.
const metricsMiddleware = (req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const route = req.route?.path || req.path;
httpRequestDuration.observe(
{ method: req.method, route, status_code: res.statusCode },
duration
);
httpRequestsTotal.inc({ method: req.method, route, status_code: res.statusCode });
});
next();
};
// Metrics endpoint.
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});

PHP/Laravel with Prometheus Exporter

// composer require promphp/prometheus_client_php
use Prometheus\CollectorRegistry;
use Prometheus\RenderTextFormat;
use Prometheus\Storage\Redis;
class MetricsController extends Controller
{
private CollectorRegistry $registry;
public function __construct()
{
$adapter = new Redis(['host' => env('REDIS_HOST')]);
$this->registry = new CollectorRegistry($adapter);
}
public function index()
{
$renderer = new RenderTextFormat();
return response($renderer->render($this->registry->getMetricFamilySamples()))
->header('Content-Type', RenderTextFormat::MIME_TYPE);
}
}
// Middleware for request metrics.
class MetricsMiddleware
{
public function handle($request, Closure $next)
{
$start = microtime(true);
$response = $next($request);
$duration = microtime(true) - $start;
$histogram = $this->registry->getOrRegisterHistogram(
'app',
'http_request_duration_seconds',
'Request duration',
['method', 'route', 'status'],
[0.01, 0.05, 0.1, 0.5, 1, 5]
);
$histogram->observe(
$duration,
[$request->method(), $request->route()->uri(), $response->status()]
);
return $response;
}
}

Pod Annotations for Scraping

apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: app
ports:
- name: metrics
containerPort: 8080

2. Centralized Logging

Install Loki Stack

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install loki grafana/loki-stack \
--namespace monitoring \
--set promtail.enabled=true \
--set grafana.enabled=false \ # Use existing Grafana.
--values loki-values.yaml

loki-values.yaml

loki:
persistence:
enabled: true
size: 100Gi
config:
limits_config:
retention_period: 30d
table_manager:
retention_deletes_enabled: true
retention_period: 30d
promtail:
config:
snippets:
scrapeConfigs: |
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- cri: {}
- json:
expressions:
level: level
message: message
timestamp: timestamp
- labels:
level:
- timestamp:
source: timestamp
format: RFC3339
relabel_configs:
- source_labels:
- __meta_kubernetes_pod_label_app
target_label: app
- source_labels:
- __meta_kubernetes_namespace
target_label: namespace
- source_labels:
- __meta_kubernetes_pod_name
target_label: pod

Structured Logging in Applications

Node.js with Winston

const winston = require('winston');
const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json()
),
defaultMeta: {
service: 'user-service',
version: process.env.APP_VERSION
},
transports: [
new winston.transports.Console()
]
});
// Usage.
logger.info('User created', {
userId: user.id,
email: user.email,
requestId: req.id
});
logger.error('Database connection failed', {
error: err.message,
stack: err.stack,
retryCount: 3
});

PHP/Laravel Logging

// config/logging.php
'channels' => [
'stdout' => [
'driver' => 'monolog',
'handler' => StreamHandler::class,
'with' => [
'stream' => 'php://stdout',
],
'formatter' => JsonFormatter::class,
],
],
// Usage.
Log::channel('stdout')->info('User created', [
'user_id' => $user->id,
'email' => $user->email,
'request_id' => request()->header('X-Request-ID'),
]);

3. Distributed Tracing with Jaeger

Install Jaeger

helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo update
helm install jaeger jaegertracing/jaeger \
--namespace monitoring \
--set provisionDataStore.cassandra=false \
--set storage.type=elasticsearch \
--set storage.elasticsearch.host=elasticsearch.monitoring.svc \
--set collector.service.type=ClusterIP

OpenTelemetry Instrumentation

Node.js

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'user-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
environment: process.env.NODE_ENV,
}),
traceExporter: new OTLPTraceExporter({
url: 'http://jaeger-collector.monitoring.svc:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
// Custom spans.
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('user-service');
async function processOrder(orderId) {
return tracer.startActiveSpan('processOrder', async (span) => {
span.setAttribute('order.id', orderId);
try {
// Process order logic.
const result = await validateOrder(orderId);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}

4. Alerting Rules

Critical Alerts

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: critical-alerts
namespace: monitoring
spec:
groups:
- name: availability
rules:
- alert: ServiceDown
expr: up{job="kubernetes-pods"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
description: "{{ $labels.job }} has been down for more than 1 minute."
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} (>5%)"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.service }}"
description: "95th percentile latency is {{ $value }}s"
- name: resources
rules:
- alert: HighMemoryUsage
expr: |
container_memory_usage_bytes{container!=""}
/
container_spec_memory_limit_bytes{container!=""}
> 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage in {{ $labels.pod }}"
description: "Memory usage is {{ $value | humanizePercentage }}"
- alert: HighCPUUsage
expr: |
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace)
/
sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, namespace)
> 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage in {{ $labels.pod }}"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
- name: database
rules:
- alert: DatabaseConnectionsExhausted
expr: |
pg_stat_activity_count{datname!~"template.*"}
/
pg_settings_max_connections > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL connections near limit"
- alert: DatabaseReplicationLag
expr: pg_replication_lag > 300
for: 5m
labels:
severity: critical
annotations:
summary: "PostgreSQL replication lag is {{ $value }}s"

5. Grafana Dashboards

Application Dashboard (JSON)

{
"title": "Application Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{ service }}"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{ service }}"
}
]
},
{
"title": "Latency (p95)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
"legendFormat": "{{ service }}"
}
]
},
{
"title": "Active Pods",
"type": "stat",
"targets": [
{
"expr": "count(kube_pod_status_phase{phase=\"Running\", namespace=\"production\"})"
}
]
}
]
}

The Senior Observability Mindset

"The server is down." "Why?" "I don't know; I can't SSH in." If you rely on SSH to view logs, you have already lost.

Senior engineers treat observability as a core part of system design. They build the signals that make debugging fast and incident response effective.

Define SLOs First

Observability is more useful when tied to clear objectives. Define SLOs for critical workflows such as API availability and latency. Use those SLOs to set alert thresholds and dashboard targets. If you don't know what "good" looks like, you cannot detect when the system is unhealthy.

Structured Logging: Stop Logging Strings

Bad: [ERROR] User 123 failed payment: timeout Good: {"level": "error", "user_id": 123, "action": "payment", "error": "timeout"}

In Kibana, you can filter by action: payment. With strings, you are writing regex.

Correlation IDs Are Non-Negotiable

In microservices, one user click triggers logs in five services.

Pattern: 1. Gateway generates X-Request-ID. 2. Every service logs request_id. 3. Every service passes it downstream in headers.

Result: One query in Kibana shows the full trace across the cluster.

Log Retention Strategy

Logs are expensive. Plan accordingly:

Hot (SSD): Last 7 days. Fast search.
Warm (HDD): Last 30 days. Slow search.
Cold (S3): Last 1 year. Archived.
Delete: Automated Index Lifecycle Management (ILM) policy.

Design Alerting for Action

Alerts should trigger clear actions. Use thresholds that reflect real user impact and avoid alerting on noise. If an alert cannot be acted on, it should be removed or changed.

Use multi-signal alerts when possible, such as error rate plus latency plus saturation. This reduces false positives.

Common Pitfalls

Logging too much without structure.
Missing correlation IDs across services.
Alerting on every error rather than user impact.
No retention policies, leading to runaway costs.
Dashboards that are out of date or unused.

Quick Container Log Access

Add simple aliases for on-call engineers to stream logs without waiting for the central pipeline:

# Quick pod log access.
alias klogs='kubectl logs -f'
alias dlogs='docker logs -f'
# Stream logs from all pods of a deployment.
kubectl logs -f deployment/myapp --all-containers

Observability Checklist

Metrics

[ ] System metrics collected (CPU, memory, disk, network)
[ ] Application metrics instrumented (RED metrics)
[ ] Business metrics tracked
[ ] Dashboards created for key services
[ ] Alerting rules defined for SLOs

Logs

[ ] Structured logging implemented (JSON format)
[ ] Log levels used appropriately
[ ] Request IDs propagated across all services
[ ] Sensitive data not logged
[ ] Log retention configured with ILM policies

Traces

[ ] Distributed tracing enabled
[ ] Context propagation working
[ ] Critical paths traced
[ ] Sampling configured appropriately

Alerting

[ ] On-call rotation defined
[ ] Escalation policies configured
[ ] Alert fatigue minimized
[ ] Runbooks linked to alerts
[ ] Multi-signal alerts for critical paths

The Three Pillars Summary

Logs tell you why something happened.
Metrics tell you when.
Traces tell you where.

You need all three. A good observability stack turns complex systems into understandable ones.

Log In