How to Implement Disaster Recovery and Backup Strategies

About us: Personal website of Timofey Bugaevsky and the company Zetka Interactive

Guides: A comprehensive collection of technical guides written from a senior developer's perspective. Each article provides in-depth explanations, practical code examples, and production-ready patterns.

DevOps: CI/CD, containerization, and deployment

You need to protect your systems against data loss, minimize downtime during disasters, and ensure business continuity with tested recovery procedures.

Problem Statement

You need to protect your systems against data loss, minimize downtime during disasters, and ensure business continuity with tested recovery procedures.

Recovery Objectives

Metric	Definition	Typical Targets
RPO (Recovery Point Objective)	Maximum acceptable data loss	0–24 hours
RTO (Recovery Time Objective)	Maximum acceptable downtime	Minutes–hours

Backup Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│ Production Environment │
├────────────────┬────────────────┬────────────────┬─────────────────────┤
│ Kubernetes │ Database │ Object Store │ Configuration │
│ Workloads │ (Primary) │ (S3/MinIO) │ (GitOps Repo) │
└───────┬────────┴───────┬────────┴───────┬────────┴─────────┬───────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌───────────────────────────────────────────────────────────────────────┐
│ Backup Layer │
├────────────────┬────────────────┬────────────────┬────────────────────┤
│ Velero │ pg_dump / │ S3 Cross- │ Git Mirror │
│ Snapshots │ mysqldump │ Region │ │
└───────┬────────┴───────┬────────┴───────┬────────┴────────┬───────────┘
│ │ │ │
└────────────────┴────────────────┴──────────────────┘
│
┌───────────▼───────────┐
│ Offsite Backup │
│ Storage (S3/GCS) │
│ Different Region │
└───────────────────────┘

1. Kubernetes Backup with Velero

Install Velero

# Install the Velero CLI.
wget https://github.com/vmware-tanzu/velero/releases/download/v1.12.0/velero-v1.12.0-linux-amd64.tar.gz
tar -xvf velero-v1.12.0-linux-amd64.tar.gz
mv velero-v1.12.0-linux-amd64/velero /usr/local/bin/
# Install Velero in the cluster.
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.8.0 \
--bucket velero-backups \
--secret-file ./credentials-velero \
--backup-location-config region=us-west-2,s3ForcePathStyle="true",s3Url=https://s3.us-west-2.amazonaws.com \
--snapshot-location-config region=us-west-2

credentials-velero

[default]
aws_access_key_id=YOUR_ACCESS_KEY
aws_secret_access_key=YOUR_SECRET_KEY

Backup Schedules

apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
namespace: velero
spec:
schedule: "0 2 * * *" # Daily at 2 AM.
template:
includedNamespaces:
- production
- staging
excludedResources:
- events
- pods
storageLocation: default
ttl: 720h # 30-day retention.
snapshotVolumes: true
volumeSnapshotLocations:
- default
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: hourly-backup
namespace: velero
spec:
schedule: "0 * * * *" # Every hour.
template:
includedNamespaces:
- production
includedResources:
- configmaps
- secrets
- deployments
- services
- ingresses
ttl: 168h # 7-day retention.
snapshotVolumes: false

Manual Backup

# Create a backup of all production resources.
velero backup create production-backup-$(date +%Y%m%d) \
--include-namespaces production \
--snapshot-volumes
# Create a backup before major changes.
velero backup create pre-migration-backup \
--include-namespaces production \
--wait

Restore from Backup

# List backups.
velero backup get
# Restore to the same cluster.
velero restore create --from-backup production-backup-20240115 \
--include-namespaces production
# Restore to a different namespace (for testing).
velero restore create --from-backup production-backup-20240115 \
--namespace-mappings production:restore-test

2. Database Backup Strategies

PostgreSQL Continuous Archiving (WAL)

# postgresql.conf
archive_mode = on
archive_command = 'aws s3 cp %p s3://db-backups/wal/%f'
wal_level = replica
# Base backup script
##!/bin/bash
BACKUP_NAME="base-$(date +%Y%m%d_%H%M%S)"
pg_basebackup -D /tmp/$BACKUP_NAME -Ft -z -Xs -P
aws s3 cp /tmp/$BACKUP_NAME.tar.gz s3://db-backups/base/$BACKUP_NAME.tar.gz
rm -rf /tmp/$BACKUP_NAME*

PostgreSQL Backup with pgBackRest

# /etc/pgbackrest/pgbackrest.conf
[global]
repo1-path=/backup/pgbackrest
repo1-retention-full=2
repo1-retention-diff=14
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=your-encryption-key
[main]
pg1-path=/var/lib/pgsql/data

# Full backup (weekly).
pgbackrest --stanza=main backup --type=full
# Differential backup (daily).
pgbackrest --stanza=main backup --type=diff
# Point-in-time recovery.
pgbackrest --stanza=main restore \
--type=time \
--target="2024-01-15 10:30:00"

MySQL/Percona Backup with Percona XtraBackup

# Full backup.
xtrabackup --backup --target-dir=/backup/full/$(date +%Y%m%d)
# Incremental backup.
xtrabackup --backup --target-dir=/backup/incr/$(date +%Y%m%d) \
--incremental-basedir=/backup/full/20240115
# Prepare for restore.
xtrabackup --prepare --target-dir=/backup/full/20240115
xtrabackup --prepare --target-dir=/backup/full/20240115 \
--incremental-dir=/backup/incr/20240116
# Restore.
systemctl stop mysql
rm -rf /var/lib/mysql/*
xtrabackup --copy-back --target-dir=/backup/full/20240115
chown -R mysql:mysql /var/lib/mysql
systemctl start mysql

Kubernetes CronJob for Database Backup

apiVersion: batch/v1
kind: CronJob
metadata:
name: database-backup
namespace: production
spec:
schedule: "0 3 * * *"
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: backup
image: postgres:15
command:
- /bin/bash
- -c
- |
set -e
BACKUP_NAME="backup-$(date +%Y%m%d-%H%M%S).sql.gz"
pg_dump -h $DB_HOST -U $DB_USER $DB_NAME | gzip > /tmp/$BACKUP_NAME
aws s3 cp /tmp/$BACKUP_NAME s3://db-backups/postgres/$BACKUP_NAME
# Clean up old backups (keep last 30).
aws s3 ls s3://db-backups/postgres/ | sort -r | tail -n +31 | \
awk '{print $4}' | xargs -I {} aws s3 rm s3://db-backups/postgres/{}
env:
- name: DB_HOST
value: postgres-service
- name: DB_USER
valueFrom:
secretKeyRef:
name: db-credentials
key: username
- name: DB_NAME
value: production
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: db-credentials
key: password
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: s3-credentials
key: access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: s3-credentials
key: secret-key

3. Multi-Region Disaster Recovery

Active-Passive Setup

# Primary region.
apiVersion: v1
kind: Service
metadata:
name: app-primary
annotations:
external-dns.alpha.kubernetes.io/hostname: app.example.com
external-dns.alpha.kubernetes.io/ttl: "60"
spec:
type: LoadBalancer
selector:
app: myapp
---
# Secondary region (standby).
apiVersion: v1
kind: Service
metadata:
name: app-secondary
annotations:
# DNS record only activated during failover.
external-dns.alpha.kubernetes.io/hostname: app-dr.example.com
spec:
type: LoadBalancer
selector:
app: myapp

Database Replication Across Regions

# CloudNativePG cross-region replica.
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: postgres-replica
namespace: production
spec:
instances: 2
replica:
enabled: true
source: postgres-primary
externalClusters:
- name: postgres-primary
connectionParameters:
host: postgres-primary.us-west-2.rds.amazonaws.com
user: replicator
password:
name: replication-credentials
key: password

4. Backup Verification and Testing

Automated Restore Testing

apiVersion: batch/v1
kind: CronJob
metadata:
name: backup-verification
spec:
schedule: "0 6 * * 0" # Weekly on Sunday.
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
containers:
- name: verify
image: backup-verifier:latest
command:
- /bin/bash
- -c
- |
set -e
# Get the latest backup.
LATEST=$(aws s3 ls s3://db-backups/postgres/ | sort -r | head -1 | awk '{print $4}')
# Download and restore to the test database.
aws s3 cp s3://db-backups/postgres/$LATEST /tmp/backup.sql.gz
gunzip /tmp/backup.sql.gz
# Create a test database.
psql -h $TEST_DB_HOST -U $DB_USER -c "DROP DATABASE IF EXISTS backup_test"
psql -h $TEST_DB_HOST -U $DB_USER -c "CREATE DATABASE backup_test"
psql -h $TEST_DB_HOST -U $DB_USER -d backup_test -f /tmp/backup.sql
# Run verification queries.
USERS=$(psql -h $TEST_DB_HOST -U $DB_USER -d backup_test -t -c "SELECT COUNT(*) FROM users")
ORDERS=$(psql -h $TEST_DB_HOST -U $DB_USER -d backup_test -t -c "SELECT COUNT(*) FROM orders")
# Verify data integrity.
if [ "$USERS" -gt 0 ] && [ "$ORDERS" -gt 0 ]; then
echo "Backup verification PASSED"
curl -X POST $SLACK_WEBHOOK -d '{"text":"✅ Weekly backup verification passed"}'
else
echo "Backup verification FAILED"
curl -X POST $SLACK_WEBHOOK -d '{"text":"❌ Weekly backup verification FAILED"}'
exit 1
fi
# Clean up.
psql -h $TEST_DB_HOST -U $DB_USER -c "DROP DATABASE backup_test"

5. Disaster Recovery Runbook

DR Activation Checklist

## Disaster Recovery Activation
### Pre-Activation (Assess)
- [ ] Confirm primary site is unavailable
- [ ] Estimate time to recover primary
- [ ] Get management approval for failover
- [ ] Notify stakeholders
### Activation (Execute)
1. [ ] Verify DR site health
```bash
kubectl get nodes
kubectl get pods -A
```
2. [ ] Activate database replica
```bash
# Promote PostgreSQL replica
kubectl exec -it postgres-0 -- pg_ctl promote
```
3. [ ] Update DNS records
```bash
# Point traffic to DR site
aws route53 change-resource-record-sets ...
```
4. [ ] Verify application connectivity
```bash
curl -I https://app.example.com/health
```
5. [ ] Monitor for errors
```bash
kubectl logs -f deployment/myapp
```
### Post-Activation (Verify)
- [ ] Verify all services operational
- [ ] Check database consistency
- [ ] Monitor error rates
- [ ] Communicate status to stakeholders
- [ ] Document timeline and actions taken
### Failback (Return to Primary)
- [ ] Restore primary site
- [ ] Sync data from DR to primary
- [ ] Test primary site
- [ ] Switch traffic back
- [ ] Deactivate DR site

6. Backup Encryption and Security

Encrypt Backups at Rest

# Use GPG for backup encryption.
pg_dump mydb | gpg --encrypt --recipient [email protected] > backup.sql.gpg
# Use OpenSSL.
pg_dump mydb | openssl enc -aes-256-cbc -salt -pass file:/path/to/keyfile > backup.sql.enc
# Decrypt.
openssl enc -d -aes-256-cbc -pass file:/path/to/keyfile < backup.sql.enc | psql mydb

S3 Bucket Policy

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "EnforceEncryption",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::db-backups/*",
"Condition": {
"StringNotEquals": {
"s3:x-amz-server-side-encryption": "aws:kms"
}
}
},
{
"Sid": "RestrictAccess",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::db-backups",
"arn:aws:s3:::db-backups/*"
],
"Condition": {
"NotIpAddress": {
"aws:SourceIp": ["10.0.0.0/8"]
}
}
}
]
}

The Senior DR Mindset

The etcd Problem: Cluster State Backups

Your Kubernetes cluster state lives in etcd. If etcd dies and you don't have a backup, you're rebuilding the entire cluster from scratch.

Encrypted etcd Backups:

# Create an encrypted backup.
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# Encrypt and upload.
gpg --encrypt --recipient [email protected] /backup/etcd-*.db
aws s3 cp /backup/etcd-*.db.gpg s3://cluster-backups/etcd/

Schedule this daily at minimum. An outdated etcd backup is better than no backup.

Testing Philosophy: Untested Backups Are Worthless

Senior Rule: A backup you haven't tested is not a backup—it's a hope.

What to Test: 1. Can you download the backup? (Network, permissions, encryption keys) 2. Can you restore the backup? (Format, corruption, completeness) 3. Is the data correct? (Row counts, checksums, application tests) 4. How long does restore take? (Does it meet your RTO?)

Schedule automated restore tests weekly. If a test fails, someone should be paged.

The Failover Decision Framework

Failover is expensive and risky. Use this framework:

Question	If Yes	If No
Is the primary site completely unavailable?	Continue to next	Wait and monitor
Will recovery take longer than RTO?	Continue to next	Wait and recover
Is data in the DR site current (within RPO)?	Proceed with failover	Assess data loss risk
Do stakeholders approve the data loss?	Execute failover	Wait or find alternatives

Never fail over without explicit approval. The person who decides should understand the data loss implications.

Failback: The Forgotten Half

Planning failover without planning failback leaves you stuck in DR mode indefinitely.

Failback Considerations:

How do you sync data back to the primary site?
How do you verify the primary is healthy before switching?
How do you minimize the second switch's downtime?
What if the primary fails again during failback?

Document and test failback procedures alongside failover.

DR Checklist

Backup Strategy

[ ] RPO and RTO defined and documented
[ ] All critical data identified
[ ] Automated backup schedules configured
[ ] Backups stored offsite/cross-region
[ ] Backup encryption enabled
[ ] Backup retention policies defined
[ ] etcd backups scheduled for Kubernetes clusters

Recovery Capability

[ ] DR site provisioned and tested
[ ] Database replication configured
[ ] DNS failover configured
[ ] Recovery runbooks documented
[ ] Recovery tested quarterly
[ ] Failback procedures documented

Verification

[ ] Automated backup verification (weekly)
[ ] Regular restore testing
[ ] DR drills conducted
[ ] Recovery time measured against RTO
[ ] Data integrity verified after restore

Log In