Problem Statement
You need to ensure your database can handle failures, provide read scaling, and maintain data durability across multiple servers or regions.
When to Use Each Pattern
| Pattern | Use Case | Trade-offs |
|---|
| Primary-Replica | Read scaling, simple failover | Replication lag, manual failover |
| Multi-Primary | Write scaling, geographic distribution | Conflict resolution complexity |
| Synchronous | Zero data loss | Slower writes, requires low latency |
| Asynchronous | Higher performance | Potential data loss on failure |
MySQL/Percona Replication
Step 1: Configure Primary Server
Edit /etc/my.cnf:
[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_format = ROW
binlog_do_db = myapp_production
sync_binlog = 1
innodb_flush_log_at_trx_commit = 1
# GTID-based replication (recommended)
gtid_mode = ON
enforce_gtid_consistency = ON
Create replication user:
CREATE USER 'replicator'@'replica-host' IDENTIFIED BY 'secure_password';
GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'replica-host';
FLUSH PRIVILEGES;
Step 2: Configure Replica Server
Edit /etc/my.cnf:
[mysqld]
server-id = 2
relay_log = /var/log/mysql/mysql-relay-bin.log
log_bin = /var/log/mysql/mysql-bin.log
binlog_format = ROW
read_only = ON
# GTID-based replication
gtid_mode = ON
enforce_gtid_consistency = ON
Step 3: Initialize Replication
On the replica:
CHANGE MASTER TO
MASTER_HOST = 'primary-host',
MASTER_USER = 'replicator',
MASTER_PASSWORD = 'secure_password',
MASTER_AUTO_POSITION = 1;
START SLAVE;
Step 4: Verify Replication Status
SHOW SLAVE STATUS\G
Check for:
SlaveIORunning: YesSlaveSQLRunning: YesSecondsBehindMaster: 0 (or close to 0)
Semi-Synchronous Replication
For stronger durability guarantees:
On the primary:
INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';
SET GLOBAL rpl_semi_sync_master_enabled = 1;
SET GLOBAL rpl_semi_sync_master_timeout = 1000; -- 1 second
On the replica:
INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';
SET GLOBAL rpl_semi_sync_slave_enabled = 1;
STOP SLAVE IO_THREAD;
START SLAVE IO_THREAD;
PostgreSQL Streaming Replication
Step 1: Configure Primary Server
Edit postgresql.conf:
wal_level = replica
max_wal_senders = 5
wal_keep_size = 1GB
synchronous_commit = on
synchronous_standby_names = 'standby1' # For synchronous replication
Edit pg_hba.conf:
host replication replicator replica-ip/32 scram-sha-256
Create replication user:
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secure_password';
Step 2: Create Base Backup
On the replica:
pg_basebackup -h primary-host -U replicator -D /var/lib/pgsql/data -Fp -Xs -P -R
The -R flag creates standby.signal and configures postgresql.auto.conf.
Step 3: Configure Replica
Edit postgresql.conf (if needed):
hot_standby = on
primary_conninfo = 'host=primary-host user=replicator password=secure_password'
Step 4: Start Replica and Verify
systemctl start postgresql
# On the primary, check connected replicas
SELECT * FROM pg_stat_replication;
# On the replica, check recovery status
SELECT pg_is_in_recovery();
Automatic Failover with Patroni
For production, use Patroni for automatic failover:
# patroni.yml
scope: postgres-cluster
namespace: /service/
name: node1
restapi:
listen: 0.0.0.0:8008
connect_address: node1:8008
etcd:
hosts: etcd1:2379,etcd2:2379,etcd3:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
parameters:
wal_level: replica
hot_standby: "on"
max_wal_senders: 5
max_replication_slots: 5
postgresql:
listen: 0.0.0.0:5432
connect_address: node1:5432
data_dir: /var/lib/postgresql/data
authentication:
replication:
username: replicator
password: secure_password
superuser:
username: postgres
password: admin_password
MySQL Replication in Kubernetes
Using Percona Operator
apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBCluster
metadata:
name: mysql-cluster
spec:
crVersion: 1.13.0
secretsName: mysql-secrets
pxc:
size: 3
image: percona/percona-xtradb-cluster:8.0
resources:
requests:
memory: 1G
cpu: 600m
volumeSpec:
persistentVolumeClaim:
resources:
requests:
storage: 100Gi
haproxy:
enabled: true
size: 2
image: percona/percona-xtradb-cluster-operator:1.13.0-haproxy
proxysql:
enabled: false
Connection String
mysql://user:password@mysql-cluster-haproxy:3306/database
PostgreSQL Replication in Kubernetes
Using CloudNativePG Operator
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: postgres-cluster
spec:
instances: 3
postgresql:
parameters:
max_connections: "300"
shared_buffers: "256MB"
storage:
size: 100Gi
storageClass: fast-ssd
backup:
barmanObjectStore:
destinationPath: s3://backups/postgres
s3Credentials:
accessKeyId:
name: s3-creds
key: ACCESS_KEY_ID
secretAccessKey:
name: s3-creds
key: SECRET_ACCESS_KEY
Replication Monitoring
Key Metrics to Monitor
- Replication Lag
``` sql -- MySQL SHOW SLAVE STATUS\G -- Check SecondsBehindMaster
-- PostgreSQL SELECT clientaddr, state, sentlsn, writelsn, flushlsn, replaylsn, pgwallsndiff(sentlsn, replaylsn) AS lagbytes FROM pgstat_replication; ```
- Connection Status
`` sql -- MySQL: Check SlaveIORunning and SlaveSQLRunning -- PostgreSQL: Check pgstatreplication state ``
- Binary Log/WAL Position
Alerting Rules
# Prometheus alerting rules
groups:
- name: database-replication
rules:
- alert: ReplicationLagHigh
expr: mysql_slave_seconds_behind_master > 60
for: 5m
labels:
severity: warning
annotations:
summary: "MySQL replication lag is {{ $value }} seconds"
- alert: ReplicationStopped
expr: mysql_slave_sql_running == 0 OR mysql_slave_io_running == 0
for: 1m
labels:
severity: critical
annotations:
summary: "MySQL replication has stopped"
Failover Procedures
Manual Failover (MySQL)
- Stop writes to primary
`` sql SET GLOBAL read_only = ON; FLUSH TABLES WITH READ LOCK; ``
- Ensure replica is caught up
`` sql -- On the replica SHOW SLAVE STATUS\G -- Wait for SecondsBehindMaster = 0 ``
- Promote replica
`` sql STOP SLAVE; RESET SLAVE ALL; SET GLOBAL read_only = OFF; ``
- Update application connection strings
Manual Failover (PostgreSQL)
- Promote standby
`` bash pgctl promote -D /var/lib/pgsql/data # Or SELECT pgpromote(); ``
- Reconfigure old primary as standby (if recovered)
Best Practices Checklist
- [ ] Use GTID-based replication (MySQL) for easier failover
- [ ] Configure monitoring for replication lag
- [ ] Test failover procedures regularly
- [ ] Use semi-synchronous replication for critical data
- [ ] Keep binlogs/WAL retained long enough for recovery
- [ ] Document and automate failover procedures
- [ ] Use connection pooling (PgBouncer, ProxySQL)
- [ ] Configure read replicas in application for read scaling
- [ ] Back up both primary and replicas
- [ ] Monitor disk space for logs and data
Related Wiki Articles