Skip to main content

Disaster Recovery

This guide describes how to use rabbitmq-backup to build a disaster recovery (DR) strategy for RabbitMQ. It covers active-passive setups, backup to remote storage, and step-by-step recovery procedures.

Architecture: Active-Passive DR

Production Cluster                          DR Cluster
+-----------------+ +-----------------+
| RabbitMQ | | RabbitMQ |
| (active) | | (standby) |
+-----------------+ +-----------------+
| ^
v |
+------------------+ +--------+ +-----------+
| rabbitmq-backup | --> | S3 / | --> | rabbitmq- |
| (scheduled) | | Azure | | backup |
+------------------+ | / GCS | | (restore) |
+--------+ +-----------+

The production cluster runs scheduled backups to remote object storage. The DR cluster can be restored at any time from the latest backup.

Step 1: Set Up Scheduled Backups

Configure a scheduled backup on or near the production cluster. The backup writes both definitions (topology) and messages to remote storage.

dr-backup.yaml
mode: backup
backup_id: "dr-backup"

source:
amqp_url: "amqp://backup_user:${RABBITMQ_PASSWORD}@prod-rabbitmq:5672/%2f"
management_url: "http://prod-rabbitmq:15672"
management_username: backup_user
management_password: "${RABBITMQ_PASSWORD}"
queues:
include:
- "*"

storage:
backend: s3
bucket: rabbitmq-dr-backups
region: us-west-2 # Different region from production
prefix: prod-cluster/

backup:
compression: zstd
compression_level: 3
prefetch_count: 200
max_concurrent_queues: 4
include_definitions: true
stop_at_current_depth: true

offset_storage:
backend: sqlite
db_path: ./dr-offsets.db
s3_key: state/offsets.db
sync_interval_secs: 30

Schedule via cron or Kubernetes CronJob:

# Every 6 hours
0 */6 * * * RABBITMQ_PASSWORD=secret rabbitmq-backup backup --config /etc/rabbitmq-backup/dr-backup.yaml

Step 2: Validate Backups Regularly

Run periodic validation to confirm backup integrity:

rabbitmq-backup validate \
--path s3://rabbitmq-dr-backups \
--backup-id dr-backup \
--deep

Automate validation as a separate cron job that runs after the backup window:

# Validate 1 hour after backup
0 1,7,13,19 * * * rabbitmq-backup validate --path s3://rabbitmq-dr-backups --backup-id dr-backup --deep

Step 3: Recovery Procedure

When the production cluster is lost, follow this runbook to restore to the DR cluster.

3a. Verify the DR Cluster Is Ready

Ensure RabbitMQ is running on the DR cluster with the Management Plugin enabled:

rabbitmqctl status
rabbitmq-plugins list | grep rabbitmq_management

3b. List Available Backups

rabbitmq-backup list --path s3://rabbitmq-dr-backups

Select the most recent completed backup.

3c. Describe the Backup

Confirm the backup contents before restoring:

rabbitmq-backup describe \
--path s3://rabbitmq-dr-backups \
--backup-id dr-backup \
--format json

3d. Restore Definitions First

Restore the topology (vhosts, exchanges, queues, bindings, policies):

dr-restore.yaml
mode: restore
backup_id: "dr-backup"

target:
amqp_url: "amqp://admin:${RABBITMQ_PASSWORD}@dr-rabbitmq:5672/%2f"
management_url: "http://dr-rabbitmq:15672"
management_username: admin
management_password: "${RABBITMQ_PASSWORD}"

storage:
backend: s3
bucket: rabbitmq-dr-backups
region: us-west-2
prefix: prod-cluster/

restore:
restore_definitions: true
publish_mode: exchange
publisher_confirms: true
max_concurrent_queues: 4
produce_batch_size: 100
rabbitmq-backup restore --config dr-restore.yaml

3e. Verify the Restore

# Check queue counts
curl -u admin:password http://dr-rabbitmq:15672/api/queues | jq '.[].name'

# Check message counts
curl -u admin:password http://dr-rabbitmq:15672/api/overview | jq '.queue_totals'

3f. Switch Traffic

Update application connection strings to point to the DR cluster. If you use a load balancer or DNS, update the record:

# Example: update DNS
aws route53 change-resource-record-sets --hosted-zone-id ZXXXXX \
--change-batch '{"Changes":[{"Action":"UPSERT","ResourceRecordSet":{"Name":"rabbitmq.example.com","Type":"CNAME","TTL":60,"ResourceRecords":[{"Value":"dr-rabbitmq.example.com"}]}}]}'

Recovery Time Objective (RTO) and Recovery Point Objective (RPO)

FactorImpact on RTO/RPO
Backup frequencyRPO = time since last backup
Backup durationLonger backups increase RPO
Storage latencyAffects restore time (RTO)
Message volumeMore messages = longer restore (RTO)
Network bandwidthAffects both backup and restore time

Reducing RPO

  • Increase backup frequency (e.g., every hour instead of every 6 hours)
  • Use resumable backups with checkpoints to capture incremental changes
  • Back up definitions separately at higher frequency (lightweight operation)

Reducing RTO

  • Pre-provision the DR cluster with RabbitMQ installed
  • Pre-import definitions so only messages need restoring
  • Use a storage region close to the DR cluster
  • Test the restore procedure regularly

Multi-Region Setup

For cross-region DR, use S3 cross-region replication so the backup data is available close to the DR cluster:

aws s3api put-bucket-replication \
--bucket rabbitmq-dr-backups \
--replication-configuration '{
"Role": "arn:aws:iam::123456789012:role/s3-replication-role",
"Rules": [{
"Status": "Enabled",
"Destination": {
"Bucket": "arn:aws:s3:::rabbitmq-dr-backups-us-east-1"
}
}]
}'

DR Testing Checklist

Run through this checklist quarterly:

  • Verify the latest backup completed successfully
  • Run validate --deep on the latest backup
  • Restore to a test environment
  • Confirm all definitions are present (vhosts, exchanges, queues, bindings)
  • Confirm message counts match expectations
  • Publish a test message and consume it on the restored cluster
  • Measure actual RTO and compare to target