Disaster Recovery
This guide describes how to use rabbitmq-backup to build a disaster recovery (DR) strategy for RabbitMQ. It covers active-passive setups, backup to remote storage, and step-by-step recovery procedures.
Architecture: Active-Passive DR
Production Cluster DR Cluster
+-----------------+ +-----------------+
| RabbitMQ | | RabbitMQ |
| (active) | | (standby) |
+-----------------+ +-----------------+
| ^
v |
+------------------+ +--------+ +-----------+
| rabbitmq-backup | --> | S3 / | --> | rabbitmq- |
| (scheduled) | | Azure | | backup |
+------------------+ | / GCS | | (restore) |
+--------+ +-----------+
The production cluster runs scheduled backups to remote object storage. The DR cluster can be restored at any time from the latest backup.
Step 1: Set Up Scheduled Backups
Configure a scheduled backup on or near the production cluster. The backup writes both definitions (topology) and messages to remote storage.
mode: backup
backup_id: "dr-backup"
source:
amqp_url: "amqp://backup_user:${RABBITMQ_PASSWORD}@prod-rabbitmq:5672/%2f"
management_url: "http://prod-rabbitmq:15672"
management_username: backup_user
management_password: "${RABBITMQ_PASSWORD}"
queues:
include:
- "*"
storage:
backend: s3
bucket: rabbitmq-dr-backups
region: us-west-2 # Different region from production
prefix: prod-cluster/
backup:
compression: zstd
compression_level: 3
prefetch_count: 200
max_concurrent_queues: 4
include_definitions: true
stop_at_current_depth: true
offset_storage:
backend: sqlite
db_path: ./dr-offsets.db
s3_key: state/offsets.db
sync_interval_secs: 30
Schedule via cron or Kubernetes CronJob:
# Every 6 hours
0 */6 * * * RABBITMQ_PASSWORD=secret rabbitmq-backup backup --config /etc/rabbitmq-backup/dr-backup.yaml
Step 2: Validate Backups Regularly
Run periodic validation to confirm backup integrity:
rabbitmq-backup validate \
--path s3://rabbitmq-dr-backups \
--backup-id dr-backup \
--deep
Automate validation as a separate cron job that runs after the backup window:
# Validate 1 hour after backup
0 1,7,13,19 * * * rabbitmq-backup validate --path s3://rabbitmq-dr-backups --backup-id dr-backup --deep
Step 3: Recovery Procedure
When the production cluster is lost, follow this runbook to restore to the DR cluster.
3a. Verify the DR Cluster Is Ready
Ensure RabbitMQ is running on the DR cluster with the Management Plugin enabled:
rabbitmqctl status
rabbitmq-plugins list | grep rabbitmq_management
3b. List Available Backups
rabbitmq-backup list --path s3://rabbitmq-dr-backups
Select the most recent completed backup.
3c. Describe the Backup
Confirm the backup contents before restoring:
rabbitmq-backup describe \
--path s3://rabbitmq-dr-backups \
--backup-id dr-backup \
--format json
3d. Restore Definitions First
Restore the topology (vhosts, exchanges, queues, bindings, policies):
mode: restore
backup_id: "dr-backup"
target:
amqp_url: "amqp://admin:${RABBITMQ_PASSWORD}@dr-rabbitmq:5672/%2f"
management_url: "http://dr-rabbitmq:15672"
management_username: admin
management_password: "${RABBITMQ_PASSWORD}"
storage:
backend: s3
bucket: rabbitmq-dr-backups
region: us-west-2
prefix: prod-cluster/
restore:
restore_definitions: true
publish_mode: exchange
publisher_confirms: true
max_concurrent_queues: 4
produce_batch_size: 100
rabbitmq-backup restore --config dr-restore.yaml
3e. Verify the Restore
# Check queue counts
curl -u admin:password http://dr-rabbitmq:15672/api/queues | jq '.[].name'
# Check message counts
curl -u admin:password http://dr-rabbitmq:15672/api/overview | jq '.queue_totals'
3f. Switch Traffic
Update application connection strings to point to the DR cluster. If you use a load balancer or DNS, update the record:
# Example: update DNS
aws route53 change-resource-record-sets --hosted-zone-id ZXXXXX \
--change-batch '{"Changes":[{"Action":"UPSERT","ResourceRecordSet":{"Name":"rabbitmq.example.com","Type":"CNAME","TTL":60,"ResourceRecords":[{"Value":"dr-rabbitmq.example.com"}]}}]}'
Recovery Time Objective (RTO) and Recovery Point Objective (RPO)
| Factor | Impact on RTO/RPO |
|---|---|
| Backup frequency | RPO = time since last backup |
| Backup duration | Longer backups increase RPO |
| Storage latency | Affects restore time (RTO) |
| Message volume | More messages = longer restore (RTO) |
| Network bandwidth | Affects both backup and restore time |
Reducing RPO
- Increase backup frequency (e.g., every hour instead of every 6 hours)
- Use resumable backups with checkpoints to capture incremental changes
- Back up definitions separately at higher frequency (lightweight operation)
Reducing RTO
- Pre-provision the DR cluster with RabbitMQ installed
- Pre-import definitions so only messages need restoring
- Use a storage region close to the DR cluster
- Test the restore procedure regularly
Multi-Region Setup
For cross-region DR, use S3 cross-region replication so the backup data is available close to the DR cluster:
aws s3api put-bucket-replication \
--bucket rabbitmq-dr-backups \
--replication-configuration '{
"Role": "arn:aws:iam::123456789012:role/s3-replication-role",
"Rules": [{
"Status": "Enabled",
"Destination": {
"Bucket": "arn:aws:s3:::rabbitmq-dr-backups-us-east-1"
}
}]
}'
DR Testing Checklist
Run through this checklist quarterly:
- Verify the latest backup completed successfully
- Run
validate --deepon the latest backup - Restore to a test environment
- Confirm all definitions are present (vhosts, exchanges, queues, bindings)
- Confirm message counts match expectations
- Publish a test message and consume it on the restored cluster
- Measure actual RTO and compare to target