Disaster Recovery

This guide describes how to use rabbitmq-backup to build a disaster recovery (DR) strategy for RabbitMQ. It covers active-passive setups, backup to remote storage, and step-by-step recovery procedures.

Architecture: Active-Passive DR

Production Cluster                          DR Cluster
+-----------------+                         +-----------------+
|   RabbitMQ      |                         |   RabbitMQ      |
|   (active)      |                         |   (standby)     |
+-----------------+                         +-----------------+
        |                                           ^
        v                                           |
+------------------+     +--------+     +-----------+
| rabbitmq-backup  | --> |  S3 /  | --> | rabbitmq- |
| (scheduled)      |     | Azure  |     |  backup   |
+------------------+     | / GCS  |     | (restore) |
                         +--------+     +-----------+

The production cluster runs scheduled backups to remote object storage. The DR cluster can be restored at any time from the latest backup.

Step 1: Set Up Scheduled Backups

Configure a scheduled backup on or near the production cluster. The backup writes both definitions (topology) and messages to remote storage.

dr-backup.yaml
mode: backup
backup_id: "dr-backup"

source:
  amqp_url: "amqp://backup_user:${RABBITMQ_PASSWORD}@prod-rabbitmq:5672/%2f"
  management_url: "http://prod-rabbitmq:15672"
  management_username: backup_user
  management_password: "${RABBITMQ_PASSWORD}"
  queues:
    include:
      - "*"

storage:
  backend: s3
  bucket: rabbitmq-dr-backups
  region: us-west-2  # Different region from production
  prefix: prod-cluster/

backup:
  compression: zstd
  compression_level: 3
  prefetch_count: 200
  max_concurrent_queues: 4
  include_definitions: true
  stop_at_current_depth: true

offset_storage:
  backend: sqlite
  db_path: ./dr-offsets.db
  s3_key: state/offsets.db
  sync_interval_secs: 30

Schedule via cron or Kubernetes CronJob:

# Every 6 hours
0 */6 * * * RABBITMQ_PASSWORD=secret rabbitmq-backup backup --config /etc/rabbitmq-backup/dr-backup.yaml

Step 2: Validate Backups Regularly

Run periodic validation to confirm backup integrity:

rabbitmq-backup validate \
  --path s3://rabbitmq-dr-backups \
  --backup-id dr-backup \
  --deep

Automate validation as a separate cron job that runs after the backup window:

# Validate 1 hour after backup
0 1,7,13,19 * * * rabbitmq-backup validate --path s3://rabbitmq-dr-backups --backup-id dr-backup --deep

Step 3: Recovery Procedure

When the production cluster is lost, follow this runbook to restore to the DR cluster.

3a. Verify the DR Cluster Is Ready

Ensure RabbitMQ is running on the DR cluster with the Management Plugin enabled:

rabbitmqctl status
rabbitmq-plugins list | grep rabbitmq_management

3b. List Available Backups

rabbitmq-backup list --path s3://rabbitmq-dr-backups

Select the most recent completed backup.

3c. Describe the Backup

Confirm the backup contents before restoring:

rabbitmq-backup describe \
  --path s3://rabbitmq-dr-backups \
  --backup-id dr-backup \
  --format json

3d. Restore Definitions First

Restore the topology (vhosts, exchanges, queues, bindings, policies):

dr-restore.yaml
mode: restore
backup_id: "dr-backup"

target:
  amqp_url: "amqp://admin:${RABBITMQ_PASSWORD}@dr-rabbitmq:5672/%2f"
  management_url: "http://dr-rabbitmq:15672"
  management_username: admin
  management_password: "${RABBITMQ_PASSWORD}"

storage:
  backend: s3
  bucket: rabbitmq-dr-backups
  region: us-west-2
  prefix: prod-cluster/

restore:
  restore_definitions: true
  publish_mode: exchange
  publisher_confirms: true
  max_concurrent_queues: 4
  produce_batch_size: 100

rabbitmq-backup restore --config dr-restore.yaml

3e. Verify the Restore

# Check queue counts
curl -u admin:password http://dr-rabbitmq:15672/api/queues | jq '.[].name'

# Check message counts
curl -u admin:password http://dr-rabbitmq:15672/api/overview | jq '.queue_totals'

3f. Switch Traffic

Update application connection strings to point to the DR cluster. If you use a load balancer or DNS, update the record:

# Example: update DNS
aws route53 change-resource-record-sets --hosted-zone-id ZXXXXX \
  --change-batch '{"Changes":[{"Action":"UPSERT","ResourceRecordSet":{"Name":"rabbitmq.example.com","Type":"CNAME","TTL":60,"ResourceRecords":[{"Value":"dr-rabbitmq.example.com"}]}}]}'

Recovery Time Objective (RTO) and Recovery Point Objective (RPO)

Factor	Impact on RTO/RPO
Backup frequency	RPO = time since last backup
Backup duration	Longer backups increase RPO
Storage latency	Affects restore time (RTO)
Message volume	More messages = longer restore (RTO)
Network bandwidth	Affects both backup and restore time

Reducing RPO

Increase backup frequency (e.g., every hour instead of every 6 hours)
Use resumable backups with checkpoints to capture incremental changes
Back up definitions separately at higher frequency (lightweight operation)

Reducing RTO

Pre-provision the DR cluster with RabbitMQ installed
Pre-import definitions so only messages need restoring
Use a storage region close to the DR cluster
Test the restore procedure regularly

Multi-Region Setup

For cross-region DR, use S3 cross-region replication so the backup data is available close to the DR cluster:

aws s3api put-bucket-replication \
  --bucket rabbitmq-dr-backups \
  --replication-configuration '{
    "Role": "arn:aws:iam::123456789012:role/s3-replication-role",
    "Rules": [{
      "Status": "Enabled",
      "Destination": {
        "Bucket": "arn:aws:s3:::rabbitmq-dr-backups-us-east-1"
      }
    }]
  }'

DR Testing Checklist

Run through this checklist quarterly:

Verify the latest backup completed successfully
Run validate --deep on the latest backup
Restore to a test environment
Confirm all definitions are present (vhosts, exchanges, queues, bindings)
Confirm message counts match expectations
Publish a test message and consume it on the restored cluster
Measure actual RTO and compare to target

Architecture: Active-Passive DR​

Step 1: Set Up Scheduled Backups​

Step 2: Validate Backups Regularly​

Step 3: Recovery Procedure​

3a. Verify the DR Cluster Is Ready​

3b. List Available Backups​

3c. Describe the Backup​

3d. Restore Definitions First​

3e. Verify the Restore​

3f. Switch Traffic​

Recovery Time Objective (RTO) and Recovery Point Objective (RPO)​

Reducing RPO​

Reducing RTO​

Multi-Region Setup​

DR Testing Checklist​