Performance Issues
This page covers common performance problems, how to diagnose them, and solutions.
Slow Backups
Symptoms
- Backup takes much longer than expected
- Low
rabbitmq_backup_messages_readrate in Prometheus - Long gaps between segment writes in logs
Diagnosis
Enable verbose logging to see per-queue timing:
rabbitmq-backup backup -v --config backup.yaml
Check the metrics endpoint for throughput:
curl -s http://localhost:8080/metrics | grep messages_read
Common Causes and Solutions
Low Prefetch Count
The default prefetch_count of 100 may be too low for high-throughput scenarios.
backup:
prefetch_count: 500 # Increase from default 100
Sequential Queue Processing
If max_concurrent_queues is 1, queues are processed one at a time.
backup:
max_concurrent_queues: 8 # Process more queues in parallel
Slow Requeue Strategy
The get strategy processes one message at a time and is the slowest.
backup:
requeue_strategy: cancel # Fastest strategy (default)
High Compression Level
Compression levels above 6 significantly increase CPU time.
backup:
compression: zstd
compression_level: 3 # Default. Lower = faster
Or switch to LZ4 for faster compression:
backup:
compression: lz4
Network Latency to Storage
If the storage backend is in a different region, segment uploads are slow.
- Move storage to the same region as the backup process
- Increase
segment_max_bytesto write fewer, larger segments - Check upload bandwidth with:
curl -o /dev/null -w "%{speed_upload}" -T testfile s3://bucket/test
High Memory Usage
Symptoms
- Process memory grows beyond expected limits
- OOM kills in Kubernetes or Docker
- Swap usage on bare-metal deployments
Diagnosis
Monitor process memory:
# Linux
ps -o rss,vsz,cmd -p $(pgrep rabbitmq-backup)
# In Kubernetes
kubectl top pod -n rabbitmq-backup
Common Causes and Solutions
Too Many Messages In-Flight
Memory scales with: prefetch_count x average_message_size x max_concurrent_queues
Example: 500 prefetch x 50 KB messages x 8 queues = 200 MB just for message buffers.
backup:
prefetch_count: 50 # Reduce
max_concurrent_queues: 2 # Reduce
Large Segment Buffers
Segments accumulate messages in memory until flushed.
backup:
segment_max_bytes: 33554432 # 32 MB instead of 128 MB
segment_max_interval_ms: 10000 # Flush every 10 seconds
Large Individual Messages
If messages are >1 MB each, even a modest prefetch count uses significant memory.
backup:
prefetch_count: 10 # Very low prefetch for large messages
max_concurrent_queues: 2
Kubernetes Memory Limits
Set appropriate resource limits based on your configuration:
resources:
requests:
memory: 256Mi
limits:
memory: 512Mi # Allow headroom above estimated usage
Slow Restores
Symptoms
- Restore takes much longer than the backup did
- Low
rabbitmq_restore_messages_publishedrate - Target broker CPU is high
Common Causes and Solutions
Small Batch Size
Publishing one message at a time is slow. Increase the batch size:
restore:
produce_batch_size: 500 # Increase from default 100
Publisher Confirms Overhead
Each batch waits for broker confirmation. For faster restore (with some risk):
restore:
publisher_confirms: false # Faster but risks message loss on broker crash
Target Broker Overloaded
If the target broker is struggling:
- Add a rate limit:
rate_limit_messages_per_sec: 5000 - Reduce
max_concurrent_queuesto lower connection pressure - Check target broker memory and disk alarms:
rabbitmqctl status
Network Bottlenecks
Symptoms
- Backup or restore is slow despite low CPU and memory usage
- Storage uploads/downloads are the bottleneck
- High latency to RabbitMQ broker
Diagnosis
Test Storage Throughput
# S3 upload speed test
dd if=/dev/zero bs=1M count=100 | aws s3 cp - s3://bucket/speedtest
aws s3 rm s3://bucket/speedtest
# S3 download speed test
aws s3 cp s3://bucket/some-segment /dev/null
Test AMQP Latency
# Test TCP latency to broker
ping rabbitmq-host
# Test AMQP port
time nc -zv rabbitmq-host 5672
Solutions
Co-locate with RabbitMQ
Run the backup tool on the same network as the RabbitMQ broker to minimize AMQP latency.
Co-locate with Storage
Run the backup tool in the same cloud region as the storage bucket.
Reduce Storage API Calls
Increase segment size to write fewer, larger objects:
backup:
segment_max_bytes: 268435456 # 256 MB
Reduce Checkpoint Sync Frequency
Checkpoint syncs upload to remote storage. Reduce frequency:
offset_storage:
sync_interval_secs: 120 # Every 2 minutes instead of 30 seconds
Monitoring Performance
Use Prometheus metrics to track performance over time:
# Messages per second (backup)
rate(rabbitmq_backup_messages_read[1m])
# Bytes per second (backup)
rate(rabbitmq_backup_bytes_read[1m])
# Messages per second (restore)
rate(rabbitmq_restore_messages_published[1m])
# Compression ratio
1 - (rate(rabbitmq_backup_segments_bytes[5m]) / rate(rabbitmq_backup_bytes_read[5m]))
# Error rate
rate(rabbitmq_backup_errors[5m])
Set up alerts for performance degradation:
- alert: RabbitMQBackupSlowThroughput
expr: rate(rabbitmq_backup_messages_read[15m]) < 100
for: 30m
labels:
severity: warning
annotations:
summary: "Backup throughput is below 100 messages/sec"
Quick Tuning Checklist
-
prefetch_count-- increase for throughput, decrease for memory -
max_concurrent_queues-- increase for parallelism, decrease for lower load -
requeue_strategy-- usecancel(fastest) unless you have a specific reason -
compression-- uselz4if CPU-bound,zstdlevel 1-3 otherwise -
segment_max_bytes-- increase for fewer storage API calls -
produce_batch_size-- increase for faster restores - Network -- co-locate with both RabbitMQ and storage