Skip to main content

Performance Issues

This page covers common performance problems, how to diagnose them, and solutions.

Slow Backups

Symptoms

  • Backup takes much longer than expected
  • Low rabbitmq_backup_messages_read rate in Prometheus
  • Long gaps between segment writes in logs

Diagnosis

Enable verbose logging to see per-queue timing:

rabbitmq-backup backup -v --config backup.yaml

Check the metrics endpoint for throughput:

curl -s http://localhost:8080/metrics | grep messages_read

Common Causes and Solutions

Low Prefetch Count

The default prefetch_count of 100 may be too low for high-throughput scenarios.

backup:
prefetch_count: 500 # Increase from default 100

Sequential Queue Processing

If max_concurrent_queues is 1, queues are processed one at a time.

backup:
max_concurrent_queues: 8 # Process more queues in parallel

Slow Requeue Strategy

The get strategy processes one message at a time and is the slowest.

backup:
requeue_strategy: cancel # Fastest strategy (default)

High Compression Level

Compression levels above 6 significantly increase CPU time.

backup:
compression: zstd
compression_level: 3 # Default. Lower = faster

Or switch to LZ4 for faster compression:

backup:
compression: lz4

Network Latency to Storage

If the storage backend is in a different region, segment uploads are slow.

  • Move storage to the same region as the backup process
  • Increase segment_max_bytes to write fewer, larger segments
  • Check upload bandwidth with: curl -o /dev/null -w "%{speed_upload}" -T testfile s3://bucket/test

High Memory Usage

Symptoms

  • Process memory grows beyond expected limits
  • OOM kills in Kubernetes or Docker
  • Swap usage on bare-metal deployments

Diagnosis

Monitor process memory:

# Linux
ps -o rss,vsz,cmd -p $(pgrep rabbitmq-backup)

# In Kubernetes
kubectl top pod -n rabbitmq-backup

Common Causes and Solutions

Too Many Messages In-Flight

Memory scales with: prefetch_count x average_message_size x max_concurrent_queues

Example: 500 prefetch x 50 KB messages x 8 queues = 200 MB just for message buffers.

backup:
prefetch_count: 50 # Reduce
max_concurrent_queues: 2 # Reduce

Large Segment Buffers

Segments accumulate messages in memory until flushed.

backup:
segment_max_bytes: 33554432 # 32 MB instead of 128 MB
segment_max_interval_ms: 10000 # Flush every 10 seconds

Large Individual Messages

If messages are >1 MB each, even a modest prefetch count uses significant memory.

backup:
prefetch_count: 10 # Very low prefetch for large messages
max_concurrent_queues: 2

Kubernetes Memory Limits

Set appropriate resource limits based on your configuration:

resources:
requests:
memory: 256Mi
limits:
memory: 512Mi # Allow headroom above estimated usage

Slow Restores

Symptoms

  • Restore takes much longer than the backup did
  • Low rabbitmq_restore_messages_published rate
  • Target broker CPU is high

Common Causes and Solutions

Small Batch Size

Publishing one message at a time is slow. Increase the batch size:

restore:
produce_batch_size: 500 # Increase from default 100

Publisher Confirms Overhead

Each batch waits for broker confirmation. For faster restore (with some risk):

restore:
publisher_confirms: false # Faster but risks message loss on broker crash

Target Broker Overloaded

If the target broker is struggling:

  • Add a rate limit: rate_limit_messages_per_sec: 5000
  • Reduce max_concurrent_queues to lower connection pressure
  • Check target broker memory and disk alarms: rabbitmqctl status

Network Bottlenecks

Symptoms

  • Backup or restore is slow despite low CPU and memory usage
  • Storage uploads/downloads are the bottleneck
  • High latency to RabbitMQ broker

Diagnosis

Test Storage Throughput

# S3 upload speed test
dd if=/dev/zero bs=1M count=100 | aws s3 cp - s3://bucket/speedtest
aws s3 rm s3://bucket/speedtest

# S3 download speed test
aws s3 cp s3://bucket/some-segment /dev/null

Test AMQP Latency

# Test TCP latency to broker
ping rabbitmq-host
# Test AMQP port
time nc -zv rabbitmq-host 5672

Solutions

Co-locate with RabbitMQ

Run the backup tool on the same network as the RabbitMQ broker to minimize AMQP latency.

Co-locate with Storage

Run the backup tool in the same cloud region as the storage bucket.

Reduce Storage API Calls

Increase segment size to write fewer, larger objects:

backup:
segment_max_bytes: 268435456 # 256 MB

Reduce Checkpoint Sync Frequency

Checkpoint syncs upload to remote storage. Reduce frequency:

offset_storage:
sync_interval_secs: 120 # Every 2 minutes instead of 30 seconds

Monitoring Performance

Use Prometheus metrics to track performance over time:

# Messages per second (backup)
rate(rabbitmq_backup_messages_read[1m])

# Bytes per second (backup)
rate(rabbitmq_backup_bytes_read[1m])

# Messages per second (restore)
rate(rabbitmq_restore_messages_published[1m])

# Compression ratio
1 - (rate(rabbitmq_backup_segments_bytes[5m]) / rate(rabbitmq_backup_bytes_read[5m]))

# Error rate
rate(rabbitmq_backup_errors[5m])

Set up alerts for performance degradation:

- alert: RabbitMQBackupSlowThroughput
expr: rate(rabbitmq_backup_messages_read[15m]) < 100
for: 30m
labels:
severity: warning
annotations:
summary: "Backup throughput is below 100 messages/sec"

Quick Tuning Checklist

  • prefetch_count -- increase for throughput, decrease for memory
  • max_concurrent_queues -- increase for parallelism, decrease for lower load
  • requeue_strategy -- use cancel (fastest) unless you have a specific reason
  • compression -- use lz4 if CPU-bound, zstd level 1-3 otherwise
  • segment_max_bytes -- increase for fewer storage API calls
  • produce_batch_size -- increase for faster restores
  • Network -- co-locate with both RabbitMQ and storage