Performance Issues

This page covers common performance problems, how to diagnose them, and solutions.

Slow Backups

Symptoms

Backup takes much longer than expected
Low rabbitmq_backup_messages_read rate in Prometheus
Long gaps between segment writes in logs

Diagnosis

Enable verbose logging to see per-queue timing:

rabbitmq-backup backup -v --config backup.yaml

Check the metrics endpoint for throughput:

curl -s http://localhost:8080/metrics | grep messages_read

Common Causes and Solutions

Low Prefetch Count

The default prefetch_count of 100 may be too low for high-throughput scenarios.

backup:
  prefetch_count: 500  # Increase from default 100

Sequential Queue Processing

If max_concurrent_queues is 1, queues are processed one at a time.

backup:
  max_concurrent_queues: 8  # Process more queues in parallel

Slow Requeue Strategy

The get strategy processes one message at a time and is the slowest.

backup:
  requeue_strategy: cancel  # Fastest strategy (default)

High Compression Level

Compression levels above 6 significantly increase CPU time.

backup:
  compression: zstd
  compression_level: 3  # Default. Lower = faster

Or switch to LZ4 for faster compression:

backup:
  compression: lz4

Network Latency to Storage

If the storage backend is in a different region, segment uploads are slow.

Move storage to the same region as the backup process
Increase segment_max_bytes to write fewer, larger segments
Check upload bandwidth with: curl -o /dev/null -w "%{speed_upload}" -T testfile s3://bucket/test

High Memory Usage

Symptoms

Process memory grows beyond expected limits
OOM kills in Kubernetes or Docker
Swap usage on bare-metal deployments

Diagnosis

Monitor process memory:

# Linux
ps -o rss,vsz,cmd -p $(pgrep rabbitmq-backup)

# In Kubernetes
kubectl top pod -n rabbitmq-backup

Common Causes and Solutions

Too Many Messages In-Flight

Memory scales with: prefetch_count x average_message_size x max_concurrent_queues

Example: 500 prefetch x 50 KB messages x 8 queues = 200 MB just for message buffers.

backup:
  prefetch_count: 50    # Reduce
  max_concurrent_queues: 2  # Reduce

Large Segment Buffers

Segments accumulate messages in memory until flushed.

backup:
  segment_max_bytes: 33554432  # 32 MB instead of 128 MB
  segment_max_interval_ms: 10000  # Flush every 10 seconds

Large Individual Messages

If messages are >1 MB each, even a modest prefetch count uses significant memory.

backup:
  prefetch_count: 10  # Very low prefetch for large messages
  max_concurrent_queues: 2

Kubernetes Memory Limits

Set appropriate resource limits based on your configuration:

resources:
  requests:
    memory: 256Mi
  limits:
    memory: 512Mi  # Allow headroom above estimated usage

Slow Restores

Symptoms

Restore takes much longer than the backup did
Low rabbitmq_restore_messages_published rate
Target broker CPU is high

Common Causes and Solutions

Small Batch Size

Publishing one message at a time is slow. Increase the batch size:

restore:
  produce_batch_size: 500  # Increase from default 100

Publisher Confirms Overhead

Each batch waits for broker confirmation. For faster restore (with some risk):

restore:
  publisher_confirms: false  # Faster but risks message loss on broker crash

Target Broker Overloaded

If the target broker is struggling:

Add a rate limit: rate_limit_messages_per_sec: 5000
Reduce max_concurrent_queues to lower connection pressure
Check target broker memory and disk alarms: rabbitmqctl status

Network Bottlenecks

Symptoms

Backup or restore is slow despite low CPU and memory usage
Storage uploads/downloads are the bottleneck
High latency to RabbitMQ broker

Diagnosis

Test Storage Throughput

# S3 upload speed test
dd if=/dev/zero bs=1M count=100 | aws s3 cp - s3://bucket/speedtest
aws s3 rm s3://bucket/speedtest

# S3 download speed test
aws s3 cp s3://bucket/some-segment /dev/null

Test AMQP Latency

# Test TCP latency to broker
ping rabbitmq-host
# Test AMQP port
time nc -zv rabbitmq-host 5672

Solutions

Co-locate with RabbitMQ

Run the backup tool on the same network as the RabbitMQ broker to minimize AMQP latency.

Co-locate with Storage

Run the backup tool in the same cloud region as the storage bucket.

Reduce Storage API Calls

Increase segment size to write fewer, larger objects:

backup:
  segment_max_bytes: 268435456  # 256 MB

Reduce Checkpoint Sync Frequency

Checkpoint syncs upload to remote storage. Reduce frequency:

offset_storage:
  sync_interval_secs: 120  # Every 2 minutes instead of 30 seconds

Monitoring Performance

Use Prometheus metrics to track performance over time:

# Messages per second (backup)
rate(rabbitmq_backup_messages_read[1m])

# Bytes per second (backup)
rate(rabbitmq_backup_bytes_read[1m])

# Messages per second (restore)
rate(rabbitmq_restore_messages_published[1m])

# Compression ratio
1 - (rate(rabbitmq_backup_segments_bytes[5m]) / rate(rabbitmq_backup_bytes_read[5m]))

# Error rate
rate(rabbitmq_backup_errors[5m])

Set up alerts for performance degradation:

- alert: RabbitMQBackupSlowThroughput
  expr: rate(rabbitmq_backup_messages_read[15m]) < 100
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Backup throughput is below 100 messages/sec"

Quick Tuning Checklist

prefetch_count -- increase for throughput, decrease for memory
max_concurrent_queues -- increase for parallelism, decrease for lower load
requeue_strategy -- use cancel (fastest) unless you have a specific reason
compression -- use lz4 if CPU-bound, zstd level 1-3 otherwise
segment_max_bytes -- increase for fewer storage API calls
produce_batch_size -- increase for faster restores
Network -- co-locate with both RabbitMQ and storage

Slow Backups​

Symptoms​

Diagnosis​

Common Causes and Solutions​

Low Prefetch Count​

Sequential Queue Processing​

Slow Requeue Strategy​

High Compression Level​

Network Latency to Storage​

High Memory Usage​

Symptoms​

Diagnosis​

Common Causes and Solutions​

Too Many Messages In-Flight​

Large Segment Buffers​

Large Individual Messages​

Kubernetes Memory Limits​

Slow Restores​

Symptoms​

Common Causes and Solutions​

Small Batch Size​

Publisher Confirms Overhead​

Target Broker Overloaded​

Network Bottlenecks​

Symptoms​

Diagnosis​

Test Storage Throughput​

Test AMQP Latency​

Solutions​

Co-locate with RabbitMQ​

Co-locate with Storage​

Reduce Storage API Calls​

Reduce Checkpoint Sync Frequency​

Monitoring Performance​

Quick Tuning Checklist​

Slow Backups

Symptoms

Diagnosis

Common Causes and Solutions

Low Prefetch Count

Sequential Queue Processing

Slow Requeue Strategy

High Compression Level

Network Latency to Storage

High Memory Usage

Symptoms

Diagnosis

Common Causes and Solutions

Too Many Messages In-Flight

Large Segment Buffers

Large Individual Messages

Kubernetes Memory Limits

Slow Restores

Symptoms

Common Causes and Solutions

Small Batch Size

Publisher Confirms Overhead

Target Broker Overloaded

Network Bottlenecks

Symptoms

Diagnosis

Test Storage Throughput

Test AMQP Latency

Solutions

Co-locate with RabbitMQ

Co-locate with Storage

Reduce Storage API Calls

Reduce Checkpoint Sync Frequency

Monitoring Performance

Quick Tuning Checklist