Point-in-Time Recovery (PITR)

rabbitmq-backup supports Point-in-Time Recovery (PITR), allowing you to restore only the messages that were backed up within a specific time window. This is useful for disaster recovery, data correction, and selective replay scenarios.

How PITR Works

PITR in rabbitmq-backup is built on three foundations:

backed_up_at timestamp: Every message is stamped with the exact time it was captured during backup.
Segment-level timestamp range: Each segment header records the first and last backed_up_at timestamps, enabling coarse-grained filtering without decompressing.
Record-level filtering: During restore, each record's backed_up_at is compared against the configured time window.

The `backed_up_at` Timestamp

When a message is read from RabbitMQ during backup, the tool records the current UTC time in epoch milliseconds:

BackupRecord {
    // ... message fields ...
    backed_up_at: chrono::Utc::now().timestamp_millis(),
    // ...
}

This timestamp represents when the message was captured by the backup tool, not when the message was originally published. This distinction is important:

Timestamp	Source	Meaning
`backed_up_at`	Set by backup tool	When this message was read from the queue and recorded in a segment.
`properties.timestamp`	Set by publisher (optional)	AMQP `timestamp` property set by the original message publisher. May be `null`.

Why `backed_up_at` Instead of `properties.timestamp`?

properties.timestamp is optional -- many publishers do not set it. Using it for PITR would exclude messages without a timestamp.
properties.timestamp is application-controlled -- it could be set to any value (past, future, or epoch 0). It is not a reliable ordering indicator.
backed_up_at is guaranteed -- every record has this timestamp, set at capture time by the backup tool.
backed_up_at reflects backup ordering -- messages within a segment are ordered by capture time, making time-window filtering predictable.

Segment-Level Timestamps

Each RBAK segment header (32 bytes) includes the first and last backed_up_at timestamps:

Header bytes 16-23: First Timestamp (i64 LE, epoch ms)
Header bytes 24-31: Last Timestamp  (i64 LE, epoch ms)

These timestamps are also stored in the manifest's SegmentMetadata:

{
  "key": "backup-001/queues/_default/orders/segment-0001.zst",
  "first_timestamp": 1712700000000,
  "last_timestamp": 1712720000000,
  "record_count": 1000
}

This enables two levels of optimization during restore:

Segment-level skip: If a segment's [first_timestamp, last_timestamp] range does not overlap with the restore time window, the entire segment can be skipped without downloading or decompressing it.
Record-level filter: Within a segment that overlaps the time window, individual records are filtered by their backed_up_at timestamp.

Time Window Configuration

PITR is configured in the restore section of the YAML configuration:

restore:
  time_window_start: 1712739600000   # 2025-04-10T10:00:00Z in epoch ms
  time_window_end:   1712754000000   # 2025-04-10T14:00:00Z in epoch ms

Fields

Field	Type	Default	Description
`time_window_start`	i64 (epoch ms)	`null`	Include only records with `backed_up_at >= time_window_start`. If `null`, no start filter is applied.
`time_window_end`	i64 (epoch ms)	`null`	Include only records with `backed_up_at <= time_window_end`. If `null`, no end filter is applied.

Filter Combinations

`time_window_start`	`time_window_end`	Behavior
`null`	`null`	No PITR filtering -- restore all messages.
Set	`null`	Restore messages from the start time onward.
`null`	Set	Restore messages up to the end time.
Set	Set	Restore messages within the closed interval `[start, end]`.

Filtering Algorithm

The filtering logic is implemented in restore/engine.rs:

fn should_include(record: &BackupRecord, opts: &RestoreOptions) -> bool {
    let after_start = opts
        .time_window_start
        .is_none_or(|s| record.backed_up_at >= s);
    let before_end = opts
        .time_window_end
        .is_none_or(|e| record.backed_up_at <= e);
    after_start && before_end
}

The filter is applied after decompressing the segment and before publishing messages to the target broker:

Segment downloaded
    ↓
CRC32 verified
    ↓
Payload decompressed
    ↓
Records parsed from length-prefixed JSON
    ↓
┌─────────────────────────────────────┐
│  PITR Filter: should_include()?     │
│  backed_up_at >= start? AND         │
│  backed_up_at <= end?               │
└─────────────────────────────────────┘
    ↓ included              ↓ excluded
Published to target     Counted as "skipped"

Restore Statistics

The restore engine tracks PITR filtering in its statistics:

INFO  Queue orders restored: 542 published, 1000 skipped (PITR), 0 failed
INFO  Restore complete: 542 restored, 1000 skipped, 0 failed (1 queues)

restored: Messages that passed the PITR filter and were successfully published.
skipped: Messages that were excluded by the PITR filter.
failed: Messages that passed the filter but failed to publish.

Practical Usage Patterns

Scenario 1: Restore After a Bad Deployment

A faulty deployment at 14:30 UTC corrupted messages in the orders queue. You want to restore messages from before the deployment:

restore:
  time_window_end: 1712761800000   # 2025-04-10T14:30:00Z

This restores all messages backed up before the deployment, discarding any captured after the corruption started.

Scenario 2: Replay a Specific Time Window

You need to reprocess messages from a 2-hour window for debugging:

restore:
  time_window_start: 1712739600000  # 10:00 UTC
  time_window_end:   1712746800000  # 12:00 UTC
  queue_mapping:
    orders: orders-replay           # Restore to a separate queue
  publish_mode: direct-to-queue

Scenario 3: Restore Everything Since Last Known Good

Your last known-good state was at 06:00 UTC. Restore everything from that point:

restore:
  time_window_start: 1712725200000  # 06:00 UTC

Scenario 4: Dry Run to Count Messages in a Window

Before performing a real restore, check how many messages fall within your time window:

restore:
  time_window_start: 1712739600000
  time_window_end:   1712754000000
  dry_run: true

Output:

Dry run summary:
  Backup ID: prod-daily-2025-04-10
  Queues:    3
  Messages:  2621
  Segments:  4
  Size:      905216 bytes
  PITR window: 1712739600000 - 1712754000000 (epoch ms)

Converting Human-Readable Dates to Epoch Milliseconds

The PITR configuration uses epoch milliseconds. Here are common ways to convert:

Using `date` (Linux/macOS)

# Convert to epoch milliseconds
date -d "2025-04-10T14:30:00Z" +%s000
# Output: 1712761800000

# macOS (BSD date)
date -j -f "%Y-%m-%dT%H:%M:%SZ" "2025-04-10T14:30:00Z" +%s000

Using Python

from datetime import datetime, timezone
dt = datetime(2025, 4, 10, 14, 30, 0, tzinfo=timezone.utc)
print(int(dt.timestamp() * 1000))
# Output: 1712761800000

Using JavaScript

new Date("2025-04-10T14:30:00Z").getTime()
// Output: 1712761800000

Limitations and Considerations

Timestamp Granularity

backed_up_at has millisecond granularity. Messages captured within the same millisecond have the same timestamp. This is normally not an issue because:

Backup throughput is typically hundreds to thousands of messages per second, not millions.
The time window is usually specified in minutes or hours, not milliseconds.

Backup Duration and Timestamp Spread

The backed_up_at timestamp reflects when the message was captured, not when it was published. For a backup that takes 10 minutes:

Messages at the start of the backup have earlier backed_up_at values.
Messages at the end have later backed_up_at values.
The spread is the duration of the backup operation.

This means PITR granularity is limited by the backup frequency. For sub-minute granularity, consider running backups more frequently or using stream queues with offset-based checkpointing.

Cross-Queue Consistency

PITR filtering is applied per-queue. If multiple queues have related messages (e.g., orders and payments), filtering by the same time window will include the messages that were captured during that window in each queue. However, because queues are backed up in parallel, there may be slight timing differences between queues.

For strict cross-queue consistency, consider:

Using a single backup operation for all related queues (they will have similar backed_up_at spreads).
Adding a small buffer to the time window edges (e.g., extend by 1 minute on each side).

No Random-Access Seek

The current implementation reads and filters all records within a segment sequentially. There is no index for seeking directly to a specific timestamp within a segment. For very large segments, this means the entire segment is decompressed even if only a few records match the time window.

Mitigation: Use smaller segment_max_bytes or segment_max_interval_ms values to produce more, smaller segments. This improves segment-level skip efficiency at the cost of more storage objects.

How PITR Works​

The backed_up_at Timestamp​

Why backed_up_at Instead of properties.timestamp?​

Segment-Level Timestamps​

Time Window Configuration​

Fields​

Filter Combinations​

Filtering Algorithm​

Restore Statistics​

Practical Usage Patterns​

Scenario 1: Restore After a Bad Deployment​

Scenario 2: Replay a Specific Time Window​

Scenario 3: Restore Everything Since Last Known Good​

Scenario 4: Dry Run to Count Messages in a Window​

Converting Human-Readable Dates to Epoch Milliseconds​

Using date (Linux/macOS)​

Using Python​

Using JavaScript​

Limitations and Considerations​

Timestamp Granularity​

Backup Duration and Timestamp Spread​

Cross-Queue Consistency​

No Random-Access Seek​