docs/.NET Domain Driven Design/snapshots-and-replay

Snapshots and Replay

In event sourcing, aggregates are reconstructed by replaying their event stream. For aggregates with short event streams, this is fast and simple. For aggregates with hundreds or thousands of events, replay becomes a performance bottleneck. Snapshots and smart replay strategies address this.

Why Snapshots Exist

Consider an Order aggregate that has been heavily modified -- 200 line additions, removals, and status changes over its lifecycle. Loading this aggregate requires:

Fetching 200 events from the event store
Deserializing each event from JSON
Calling When(event) 200 times to rebuild state

For a single request, this might take 50-100ms. Under load, with many concurrent aggregate loads, this adds up.

Snapshots solve this by periodically saving the aggregate's fully reconstructed state at a specific version. On subsequent loads, you restore from the snapshot and only replay events that occurred after it.

Without snapshot: Replay 200 events (all of them) With snapshot at version 150: Deserialize snapshot + replay 50 events (only the recent ones)

How Snapshots Work

Save Flow

Restore Flow

The Key Insight

The snapshot is a performance optimization, not a source of truth. The event stream remains the authoritative record. If a snapshot is corrupted, missing, or stale, you can always fall back to full replay. Snapshots can be deleted and regenerated at any time without data loss.

Snapshot Strategy

Every N Events

The simplest strategy: save a snapshot every N events (e.g., every 50 or 100 events).

csharp

// After saving events
if (aggregate.Version % SnapshotInterval == 0)
{
    var snapshot = new AggregateSnapshot(
        AggregateId: aggregate.Id,
        Version: aggregate.Version,
        State: JsonSerializer.Serialize(aggregate),
        CreatedAt: DateTimeOffset.UtcNow);

    await snapshotStore.SaveSnapshotAsync(snapshot);
}

Tradeoff: Simple to implement, but you pay the snapshot cost even for aggregates that are rarely loaded. A frequently-modified, rarely-read aggregate wastes effort snapshotting.

On Demand

Only snapshot when the event count since the last snapshot exceeds a threshold and the aggregate is being loaded.

csharp

// During load
var snapshot = await snapshotStore.GetSnapshotAsync(aggregateId);
var events = snapshot != null
    ? await eventStore.GetEventsAsync(aggregateId, snapshot.Version + 1)
    : await eventStore.GetEventsAsync(aggregateId);

// If we had to replay too many events, save a fresh snapshot
if (events.Count > SnapshotThreshold)
{
    await SaveSnapshotAsync(aggregate);
}

Tradeoff: Only snapshots aggregates that are actually being read, but the first load after a long gap is slow (pays the full replay cost before snapshotting).

Recommended Approach

Combine both: snapshot every N events during save (cheap, predictable), and also snapshot on demand when replay count exceeds a threshold during load (catches edge cases).

Where Snapshots Are Stored

In this project, snapshots are stored in Redis, separate from the event store in PostgreSQL.

Why Redis?

Factor	Redis	PostgreSQL (same as event store)
Read latency	Sub-millisecond	Low milliseconds
Write pattern	Key-value overwrite (natural for snapshots)	Row upsert
TTL support	Built-in (auto-expire old snapshots)	Must manage manually
Cache locality	Already used for read model caching	Separate concern
Operational overhead	Already in the stack	No additional infrastructure

Snapshot Storage Interface

From this project's ISnapshotStore:

csharp

public interface ISnapshotStore
{
    Task<AggregateSnapshot?> GetSnapshotAsync(
        Guid aggregateId, CancellationToken cancellationToken = default);

    Task SaveSnapshotAsync(
        AggregateSnapshot snapshot, CancellationToken cancellationToken = default);
}

The AggregateSnapshot record:

csharp

public sealed record AggregateSnapshot(
    Guid AggregateId,
    int Version,
    string State,          // JSON-serialized aggregate state
    DateTimeOffset CreatedAt);

Redis Key Design

text

snapshot:{aggregateType}:{aggregateId}
// Example: snapshot:Order:7c9e6679-7425-40de-944b-e07fc1f90ae7

Each aggregate has at most one snapshot in Redis (the latest). SaveSnapshotAsync overwrites the previous snapshot.

Event Replay for Rebuilding Projections

Replay is not only for aggregate reconstruction -- it is also how projections (read models) are built and rebuilt.

Full Replay

Replaying the entire event store to build a projection from scratch. Used when:

A new projection is created (new read model that did not exist before)
A projection's logic has changed (bug fix, new field added)
A projection's data is corrupted or lost

Cost: Proportional to the total number of events in the store. For a system with millions of events, full replay can take minutes to hours.

Incremental Catch-Up

Processing only events that occurred after the last checkpoint. Used for:

Normal operation (real-time projection updates)
Recovery after a brief outage
Catching up after a deployment

Cost: Proportional only to events since the last checkpoint. In normal operation, this is a handful of events.

Replay Cost and Optimization

Performance Considerations

Factor	Impact	Optimization
Event count	Linear cost with total events	Use incremental catch-up; reserve full replay for initialization
Deserialization	JSON parsing is CPU-bound	Use `System.Text.Json` source generators for fast deserialization
Database writes	Projection updates are I/O-bound	Batch writes (e.g., upsert 100 read model rows per transaction)
Event store queries	Sequential scan for global replay	Use timestamp or sequence number index; paginate with LIMIT/OFFSET
Projection state	Rebuilding drops and recreates read model data	Use idempotent upserts so replay can be resumed if interrupted
Concurrent access	Projection rebuild while serving reads	Use a separate read model table during rebuild, then swap (blue-green)

Batch Replay Pattern

For full replays, process events in batches rather than one at a time:

csharp

long lastCheckpoint = 0;
const int batchSize = 1000;

while (true)
{
    var events = await eventStore.GetGlobalEvents(fromCheckpoint: lastCheckpoint, limit: batchSize);
    if (events.Count == 0) break;

    foreach (var evt in events)
    {
        projection.Handle(evt);
    }

    await projection.FlushAsync();  // Batch write to read model
    lastCheckpoint = events.Last().SequenceNumber;
    await SaveCheckpoint(lastCheckpoint);
}

Blue-Green Replay

When rebuilding a projection in production:

Create a new read model table (e.g., order_read_models_v2)
Run full replay into the new table
Once caught up, atomically swap the table name (or update the connection string)
Drop the old table

This avoids serving stale or incomplete data during the rebuild.

Snapshot Pitfalls

1. Snapshot as Source of Truth

Treating the snapshot as authoritative rather than the event stream. If the snapshot format changes (e.g., you add a new field to the aggregate), old snapshots may deserialize incorrectly.

Fix: Always be able to fall back to full replay. If a snapshot fails to deserialize, delete it and replay from events.

2. Forgetting to Invalidate

Not invalidating or updating snapshots when the aggregate's serialization format changes during a deployment.

Fix: Include a schema version in the snapshot. If the loaded snapshot's schema version does not match the current code, discard it and replay from events.

3. Snapshot Drift

The snapshot represents a version of the aggregate, but the aggregate's When() logic has changed since the snapshot was taken. Replaying events after the snapshot with new logic produces a different state than the snapshot + new events.

Fix: When When() logic changes, invalidate all snapshots for that aggregate type. This forces a full replay with the new logic.

Memory Hook

"A snapshot is a bookmark, not the book. The events are the book. If you lose the bookmark, you just start reading from the beginning."

Snapshots and Replay

Why Snapshots Exist

How Snapshots Work

Save Flow

Restore Flow

The Key Insight

Snapshot Strategy

Every N Events

On Demand

Recommended Approach

Where Snapshots Are Stored

Why Redis?

Snapshot Storage Interface

Redis Key Design

Event Replay for Rebuilding Projections

Full Replay

Incremental Catch-Up

Replay Cost and Optimization

Performance Considerations

Batch Replay Pattern

Blue-Green Replay

Snapshot Pitfalls

1. Snapshot as Source of Truth

2. Forgetting to Invalidate

3. Snapshot Drift

Memory Hook

Further Reading