docs/.NET Domain Driven Design/snapshots-and-replay
Edit on GitHub

Snapshots and Replay

In event sourcing, aggregates are reconstructed by replaying their event stream. For aggregates with short event streams, this is fast and simple. For aggregates with hundreds or thousands of events, replay becomes a performance bottleneck. Snapshots and smart replay strategies address this.


Why Snapshots Exist

Consider an Order aggregate that has been heavily modified -- 200 line additions, removals, and status changes over its lifecycle. Loading this aggregate requires:

  1. Fetching 200 events from the event store
  2. Deserializing each event from JSON
  3. Calling When(event) 200 times to rebuild state

For a single request, this might take 50-100ms. Under load, with many concurrent aggregate loads, this adds up.

Snapshots solve this by periodically saving the aggregate's fully reconstructed state at a specific version. On subsequent loads, you restore from the snapshot and only replay events that occurred after it.

Without snapshot: Replay 200 events (all of them) With snapshot at version 150: Deserialize snapshot + replay 50 events (only the recent ones)


How Snapshots Work

Save Flow

Restore Flow

The Key Insight

The snapshot is a performance optimization, not a source of truth. The event stream remains the authoritative record. If a snapshot is corrupted, missing, or stale, you can always fall back to full replay. Snapshots can be deleted and regenerated at any time without data loss.


Snapshot Strategy

Every N Events

The simplest strategy: save a snapshot every N events (e.g., every 50 or 100 events).

csharp
// After saving events
if (aggregate.Version % SnapshotInterval == 0)
{
    var snapshot = new AggregateSnapshot(
        AggregateId: aggregate.Id,
        Version: aggregate.Version,
        State: JsonSerializer.Serialize(aggregate),
        CreatedAt: DateTimeOffset.UtcNow);

    await snapshotStore.SaveSnapshotAsync(snapshot);
}

Tradeoff: Simple to implement, but you pay the snapshot cost even for aggregates that are rarely loaded. A frequently-modified, rarely-read aggregate wastes effort snapshotting.

On Demand

Only snapshot when the event count since the last snapshot exceeds a threshold and the aggregate is being loaded.

csharp
// During load
var snapshot = await snapshotStore.GetSnapshotAsync(aggregateId);
var events = snapshot != null
    ? await eventStore.GetEventsAsync(aggregateId, snapshot.Version + 1)
    : await eventStore.GetEventsAsync(aggregateId);

// If we had to replay too many events, save a fresh snapshot
if (events.Count > SnapshotThreshold)
{
    await SaveSnapshotAsync(aggregate);
}

Tradeoff: Only snapshots aggregates that are actually being read, but the first load after a long gap is slow (pays the full replay cost before snapshotting).

Combine both: snapshot every N events during save (cheap, predictable), and also snapshot on demand when replay count exceeds a threshold during load (catches edge cases).


Where Snapshots Are Stored

In this project, snapshots are stored in Redis, separate from the event store in PostgreSQL.

Why Redis?

FactorRedisPostgreSQL (same as event store)
Read latencySub-millisecondLow milliseconds
Write patternKey-value overwrite (natural for snapshots)Row upsert
TTL supportBuilt-in (auto-expire old snapshots)Must manage manually
Cache localityAlready used for read model cachingSeparate concern
Operational overheadAlready in the stackNo additional infrastructure

Snapshot Storage Interface

From this project's ISnapshotStore:

csharp
public interface ISnapshotStore
{
    Task<AggregateSnapshot?> GetSnapshotAsync(
        Guid aggregateId, CancellationToken cancellationToken = default);

    Task SaveSnapshotAsync(
        AggregateSnapshot snapshot, CancellationToken cancellationToken = default);
}

The AggregateSnapshot record:

csharp
public sealed record AggregateSnapshot(
    Guid AggregateId,
    int Version,
    string State,          // JSON-serialized aggregate state
    DateTimeOffset CreatedAt);

Redis Key Design

text
snapshot:{aggregateType}:{aggregateId}
// Example: snapshot:Order:7c9e6679-7425-40de-944b-e07fc1f90ae7

Each aggregate has at most one snapshot in Redis (the latest). SaveSnapshotAsync overwrites the previous snapshot.


Event Replay for Rebuilding Projections

Replay is not only for aggregate reconstruction -- it is also how projections (read models) are built and rebuilt.

Full Replay

Replaying the entire event store to build a projection from scratch. Used when:

  • A new projection is created (new read model that did not exist before)
  • A projection's logic has changed (bug fix, new field added)
  • A projection's data is corrupted or lost

Cost: Proportional to the total number of events in the store. For a system with millions of events, full replay can take minutes to hours.

Incremental Catch-Up

Processing only events that occurred after the last checkpoint. Used for:

  • Normal operation (real-time projection updates)
  • Recovery after a brief outage
  • Catching up after a deployment

Cost: Proportional only to events since the last checkpoint. In normal operation, this is a handful of events.


Replay Cost and Optimization

Performance Considerations

FactorImpactOptimization
Event countLinear cost with total eventsUse incremental catch-up; reserve full replay for initialization
DeserializationJSON parsing is CPU-boundUse System.Text.Json source generators for fast deserialization
Database writesProjection updates are I/O-boundBatch writes (e.g., upsert 100 read model rows per transaction)
Event store queriesSequential scan for global replayUse timestamp or sequence number index; paginate with LIMIT/OFFSET
Projection stateRebuilding drops and recreates read model dataUse idempotent upserts so replay can be resumed if interrupted
Concurrent accessProjection rebuild while serving readsUse a separate read model table during rebuild, then swap (blue-green)

Batch Replay Pattern

For full replays, process events in batches rather than one at a time:

csharp
long lastCheckpoint = 0;
const int batchSize = 1000;

while (true)
{
    var events = await eventStore.GetGlobalEvents(fromCheckpoint: lastCheckpoint, limit: batchSize);
    if (events.Count == 0) break;

    foreach (var evt in events)
    {
        projection.Handle(evt);
    }

    await projection.FlushAsync();  // Batch write to read model
    lastCheckpoint = events.Last().SequenceNumber;
    await SaveCheckpoint(lastCheckpoint);
}

Blue-Green Replay

When rebuilding a projection in production:

  1. Create a new read model table (e.g., order_read_models_v2)
  2. Run full replay into the new table
  3. Once caught up, atomically swap the table name (or update the connection string)
  4. Drop the old table

This avoids serving stale or incomplete data during the rebuild.


Snapshot Pitfalls

1. Snapshot as Source of Truth

Treating the snapshot as authoritative rather than the event stream. If the snapshot format changes (e.g., you add a new field to the aggregate), old snapshots may deserialize incorrectly.

Fix: Always be able to fall back to full replay. If a snapshot fails to deserialize, delete it and replay from events.

2. Forgetting to Invalidate

Not invalidating or updating snapshots when the aggregate's serialization format changes during a deployment.

Fix: Include a schema version in the snapshot. If the loaded snapshot's schema version does not match the current code, discard it and replay from events.

3. Snapshot Drift

The snapshot represents a version of the aggregate, but the aggregate's When() logic has changed since the snapshot was taken. Replaying events after the snapshot with new logic produces a different state than the snapshot + new events.

Fix: When When() logic changes, invalidate all snapshots for that aggregate type. This forces a full replay with the new logic.


Memory Hook

"A snapshot is a bookmark, not the book. The events are the book. If you lose the bookmark, you just start reading from the beginning."


Further Reading