Snapshots and Replay
In event sourcing, aggregates are reconstructed by replaying their event stream. For aggregates with short event streams, this is fast and simple. For aggregates with hundreds or thousands of events, replay becomes a performance bottleneck. Snapshots and smart replay strategies address this.
Why Snapshots Exist
Consider an Order aggregate that has been heavily modified -- 200 line additions, removals, and status changes over its lifecycle. Loading this aggregate requires:
- Fetching 200 events from the event store
- Deserializing each event from JSON
- Calling
When(event)200 times to rebuild state
For a single request, this might take 50-100ms. Under load, with many concurrent aggregate loads, this adds up.
Snapshots solve this by periodically saving the aggregate's fully reconstructed state at a specific version. On subsequent loads, you restore from the snapshot and only replay events that occurred after it.
Without snapshot: Replay 200 events (all of them) With snapshot at version 150: Deserialize snapshot + replay 50 events (only the recent ones)
How Snapshots Work
Save Flow
Restore Flow
The Key Insight
The snapshot is a performance optimization, not a source of truth. The event stream remains the authoritative record. If a snapshot is corrupted, missing, or stale, you can always fall back to full replay. Snapshots can be deleted and regenerated at any time without data loss.
Snapshot Strategy
Every N Events
The simplest strategy: save a snapshot every N events (e.g., every 50 or 100 events).
// After saving events
if (aggregate.Version % SnapshotInterval == 0)
{
var snapshot = new AggregateSnapshot(
AggregateId: aggregate.Id,
Version: aggregate.Version,
State: JsonSerializer.Serialize(aggregate),
CreatedAt: DateTimeOffset.UtcNow);
await snapshotStore.SaveSnapshotAsync(snapshot);
}Tradeoff: Simple to implement, but you pay the snapshot cost even for aggregates that are rarely loaded. A frequently-modified, rarely-read aggregate wastes effort snapshotting.
On Demand
Only snapshot when the event count since the last snapshot exceeds a threshold and the aggregate is being loaded.
// During load
var snapshot = await snapshotStore.GetSnapshotAsync(aggregateId);
var events = snapshot != null
? await eventStore.GetEventsAsync(aggregateId, snapshot.Version + 1)
: await eventStore.GetEventsAsync(aggregateId);
// If we had to replay too many events, save a fresh snapshot
if (events.Count > SnapshotThreshold)
{
await SaveSnapshotAsync(aggregate);
}Tradeoff: Only snapshots aggregates that are actually being read, but the first load after a long gap is slow (pays the full replay cost before snapshotting).
Recommended Approach
Combine both: snapshot every N events during save (cheap, predictable), and also snapshot on demand when replay count exceeds a threshold during load (catches edge cases).
Where Snapshots Are Stored
In this project, snapshots are stored in Redis, separate from the event store in PostgreSQL.
Why Redis?
| Factor | Redis | PostgreSQL (same as event store) |
|---|---|---|
| Read latency | Sub-millisecond | Low milliseconds |
| Write pattern | Key-value overwrite (natural for snapshots) | Row upsert |
| TTL support | Built-in (auto-expire old snapshots) | Must manage manually |
| Cache locality | Already used for read model caching | Separate concern |
| Operational overhead | Already in the stack | No additional infrastructure |
Snapshot Storage Interface
From this project's ISnapshotStore:
public interface ISnapshotStore
{
Task<AggregateSnapshot?> GetSnapshotAsync(
Guid aggregateId, CancellationToken cancellationToken = default);
Task SaveSnapshotAsync(
AggregateSnapshot snapshot, CancellationToken cancellationToken = default);
}The AggregateSnapshot record:
public sealed record AggregateSnapshot(
Guid AggregateId,
int Version,
string State, // JSON-serialized aggregate state
DateTimeOffset CreatedAt);Redis Key Design
snapshot:{aggregateType}:{aggregateId}
// Example: snapshot:Order:7c9e6679-7425-40de-944b-e07fc1f90ae7Each aggregate has at most one snapshot in Redis (the latest). SaveSnapshotAsync overwrites the previous snapshot.
Event Replay for Rebuilding Projections
Replay is not only for aggregate reconstruction -- it is also how projections (read models) are built and rebuilt.
Full Replay
Replaying the entire event store to build a projection from scratch. Used when:
- A new projection is created (new read model that did not exist before)
- A projection's logic has changed (bug fix, new field added)
- A projection's data is corrupted or lost
Cost: Proportional to the total number of events in the store. For a system with millions of events, full replay can take minutes to hours.
Incremental Catch-Up
Processing only events that occurred after the last checkpoint. Used for:
- Normal operation (real-time projection updates)
- Recovery after a brief outage
- Catching up after a deployment
Cost: Proportional only to events since the last checkpoint. In normal operation, this is a handful of events.
Replay Cost and Optimization
Performance Considerations
| Factor | Impact | Optimization |
|---|---|---|
| Event count | Linear cost with total events | Use incremental catch-up; reserve full replay for initialization |
| Deserialization | JSON parsing is CPU-bound | Use System.Text.Json source generators for fast deserialization |
| Database writes | Projection updates are I/O-bound | Batch writes (e.g., upsert 100 read model rows per transaction) |
| Event store queries | Sequential scan for global replay | Use timestamp or sequence number index; paginate with LIMIT/OFFSET |
| Projection state | Rebuilding drops and recreates read model data | Use idempotent upserts so replay can be resumed if interrupted |
| Concurrent access | Projection rebuild while serving reads | Use a separate read model table during rebuild, then swap (blue-green) |
Batch Replay Pattern
For full replays, process events in batches rather than one at a time:
long lastCheckpoint = 0;
const int batchSize = 1000;
while (true)
{
var events = await eventStore.GetGlobalEvents(fromCheckpoint: lastCheckpoint, limit: batchSize);
if (events.Count == 0) break;
foreach (var evt in events)
{
projection.Handle(evt);
}
await projection.FlushAsync(); // Batch write to read model
lastCheckpoint = events.Last().SequenceNumber;
await SaveCheckpoint(lastCheckpoint);
}Blue-Green Replay
When rebuilding a projection in production:
- Create a new read model table (e.g.,
order_read_models_v2) - Run full replay into the new table
- Once caught up, atomically swap the table name (or update the connection string)
- Drop the old table
This avoids serving stale or incomplete data during the rebuild.
Snapshot Pitfalls
1. Snapshot as Source of Truth
Treating the snapshot as authoritative rather than the event stream. If the snapshot format changes (e.g., you add a new field to the aggregate), old snapshots may deserialize incorrectly.
Fix: Always be able to fall back to full replay. If a snapshot fails to deserialize, delete it and replay from events.
2. Forgetting to Invalidate
Not invalidating or updating snapshots when the aggregate's serialization format changes during a deployment.
Fix: Include a schema version in the snapshot. If the loaded snapshot's schema version does not match the current code, discard it and replay from events.
3. Snapshot Drift
The snapshot represents a version of the aggregate, but the aggregate's When() logic has changed since the snapshot was taken. Replaying events after the snapshot with new logic produces a different state than the snapshot + new events.
Fix: When When() logic changes, invalidate all snapshots for that aggregate type. This forces a full replay with the new logic.
Memory Hook
"A snapshot is a bookmark, not the book. The events are the book. If you lose the bookmark, you just start reading from the beginning."
Further Reading
- Event Store Design -- the storage layer snapshots sit on top of
- Projections and Read Models -- the other consumer of event replay
- Event Sourcing Overview -- the overall pattern