Error Handling

This guide covers how Protean handles message processing failures across all subscription types, including retry logic, dead letter queues (DLQ), and recovery mechanisms.

For monitoring failed messages in production, see Monitoring. For the DLQ CLI commands, see DLQ Commands.

Processing guarantees

Protean provides at-least-once delivery for all subscription types. When a handler fails to process a message, the framework retries it a configurable number of times before routing it to a dead letter queue or marking it as exhausted.

No message is silently dropped. Every failure is logged, traced, and either retried or preserved for manual inspection.

Subscription error flows

Each subscription type handles failures differently based on its underlying transport, but all share the same configuration model.

StreamSubscription

Used with Redis Streams for event and command handlers when default_subscription_type = "stream".

Flow: Handler fails → NACK + retry → DLQ after exhaustion

Handler raises an exception.
Retry count incremented. If retries remain, the message is NACKed (returned to the Redis consumer group for re-delivery) after a configurable delay.
When retries are exhausted, the message is published to a DLQ stream ({stream}:dlq) with enriched metadata, then ACKed from the original stream.
If priority lanes are enabled, backfill stream failures route to {stream}:backfill:dlq.

Deserialization errors skip the retry pipeline entirely and go straight to the DLQ, since retrying a malformed message cannot succeed.

[server.stream_subscription]
max_retries = 3
retry_delay_seconds = 1
enable_dlq = true

EventStoreSubscription

Used with event stores (Memory, MessageDB) for event and command handlers when default_subscription_type = "event_store" (the default).

Flow: Handler fails → position recorded → recovery pass retries

Handler raises an exception. The read position advances normally so the subscription is not blocked (avoids the poison-pill problem).
The failed position is recorded in-memory and checkpointed to the event store for durability.
A periodic recovery pass re-reads the original message from the event store and retries the handler.
On success, the position is marked as resolved. After exhausting max_retries, it is marked as exhausted and logged.

This approach leverages the event store's inherent durability — events are immutable and always available for replay.

[server.event_store_subscription]
max_retries = 3
retry_delay_seconds = 1
enable_recovery = true
recovery_interval_seconds = 30

BrokerSubscription

Used for subscribers that consume messages from external broker streams.

Flow: Handler fails → NACK + retry → DLQ after exhaustion

Subscriber raises an exception.
Retry count incremented. If retries remain, the message is NACKed after a configurable delay.
When retries are exhausted, the message is published to a DLQ stream ({stream}:dlq), ACKed from the original stream, and logged.

[server.broker_subscription]
max_retries = 3
retry_delay_seconds = 1
enable_dlq = true

OutboxProcessor

Handles outgoing messages (domain → broker). Uses exponential backoff with jitter. Failed messages stay in the outbox table with a retry status — no DLQ is needed because the outbox table itself is durable.

Dead letter queue lifecycle

How messages enter the DLQ

Messages enter the DLQ when:

A handler/subscriber fails more times than max_retries allows.
A message cannot be deserialized (StreamSubscription only).

Each DLQ message preserves the original payload and adds a _dlq_metadata dict:

{
  "original_stream": "orders",
  "original_id": "msg-abc-123",
  "consumer_group": "OrderHandler",
  "consumer": "OrderHandler-host-12345-a1b2c3",
  "failed_at": "2025-01-15T10:30:00+00:00",
  "retry_count": 3
}

Inspecting and replaying

Use the protean dlq CLI commands:

# List all DLQ messages
protean dlq list --domain=my_domain

# Filter by subscription
protean dlq list --domain=my_domain --subscription=orders

# Inspect a specific message
protean dlq inspect MSG_ID --domain=my_domain

# Replay a single message back to the original stream
protean dlq replay MSG_ID --domain=my_domain --subscription=orders

# Replay all messages for a subscription
protean dlq replay-all --domain=my_domain --subscription=orders

# Purge all DLQ messages for a subscription
protean dlq purge --domain=my_domain --subscription=orders

Disabling the DLQ

Set enable_dlq = false to discard messages after exhausting retries instead of routing them to a DLQ. The message is still ACKed (removed from pending) and logged as a warning.

[server.stream_subscription]
enable_dlq = false

Configuration reference

All error-handling configuration lives under [server] in domain.toml:

Key	Default	Description
`stream_subscription.max_retries`	3	Retry attempts before DLQ
`stream_subscription.retry_delay_seconds`	1	Delay between retries
`stream_subscription.enable_dlq`	true	Route to DLQ or discard
`event_store_subscription.max_retries`	3	Retry attempts before marking exhausted
`event_store_subscription.retry_delay_seconds`	1	Delay between recovery retries
`event_store_subscription.enable_recovery`	true	Enable periodic recovery pass
`event_store_subscription.recovery_interval_seconds`	30	Interval between recovery sweeps
`broker_subscription.max_retries`	3	Retry attempts before DLQ
`broker_subscription.retry_delay_seconds`	1	Delay between retries
`broker_subscription.enable_dlq`	true	Route to DLQ or discard

Per-handler overrides can be passed via constructor arguments when creating subscriptions programmatically.

Version conflict auto-retry

ExpectedVersionError is the most common transient failure in event-sourced and version-tracked systems. It occurs when two handlers concurrently modify the same aggregate — the second writer's version check fails because the first writer already advanced the version.

Protean handles this automatically at the @handle wrapper level. When a handler raises ExpectedVersionError, the framework:

Catches the exception before it reaches the subscription retry pipeline.
Waits with exponential backoff (50 ms → 100 ms → 200 ms …).
Re-executes the handler in a fresh UnitOfWork, so the aggregate is re-read at the latest version.
After exhausting fast retries (default 3), propagates the error to the subscription for normal retry/DLQ handling.

This is transparent — the subscription never sees transient version conflicts. Only persistent conflicts (extremely rare) surface as failures.

Why this works

Each retry creates a new UnitOfWork. The handler re-reads the aggregate from the event store (or database), which now reflects the concurrent write. The handler's business logic executes against the current state, and the write succeeds.

Configuration

Version retry is enabled by default with sensible defaults. Configure it under [server.version_retry] in domain.toml:

[server.version_retry]
enabled = true            # Set false to disable auto-retry
max_retries = 3           # Fast retries before propagating to subscription
base_delay_seconds = 0.05 # 50ms initial backoff delay
max_delay_seconds = 1.0   # Cap backoff at 1 second

Key	Default	Description
`enabled`	`true`	Enable/disable version conflict auto-retry
`max_retries`	`3`	Number of fast retries before propagating
`base_delay_seconds`	`0.05`	Initial backoff delay (doubles each retry)
`max_delay_seconds`	`1.0`	Maximum backoff delay cap

With the defaults, worst-case retry adds 350 ms (50 + 100 + 200 ms) before the handler either succeeds or the error escalates to the subscription.

Disabling auto-retry

[server.version_retry]
enabled = false

When disabled, ExpectedVersionError propagates immediately to the subscription retry/DLQ pipeline, like any other exception.

When auto-retry is not enough

Auto-retry works well for idempotent operations where either outcome is acceptable (e.g., updating user preferences). But not all version conflicts are equal — some mean a real business problem (e.g., two customers booking the same seat), and others require merge logic.

If your handler needs to distinguish between conflict types, catch ExpectedVersionError inside the handler method and handle it explicitly. When you catch it inside the handler, the framework's auto-retry does not trigger.

For a full treatment of the three conflict categories (last writer wins, business rejection, conditional merge), see Optimistic Concurrency as a Design Tool.

Custom error handling

Every handler and subscriber class can override handle_error() to implement custom error logic:

@domain.event_handler(part_of=Order)
class OrderEventHandler:
    @handle(OrderPlaced)
    def on_order_placed(self, event):
        # ... processing logic
        pass

    @classmethod
    def handle_error(cls, exc, message):
        """Called when the handler raises an exception."""
        if isinstance(exc, ExternalServiceUnavailable):
            alert_ops_team(exc, message)
        logger.error(f"OrderEventHandler failed: {exc}")

The handle_error() callback receives the exception and the original message. If handle_error() itself raises, the exception is caught and logged — the engine continues processing.

Trace events

The server emits trace events for observability:

Event	When	Key metadata
`message.acked`	Message processed successfully	`stream`, `handler`
`message.nacked`	Message failed, will retry	`stream`, `retry_count`, `max_retries`
`message.dlq`	Message moved to DLQ	`stream`, `dlq_stream`, `retry_count`
`handler.failed`	Handler raised an exception	`handler`, `error`

These events are visible in the Observatory dashboard and can be used for alerting.

Operational runbooks

Inspect a failure

# Find the failed message
protean dlq list --domain=my_domain

# Get full details
protean dlq inspect MSG_ID --domain=my_domain

Review the payload, error metadata, and retry count to determine the root cause.

Replay after fixing the bug

Fix the handler code that caused the failure.
Deploy the fix.
Replay the message:

protean dlq replay MSG_ID --domain=my_domain --subscription=orders

Bulk replay

After a transient issue (network outage, dependency downtime) is resolved:

protean dlq replay-all --domain=my_domain --subscription=orders

Clear stale DLQ messages

If messages are no longer relevant (e.g., superseded by newer events):

protean dlq purge --domain=my_domain --subscription=orders

Next steps

Optimistic Concurrency as a Design Tool — Classify version conflicts by business meaning
Monitoring — Observatory dashboard and metrics
Logging — Structured logging configuration
Production Deployment — Process management and scaling
Using Priority Lanes — Route background workloads