Chapter 17: When Things Go Wrong — Dead Letter Queues
A production bug in the ComplianceAlertHandler causes it to crash
with a TypeError on deposits where source_type is None — a
pre-upcasting edge case. The handler exhausts its retry attempts and
messages pile up in the dead-letter queue.
This chapter covers the operational workflow for discovering, diagnosing, and recovering from handler failures.
How DLQ Works
When a StreamSubscription handler throws an exception:
- The message is retried up to
max_retriestimes (default: 3) - Each retry uses exponential backoff (
retry_delay_seconds * 2^N) - After exhausting retries, the message moves to the DLQ stream
- The subscription continues processing the next message
The DLQ preserves the original payload, the failure reason, the retry count, and the timestamp. Nothing is lost.
Discovering the Problem
$ protean dlq list --domain=fidelis
DLQ ID Subscription Failure Reason Failed At Retries
1719234567890-0 fidelis::account TypeError: 'NoneT... 2025-06-16 14:22:00 3
1719234568123-0 fidelis::account TypeError: 'NoneT... 2025-06-16 14:22:01 3
1719234569456-0 fidelis::account TypeError: 'NoneT... 2025-06-16 14:23:15 3
3 DLQ message(s) found.
You can filter by subscription:
$ protean dlq list --subscription=fidelis::account --domain=fidelis
Inspecting a Failed Message
$ protean dlq inspect 1719234567890-0 --domain=fidelis
DLQ ID: 1719234567890-0
Stream: fidelis::account
Failure Reason: TypeError: 'NoneType' object has no attribute 'startswith'
Failed At: 2025-06-16 14:22:00
Retry Count: 3
Payload:
{
"type": "Fidelis.DepositMade.v1",
"data": {
"account_id": "acc-7742",
"amount": 12000.0,
"reference": "DEP-8834"
}
}
The inspection shows everything needed to diagnose the issue:
- The failure reason (
TypeError) tells you what went wrong. - The payload shows the exact message that caused the failure.
- The type (
Fidelis.DepositMade.v1) reveals it was a v1 event — thesource_typefield is missing because the upcaster has not run yet at the handler level.
Fixing and Replaying
After fixing the handler code (adding a None check for
source_type) and redeploying:
# Replay a single message
$ protean dlq replay 1719234567890-0 --domain=fidelis
Replayed message 1719234567890-0 to stream 'fidelis::account'.
# Replay all failed messages for a subscription
$ protean dlq replay-all --subscription=fidelis::account --domain=fidelis
Replay all DLQ messages for subscription 'fidelis::account'? [y/N]: y
Replayed 3 message(s) to stream 'fidelis::account'.
Replaying puts the message back on the original stream. The handler (now fixed) processes it normally.
Purging Unrecoverable Messages
If messages cannot be fixed (e.g., they reference deleted data):
$ protean dlq purge --subscription=fidelis::account --domain=fidelis
Purge all DLQ messages for subscription 'fidelis::account'? [y/N]: y
Purged 3 message(s) from DLQ.
Warning
purge permanently removes messages. Use it only when you are
certain the messages are unrecoverable or no longer relevant.
Verifying Recovery
$ protean dlq list --domain=fidelis
No DLQ messages found.
The Fix-and-Replay Cycle
This pattern will become your standard operating procedure:
- Discover —
protean dlq listfinds failed messages - Inspect —
protean dlq inspectreveals the cause - Fix — update handler code and redeploy
- Replay —
protean dlq replay-allreprocesses the messages - Verify —
protean dlq listconfirms the DLQ is empty
DLQ Configuration
The DLQ is configured in domain.toml:
[server.stream_subscription]
max_retries = 3 # Retry before DLQ
retry_delay_seconds = 1 # Base delay (exponential backoff)
enable_dlq = true # Enable dead-letter queue
Setting enable_dlq = false means failed messages are dropped after
exhausting retries. This is almost never what you want in production.
What We Built
- The fix-and-replay cycle for production incident response.
protean dlq list— discover failed messages.protean dlq inspect— diagnose the root cause.protean dlq replay/replay-all— reprocess after fixing.protean dlq purge— discard unrecoverable messages.- DLQ configuration for retries and backoff.
Next, we set up proactive monitoring to catch problems before they fill the DLQ.