[BEE-10005] Dead Letter Queues and Poison Messages

INFO

Route unprocessable messages to a dedicated holding queue. Monitor it. Inspect it. Fix the root cause before replaying. Never let a single bad message block the flow of all the good ones.

Context

In any message-driven system, some messages will fail to process. The failure can be transient — a downstream service is temporarily unavailable — or permanent — the payload is malformed, the schema has changed, or there is a bug in the consumer. Retrying a transient failure makes sense. Retrying a permanent failure indefinitely does not.

A poison message is a message that repeatedly fails processing regardless of how many times it is retried. If the messaging system has no mechanism to isolate it, a poison message can stall an entire queue: the consumer picks it up, fails, requeues it, picks it up again, fails again, in an infinite loop. No other messages in the queue are processed while the consumer is stuck.

A Dead Letter Queue (DLQ) is a dedicated destination — a separate queue or topic — where messages are routed after they have exhausted their retry budget. The DLQ is not a trash bin; it is a holding area for investigation and recovery. Messages in the DLQ are kept until an operator inspects them, determines the cause of failure, fixes the underlying issue, and either replays the messages or discards them.

References:

Principle

Configure a DLQ for every queue that carries business-critical messages. Set a finite max retry count. Enrich DLQ messages with failure metadata. Monitor the DLQ and alert on any new arrivals. Never replay from the DLQ without first fixing the root cause.

What Is a Poison Message?

A poison message is one that cannot be processed successfully by a consumer, no matter how many times it is retried. The failure is permanent given the current state of the system.

Common causes:

Cause	Example
Schema mismatch	Producer serialized with Protobuf v2; consumer still expects v1
Invalid or missing data	`product_id` field is null; consumer cannot look up the product
Business rule violation	Order total is negative; validation always rejects it
Consumer bug	A null pointer exception in code that was recently deployed
Downstream permanently unavailable	A third-party service has been decommissioned
Message too large	Payload exceeds broker size limit; deserialization always fails

The key distinction from a transient failure: a transient failure will eventually succeed if retried after a delay. A poison message will never succeed until something external changes — the code is fixed, the data is corrected, or the schema is migrated.

Why Infinite Retries Are Harmful

Without a DLQ, the typical fallback for processing failures is indefinite retry. This creates several problems:

Queue starvation. While the consumer is stuck retrying a poison message, all subsequent messages in the queue are delayed. In an ordered queue, no message behind the poison message can be processed at all.
Resource waste. CPU cycles, network calls, and downstream service connections are consumed on work that will never succeed.
Alert fatigue. If every retry emits an error log or metric, on-call engineers are buried in noise from a single bad message.
Hidden data integrity issues. Messages processed out of order after a long retry cycle may arrive in a state that downstream systems no longer expect.

The correct model: retry a bounded number of times with backoff (see BEE-261), then route to the DLQ and move on.

DLQ Message Enrichment

When a message is routed to the DLQ, the raw payload alone is rarely enough to diagnose the problem. The message should be enriched with metadata at the point of routing:

Metadata Field	Purpose
`source_queue`	Which queue the message originally came from
`failure_reason`	Exception type and message from the last failed attempt
`attempt_count`	How many times processing was attempted
`first_failure_time`	Timestamp of the first failed attempt
`last_failure_time`	Timestamp of the most recent failed attempt
`consumer_id`	Which consumer instance processed it
`original_message_id`	Stable identifier linking the DLQ entry to the source

Many brokers (AWS SQS, Azure Service Bus, ActiveMQ) populate some of this metadata automatically. For brokers that do not, the consumer should wrap the message before forwarding it to the DLQ.

Message Flow

Max Retry Count

The max retry count (called maxReceiveCount in AWS SQS, maxDeliveryCount in Azure Service Bus) is the number of times the broker will attempt delivery before routing to the DLQ.

Choosing the right value:

Too low (1–2): A single transient network hiccup routes a perfectly valid message to the DLQ. On-call burden increases.
Too high (50+): A poison message burns through retries slowly, blocking the queue for an extended period.
Typical sweet spot: 3–5 for most systems, combined with exponential backoff (see BEE-261). This allows the system to survive brief outages without endlessly recycling unprocessable messages.

For queues processing time-sensitive work (payment processing, inventory holds), prefer a lower max retry count combined with shorter backoff intervals so the DLQ catches permanent failures quickly.

DLQ Monitoring and Alerting

A DLQ without monitoring is worse than no DLQ at all. Messages accumulate silently, the root cause grows stale, and recovery becomes harder.

Required alerts:

DLQ depth > 0 — Any message in the DLQ should trigger a notification. For high-volume systems, alert on depth > threshold (e.g., 5 messages) to avoid alert storms from transient spikes.
DLQ depth growing over time — Indicates a systematic failure, not a one-off. Escalate if depth increases across multiple check intervals.
DLQ message age exceeds SLA — If a DLQ message has been sitting uninspected for longer than the SLA permits (e.g., 24 hours), escalate.

Structured log lines emitted when messages are routed to the DLQ enable log-based alerting (see BEE-20031):

json

{
  "level": "error",
  "event": "message.dead_lettered",
  "queue": "order.fulfillment",
  "dlq": "order.fulfillment.dlq",
  "message_id": "msg-abc123",
  "attempt_count": 3,
  "failure_reason": "ValidationException: product_id not found",
  "timestamp": "2024-03-15T14:23:01Z"
}

Manual Inspection and Replay

When the DLQ alert fires, the recovery process is:

Inspect the message. Read the payload and failure metadata. Understand what the consumer tried to do and why it failed.
Classify the failure. Is it a data problem (bad input from the producer), a code bug (fixed in a recent deploy), or an infrastructure issue (downstream service now healthy)?
Fix the root cause. Deploy the code fix, correct the upstream data, or confirm the downstream service is healthy.
Replay selectively. If the fix is targeted, replay only the affected messages. For broad fixes, replay the full DLQ batch.
Verify. Confirm the replayed messages are processed successfully. Watch consumer error rates.
Discard if unrecoverable. Some messages cannot be replayed (e.g., a time-sensitive notification for an event that has already passed). Document and discard them.

Do not replay without fixing the root cause. Replaying unfixed messages sends them straight back to the DLQ.

Automated vs. Manual DLQ Processing

In some systems, it is possible to automate part of DLQ handling:

Pattern	When Appropriate
Auto-replay after delay	Infrastructure outage: wait for recovery, then auto-replay
Conditional discard	Messages past a TTL are discarded automatically
Separate DLQ consumer	A dedicated service inspects and routes messages based on error type
Human-in-the-loop	Default for business-critical queues where data integrity matters

For order processing, payments, and inventory — always require human review before replay. For telemetry pipelines and non-critical notifications, automated replay policies are appropriate.

Worked Example: Order Fulfillment

An order fulfillment service consumes from a order.placed queue. Each message contains an order with a list of product_id values that the consumer validates against the product catalog.

The failure scenario:

A message arrives with product_id: "PRD-99999", which does not exist in the product catalog. The consumer throws a ProductNotFoundException.

Attempt 1: ProductNotFoundException — NACK
Attempt 2: ProductNotFoundException — NACK (backoff: 2s)
Attempt 3: ProductNotFoundException — NACK (backoff: 4s)
maxReceiveCount = 3 → route to order.placed.dlq

The DLQ message is enriched:

json

{
  "original_payload": {
    "order_id": "ORD-88812",
    "customer_id": "CUST-441",
    "items": [
      { "product_id": "PRD-99999", "quantity": 2 }
    ]
  },
  "failure_reason": "ProductNotFoundException: PRD-99999 not found in catalog",
  "attempt_count": 3,
  "source_queue": "order.placed",
  "last_failure_time": "2024-03-15T14:23:01Z"
}

An alert fires. The on-call engineer inspects the DLQ message, checks the product catalog, and discovers PRD-99999 was accidentally deleted during a catalog migration. The product is restored. The engineer replays the message from the DLQ, and the consumer processes it successfully.

Common Mistakes

1. No DLQ Configured

The most dangerous mistake. Without a DLQ, a poison message either blocks the queue indefinitely (ordered queues) or is retried forever, consuming resources and generating noise while never making progress. Always configure a DLQ for queues that carry business-critical messages.

2. DLQ Without Monitoring

A DLQ that no one watches is operationally equivalent to discarding the messages. The DLQ fills up silently, the data loss is discovered later (if ever), and the window for recovery grows stale. Every DLQ must have a corresponding alert on depth > 0.

3. Infinite Retries Without a DLQ

Setting maxReceiveCount to a very large number (or no limit) is the same as having no DLQ. A poison message with 1,000 retry attempts will block the queue for hours before reaching the DLQ. Pair a finite retry count with exponential backoff (BEE-12002).

4. No Failure Metadata in DLQ Messages

Routing the raw payload to the DLQ without failure reason, attempt count, or source queue makes investigation much harder. When a human inspects the message hours or days later, they have no context. Always enrich the DLQ message at the point of failure routing.

5. Replaying Without Fixing the Root Cause

Replaying DLQ messages before the underlying bug or data issue is resolved sends them straight back to the DLQ. This is a common mistake during incident response under time pressure. Verify the fix is in place before initiating replay.

BEE-10003 — Delivery guarantees: at-most-once, at-least-once, and exactly-once semantics
BEE-12002 — Retry strategies: exponential backoff, jitter, and retry budgets
BEE-14002 — Structured logging: log formats for DLQ alerts and incident correlation

[BEE-10005] Dead Letter Queues and Poison Messages ​

Context ​

Principle ​

What Is a Poison Message? ​

Why Infinite Retries Are Harmful ​

DLQ Message Enrichment ​

Message Flow ​

Max Retry Count ​

DLQ Monitoring and Alerting ​

Manual Inspection and Replay ​

Automated vs. Manual DLQ Processing ​

Worked Example: Order Fulfillment ​

Common Mistakes ​

1. No DLQ Configured ​

2. DLQ Without Monitoring ​

3. Infinite Retries Without a DLQ ​

4. No Failure Metadata in DLQ Messages ​

5. Replaying Without Fixing the Root Cause ​

Related BEPs ​