kafka-dlq-review
Review dead letter queue implementations for completeness using the Lenses MCP server. Checks DLQ topic existence, configuration, monitoring, metadata preservation, retry logic, reprocessing paths and connector DLQ alignment. Use when user says "review dead letter queues", "check
What it does
Kafka Dead Letter Queue Review
Reviews dead letter queue implementations for completeness and best practices. DLQs are a cross-cutting concern that every Kafka consumer eventually needs. Most teams implement them as an afterthought and forget to monitor them.
Target environment: $ARGUMENTS
Workflow
Copy this checklist and track your progress:
DLQ Review Progress:
- [ ] Step 1: Discover DLQ topics
- [ ] Step 2: Check DLQ topic configuration
- [ ] Step 3: Verify DLQ monitoring
- [ ] Step 4: Sample DLQ messages
- [ ] Step 5: Audit connector DLQ configs
- [ ] Step 6: Scan codebase for error handling
- [ ] Step 7: Generate report with maturity assessment
- Discover DLQ topics in the cluster
- Check DLQ topic configuration (retention, partitions)
- Verify DLQ monitoring (consumers, alerting)
- Sample DLQ messages for metadata completeness
- Audit connector DLQ configs
- Scan codebase for error handling patterns
- Report findings with maturity assessment
Step 1: Discover DLQ Topics
Use the Lenses MCP list_topics tool and filter for DLQ topics by naming convention:
*.dlq*.dead-letter*.error*.errors*.retry*-dlq*-dead-letter
Also check connector configurations for DLQ topic names using list_kafka_connectors.
Expected output: List of DLQ topics discovered by naming convention and connector config.
Validation: If no DLQ topics are found, search the codebase for error topic configuration before concluding that DLQs are not implemented.
Step 2: Check DLQ Topic Configuration
For each discovered DLQ topic, use get_topic to verify:
- Warning: Retention too short (DLQ messages should be kept longer than source topics for investigation)
- Warning: Only 1 partition (may bottleneck if DLQ volume spikes)
- Warning: Replication factor < source topic (DLQ data is often more critical to preserve)
- Suggestion: Consider compaction for DLQ topics where reprocessing overwrites errors
Step 3: Verify DLQ Monitoring
Use list_consumer_groups_by_topic for each DLQ topic to check:
- Critical: DLQ topic has no consumer groups (nobody is monitoring failures)
- Warning: DLQ consumer groups are inactive/empty (monitoring may have stopped)
- Suggestion: DLQ consumer groups should include alerting or dashboarding services
Check DLQ message rates using get_dataset_message_metrics:
- Warning: Sustained high DLQ message rate (indicates upstream problem)
- Warning: DLQ message rate increasing over time (degrading system health)
Step 4: Sample DLQ Messages
Use the Lenses MCP execute_sql tool to sample messages from DLQ topics:
SELECT * FROM `{dlq-topic}` LIMIT 10
Check that DLQ messages include complete metadata:
Required Metadata
- Critical: Original topic name (where the message came from)
- Critical: Error message (what went wrong)
- Warning: Original partition and offset (for tracing)
- Warning: Original timestamp (when it was produced)
- Suggestion: Stack trace (for debugging)
- Suggestion: Retry count (how many times it was attempted)
- Suggestion: Consumer group ID (which consumer failed)
Step 5: Audit Connector DLQ Configs
Use list_kafka_connectors and get_kafka_connector_target_definition to check:
- Critical: Connectors with
errors.tolerance=allbut noerrors.deadletterqueue.topic.name(silently drops messages) - Warning: Connectors with
errors.tolerance=noneand no DLQ (stops on any error, no recovery path) - Suggestion: Enable
errors.deadletterqueue.context.headers.enable=truefor richer error context - Suggestion: Enable
errors.log.enable=truefor error logging
Step 6: Codebase Inspection
Search the codebase for error handling and DLQ patterns:
- DLQ producer setup (search for
dlq,dead.letter,error.topic) - Retry logic before DLQ routing (search for
retry,backoff,max_retries) - Error handling in consumers (search for exception handling around
poll,consume,process) - Reprocessing mechanisms (scripts or tools to replay DLQ messages)
Flag:
- Warning: Consumers with no error handling around message processing
- Warning: Immediate DLQ routing without retry attempts
- Suggestion: No reprocessing mechanism for DLQ messages (use
resend_messagevia Lenses or custom tooling)
Success Criteria
Quantitative
- Triggers on 90% of DLQ-related queries (test with 10-20 varied phrasings)
- Completes review in under 15 tool calls (MCP + codebase search)
- Discovers 100% of DLQ topics matching standard naming patterns
Qualitative
- Unmonitored DLQ topics are always flagged as critical
- Metadata completeness is assessed per DLQ topic with a clear checklist
- Maturity assessment (none/basic/complete) is consistently applied
Examples
Example 1: Full DLQ audit
User says: "Audit all dead letter queues in production"
Actions:
- Discover all DLQ topics by naming convention
- Check each DLQ topic's configuration and monitoring
- Sample messages for metadata completeness
- Audit connector DLQ configs
- Scan codebase for error handling patterns Result: Comprehensive DLQ maturity report across all consumers and connectors
Example 2: Investigating message failures
User says: "Messages are ending up in the DLQ, what's going wrong?"
Actions:
- Sample recent DLQ messages with
execute_sql - Check error messages and stack traces
- Trace back to the source topic and consumer group
- Check DLQ message rate trends Result: Root cause analysis of DLQ message flow
Example 3: DLQ coverage check
User says: "Do all our consumers have DLQ handling?"
Actions:
- List all consumer groups from Lenses
- Check which groups have corresponding DLQ topics
- Scan codebase for error handling in consumers
- Report consumers missing DLQ implementation Result: Coverage report showing DLQ maturity per consumer
Troubleshooting
No DLQ topics found
Cause: DLQ topics may use non-standard naming or DLQs may not be implemented yet. Solution: Search the codebase for error topic configuration. Also check connector configs for DLQ topic names that may not follow common patterns.
DLQ messages have no metadata
Cause: The DLQ producer does not include headers or structured error information. Solution: Report this as a critical finding. DLQ messages without metadata are very difficult to investigate and reprocess.
execute_sql returns binary/unreadable data
Cause: DLQ messages may use a different serialisation format than expected. Solution: Try specifying the format in the query. Report the serialisation format issue and recommend consistent format usage.
Output Format
## DLQ Review Report
### Environment: {name}
### DLQ Topics Discovered
| DLQ Topic | Source Topic | Retention | Partitions | Has Consumers | Message Rate |
|-----------|-------------|-----------|------------|--------------|-------------|
| topic.dlq | topic | 7d | 3 | Yes | 12/hr |
### Critical (must fix)
- [topic/connector/file] Description of the issue
Impact: {what could go wrong}
Remediation: {how to fix}
### Warning (should fix)
- [topic/connector/file] Description of the issue
Remediation: {how to fix}
### Suggestion (consider improving)
- [topic/connector/file] Description of the suggestion
Recommendation: {how to improve}
### DLQ Metadata Completeness
| DLQ Topic | Original Topic | Error Message | Partition/Offset | Timestamp | Stack Trace | Retry Count |
|-----------|---------------|--------------|-----------------|-----------|-------------|-------------|
| topic.dlq | ✅ | ✅ | ⚠️ | ⚠️ | ❌ | ❌ |
### Maturity Assessment
| Consumer/Connector | DLQ Status | Maturity |
|-------------------|-----------|---------|
| consumer-name | Configured with monitoring | Complete |
| connector-name | Configured without monitoring | Basic |
| other-consumer | No DLQ | None |
### Summary
- X DLQ topics found
- Y consumers/connectors without DLQ
- Z unmonitored DLQ topics
- Overall DLQ maturity: none/basic/complete
Capabilities
Install
Quality
deterministic score 0.46 from registry signals: · indexed on github topic:agent-skills · 26 github stars · SKILL.md body (8,104 chars)