Skillquality 0.46

kafka-consumer-lag

Analyse Kafka consumer group lag using the Lenses MCP server. Diagnoses lag causes (throughput bottlenecks, rebalancing, partition skew, stalled consumers) and suggests remediation. Use when user says "check consumer lag", "why are consumers slow", "lag report" or asks about cons

Price
free
Protocol
skill
Verified
no

What it does

Kafka Consumer Lag Analysis

Analyses consumer group lag across all groups and diagnoses potential causes. Consumer lag is the most commonly monitored Kafka metric and the first thing engineers check during incidents.

Target environment: $ARGUMENTS

Workflow

Copy this checklist and track your progress:

Lag Analysis Progress:
- [ ] Step 1: Fetch all consumer groups
- [ ] Step 2: Identify problematic groups
- [ ] Step 3: Diagnose root causes
- [ ] Step 4: Generate report
  1. Fetch all consumer groups with state and lag info
  2. Identify problematic groups (high lag, idle, rebalancing)
  3. Diagnose root causes using partition and throughput data
  4. Report findings with per-group remediation steps

Step 1: Fetch Consumer Groups

Use the Lenses MCP list_consumer_groups tool to get all consumer groups with:

  • Group state (Stable, Rebalancing, Empty, Dead)
  • Lag per topic-partition
  • Active member count
  • Coordinator info

For topic-specific analysis, use list_consumer_groups_by_topic to narrow the scope.

Expected output: List of all consumer groups with state, lag and member count.

Validation: If no consumer groups are returned, report this finding and stop - the cluster may have no active consumers.

Step 2: Identify Problematic Groups

Flag consumer groups in these categories:

Critical

  • Groups in Dead or Empty state with committed offsets (consumer has stopped)
  • Groups with lag growing over time (compare current lag against partition end offsets using get_topic_partitions)

Warning

  • Groups in Rebalancing state (consumption paused)
  • Groups with uneven partition assignment (some consumers have significantly more partitions than others)
  • Groups with no committed offsets (consumer may never have started successfully)

Suggestion

  • Groups with very high member count relative to partition count (idle consumers)
  • Groups consuming from topics with no recent messages (use get_dataset_message_metrics to check producer throughput)

Step 3: Diagnose Root Causes

For each problematic group, determine the likely cause:

Growing Lag + Stable Producer Rate

  • Consumer throughput bottleneck
  • Remediation: Scale consumers, increase max.poll.records, reduce processing time per message, check for synchronous blocking calls

Lag Spikes + Rebalancing

  • Rebalancing is pausing consumption
  • Remediation: Switch to cooperative sticky assignor, increase session.timeout.ms, reduce max.poll.interval.ms risk

Single-Partition Lag (others are fine)

  • Hot partition / key skew
  • Remediation: Review partitioning strategy, check for hot keys, consider custom partitioner

All Partitions Lagging Equally

  • Overall consumer slowness
  • Remediation: Profile consumer processing, check for external dependencies (database, API calls), increase consumer instances

Zero Lag but Empty Group

  • Consumer crashed or was decommissioned
  • Remediation: Check consumer process health, restart if needed, clean up stale groups

Step 4: Verify Data Flow

Use the Lenses MCP execute_sql tool to sample recent messages from lagging topics:

SELECT * FROM {topic} LIMIT 5

This confirms the topic has active producers and messages are flowing.

Success Criteria

Quantitative

  • Triggers on 90% of lag-related queries (test with 10-20 varied phrasings)
  • Completes analysis in under 12 MCP tool calls
  • 0 failed MCP calls per run

Qualitative

  • Each flagged group has a specific diagnosis and remediation
  • Report distinguishes symptom from root cause
  • Consistent results when re-run on the same environment

Examples

Example 1: Incident investigation

User says: "Consumers are falling behind, check the lag"

Actions:

  1. Fetch all consumer groups for the environment
  2. Sort by total lag descending to find the worst offenders
  3. Diagnose root causes for each critical group Result: Prioritised report with per-group diagnosis and remediation

Example 2: Topic-specific lag check

User says: "Check consumer lag for orders.payment.completed"

Actions:

  1. Use list_consumer_groups_by_topic to get groups consuming from that topic
  2. Check partition-level lag distribution
  3. Verify data flow with execute_sql Result: Focused report on a single topic's consumer groups

Example 3: Stale group cleanup

User says: "Find any dead or empty consumer groups"

Actions:

  1. Fetch all consumer groups
  2. Filter for Dead or Empty state with committed offsets
  3. Report stale groups that can be safely cleaned up Result: List of candidate groups for deletion

Troubleshooting

Consumer groups list is empty

Cause: No consumer groups exist in the environment or permissions are restricted. Solution: Verify consumers are running. Check Lenses agent permissions allow listing consumer groups.

Lag numbers seem incorrect

Cause: Consumer group offsets may be stale if the consumer recently stopped. Solution: Compare lag against partition end offsets from get_topic_partitions. Note when offsets were last committed.

execute_sql returns no data

Cause: Topic is empty or has very old messages beyond the query window. Solution: Use get_dataset_message_metrics to check if the topic has recent throughput. An empty topic with lagging consumers indicates a producer issue.

Output Format

## Consumer Lag Report

### Environment: {name}

### Critical (immediate action)
| Consumer Group | Topic | Total Lag | State | Diagnosis | Remediation |
|---------------|-------|-----------|-------|-----------|-------------|
| group-name | topic | 50000 | Stable | Throughput bottleneck | Scale consumers |

### Warning (investigate)
| Consumer Group | Topic | Total Lag | State | Diagnosis | Remediation |
|---------------|-------|-----------|-------|-----------|-------------|

### Suggestion (optimise)
| Consumer Group | Topic | Total Lag | State | Diagnosis | Remediation |
|---------------|-------|-----------|-------|-----------|-------------|

### Summary
- X consumer groups analysed
- Y groups with critical lag
- Z groups with warnings
- Stale/empty groups: N

Capabilities

skillsource-lensesioskill-kafka-consumer-lagtopic-agent-skillstopic-agentic-engineeringtopic-apache-kafkatopic-claude-codetopic-context-engineeringtopic-cursortopic-data-engineeringtopic-devopstopic-kafkatopic-kafka-connecttopic-lensestopic-mcp

Install

Quality

0.46/ 1.00

deterministic score 0.46 from registry signals: · indexed on github topic:agent-skills · 26 github stars · SKILL.md body (6,137 chars)

Provenance

Indexed fromgithub
Enriched2026-05-18 19:05:00Z · deterministic:skill-github:v1 · v1
First seen2026-05-15
Last seen2026-05-18

Agent access