monte-carlo-performance-diagnosis
Diagnoses pipeline performance issues -- slow jobs, expensive queries, latency trends -- using Monte Carlo's cross-platform observability. Uses a tiered investigation approach: discover problems, bridge to affected tables, then drill into root causes. Activates when a user asks a
What it does
Monte Carlo Performance Diagnosis Skill
This skill helps diagnose data pipeline performance issues using Monte Carlo's cross-platform observability data. It works across Airflow, dbt, Databricks, and warehouse query engines to find bottlenecks, detect regressions, and identify root causes.
Reference files live next to this skill file. Use the Read tool (not MCP resources) to access them:
- Tiered investigation approach:
references/investigation-tiers.md(relative to this file) - Query analysis patterns:
references/query-analysis.md(relative to this file)
When to activate this skill
Activate when the user:
- Asks about slow pipelines, jobs, or queries
- Wants to find expensive or costly queries
- Mentions performance regressions or degradation
- Asks "why is this pipeline slow?" or "what's using the most compute?"
- Wants to compare performance over time or find bottleneck tasks
- Asks about failed or futile query patterns
When NOT to activate this skill
Do not activate when the user is:
- Investigating data quality issues (use the prevent skill)
- Looking at storage costs (use the storage-cost-analysis skill)
- Creating monitors (use the monitoring-advisor skill)
- Just querying data or exploring table contents
Prerequisites
The following MCP tools must be available (connect to Monte Carlo's MCP server):
Discovery tools (Tier 1):
get_jobs_performance-- find slow/failing jobs across Airflow, dbt, Databricksget_top_slow_queries-- find slowest query groups by total runtime
Bridge tool:
get_tables_for_job-- convert job MCONs to table MCONs
Diagnosis tools (Tier 2):
get_tasks_performance-- drill into a job's individual tasksget_change_timeline-- unified timeline of query changes, volume shifts, Airflow/dbt failuresget_query_rca-- root cause analysis for failed/futile queriesget_query_latency_distribution-- latency trend over timeget_asset_lineage-- trace upstream/downstream impact
Supporting tools:
get_warehouses-- list available warehouses
Workflow
Step 1: Identify the scope
Determine what the user wants to investigate:
- Specific job/pipeline: User mentions a job name or pipeline
- Specific table: User mentions a table that's slow to update
- General discovery: User wants to find what's slow
Call get_warehouses to list available warehouses. Match the user's context to a warehouse.
Step 2: Tier 1 -- Discovery
If you don't have specific MCONs to investigate, start with discovery:
-
Find slow jobs: Call
get_jobs_performancewith optionalintegration_typefilter (AIRFLOW, DATABRICKS, DBT) if the user specifies a platform.- Results include: job name, average duration, trend (7-day), run count, failure rate
- Look for: high
avgDuration, negativerunDurationTrend7d, high failure rates
-
Find expensive queries: Call
get_top_slow_querieswith optionalwarehouse_idandquery_type("read" for SELECTs, "write" for INSERT/CREATE/MERGE).- Results include: query hash, total runtime, average runtime, run count
- Look for: queries with high total runtime or high individual execution time
Present the top findings to the user before drilling deeper. A typical investigation needs only 3-7 tool calls.
If both discovery tools return no results: Tell the user no performance issues were found in the current time window. Suggest broadening the scope (different warehouse, longer time range, or a different platform filter).
Step 3: Bridge -- Job to Tables
After Tier 1 identifies problematic jobs, convert to table MCONs:
Call get_tables_for_job(job_mcon=..., integration_type=...) using the integration_type from the job performance results.
This gives you the table MCONs needed for Tier 2 investigation.
Step 4: Tier 2 -- Diagnosis
Now drill into root causes using the MCONs from discovery or the bridge:
-
Task bottleneck: Call
get_tasks_performanceto find which specific task in a job is the bottleneck. -
What changed? Call
get_change_timeline-- this is your most powerful tool. It returns a unified timeline of:- Query text changes (schema modifications, new JOINs, filter changes)
- Volume shifts (row count spikes/drops)
- Airflow task failures
- dbt model failures All in one call. Look for correlations: "query changed on day X, runtime doubled on day X+1."
-
Why are queries failing? Call
get_query_rcato get root cause analysis:- Failed queries: errors, timeouts, permission issues
- Futile queries: queries that run but produce no useful output
- Patterns are pre-computed -- the tool groups failures by cause
-
Is latency degrading? Call
get_query_latency_distributionto see the trend:- Compare p50 vs p95 -- if p95 >> p50 (>5x), the problem is outlier queries
- Look for step-changes in latency (sudden increase = regression)
-
Trace impact: Call
get_asset_lineagewithdirection="DOWNSTREAM"to see what's affected by a slow table, ordirection="UPSTREAM"to find what feeds it.
Step 5: Present findings
Structure your response as:
- Problem summary: What's slow and by how much (with exact numbers from tools)
- Root cause: What changed or what's causing the issue
- Impact: What downstream systems are affected
- Recommendations: Specific actions to fix the issue
Important rules
- Quote tool numbers exactly. If a tool returns "1282 runs, avg 22.5s", say exactly that. Never round, estimate, or fabricate numbers.
- Always compare to baselines. Use 7-day trend data (
runDurationTrend7d) to distinguish regressions from normal variance. Flag if trend data has less than 0.1 confidence. - Stop when you have a root cause. 3-7 tool calls is typical. More than 10 means you're over-investigating.
- Read vs write queries: When the user asks about "reads" or "read queries", filter with
query_type="read". When they ask about "writes", usequery_type="write". Do NOT mix them. - Never expose MCONs, UUIDs, or internal identifiers to the user. Use human-readable names.
- Cross-platform: This skill works across Airflow, dbt, and Databricks. Note which platform each finding comes from.
Capabilities
Install
Quality
deterministic score 0.49 from registry signals: · indexed on github topic:agent-skills · 78 github stars · SKILL.md body (6,280 chars)