{"id":"ba5ba21d-107b-4e80-b625-068848b8c0ce","shortId":"qsafCM","kind":"skill","title":"incident-response","tagline":"Manage active production incidents through detection, triage, mitigation, communication, and resolution with structured roles and decision-making. Use this skill whenever the user has an active incident, a production issue, a service outage, a security incident, or needs to plan ","description":"# Incident Response\n\nManage active production incidents from detection to resolution. Stack-agnostic. Tool-agnostic.\n\nThis skill is for active incidents and incident process. For after-the-fact analysis, use `after-action-report`. For planned launches, use `launch-runbook`.\n\n---\n\n## When to use\n\n- An active incident is happening\n- Building incident response procedures\n- Defining severity levels\n- Setting up on-call rotations\n- Training a team on incident response\n\n## When NOT to use\n\n- Post-incident retrospective (use `after-action-report`)\n- Planned launches (use `launch-runbook`)\n- Pre-launch issue triage (use `qa-testing`)\n\n---\n\n## Required inputs\n\n- Awareness of the incident (alert, customer report, internal observation)\n- Access to production systems and monitoring\n- Roles and authorities clearly defined\n- Communication channels operational\n\n---\n\n## The framework: 5 phases\n\n### 1. Detection\n\nHow the incident becomes known.\n\n**Detection sources:**\n\n- Automated alerts (monitoring, SLO violations, error rate spikes)\n- Customer reports (support tickets, social media, status page subscribers)\n- Internal observation (engineer notices something off)\n- Third-party (security researchers, partners)\n\n**On detection:**\n\n- Acknowledge within target time (typically 5 to 15 minutes for critical)\n- Assess severity (see severity rubric below)\n- Page the on-call if not already paged\n- Open the incident channel\n\n### 2. Triage\n\nEstablish severity and impact.\n\n**Severity rubric:**\n\n| Severity | Definition | Response |\n|---|---|---|\n| SEV-1 (Critical) | Major customer-facing functionality broken. Data integrity at risk. Security breach. | All-hands. Incident commander. Active war room. Public communication required. |\n| SEV-2 (Major) | Significant degradation. Some customers affected. Revenue impact. | Incident commander assigned. Active response. Internal communication. May or may not need public communication. |\n| SEV-3 (Minor) | Limited impact. Workaround available. Affecting a small group of users. | Standard on-call response. Single owner. |\n| SEV-4 (Low) | Cosmetic, edge-case, or low-frequency. No urgent action needed. | Tracked as bug. Addressed in normal queue. |\n\nSeverity can change. Re-evaluate as more info emerges.\n\n### 3. Mitigation\n\nStop the bleeding before fixing the cause.\n\n**Mitigation patterns (faster than full fix):**\n\n- **Rollback** (revert recent deploy)\n- **Feature flag off** (disable the broken feature without deploy)\n- **Failover** (route to healthy replica or region)\n- **Scale up** (more capacity to absorb the load)\n- **Throttle** (reject some traffic to protect the rest)\n- **Graceful degradation** (turn off non-essential features to keep core functional)\n- **Maintenance mode** (last resort, blocks all users)\n\n**Mitigation principle:** Stop user impact first. Cause analysis second.\n\n### 4. Communication\n\nThree audiences during an incident:\n\n**Internal team:**\n- Real-time updates in incident channel\n- Cadence: every 15 minutes minimum during active incident\n- Format: timestamped status updates with what we know, what we're doing, ETA\n\n**Internal stakeholders:**\n- Higher-level updates to broader org\n- Cadence: every 30 to 60 minutes\n- Format: business-impact framing, not technical detail\n\n**External / customers:**\n- Status page updates\n- Cadence: every 30 minutes minimum during active incident\n- Format: plain language, no blame, what users are experiencing, what to expect\n\n**Communication principles:**\n- Acknowledge before you have answers (\"We're aware and investigating\")\n- Update on schedule even if no progress (\"Still investigating, no new information\")\n- Never speculate publicly about cause\n- Confirm resolution explicitly when restored\n\n### 5. Resolution\n\nVerified fix, customers restored, incident closed.\n\n**Resolution criteria:**\n\n- Mitigation in place and verified\n- Root cause identified (or explicitly deferred to AAR)\n- All affected systems back to normal\n- Customers can resume normal use\n- Final status update posted (internal and external)\n- Incident channel can be closed (or archived for AAR)\n\nAfter closure:\n- Schedule AAR within 1 to 2 weeks\n- Capture initial timeline while memories are fresh\n- Track follow-up action items\n\n---\n\n## Roles during an incident\n\n| Role | Responsibility |\n|---|---|\n| Incident commander (IC) | Owns the response. Calls decisions. Assigns work. Not necessarily the most technical person; needs to coordinate. |\n| Communications lead | Owns internal and external messaging. Reduces IC's communication burden. |\n| Operations lead | Drives the technical investigation and mitigation. Often the most senior on-call engineer. |\n| Scribe | Captures the timeline as the incident unfolds. Critical for AAR. |\n| Subject matter experts | Pulled in as needed. Service owners, database experts, security experts. |\n\nFor small teams or low-severity incidents, one person can hold multiple roles. Each role's responsibilities should still be explicit.\n\n---\n\n## Decision-making during an incident\n\n**The IC's authority:**\n\n- Call rollback or other mitigations\n- Pull additional people in\n- Escalate severity\n- Make the call when unclear options exist\n\n**Non-decisions to avoid:**\n\n- \"Let's wait and see\" when mitigations are available and impact is occurring\n- Discussing root cause while users are actively impacted (mitigate first)\n- Premature resolution announcements before verification\n- Death-by-committee (pull in lots of people, no one decides)\n\nWhen in doubt: act. A wrong action that can be rolled back beats inaction while users suffer.\n\n---\n\n## Status page communication patterns\n\n**Initial:**\n> \"We are investigating reports of [issue]. Updates to follow.\"\n\n**Identified:**\n> \"We have identified the issue affecting [scope]. Engineers are working on a fix. Next update by [time].\"\n\n**Monitoring:**\n> \"A fix has been applied. We are monitoring to confirm resolution. Next update by [time].\"\n\n**Resolved:**\n> \"This incident has been resolved. Service has been restored. A full incident report will be posted within [timeframe].\"\n\nPatterns to avoid:\n\n- Vague language (\"experiencing some issues\" - what kind?)\n- Missing affected scope (\"login is down\" - everywhere or just one region?)\n- Missing time commitments\n- \"Should be resolved soon\" without verification\n- Using \"back up\" before verification\n\n---\n\n## Workflow\n\n1. **Acknowledge.** First responder acknowledges within target time.\n2. **Assess severity.** Use the rubric. Open the appropriate response channel.\n3. **Assign roles.** IC, comms, ops at minimum.\n4. **Communicate.** Initial status update. Internal channel active.\n5. **Investigate.** Logs, metrics, recent changes. The four most common causes: a recent deploy, a configuration change, a third-party dependency change, a load spike.\n6. **Mitigate.** Stop the bleeding. Don't wait for full root cause.\n7. **Verify mitigation.** Don't trust dashboards alone; test the user flow.\n8. **Communicate resolution.** Internal and external.\n9. **Close incident.** Final timeline noted. Action items tracked.\n10. **Schedule AAR.** Within 1 to 2 weeks.\n\n---\n\n## Failure patterns\n\n- **No clear IC.** Multiple people debugging in parallel, no coordination. Slower to mitigate, easier to make conflicting changes.\n- **Skipping mitigation, going straight to root cause.** Users keep suffering while engineers debug.\n- **Premature \"all clear.\"** Announcing resolution before verification.\n- **Communication silence.** Users don't know if anyone is working on it.\n- **Status updates too vague.** \"We're working on it\" with no detail.\n- **Speculating publicly about cause.** Often wrong, always damaging trust.\n- **Pulling in too many people.** Coordination overhead exceeds value.\n- **No scribe.** The timeline gets lost. AAR has to reconstruct from chat logs.\n- **Skipping AAR for \"minor\" incidents.** Patterns get missed. Lessons get re-learned.\n- **Blame culture.** People hide mistakes, incidents take longer.\n\n---\n\n## Output format\n\nDuring an active incident: incident channel updates and status page updates as per the framework above.\n\nAfter incident close: a brief incident summary feeding into the AAR.\n\n```markdown\n# Incident: [Brief title]\n\n**Date:** [YYYY-MM-DD]\n**Severity:** [SEV-1 / 2 / 3 / 4]\n**Duration:** [Detection to resolution]\n**Customer impact:** [Who, how many, how]\n\n## Summary\n[1 to 2 paragraphs]\n\n## Timeline\n[Timestamped events]\n\n## Mitigation\n[What was done]\n\n## Action items\n[Follow-ups, with owners]\n\n## AAR scheduled for\n[Date]\n```\n\n---\n\n## Reference files\n\n- [`references/incident-playbook.md`](references/incident-playbook.md) - Severity definitions, roles, status page templates, decision rubrics.","tags":["incident","response","claude","skills","rampstackco","agent-skills","anthropic","awesome-claude-code","awesome-claude-prompts","awesome-claude-skills","claude-code","claude-skills"],"capabilities":["skill","source-rampstackco","skill-incident-response","topic-agent-skills","topic-anthropic","topic-awesome-claude-code","topic-awesome-claude-prompts","topic-awesome-claude-skills","topic-claude","topic-claude-code","topic-claude-skills","topic-good-first-issue","topic-mcp","topic-product-management","topic-seo"],"categories":["claude-skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/rampstackco/claude-skills/incident-response","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add rampstackco/claude-skills","source_repo":"https://github.com/rampstackco/claude-skills","install_from":"skills.sh"}},"qualityScore":"0.540","qualityRationale":"deterministic score 0.54 from registry signals: · indexed on github topic:agent-skills · 181 github stars · SKILL.md body (8,647 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T18:55:16.759Z","embedding":null,"createdAt":"2026-04-30T01:01:28.705Z","updatedAt":"2026-05-18T18:55:16.759Z","lastSeenAt":"2026-05-18T18:55:16.759Z","tsv":"'-1':254,1181 '-2':280 '-3':304 '-4':324 '1':172,608,917,1021,1196 '10':1017 '15':219,452 '2':242,610,925,1023,1182,1198 '3':355,936,1183 '30':482,501 '4':434,944,1184 '5':170,217,553,952 '6':978 '60':484 '7':990 '8':1002 '9':1008 'aar':575,602,606,688,1019,1113,1121,1169,1214 'absorb':395 'access':154 'acknowledg':212,521,918,921 'act':800 'action':79,126,336,623,803,1014,1207 'activ':5,30,48,65,92,273,292,456,505,776,951,1145 'addit':740 'address':341 'affect':286,310,577,834,892 'after-action-report':77,124 'after-the-fact':71 'agnost':57,60 'alert':149,182 'all-hand':268 'alon':997 'alreadi':236 'alway':1095 'analysi':75,432 'announc':782,1061 'answer':525 'anyon':1072 'appli':851 'appropri':933 'archiv':600 'assess':223,926 'assign':291,639,937 'audienc':437 'author':162,733 'autom':181 'avail':309,765 'avoid':756,883 'awar':145,528 'back':579,808,912 'beat':809 'becom':177 'blame':511,1133 'bleed':359,982 'block':422 'breach':267 'brief':1163,1172 'broader':478 'broken':261,379 'bug':340 'build':96 'burden':661 'busi':488 'business-impact':487 'cadenc':450,480,499 'call':107,233,319,637,676,734,747 'capac':393 'captur':612,679 'case':329 'caus':363,431,547,569,772,962,989,1051,1092 'chang':347,957,968,974,1044 'channel':166,241,449,595,935,950,1148 'chat':1118 'clear':163,1028,1060 'close':560,598,1009,1161 'closur':604 'comm':940 'command':272,290,632 'commit':904 'committe':788 'common':961 'communic':12,165,277,295,302,435,519,650,660,816,945,1003,1065 'configur':967 'confirm':548,856 'conflict':1043 'coordin':649,1036,1103 'core':416 'cosmet':326 'criteria':562 'critic':222,255,686 'cultur':1134 'custom':150,189,258,285,495,557,582,1189 'customer-fac':257 'damag':1096 'dashboard':996 'data':262 'databas':698 'date':1174,1217 'dd':1178 'death':786 'death-by-committe':785 'debug':1032,1057 'decid':796 'decis':20,638,725,754,1228 'decision-mak':19,724 'defer':573 'defin':100,164 'definit':251,1223 'degrad':283,407 'depend':973 'deploy':373,382,965 'detail':493,1088 'detect':9,52,173,179,211,1186 'disabl':377 'discuss':770 'done':1206 'doubt':799 'drive':664 'durat':1185 'easier':1040 'edg':328 'edge-cas':327 'emerg':354 'engin':200,677,836,1056 'error':186 'escal':743 'essenti':412 'establish':244 'eta':470 'evalu':350 'even':534 'event':1202 'everi':451,481,500 'everywher':897 'exceed':1105 'exist':751 'expect':518 'experienc':515,886 'expert':691,699,701 'explicit':550,572,723 'extern':494,593,655,1007 'face':259 'fact':74 'failov':383 'failur':1025 'faster':366 'featur':374,380,413 'feed':1166 'file':1219 'final':587,1011 'first':430,779,919 'fix':361,369,556,841,848 'flag':375 'flow':1001 'follow':621,827,1210 'follow-up':620,1209 'format':458,486,507,1142 'four':959 'frame':490 'framework':169,1157 'frequenc':333 'fresh':618 'full':368,873,987 'function':260,417 'get':1111,1126,1129 'go':1047 'grace':406 'group':313 'hand':270 'happen':95 'healthi':386 'hide':1136 'higher':474 'higher-level':473 'hold':713 'ic':633,658,731,939,1029 'identifi':570,828,831 'impact':247,288,307,429,489,767,777,1190 'inact':810 'incid':2,7,31,40,45,50,66,68,93,97,113,121,148,176,240,271,289,440,448,457,506,559,594,628,631,684,709,729,864,874,1010,1124,1138,1146,1147,1160,1164,1171 'incident-respons':1 'info':353 'inform':542 'initi':613,818,946 'input':144 'integr':263 'intern':152,198,294,441,471,591,653,949,1005 'investig':530,539,667,821,953 'issu':34,137,824,833,888 'item':624,1015,1208 'keep':415,1053 'kind':890 'know':465,1070 'known':178 'languag':509,885 'last':420 'launch':83,86,129,132,136 'launch-runbook':85,131 'lead':651,663 'learn':1132 'lesson':1128 'let':757 'level':102,475 'limit':306 'load':397,976 'log':954,1119 'login':894 'longer':1140 'lost':1112 'lot':791 'low':325,332,707 'low-frequ':331 'low-sever':706 'mainten':418 'major':256,281 'make':21,726,745,1042 'manag':4,47 'mani':1101,1193 'markdown':1170 'matter':690 'may':296,298 'media':194 'memori':616 'messag':656 'metric':955 'minimum':454,503,943 'minor':305,1123 'minut':220,453,485,502 'miss':891,902,1127 'mistak':1137 'mitig':11,356,364,425,563,669,738,763,778,979,992,1039,1046,1203 'mm':1177 'mode':419 'monitor':159,183,846,854 'multipl':714,1030 'necessarili':642 'need':42,300,337,647,695 'never':543 'new':541 'next':842,858 'non':411,753 'non-decis':752 'non-essenti':410 'normal':343,581,585 'note':1013 'notic':201 'observ':153,199 'occur':769 'often':670,1093 'on-cal':105,231,317,674 'one':710,795,900 'op':941 'open':238,931 'oper':167,662 'option':750 'org':479 'outag':37 'output':1141 'overhead':1104 'own':634,652 'owner':322,697,1213 'page':196,229,237,497,815,1152,1226 'paragraph':1199 'parallel':1034 'parti':206,972 'partner':209 'pattern':365,817,881,1026,1125 'peopl':741,793,1031,1102,1135 'per':1155 'person':646,711 'phase':171 'place':565 'plain':508 'plan':44,82,128 'post':120,590,878 'post-incid':119 'pre':135 'pre-launch':134 'prematur':780,1058 'principl':426,520 'procedur':99 'process':69 'product':6,33,49,156 'progress':537 'protect':403 'public':276,301,545,1090 'pull':692,739,789,1098 'qa':141 'qa-test':140 'queue':344 'rate':187 're':349,468,527,1082,1131 're-evalu':348 're-learn':1130 'real':444 'real-tim':443 'recent':372,956,964 'reconstruct':1116 'reduc':657 'refer':1218 'references/incident-playbook.md':1220,1221 'region':389,901 'reject':399 'replica':387 'report':80,127,151,190,822,875 'requir':143,278 'research':208 'resolut':14,54,549,554,561,781,857,1004,1062,1188 'resolv':862,867,907 'resort':421 'respond':920 'respons':3,46,98,114,252,293,320,630,636,719,934 'rest':405 'restor':552,558,871 'resum':584 'retrospect':122 'revenu':287 'revert':371 'risk':265 'role':17,160,625,629,715,717,938,1224 'roll':807 'rollback':370,735 'room':275 'root':568,771,988,1050 'rotat':108 'rout':384 'rubric':227,249,930,1229 'runbook':87,133 'scale':390 'schedul':533,605,1018,1215 'scope':835,893 'scribe':678,1108 'second':433 'secur':39,207,266,700 'see':225,761 'senior':673 'servic':36,696,868 'set':103 'sev':253,279,303,323,1180 'sever':101,224,226,245,248,250,345,708,744,927,1179,1222 'signific':282 'silenc':1066 'singl':321 'skill':24,62 'skill-incident-response' 'skip':1045,1120 'slo':184 'slower':1037 'small':312,703 'social':193 'someth':202 'soon':908 'sourc':180 'source-rampstackco' 'specul':544,1089 'spike':188,977 'stack':56 'stack-agnost':55 'stakehold':472 'standard':316 'status':195,460,496,588,814,947,1077,1151,1225 'still':538,721 'stop':357,427,980 'straight':1048 'structur':16 'subject':689 'subscrib':197 'suffer':813,1054 'summari':1165,1195 'support':191 'system':157,578 'take':1139 'target':214,923 'team':111,442,704 'technic':492,645,666 'templat':1227 'test':142,998 'third':205,971 'third-parti':204,970 'three':436 'throttl':398 'ticket':192 'time':215,445,845,861,903,924 'timefram':880 'timelin':614,681,1012,1110,1200 'timestamp':459,1201 'titl':1173 'tool':59 'tool-agnost':58 'topic-agent-skills' 'topic-anthropic' 'topic-awesome-claude-code' 'topic-awesome-claude-prompts' 'topic-awesome-claude-skills' 'topic-claude' 'topic-claude-code' 'topic-claude-skills' 'topic-good-first-issue' 'topic-mcp' 'topic-product-management' 'topic-seo' 'track':338,619,1016 'traffic':401 'train':109 'triag':10,138,243 'trust':995,1097 'turn':408 'typic':216 'unclear':749 'unfold':685 'up':1211 'updat':446,461,476,498,531,589,825,843,859,948,1078,1149,1153 'urgent':335 'use':22,76,84,90,118,123,130,139,586,911,928 'user':27,315,424,428,513,774,812,1000,1052,1067 'vagu':884,1080 'valu':1106 'verif':784,910,915,1064 'verifi':555,567,991 'violat':185 'wait':759,985 'war':274 'week':611,1024 'whenev':25 'within':213,607,879,922,1020 'without':381,909 'work':640,838,1074,1083 'workaround':308 'workflow':916 'wrong':802,1094 'yyyi':1176 'yyyy-mm-dd':1175","prices":[{"id":"6dd8afb5-108b-461a-a0c6-43eaecdca00d","listingId":"ba5ba21d-107b-4e80-b625-068848b8c0ce","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"rampstackco","category":"claude-skills","install_from":"skills.sh"},"createdAt":"2026-04-30T01:01:28.705Z"}],"sources":[{"listingId":"ba5ba21d-107b-4e80-b625-068848b8c0ce","source":"github","sourceId":"rampstackco/claude-skills/incident-response","sourceUrl":"https://github.com/rampstackco/claude-skills/tree/main/skills/incident-response","isPrimary":false,"firstSeenAt":"2026-04-30T01:01:28.705Z","lastSeenAt":"2026-05-18T18:55:16.759Z"}],"details":{"listingId":"ba5ba21d-107b-4e80-b625-068848b8c0ce","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"rampstackco","slug":"incident-response","github":{"repo":"rampstackco/claude-skills","stars":181,"topics":["agent-skills","anthropic","awesome-claude-code","awesome-claude-prompts","awesome-claude-skills","claude","claude-code","claude-skills","good-first-issue","mcp","product-management","seo","show-hn","showcase","showdev","web-design","web-development"],"license":"mit","html_url":"https://github.com/rampstackco/claude-skills","pushed_at":"2026-05-10T22:40:22Z","description":"Stack-agnostic Claude Skills covering the full website lifecycle: brand, design, content, SEO, dev, ops, growth, and research. Build, ship, audit, optimize.","skill_md_sha":"f48474910e11cc994d1d1825677fa9b5b77cead6","skill_md_path":"skills/incident-response/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/rampstackco/claude-skills/tree/main/skills/incident-response"},"layout":"multi","source":"github","category":"claude-skills","frontmatter":{"name":"incident-response","description":"Manage active production incidents through detection, triage, mitigation, communication, and resolution with structured roles and decision-making. Use this skill whenever the user has an active incident, a production issue, a service outage, a security incident, or needs to plan incident response procedures. Triggers on incident response, production incident, outage, service down, site down, P0, P1, severity, downtime, on-call, incident commander, status page, postmortem prep. Also triggers when something is actively broken in production and the user is figuring out what to do."},"skills_sh_url":"https://skills.sh/rampstackco/claude-skills/incident-response"},"updatedAt":"2026-05-18T18:55:16.759Z"}}