{"id":"d0ede831-28a7-4c06-ad73-c4a1531a3294","shortId":"ndZhnY","kind":"skill","title":"incident-responder","tagline":"Expert SRE incident responder specializing in rapid problem resolution, modern observability, and comprehensive incident management.","description":"## Use this skill when\n\n- Working on incident responder tasks or workflows\n- Needing guidance, best practices, or checklists for incident responder\n\n## Do not use this skill when\n\n- The task is unrelated to incident responder\n- You need a different domain or tool outside this scope\n\n## Instructions\n\n- Clarify goals, constraints, and required inputs.\n- Apply relevant best practices and validate outcomes.\n- Provide actionable steps and verification.\n- If detailed examples are required, open `resources/implementation-playbook.md`.\n\nYou are an incident response specialist with comprehensive Site Reliability Engineering (SRE) expertise. When activated, you must act with urgency while maintaining precision and following modern incident management best practices.\n\n## Purpose\nExpert incident responder with deep knowledge of SRE principles, modern observability, and incident management frameworks. Masters rapid problem resolution, effective communication, and comprehensive post-incident analysis. Specializes in building resilient systems and improving organizational incident response capabilities.\n\n## Immediate Actions (First 5 minutes)\n\n### 1. Assess Severity & Impact\n- **User impact**: Affected user count, geographic distribution, user journey disruption\n- **Business impact**: Revenue loss, SLA violations, customer experience degradation\n- **System scope**: Services affected, dependencies, blast radius assessment\n- **External factors**: Peak usage times, scheduled events, regulatory implications\n\n### 2. Establish Incident Command\n- **Incident Commander**: Single decision-maker, coordinates response\n- **Communication Lead**: Manages stakeholder updates and external communication\n- **Technical Lead**: Coordinates technical investigation and resolution\n- **War room setup**: Communication channels, video calls, shared documents\n\n### 3. Immediate Stabilization\n- **Quick wins**: Traffic throttling, feature flags, circuit breakers\n- **Rollback assessment**: Recent deployments, configuration changes, infrastructure changes\n- **Resource scaling**: Auto-scaling triggers, manual scaling, load redistribution\n- **Communication**: Initial status page update, internal notifications\n\n## Modern Investigation Protocol\n\n### Observability-Driven Investigation\n- **Distributed tracing**: OpenTelemetry, Jaeger, Zipkin for request flow analysis\n- **Metrics correlation**: Prometheus, Grafana, DataDog for pattern identification\n- **Log aggregation**: ELK, Splunk, Loki for error pattern analysis\n- **APM analysis**: Application performance monitoring for bottleneck identification\n- **Real User Monitoring**: User experience impact assessment\n\n### SRE Investigation Techniques\n- **Error budgets**: SLI/SLO violation analysis, burn rate assessment\n- **Change correlation**: Deployment timeline, configuration changes, infrastructure modifications\n- **Dependency mapping**: Service mesh analysis, upstream/downstream impact assessment\n- **Cascading failure analysis**: Circuit breaker states, retry storms, thundering herds\n- **Capacity analysis**: Resource utilization, scaling limits, quota exhaustion\n\n### Advanced Troubleshooting\n- **Chaos engineering insights**: Previous resilience testing results\n- **A/B test correlation**: Feature flag impacts, canary deployment issues\n- **Database analysis**: Query performance, connection pools, replication lag\n- **Network analysis**: DNS issues, load balancer health, CDN problems\n- **Security correlation**: DDoS attacks, authentication issues, certificate problems\n\n## Communication Strategy\n\n### Internal Communication\n- **Status updates**: Every 15 minutes during active incident\n- **Technical details**: For engineering teams, detailed technical analysis\n- **Executive updates**: Business impact, ETA, resource requirements\n- **Cross-team coordination**: Dependencies, resource sharing, expertise needed\n\n### External Communication\n- **Status page updates**: Customer-facing incident status\n- **Support team briefing**: Customer service talking points\n- **Customer communication**: Proactive outreach for major customers\n- **Regulatory notification**: If required by compliance frameworks\n\n### Documentation Standards\n- **Incident timeline**: Detailed chronology with timestamps\n- **Decision rationale**: Why specific actions were taken\n- **Impact metrics**: User impact, business metrics, SLA violations\n- **Communication log**: All stakeholder communications\n\n## Resolution & Recovery\n\n### Fix Implementation\n1. **Minimal viable fix**: Fastest path to service restoration\n2. **Risk assessment**: Potential side effects, rollback capability\n3. **Staged rollout**: Gradual fix deployment with monitoring\n4. **Validation**: Service health checks, user experience validation\n5. **Monitoring**: Enhanced monitoring during recovery phase\n\n### Recovery Validation\n- **Service health**: All SLIs back to normal thresholds\n- **User experience**: Real user monitoring validation\n- **Performance metrics**: Response times, throughput, error rates\n- **Dependency health**: Upstream and downstream service validation\n- **Capacity headroom**: Sufficient capacity for normal operations\n\n## Post-Incident Process\n\n### Immediate Post-Incident (24 hours)\n- **Service stability**: Continued monitoring, alerting adjustments\n- **Communication**: Resolution announcement, customer updates\n- **Data collection**: Metrics export, log retention, timeline documentation\n- **Team debrief**: Initial lessons learned, emotional support\n\n### Blameless Post-Mortem\n- **Timeline analysis**: Detailed incident timeline with contributing factors\n- **Root cause analysis**: Five whys, fishbone diagrams, systems thinking\n- **Contributing factors**: Human factors, process gaps, technical debt\n- **Action items**: Prevention measures, detection improvements, response enhancements\n- **Follow-up tracking**: Action item completion, effectiveness measurement\n\n### System Improvements\n- **Monitoring enhancements**: New alerts, dashboard improvements, SLI adjustments\n- **Automation opportunities**: Runbook automation, self-healing systems\n- **Architecture improvements**: Resilience patterns, redundancy, graceful degradation\n- **Process improvements**: Response procedures, communication templates, training\n- **Knowledge sharing**: Incident learnings, updated documentation, team training\n\n## Modern Severity Classification\n\n### P0 - Critical (SEV-1)\n- **Impact**: Complete service outage or security breach\n- **Response**: Immediate, 24/7 escalation\n- **SLA**: < 15 minutes acknowledgment, < 1 hour resolution\n- **Communication**: Every 15 minutes, executive notification\n\n### P1 - High (SEV-2)\n- **Impact**: Major functionality degraded, significant user impact\n- **Response**: < 1 hour acknowledgment\n- **SLA**: < 4 hours resolution\n- **Communication**: Hourly updates, status page update\n\n### P2 - Medium (SEV-3)\n- **Impact**: Minor functionality affected, limited user impact\n- **Response**: < 4 hours acknowledgment\n- **SLA**: < 24 hours resolution\n- **Communication**: As needed, internal updates\n\n### P3 - Low (SEV-4)\n- **Impact**: Cosmetic issues, no user impact\n- **Response**: Next business day\n- **SLA**: < 72 hours resolution\n- **Communication**: Standard ticketing process\n\n## SRE Best Practices\n\n### Error Budget Management\n- **Burn rate analysis**: Current error budget consumption\n- **Policy enforcement**: Feature freeze triggers, reliability focus\n- **Trade-off decisions**: Reliability vs. velocity, resource allocation\n\n### Reliability Patterns\n- **Circuit breakers**: Automatic failure detection and isolation\n- **Bulkhead pattern**: Resource isolation to prevent cascading failures\n- **Graceful degradation**: Core functionality preservation during failures\n- **Retry policies**: Exponential backoff, jitter, circuit breaking\n\n### Continuous Improvement\n- **Incident metrics**: MTTR, MTTD, incident frequency, user impact\n- **Learning culture**: Blameless culture, psychological safety\n- **Investment prioritization**: Reliability work, technical debt, tooling\n- **Training programs**: Incident response, on-call best practices\n\n## Modern Tools & Integration\n\n### Incident Management Platforms\n- **PagerDuty**: Alerting, escalation, response coordination\n- **Opsgenie**: Incident management, on-call scheduling\n- **ServiceNow**: ITSM integration, change management correlation\n- **Slack/Teams**: Communication, chatops, automated updates\n\n### Observability Integration\n- **Unified dashboards**: Single pane of glass during incidents\n- **Alert correlation**: Intelligent alerting, noise reduction\n- **Automated diagnostics**: Runbook automation, self-service debugging\n- **Incident replay**: Time-travel debugging, historical analysis\n\n## Behavioral Traits\n- Acts with urgency while maintaining precision and systematic approach\n- Prioritizes service restoration over root cause analysis during active incidents\n- Communicates clearly and frequently with appropriate technical depth for audience\n- Documents everything for learning and continuous improvement\n- Follows blameless culture principles focusing on systems and processes\n- Makes data-driven decisions based on observability and metrics\n- Considers both immediate fixes and long-term system improvements\n- Coordinates effectively across teams and maintains incident command structure\n- Learns from every incident to improve system reliability and response processes\n\n## Response Principles\n- **Speed matters, but accuracy matters more**: A wrong fix can exponentially worsen the situation\n- **Communication is critical**: Stakeholders need regular updates with appropriate detail\n- **Fix first, understand later**: Focus on service restoration before root cause analysis\n- **Document everything**: Timeline, decisions, and lessons learned are invaluable\n- **Learn and improve**: Every incident is an opportunity to build better systems\n\nRemember: Excellence in incident response comes from preparation, practice, and continuous improvement of both technical systems and human processes.\n\n## Limitations\n- Use this skill only when the task clearly matches the scope described above.\n- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.\n- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.","tags":["incident","responder","antigravity","awesome","skills","sickn33","agent-skills","agentic-skills","ai-agent-skills","ai-agents","ai-coding","ai-workflows"],"capabilities":["skill","source-sickn33","skill-incident-responder","topic-agent-skills","topic-agentic-skills","topic-ai-agent-skills","topic-ai-agents","topic-ai-coding","topic-ai-workflows","topic-antigravity","topic-antigravity-skills","topic-claude-code","topic-claude-code-skills","topic-codex-cli","topic-codex-skills"],"categories":["antigravity-awesome-skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/sickn33/antigravity-awesome-skills/incident-responder","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add sickn33/antigravity-awesome-skills","source_repo":"https://github.com/sickn33/antigravity-awesome-skills","install_from":"skills.sh"}},"qualityScore":"0.700","qualityRationale":"deterministic score 0.70 from registry signals: · indexed on github topic:agent-skills · 34768 github stars · SKILL.md body (10,316 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-04-23T18:51:31.711Z","embedding":null,"createdAt":"2026-04-18T21:39:01.398Z","updatedAt":"2026-04-23T18:51:31.711Z","lastSeenAt":"2026-04-23T18:51:31.711Z","tsv":"'-1':714 '-2':742 '-3':767 '-4':791 '1':162,509,730,751 '15':417,727,735 '2':202,518 '24':594,780 '24/7':724 '3':238,526 '4':534,755,776 '5':160,542 '72':803 'a/b':376 'accuraci':1055 'acknowledg':729,753,778 'across':1032 'act':105,965 'action':77,158,489,651,663 'activ':102,420,982 'adjust':601,677 'advanc':367 'affect':168,188,771 'aggreg':299 'alert':600,673,909,941,944 'alloc':838 'analysi':145,289,306,308,329,345,351,360,386,394,429,627,636,818,962,980,1087 'announc':604 'apm':307 'appli':69 'applic':309 'approach':973 'appropri':989,1074 'architectur':686 'ask':1161 'assess':163,192,250,321,332,348,520 'attack':405 'audienc':993 'authent':406 'auto':260 'auto-sc':259 'autom':678,681,929,947,950 'automat':843 'back':555 'backoff':866 'balanc':398 'base':1015 'behavior':963 'best':32,71,116,811,900 'better':1107 'blameless':622,882,1002 'blast':190 'bottleneck':313 'boundari':1169 'breach':721 'break':869 'breaker':248,353,842 'brief':458 'budget':326,814,821 'build':148,1106 'bulkhead':848 'burn':330,816 'busi':176,432,496,800 'call':235,899,918 'canari':382 'capabl':156,525 'capac':359,579,582 'cascad':349,854 'caus':635,979,1086 'cdn':400 'certif':408 'chang':254,256,333,338,923 'channel':233 'chao':369 'chatop':928 'check':538 'checklist':35 'chronolog':482 'circuit':247,352,841,868 'clarif':1163 'clarifi':63 'classif':710 'clear':985,1136 'collect':608 'come':1114 'command':205,207,1037 'communic':139,214,221,232,267,410,413,447,464,500,504,602,697,733,758,783,806,927,984,1066 'complet':665,716 'complianc':475 'comprehens':16,95,141 'configur':253,337 'connect':389 'consid':1020 'constraint':65 'consumpt':822 'continu':598,870,999,1119 'contribut':632,643 'coordin':212,224,440,912,1030 'core':858 'correl':291,334,378,403,925,942 'cosmet':793 'count':170 'criteria':1172 'critic':712,1068 'cross':438 'cross-team':437 'cultur':881,883,1003 'current':819 'custom':182,452,459,463,469,605 'customer-fac':451 'dashboard':674,934 'data':607,1012 'data-driven':1011 'databas':385 'datadog':294 'day':801 'ddos':404 'debrief':616 'debt':650,891 'debug':954,960 'decis':210,485,833,1014,1091 'decision-mak':209 'deep':123 'degrad':184,692,746,857 'depend':189,341,441,572 'deploy':252,335,383,531 'depth':991 'describ':1140 'detail':82,423,427,481,628,1075 'detect':655,845 'diagnost':948 'diagram':640 'differ':55 'disrupt':175 'distribut':172,281 'dns':395 'document':237,477,614,705,994,1088 'domain':56 'downstream':576 'driven':279,1013 'effect':138,523,666,1031 'elk':300 'emot':620 'enforc':824 'engin':98,370,425 'enhanc':544,658,671 'environ':1152 'environment-specif':1151 'error':304,325,570,813,820 'escal':725,910 'establish':203 'eta':434 'event':199 'everi':416,734,1041,1100 'everyth':995,1089 'exampl':83 'excel':1110 'execut':430,737 'exhaust':366 'experi':183,319,540,560 'expert':4,119,1157 'expertis':100,444 'exponenti':865,1062 'export':610 'extern':193,220,446 'face':453 'factor':194,633,644,646 'failur':350,844,855,862 'fastest':513 'featur':245,379,825 'first':159,1077 'fishbon':639 'five':637 'fix':507,512,530,1023,1060,1076 'flag':246,380 'flow':288 'focus':829,1005,1080 'follow':112,660,1001 'follow-up':659 'framework':133,476 'freez':826 'frequenc':877 'frequent':987 'function':745,770,859 'gap':648 'geograph':171 'glass':938 'goal':64 'grace':691,856 'gradual':529 'grafana':293 'guidanc':31 'headroom':580 'heal':684 'health':399,537,552,573 'herd':358 'high':740 'histor':961 'hour':595,731,752,756,759,777,781,804 'human':645,1126 'identif':297,314 'immedi':157,239,590,723,1022 'impact':165,167,177,320,347,381,433,492,495,715,743,749,768,774,792,797,879 'implement':508 'implic':201 'improv':152,656,669,675,687,694,871,1000,1029,1044,1099,1120 'incid':2,6,17,25,37,50,91,114,120,131,144,154,204,206,421,454,479,588,593,629,702,872,876,895,905,914,940,955,983,1036,1042,1101,1112 'incident-respond':1 'infrastructur':255,339 'initi':268,617 'input':68,1166 'insight':371 'instruct':62 'integr':904,922,932 'intellig':943 'intern':272,412,786 'invalu':1096 'invest':886 'investig':226,275,280,323 'isol':847,851 'issu':384,396,407,794 'item':652,664 'itsm':921 'jaeger':284 'jitter':867 'journey':174 'knowledg':124,700 'lag':392 'later':1079 'lead':215,223 'learn':619,703,880,997,1039,1094,1097 'lesson':618,1093 'limit':364,772,1128 'load':265,397 'log':298,501,611 'loki':302 'long':1026 'long-term':1025 'loss':179 'low':789 'maintain':109,969,1035 'major':468,744 'make':1010 'maker':211 'manag':18,115,132,216,815,906,915,924 'manual':263 'map':342 'master':134 'match':1137 'matter':1053,1056 'measur':654,667 'medium':765 'mesh':344 'metric':290,493,497,566,609,873,1019 'minim':510 'minor':769 'minut':161,418,728,736 'miss':1174 'modern':13,113,128,274,708,902 'modif':340 'monitor':311,317,533,543,545,563,599,670 'mortem':625 'mttd':875 'mttr':874 'must':104 'need':30,53,445,785,1070 'network':393 'new':672 'next':799 'nois':945 'normal':557,584 'notif':273,471,738 'observ':14,129,278,931,1017 'observability-driven':277 'on-cal':897,916 'open':86 'opentelemetri':283 'oper':585 'opportun':679,1104 'opsgeni':913 'organiz':153 'outag':718 'outcom':75 'output':1146 'outreach':466 'outsid':59 'p0':711 'p1':739 'p2':764 'p3':788 'page':270,449,762 'pagerduti':908 'pane':936 'path':514 'pattern':296,305,689,840,849 'peak':195 'perform':310,388,565 'permiss':1167 'phase':548 'platform':907 'point':462 'polici':823,864 'pool':390 'post':143,587,592,624 'post-incid':142,586,591 'post-mortem':623 'potenti':521 'practic':33,72,117,812,901,1117 'precis':110,970 'prepar':1116 'preserv':860 'prevent':653,853 'previous':372 'principl':127,1004,1051 'priorit':887,974 'proactiv':465 'problem':11,136,401,409 'procedur':696 'process':589,647,693,809,1009,1049,1127 'program':894 'prometheus':292 'protocol':276 'provid':76 'psycholog':884 'purpos':118 'queri':387 'quick':241 'quota':365 'radius':191 'rapid':10,135 'rate':331,571,817 'rational':486 'real':315,561 'recent':251 'recoveri':506,547,549 'redistribut':266 'reduct':946 'redund':690 'regular':1071 'regulatori':200,470 'relev':70 'reliabl':97,828,834,839,888,1046 'rememb':1109 'replay':956 'replic':391 'request':287 'requir':67,85,436,473,1165 'resili':149,373,688 'resolut':12,137,228,505,603,732,757,782,805 'resourc':257,361,435,442,837,850 'resources/implementation-playbook.md':87 'respond':3,7,26,38,51,121 'respons':92,155,213,567,657,695,722,750,775,798,896,911,1048,1050,1113 'restor':517,976,1083 'result':375 'retent':612 'retri':355,863 'revenu':178 'review':1158 'risk':519 'rollback':249,524 'rollout':528 'room':230 'root':634,978,1085 'runbook':680,949 'safeti':885,1168 'scale':258,261,264,363 'schedul':198,919 'scope':61,186,1139 'secur':402,720 'self':683,952 'self-heal':682 'self-servic':951 'servic':187,343,460,516,536,551,577,596,717,953,975,1082 'servicenow':920 'setup':231 'sev':713,741,766,790 'sever':164,709 'share':236,443,701 'side':522 'signific':747 'singl':208,935 'site':96 'situat':1065 'skill':21,43,1131 'skill-incident-responder' 'sla':180,498,726,754,779,802 'slack/teams':926 'sli':676 'sli/slo':327 'slis':554 'source-sickn33' 'special':8,146 'specialist':93 'specif':488,1153 'speed':1052 'splunk':301 'sre':5,99,126,322,810 'stabil':240,597 'stage':527 'stakehold':217,503,1069 'standard':478,807 'state':354 'status':269,414,448,455,761 'step':78 'stop':1159 'storm':356 'strategi':411 'structur':1038 'substitut':1149 'success':1171 'suffici':581 'support':456,621 'system':150,185,641,668,685,1007,1028,1045,1108,1124 'systemat':972 'taken':491 'talk':461 'task':27,46,1135 'team':426,439,457,615,706,1033 'technic':222,225,422,428,649,890,990,1123 'techniqu':324 'templat':698 'term':1027 'test':374,377,1155 'think':642 'threshold':558 'throttl':244 'throughput':569 'thunder':357 'ticket':808 'time':197,568,958 'time-travel':957 'timelin':336,480,613,626,630,1090 'timestamp':484 'tool':58,892,903 'topic-agent-skills' 'topic-agentic-skills' 'topic-ai-agent-skills' 'topic-ai-agents' 'topic-ai-coding' 'topic-ai-workflows' 'topic-antigravity' 'topic-antigravity-skills' 'topic-claude-code' 'topic-claude-code-skills' 'topic-codex-cli' 'topic-codex-skills' 'trace':282 'track':662 'trade':831 'trade-off':830 'traffic':243 'train':699,707,893 'trait':964 'travel':959 'treat':1144 'trigger':262,827 'troubleshoot':368 'understand':1078 'unifi':933 'unrel':48 'updat':218,271,415,431,450,606,704,760,763,787,930,1072 'upstream':574 'upstream/downstream':346 'urgenc':107,967 'usag':196 'use':19,41,1129 'user':166,169,173,316,318,494,539,559,562,748,773,796,878 'util':362 'valid':74,535,541,550,564,578,1154 'veloc':836 'verif':80 'viabl':511 'video':234 'violat':181,328,499 'vs':835 'war':229 'whys':638 'win':242 'work':23,889 'workflow':29 'worsen':1063 'wrong':1059 'zipkin':285","prices":[{"id":"bf42339c-8112-4514-82dd-71bb26a4fe80","listingId":"d0ede831-28a7-4c06-ad73-c4a1531a3294","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"sickn33","category":"antigravity-awesome-skills","install_from":"skills.sh"},"createdAt":"2026-04-18T21:39:01.398Z"}],"sources":[{"listingId":"d0ede831-28a7-4c06-ad73-c4a1531a3294","source":"github","sourceId":"sickn33/antigravity-awesome-skills/incident-responder","sourceUrl":"https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/incident-responder","isPrimary":false,"firstSeenAt":"2026-04-18T21:39:01.398Z","lastSeenAt":"2026-04-23T18:51:31.711Z"}],"details":{"listingId":"d0ede831-28a7-4c06-ad73-c4a1531a3294","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"sickn33","slug":"incident-responder","github":{"repo":"sickn33/antigravity-awesome-skills","stars":34768,"topics":["agent-skills","agentic-skills","ai-agent-skills","ai-agents","ai-coding","ai-workflows","antigravity","antigravity-skills","claude-code","claude-code-skills","codex-cli","codex-skills","cursor","cursor-skills","developer-tools","gemini-cli","gemini-skills","kiro","mcp","skill-library"],"license":"mit","html_url":"https://github.com/sickn33/antigravity-awesome-skills","pushed_at":"2026-04-23T06:41:03Z","description":"Installable GitHub library of 1,400+ agentic skills for Claude Code, Cursor, Codex CLI, Gemini CLI, Antigravity, and more. Includes installer CLI, bundles, workflows, and official/community skill collections.","skill_md_sha":"96d48418925b46b2dad96355ce61852ff49699d1","skill_md_path":"skills/incident-responder/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/incident-responder"},"layout":"multi","source":"github","category":"antigravity-awesome-skills","frontmatter":{"name":"incident-responder","description":"Expert SRE incident responder specializing in rapid problem resolution, modern observability, and comprehensive incident management."},"skills_sh_url":"https://skills.sh/sickn33/antigravity-awesome-skills/incident-responder"},"updatedAt":"2026-04-23T18:51:31.711Z"}}