pmstudio-recovery
Generate detailed service restoration runbooks with step-by-step procedures. Use when someone asks to "create recovery procedures", "restoration runbook", "recovery steps", "how to restore service", or needs tactical step-by-step procedures for recovering from disaster scenarios.
What it does
Recovery Plan — Service Restoration Runbooks
Purpose
Generates step-by-step runbooks for each disaster scenario defined in the DR plan. These are tactical execution documents — meant to be followed during an actual incident by someone who may not be the person who wrote the plan.
Prerequisites
Hard dependency: A DR plan must exist at Operational/DR-Plan-*.html. If not found, respond:
"No DR plan found for this project. The recovery plan is built from DR plan scenarios and RTO/RPO targets. Run
/dr-planfirst to create one, then run/recovery-planto generate the runbooks."
Process
Step 1: Read Context
Required:
Operational/DR-Plan-*.html— scenarios, RTO/RPO targets, dependencies, communication planCLAUDE.local.md— architecture, contacts, integrations
Optional (enriches the runbooks):
Operational/IRP-*.html— escalation matrix, communication templatesPRD/*.html— integrations detail, technical considerationsArchitecture/— system diagrams for reference
Step 2: Extract Scenarios
From the DR plan, extract:
- Each disaster scenario (name, description, trigger condition)
- RTO/RPO targets per component
- Dependencies and contacts
- Manual workarounds
Step 3: Generate Runbooks
Output: Operational/Recovery-Procedures-{ProductName}-{Date}.html
Self-contained HTML. Zero CDN dependencies — this must work offline during recovery.
Structure: One runbook per disaster scenario, plus a general section.
General Section:
## Before Any Recovery
### Emergency Contacts
| Role | Name | Phone | Email |
|------|------|-------|-------|
| PM / Incident Commander | ... | ... | ... |
| Technical Lead | ... | ... | ... |
| Vendor Support | ... | ... | ... |
| Security POC | ... | ... | ... |
### Tools Needed
- Access to vendor admin console
- Access to monitoring/status page
- Access to communication channel (Slack/Teams/email)
- Access to backup location (Snowflake/SharePoint)
### Recovery Principles
1. Communicate first, then fix
2. Document every action and timestamp
3. Verify each step before proceeding to next
4. If stuck for >15 minutes on any step, escalate
Per-Scenario Runbook:
## Scenario: {Name}
**Trigger:** {How you know this is happening}
**Target RTO:** {time} | **Target RPO:** {time}
**Severity:** {from IRP if exists}
### Pre-Conditions
- [ ] Incident declared and logged
- [ ] Incident Commander assigned
- [ ] Stakeholders notified (initial)
### Recovery Steps
| # | Action | Owner | How to Verify | Est. Time |
|---|--------|-------|--------------|-----------|
| 1 | {action} | {role} | {verification} | {minutes} |
| 2 | ... | ... | ... | ... |
**Cumulative time: {sum} — within RTO: {yes/no}**
### Decision Points
- After step N: If {condition}, go to step M instead
- After step N: If {condition}, escalate to {person}
### Verification Checklist
- [ ] Service accessible to users
- [ ] Data integrity confirmed (spot-check N records)
- [ ] All integrations responding
- [ ] No error alerts in last 15 minutes
- [ ] Stakeholders notified of restoration
### If Recovery Fails
- At step N: {rollback action}
- Escalation: {who to call}
- Alternative: {manual workaround from DR plan}
### Post-Recovery
- [ ] Update incident log with recovery timeline
- [ ] Schedule PIR within 48 hours
- [ ] Document any deviations from this runbook
- [ ] Update this runbook with lessons learned
Step 4: Validate Timing
For each runbook, sum the estimated step times. Compare to RTO:
- If total < RTO: Good.
- If total > RTO: Flag as risk. Ask user: "Steps total {X} but RTO is {Y}. Should we adjust the RTO or find ways to parallelize steps?"
Step 5: Present for Review
Show all runbook outlines with step counts and timing. Ask for approval before writing.
Critical Rules
- Written for execution, not understanding. Each step should be a clear action ("Open {URL} and click Settings > Backup > Export"), not a description ("The admin should initiate a backup process").
- Every step has verification. Don't move to step N+1 without confirming step N succeeded.
- Include decision trees. Real recovery rarely follows a straight line. Document branch points.
- Timing must be realistic. Don't estimate "1 minute" for something that requires vendor response. Use pessimistic estimates.
- Offline-capable. Zero CDN dependencies. Print-friendly CSS. Someone may be reading this on a phone with spotty internet.
- Contacts are phone numbers, not just emails. During a crisis, email is slow. Include phone numbers for all critical contacts.
Capabilities
Install
Quality
deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 7 github stars · SKILL.md body (4,651 chars)