{"id":"b34695e6-ddb9-46fd-ab4c-c4120247f14b","shortId":"rLeVSv","kind":"skill","title":"Observability Checklist","tagline":"Reviews a service or codebase against a full observability checklist — logs, metrics, traces, and alerting gaps.","description":"# Observability Checklist\n\n## What this skill does\n\nThis skill reviews a service or codebase against a comprehensive observability checklist covering structured logging, metrics instrumentation, distributed tracing, alerting, dashboards, and runbooks. It identifies gaps that would make it hard to diagnose incidents, detect regressions, or understand the system's health. The output is a prioritized list of missing observability with recommendations for each gap.\n\nUse this when building a new service, when preparing for an on-call rotation, after an incident where you couldn't figure out what happened, or as part of a production readiness review.\n\n## How to use\n\n### Claude Code / Cline\n\nCopy this file to `.agents/skills/observability-checklist/SKILL.md` in your project root.\n\nThen ask:\n- *\"Use the Observability Checklist skill to review our payments service.\"*\n- *\"Run an observability review on `server/routes/orders.ts` using the Observability Checklist skill.\"*\n\nProvide the service description, relevant code files, and information about what observability tooling is already in place (e.g., \"we use Datadog, we have some logging but no tracing\").\n\n### Cursor\n\nAdd the instructions below to your `.cursorrules` or paste them into the Cursor AI pane with the service description and code.\n\n### Codex\n\nProvide the service overview and code. Ask Codex to follow the instructions below to produce the observability gap report.\n\n## The Prompt / Instructions for the Agent\n\nWhen asked to review observability, evaluate the service against every item in this checklist:\n\n### Pillar 1 — Structured Logging\n\n**Must have**\n- [ ] All logs are structured (JSON or key-value format), not plain string concatenation\n- [ ] Every log entry includes: timestamp, severity level, service name, and a message\n- [ ] Errors are logged at ERROR level with the full stack trace\n- [ ] Warnings are logged at WARN for recoverable issues\n- [ ] No sensitive data in logs (passwords, tokens, PII, credit card numbers)\n\n**Should have**\n- [ ] A correlation/trace ID is included in every log entry so all logs for one request can be found\n- [ ] User ID or session ID included where relevant (but not PII that's not already stripped)\n- [ ] Log levels can be changed without redeploying (via config or environment variable)\n- [ ] Slow operations are logged with duration\n- [ ] External API calls log the endpoint, duration, and response code\n\n**Log coverage gaps to check**\n- All HTTP request/response cycles should be logged at INFO\n- Database query errors should be logged at ERROR\n- Background jobs should log start, end, and error\n- Auth failures should be logged (for security audit)\n\n### Pillar 2 — Metrics\n\n**Must have**\n- [ ] Error rate is tracked (errors per minute or errors as % of total requests)\n- [ ] Request latency is tracked (p50, p95, p99)\n- [ ] Request throughput is tracked (requests per second/minute)\n- [ ] Health check endpoint exists (`/health` or `/ping`) returning 200 when healthy\n\n**Should have**\n- [ ] Business metrics tracked (e.g., orders created/minute, sign-ups/day)\n- [ ] Database connection pool metrics (used, available, waiting)\n- [ ] Queue depth tracked for any async job queues\n- [ ] External API call success/failure rates tracked\n- [ ] Cache hit/miss ratios tracked if caching is used\n- [ ] Resource metrics: CPU, memory, disk I/O\n\n**Common gaps**\n- HTTP 4xx and 5xx rates tracked separately\n- Endpoint-level latency (not just aggregate)\n- DB query duration tracked\n\n### Pillar 3 — Distributed Tracing\n\n**Should have if the service has multiple components**\n- [ ] Trace context propagated from incoming HTTP requests to outgoing calls\n- [ ] Database calls included in traces\n- [ ] External API calls included in traces\n- [ ] Background jobs generate their own trace spans\n- [ ] Trace IDs correlate to log entries (same trace ID in logs and traces)\n\n**Common gaps**\n- Async callbacks break trace context — it needs to be manually propagated\n- Missing spans on queued job processing\n\n### Pillar 4 — Alerting\n\n**Must have**\n- [ ] Alert fires when error rate exceeds a threshold (e.g., > 1% of requests failing)\n- [ ] Alert fires when the service is down / health check fails\n- [ ] Alert fires when latency exceeds an SLO threshold\n- [ ] Someone is on call and the alert reaches them (PagerDuty, Opsgenie, etc.)\n\n**Should have**\n- [ ] Alerts are reviewed for signal vs. noise — no alert fires more than once per week under normal conditions\n- [ ] Alert runbook exists: what to do when this alert fires\n- [ ] Alert severity levels (P1 for wake-up-at-night, P2 for next-business-day)\n- [ ] Business metric anomaly detection (sudden drop in orders, sign-ups, etc.)\n\n**Common gaps**\n- Alerts that fire but no one knows what to do (missing runbook)\n- Alerts that are always firing and get ignored (alert fatigue)\n- No alerting on downstream dependency failures\n\n### Pillar 5 — Dashboards\n\n**Should have**\n- [ ] A service dashboard showing the four golden signals: latency, traffic, error rate, saturation\n- [ ] Deployment markers on graphs (so you can see if a deploy caused a spike)\n- [ ] Comparison with the same period last week (to distinguish trend from incident)\n\n### Output format\n\n```markdown\n## Observability Review — [Service Name]\n\n### Score\n- Logging: N/10\n- Metrics: N/10\n- Tracing: N/10\n- Alerting: N/10\n- Overall readiness: [Not ready / Needs work / Good / Excellent]\n\n---\n\n### Critical Gaps (would make incident diagnosis impossible)\n\n#### 1. [Gap title]\n- **Pillar**: [Logging / Metrics / Tracing / Alerting]\n- **Problem**: [What's missing and why it matters]\n- **Recommendation**: [Specific change to make]\n- **Effort**: [Small / Medium / Large]\n\n### High-Priority Gaps\n\n[Same format]\n\n### Low-Priority / Nice-to-Have\n\n[Same format]\n\n### What's Already Good\n[Brief acknowledgment of observability that's already in place]\n```\n\n## Example\n\n**Input to Agent:**\n> \"Use the Observability Checklist skill on our Node.js orders service. We use console.log for logging, we have a /health endpoint, we track errors in Sentry, but we don't have any metrics or tracing set up. We do have Datadog available.\"\n\n**Output from Agent:**\n\n> ## Observability Review — Orders Service\n>\n> ### Score\n> - Logging: 3/10 (unstructured, no correlation IDs)\n> - Metrics: 2/10 (only crash errors via Sentry, no latency/throughput)\n> - Tracing: 0/10 (not configured)\n> - Alerting: 2/10 (Sentry sends emails, no PagerDuty, no latency alerts)\n> - Overall readiness: Needs work\n>\n> ### Critical Gaps\n>\n> #### 1. Unstructured logging (console.log)\n> - **Pillar**: Logging\n> - **Problem**: `console.log` output cannot be queried, filtered, or correlated. During an incident, you can't find all logs for a specific request or user.\n> - **Recommendation**: Replace `console.log` with a structured logger like `pino` or `winston`. Output JSON with `level`, `timestamp`, `message`, and a `requestId` field.\n> - **Effort**: Medium (2–4 hours for initial setup + incrementally replacing log calls)\n>\n> #### 2. No latency or throughput metrics\n> - **Pillar**: Metrics\n> - **Problem**: You have no way to know if order creation is getting slower. A 10x latency increase would go undetected until users complain.\n> - **Recommendation**: Instrument with `dd-trace` (Datadog APM) to auto-capture HTTP request latency, throughput, and error rates. Takes under 1 hour to set up.\n> - **Effort**: Small\n>\n> #### 3. No on-call alerting\n> - **Pillar**: Alerting\n> - **Problem**: Sentry sends emails, but emails don't wake anyone up at night. A critical orders outage could go undetected for hours.\n> - **Recommendation**: Create a Datadog monitor for `error_rate > 5%` and `p99_latency > 2000ms`, routed to PagerDuty.\n> - **Effort**: Small\n\n## Notes\n\n- Observability is a spectrum — some coverage is better than none. Start with the critical gaps and add coverage incrementally.\n- The most valuable first investment for most services: structured logging + error rate alert + latency alert. Everything else follows.\n- Metrics and logs without runbooks are less useful than they could be. Every alert should have a runbook explaining what to check first.","tags":["observability","checklist","openagentskills","notysoty","agent-skills","claude","claude-code","claude-skills","cline","cursor","llm","llm-skills"],"capabilities":["skill","source-notysoty","skill-observability-checklist","topic-agent-skills","topic-claude","topic-claude-code","topic-claude-skills","topic-cline","topic-cursor","topic-llm","topic-llm-skills","topic-skills"],"categories":["openagentskills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/Notysoty/openagentskills/observability-checklist","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add Notysoty/openagentskills","source_repo":"https://github.com/Notysoty/openagentskills","install_from":"skills.sh"}},"qualityScore":"0.454","qualityRationale":"deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 8 github stars · SKILL.md body (8,133 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T19:13:23.189Z","embedding":null,"createdAt":"2026-05-18T13:20:44.634Z","updatedAt":"2026-05-18T19:13:23.189Z","lastSeenAt":"2026-05-18T19:13:23.189Z","tsv":"'/day':461 '/health':443,875 '/ping':445 '0/10':922 '1':244,603,800,941,1056 '10x':1026 '2':408,994,1004 '2/10':913,926 '200':447 '2000ms':1105 '3':518,1063 '3/10':907 '4':590,995 '4xx':500 '5':726,1101 '5xx':502 'acknowledg':845 'add':182,1128 'agent':228,856,900 'agents/skills/observability-checklist/skill.md':125 'aggreg':512 'ai':195 'alert':17,44,591,594,607,617,631,639,647,657,665,667,697,709,717,720,783,807,925,934,1068,1070,1143,1145,1162 'alreadi':167,339,842,850 'alway':712 'anomali':685 'anyon':1080 'api':360,478,545 'apm':1042 'ask':131,210,230 'async':474,572 'audit':406 'auth':399 'auto':1045 'auto-captur':1044 'avail':467,897 'background':391,550 'better':1119 'break':574 'brief':844 'build':84 'busi':452,681,683 'cach':483,488 'call':94,361,479,538,540,546,628,1003,1067 'callback':573 'cannot':950 'captur':1046 'card':303 'caus':754 'chang':345,818 'check':373,440,615,1170 'checklist':2,12,20,36,135,151,242,860 'claud':118 'cline':120 'code':119,158,202,209,368 'codebas':7,31 'codex':203,211 'common':497,570,695 'comparison':757 'complain':1034 'compon':528 'comprehens':34 'concaten':262 'condit':656 'config':349 'configur':924 'connect':463 'console.log':869,944,948,973 'context':530,576 'copi':121 'correl':559,910,955 'correlation/trace':308 'could':1088,1159 'couldn':101 'cover':37 'coverag':370,1117,1129 'cpu':493 'crash':915 'creat':1094 'created/minute':457 'creation':1021 'credit':302 'critic':793,939,1085,1125 'cursor':181,194 'cursorrul':188 'cycl':377 'dashboard':45,727,732 'data':296 'databas':383,462,539 'datadog':173,896,1041,1096 'day':682 'db':513 'dd':1039 'dd-trace':1038 'depend':723 'deploy':743,753 'depth':470 'descript':156,200 'detect':59,686 'diagnos':57 'diagnosi':798 'disk':495 'distinguish':765 'distribut':42,519 'downstream':722 'drop':688 'durat':358,365,515 'e.g':170,455,602 'effort':821,992,1061,1109 'els':1147 'email':929,1074,1076 'end':396 'endpoint':364,441,507,876 'endpoint-level':506 'entri':265,315,562 'environ':351 'error':275,279,385,390,398,412,416,420,597,740,879,916,1052,1099,1141 'etc':636,694 'evalu':234 'everi':238,263,313,1161 'everyth':1146 'exampl':853 'exceed':599,621 'excel':792 'exist':442,659 'explain':1167 'extern':359,477,544 'fail':606,616 'failur':400,724 'fatigu':718 'field':991 'figur':103 'file':123,159 'filter':953 'find':962 'fire':595,608,618,648,666,699,713 'first':1134,1171 'follow':213,1148 'format':258,770,830,839 'found':324 'four':735 'full':10,283 'gap':18,50,80,221,371,498,571,696,794,801,828,940,1126 'generat':552 'get':715,1023 'go':1030,1089 'golden':736 'good':791,843 'graph':746 'happen':106 'hard':55 'health':66,439,614 'healthi':449 'high':826 'high-prior':825 'hit/miss':484 'hour':996,1057,1092 'http':375,499,534,1047 'i/o':496 'id':309,326,329,558,565,911 'identifi':49 'ignor':716 'imposs':799 'incid':58,98,768,797,958 'includ':266,311,330,541,547 'incom':533 'increas':1028 'increment':1000,1130 'info':382 'inform':161 'initi':998 'input':854 'instruct':184,215,225 'instrument':41,1036 'invest':1135 'issu':293 'item':239 'job':392,475,551,587 'json':253,983 'key':256 'key-valu':255 'know':703,1018 'larg':824 'last':762 'latenc':426,509,620,738,933,1006,1027,1049,1104,1144 'latency/throughput':920 'less':1155 'level':269,280,342,508,669,985 'like':978 'list':72 'log':13,39,177,246,250,264,277,288,298,314,318,341,356,362,369,380,388,394,403,561,567,777,804,871,906,943,946,964,1002,1140,1151 'logger':977 'low':832 'low-prior':831 'make':53,796,820 'manual':581 'markdown':771 'marker':744 'matter':815 'medium':823,993 'memori':494 'messag':274,987 'metric':14,40,409,453,465,492,684,779,805,888,912,1009,1011,1149 'minut':418 'miss':74,583,707,811 'monitor':1097 'multipl':527 'must':247,410,592 'n/10':778,780,782,784 'name':271,775 'need':578,789,937 'new':86 'next':680 'next-business-day':679 'nice':835 'nice-to-hav':834 'night':676,1083 'node.js':864 'nois':645 'none':1121 'normal':655 'note':1111 'number':304 'observ':1,11,19,35,75,134,144,150,164,220,233,772,847,859,901,1112 'on-cal':92,1065 'one':320,702 'oper':354 'opsgeni':635 'order':456,690,865,903,1020,1086 'outag':1087 'outgo':537 'output':68,769,898,949,982 'overal':785,935 'overview':207 'p1':670 'p2':677 'p50':429 'p95':430 'p99':431,1103 'pagerduti':634,931,1108 'pane':196 'part':109 'password':299 'past':190 'payment':140 'per':417,437,652 'period':761 'pii':301,335 'pillar':243,407,517,589,725,803,945,1010,1069 'pino':979 'place':169,852 'plain':260 'pool':464 'prepar':89 'priorit':71 'prioriti':827,833 'problem':808,947,1012,1071 'process':588 'produc':218 'product':112 'project':128 'prompt':224 'propag':531,582 'provid':153,204 'queri':384,514,952 'queu':586 'queue':469,476 'rate':413,481,503,598,741,1053,1100,1142 'ratio':485 'reach':632 'readi':113,786,788,936 'recommend':77,816,971,1035,1093 'recover':292 'redeploy':347 'regress':60 'relev':157,332 'replac':972,1001 'report':222 'request':321,424,425,432,436,535,605,968,1048 'request/response':376 'requestid':990 'resourc':491 'respons':367 'return':446 'review':3,27,114,138,145,232,641,773,902 'root':129 'rotat':95 'rout':1106 'run':142 'runbook':47,658,708,1153,1166 'satur':742 'score':776,905 'second/minute':438 'secur':405 'see':750 'send':928,1073 'sensit':295 'sentri':881,918,927,1072 'separ':505 'server/routes/orders.ts':147 'servic':5,29,87,141,155,199,206,236,270,525,611,731,774,866,904,1138 'session':328 'set':891,1059 'setup':999 'sever':268,668 'show':733 'sign':459,692 'sign-up':458,691 'signal':643,737 'skill':23,26,136,152,861 'skill-observability-checklist' 'slo':623 'slow':353 'slower':1024 'small':822,1062,1110 'someon':625 'source-notysoty' 'span':556,584 'specif':817,967 'spectrum':1115 'spike':756 'stack':284 'start':395,1122 'string':261 'strip':340 'structur':38,245,252,976,1139 'success/failure':480 'sudden':687 'system':64 'take':1054 'threshold':601,624 'throughput':433,1008,1050 'timestamp':267,986 'titl':802 'token':300 'tool':165 'topic-agent-skills' 'topic-claude' 'topic-claude-code' 'topic-claude-skills' 'topic-cline' 'topic-cursor' 'topic-llm' 'topic-llm-skills' 'topic-skills' 'total':423 'trace':15,43,180,285,520,529,543,549,555,557,564,569,575,781,806,890,921,1040 'track':415,428,435,454,471,482,486,504,516,878 'traffic':739 'trend':766 'understand':62 'undetect':1031,1090 'unstructur':908,942 'up':460,693 'use':81,117,132,148,172,466,490,857,868,1156 'user':325,970,1033 'valu':257 'valuabl':1133 'variabl':352 'via':348,917 'vs':644 'wait':468 'wake':673,1079 'wake-up-at-night':672 'warn':286,290 'way':1016 'week':653,763 'winston':981 'without':346,1152 'work':790,938 'would':52,795,1029","prices":[{"id":"201b6338-1256-41b9-8c91-cb1e02523311","listingId":"b34695e6-ddb9-46fd-ab4c-c4120247f14b","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"Notysoty","category":"openagentskills","install_from":"skills.sh"},"createdAt":"2026-05-18T13:20:44.634Z"}],"sources":[{"listingId":"b34695e6-ddb9-46fd-ab4c-c4120247f14b","source":"github","sourceId":"Notysoty/openagentskills/observability-checklist","sourceUrl":"https://github.com/Notysoty/openagentskills/tree/main/skills/observability-checklist","isPrimary":false,"firstSeenAt":"2026-05-18T13:20:44.634Z","lastSeenAt":"2026-05-18T19:13:23.189Z"}],"details":{"listingId":"b34695e6-ddb9-46fd-ab4c-c4120247f14b","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"Notysoty","slug":"observability-checklist","github":{"repo":"Notysoty/openagentskills","stars":8,"topics":["agent-skills","claude","claude-code","claude-skills","cline","cursor","llm","llm-skills","skills"],"license":"mit","html_url":"https://github.com/Notysoty/openagentskills","pushed_at":"2026-03-28T06:50:19Z","description":"A  community-driven library of reusable AI agent skills for Claude Code, Cursor, Codex, Cline, and more.","skill_md_sha":"d9f43ad13abf324adf3adfa1bd2d4cc1669d137b","skill_md_path":"skills/observability-checklist/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/Notysoty/openagentskills/tree/main/skills/observability-checklist"},"layout":"multi","source":"github","category":"openagentskills","frontmatter":{"name":"Observability Checklist","description":"Reviews a service or codebase against a full observability checklist — logs, metrics, traces, and alerting gaps."},"skills_sh_url":"https://skills.sh/Notysoty/openagentskills/observability-checklist"},"updatedAt":"2026-05-18T19:13:23.189Z"}}