{"id":"6b94c6e3-3d76-4394-9c46-0461154204ed","shortId":"L2v7wv","kind":"skill","title":"observability","tagline":"Investigate production issues using logs, traces, and errors — how to triage, correlate signals, and know when to escalate.","description":"Use this skill to investigate production problems. For query syntax, index patterns, and curl commands, use your `logs` capability.\n\n## Triage process\n\nStart with the symptom, not the tool. Before querying anything:\n\n1. **State the hypothesis** — what do you think is wrong and why?\n2. **Bound the time window** — when did it start? Is it ongoing or resolved?\n3. **Identify the scope** — one service, one endpoint, one user, or system-wide?\n\nThis prevents aimless log-scrolling and makes findings interpretable.\n\n## Signal hierarchy\n\nWork top-down — coarser signals first, drill into finer ones only when needed:\n\n| Signal | What it tells you | When to use |\n|---|---|---|\n| **Error rate / rate spike** | Something broke at scale | First check — confirms the problem is real |\n| **APM traces** | Which transaction is slow or failing, full call chain | Once you know the scope |\n| **APM errors** | Exception type, stack trace, grouping key | When you need the root cause code path |\n| **Logs** | Raw context around a specific event | When traces don't have enough detail |\n\nDon't start with logs. Start with traces or error groups, then use `trace.id` to pull the surrounding log context.\n\n## Correlating signals\n\nThe `trace.id` field links all three indices (`logs-*`, `traces-apm*`, `logs-apm.error-*`). Once you have a `trace.id` from an error or slow trace, use it to pull all logs from that same request:\n\n```json\n{\"term\": {\"trace.id\": \"<trace-id-here>\"}}\n```\n\n## Asking the right questions\n\nBefore querying, write down what a \"confirmed\" answer looks like. Examples:\n\n- \"If query returns 0 errors for service X in the last 1h, the issue has resolved\"\n- \"If the slow trace shows N+1 queries on endpoint Y, the cause is clear\"\n- \"If errors spike at exactly :15 and :45 of every hour, it's likely a cron job\"\n\nThis prevents misreading absence of evidence as evidence of absence.\n\n## When to escalate\n\nStop investigating and escalate to the team when:\n\n- Error rate is sustained above baseline for > 15 minutes and cause is not identified\n- A trace shows calls to an external dependency timing out (not your code)\n- Errors reference a data migration or schema change (potential data integrity issue)\n- You've ruled out the obvious causes and need production access or context you don't have\n\n## Common patterns\n\n| Symptom | Where to look first |\n|---|---|\n| Slow page loads | APM traces — sort by `transaction.duration.us` desc |\n| 500 errors spiking | APM errors — group by `error.grouping_key` |\n| One user affected | Logs — filter by user ID or session ID |\n| Periodic issue | Logs — look for time pattern in `@timestamp` |\n| After a deploy | APM errors — filter by `@timestamp` after deploy time |","tags":["observability","dotfiles","athal7","agent-skills"],"capabilities":["skill","source-athal7","skill-observability","topic-agent-skills"],"categories":["dotfiles"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/athal7/dotfiles/observability","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add athal7/dotfiles","source_repo":"https://github.com/athal7/dotfiles","install_from":"skills.sh"}},"qualityScore":"0.453","qualityRationale":"deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 6 github stars · SKILL.md body (2,702 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T19:14:35.191Z","embedding":null,"createdAt":"2026-05-18T13:22:30.394Z","updatedAt":"2026-05-18T19:14:35.191Z","lastSeenAt":"2026-05-18T19:14:35.191Z","tsv":"'+1':281 '0':262 '1':51 '15':295,335 '1h':270 '2':63 '3':77 '45':297 '500':400 'absenc':310,316 'access':377 'affect':411 'aimless':93 'answer':255 'anyth':50 'apm':140,156,218,394,403,432 'around':175 'ask':244 'baselin':333 'bound':64 'broke':130 'call':149,345 'capabl':38 'caus':169,287,338,373 'chain':150 'chang':362 'check':134 'clear':289 'coarser':107 'code':170,354 'command':34 'common':384 'confirm':135,254 'context':174,205,379 'correl':13,206 'cron':305 'curl':33 'data':358,364 'depend':349 'deploy':431,438 'desc':399 'detail':185 'drill':110 'endpoint':84,284 'enough':184 'error':9,125,157,195,227,263,291,328,355,401,404,433 'error.grouping':407 'escal':19,319,323 'event':178 'everi':299 'evid':312,314 'exact':294 'exampl':258 'except':158 'extern':348 'fail':147 'field':210 'filter':413,434 'find':99 'finer':112 'first':109,133,390 'full':148 'group':162,196,405 'hierarchi':102 'hour':300 'hypothesi':54 'id':416,419 'identifi':78,341 'index':30 'indic':214 'integr':365 'interpret':100 'investig':2,24,321 'issu':4,272,366,421 'job':306 'json':241 'key':163,408 'know':16,153 'last':269 'like':257,303 'link':211 'load':393 'log':6,37,95,172,190,204,215,236,412,422 'log-scrol':94 'logs-apm.error':219 'look':256,389,423 'make':98 'migrat':359 'minut':336 'misread':309 'n':280 'need':116,166,375 'observ':1 'obvious':372 'one':81,83,85,113,409 'ongo':74 'page':392 'path':171 'pattern':31,385,426 'period':420 'potenti':363 'prevent':92,308 'problem':26,137 'process':40 'product':3,25,376 'pull':201,234 'queri':28,49,249,260,282 'question':247 'rate':126,127,329 'raw':173 'real':139 'refer':356 'request':240 'resolv':76,274 'return':261 'right':246 'root':168 'rule':369 'scale':132 'schema':361 'scope':80,155 'scroll':96 'servic':82,265 'session':418 'show':279,344 'signal':14,101,108,117,207 'skill':22 'skill-observability' 'slow':145,229,277,391 'someth':129 'sort':396 'source-athal7' 'specif':177 'spike':128,292,402 'stack':160 'start':41,71,188,191 'state':52 'stop':320 'surround':203 'sustain':331 'symptom':44,386 'syntax':29 'system':89 'system-wid':88 'team':326 'tell':120 'term':242 'think':58 'three':213 'time':66,350,425,439 'timestamp':428,436 'tool':47 'top':105 'top-down':104 'topic-agent-skills' 'trace':7,141,161,180,193,217,230,278,343,395 'trace.id':199,209,224,243 'traces-apm':216 'transact':143 'transaction.duration.us':398 'triag':12,39 'type':159 'use':5,20,35,124,198,231 'user':86,410,415 've':368 'wide':90 'window':67 'work':103 'write':250 'wrong':60 'x':266 'y':285","prices":[{"id":"0a52bd51-ed94-4d47-9214-7623c92f566b","listingId":"6b94c6e3-3d76-4394-9c46-0461154204ed","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"athal7","category":"dotfiles","install_from":"skills.sh"},"createdAt":"2026-05-18T13:22:30.394Z"}],"sources":[{"listingId":"6b94c6e3-3d76-4394-9c46-0461154204ed","source":"github","sourceId":"athal7/dotfiles/observability","sourceUrl":"https://github.com/athal7/dotfiles/tree/main/skills/observability","isPrimary":false,"firstSeenAt":"2026-05-18T13:22:30.394Z","lastSeenAt":"2026-05-18T19:14:35.191Z"}],"details":{"listingId":"6b94c6e3-3d76-4394-9c46-0461154204ed","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"athal7","slug":"observability","github":{"repo":"athal7/dotfiles","stars":6,"topics":["agent-skills"],"license":null,"html_url":"https://github.com/athal7/dotfiles","pushed_at":"2026-05-18T18:53:57Z","description":null,"skill_md_sha":"22980f466f8b35bbbb7fa145f9a63a6f234f4883","skill_md_path":"skills/observability/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/athal7/dotfiles/tree/main/skills/observability"},"layout":"multi","source":"github","category":"dotfiles","frontmatter":{"name":"observability","license":"MIT","description":"Investigate production issues using logs, traces, and errors — how to triage, correlate signals, and know when to escalate."},"skills_sh_url":"https://skills.sh/athal7/dotfiles/observability"},"updatedAt":"2026-05-18T19:14:35.191Z"}}