{"id":"befbf73b-26b7-417b-ab29-5090ba622cfc","shortId":"vt6Q3z","kind":"skill","title":"observability","tagline":"Use when writing or reviewing request handlers, RPCs, or background jobs for production; adding tracing, metrics, or structured-log calls; or making diagnosability decisions","description":"# Observability\n\n## Overview\n\nCode that runs fine in dev and goes inert in production is the dominant operational failure mode for modern services. **When you add code that will run for users, you also add the diagnosability of that code: structured logs, trace context across process boundaries, metrics with bounded cardinality, signals an operator can read without your help.**\n\nThis is a **rigid** skill. Jump to the sub-section that matches what you're writing and run that sub-section's checks.\n\nThese checks matter most when adding a request handler, RPC, or background job that will run in production with users depending on diagnosability. In MVPs, prototypes, internal dev tools, and one-off scripts, structured-logging, tracing, and SLO discipline are premature — prefer the simplest thing that works.\n\n## When to invoke\n\nInvoke when you're about to:\n\n- Add a request handler, RPC method, or background job that will run in production\n- Add or change `log.info` / `log.warn` / `log.error` calls in code that will run under load\n- Add tracing instrumentation, span creation, or trace-context propagation\n- Add or change a metric (counter, gauge, histogram), especially one with labels\n- Make a diagnosability decision that crosses process boundaries (logging across services, distributed traces, error correlation)\n- Review observability coverage, log/metric/trace quality, or diagnosability of existing code\n\n### Non-triggers — do NOT invoke for\n\n- A script that runs once locally\n- A one-off migration or cleanup job\n- A test\n- An early-stage MVP or prototype where the architecture is still in flux\n- An internal dev tool or debugging endpoint\n- Throwaway code expected to be replaced before reaching users\n\nIf the change adds an observability call to production code even slightly, **invoke anyway** — the cardinality and trace-context bugs are not.\n\n## Checks by domain\n\n### Logs\n\n1. **Structured, not free-form.** Log as JSON or another key/value format the platform parses. Keys: `timestamp`, `level`, `event` (a short stable name like `user_login_failed`), plus the relevant context fields (`request_id`, `user_id` when not sensitive, `route`, `duration_ms`, `status`). Example: `logger.info(f\"user {user.id} logged in via {provider} at {ts}\")` is unsearchable; `logger.info(\"user_login\", user_id=user.id, provider=provider)` is queryable. *(`OTel/StructuredLogs`.)*\n2. **Every request carries a request id; every cross-process call propagates it.** A single user action that touches three services should be traceable through all three by one ID. Generate at the entry point if upstream did not provide one; pass through every downstream call; include in every log line emitted while handling the request.\n3. **Log content boundaries belong to other skills.** What not to log (`security-and-trust-boundaries`); whether log files belong on disk or stdout (`build-deploy-and-tooling` `12F/XI`). This skill decides what fields go on the line and how they are shaped.\n\n### Traces\n\n4. **Propagate W3C Trace Context across process boundaries.** Every outgoing HTTP / gRPC / queue call carries the trace headers; every incoming handler reads them and continues the trace. The platform's tracer SDK does this if you let it; explicit propagation is required when you bypass the SDK (raw `requests.get`, manual queue producer). Example: a handler that reads from one service and writes to another with no propagation — the trace breaks at the boundary and the operator cannot see the cross-service path. *(`OTel/TraceContext`.)*\n5. **Spans cover meaningful units of work, not every function call.** A span per HTTP request, per DB transaction, per queue message handle, per batch job — yes. A span per private helper — no, the noise drowns the signal and the trace cost rises. The default tracer auto-instrumentation usually picks the right level; resist adding more spans without a reason.\n\n### Metrics\n\n6. **Watch cardinality on metric labels.** Metric labels are indexed by every unique combination; an unbounded label (user id, request id, full URL path) creates one time series per unique value, which the metrics backend has to store, index, and query forever. Example: `failed_logins_total{user_id=\"...\", reason=\"...\"}` produces a new time series per user — millions of series for a system with millions of users, and the metrics backend falls over. **Per-user, per-request, per-trace-id data belongs in logs and traces, not metric labels.** Metric labels are for low-cardinality, bounded sets: HTTP method, route template, status class, region, downstream name. *(`OE/CardinalityDiscipline`.)*\n7. **Choose the four signals deliberately for service code.** For a production service, the canonical operator-facing signals are **latency** (how long is the work taking), **traffic** (how much work), **errors** (rate of failed work), and **saturation** (how full is the resource). For each new request handler or background job, ask which of the four signals is observable; if any is not, add an instrument or note the gap. Not every codebase needs all four — a CLI is not a service — but service code does. *(`SRE/GoldenSignals`.)*\n\n## Red Flags\n\nThese thoughts mean STOP — apply the domain check before committing:\n\n| Thought | Reality |\n|---|---|\n| \"I'll log a single human-readable string — it's easier to grep.\" | Free-form strings are unsearchable in production aggregators. Log structured key-value with stable event names; the operator queries by field, not by substring. (`OTel/StructuredLogs`) |\n| \"I'll add the user id as a metric label so we can see per-user failures.\" | Per-user labels create a time series per user. Use a metric for the *count*; put the user id in logs and traces where high cardinality is fine. (`OE/CardinalityDiscipline`) |\n| \"I'll add the full URL path as a label.\" | Same problem — `/users/12345` and `/users/12346` are different series. Use the route template (`/users/:id`), not the realized path. (`OE/CardinalityDiscipline`) |\n| \"I'll instrument every helper function with a span.\" | Spans cover meaningful units of work; one per private helper buries the trace in noise. Span per request / transaction / job, not per function. (`OTel/TraceContext`) |\n| \"The downstream call uses raw `requests.get` — no need to thread the trace headers.\" | The trace breaks at the boundary; the operator cannot see the cross-service path. Propagate W3C Trace Context, even when bypassing the tracer SDK. (`OTel/TraceContext`) |\n| \"We don't measure latency on this background job — it'll be fine.\" | Without latency / traffic / errors / saturation visibility, the only way to know it broke is a user complaint. Wire at least the four signals for production service code. (`SRE/GoldenSignals`) |\n| \"The request id is in the trace — we don't need it in the log.\" | Logs without the request id force the operator to traverse the trace just to correlate one error line. Put the request id on every log line for the request. (`OTel/StructuredLogs`) |\n\n## What \"done\" looks like\n\nFor every observability surface your change touches, **all** of the following are true:\n\n- [ ] **Logs:** every new log call is structured (JSON or key/value), carries a stable `event` name, and includes the request id.\n- [ ] **Traces:** trace context is propagated across every cross-process call your code makes; spans correspond to meaningful units of work, not every function.\n- [ ] **Metrics:** every new label is bounded and low-cardinality; per-user / per-request / per-trace-id data lives in logs or traces, not labels.\n- [ ] **Signals:** for production service code, the four golden signals (latency, traffic, errors, saturation) are observable for the new code path or you have noted the gap.\n- [ ] **Content boundaries:** no secrets, no PII, no auth tokens in logs or traces (verified against `security-and-trust-boundaries`).\n\nIf any box that applies to your change is unchecked, you are not done. Either finish, or revert and re-plan.\n\n## Principles in this skill\n\n| ID | Principle | Source |\n|---|---|---|\n| `OTel/StructuredLogs` | Structured key/value logs with stable event names | OpenTelemetry semantic conventions; SRE book |\n| `OTel/TraceContext` | W3C Trace Context propagated across every cross-process call | OpenTelemetry semantic conventions; *Observability Engineering* |\n| `SRE/GoldenSignals` | The four signals for service code: latency, traffic, errors, saturation | *Site Reliability Engineering*, ch. 6 |\n| `OE/CardinalityDiscipline` | High-cardinality data belongs in logs and traces, not metric labels | *Observability Engineering* (Majors et al.) |\n\nSee `principles.md` for the long-form distillations and source citations.","tags":["observability","oribarilan","agent-skills","ai-agents","best-practices","claude-code","claude-code-plugin","claude-code-skills","coding-agents","copilot-cli","copilot-cli-plugin","opencode"],"capabilities":["skill","source-oribarilan","skill-observability","topic-agent-skills","topic-ai-agents","topic-best-practices","topic-claude-code","topic-claude-code-plugin","topic-claude-code-skills","topic-coding-agents","topic-copilot-cli","topic-copilot-cli-plugin","topic-opencode","topic-opencode-plugin","topic-programming-principles"],"categories":["97"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/oribarilan/97/observability","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add oribarilan/97","source_repo":"https://github.com/oribarilan/97","install_from":"skills.sh"}},"qualityScore":"0.460","qualityRationale":"deterministic score 0.46 from registry signals: · indexed on github topic:agent-skills · 21 github stars · SKILL.md body (8,667 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T19:05:32.914Z","embedding":null,"createdAt":"2026-05-08T13:06:23.888Z","updatedAt":"2026-05-18T19:05:32.914Z","lastSeenAt":"2026-05-18T19:05:32.914Z","tsv":"'/users':962 '/users/12345':952 '/users/12346':954 '1':323 '12f/xi':478 '2':391 '3':448 '4':494 '5':578 '6':640,1335 '7':750 'across':70,227,499,1169,1309 'action':408 'ad':15,115,633 'add':51,60,168,182,196,206,299,813,894,942 'aggreg':873 'al':1353 'also':59 'anoth':333,557 'anyway':309 'appli':843,1266 'architectur':275 'ask':801 'auth':1249 'auto':625 'auto-instrument':624 'backend':674,709 'background':11,121,175,799,1048 'batch':602 'belong':452,468,723,1341 'book':1303 'bound':75,738,1193 'boundari':72,225,451,464,501,566,1020,1243,1261 'box':1264 'break':563,1017 'broke':1066 'bug':316 'build':474 'build-deploy-and-tool':473 'buri':988 'bypass':538,1036 'call':22,188,302,402,437,507,588,1004,1148,1174,1314 'cannot':570,1023 'canon':764 'cardin':76,311,642,737,936,1197,1339 'carri':394,508,1154 'ch':1334 'chang':184,208,298,1136,1269 'check':109,111,319,846 'choos':751 'citat':1364 'class':745 'cleanup':262 'cli':827 'code':29,52,65,190,242,288,305,758,834,1080,1176,1220,1234,1326 'codebas':822 'combin':653 'commit':848 'complaint':1070 'content':450,1242 'context':69,204,315,354,498,1033,1166,1307 'continu':518 'convent':1301,1317 'correl':232,1111 'correspond':1179 'cost':619 'count':925 'counter':211 'cover':580,979 'coverag':235 'creat':664,914 'creation':200 'cross':223,400,574,1027,1172,1312 'cross-process':399,1171,1311 'cross-servic':573,1026 'data':722,1208,1340 'db':595 'debug':285 'decid':481 'decis':26,221 'default':622 'deliber':755 'depend':130 'deploy':475 'dev':34,137,282 'diagnos':25,62,132,220,239 'differ':956 'disciplin':150 'disk':470 'distil':1361 'distribut':229 'domain':321,845 'domin':42 'done':1128,1275 'downstream':436,747,1003 'drown':613 'durat':364 'earli':268 'early-stag':267 'easier':862 'either':1276 'emit':443 'endpoint':286 'engin':1319,1333,1350 'entri':425 'error':231,781,1057,1113,1227,1329 'especi':214 'et':1352 'even':306,1034 'event':342,881,1157,1297 'everi':392,398,435,440,502,512,586,651,821,972,1120,1132,1145,1170,1186,1189,1310 'exampl':367,546,682 'exist':241 'expect':289 'explicit':532 'f':369 'face':767 'fail':350,683,784 'failur':44,909 'fall':710 'field':355,483,887 'file':467 'fine':32,938,1053 'finish':1277 'flag':838 'flux':279 'follow':1141 'forc':1102 'forev':681 'form':328,867,1360 'format':335 'four':753,805,825,1075,1222,1322 'free':327,866 'free-form':326,865 'full':661,789,944 'function':587,974,1000,1187 'gap':819,1241 'gaug':212 'generat':422 'go':484 'goe':36 'golden':1223 'grep':864 'grpc':505 'handl':445,600 'handler':8,118,171,514,548,797 'header':511,1014 'help':84 'helper':609,973,987 'high':935,1338 'high-cardin':1337 'histogram':213 'http':504,592,740 'human':857 'human-read':856 'id':357,359,384,397,421,658,660,687,721,897,929,963,1084,1101,1118,1163,1207,1288 'includ':438,1160 'incom':513 'index':649,678 'inert':37 'instrument':198,626,815,971 'intern':136,281 'invok':161,162,248,308 'job':12,122,176,263,603,800,997,1049 'json':331,1151 'jump':90 'key':339,877 'key-valu':876 'key/value':334,1153,1293 'know':1064 'label':217,645,647,656,730,732,901,913,949,1191,1215,1348 'latenc':770,1045,1055,1225,1327 'least':1073 'let':530 'level':341,631 'like':347,1130 'line':442,487,1114,1122 'live':1209 'll':852,893,941,970,1051 'load':195 'local':255 'log':21,67,146,226,322,329,372,441,449,459,466,725,853,874,931,1096,1097,1121,1144,1147,1211,1252,1294,1343 'log.error':187 'log.info':185 'log.warn':186 'log/metric/trace':236 'logger.info':368,380 'login':349,382,684 'long':772,1359 'long-form':1358 'look':1129 'low':736,1196 'low-cardin':735,1195 'major':1351 'make':24,218,1177 'manual':543 'match':97 'matter':112 'mean':841 'meaning':581,980,1181 'measur':1044 'messag':599 'method':173,741 'metric':17,73,210,639,644,646,673,708,729,731,900,922,1188,1347 'migrat':260 'million':696,703 'mode':45 'modern':47 'ms':365 'much':779 'mvp':270 'mvps':134 'name':346,748,882,1158,1298 'need':823,1009,1092 'new':691,795,1146,1190,1233 'nois':612,992 'non':244 'non-trigg':243 'note':817,1239 'observ':1,27,234,301,808,1133,1230,1318,1349 'oe/cardinalitydiscipline':749,939,968,1336 'one':141,215,258,420,432,552,665,984,1112 'one-off':140,257 'opentelemetri':1299,1315 'oper':43,79,569,766,884,1022,1104 'operator-fac':765 'otel/structuredlogs':390,891,1126,1291 'otel/tracecontext':577,1001,1040,1304 'outgo':503 'overview':28 'pars':338 'pass':433 'path':576,663,946,967,1029,1235 'per':591,594,597,601,607,668,694,713,716,719,907,911,918,985,994,999,1199,1202,1205 'per-request':715,1201 'per-trace-id':718,1204 'per-us':712,906,910,1198 'pick':628 'pii':1247 'plan':1283 'platform':337,522 'plus':351 'point':426 'prefer':153 'prematur':152 'principl':1284,1289 'principles.md':1355 'privat':608,986 'problem':951 'process':71,224,401,500,1173,1313 'produc':545,689 'product':14,39,127,181,304,761,872,1078,1218 'propag':205,403,495,533,560,1030,1168,1308 'prototyp':135,272 'provid':375,386,387,431 'put':926,1115 'qualiti':237 'queri':680,885 'queryabl':389 'queue':506,544,598 'rate':782 'raw':541,1006 're':100,165,1282 're-plan':1281 'reach':294 'read':81,515,550 'readabl':858 'realiti':850 'realiz':966 'reason':638,688 'red':837 'region':746 'relev':353 'reliabl':1332 'replac':292 'request':7,117,170,356,393,396,447,593,659,717,796,995,1083,1100,1117,1125,1162,1203 'requests.get':542,1007 'requir':535 'resist':632 'resourc':792 'revert':1279 'review':6,233 'right':630 'rigid':88 'rise':620 'rout':363,742,960 'rpc':119,172 'rpcs':9 'run':31,55,103,125,179,193,253 'satur':787,1058,1228,1330 'script':143,251 'sdk':525,540,1039 'secret':1245 'section':95,107 'secur':461,1258 'security-and-trust-boundari':460,1257 'see':571,905,1024,1354 'semant':1300,1316 'sensit':362 'seri':667,693,698,917,957 'servic':48,228,412,553,575,757,762,831,833,1028,1079,1219,1325 'set':739 'shape':492 'short':344 'signal':77,615,754,768,806,1076,1216,1224,1323 'simplest':155 'singl':406,855 'site':1331 'skill':89,455,480,1287 'skill-observability' 'slight':307 'slo':149 'sourc':1290,1363 'source-oribarilan' 'span':199,579,590,606,635,977,978,993,1178 'sre':1302 'sre/goldensignals':836,1081,1320 'stabl':345,880,1156,1296 'stage':269 'status':366,744 'stdout':472 'still':277 'stop':842 'store':677 'string':859,868 'structur':20,66,145,324,875,1150,1292 'structured-log':19,144 'sub':94,106 'sub-sect':93,105 'substr':890 'surfac':1134 'system':701 'take':776 'templat':743,961 'test':265 'thing':156 'thought':840,849 'thread':1011 'three':411,418 'throwaway':287 'time':666,692,916 'timestamp':340 'token':1250 'tool':138,283,477 'topic-agent-skills' 'topic-ai-agents' 'topic-best-practices' 'topic-claude-code' 'topic-claude-code-plugin' 'topic-claude-code-skills' 'topic-coding-agents' 'topic-copilot-cli' 'topic-copilot-cli-plugin' 'topic-opencode' 'topic-opencode-plugin' 'topic-programming-principles' 'total':685 'touch':410,1137 'trace':16,68,147,197,203,230,314,493,497,510,520,562,618,720,727,933,990,1013,1016,1032,1088,1108,1164,1165,1206,1213,1254,1306,1345 'trace-context':202,313 'traceabl':415 'tracer':524,623,1038 'traffic':777,1056,1226,1328 'transact':596,996 'travers':1106 'trigger':245 'true':1143 'trust':463,1260 'ts':377 'unbound':655 'uncheck':1271 'uniqu':652,669 'unit':582,981,1182 'unsearch':379,870 'upstream':428 'url':662,945 'use':2,920,958,1005 'user':57,129,295,348,358,370,381,383,407,657,686,695,705,714,896,908,912,919,928,1069,1200 'user.id':371,385 'usual':627 'valu':670,878 'verifi':1255 'via':374 'visibl':1059 'w3c':496,1031,1305 'watch':641 'way':1062 'whether':465 'wire':1071 'without':82,636,1054,1098 'work':158,584,775,780,785,983,1184 'write':4,101,555 'yes':604","prices":[{"id":"0e7f66b7-48d7-455f-bf71-9496b43b3d8f","listingId":"befbf73b-26b7-417b-ab29-5090ba622cfc","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"oribarilan","category":"97","install_from":"skills.sh"},"createdAt":"2026-05-08T13:06:23.888Z"}],"sources":[{"listingId":"befbf73b-26b7-417b-ab29-5090ba622cfc","source":"github","sourceId":"oribarilan/97/observability","sourceUrl":"https://github.com/oribarilan/97/tree/main/skills/observability","isPrimary":false,"firstSeenAt":"2026-05-08T13:06:23.888Z","lastSeenAt":"2026-05-18T19:05:32.914Z"}],"details":{"listingId":"befbf73b-26b7-417b-ab29-5090ba622cfc","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"oribarilan","slug":"observability","github":{"repo":"oribarilan/97","stars":21,"topics":["agent-skills","ai-agents","best-practices","claude-code","claude-code-plugin","claude-code-skills","coding-agents","copilot-cli","copilot-cli-plugin","opencode","opencode-plugin","programming-principles"],"license":"other","html_url":"https://github.com/oribarilan/97","pushed_at":"2026-05-15T21:32:54Z","description":"Agent skills distilled from the hard-won lessons of world-renowned programmers, in the spirit of \"97 Things Every Programmer Should Know\"","skill_md_sha":"b132744e31be9cf051d92d7caab4dbb8356864f8","skill_md_path":"skills/observability/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/oribarilan/97/tree/main/skills/observability"},"layout":"multi","source":"github","category":"97","frontmatter":{"name":"observability","description":"Use when writing or reviewing request handlers, RPCs, or background jobs for production; adding tracing, metrics, or structured-log calls; or making diagnosability decisions"},"skills_sh_url":"https://skills.sh/oribarilan/97/observability"},"updatedAt":"2026-05-18T19:05:32.914Z"}}