{"id":"7f727b32-8ea0-435e-8b56-97a2180f3d8b","shortId":"bgqdEM","kind":"skill","title":"observability-sre","tagline":"Observability and SRE expert. Use when setting up monitoring, logging, tracing, defining SLOs, or managing incidents. Covers Prometheus, Grafana, OpenTelemetry, and incident response best practices.","description":"# Observability & Site Reliability Engineering\n\n## Core Principles\n\n- **Three Pillars** — Metrics, Logs, and Traces provide holistic visibility\n- **Observability-First** — Build systems that explain their own behavior\n- **SLO-Driven** — Define reliability targets that matter to users\n- **Proactive Detection** — Find issues before customers do\n- **Blameless Culture** — Learn from failures without blame\n- **Automate Toil** — Reduce repetitive operational work\n- **Continuous Improvement** — Each incident makes systems more resilient\n- **Full-Stack Visibility** — Monitor from infrastructure to business metrics\n\n---\n\n## Hard Rules (Must Follow)\n\n> These rules are mandatory. Violating them means the skill is not working correctly.\n\n### Symptom-Based Alerts Only\n\n**Alert on user-facing symptoms, not internal infrastructure metrics.**\n\n```yaml\n# ❌ FORBIDDEN: Alerting on internal metrics\n- alert: CPUHigh\n  expr: cpu_usage > 70%\n  # Users don't care about CPU, they care about latency\n\n- alert: MemoryHigh\n  expr: memory_usage > 80%\n  # Internal metric, may not affect users\n\n# ✅ REQUIRED: Alert on user experience\n- alert: APILatencyHigh\n  expr: slo:api_latency:p95 > 0.200\n  annotations:\n    summary: \"Users experiencing slow response times\"\n\n- alert: ErrorRateHigh\n  expr: slo:api_errors:rate5m > 0.001\n  annotations:\n    summary: \"Users encountering errors\"\n```\n\n### Low Cardinality Labels\n\n**Loki/Prometheus labels must have low cardinality (<10 unique labels).**\n\n```yaml\n# ❌ FORBIDDEN: High cardinality labels\nlabels:\n  user_id: \"usr_123\"      # Millions of values!\n  order_id: \"ord_456\"     # Millions of values!\n  request_id: \"req_789\"   # Every request is unique!\n\n# ✅ REQUIRED: Low cardinality only\nlabels:\n  namespace: \"production\"  # Few values\n  app: \"api-server\"        # Few values\n  level: \"error\"           # 5-6 values\n  method: \"GET\"            # ~10 values\n\n# High cardinality data goes in log body:\nlogger.info({\n  user_id: \"usr_123\",      # In JSON body, not label\n  order_id: \"ord_456\",\n}, \"Order processed\");\n```\n\n### SLO-Based Error Budgets\n\n**Every service must have defined SLOs with error budget tracking.**\n\n```yaml\n# ❌ FORBIDDEN: No SLO definition\n# Just monitoring without targets\n\n# ✅ REQUIRED: Explicit SLO with budget\n# SLO: 99.9% availability\n# Error Budget: 0.1% = 43.2 minutes/month downtime\n\ngroups:\n  - name: slo_tracking\n    rules:\n      - record: slo:api_availability:ratio\n        expr: sum(rate(http_requests_total{status!~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))\n\n      - alert: ErrorBudgetBurnRate\n        expr: slo:api_availability:ratio < 0.999\n        for: 5m\n        annotations:\n          summary: \"Burning error budget too fast\"\n```\n\n### Trace Context in Logs\n\n**All logs must include trace_id for correlation with distributed traces.**\n\n```typescript\n// ❌ FORBIDDEN: Logs without trace context\nlogger.info(\"Payment processed\");\n\n// ✅ REQUIRED: Include trace_id in every log\nconst span = trace.getActiveSpan();\nlogger.info({\n  trace_id: span?.spanContext().traceId,\n  span_id: span?.spanContext().spanId,\n  order_id: \"ord_123\",\n}, \"Payment processed\");\n\n// Output includes correlation:\n// {\"trace_id\":\"abc123\",\"span_id\":\"def456\",\"order_id\":\"ord_123\",\"msg\":\"Payment processed\"}\n```\n\n---\n\n## Quick Reference\n\n### When to Use What\n\n| Scenario | Tool/Pattern | Reason |\n|----------|--------------|--------|\n| Metrics collection | Prometheus + Grafana | Industry standard, powerful query language |\n| Distributed tracing | OpenTelemetry + Tempo/Jaeger | Vendor-neutral, CNCF standard |\n| Log aggregation (cost-sensitive) | Grafana Loki | Indexes only labels, 10x cheaper |\n| Log aggregation (search-heavy) | ELK Stack | Full-text search, advanced analytics |\n| Unified observability | Elastic/Datadog/Dynatrace | Single pane of glass for all telemetry |\n| Incident management | PagerDuty/Opsgenie | Alert routing, on-call scheduling |\n| Chaos engineering | Gremlin/Chaos Mesh | Controlled failure injection |\n| AIOps/Anomaly detection | Dynatrace/Datadog | AI-driven root cause analysis |\n\n### The Three Pillars\n\n| Pillar | What | When | Tools |\n|--------|------|------|-------|\n| **Metrics** | Numerical time-series data | Real-time monitoring, alerting | Prometheus, StatsD, CloudWatch |\n| **Logs** | Event records with context | Debugging, audit trails | Loki, ELK, Splunk |\n| **Traces** | Request journey across services | Performance analysis, dependencies | OpenTelemetry, Jaeger, Zipkin |\n\n**Fourth Pillar (Emerging):** Continuous Profiling — Code-level performance data (CPU, memory usage at function level)\n\n---\n\n## Observability Architecture\n\n### Layered Prometheus Setup\n\n```yaml\n# 2025 Best Practice: Federated architecture\n# Prevents metric chaos while enabling drill-down\n\n# Layer 1: Application Prometheus\n# - Detailed business logic metrics\n# - High cardinality acceptable\n# - Short retention (7 days)\n\n# Layer 2: Cluster Prometheus\n# - Per-environment/cluster metrics\n# - Medium retention (30 days)\n# - Aggregates from application level\n\n# Layer 3: Global Prometheus\n# - Cross-cluster critical metrics\n# - Long retention (1 year)\n# - Federation from cluster level\n\n# Global Prometheus config\nscrape_configs:\n  - job_name: 'federate'\n    scrape_interval: 15s\n    honor_labels: true\n    metrics_path: '/federate'\n    params:\n      'match[]':\n        - '{job=\"kubernetes-nodes\"}'\n        - '{__name__=~\"job:.*\"}'  # Recording rules only\n    static_configs:\n      - targets:\n        - 'cluster-prom-us-east.internal:9090'\n        - 'cluster-prom-eu-west.internal:9090'\n```\n\n### Recording Rules for Performance\n\n```yaml\n# Precompute expensive queries\ngroups:\n  - name: api_performance\n    interval: 30s\n    rules:\n      # Request rate (requests per second)\n      - record: job:api_requests:rate5m\n        expr: sum(rate(http_requests_total[5m])) by (job, method, status)\n\n      # Error rate\n      - record: job:api_errors:rate5m\n        expr: |\n          sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (job)\n          /\n          sum(rate(http_requests_total[5m])) by (job)\n\n      # P95 latency\n      - record: job:api_latency:p95\n        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))\n```\n\n### Resource Optimization\n\n```yaml\n# Increase scrape interval for high-target deployments\nscrape_interval: 30s  # Default: 15s reduces load by 50%\n\n# Use relabeling to drop unnecessary metrics\nmetric_relabel_configs:\n  - source_labels: [__name__]\n    regex: 'go_.*|process_.*'  # Drop Go runtime metrics\n    action: drop\n\n# Limit sample retention\nstorage:\n  tsdb:\n    retention.time: 15d  # Keep only 15 days locally\n    retention.size: 50GB # Or max 50GB\n```\n\n---\n\n## Distributed Tracing with OpenTelemetry\n\n### Auto-Instrumentation Setup\n\n```typescript\n// Node.js auto-instrumentation\nimport { NodeSDK } from '@opentelemetry/sdk-node';\nimport { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';\nimport { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';\n\nconst sdk = new NodeSDK({\n  traceExporter: new OTLPTraceExporter({\n    url: 'http://otel-collector:4318/v1/traces',\n  }),\n  instrumentations: [\n    getNodeAutoInstrumentations({\n      // Auto-instruments HTTP, Express, PostgreSQL, Redis, etc.\n      '@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy\n    }),\n  ],\n});\n\nsdk.start();\n```\n\n### Manual Instrumentation for Business Logic\n\n```typescript\nimport { trace, SpanStatusCode } from '@opentelemetry/api';\n\nconst tracer = trace.getTracer('payment-service', '1.0.0');\n\nasync function processPayment(orderId: string, amount: number) {\n  // Create custom span for business operation\n  return tracer.startActiveSpan('processPayment', async (span) => {\n    try {\n      // Add business context\n      span.setAttributes({\n        'order.id': orderId,\n        'payment.amount': amount,\n        'payment.currency': 'USD',\n      });\n\n      // Child span for external API call\n      const paymentResult = await tracer.startActiveSpan('stripe.charge', async (childSpan) => {\n        const result = await stripe.charges.create({ amount, currency: 'usd' });\n        childSpan.setAttribute('stripe.charge_id', result.id);\n        childSpan.setStatus({ code: SpanStatusCode.OK });\n        childSpan.end();\n        return result;\n      });\n\n      span.setStatus({ code: SpanStatusCode.OK });\n      return paymentResult;\n    } catch (error) {\n      span.recordException(error);\n      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });\n      throw error;\n    } finally {\n      span.end();\n    }\n  });\n}\n```\n\n### Sampling Strategies\n\n```yaml\n# OpenTelemetry Collector config\nprocessors:\n  # Probabilistic sampling: Keep 10% of traces\n  probabilistic_sampler:\n    sampling_percentage: 10\n\n  # Tail sampling: Make decisions after seeing full trace\n  tail_sampling:\n    policies:\n      # Always sample errors\n      - name: error-traces\n        type: status_code\n        status_code: {status_codes: [ERROR]}\n\n      # Always sample slow requests\n      - name: slow-traces\n        type: latency\n        latency: {threshold_ms: 1000}\n\n      # Sample 5% of normal traffic\n      - name: normal-traces\n        type: probabilistic\n        probabilistic: {sampling_percentage: 5}\n```\n\n### Context Propagation\n\n```typescript\n// Ensure trace context flows across services\nimport { propagation, context } from '@opentelemetry/api';\n\n// Outgoing HTTP request (automatic with auto-instrumentation)\nfetch('https://api.example.com/data', {\n  headers: {\n    // W3C Trace Context headers injected automatically:\n    // traceparent: 00-<trace-id>-<span-id>-01\n    // tracestate: vendor=value\n  },\n});\n\n// Manual propagation for non-HTTP (e.g., message queues)\nconst carrier = {};\npropagation.inject(context.active(), carrier);\nawait publishMessage(queue, { data: payload, headers: carrier });\n```\n\n---\n\n## Structured Logging Best Practices\n\n### JSON Logging Format\n\n```typescript\n// Use structured logging library\nimport pino from 'pino';\n\nconst logger = pino({\n  level: process.env.LOG_LEVEL || 'info',\n  formatters: {\n    level: (label) => ({ level: label }),\n  },\n  timestamp: pino.stdTimeFunctions.isoTime,\n  // Include trace context in logs\n  mixin() {\n    const span = trace.getActiveSpan();\n    if (!span) return {};\n\n    const { traceId, spanId } = span.spanContext();\n    return {\n      trace_id: traceId,\n      span_id: spanId,\n    };\n  },\n});\n\n// Structured logging with context\nlogger.info(\n  {\n    user_id: '123',\n    order_id: 'ord_456',\n    amount: 99.99,\n    payment_method: 'card',\n  },\n  'Payment processed successfully'\n);\n\n// Output:\n// {\"level\":\"info\",\"time\":\"2025-01-15T10:30:00.000Z\",\"trace_id\":\"abc123\",\"span_id\":\"def456\",\"user_id\":\"123\",\"order_id\":\"ord_456\",\"amount\":99.99,\"payment_method\":\"card\",\"msg\":\"Payment processed successfully\"}\n```\n\n### Log Levels\n\n```typescript\n// Follow standard severity levels\nlogger.trace({ details }, 'Low-level debugging');     // Very verbose\nlogger.debug({ state }, 'Debug information');          // Development\nlogger.info({ event }, 'Normal operation');            // Production default\nlogger.warn({ issue }, 'Warning condition');           // Potential issues\nlogger.error({ error, context }, 'Error occurred');    // Errors\nlogger.fatal({ critical }, 'Fatal error');             // Process crash\n```\n\n### Grafana Loki Configuration\n\n```yaml\n# Promtail config - ships logs to Loki\nserver:\n  http_listen_port: 9080\n\npositions:\n  filename: /tmp/positions.yaml\n\nclients:\n  - url: http://loki:3100/loki/api/v1/push\n\nscrape_configs:\n  - job_name: kubernetes\n    kubernetes_sd_configs:\n      - role: pod\n    relabel_configs:\n      # Add pod labels as Loki labels (LOW cardinality only!)\n      - source_labels: [__meta_kubernetes_namespace]\n        target_label: namespace\n      - source_labels: [__meta_kubernetes_pod_name]\n        target_label: pod\n      - source_labels: [__meta_kubernetes_pod_label_app]\n        target_label: app\n    pipeline_stages:\n      # Parse JSON logs\n      - json:\n          expressions:\n            level: level\n            trace_id: trace_id\n      # Extract fields as labels\n      - labels:\n          level:\n          trace_id:\n```\n\n### Loki Best Practices\n\n- **Low Cardinality Labels** — Use only 5-10 labels (namespace, app, level)\n- **High Cardinality in Log Body** — Put user_id, order_id in JSON, not labels\n- **LogQL for Filtering** — Use `{app=\"api\"} | json | user_id=\"123\"`\n- **Retention Policy** — Keep recent logs longer, compress old logs\n\n```promql\n# LogQL query examples\n{namespace=\"production\", app=\"api\"} |= \"error\"  # Text search\n\n{app=\"api\"} | json | level=\"error\" | line_format \"{{.msg}}\"  # JSON parsing\n\nrate({app=\"api\"}[5m])  # Log rate per second\n\nsum by (level) (count_over_time({namespace=\"production\"}[1h]))  # Count by level\n```\n\n---\n\n## SLO/SLI/SLA Management\n\n### Definitions\n\n- **SLI (Service Level Indicator)** — Quantifiable measurement of service behavior\n  - Examples: Request latency, error rate, availability, throughput\n\n- **SLO (Service Level Objective)** — Target value/range for an SLI\n  - Examples: 99.9% availability, P95 latency < 200ms\n\n- **SLA (Service Level Agreement)** — Formal commitment with consequences\n  - Examples: \"99.9% uptime or 10% credit\"\n\n### The Four Golden Signals\n\n```yaml\n# Google SRE's key metrics for any service\n\n1. Latency\n   SLI: P95 request latency\n   SLO: 95% of requests complete in < 200ms\n\n2. Traffic\n   SLI: Requests per second\n   SLO: Handle 10,000 req/s peak load\n\n3. Errors\n   SLI: Error rate (5xx / total)\n   SLO: < 0.1% error rate\n\n4. Saturation\n   SLI: Resource utilization (CPU, memory, disk)\n   SLO: CPU < 70%, Memory < 80%\n```\n\n### Error Budget\n\n```python\n# Error budget = 1 - SLO\nSLO = 99.9%  # \"three nines\"\nError_Budget = 100% - 99.9% = 0.1%\n\n# Monthly calculation (30 days)\nTotal_Minutes = 30 * 24 * 60 = 43,200 minutes\nAllowed_Downtime = 43,200 * 0.001 = 43.2 minutes\n\n# If you've had 20 minutes downtime this month:\nBudget_Remaining = 43.2 - 20 = 23.2 minutes\nBudget_Consumed = 20 / 43.2 = 46.3%\n\n# Policy: If budget > 90% consumed, freeze deployments\n```\n\n### SLO Implementation with Prometheus\n\n```yaml\n# Recording rules for SLI calculation\ngroups:\n  - name: slo_availability\n    interval: 30s\n    rules:\n      # Total requests\n      - record: slo:api_requests:total\n        expr: sum(rate(http_requests_total[5m]))\n\n      # Successful requests (non-5xx)\n      - record: slo:api_requests:success\n        expr: sum(rate(http_requests_total{status!~\"5..\"}[5m]))\n\n      # Availability SLI\n      - record: slo:api_availability:ratio\n        expr: slo:api_requests:success / slo:api_requests:total\n\n      # 30-day availability\n      - record: slo:api_availability:30d\n        expr: avg_over_time(slo:api_availability:ratio[30d])\n\n  - name: slo_latency\n    interval: 30s\n    rules:\n      # P95 latency SLI\n      - record: slo:api_latency:p95\n        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))\n\n# Alerting on SLO burn rate\n- alert: HighErrorBudgetBurnRate\n  expr: |\n    (\n      slo:api_availability:ratio < 0.999  # Below 99.9% SLO\n      and\n      slo:api_availability:30d > 0.999    # But 30-day average still OK\n    )\n  for: 5m\n  annotations:\n    summary: \"Burning error budget too fast\"\n    description: \"Current availability {{ $value }} is below SLO. {{ $labels.service }}\"\n```\n\n---\n\n## Incident Response\n\n### Incident Severity Levels\n\n| Level | Impact | Response Time | Examples |\n|-------|--------|---------------|----------|\n| **SEV-1** | Service down or major degradation | < 15 min | Complete outage, data loss, security breach |\n| **SEV-2** | Significant impact, partial outage | < 1 hour | Feature unavailable, high error rates |\n| **SEV-3** | Minor impact, workaround exists | < 4 hours | Single component degraded, slow performance |\n| **SEV-4** | Cosmetic, no user impact | Next business day | UI glitches, logging errors |\n\n### Incident Response Roles (IMAG Framework)\n\n```yaml\nIncident Commander (IC):\n  - Overall coordination and decision-making\n  - Declares incident start/end\n  - Decides on escalations\n  - Owns communication to leadership\n\nOperations Lead (OL):\n  - Technical investigation and mitigation\n  - Coordinates engineers\n  - Implements fixes\n  - Reports status to IC\n\nCommunications Lead (CL):\n  - Internal/external status updates\n  - Customer communication\n  - Stakeholder notifications\n  - Status page updates\n```\n\n### Incident Workflow\n\n```\n1. Detection (Alert fires or user reports)\n   ↓\n2. Triage (Assess severity, assign IC)\n   ↓\n3. Response (Assemble team, create war room)\n   ↓\n4. Mitigation (Stop the bleeding, restore service)\n   ↓\n5. Resolution (Fix root cause)\n   ↓\n6. Postmortem (Blameless review, action items)\n   ↓\n7. Follow-up (Implement improvements)\n```\n\n### On-Call Best Practices\n\n- **Rotation** — 1-week shifts, balanced across timezones\n- **Escalation** — Primary → Secondary → Manager (15 min each)\n- **Playbooks** — Step-by-step debugging guides for common issues\n- **Runbooks** — Automated remediation scripts\n- **Handoff** — 15-min sync at rotation change\n- **Compensation** — On-call pay or comp time\n- **Health** — No more than 2 incidents/night target\n\n### Alert Fatigue Prevention\n\n```yaml\n# Symptoms vs Causes alerting\n# Alert on WHAT users experience, not WHY it's broken\n\n# GOOD: Symptom-based alert\n- alert: APILatencyHigh\n  expr: slo:api_latency:p95 > 0.200  # User-facing metric\n  annotations:\n    summary: \"API is slow for users\"\n\n# BAD: Cause-based alert\n- alert: CPUHigh\n  expr: cpu_usage > 70%  # Internal metric, might not impact users\n  # Don't alert unless this affects SLOs\n\n# Use SLO-based alerting\n# Alert when error budget burn rate is too high\n```\n\n---\n\n## Blameless Postmortems\n\n### Core Principles\n\n- **Assume Good Intentions** — Everyone did their best with available information\n- **Focus on Systems** — Identify gaps in process/tooling, not people\n- **Psychological Safety** — No punishment for honest mistakes\n- **Learning Culture** — Incidents are opportunities to improve\n- **Separate from Performance Reviews** — Postmortem participation never affects evaluations\n\n### Postmortem Template\n\n```markdown\n# Incident Postmortem: [Title]\n\n**Date:** 2025-01-15\n**Duration:** 10:30 - 12:15 UTC (1h 45m)\n**Severity:** SEV-2\n**Incident Commander:** Jane Doe\n**Responders:** John Smith, Alice Johnson\n\n## Impact\n- 15,000 users affected\n- 12% error rate on payment processing\n- $5,000 estimated revenue impact\n- No data loss\n\n## Timeline (UTC)\n- 10:30 - Alert: Payment error rate > 5%\n- 10:32 - IC assigned, war room created\n- 10:45 - Identified: Database connection pool exhausted\n- 11:00 - Mitigation: Increased pool size from 50 → 100\n- 11:15 - Error rate back to normal\n- 12:15 - Incident closed after monitoring\n\n## Root Cause\nDatabase connection pool configured for average load, not peak traffic.\nBlack Friday traffic spike (3x normal) exhausted connections.\n\n## What Went Well\n- Alert fired within 2 minutes of issue\n- Clear escalation path, IC available immediately\n- Mitigation applied quickly (30 minutes to fix)\n- No data corruption or loss\n\n## What Went Wrong\n- No load testing at 3x scale\n- No auto-scaling for connection pool\n- No alert on connection pool saturation\n- Insufficient monitoring of database metrics\n\n## Action Items\n- [ ] (@john) Add connection pool metrics to Grafana (Due: Jan 20)\n- [ ] (@alice) Implement auto-scaling based on request rate (Due: Jan 25)\n- [ ] (@jane) Add load testing to CI for 5x scale (Due: Feb 1)\n- [ ] (@jane) Add alert: connection pool > 80% (Due: Jan 18)\n- [ ] (@john) Document connection pool tuning runbook (Due: Jan 22)\n\n## Lessons Learned\n1. Black Friday load patterns need dedicated testing\n2. Database metrics were missing from standard dashboards\n3. Auto-scaling should cover ALL resources, not just pods\n```\n\n### Follow-up\n\n- Review postmortem in team meeting within 1 week\n- Track action items to completion (not optional!)\n- Share learnings across teams\n- Update runbooks and playbooks\n- Celebrate successful incident response\n\n---\n\n## Chaos Engineering\n\n### Principles\n\n1. **Define Steady State** — Normal system behavior (e.g., 99.9% success rate)\n2. **Hypothesize** — Predict system will remain stable under failure\n3. **Inject Failures** — Simulate real-world events\n4. **Disprove Hypothesis** — Look for deviations from steady state\n5. **Learn and Improve** — Fix weaknesses, increase resilience\n\n### Failure Types\n\n```yaml\nInfrastructure:\n  - Pod/node termination\n  - Network latency/packet loss\n  - DNS failures\n  - Cloud region outage\n\nResources:\n  - CPU stress\n  - Memory exhaustion\n  - Disk I/O saturation\n  - File descriptor limits\n\nDependencies:\n  - Database connection failures\n  - API timeout/errors\n  - Cache unavailability\n  - Message queue backlog\n\nSecurity:\n  - DDoS simulation\n  - Certificate expiration\n  - Unauthorized access attempts\n```\n\n### Chaos Mesh Example\n\n```yaml\n# Network latency injection\napiVersion: chaos-mesh.org/v1alpha1\nkind: NetworkChaos\nmetadata:\n  name: network-delay\nspec:\n  action: delay\n  mode: one\n  selector:\n    namespaces:\n      - production\n    labelSelectors:\n      app: payment-service\n  delay:\n    latency: \"100ms\"\n    correlation: \"50\"\n    jitter: \"50ms\"\n  duration: \"5m\"\n  scheduler:\n    cron: \"@every 2h\"  # Run every 2 hours\n\n---\n# Pod kill experiment\napiVersion: chaos-mesh.org/v1alpha1\nkind: PodChaos\nmetadata:\n  name: pod-kill\nspec:\n  action: pod-kill\n  mode: fixed-percent\n  value: \"10\"  # Kill 10% of pods\n  selector:\n    namespaces:\n      - production\n    labelSelectors:\n      app: api-server\n  duration: \"30s\"\n```\n\n### Best Practices\n\n- **Start Small** — Non-production first, then canary production\n- **Collect Baselines** — Know normal metrics before experiments\n- **Define Success** — Clear criteria for what \"stable\" means\n- **Monitor Everything** — Watch metrics, logs, traces during tests\n- **Automate Rollback** — Stop experiment if SLOs violated\n- **Game Days** — Scheduled chaos exercises with full team\n- **Blameless Reviews** — Treat chaos failures like production incidents\n\n---\n\n## AIOps and AI in Observability\n\n### 2025 Trends\n\n- **Anomaly Detection** — AI spots unusual patterns in metrics/logs\n- **Root Cause Analysis** — Correlate failures across services automatically\n- **Predictive Alerting** — Predict failures before they happen\n- **Auto-Remediation** — AI suggests or applies fixes autonomously\n- **Natural Language Queries** — Ask \"Why is checkout slow?\" instead of writing PromQL\n- **AI Observability** — Monitor AI model drift, hallucinations, token usage\n\n### AI-Driven Platforms (2025)\n\n```yaml\nDynatrace Davis AI:\n  - Auto-detected 73% of incidents before customer impact\n  - Reduced alert noise by 90%\n  - Causal AI for root cause analysis\n\nDatadog Watchdog:\n  - Anomaly detection across metrics, logs, traces\n  - Automated correlation of related issues\n  - LLM-powered investigation assistant\n\nElastic AIOps:\n  - Machine learning for log anomaly detection\n  - Automated baseline learning\n  - Predictive alerting\n\nNew Relic AI:\n  - Natural language query interface\n  - Automated incident summarization\n  - Proactive capacity recommendations\n```\n\n### Implementing AI Observability\n\n```python\n# Monitor AI model performance\nfrom opentelemetry import trace, metrics\n\ntracer = trace.get_tracer(__name__)\nmeter = metrics.get_meter(__name__)\n\n# Create metrics for AI model\nmodel_latency = meter.create_histogram(\n    \"ai.model.latency\",\n    description=\"AI model inference latency\",\n    unit=\"ms\"\n)\nmodel_tokens = meter.create_counter(\n    \"ai.model.tokens\",\n    description=\"Token usage\"\n)\n\nasync def run_ai_model(prompt: str):\n    with tracer.start_as_current_span(\"ai.inference\") as span:\n        start = time.time()\n\n        span.set_attribute(\"ai.model\", \"gpt-4\")\n        span.set_attribute(\"ai.prompt_length\", len(prompt))\n\n        response = await openai.chat.completions.create(\n            model=\"gpt-4\",\n            messages=[{\"role\": \"user\", \"content\": prompt}]\n        )\n\n        latency = (time.time() - start) * 1000\n        tokens = response.usage.total_tokens\n\n        # Record metrics\n        model_latency.record(latency, {\"model\": \"gpt-4\"})\n        model_tokens.add(tokens, {\"model\": \"gpt-4\", \"type\": \"total\"})\n\n        # Add to span\n        span.set_attribute(\"ai.response_length\", len(response.choices[0].message.content))\n        span.set_attribute(\"ai.tokens_used\", tokens)\n\n        return response\n```\n\n---\n\n## Grafana Dashboards\n\n### 3-3-3 Rule\n\n- **3 rows** of panels per dashboard\n- **3 panels** per row\n- **3 key metrics** per panel\n\nAvoid \"dashboard sprawl\" — Each dashboard should answer ONE question.\n\n### Dashboard Categories\n\n```yaml\nRED Dashboard (for services):\n  - Rate: Requests per second\n  - Errors: Error rate\n  - Duration: Latency (P50, P95, P99)\n\nUSE Dashboard (for resources):\n  - Utilization: % of capacity used\n  - Saturation: Queue depth, wait time\n  - Errors: Error count\n\nFour Golden Signals Dashboard:\n  - Latency\n  - Traffic\n  - Errors\n  - Saturation\n\nSLO Dashboard:\n  - Current SLI value\n  - Error budget remaining\n  - Burn rate\n  - Trend (30-day)\n```\n\n### Panel Best Practices\n\n```json\n{\n  \"title\": \"API Request Rate\",\n  \"type\": \"graph\",\n  \"targets\": [\n    {\n      \"expr\": \"sum(rate(http_requests_total[5m])) by (method)\",\n      \"legendFormat\": \"{{ method }}\"\n    }\n  ],\n  \"options\": {\n    \"tooltip\": { \"mode\": \"multi\" },\n    \"legend\": { \"displayMode\": \"table\", \"calcs\": [\"mean\", \"last\"] }\n  },\n  \"fieldConfig\": {\n    \"defaults\": {\n      \"unit\": \"reqps\",  // Requests per second\n      \"color\": { \"mode\": \"palette-classic\" },\n      \"custom\": {\n        \"lineWidth\": 2,\n        \"fillOpacity\": 10\n      }\n    }\n  }\n}\n```\n\n---\n\n## Checklist\n\n```markdown\n## Metrics (Prometheus + Grafana)\n- [ ] Layered architecture (app/cluster/global)\n- [ ] Recording rules for expensive queries\n- [ ] Resource limits and retention configured\n- [ ] Dashboards follow 3-3-3 rule\n- [ ] Alerts based on SLOs, not internal metrics\n\n## Tracing (OpenTelemetry)\n- [ ] Auto-instrumentation enabled\n- [ ] Custom spans for business operations\n- [ ] Sampling strategy configured\n- [ ] Trace context in logs (correlation)\n- [ ] Backend connected (Tempo/Jaeger)\n\n## Logging (Loki/ELK)\n- [ ] Structured JSON logging\n- [ ] Low cardinality labels (<10)\n- [ ] Trace IDs in logs\n- [ ] Appropriate log levels\n- [ ] Retention policy defined\n\n## SLOs\n- [ ] SLIs defined for key user journeys\n- [ ] SLOs documented and tracked\n- [ ] Error budget calculated\n- [ ] Burn rate alerting configured\n- [ ] Monthly SLO review process\n\n## Incident Response\n- [ ] Severity levels defined\n- [ ] On-call rotation scheduled\n- [ ] Escalation policy documented\n- [ ] Runbooks for common issues\n- [ ] Postmortem template ready\n\n## Culture\n- [ ] Blameless postmortem process\n- [ ] Action items tracked to completion\n- [ ] Incident learnings shared\n- [ ] On-call compensation policy\n- [ ] Regular chaos engineering exercises\n```\n\n---\n\n## See Also\n\n- [reference/monitoring.md](reference/monitoring.md) — Prometheus and Grafana deep dive\n- [reference/logging.md](reference/logging.md) — Structured logging best practices\n- [reference/tracing.md](reference/tracing.md) — OpenTelemetry and distributed tracing\n- [reference/incident-response.md](reference/incident-response.md) — Incident management and postmortems\n- [templates/slo-template.md](templates/slo-template.md) — SLO definition template","tags":["observability","sre","claude","arsenal","majiayu000","agent-skills","ai-agents","ai-coding-assistant","automation","claude-code","code-review","developer-tools"],"capabilities":["skill","source-majiayu000","skill-observability-sre","topic-agent-skills","topic-ai-agents","topic-ai-coding-assistant","topic-automation","topic-claude","topic-claude-code","topic-code-review","topic-developer-tools","topic-devops","topic-productivity","topic-prompt-engineering","topic-python"],"categories":["claude-arsenal"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/majiayu000/claude-arsenal/observability-sre","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add majiayu000/claude-arsenal","source_repo":"https://github.com/majiayu000/claude-arsenal","install_from":"skills.sh"}},"qualityScore":"0.464","qualityRationale":"deterministic score 0.46 from registry signals: · indexed on github topic:agent-skills · 29 github stars · SKILL.md body (27,407 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-01T07:01:14.717Z","embedding":null,"createdAt":"2026-04-18T22:24:15.628Z","updatedAt":"2026-05-01T07:01:14.717Z","lastSeenAt":"2026-05-01T07:01:14.717Z","tsv":"'-01':1080,1183,2115 '-1':1774 '-10':1355 '-15':1184,2116 '-2':1789,2127 '-3':1802,2885,2886,3038,3039 '-4':1815,2825,2837,2856,2861 '-6':259 '/cluster':622 '/data'',':1070 '/federate':665 '/tmp/positions.yaml':1272 '/v1alpha1':2491,2535 '0':2873 '0.001':195,1577 '0.1':322,1529,1560 '0.200':180,2011 '0.95':754,1707 '0.999':358,1730,1739 '00':1079,2180 '00.000':1187 '000':1517,2139,2149 '1':601,643,1495,1550,1794,1882,1932,2311,2332,2368,2392 '1.0.0':894 '10':210,263,982,989,1480,1516,2118,2158,2165,2172,2553,2555,3016,3078 '100':1558,2187 '1000':1029,2846 '100ms':2514 '10x':472 '11':2179,2188 '12':2120,2142,2195 '123':222,276,416,431,1165,1197,1383 '15':816,1780,1942,1960,2121,2138,2189,2196 '15d':813 '15s':659,781 '18':2320 '1h':1430,2123 '2':616,1508,1889,1978,2227,2340,2403,2527,3014 '20':1584,1592,1597,2287 '200':1571,1576 '200ms':1467,1507 '2025':587,1182,2114,2630,2689 '22':2329 '23.2':1593 '24':1568 '25':2299 '2h':2524 '3':633,1521,1895,2348,2412,2884,2888,2894,2898,3037 '30':626,1186,1563,1567,1673,1741,2119,2159,2240,2966 '30d':1680,1689,1738 '30s':695,779,1622,1694,2567 '3100/loki/api/v1/push':1276 '32':2166 '3x':2217,2256 '4':1532,1807,1902,2420 '43':1570,1575 '43.2':323,1578,1591,1598 '4318/v1/traces':860 '45':2173 '456':229,285,1169,1201 '45m':2124 '46.3':1599 '5':258,343,732,1031,1044,1354,1655,1909,2148,2164,2429 '50':785,2186,2516 '50gb':820,823 '50ms':2518 '5m':344,350,360,713,733,741,762,1417,1637,1656,1715,1747,2520,2985 '5x':2307 '5xx':1526,1642 '6':1914 '60':1569 '7':613,1920 '70':145,1542,2033 '73':2697 '789':236 '80':161,1544,2317 '90':1603,2707 '9080':1269 '95':1502 '99.9':318,1463,1477,1553,1559,1732,2400 '99.99':1171,1203 'abc123':424,1191 'accept':610 'access':2479 'across':557,1052,1936,2379,2645,2718 'action':805,1918,2276,2371,2500,2544,3135 'add':914,1289,2279,2301,2313,2864 'advanc':485 'affect':166,2045,2105,2141 'aggreg':463,475,628 'agreement':1471 'ai':517,2627,2634,2658,2676,2679,2686,2693,2709,2747,2759,2763,2782,2790,2807 'ai-driven':516,2685 'ai.inference':2816 'ai.model':2823 'ai.model.latency':2788 'ai.model.tokens':2800 'ai.prompt':2828 'ai.response':2869 'ai.tokens':2877 'aiop':2625,2733 'aiops/anomaly':513 'alert':122,124,136,140,156,169,173,188,351,500,539,1718,1723,1884,1981,1988,1989,2003,2004,2027,2028,2042,2051,2052,2160,2224,2266,2314,2649,2704,2744,3041,3105 'alic':2135,2288 'allow':1573 'also':3153 'alway':1001,1016 'amount':900,921,941,1170,1202 'analysi':521,560,2642,2713 'analyt':486 'annot':181,196,361,1748,2016 'anomali':2632,2716,2738 'answer':2909 'api':177,192,252,333,355,692,704,722,748,928,1379,1400,1405,1416,1628,1645,1661,1666,1670,1678,1686,1701,1727,1736,2008,2018,2466,2564,2973 'api-serv':251,2563 'api.example.com':1069 'api.example.com/data'',':1068 'apilatencyhigh':174,2005 'apivers':2488,2532 'app':250,1321,1324,1358,1378,1399,1404,1415,2508,2562 'app/cluster/global':3024 'appli':2238,2661 'applic':602,630 'appropri':3083 'architectur':582,591,3023 'ask':2667 'assembl':1897 'assess':1891 'assign':1893,2168 'assist':2731 'assum':2065 'async':895,911,935,2804 'attempt':2480 'attribut':2822,2827,2868,2876 'audit':549 'auto':829,835,864,1065,2260,2291,2350,2656,2695,3051 'auto-detect':2694 'auto-instru':863 'auto-instrument':828,834,1064,3050 'auto-remedi':2655 'auto-sc':2259,2290,2349 'autom':78,1956,2602,2722,2740,2752 'automat':1062,1077,2647 'autonom':2663 'avail':319,334,356,1451,1464,1620,1657,1662,1675,1679,1687,1728,1737,1757,2073,2235 'averag':1743,2208 'avg':1682 'avoid':2903 'await':932,939,1098,2833 'back':2192 'backend':3067 'backlog':2472 'bad':2023 'balanc':1935 'base':121,290,2002,2026,2050,2293,3042 'baselin':2580,2741 'behavior':53,1445,2398 'best':27,588,1107,1347,1929,2071,2568,2969,3165 'black':2213,2333 'blame':77 'blameless':71,1916,2061,2617,3132 'bleed':1906 'bodi':271,279,1364 'breach':1787 'broken':1998 'bucket':761,1714 'budget':292,301,316,321,365,1546,1549,1557,1589,1595,1602,1752,2055,2961,3101 'build':47 'burn':363,1721,1750,2056,2963,3103 'busi':100,605,880,906,915,1821,3057 'cach':2468 'calc':2997 'calcul':1562,1616,3102 'call':504,929,1928,1969,3118,3145 'canari':2577 'capac':2756,2937 'card':1174,1206 'cardin':202,209,216,243,266,609,1296,1350,1361,3076 'care':149,153 'carrier':1094,1097,1104 'catch':959 'categori':2913 'caus':520,1913,1987,2025,2202,2641,2712 'causal':2708 'cause-bas':2024 'celebr':2385 'certif':2476 'chang':1965 'chao':506,594,2389,2481,2612,2620,3149 'chaos-mesh.org':2490,2534 'chaos-mesh.org/v1alpha1':2489,2533 'cheaper':473 'checklist':3017 'checkout':2670 'child':924 'childspan':936 'childspan.end':951 'childspan.setattribute':944 'childspan.setstatus':948 'ci':2305 'cl':1869 'classic':3011 'clear':2231,2588 'client':1273 'close':2198 'cloud':2448 'cloudwatch':542 'cluster':617,638,647 'cluster-prom-eu-west.internal:9090':681 'cluster-prom-us-east.internal:9090':680 'cncf':460 'code':571,949,955,964,1010,1012,1014 'code-level':570 'collect':445,2579 'collector':859,976 'color':3007 'command':1834,2129 'commit':1473 'common':1953,3126 'communic':1849,1867,1874 'comp':1972 'compens':1966,3146 'complet':1505,1782,2374,3139 'compon':1810 'compress':1390 'condit':1240 'config':651,653,678,794,977,1260,1278,1284,1288 'configur':1257,2206,3034,3061,3106 'connect':2176,2204,2220,2263,2268,2280,2315,2323,2464,3068 'consequ':1475 'const':399,849,888,930,937,1093,1121,1141,1147 'consum':1596,1604 'content':2841 'context':369,388,547,916,1045,1050,1056,1074,1137,1161,1245,3063 'context.active':1096 'continu':84,568 'control':510 'coordin':1837,1859 'core':33,2063 'correct':118 'correl':379,421,2515,2643,2723,3066 'corrupt':2246 'cosmet':1816 'cost':465 'cost-sensit':464 'count':1425,1431,2946 'counter':2799 'cover':20,2353 'cpu':143,151,575,1537,1541,2031,2452 'cpuhigh':141,2029 'crash':1254 'creat':902,1899,2171,2779 'credit':1481 'criteria':2589 'critic':639,1250 'cron':2522 'cross':637 'cross-clust':636 'cultur':72,2092,3131 'currenc':942 'current':1756,2814,2957 'custom':69,903,1873,2701,3012,3054 'dashboard':2347,2883,2893,2904,2907,2912,2916,2932,2950,2956,3035 'data':267,534,574,1101,1784,2154,2245 'databas':2175,2203,2274,2341,2463 'datadog':2714 'date':2113 'davi':2692 'day':614,627,817,1564,1674,1742,1822,2610,2967 'ddos':2474 'debug':548,1223,1228,1950 'decid':1845 'decis':993,1840 'decision-mak':1839 'declar':1842 'dedic':2338 'deep':3159 'def':2805 'def456':427,1194 'default':780,1236,3001 'defin':15,57,297,2393,2586,3088,3091,3115 'definit':307,1436,3182 'degrad':1779,1811 'delay':2498,2501,2512 'depend':561,2462 'deploy':776,1606 'depth':2941 'descript':1755,2789,2801 'descriptor':2460 'detail':604,1219 'detect':65,514,1883,2633,2696,2717,2739 'develop':1230 'deviat':2425 'disk':1539,2456 'displaymod':2995 'disprov':2421 'distribut':381,453,824,3171 'dive':3160 'dns':2446 'document':2322,3097,3123 'doe':2131 'downtim':325,1574,1586 'drift':2681 'drill':598 'drill-down':597 'driven':56,518,2687 'drop':789,801,806 'due':2285,2297,2309,2318,2327 'durat':759,1712,2117,2519,2566,2926 'dynatrac':2691 'dynatrace/datadog':515 'e.g':1090,2399 'elast':2732 'elastic/datadog/dynatrace':489 'elk':479,552 'emerg':567 'enabl':596,872,3053 'encount':199 'engin':32,507,1860,2390,3150 'ensur':1048 'environ':621 'error':193,200,257,291,300,320,364,718,723,960,962,969,1003,1006,1015,1244,1246,1248,1252,1401,1408,1449,1522,1524,1530,1545,1548,1556,1751,1799,1826,2054,2143,2162,2190,2923,2924,2944,2945,2953,2960,3100 'error-trac':1005 'error.message':967 'errorbudgetburnr':352 'errorratehigh':189 'escal':1847,1938,2232,3121 'estim':2150 'etc':870 'evalu':2106 'event':544,1232,2419 'everi':237,293,397,2523,2526 'everyon':2068 'everyth':2595 'exampl':1396,1446,1462,1476,1772,2483 'exercis':2613,3151 'exhaust':2178,2219,2455 'exist':1806 'expens':688,3028 'experi':172,1993,2531,2585,2605 'experienc':184 'expert':7 'expir':2477 'explain':50 'explicit':313 'expr':142,158,175,190,336,353,707,725,751,1631,1648,1664,1681,1704,1725,2006,2030,2979 'express':867,1331 'extern':927 'extract':1338 'face':128,2014 'failur':75,511,2411,2414,2437,2447,2465,2621,2644,2651 'fals':873 'fast':367,1754 'fatal':1251 'fatigu':1982 'featur':1796 'feb':2310 'feder':590,645,656 'fetch':1067 'field':1339 'fieldconfig':3000 'file':2459 'filenam':1271 'fillopac':3015 'filter':1376 'final':970 'find':66 'fire':1885,2225 'first':46,2575 'fix':1862,1911,2243,2433,2550,2662 'fixed-perc':2549 'flow':1051 'focus':2075 'follow':105,1214,1922,2360,3036 'follow-up':1921,2359 'forbidden':135,214,304,384 'formal':1472 'format':1111,1410 'formatt':1128 'four':1483,2947 'fourth':565 'framework':1831 'freez':1605 'friday':2214,2334 'full':93,482,996,2615 'full-stack':92 'full-text':481 'function':579,896 'game':2609 'gap':2079 'get':262 'getnodeautoinstrument':842,862 'glass':493 'glitch':1824 'global':634,649 'go':799,802 'goe':268 'golden':1484,2948 'good':1999,2066 'googl':1487 'gpt':2824,2836,2855,2860 'grafana':22,447,467,1255,2284,2882,3021,3158 'graph':2977 'gremlin/chaos':508 'group':326,690,1617 'guid':1951 'hallucin':2682 'handl':1515 'handoff':1959 'happen':2654 'hard':102 'header':1071,1075,1103 'health':1974 'heavi':478 'high':215,265,608,774,1360,1798,2060 'high-target':773 'higherrorbudgetburnr':1724 'histogram':752,1705,2787 'holist':42 'honest':2089 'honor':660 'hour':1795,1808,2528 'http':339,347,710,728,738,757,866,1060,1089,1266,1634,1651,1710,2982 'hypothes':2404 'hypothesi':2422 'i/o':2457 'ic':1835,1866,1894,2167,2234 'id':220,227,234,274,283,377,395,404,409,414,423,426,429,946,1153,1156,1164,1167,1190,1193,1196,1199,1335,1337,1345,1367,1369,1382,3080 'identifi':2078,2174 'imag':1830 'immedi':2236 'impact':1769,1791,1804,1819,2038,2137,2152,2702 'implement':1608,1861,1924,2289,2758 'import':837,841,845,883,1054,1117,2768 'improv':85,1925,2097,2432 'incid':19,25,87,497,1763,1765,1827,1833,1843,1880,2093,2110,2128,2197,2387,2624,2699,2753,3111,3140,3175 'incidents/night':1979 'includ':375,393,420,1135 'increas':769,2182,2435 'index':469 'indic':1440 'industri':448 'infer':2792 'info':1127,1180 'inform':1229,2074 'infrastructur':98,132,2440 'inject':512,1076,2413,2487 'instead':2672 'instrument':830,836,861,865,878,1066,3052 'insuffici':2271 'intent':2067 'interfac':2751 'intern':131,138,162,2034,3046 'internal/external':1870 'interv':658,694,771,778,1621,1693 'investig':1856,2730 'issu':67,1238,1242,1954,2230,2726,3127 'item':1919,2277,2372,3136 'jaeger':563 'jan':2286,2298,2319,2328 'jane':2130,2300,2312 'jitter':2517 'job':654,668,673,703,715,721,735,743,747,764,1279 'john':2133,2278,2321 'johnson':2136 'journey':556,3095 'json':278,1109,1328,1330,1371,1380,1406,1412,2971,3073 'keep':814,981,1386 'key':1490,2899,3093 'kill':2530,2542,2547,2554 'kind':2492,2536 'know':2581 'kubernet':670,1281,1282,1301,1309,1318 'kubernetes-nod':669 'label':203,205,212,217,218,245,281,471,661,796,1130,1132,1291,1294,1299,1304,1307,1313,1316,1320,1323,1341,1342,1351,1356,1373,3077 'labels.service':1762 'labelselector':2507,2561 'languag':452,2665,2749 'last':2999 'latenc':155,178,745,749,1025,1026,1448,1466,1496,1500,1692,1697,1702,2009,2486,2513,2785,2793,2843,2853,2927,2951 'latency/packet':2444 'layer':583,600,615,632,3022 'le':765,1717 'lead':1853,1868 'leadership':1851 'learn':73,2091,2331,2378,2430,2735,2742,3141 'legend':2994 'legendformat':2988 'len':2830,2871 'length':2829,2870 'lesson':2330 'level':256,572,580,631,648,1124,1126,1129,1131,1179,1212,1217,1222,1332,1333,1343,1359,1407,1424,1433,1439,1455,1470,1767,1768,3085,3114 'librari':1116 'like':2622 'limit':807,2461,3031 'line':1409 'linewidth':3013 'listen':1267 'llm':2728 'llm-power':2727 'load':783,1520,2209,2253,2302,2335 'local':818 'log':13,38,270,371,373,385,398,462,474,543,1106,1110,1115,1139,1159,1211,1262,1329,1363,1388,1392,1418,1825,2598,2720,2737,3065,3070,3074,3082,3084,3164 'logger':1122 'logger.debug':1226 'logger.error':1243 'logger.fatal':1249 'logger.info':272,389,402,1162,1231 'logger.trace':1218 'logger.warn':1237 'logic':606,881 'logql':1374,1394 'loki':468,551,1256,1264,1275,1293,1346 'loki/elk':3071 'loki/prometheus':204 'long':641 'longer':1389 'look':2423 'loss':1785,2155,2248,2445 'low':201,208,242,1221,1295,1349,3075 'low-level':1220 'machin':2734 'major':1778 'make':88,992,1841 'manag':18,498,1435,1941,3176 'mandatori':109 'manual':877,1084 'markdown':2109,3018 'match':667 'matter':61 'max':822 'may':164 'mean':112,2593,2998 'measur':1442 'medium':624 'meet':2366 'memori':159,576,1538,1543,2454 'memoryhigh':157 'mesh':509,2482 'messag':966,1091,2470,2838 'message.content':2874 'meta':1300,1308,1317 'metadata':2494,2538 'meter':2775,2777 'meter.create':2786,2798 'method':261,716,1173,1205,2987,2989 'metric':37,101,133,139,163,444,529,593,607,623,640,663,791,792,804,1491,2015,2035,2275,2282,2342,2583,2597,2719,2770,2780,2851,2900,3019,3047 'metrics.get':2776 'metrics/logs':2639 'might':2036 'million':223,230 'min':1781,1943,1961 'minor':1803 'minut':1566,1572,1579,1585,1594,2228,2241 'minutes/month':324 'miss':2344 'mistak':2090 'mitig':1858,1903,2181,2237 'mixin':1140 'mode':2502,2548,2992,3008 'model':2680,2764,2783,2784,2791,2796,2808,2835,2854,2859 'model_latency.record':2852 'model_tokens.add':2857 'monitor':12,96,309,538,2200,2272,2594,2678,2762 'month':1561,1588,3107 'ms':1028,2795 'msg':432,1207,1411 'multi':2993 'must':104,206,295,374 'name':327,655,672,691,797,1004,1020,1035,1280,1311,1618,1690,2495,2539,2774,2778 'namespac':246,1302,1305,1357,1397,1428,2505,2559 'natur':2664,2748 'need':2337 'network':2443,2485,2497 'network-delay':2496 'networkchao':2493 'neutral':459 'never':2104 'new':851,854,2745 'next':1820 'nine':1555 'node':671 'node.js':833 'nodesdk':838,852 'nois':2705 'noisi':875 'non':1088,1641,2573 'non-5xx':1640 'non-http':1087 'non-product':2572 'normal':1033,1037,1233,2194,2218,2396,2582 'normal-trac':1036 'notif':1876 'number':901 'numer':530 'object':1456 'observ':2,4,29,45,488,581,2629,2677,2760 'observability-first':44 'observability-sr':1 'occur':1247 'ok':1745 'ol':1854 'old':1391 'on-cal':502,1926,1967,3116,3143 'one':2503,2910 'openai.chat.completions.create':2834 'opentelemetri':23,455,562,827,975,2767,3049,3169 'opentelemetry/api':887,1058 'opentelemetry/auto-instrumentations-node':844 'opentelemetry/exporter-trace-otlp-http':848 'opentelemetry/instrumentation-fs':871 'opentelemetry/sdk-node':840 'oper':82,907,1234,1852,3058 'opportun':2095 'optim':767 'option':2376,2990 'ord':228,284,415,430,1168,1200 'order':226,282,286,413,428,1166,1198,1368 'order.id':918 'orderid':898,919 'otel':858 'otel-collector':857 'otlptraceexport':846,855 'outag':1783,1793,2450 'outgo':1059 'output':419,1178 'overal':1836 'own':1848 'p50':2928 'p95':179,744,750,1465,1498,1696,1703,2010,2929 'p99':2930 'page':1878 'pagerduty/opsgenie':499 'palett':3010 'palette-class':3009 'pane':491 'panel':2891,2895,2902,2968 'param':666 'pars':1327,1413 'partial':1792 'particip':2103 'path':664,2233 'pattern':2336,2637 'pay':1970 'payload':1102 'payment':390,417,433,892,1172,1175,1204,1208,2146,2161,2510 'payment-servic':891,2509 'payment.amount':920 'payment.currency':922 'paymentresult':931,958 'peak':1519,2211 'peopl':2083 'per':620,700,1420,1512,2892,2896,2901,2921,3005 'per-environ':619 'percent':2551 'percentag':988,1043 'perform':559,573,685,693,1813,2100,2765 'pillar':36,524,525,566 'pino':1118,1120,1123 'pino.stdtimefunctions.isotime':1134 'pipelin':1325 'platform':2688 'playbook':1945,2384 'pod':1286,1290,1310,1314,1319,2358,2529,2541,2546,2557 'pod-kil':2540,2545 'pod/node':2441 'podchao':2537 'polici':1000,1385,1600,3087,3122,3147 'pool':2177,2183,2205,2264,2269,2281,2316,2324 'port':1268 'posit':1270 'postgresql':868 'postmortem':1915,2062,2102,2107,2111,2363,3128,3133,3178 'potenti':1241 'power':450,2729 'practic':28,589,1108,1348,1930,2569,2970,3166 'precomput':687 'predict':2405,2648,2650,2743 'prevent':592,1983 'primari':1939 'principl':34,2064,2391 'proactiv':64,2755 'probabilist':979,985,1040,1041 'process':287,391,418,434,800,1176,1209,1253,2147,3110,3134 'process.env.log':1125 'process/tooling':2081 'processor':978 'processpay':897,910 'product':247,1235,1398,1429,2506,2560,2574,2578,2623 'profil':569 'prometheus':21,446,540,584,603,618,635,650,1610,3020,3156 'prompt':2809,2831,2842 'promql':1393,2675 'promtail':1259 'propag':1046,1055,1085 'propagation.inject':1095 'provid':41 'psycholog':2084 'publishmessag':1099 'punish':2087 'put':1365 'python':1547,2761 'quantifi':1441 'quantil':753,1706 'queri':451,689,1395,2666,2750,3029 'question':2911 'queue':1092,1100,2471,2940 'quick':435,2239 'rate':338,346,698,709,719,727,737,756,1414,1419,1450,1525,1531,1633,1650,1709,1722,1800,2057,2144,2163,2191,2296,2402,2919,2925,2964,2975,2981,3104 'rate5m':194,706,724 'ratio':335,357,1663,1688,1729 'readi':3130 'real':536,2417 'real-tim':535 'real-world':2416 'reason':443 'recent':1387 'recommend':2757 'record':331,545,674,682,702,720,746,1612,1626,1643,1659,1676,1699,2850,3025 'red':2915 'redi':869 'reduc':80,782,2703 'refer':436 'reference/incident-response.md':3173,3174 'reference/logging.md':3161,3162 'reference/monitoring.md':3154,3155 'reference/tracing.md':3167,3168 'regex':798 'region':2449 'regular':3148 'relabel':787,793,1287 'relat':2725 'reliabl':31,58 'relic':2746 'remain':1590,2408,2962 'remedi':1957,2657 'repetit':81 'report':1863,1888 'req':235 'req/s':1518 'reqp':3003 'request':233,238,340,348,555,697,699,705,711,729,739,758,1019,1061,1447,1499,1504,1511,1625,1629,1635,1639,1646,1652,1667,1671,1711,2295,2920,2974,2983,3004 'requir':168,241,312,392 'resili':91,2436 'resolut':1910 'resourc':766,1535,2355,2451,2934,3030 'respond':2132 'respons':26,186,1764,1770,1828,1896,2388,2832,2881,3112 'response.choices':2872 'response.usage.total':2848 'restor':1907 'result':938,953 'result.id':947 'retent':612,625,642,809,1384,3033,3086 'retention.size':819 'retention.time':812 'return':908,952,957,1146,1151,2880 'revenu':2151 'review':1917,2101,2362,2618,3109 'role':1285,1829,2839 'rollback':2603 'room':1901,2170 'root':519,1912,2201,2640,2711 'rotat':1931,1964,3119 'rout':501 'row':2889,2897 'rule':103,107,330,675,683,696,1613,1623,1695,2887,3026,3040 'run':2525,2806 'runbook':1955,2326,2382,3124 'runtim':803 'safeti':2085 'sampl':808,972,980,987,991,999,1002,1017,1030,1042,3059 'sampler':986 'satur':1533,2270,2458,2939,2954 'scale':2257,2261,2292,2308,2351 'scenario':441 'schedul':505,2521,2611,3120 'scrape':652,657,770,777,1277 'script':1958 'sd':1283 'sdk':850 'sdk.start':876 'search':477,484,1403 'search-heavi':476 'second':701,760,1421,1513,1713,2922,3006 'secondari':1940 'secur':1786,2473 'see':995,3152 'selector':2504,2558 'sensit':466 'separ':2098 'seri':533 'server':253,1265,2565 'servic':294,558,893,1053,1438,1444,1454,1469,1494,1775,1908,2511,2646,2918 'set':10 'setup':585,831 'sev':1773,1788,1801,1814,2126 'sever':1216,1766,1892,2125,3113 'share':2377,3142 'shift':1934 'ship':1261 'short':611 'signal':1485,2949 'signific':1790 'simul':2415,2475 'singl':490,1809 'site':30 'size':2184 'skill':114 'skill-observability-sre' 'sla':1468 'sli':1437,1461,1497,1510,1523,1534,1615,1658,1698,2958 'slis':3090 'slo':55,176,191,289,306,314,317,328,332,354,1453,1501,1514,1528,1540,1551,1552,1607,1619,1627,1644,1660,1665,1669,1677,1685,1691,1700,1720,1726,1733,1735,1761,2007,2049,2955,3108,3181 'slo-bas':288,2048 'slo-driven':54 'slo/sli/sla':1434 'slos':16,298,2046,2607,3044,3089,3096 'slow':185,1018,1022,1812,2020,2671 'slow-trac':1021 'small':2571 'smith':2134 'sourc':795,1298,1306,1315 'source-majiayu000' 'span':400,405,408,410,425,904,912,925,1142,1145,1155,1192,2815,2818,2866,3055 'span.end':971 'span.recordexception':961 'span.set':2821,2826,2867,2875 'span.setattributes':917 'span.setstatus':954,963 'span.spancontext':1150 'spancontext':406,411 'spanid':412,1149,1157 'spanstatuscod':885 'spanstatuscode.error':965 'spanstatuscode.ok':950,956 'spec':2499,2543 'spike':2216 'splunk':553 'spot':2635 'sprawl':2905 'sre':3,6,1488 'stabl':2409,2592 'stack':94,480 'stage':1326 'stakehold':1875 'standard':449,461,1215,2346 'start':2570,2819,2845 'start/end':1844 'state':1227,2395,2428 'static':677 'statsd':541 'status':342,717,731,1009,1011,1013,1654,1864,1871,1877 'steadi':2394,2427 'step':1947,1949 'step-by-step':1946 'still':1744 'stop':1904,2604 'storag':810 'str':2810 'strategi':973,3060 'stress':2453 'string':899 'stripe.charge':934,945 'stripe.charges.create':940 'structur':1105,1114,1158,3072,3163 'success':1177,1210,1638,1647,1668,2386,2401,2587 'suggest':2659 'sum':337,345,708,726,736,755,1422,1632,1649,1708,2980 'summar':2754 'summari':182,197,362,1749,2017 'symptom':120,129,1985,2001 'symptom-bas':119,2000 'sync':1962 'system':48,89,2077,2397,2406 't10':1185 'tabl':2996 'tail':990,998 'target':59,311,679,775,1303,1312,1322,1457,1980,2978 'team':1898,2365,2380,2616 'technic':1855 'telemetri':496 'templat':2108,3129,3183 'templates/slo-template.md':3179,3180 'tempo/jaeger':456,3069 'termin':2442 'test':2254,2303,2339,2601 'text':483,1402 'three':35,523,1554 'threshold':1027 'throughput':1452 'throw':968 'time':187,532,537,1181,1427,1684,1771,1973,2943 'time-seri':531 'time.time':2820,2844 'timelin':2156 'timeout/errors':2467 'timestamp':1133 'timezon':1937 'titl':2112,2972 'toil':79 'token':2683,2797,2802,2847,2849,2858,2879 'tool':528 'tool/pattern':442 'tooltip':2991 'topic-agent-skills' 'topic-ai-agents' 'topic-ai-coding-assistant' 'topic-automation' 'topic-claude' 'topic-claude-code' 'topic-code-review' 'topic-developer-tools' 'topic-devops' 'topic-productivity' 'topic-prompt-engineering' 'topic-python' 'total':341,349,712,730,740,1527,1565,1624,1630,1636,1653,1672,2863,2984 'trace':14,40,368,376,382,387,394,403,422,454,554,825,884,984,997,1007,1023,1038,1049,1073,1136,1152,1189,1334,1336,1344,2599,2721,2769,3048,3062,3079,3172 'trace.get':2772 'trace.getactivespan':401,1143 'trace.gettracer':890 'traceexport':853 'traceid':407,1148,1154 'tracepar':1078 'tracer':889,2771,2773 'tracer.start':2812 'tracer.startactivespan':909,933 'tracest':1081 'track':302,329,2370,3099,3137 'traffic':1034,1509,2212,2215,2952 'trail':550 'treat':2619 'trend':2631,2965 'tri':913 'triag':1890 'true':662 'tsdb':811 'tune':2325 'type':1008,1024,1039,2438,2862,2976 'typescript':383,832,882,1047,1112,1213 'ui':1823 'unauthor':2478 'unavail':1797,2469 'unifi':487 'uniqu':211,240 'unit':2794,3002 'unless':2043 'unnecessari':790 'unusu':2636 'updat':1872,1879,2381 'uptim':1478 'url':856,1274 'usag':144,160,577,2032,2684,2803 'usd':923,943 'use':8,439,786,1113,1352,1377,2047,2878,2931,2938 'user':63,127,146,167,171,183,198,219,273,1163,1195,1366,1381,1818,1887,1992,2013,2022,2039,2140,2840,3094 'user-fac':126,2012 'usr':221,275 'utc':2122,2157 'util':1536,2935 'valu':225,232,249,255,260,264,1083,1758,2552,2959 'value/range':1458 've':1582 'vendor':458,1082 'vendor-neutr':457 'verbos':1225 'violat':110,2608 'visibl':43,95 'vs':1986 'w3c':1072 'wait':2942 'war':1900,2169 'warn':1239 'watch':2596 'watchdog':2715 'weak':2434 'week':1933,2369 'well':2223 'went':2222,2250 'within':2226,2367 'without':76,310,386 'work':83,117 'workaround':1805 'workflow':1881 'world':2418 'write':2674 'wrong':2251 'yaml':134,213,303,586,686,768,974,1258,1486,1611,1832,1984,2439,2484,2690,2914 'year':644 'z':1188 'zipkin':564","prices":[{"id":"aa8be846-d857-441a-8d2c-120900de1c1d","listingId":"7f727b32-8ea0-435e-8b56-97a2180f3d8b","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"majiayu000","category":"claude-arsenal","install_from":"skills.sh"},"createdAt":"2026-04-18T22:24:15.628Z"}],"sources":[{"listingId":"7f727b32-8ea0-435e-8b56-97a2180f3d8b","source":"github","sourceId":"majiayu000/claude-arsenal/observability-sre","sourceUrl":"https://github.com/majiayu000/claude-arsenal/tree/main/skills/observability-sre","isPrimary":false,"firstSeenAt":"2026-04-18T22:24:15.628Z","lastSeenAt":"2026-05-01T07:01:14.717Z"}],"details":{"listingId":"7f727b32-8ea0-435e-8b56-97a2180f3d8b","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"majiayu000","slug":"observability-sre","github":{"repo":"majiayu000/claude-arsenal","stars":29,"topics":["agent-skills","ai-agents","ai-coding-assistant","automation","claude","claude-code","code-review","developer-tools","devops","productivity","prompt-engineering","python","software-development","typescript","workflows"],"license":"mit","html_url":"https://github.com/majiayu000/claude-arsenal","pushed_at":"2026-04-29T04:12:22Z","description":"52 production-ready Claude Code skills and 7 specialized agents for software development, DevOps, product workflows, and automation.","skill_md_sha":"d9214ce60ac2174a172d254f7c2eab6f69ccdf8f","skill_md_path":"skills/observability-sre/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/majiayu000/claude-arsenal/tree/main/skills/observability-sre"},"layout":"multi","source":"github","category":"claude-arsenal","frontmatter":{"name":"observability-sre","description":"Observability and SRE expert. Use when setting up monitoring, logging, tracing, defining SLOs, or managing incidents. Covers Prometheus, Grafana, OpenTelemetry, and incident response best practices."},"skills_sh_url":"https://skills.sh/majiayu000/claude-arsenal/observability-sre"},"updatedAt":"2026-05-01T07:01:14.717Z"}}