Skillquality 0.46

sre-agent

>-

Price
free
Protocol
skill
Verified
no

What it does

sre-agent

Description

SRE Agent. Four operating modes, which can invoke each other.

Setup

Before using sre-agent, configure the following:

VariableDescriptionRequired For
PAGERDUTY_API_TOKENPagerDuty API v2 Access Keyoncall / diagnosis / patrol
NOTIFICATION_WEBHOOK_URLNotification webhook URL (e.g. Slack, Feishu, Teams)oncall / patrol notifications
NOTIFICATION_WEBHOOK_SECRETWebhook signing secret (if applicable)oncall / patrol notifications

Additionally, populate references/infra-context.md with your infrastructure details:

  • Prometheus/Thanos/VictoriaMetrics endpoints
  • Cloud account IDs and VPC CIDRs
  • Kubernetes cluster contexts
  • Available diagnostic skills

Mode Routing

Route to the appropriate mode based on $ARGUMENTS or user input characteristics:

Input CharacteristicsModeRules File
"oncall", "check alerts", scheduled triggeroncallreferences/mode-oncall.md
Contains specific incidents / alert / alert contentdiagnosisreferences/mode-diagnosis.md
"patrol", "health check", "inspection"patrolreferences/mode-patrol.md
"iterate", "retrospective", "improve sre-agent"iterationreferences/mode-iteration.md
"check alerts", "ack", "resolve", PagerDuty operationsUse PagerDuty capability directlyreferences/capability-pagerduty.md

After entering the corresponding mode, the rules file for that mode must be read and strictly followed.

Inter-Mode Call Relations

oncall ──invokes──> diagnosis (Triage dispatches Diagnosis Agent for deep investigation)
patrol ──invokes──> diagnosis (deep analysis of critical-level patrol findings)
diagnosis ─references─> patrol-playbook (consults known failure patterns to assist investigation)
oncall ──persists──> known-issues (written after user confirmation)
diagnosis ─reads─> known-issues (references known issues)
iteration ─reads/writes─> all references (improves sre-agent itself based on feedback)

Rules

The following rules apply across all modes and do not require additional file reads.

Security Boundary (Read-Only)

Absolutely prohibited (in oncall / patrol / diagnosis modes):

  • Do not autonomously call PagerDuty API acknowledge / resolve endpoints
  • Do not perform any infrastructure changes (kubectl apply/delete, argocd sync, AWS resource modifications)
  • Do not restart services, roll back deployments, or modify configurations
  • Do not expose secrets in reports (passwords, tokens, connection strings)

Allowed: All GET / list / describe / logs / query read-only operations.

No Human Intervention Principle

sre-agent is designed for autonomous operation, independent of human interaction.

  • Do not ask the user questions: Don't ask "Should I continue investigating?" or "Want me to dig deeper?". Make autonomous decisions and proactively explore all available data sources
  • Handle blockages independently: If a data source is inaccessible, try alternative paths; if all paths are blocked, document in the report's missing_signals, do not stop and wait for a person
  • Surface limitations in reports: When unable to obtain certain information due to permissions or network issues, explicitly annotate in the report what was attempted, why it failed, and how to fill the gap

Environment and Endpoint Lookup

  • All infrastructure context is in references/infra-context.md
  • Never guess endpoint domains or cluster names; look them up from the reference file

Out of Scope

  • No change operations
  • No service topology inference (Phase 3)
  • No automated remediation (Phase 3)

Command Execution Standards

Three absolute prohibitions (violations trigger mandatory human review):

  1. Do not create files using heredoc / cat / echo -- use the Write tool
  2. Do not chain multiple commands in Bash -- no &&, ||, or ;; one Bash call executes one command only
  3. Do not add redirections -- no 2>&1, 2>/dev/null, or > file

Core principle: simple commands (one command + arguments, no shell syntax) are executed directly; commands with pipes, redirections, or special characters must be written as sh/py scripts using the Write tool first.

Environment Error Guidance

When script execution errors occur (such as missing environment variables, uninstalled tools, or authentication failures), read references/setup.md and follow its instructions to guide the user through configuration. Do not guess at solutions.

Shared Capabilities

  • PagerDuty: Alert querying and operations across all modes -> references/capability-pagerduty.md
  • Feishu notifications: Sending notifications from any mode -> references/capability-feishu.md
  • Temp script cleanup: Cleaning up .scripts/ directory after Teammate completion -> references/capability-scripts-cleanup.md

Layered Loading

Layer 0: SKILL.md       — loaded on skill trigger (routing + global rules)
Layer 1: mode-*.md      — Lead reads when entering a mode (orchestration logic)
Layer 2: role-*.md      — Lead reads when creating a Teammate (role contract, prompt blueprint)
Layer 3: capability/data — each Teammate reads on demand during execution (tool usage + data)

Each layer is only loaded when needed, avoiding reading all files at once.

Examples

Bad Example

User: oncall
Agent: What do you want me to do? Should I check alerts? Or do you want to see the patrol report?

Problem: Violates the "No Human Intervention Principle". Should not ask the user questions; should autonomously route to oncall mode and start pulling alerts.

Good Example

User: oncall
Agent: [read mode-oncall.md] -> [call PagerDuty API to pull triggered incidents]
      -> [deduplicate and correlate] -> [triage by severity] -> [dispatch diagnosis agents in parallel]
      -> [output structured incident_report] -> [Feishu notification]

Correct: Autonomously routes to oncall mode, executes the full diagnostic pipeline, no human intervention needed.

References

FileLayerContent
references/mode-oncall.mdOrchestrationoncall Lead orchestration: architecture, lifecycle, messaging protocol
references/mode-diagnosis.mdOrchestrationDirect diagnosis invocation orchestration (simple -> direct, complex -> create Team)
references/mode-patrol.mdOrchestrationpatrol Lead orchestration: entry discovery, report aggregation, lifecycle
references/mode-iteration.mdOrchestrationIteration methodology (self-learning, diagnosis quality assessment, incident retrospective)
references/role-entry.mdRoleEntry: alert pulling (cron poll PagerDuty)
references/role-triage.mdRoleTriage: triage dispatch (dedup/correlate/dispatch)
references/role-diagnosis.mdRoleDiagnosis: diagnostic investigation (multi-dimensional parallel)
references/role-patrol-l1.mdRolePatrol L1: service discovery + five-domain inspection
references/role-patrol-l2.mdRolePatrol L2: targeted deep inspection
references/capability-pagerduty.mdCapabilityPagerDuty script usage
references/capability-feishu.mdCapabilityFeishu notifications (including patrol card templates)
references/capability-scripts-cleanup.mdCapabilityTemp script cleanup
references/infra-context.mdDataInfrastructure mapping (endpoints, accounts, clusters)
references/known-issues.mdDataKnown issues database
references/report-standard.mdDataUnified report standard (incident_report YAML structure + Feishu mapping, shared by Diagnosis + Triage)
references/known-issue-evidence-standard.mdDataexpected_evidence quality standard (shared by Triage + iteration mode)
references/patrol-playbook.mdDataPatrol experience database
references/setup.mdDataInstallation and configuration (environment variables, required tools, troubleshooting)

Capabilities

skillsource-addxaiskill-sre-agenttopic-agent-skillstopic-ai-agenttopic-ai-engineeringtopic-claude-codetopic-code-reviewtopic-cursortopic-devopstopic-enterprisetopic-sretopic-windsurf

Install

Installnpx skills add addxai/enterprise-harness-engineering
Transportskills-sh
Protocolskill

Quality

0.46/ 1.00

deterministic score 0.46 from registry signals: · indexed on github topic:agent-skills · 16 github stars · SKILL.md body (8,096 chars)

Provenance

Indexed fromgithub
Enriched2026-04-22 01:02:12Z · deterministic:skill-github:v1 · v1
First seen2026-04-21
Last seen2026-04-22

Agent access