Skillquality 0.70

kubesphere-volcano

KubeSphere Volcano job management Skill. Use when user asks to create, list, update, delete Jobs (Volcano Jobs), manage Queues, create PyTorch/TensorFlow/MPI training jobs, or troubleshoot Volcano scheduling issues in KubeSphere. Includes built-in YAML templates, scheduling polic

Price
free
Protocol
skill
Verified
no

What it does

KubeSphere Volcano Management

Environment (this KubeSphere instance):

  • KubeSphere: Set KS_HOST environment variable (e.g., http://<kubesphere-host>:30880)
  • Username: admin (default)
  • Password: Set KS_PASSWORD environment variable
  • Clusters: Run kubectl get clusters or ks_api GET /kapis/cluster.kubesphere.io/v1alpha1/clusters
  • Volcano Extension: Run kubectl get extension volcano -n kubesphere-system or check via KubeSphere console

Use this skill for the full Volcano lifecycle in KubeSphere:

  • Create, list, update, delete Volcano Jobs
  • Manage Queues for job scheduling
  • Generate YAML templates for PyTorch, TensorFlow, MPI, and batch jobs
  • Troubleshoot job pending, pod creation, and scheduling issues

Out of scope by default:

  • Advanced Volcano CRDs not requested by the user, such as JobFlow, JobTemplate, or Command
  • Deep scheduler configuration without cluster evidence
  • Volcano system installation (assume extension is already installed)

If the user explicitly asks for those, acknowledge that they are Volcano capabilities but treat them as a follow-up task.

Response Rules

  • Prefer executable output: Job YAML, Queue YAML, kubectl commands, or a short ordered procedure.
  • Use kind: Job (not VolcanoJob), the correct resource type in this KubeSphere environment.
  • Use kubectl get jobs.batch.volcano.sh or short name vcjob/vj to query jobs.
  • For uninstall/delete requests, warn that deleting Job resources will terminate running pods.
  • Omit optional fields instead of guessing values.

IMPORTANT: The resource type is Job (short: vcjob, vj), NOT VolcanoJob. Use these names in all kubectl commands.

API Usage

This skill supports two approaches for API operations. See Discovery Commands section below for detailed commands.

API Endpoints

KubeSphere provides two API paths for Volcano Job:

Option 1: KubeSphere Extension API (/kapis) - Recommended

Uses volcanojobs resource name:

# Job CRUD (namespace-scoped)
GET    /kapis/batch.volcano.sh/v1alpha1/namespaces/{namespace}/volcanojobs
POST   /kapis/batch.volcano.sh/v1alpha1/namespaces/{namespace}/volcanojobs
GET    /kapis/batch.volcano.sh/v1alpha1/namespaces/{namespace}/volcanojobs/{name}
DELETE /kapis/batch.volcano.sh/v1alpha1/namespaces/{namespace}/volcanojobs/{name}

# Job CRUD (cluster-scoped)
GET    /kapis/batch.volcano.sh/v1alpha1/volcanojobs
DELETE /kapis/batch.volcano.sh/v1alpha1/volcanojobs

# PodGroup CRUD (namespace-scoped)
GET    /kapis/batch.volcano.sh/v1alpha1/namespaces/{namespace}/podgroups

# Queue CRUD (user-scoped)
GET    /kapis/scheduling.volcano.sh/v1beta1/users/{user}/queues
POST   /kapis/scheduling.volcano.sh/v1beta1/users/{user}/queues
DELETE /kapis/scheduling.volcano.sh/v1beta1/users/{user}/queues/{name}

Option 2: Kubernetes Native API (/apis)

Uses jobs resource name:

# Job CRUD (namespace-scoped)
GET    /apis/batch.volcano.sh/v1alpha1/namespaces/{namespace}/jobs
POST   /apis/batch.volcano.sh/v1alpha1/namespaces/{namespace}/jobs
GET    /apis/batch.volcano.sh/v1alpha1/namespaces/{namespace}/jobs/{name}
DELETE /apis/batch.volcano.sh/v1alpha1/namespaces/{namespace}/jobs/{name}

# Job CRUD (cluster-scoped)
GET    /apis/batch.volcano.sh/v1alpha1/jobs
DELETE /apis/batch.volcano.sh/v1alpha1/jobs

# Queue CRUD (cluster-scoped)
GET    /apis/scheduling.volcano.sh/v1beta1/queues
POST   /apis/scheduling.volcano.sh/v1beta1/queues
DELETE /apis/scheduling.volcano.sh/v1beta1/queues/{name}

Note: Both paths work. Use /kapis for KubeSphere extension API (multi-cluster support), use /apis for standard Kubernetes API.

Note: Queue API has two views:

  • /kapis/.../users/{user}/queues returns queues visible to that user (user-scoped)
  • /apis/.../queues returns all queues in the cluster (cluster-scoped)
# Query Parameters
page      - Page number (default: 1)
limit     - Items per page
ascending - Sort direction (default: false)
sortBy    - Sort field (e.g., createTime)

Discovery Commands

This section provides two approaches for querying Volcano status:

  1. KubeSphere API (curl) - for extension management and multi-cluster queries
  2. kubectl - for direct Kubernetes resource operations

Option 1: Using KubeSphere API (curl)

Environment Variables:

export KS_HOST="http://<kubesphere-host>:30880"  # KubeSphere console URL (required)
export KS_USERNAME="admin"                         # Username (default)
export KS_PASSWORD="<password>"                    # Password (optional if KS_TOKEN is set)
export KS_TOKEN="<token>"                          # Pre-generated OAuth token (optional, takes priority)
# Get OAuth token - prefer KS_TOKEN if set, otherwise use password
ks_token() {
  # Use KS_TOKEN if it's set and non-empty
  if [ -n "${KS_TOKEN}" ]; then
    echo "$KS_TOKEN"
    return
  fi
  
  # Fall back to password-based token
  if [ -z "${KS_PASSWORD}" ]; then
    echo "Error: KS_TOKEN or KS_PASSWORD must be set" >&2
    return 1
  fi
  
  curl -s -X POST "${KS_HOST}/oauth/token" \
    -H "Content-Type: application/x-www-form-urlencoded" \
    -d "grant_type=password&username=${KS_USERNAME:-admin}&password=$KS_PASSWORD&client_id=kubesphere&client_secret=kubesphere" | jq -r '.access_token'
}

# Make API call (supports multi-cluster with optional cluster parameter)
ks_api() {
  local method=$1
  local path=$2
  local cluster=${3:-host}  # Default to host cluster. Pass 3rd arg for member clusters.
  local body=$4
  local token=$(ks_token)
  
  # Check if token is empty
  if [ -z "$token" ]; then
    echo "Error: Failed to obtain authentication token. Please check KS_TOKEN or KS_PASSWORD." >&2
    return 1
  fi
  
  # Prepend cluster path if not already present and not a user-scope path
  if [[ ! "$path" =~ ^/clusters/ ]] && [[ ! "$path" =~ ^/kapis/scheduling.volcano.sh/v1beta1/users ]]; then
    path="/clusters/${cluster}${path}"
  fi
  
  curl -s -X "$method" \
    -H "Authorization: Bearer $token" \
    -H "Content-Type: application/json" \
    ${body:+-d "$body"} \
    "${KS_HOST}$path"
}

Usage:

# Option 1: Use pre-generated token (recommended for automation)
export KS_HOST="http://<kubesphere-host>:30880"
export KS_TOKEN="eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."

# Option 2: Use password (will fetch token each time)
export KS_HOST="http://<kubesphere-host>:30880"
export KS_PASSWORD="your-password"

Query Commands:

# List Jobs in namespace (host cluster)
ks_api GET /kapis/batch.volcano.sh/v1alpha1/namespaces/default/volcanojobs

# List all Jobs (cluster-wide)
ks_api GET /kapis/batch.volcano.sh/v1alpha1/volcanojobs

# List Jobs in member cluster (specify cluster explicitly)
ks_api GET /kapis/batch.volcano.sh/v1alpha1/namespaces/default/volcanojobs member-4

# List Queues (user-scoped, no cluster prefix needed)
ks_api GET /kapis/scheduling.volcano.sh/v1beta1/users/admin/queues

# List PodGroups in namespace
ks_api GET /kapis/batch.volcano.sh/v1alpha1/namespaces/default/podgroups

# Check Volcano extension status
ks_api GET /kapis/kubesphere.io/v1alpha1/extensions/volcano

# List available Volcano extension versions
ks_api GET /kapis/kubesphere.io/v1alpha1/extensionversions | jq '.items[] | select(.metadata.name | contains("volcano"))'

# List clusters
ks_api GET /kapis/cluster.kubesphere.io/v1alpha1/clusters

Option 2: Using kubectl (Direct Cluster Access)

# List all Volcano Jobs
kubectl get jobs.batch.volcano.sh -A
kubectl get vcjob -A
kubectl get vj -A

# List Jobs in specific namespace
kubectl get vcjob -n <namespace>

# List all Queues
kubectl get queue -A

# List all PodGroups
kubectl get podgroup -A

# Check Volcano CRDs
kubectl get crd | grep volcano

# Check Volcano extension
kubectl get extension volcano
kubectl get extensionversion | grep volcano

Option 3: Multi-Cluster Query (kubeconfig extraction)

For querying member clusters, extract the kubeconfig from the Cluster resource:

# Get kubeconfig for a member cluster
CLUSTER_NAME=member-4
KUBECONFIG_ENCODED=$(kubectl get cluster.cluster.kubesphere.io $CLUSTER_NAME -o jsonpath='{.spec.connection.kubeconfig}')
echo "$KUBECONFIG_ENCODED" | base64 -d > /tmp/${CLUSTER_NAME}-kubeconfig

# Query member cluster
export KUBECONFIG=/tmp/${CLUSTER_NAME}-kubeconfig
kubectl get jobs.batch.volcano.sh -A
kubectl get queue -A

# Switch back to host cluster
export KUBECONFIG=""

When to Use Which Approach

ScenarioRecommended Approach
Query KubeSphere extension statusKubeSphere API (curl)
List available clustersKubeSphere API (curl)
Query host cluster Kubernetes resourceskubectl
Query member cluster Kubernetes resourceskubeconfig extraction
Create/apply Job/Queue manifestskubectl

Common Operations

List Jobs

# List in specific namespace
ks_api GET /kapis/batch.volcano.sh/v1alpha1/namespaces/default/volcanojobs

# List all (cluster-wide)
ks_api GET /kapis/batch.volcano.sh/v1alpha1/volcanojobs

# With kubectl
kubectl get jobs.batch.volcano.sh -A
kubectl get vcjob -A
kubectl get vj -A
kubectl get vcjob -n <namespace>

Get Job Details

# Via API
ks_api GET /kapis/batch.volcano.sh/v1alpha1/namespaces/default/volcanojobs/my-job

# Via kubectl
kubectl get vcjob my-job -n <namespace> -o yaml
kubectl describe vcjob my-job -n <namespace>

# View logs (get pod name first, then logs)
kubectl get pods -n <namespace> -l volcano.sh/job-name=my-job
kubectl logs <pod-name> -n <namespace>
kubectl logs -f <pod-name> -n <namespace>  # follow mode

Create Job

# Via API (POST with JSON body - use heredoc for readability)
read -r -d '' JOB_JSON <<'EOF'
{
  "apiVersion": "batch.volcano.sh/v1alpha1",
  "kind": "Job",
  "metadata": {"name": "my-job", "namespace": "default"},
  "spec": {
    "schedulerName": "volcano",
    "queue": "default",
    "tasks": [{
      "replicas": 1,
      "name": "worker",
      "template": {
        "spec": {
          "containers": [{"name": "job", "image": "busybox", "command": ["echo", "hello"]}],
          "restartPolicy": "Never"
        }
      }
    }]
  }
}
EOF
ks_api POST /kapis/batch.volcano.sh/v1alpha1/namespaces/default/volcanojobs "$JOB_JSON"

# Via kubectl (apply YAML)
kubectl apply -f - <<'EOF'
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: my-job
  namespace: default
spec:
  schedulerName: volcano
  queue: default
  tasks:
    - replicas: 1
      name: worker
      template:
        spec:
          containers:
            - name: job
              image: busybox
              command:
                - echo
                - hello
          restartPolicy: Never
EOF

Delete Job

⚠️ WARNING: Deleting a Job will terminate all running pods associated with it. This action cannot be undone.

# Via API
ks_api DELETE /kapis/batch.volcano.sh/v1alpha1/namespaces/default/volcanojobs/my-job

# Via kubectl
kubectl delete vcjob my-job -n <namespace>

List Queues

# Via API
ks_api GET /kapis/scheduling.volcano.sh/v1beta1/users/admin/queues

# Via kubectl
kubectl get queue -A

Create Queue

# Via kubectl (using YAML)
kubectl apply -f - <<'EOF'
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ml-queue
spec:
  weight: 50
  capability:
    cpu: "16"
    memory: "64Gi"
EOF

Update Queue

# Via kubectl
kubectl patch queue ml-queue -p '{"spec":{"weight":60}}' --type merge

Delete Queue

# Via API
ks_api DELETE /kapis/scheduling.volcano.sh/v1beta1/users/admin/queues/ml-queue

# Via kubectl
kubectl delete queue ml-queue

Built-in YAML Templates

Prerequisite: The templates reference claimName: pvc-name. Ensure the PVC exists in the namespace before applying the Job. Create a PVC first if needed:

1. PyTorch Distributed Training Job

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: pytorch-distributed-training
  namespace: default
spec:
  schedulerName: volcano
  minAvailable: 3
  queue: default
  tasks:
    - replicas: 1
      name: master
      template:
        metadata:
          labels:
            app: pytorch
            role: master
        spec:
          containers:
            - name: pytorch
              image: pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime
              command:
                - bash
                - -c
                - |
                  export RANK=$((${VOLCANO_TASK_INDEX:-0})
                  export WORLD_SIZE=3
                  python /workspace/train.py
              env:
                - name: MASTER_ADDR
                  value: "$(HOSTNAME)"
                - name: MASTER_PORT
                  value: "29500"
              resources:
                requests:
                  cpu: "2"
                  memory: "4Gi"
                limits:
                  cpu: "2"
                  memory: "4Gi"
              volumeMounts:
                - name: workspace
                  mountPath: /workspace
          volumes:
            - name: workspace
              persistentVolumeClaim:
                claimName: pvc-name
          restartPolicy: Never
    - replicas: 2
      name: worker
      template:
        metadata:
          labels:
            app: pytorch
            role: worker
        spec:
          containers:
            - name: pytorch
              image: pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime
              command:
                - bash
                - -c
                - |
                  export RANK=$((${VOLCANO_TASK_INDEX:-0} + 1))
                  export WORLD_SIZE=3
                  python /workspace/train.py
              env:
                - name: MASTER_ADDR
                  value: pytorch-distributed-training-master-0.pytorch-distributed-training  # Format: {job}-{task}-{index}.{job}
                - name: MASTER_PORT
                  value: "29500"
              resources:
                requests:
                  cpu: "2"
                  memory: "4Gi"
                limits:
                  cpu: "2"
                  memory: "4Gi"
              volumeMounts:
                - name: workspace
                  mountPath: /workspace
          volumes:
            - name: workspace
              persistentVolumeClaim:
                claimName: pvc-name
          restartPolicy: Never
  policies:
    - event: TaskCompleted
      action: CompleteJob

2. TensorFlow Training Job

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: tf-training
  namespace: default
spec:
  schedulerName: volcano
  minAvailable: 2
  queue: default
  tasks:
    - replicas: 1
      name: ps
      template:
        metadata:
          labels:
            app: tf
            role: ps
        spec:
          containers:
            - name: tensorflow
              image: tensorflow/tensorflow:2.14.0-gpu
              command:
                - python
                - /workspace/train.py
              resources:
                requests:
                  cpu: "2"
                  memory: "4Gi"
                  nvidia.com/gpu: "1"
                limits:
                  cpu: "2"
                  memory: "4Gi"
                  nvidia.com/gpu: "1"
              volumeMounts:
                - name: workspace
                  mountPath: /workspace
          volumes:
            - name: workspace
              persistentVolumeClaim:
                claimName: pvc-name
          restartPolicy: Never
    - replicas: 1
      name: worker
      template:
        metadata:
          labels:
            app: tf
            role: worker
        spec:
          containers:
            - name: tensorflow
              image: tensorflow/tensorflow:2.14.0-gpu
              command:
                - python
                - /workspace/train.py
              resources:
                requests:
                  cpu: "2"
                  memory: "4Gi"
                  nvidia.com/gpu: "1"
                limits:
                  cpu: "2"
                  memory: "4Gi"
                  nvidia.com/gpu: "1"
              volumeMounts:
                - name: workspace
                  mountPath: /workspace
          volumes:
            - name: workspace
              persistentVolumeClaim:
                claimName: pvc-name
          restartPolicy: Never

3. MPI Job

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: mpi-job
  namespace: default
spec:
  schedulerName: volcano
  minAvailable: 3
  queue: default
  tasks:
    - replicas: 1
      name: launcher
      template:
        metadata:
          labels:
            app: mpi
            role: launcher
        spec:
          containers:
            - name: mpi
              image: mpioperator/mpich:latest
              command:
                - mpirun
                - -np
                - "2"
                - ./run.sh
              resources:
                requests:
                  cpu: "1"
                  memory: "2Gi"
                limits:
                  cpu: "1"
                  memory: "2Gi"
          restartPolicy: Never
    - replicas: 2
      name: worker
      template:
        metadata:
          labels:
            app: mpi
            role: worker
        spec:
          containers:
            - name: mpi
              image: mpioperator/mpich:latest
              resources:
                requests:
                  cpu: "2"
                  memory: "4Gi"
                limits:
                  cpu: "2"
                  memory: "4Gi"
          restartPolicy: Never

4. Simple Batch Job

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: batch-job
  namespace: default
spec:
  schedulerName: volcano
  minAvailable: 1
  queue: default
  tasks:
    - replicas: 4
      name: worker
      template:
        metadata:
          labels:
            app: batch
        spec:
          containers:
            - name: job
              image: busybox:latest
              command:
                - sh
                - -c
                - |
                  echo "Processing batch data..."
                  sleep 30
                  echo "Done"
              resources:
                requests:
                  cpu: "1"
                  memory: "1Gi"
                limits:
                  cpu: "1"
                  memory: "1Gi"
          restartPolicy: Never

5. Queue Configuration

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ml-queue
spec:
  weight: 50
  capability:
    cpu: "16"
    memory: "64Gi"
  # Optional: resource quota (when using hierarchy)
  # quota:
  #   minResource:
  #     cpu: "8"
  #     memory: "32Gi"
  #   maxResource:
  #     cpu: "32"
  #     memory: "128Gi"

Best Practices

Resource Configuration

Workload TypeCPUMemoryGPUNotes
PyTorch Training2-4 per replica4-8Gi per replica1-2 per replicaUse GPU instances
TensorFlow Training2-4 per replica4-8Gi per replica1 per replicaMatch GPU to model size
MPI Job1-2 per rank2-4Gi per rankOptionalMinimize network latency
Batch Processing1-2 per task1-4Gi per taskNoneScale horizontally

Scheduling Recommendations

  1. Use appropriate queue - Create queues for different workload priorities:

    • default - Regular jobs
    • ml-queue - ML training jobs (higher priority)
    • low-priority - Batch jobs that can wait
  2. Set proper minAvailable - Ensure enough pods are ready before job starts:

    • Distributed training: set to replicas - 1 (allow 1 failure)
    • Batch jobs: set to 1 or replicas based on requirements
  3. Configure retry policies - Add policies for job recovery:

    policies:
      - event: TaskFailed
        action: RestartTask
      - event: PodEvicted
        action: RestartTask
    
  4. Use appropriate restart policy:

    • Never - For distributed jobs where restarts create new pods
    • OnFailure - For jobs that can recover from failures

Troubleshooting

# Check Job status
kubectl get vcjob <name> -n <namespace>
kubectl describe vcjob <name> -n <namespace>

# Check PodGroup status
kubectl get podgroup <name> -n <namespace> -o yaml

# Check pods created by Job
kubectl get pods -n <namespace> -l volcano.sh/job-name=<name>

# Check volcano system pods
kubectl get pods -n volcano-system

# List all Volcano Jobs
kubectl get jobs.batch.volcano.sh -A

# List Queues
kubectl get queue -A

# Common issues:
# 1. Job pending - check PodGroup status and events
# 2. Pods not created - check scheduler and queue resources
# 3. Pods evicted - check queue capacity and priority
# 4. Job not creating pods - check volcano controller is running in volcano-system namespace

Response Patterns

Match output to user intent:

Request TypeOutput
Create jobYAML manifest + kubectl apply command + verification
List jobskubectl/ks_api command + explanation
Get job detailskubectl describe + relevant sections
Delete jobkubectl delete command + confirmation
Show templatesTemplate with placeholders + usage notes
TroubleshootingDiagnostic commands first, then causes and solutions
Best practicesContext-aware recommendations

Workload Type Detection

When user describes a job, infer the type:

User SaysUse Template
"training", "train", "machine learning", "ml", "ai training"PyTorch or TensorFlow
"distributed training"PyTorch Distributed
"MPI", "mpi"MPI Job
"batch", "batch processing"Simple Batch Job
"queue"Queue Configuration

Apply the template with appropriate modifications based on user requirements.

Capabilities

skillsource-kubesphereskill-kubesphere-volcanotopic-agent-skillstopic-cloud-nativetopic-cncftopic-devopstopic-ebpftopic-hacktoberfesttopic-kubernetestopic-kubespheretopic-llmtopic-multi-clustertopic-multi-tenancytopic-observability

Install

Quality

0.70/ 1.00

deterministic score 0.70 from registry signals: · indexed on github topic:agent-skills · 16910 github stars · SKILL.md body (21,999 chars)

Provenance

Indexed fromgithub
Enriched2026-04-22 06:52:45Z · deterministic:skill-github:v1 · v1
First seen2026-04-20
Last seen2026-04-22

Agent access