{"id":"5739fe0b-fc27-4619-8450-ff71199fd703","shortId":"W3KxFK","kind":"skill","title":"hugging-face-evaluation","tagline":"Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.","description":"# Overview\nThis skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:\n- Extracting existing evaluation tables from README content\n- Importing benchmark scores from Artificial Analysis\n- Running custom model evaluations with vLLM or accelerate backends (lighteval/inspect-ai)\n\n## When to Use\n- You need to add structured evaluation results to a Hugging Face model card.\n- You want to import benchmark data or run custom evaluations with vLLM, lighteval, or inspect-ai.\n- You are preparing leaderboard-compatible `model-index` metadata for a model release.\n\n## Integration with HF Ecosystem\n- **Model Cards**: Updates model-index metadata for leaderboard integration\n- **Artificial Analysis**: Direct API integration for benchmark imports\n- **Papers with Code**: Compatible with their model-index specification\n- **Jobs**: Run evaluations directly on Hugging Face Jobs with `uv` integration\n- **vLLM**: Efficient GPU inference for custom model evaluation\n- **lighteval**: HuggingFace's evaluation library with vLLM/accelerate backends\n- **inspect-ai**: UK AI Safety Institute's evaluation framework\n\n# Version\n1.3.0\n\n# Dependencies\n\n## Core Dependencies\n- huggingface_hub>=0.26.0\n- markdown-it-py>=3.0.0\n- python-dotenv>=1.2.1\n- pyyaml>=6.0.3\n- requests>=2.32.5\n- re (built-in)\n\n## Inference Provider Evaluation\n- inspect-ai>=0.3.0\n- inspect-evals\n- openai\n\n## vLLM Custom Model Evaluation (GPU required)\n- lighteval[accelerate,vllm]>=0.6.0\n- vllm>=0.4.0\n- torch>=2.0.0\n- transformers>=4.40.0\n- accelerate>=0.30.0\n\nNote: vLLM dependencies are installed automatically via PEP 723 script headers when using `uv run`.\n\n# IMPORTANT: Using This Skill\n\n## ⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones\n\n**Before creating ANY pull request with `--create-pr`, you MUST check for existing open PRs:**\n\n```bash\nuv run scripts/evaluation_manager.py get-prs --repo-id \"username/model-name\"\n```\n\n**If open PRs exist:**\n1. **DO NOT create a new PR** - this creates duplicate work for maintainers\n2. **Warn the user** that open PRs already exist\n3. **Show the user** the existing PR URLs so they can review them\n4. Only proceed if the user explicitly confirms they want to create another PR\n\nThis prevents spamming model repositories with duplicate evaluation PRs.\n\n---\n\n> **All paths are relative to the directory containing this SKILL.md\nfile.**\n> Before running any script, first `cd` to that directory or use the full\npath.\n\n**Use `--help` for the latest workflow guidance.** Works with plain Python or `uv run`:\n```bash\nuv run scripts/evaluation_manager.py --help\nuv run scripts/evaluation_manager.py inspect-tables --help\nuv run scripts/evaluation_manager.py extract-readme --help\n```\nKey workflow (matches CLI help):\n\n1) `get-prs` → check for existing open PRs first\n2) `inspect-tables` → find table numbers/columns  \n3) `extract-readme --table N` → prints YAML by default  \n4) add `--apply` (push) or `--create-pr` to write changes\n\n# Core Capabilities\n\n## 1. Inspect and Extract Evaluation Tables from README\n- **Inspect Tables**: Use `inspect-tables` to see all tables in a README with structure, columns, and sample rows\n- **Parse Markdown Tables**: Accurate parsing using markdown-it-py (ignores code blocks and examples)\n- **Table Selection**: Use `--table N` to extract from a specific table (required when multiple tables exist)\n- **Format Detection**: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models)\n- **Column Matching**: Automatically identify model columns/rows; prefer `--model-column-index` (index from inspect output). Use `--model-name-override` only with exact column header text.\n- **YAML Generation**: Convert selected table to model-index YAML format\n- **Task Typing**: `--task-type` sets the `task.type` field in model-index output (e.g., `text-generation`, `summarization`)\n\n## 2. Import from Artificial Analysis\n- **API Integration**: Fetch benchmark scores directly from Artificial Analysis\n- **Automatic Formatting**: Convert API responses to model-index format\n- **Metadata Preservation**: Maintain source attribution and URLs\n- **PR Creation**: Automatically create pull requests with evaluation updates\n\n## 3. Model-Index Management\n- **YAML Generation**: Create properly formatted model-index entries\n- **Merge Support**: Add evaluations to existing model cards without overwriting\n- **Validation**: Ensure compliance with Papers with Code specification\n- **Batch Operations**: Process multiple models efficiently\n\n## 4. Run Evaluations on HF Jobs (Inference Providers)\n- **Inspect-AI Integration**: Run standard evaluations using the `inspect-ai` library\n- **UV Integration**: Seamlessly run Python scripts with ephemeral dependencies on HF infrastructure\n- **Zero-Config**: No Dockerfiles or Space management required\n- **Hardware Selection**: Configure CPU or GPU hardware for the evaluation job\n- **Secure Execution**: Handles API tokens safely via secrets passed through the CLI\n\n## 5. Run Custom Model Evaluations with vLLM (NEW)\n\n⚠️ **Important:** This approach is only possible on devices with `uv` installed and sufficient GPU memory.\n**Benefits:** No need to use `hf_jobs()` MCP tool, can run scripts directly in terminal\n**When to use:** User working in local device directly  when GPU is available\n\n### Before running the script\n\n- check the script path\n- check uv is installed\n- check gpu is available with `nvidia-smi`\n\n### Running the script\n\n```bash\nuv run scripts/train_sft_example.py\n```\n### Features\n\n- **vLLM Backend**: High-performance GPU inference (5-10x faster than standard HF methods)\n- **lighteval Framework**: HuggingFace's evaluation library with Open LLM Leaderboard tasks\n- **inspect-ai Framework**: UK AI Safety Institute's evaluation library\n- **Standalone or Jobs**: Run locally or submit to HF Jobs infrastructure\n\n# Usage Instructions\n\nThe skill includes Python scripts in `scripts/` to perform operations.\n\n### Prerequisites\n- Preferred: use `uv run` (PEP 723 header auto-installs deps)\n- Or install manually: `pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests`\n- Set `HF_TOKEN` environment variable with Write-access token\n- For Artificial Analysis: Set `AA_API_KEY` environment variable\n- `.env` is loaded automatically if `python-dotenv` is installed\n\n### Method 1: Extract from README (CLI workflow)\n\nRecommended flow (matches `--help`):\n```bash\n# 1) Inspect tables to get table numbers and column hints\nuv run scripts/evaluation_manager.py inspect-tables --repo-id \"username/model\"\n\n# 2) Extract a specific table (prints YAML by default)\nuv run scripts/evaluation_manager.py extract-readme \\\n  --repo-id \"username/model\" \\\n  --table 1 \\\n  [--model-column-index <column index shown by inspect-tables>] \\\n  [--model-name-override \"<column header/model name>\"]  # use exact header text if you can't use the index\n\n# 3) Apply changes (push or PR)\nuv run scripts/evaluation_manager.py extract-readme \\\n  --repo-id \"username/model\" \\\n  --table 1 \\\n  --apply       # push directly\n# or\nuv run scripts/evaluation_manager.py extract-readme \\\n  --repo-id \"username/model\" \\\n  --table 1 \\\n  --create-pr   # open a PR\n```\n\nValidation checklist:\n- YAML is printed by default; compare against the README table before applying.\n- Prefer `--model-column-index`; if using `--model-name-override`, the column header text must be exact.\n- For transposed tables (models as rows), ensure only one row is extracted.\n\n### Method 2: Import from Artificial Analysis\n\nFetch benchmark scores from Artificial Analysis API and add them to a model card.\n\n**Basic Usage:**\n```bash\nAA_API_KEY=\"your-api-key\" uv run scripts/evaluation_manager.py import-aa \\\n  --creator-slug \"anthropic\" \\\n  --model-name \"claude-sonnet-4\" \\\n  --repo-id \"username/model-name\"\n```\n\n**With Environment File:**\n```bash\n# Create .env file\necho \"AA_API_KEY=your-api-key\" >> .env\necho \"HF_TOKEN=your-hf-token\" >> .env\n\n# Run import\nuv run scripts/evaluation_manager.py import-aa \\\n  --creator-slug \"anthropic\" \\\n  --model-name \"claude-sonnet-4\" \\\n  --repo-id \"username/model-name\"\n```\n\n**Create Pull Request:**\n```bash\nuv run scripts/evaluation_manager.py import-aa \\\n  --creator-slug \"anthropic\" \\\n  --model-name \"claude-sonnet-4\" \\\n  --repo-id \"username/model-name\" \\\n  --create-pr\n```\n\n### Method 3: Run Evaluation Job\n\nSubmit an evaluation job on Hugging Face infrastructure using the `hf jobs uv run` CLI.\n\n**Direct CLI Usage:**\n```bash\nHF_TOKEN=$HF_TOKEN \\\nhf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \\\n  --flavor cpu-basic \\\n  --secret HF_TOKEN=$HF_TOKEN \\\n  -- --model \"meta-llama/Llama-2-7b-hf\" \\\n     --task \"mmlu\"\n```\n\n**GPU Example (A10G):**\n```bash\nHF_TOKEN=$HF_TOKEN \\\nhf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \\\n  --flavor a10g-small \\\n  --secret HF_TOKEN=$HF_TOKEN \\\n  -- --model \"meta-llama/Llama-2-7b-hf\" \\\n     --task \"gsm8k\"\n```\n\n**Python Helper (optional):**\n```bash\nuv run scripts/run_eval_job.py \\\n  --model \"meta-llama/Llama-2-7b-hf\" \\\n  --task \"mmlu\" \\\n  --hardware \"t4-small\"\n```\n\n### Method 4: Run Custom Model Evaluation with vLLM\n\nEvaluate custom HuggingFace models directly on GPU using vLLM or accelerate backends. These scripts are **separate from inference provider scripts** and run models locally on the job's hardware.\n\n#### When to Use vLLM Evaluation (vs Inference Providers)\n\n| Feature | vLLM Scripts | Inference Provider Scripts |\n|---------|-------------|---------------------------|\n| Model access | Any HF model | Models with API endpoints |\n| Hardware | Your GPU (or HF Jobs GPU) | Provider's infrastructure |\n| Cost | HF Jobs compute cost | API usage fees |\n| Speed | vLLM optimized | Depends on provider |\n| Offline | Yes (after download) | No |\n\n#### Option A: lighteval with vLLM Backend\n\nlighteval is HuggingFace's evaluation library, supporting Open LLM Leaderboard tasks.\n\n**Standalone (local GPU):**\n```bash\n# Run MMLU 5-shot with vLLM\nuv run scripts/lighteval_vllm_uv.py \\\n  --model meta-llama/Llama-3.2-1B \\\n  --tasks \"leaderboard|mmlu|5\"\n\n# Run multiple tasks\nuv run scripts/lighteval_vllm_uv.py \\\n  --model meta-llama/Llama-3.2-1B \\\n  --tasks \"leaderboard|mmlu|5,leaderboard|gsm8k|5\"\n\n# Use accelerate backend instead of vLLM\nuv run scripts/lighteval_vllm_uv.py \\\n  --model meta-llama/Llama-3.2-1B \\\n  --tasks \"leaderboard|mmlu|5\" \\\n  --backend accelerate\n\n# Chat/instruction-tuned models\nuv run scripts/lighteval_vllm_uv.py \\\n  --model meta-llama/Llama-3.2-1B-Instruct \\\n  --tasks \"leaderboard|mmlu|5\" \\\n  --use-chat-template\n```\n\n**Via HF Jobs:**\n```bash\nhf jobs uv run scripts/lighteval_vllm_uv.py \\\n  --flavor a10g-small \\\n  --secrets HF_TOKEN=$HF_TOKEN \\\n  -- --model meta-llama/Llama-3.2-1B \\\n     --tasks \"leaderboard|mmlu|5\"\n```\n\n**lighteval Task Format:**\nTasks use the format `suite|task|num_fewshot`:\n- `leaderboard|mmlu|5` - MMLU with 5-shot\n- `leaderboard|gsm8k|5` - GSM8K with 5-shot\n- `lighteval|hellaswag|0` - HellaSwag zero-shot\n- `leaderboard|arc_challenge|25` - ARC-Challenge with 25-shot\n\n**Finding Available Tasks:**\nThe complete list of available lighteval tasks can be found at:\nhttps://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt\n\nThis file contains all supported tasks in the format `suite|task|num_fewshot|0` (the trailing `0` is a version flag and can be ignored). Common suites include:\n- `leaderboard` - Open LLM Leaderboard tasks (MMLU, GSM8K, ARC, HellaSwag, etc.)\n- `lighteval` - Additional lighteval tasks\n- `bigbench` - BigBench tasks\n- `original` - Original benchmark tasks\n\nTo use a task from the list, extract the `suite|task|num_fewshot` portion (without the trailing `0`) and pass it to the `--tasks` parameter. For example:\n- From file: `leaderboard|mmlu|0` → Use: `leaderboard|mmlu|0` (or change to `5` for 5-shot)\n- From file: `bigbench|abstract_narrative_understanding|0` → Use: `bigbench|abstract_narrative_understanding|0`\n- From file: `lighteval|wmt14:hi-en|0` → Use: `lighteval|wmt14:hi-en|0`\n\nMultiple tasks can be specified as comma-separated values: `--tasks \"leaderboard|mmlu|5,leaderboard|gsm8k|5\"`\n\n#### Option B: inspect-ai with vLLM Backend\n\ninspect-ai is the UK AI Safety Institute's evaluation framework.\n\n**Standalone (local GPU):**\n```bash\n# Run MMLU with vLLM\nuv run scripts/inspect_vllm_uv.py \\\n  --model meta-llama/Llama-3.2-1B \\\n  --task mmlu\n\n# Use HuggingFace Transformers backend\nuv run scripts/inspect_vllm_uv.py \\\n  --model meta-llama/Llama-3.2-1B \\\n  --task mmlu \\\n  --backend hf\n\n# Multi-GPU with tensor parallelism\nuv run scripts/inspect_vllm_uv.py \\\n  --model meta-llama/Llama-3.2-70B \\\n  --task mmlu \\\n  --tensor-parallel-size 4\n```\n\n**Via HF Jobs:**\n```bash\nhf jobs uv run scripts/inspect_vllm_uv.py \\\n  --flavor a10g-small \\\n  --secrets HF_TOKEN=$HF_TOKEN \\\n  -- --model meta-llama/Llama-3.2-1B \\\n     --task mmlu\n```\n\n**Available inspect-ai Tasks:**\n- `mmlu` - Massive Multitask Language Understanding\n- `gsm8k` - Grade School Math\n- `hellaswag` - Common sense reasoning\n- `arc_challenge` - AI2 Reasoning Challenge\n- `truthfulqa` - TruthfulQA benchmark\n- `winogrande` - Winograd Schema Challenge\n- `humaneval` - Code generation\n\n#### Option C: Python Helper Script\n\nThe helper script auto-selects hardware and simplifies job submission:\n\n```bash\n# Auto-detect hardware based on model size\nuv run scripts/run_vllm_eval_job.py \\\n  --model meta-llama/Llama-3.2-1B \\\n  --task \"leaderboard|mmlu|5\" \\\n  --framework lighteval\n\n# Explicit hardware selection\nuv run scripts/run_vllm_eval_job.py \\\n  --model meta-llama/Llama-3.2-70B \\\n  --task mmlu \\\n  --framework inspect \\\n  --hardware a100-large \\\n  --tensor-parallel-size 4\n\n# Use HF Transformers backend\nuv run scripts/run_vllm_eval_job.py \\\n  --model microsoft/phi-2 \\\n  --task mmlu \\\n  --framework inspect \\\n  --backend hf\n```\n\n**Hardware Recommendations:**\n| Model Size | Recommended Hardware |\n|------------|---------------------|\n| < 3B params | `t4-small` |\n| 3B - 13B | `a10g-small` |\n| 13B - 34B | `a10g-large` |\n| 34B+ | `a100-large` |\n\n### Commands Reference\n\n**Top-level help and version:**\n```bash\nuv run scripts/evaluation_manager.py --help\nuv run scripts/evaluation_manager.py --version\n```\n\n**Inspect Tables (start here):**\n```bash\nuv run scripts/evaluation_manager.py inspect-tables --repo-id \"username/model-name\"\n```\n\n**Extract from README:**\n```bash\nuv run scripts/evaluation_manager.py extract-readme \\\n  --repo-id \"username/model-name\" \\\n  --table N \\\n  [--model-column-index N] \\\n  [--model-name-override \"Exact Column Header or Model Name\"] \\\n  [--task-type \"text-generation\"] \\\n  [--dataset-name \"Custom Benchmarks\"] \\\n  [--apply | --create-pr]\n```\n\n**Import from Artificial Analysis:**\n```bash\nAA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \\\n  --creator-slug \"creator-name\" \\\n  --model-name \"model-slug\" \\\n  --repo-id \"username/model-name\" \\\n  [--create-pr]\n```\n\n**View / Validate:**\n```bash\nuv run scripts/evaluation_manager.py show --repo-id \"username/model-name\"\nuv run scripts/evaluation_manager.py validate --repo-id \"username/model-name\"\n```\n\n**Check Open PRs (ALWAYS run before --create-pr):**\n```bash\nuv run scripts/evaluation_manager.py get-prs --repo-id \"username/model-name\"\n```\nLists all open pull requests for the model repository. Shows PR number, title, author, date, and URL.\n\n**Run Evaluation Job (Inference Providers):**\n```bash\nhf jobs uv run scripts/inspect_eval_uv.py \\\n  --flavor \"cpu-basic|t4-small|...\" \\\n  --secret HF_TOKEN=$HF_TOKEN \\\n  -- --model \"model-id\" \\\n     --task \"task-name\"\n```\n\nor use the Python helper:\n\n```bash\nuv run scripts/run_eval_job.py \\\n  --model \"model-id\" \\\n  --task \"task-name\" \\\n  --hardware \"cpu-basic|t4-small|...\"\n```\n\n**Run vLLM Evaluation (Custom Models):**\n```bash\n# lighteval with vLLM\nhf jobs uv run scripts/lighteval_vllm_uv.py \\\n  --flavor \"a10g-small\" \\\n  --secrets HF_TOKEN=$HF_TOKEN \\\n  -- --model \"model-id\" \\\n     --tasks \"leaderboard|mmlu|5\"\n\n# inspect-ai with vLLM\nhf jobs uv run scripts/inspect_vllm_uv.py \\\n  --flavor \"a10g-small\" \\\n  --secrets HF_TOKEN=$HF_TOKEN \\\n  -- --model \"model-id\" \\\n     --task \"mmlu\"\n\n# Helper script (auto hardware selection)\nuv run scripts/run_vllm_eval_job.py \\\n  --model \"model-id\" \\\n  --task \"leaderboard|mmlu|5\" \\\n  --framework lighteval\n```\n\n### Model-Index Format\n\nThe generated model-index follows this structure:\n\n```yaml\nmodel-index:\n  - name: Model Name\n    results:\n      - task:\n          type: text-generation\n        dataset:\n          name: Benchmark Dataset\n          type: benchmark_type\n        metrics:\n          - name: MMLU\n            type: mmlu\n            value: 85.2\n          - name: HumanEval\n            type: humaneval\n            value: 72.5\n        source:\n          name: Source Name\n          url: https://source-url.com\n```\n\nWARNING: Do not use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field.\n\n### Error Handling\n- **Table Not Found**: Script will report if no evaluation tables are detected\n- **Invalid Format**: Clear error messages for malformed tables\n- **API Errors**: Retry logic for transient Artificial Analysis API failures\n- **Token Issues**: Validation before attempting updates\n- **Merge Conflicts**: Preserves existing model-index entries when adding new ones\n- **Space Creation**: Handles naming conflicts and hardware request failures gracefully\n\n### Best Practices\n\n1. **Check for existing PRs first**: Run `get-prs` before creating any new PR to avoid duplicates\n2. **Always start with `inspect-tables`**: See table structure and get the correct extraction command\n3. **Use `--help` for guidance**: Run `inspect-tables --help` to see the complete workflow\n4. **Preview first**: Default behavior prints YAML; review it before using `--apply` or `--create-pr`\n5. **Verify extracted values**: Compare YAML output against the README table manually\n6. **Use `--table N` for multi-table READMEs**: Required when multiple evaluation tables exist\n7. **Use `--model-name-override` for comparison tables**: Copy the exact column header from `inspect-tables` output\n8. **Create PRs for Others**: Use `--create-pr` when updating models you don't own\n9. **One model per repo**: Only add the main model's results to model-index\n10. **No markdown in YAML names**: The model name field in YAML should be plain text\n\n### Model Name Matching\n\nWhen extracting evaluation tables with multiple models (either as columns or rows), the script uses **exact normalized token matching**:\n\n- Removes markdown formatting (bold `**`, links `[]()`  )\n- Normalizes names (lowercase, replace `-` and `_` with spaces)\n- Compares token sets: `\"OLMo-3-32B\"` → `{\"olmo\", \"3\", \"32b\"}` matches `\"**Olmo 3 32B**\"` or `\"Olmo-3-32B`\n- Only extracts if tokens match exactly (handles different word orders and separators)\n- Fails if no exact match found (rather than guessing from similar names)\n\n**For column-based tables** (benchmarks as rows, models as columns):\n- Finds the column header matching the model name\n- Extracts scores from that column only\n\n**For transposed tables** (models as rows, benchmarks as columns):\n- Finds the row in the first column matching the model name\n- Extracts all benchmark scores from that row only\n\nThis ensures only the correct model's scores are extracted, never unrelated models or training checkpoints. \n\n### Common Patterns\n\n**Update Your Own Model:**\n```bash\n# Extract from README and push directly\nuv run scripts/evaluation_manager.py extract-readme \\\n  --repo-id \"your-username/your-model\" \\\n  --task-type \"text-generation\"\n```\n\n**Update Someone Else's Model (Full Workflow):**\n```bash\n# Step 1: ALWAYS check for existing PRs first\nuv run scripts/evaluation_manager.py get-prs \\\n  --repo-id \"other-username/their-model\"\n\n# Step 2: If NO open PRs exist, proceed with creating one\nuv run scripts/evaluation_manager.py extract-readme \\\n  --repo-id \"other-username/their-model\" \\\n  --create-pr\n\n# If open PRs DO exist:\n# - Warn the user about existing PRs\n# - Show them the PR URLs\n# - Do NOT create a new PR unless user explicitly confirms\n```\n\n**Import Fresh Benchmarks:**\n```bash\n# Step 1: Check for existing PRs\nuv run scripts/evaluation_manager.py get-prs \\\n  --repo-id \"anthropic/claude-sonnet-4\"\n\n# Step 2: If no PRs, import from Artificial Analysis\nAA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \\\n  --creator-slug \"anthropic\" \\\n  --model-name \"claude-sonnet-4\" \\\n  --repo-id \"anthropic/claude-sonnet-4\" \\\n  --create-pr\n```\n\n### Troubleshooting\n\n**Issue**: \"No evaluation tables found in README\"\n- **Solution**: Check if README contains markdown tables with numeric scores\n\n**Issue**: \"Could not find model 'X' in transposed table\"\n- **Solution**: The script will display available models. Use `--model-name-override` with the exact name from the list\n- **Example**: `--model-name-override \"**Olmo 3-32B**\"`\n\n**Issue**: \"AA_API_KEY not set\"\n- **Solution**: Set environment variable or add to .env file\n\n**Issue**: \"Token does not have write access\"\n- **Solution**: Ensure HF_TOKEN has write permissions for the repository\n\n**Issue**: \"Model not found in Artificial Analysis\"\n- **Solution**: Verify creator-slug and model-name match API values\n\n**Issue**: \"Payment required for hardware\"\n- **Solution**: Add a payment method to your Hugging Face account to use non-CPU hardware\n\n**Issue**: \"vLLM out of memory\" or CUDA OOM\n- **Solution**: Use a larger hardware flavor, reduce `--gpu-memory-utilization`, or use `--tensor-parallel-size` for multi-GPU\n\n**Issue**: \"Model architecture not supported by vLLM\"\n- **Solution**: Use `--backend hf` (inspect-ai) or `--backend accelerate` (lighteval) for HuggingFace Transformers\n\n**Issue**: \"Trust remote code required\"\n- **Solution**: Add `--trust-remote-code` flag for models with custom code (e.g., Phi-2, Qwen)\n\n**Issue**: \"Chat template not found\"\n- **Solution**: Only use `--use-chat-template` for instruction-tuned models that include a chat template\n\n### Integration Examples\n\n**Python Script Integration:**\n```python\nimport subprocess\nimport os\n\ndef update_model_evaluations(repo_id, readme_content):\n    \"\"\"Update model card with evaluations from README.\"\"\"\n    result = subprocess.run([\n        \"python\", \"scripts/evaluation_manager.py\",\n        \"extract-readme\",\n        \"--repo-id\", repo_id,\n        \"--create-pr\"\n    ], capture_output=True, text=True)\n\n    if result.returncode == 0:\n        print(f\"Successfully updated {repo_id}\")\n    else:\n        print(f\"Error: {result.stderr}\")\n```\n\n## Limitations\n- Use this skill only when the task clearly matches the scope described above.\n- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.\n- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.","tags":["hugging","face","evaluation","antigravity","awesome","skills","sickn33","agent-skills","agentic-skills","ai-agent-skills","ai-agents","ai-coding"],"capabilities":["skill","source-sickn33","skill-hugging-face-evaluation","topic-agent-skills","topic-agentic-skills","topic-ai-agent-skills","topic-ai-agents","topic-ai-coding","topic-ai-workflows","topic-antigravity","topic-antigravity-skills","topic-claude-code","topic-claude-code-skills","topic-codex-cli","topic-codex-skills"],"categories":["antigravity-awesome-skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/sickn33/antigravity-awesome-skills/hugging-face-evaluation","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add sickn33/antigravity-awesome-skills","source_repo":"https://github.com/sickn33/antigravity-awesome-skills","install_from":"skills.sh"}},"qualityScore":"0.700","qualityRationale":"deterministic score 0.70 from registry signals: · indexed on github topic:agent-skills · 37911 github stars · SKILL.md body (22,850 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T18:51:11.293Z","embedding":null,"createdAt":"2026-04-18T21:38:45.175Z","updatedAt":"2026-05-18T18:51:11.293Z","lastSeenAt":"2026-05-18T18:51:11.293Z","tsv":"'-10':838 '-2':3079 '-3':2605,2617 '-32':2606,2618,2936 '/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt':1601 '/llama-2-7b-hf':1279,1311,1325 '/llama-3.2-1b':1455,1470,1491,1538,1774,1788,1836,1904 '/llama-3.2-1b-instruct':1507 '/llama-3.2-70b':1806,1921 '/scripts/inspect_eval_uv.py':1265,1297 '/their-model':2773,2797 '/your-model':2738 '0':1570,1615,1618,1668,1682,1686,1700,1706,1714,1721,3150 '0.26.0':212 '0.3.0':236 '0.30.0':258 '0.4.0':252 '0.6.0':250 '1':318,439,479,949,960,1000,1037,1053,2408,2754,2832 '1.2.1':221 '1.3.0':206 '10':2551 '13b':1962,1966 '2':331,449,608,980,1105,2426,2775,2848 '2.0.0':254 '2.32.5':225 '25':1578,1583 '3':340,456,648,1020,1231,2442,2609,2613,2935 '3.0.0':217 '32b':2610,2614 '34b':1967,1971 '3b':1956,1961 '4':353,466,686,1150,1197,1222,1333,1813,1934,2457,2875 '4.40.0':256 '5':751,837,1444,1459,1474,1477,1495,1511,1542,1556,1559,1563,1566,1690,1692,1735,1738,1908,2227,2268,2473 '6':2485 '6.0.3':223 '7':2500 '72.5':2315 '723':267,896 '8':2519 '85.2':2309 '9':2535 'a100':1928,1973 'a100-large':1927,1972 'a10g':1284,1300,1527,1825,1964,1969,2213,2240 'a10g-large':1968 'a10g-small':1299,1526,1824,1963,2212,2239 'aa':933,1127,1139,1163,1186,1211,2058,2066,2856,2864,2939 'abstract':1697,1703 'acceler':86,248,257,1350,1479,1497,3055 'access':927,1384,2959 'account':3003 'accur':509 'ad':63,2393 'add':5,49,95,467,664,1118,2541,2949,2995,3066 'addit':1641 'ai':121,197,199,235,696,705,858,861,1743,1749,1753,1842,2230,3052 'ai2':1859 'alreadi':338 'alway':2108,2427,2755 'analysi':26,78,151,612,621,931,1109,1115,2056,2375,2855,2976 'anoth':365 'anthrop':1143,1190,1215,2868 'anthropic/claude-sonnet-4':2846,2879 'api':27,153,613,625,742,934,1116,1128,1132,1164,1168,1390,1407,2059,2368,2376,2857,2940,2987 'appli':468,1021,1038,1073,2049,2468 'approach':761 'arc':1576,1580,1637,1857 'arc-challeng':1579 'architectur':3041 'artifici':25,77,150,611,620,930,1108,1114,2055,2374,2854,2975 'ask':3195 'attempt':2382 'attribut':636 'author':2138 'auto':899,1881,1890,2255 'auto-detect':1889 'auto-instal':898 'auto-select':1880 'automat':264,554,622,641,941 'avail':801,817,1586,1592,1839,2915 'avoid':2424 'b':1740,2607,2619,2937 'backend':87,194,831,1351,1426,1480,1496,1746,1780,1791,1938,1948,3048,3054 'base':1893,2647 'bash':303,415,825,959,1126,1158,1205,1253,1285,1317,1441,1519,1762,1817,1888,1983,1996,2010,2057,2088,2114,2147,2178,2202,2719,2752,2830 'basic':1124,1269,2156,2193 'batch':680 'behavior':2461 'benchmark':74,109,156,542,616,1111,1649,1864,2048,2298,2301,2649,2675,2691,2829 'benefit':774 'best':2406 'bigbench':1644,1645,1696,1702 'block':518 'bold':2592 'boundari':3203 'built':228 'built-in':227 'c':1873 'capabl':478 'captur':3143 'card':14,57,104,141,669,1123,3123 'cd':392 'challeng':1577,1581,1858,1861,1868 'chang':476,1022,1688 'chat':1514,3082,3091,3101 'chat/instruction-tuned':1498 'check':279,298,443,806,810,814,2105,2409,2756,2833,2892 'checklist':1061 'checkpoint':2712 'clarif':3197 'claud':1148,1195,1220,2873 'claude-sonnet':1147,1194,1219,2872 'clear':2362,3170 'cli':437,750,953,1249,1251 'code':160,517,678,1870,3063,3070,3076 'column':502,545,552,561,575,968,1003,1077,1086,2025,2033,2512,2579,2646,2654,2657,2667,2677,2684 'column-bas':2645 'columns/rows':557 'comma':1729 'comma-separ':1728 'command':1975,2441 'common':540,1627,1854,2713 'compar':1067,2477,2601 'comparison':547,2507 'compat':127,161 'complet':1589,2455 'complianc':674 'comput':1405 'config':721 'configur':730 'confirm':360,2826 'conflict':2385,2400 'contain':383,1604,2895 'content':21,72,3120 'convert':580,624 'copi':2509 'core':208,477 'correct':2439,2701 'cost':1402,1406 'could':2902 'cpu':731,1268,2155,2192,3008 'cpu-bas':1267,2154,2191 'creat':284,288,294,321,326,364,472,642,655,1055,1159,1202,1228,2051,2084,2112,2419,2471,2520,2526,2783,2799,2819,2881,3141 'create-pr':293,471,1054,1227,2050,2083,2111,2470,2525,2798,2880,3140 'creation':640,2397 'creator':1141,1188,1213,2068,2071,2866,2980 'creator-nam':2070 'creator-slug':1140,1187,1212,2067,2865,2979 'criteria':3206 'critic':278 'cuda':3016 'custom':30,80,113,184,242,753,1335,1341,2047,2200,3075 'data':65,110 'dataset':2045,2296,2299 'dataset-nam':2044 'date':2139 'def':3113 'default':465,988,1066,2460 'dep':901 'depend':207,209,261,715,1413 'describ':3174 'detect':538,1891,2359 'devic':766,796 'differ':2627 'direct':152,171,618,786,797,1040,1250,1344,2725 'directori':382,395 'display':2914 'dockerfil':723 'dotenv':220,916,945 'download':1419 'duplic':327,373,2425 'e.g':603,3077 'echo':1162,1171 'ecosystem':139 'effici':180,685 'either':2577 'els':2747,3157 'en':1713,1720 'endpoint':1391 'ensur':673,1098,2698,2961 'entri':661,2391 'env':938,1160,1170,1178,2951 'environ':922,936,1156,2946,3186 'environment-specif':3185 'ephemer':714 'error':2346,2363,2369,3160 'etc':1639 'eval':17,239 'evalu':4,8,32,51,64,68,82,97,114,170,186,190,203,232,244,374,483,646,665,688,700,737,755,849,865,1233,1237,1264,1296,1337,1340,1373,1431,1757,2143,2199,2356,2497,2572,2886,3116,3125 'exact':574,1010,1091,2032,2334,2511,2585,2625,2635,2924 'exampl':520,1283,1677,2929,3104 'execut':740 'exist':67,281,300,317,339,345,445,536,667,2387,2411,2499,2758,2780,2805,2810,2835 'expert':3191 'explicit':359,1911,2825 'extract':16,66,431,458,482,527,950,981,993,1030,1046,1103,1658,2007,2015,2440,2475,2571,2621,2663,2689,2706,2720,2730,2789,3133 'extract-readm':430,457,992,1029,1045,2014,2729,2788,3132 'f':3152,3159 'face':3,12,55,102,174,1241,3002 'fail':2632 'failur':2377,2404 'faster':840 'featur':829,1377 'fee':1409 'fetch':615,1110 'fewshot':1553,1614,1663 'field':597,2345,2560 'file':386,1157,1161,1603,1679,1695,1708,2952 'find':453,1585,2655,2678,2904 'first':391,448,2413,2459,2683,2760 'flag':1622,3071 'flavor':1266,1298,1525,1823,2153,2211,2238,3023 'flow':956 'follow':2280 'format':42,537,541,588,623,631,657,1545,1549,1610,2274,2327,2361,2591 'found':1597,2350,2637,2888,2973,3085 'framework':204,846,859,1758,1909,1924,1946,2269 'fresh':2828 'full':399,2750 'generat':579,606,654,1871,2043,2276,2295,2744 'get':308,441,964,2119,2416,2437,2765,2841 'get-pr':307,440,2118,2415,2764,2840 'github.com':1600 'github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt':1599 'gpu':181,245,733,772,799,815,835,1282,1346,1394,1398,1440,1761,1795,3026,3038 'gpu-memory-util':3025 'grace':2405 'grade':1850 'gsm8k':1313,1476,1562,1564,1636,1737,1849 'guess':2640 'guidanc':407,2446 'handl':741,2347,2398,2626 'hardwar':728,734,1328,1368,1392,1883,1892,1912,1926,1950,1955,2190,2256,2402,2993,3009,3022 'header':269,576,897,1011,1087,2034,2513,2658 'hellaswag':1569,1571,1638,1853 'help':402,419,426,433,438,958,1980,1987,2444,2451 'helper':1315,1875,1878,2177,2253 'hf':138,690,717,779,843,875,920,1172,1176,1245,1254,1256,1258,1263,1271,1273,1286,1288,1290,1295,1303,1305,1386,1396,1403,1517,1520,1530,1532,1792,1815,1818,1828,1830,1936,1949,2148,2161,2163,2206,2216,2218,2233,2243,2245,2962,3049 'hf-evalu':1262,1294 'hi':1712,1719 'hi-en':1711,1718 'high':833 'high-perform':832 'hint':969 'hub':211,909 'hug':2,11,54,101,173,1240,3001 'hugging-face-evalu':1 'huggingfac':188,210,847,908,1342,1429,1778,3058 'huggingface-hub':907 'humanev':1869,2311,2313 'id':312,978,997,1034,1050,1153,1200,1225,2005,2019,2081,2095,2103,2123,2168,2185,2223,2250,2264,2734,2769,2793,2845,2878,3118,3137,3139,3156 'identifi':555 'ignor':516,1626 'import':22,73,108,157,274,609,759,1106,1138,1180,1185,1210,2053,2065,2827,2852,2863,3109,3111 'import-aa':1137,1184,1209,2064,2862 'includ':882,1629,3099 'index':40,130,145,166,562,563,586,601,630,651,660,1004,1019,1078,2026,2273,2279,2286,2390,2550 'infer':182,230,692,836,1357,1375,1380,2145 'infrastructur':718,877,1242,1401 'input':3200 'inspect':120,196,234,238,424,451,480,487,491,565,695,704,857,961,974,1742,1748,1841,1925,1947,1992,2001,2229,2431,2449,2516,3051 'inspect-ai':119,195,233,694,703,856,1741,1747,1840,2228,3050 'inspect-ev':237 'inspect-t':423,450,490,973,2000,2430,2448,2515 'instal':263,769,813,900,903,906,947 'instead':1481 'institut':201,863,1755 'instruct':879,3095 'instruction-tun':3094 'integr':136,149,154,178,614,697,708,3103,3107 'invalid':2360 'issu':2379,2884,2901,2938,2953,2970,2989,3010,3039,3060,3081 'job':168,175,691,738,780,869,876,1234,1238,1246,1259,1291,1366,1397,1404,1518,1521,1816,1819,1886,2144,2149,2207,2234 'key':434,935,1129,1133,1165,1169,2060,2858,2941 'languag':1847 'larg':1929,1970,1974 'larger':3021 'latest':405 'leaderboard':126,148,854,1436,1457,1472,1475,1493,1509,1540,1554,1561,1575,1630,1633,1680,1684,1733,1736,1906,2225,2266 'leaderboard-compat':125 'level':1979 'librari':191,706,850,866,1432 'lightev':117,187,247,845,1423,1427,1543,1568,1593,1640,1642,1709,1716,1910,2203,2270,3056 'lighteval/inspect-ai':88 'limit':3162 'link':2593 'list':1590,1657,2125,2928 'llama':1278,1310,1324,1454,1469,1490,1506,1537,1773,1787,1805,1835,1903,1920 'llm':853,1435,1632 'load':940 'local':795,871,1363,1439,1760 'logic':2371 'lowercas':2596 'main':2543 'maintain':330,634 'malform':2366 'manag':7,652,726 'manual':904,2484 'markdown':214,507,513,911,2326,2553,2590,2896 'markdown-it-pi':213,512,910 'massiv':1845 'match':436,553,957,2569,2588,2611,2624,2636,2659,2685,2986,3171 'math':1852 'mcp':781 'memori':773,3014,3027 'merg':662,2384 'messag':2364 'meta':1277,1309,1323,1453,1468,1489,1505,1536,1772,1786,1804,1834,1902,1919 'meta-llama':1276,1308,1322,1452,1467,1488,1504,1535,1771,1785,1803,1833,1901,1918 'metadata':41,131,146,632 'method':61,844,948,1104,1230,1332,2998 'metric':2303 'microsoft/phi-2':1943 'miss':3208 'mmlu':1281,1327,1443,1458,1473,1494,1510,1541,1555,1557,1635,1681,1685,1734,1764,1776,1790,1808,1838,1844,1907,1923,1945,2226,2252,2267,2305,2307 'model':13,31,39,56,81,103,129,134,140,144,165,185,243,370,551,556,560,569,585,600,629,650,659,668,684,754,1002,1006,1076,1082,1095,1122,1145,1192,1217,1275,1307,1321,1336,1343,1362,1383,1387,1388,1451,1466,1487,1499,1503,1534,1770,1784,1802,1832,1895,1900,1917,1942,1952,2024,2029,2036,2074,2077,2132,2165,2167,2182,2184,2201,2220,2222,2247,2249,2261,2263,2272,2278,2285,2288,2330,2389,2503,2530,2537,2544,2549,2558,2567,2576,2652,2661,2672,2687,2702,2709,2718,2749,2870,2905,2916,2919,2931,2971,2984,3040,3073,3097,3115,3122 'model-column-index':559,1001,1075,2023 'model-id':2166,2183,2221,2248,2262 'model-index':38,128,143,164,584,599,628,649,658,2271,2277,2284,2388,2548 'model-nam':1144,1191,1216,2073,2869,2983 'model-name-overrid':568,1005,1081,2028,2502,2918,2930 'model-slug':2076 'multi':1794,2491,3037 'multi-gpu':1793,3036 'multi-t':2490 'multipl':60,534,550,683,1461,1722,2496,2575 'multitask':1846 'must':297,1089 'n':461,525,2022,2027,2488 'name':570,1007,1083,1146,1193,1218,2030,2037,2046,2072,2075,2172,2189,2287,2289,2297,2304,2310,2317,2319,2331,2335,2399,2504,2556,2559,2568,2595,2643,2662,2688,2871,2920,2925,2932,2985 'narrat':1698,1704 'need':93,776 'never':2707 'new':285,323,758,2394,2421,2821 'non':3007 'non-cpu':3006 'normal':2586,2594 'note':259 'num':1552,1613,1662 'number':966,2136 'numbers/columns':455 'numer':2899 'nvidia':820 'nvidia-smi':819 'offlin':1416 'olmo':2604,2608,2612,2616,2934 'one':286,1100,2395,2536,2784 'oom':3017 'open':301,315,336,446,852,1057,1434,1631,2106,2127,2778,2802 'openai':240 'oper':681,889 'optim':1412 'option':1316,1421,1739,1872 'order':2629 'origin':1647,1648 'os':3112 'other':2523 'other-usernam':2770,2794 'output':566,602,2479,2518,3144,3180 'overrid':571,1008,1084,2031,2505,2921,2933 'overview':43 'overwrit':671 'paper':158,676 'parallel':1798,1811,1932,3033 'param':1957 'paramet':1675 'pars':506,510 'pass':747,1670 'path':377,400,809 'pattern':2714 'payment':2990,2997 'pep':266,895 'per':2538 'perform':834,888 'permiss':2966,3201 'phi':3078 'pip':905 'plain':410,2565 'portion':1664 'possibl':764 'pr':295,324,346,366,473,639,1025,1056,1059,1229,2052,2085,2113,2135,2422,2472,2527,2800,2815,2822,2882,3142 'practic':2407 'prefer':558,891,1074 'prepar':124 'prerequisit':890 'preserv':633,2386 'prevent':368 'preview':2458 'print':462,985,1064,2462,3151,3158 'proceed':355,2781 'process':682 'proper':656 'provid':46,231,693,1358,1376,1381,1399,1415,2146 'prs':282,302,309,316,337,375,442,447,2107,2120,2412,2417,2521,2759,2766,2779,2803,2811,2836,2842,2851 'pull':290,643,1203,2128 'push':469,1023,1039,2724 'py':216,515,913 'python':219,411,711,883,915,944,1314,1874,2176,3105,3108,3130 'python-dotenv':218,914,943 'pyyaml':222,917 'qwen':3080 'rather':2638 're':226 'readm':20,71,432,459,486,499,952,994,1031,1047,1070,2009,2016,2482,2493,2722,2731,2790,2890,2894,3119,3127,3134 'reason':1856,1860 'recogn':539 'recommend':955,1951,1954 'reduc':3024 'refer':1976 'relat':379 'releas':135 'remot':3062,3069 'remov':2589 'replac':2597 'repo':311,977,996,1033,1049,1152,1199,1224,2004,2018,2080,2094,2102,2122,2539,2733,2768,2792,2844,2877,3117,3136,3138,3155 'repo-id':310,976,995,1032,1048,1151,1198,1223,2003,2017,2079,2093,2101,2121,2732,2767,2791,2843,2876,3135 'report':2353 'repositori':371,2133,2969 'request':224,291,644,918,1204,2129,2403 'requir':246,532,727,2494,2991,3064,3199 'respons':626 'result':9,52,98,2290,2546,3128 'result.returncode':3149 'result.stderr':3161 'retri':2370 'review':351,2464,3192 'row':505,544,1097,1101,2581,2651,2674,2680,2695 'run':29,79,112,169,273,305,388,414,417,421,428,687,698,710,752,784,803,822,827,870,894,971,990,1027,1043,1135,1179,1182,1207,1232,1248,1261,1293,1319,1334,1361,1442,1449,1460,1464,1485,1501,1523,1763,1768,1782,1800,1821,1898,1915,1940,1985,1989,1998,2012,2062,2090,2098,2109,2116,2142,2151,2180,2197,2209,2236,2259,2414,2447,2727,2762,2786,2838,2860 'safe':744 'safeti':200,862,1754,3202 'sampl':504 'schema':1867 'school':1851 'scope':3173 'score':23,75,617,1112,2664,2692,2704,2900 'script':268,390,712,785,805,808,824,884,886,1353,1359,1379,1382,1876,1879,2254,2351,2583,2912,3106 'scripts/evaluation_manager.py':306,418,422,429,972,991,1028,1044,1136,1183,1208,1986,1990,1999,2013,2063,2091,2099,2117,2728,2763,2787,2839,2861,3131 'scripts/inspect_eval_uv.py':2152 'scripts/inspect_vllm_uv.py':1769,1783,1801,1822,2237 'scripts/lighteval_vllm_uv.py':1450,1465,1486,1502,1524,2210 'scripts/run_eval_job.py':1320,2181 'scripts/run_vllm_eval_job.py':1899,1916,1941,2260 'scripts/train_sft_example.py':828 'seamless':709 'secret':746,1270,1302,1529,1827,2160,2215,2242 'secur':739 'see':494,2433,2453 'select':522,581,729,1882,1913,2257 'sens':1855 'separ':1355,1730,2631 'set':594,919,932,2603,2943,2945 'shot':1445,1560,1567,1574,1584,1693 'show':341,2092,2134,2812 'similar':2642 'simplifi':1885 'size':1812,1896,1933,1953,3034 'skill':45,277,881,3165 'skill-hugging-face-evaluation' 'skill.md':385 'slug':1142,1189,1214,2069,2078,2867,2981 'small':1301,1331,1528,1826,1960,1965,2159,2196,2214,2241 'smi':821 'solut':2891,2910,2944,2960,2977,2994,3018,3046,3065,3086 'someon':2746 'sonnet':1149,1196,1221,2874 'sourc':635,2316,2318 'source-sickn33' 'source-url.com':2321 'source.url':2344 'space':725,2396,2600 'spam':369 'specif':167,530,679,983,3187 'specifi':1726 'speed':1410 'standalon':867,1438,1759 'standard':699,842 'start':1994,2428 'step':2753,2774,2831,2847 'stop':3193 'structur':50,96,501,2282,2435 'submiss':1887 'submit':873,1235 'subprocess':3110 'subprocess.run':3129 'substitut':3183 'success':3153,3205 'suffici':771 'suit':1550,1611,1628,1660 'summar':607 'support':15,59,663,1433,1606,3043 't4':1330,1959,2158,2195 't4-small':1329,1958,2157,2194 'tabl':18,69,425,452,454,460,484,488,492,496,508,521,524,531,535,548,582,962,965,975,984,999,1036,1052,1071,1094,1993,2002,2021,2338,2348,2357,2367,2432,2434,2450,2483,2487,2492,2498,2508,2517,2573,2648,2671,2887,2897,2909 'task':589,592,855,1280,1312,1326,1437,1456,1462,1471,1492,1508,1539,1544,1546,1551,1587,1594,1607,1612,1634,1643,1646,1650,1654,1661,1674,1723,1732,1775,1789,1807,1837,1843,1905,1922,1944,2039,2169,2171,2186,2188,2224,2251,2265,2291,2740,3169 'task-nam':2170,2187 'task-typ':591,2038,2739 'task.type':596 'templat':1515,3083,3092,3102 'tensor':1797,1810,1931,3032 'tensor-parallel-s':1809,1930,3031 'termin':788 'test':3189 'text':577,605,1012,1088,2042,2294,2566,2743,3146 'text-gener':604,2041,2293,2742 'titl':2137 'token':743,921,928,1173,1177,1255,1257,1272,1274,1287,1289,1304,1306,1531,1533,1829,1831,2162,2164,2217,2219,2244,2246,2378,2587,2602,2623,2954,2963 'tool':47,782 'top':1978 'top-level':1977 'topic-agent-skills' 'topic-agentic-skills' 'topic-ai-agent-skills' 'topic-ai-agents' 'topic-ai-coding' 'topic-ai-workflows' 'topic-antigravity' 'topic-antigravity-skills' 'topic-claude-code' 'topic-claude-code-skills' 'topic-codex-cli' 'topic-codex-skills' 'torch':253 'trail':1617,1667 'train':2711 'transform':255,1779,1937,3059 'transient':2373 'transpos':1093,2670,2908 'treat':3178 'troubleshoot':2883 'true':3145,3147 'trust':3061,3068 'trust-remote-cod':3067 'truthfulqa':1862,1863 'tune':3096 'type':590,593,2040,2292,2300,2302,2306,2312,2741 'uk':198,860,1752 'understand':1699,1705,1848 'unless':2823 'unrel':2708 'updat':142,647,2383,2529,2715,2745,3114,3121,3154 'url':347,638,2141,2320,2341,2816 'usag':878,1125,1252,1408 'use':91,271,275,397,401,489,511,523,567,701,778,791,892,1009,1017,1080,1243,1347,1371,1478,1513,1547,1652,1683,1701,1715,1777,1935,2174,2325,2332,2340,2443,2467,2486,2501,2524,2584,2917,3005,3019,3030,3047,3088,3090,3163 'use-chat-templ':1512,3089 'user':334,343,358,792,2808,2824 'usernam':2737,2772,2796 'username/model':979,998,1035,1051 'username/model-name':313,1154,1201,1226,2006,2020,2082,2096,2104,2124 'util':3028 'uv':177,272,304,413,416,420,427,707,768,811,826,893,970,989,1026,1042,1134,1181,1206,1247,1260,1292,1318,1448,1463,1484,1500,1522,1767,1781,1799,1820,1897,1914,1939,1984,1988,1997,2011,2061,2089,2097,2115,2150,2179,2208,2235,2258,2726,2761,2785,2837,2859 'valid':672,1060,2087,2100,2380,3188 'valu':1731,2308,2314,2476,2988 'variabl':923,937,2947 'verifi':2474,2978 'version':205,1621,1982,1991 'via':265,745,1516,1814 'view':2086 'vllm':84,116,179,241,249,251,260,757,830,1339,1348,1372,1378,1411,1425,1447,1483,1745,1766,2198,2205,2232,3011,3045 'vllm/accelerate':193 'vllm/lighteval':34 'vs':1374 'want':106,362 'warn':332,2322,2806 'winograd':1866 'winogrand':1865 'without':670,1665 'wmt14':1710,1717 'word':2628 'work':35,328,408,793 'workflow':406,435,954,2456,2751 'write':475,926,2958,2965 'write-access':925 'x':839,2906 'yaml':463,578,587,653,986,1062,2283,2463,2478,2555,2562 'yes':1417 'your-api-key':1130,1166 'your-hf-token':1174 'your-usernam':2735 'zero':720,1573 'zero-config':719 'zero-shot':1572","prices":[{"id":"d7119cb9-7886-4fc3-a373-58a014821212","listingId":"5739fe0b-fc27-4619-8450-ff71199fd703","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"sickn33","category":"antigravity-awesome-skills","install_from":"skills.sh"},"createdAt":"2026-04-18T21:38:45.175Z"}],"sources":[{"listingId":"5739fe0b-fc27-4619-8450-ff71199fd703","source":"github","sourceId":"sickn33/antigravity-awesome-skills/hugging-face-evaluation","sourceUrl":"https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/hugging-face-evaluation","isPrimary":false,"firstSeenAt":"2026-04-18T21:38:45.175Z","lastSeenAt":"2026-05-18T18:51:11.293Z"}],"details":{"listingId":"5739fe0b-fc27-4619-8450-ff71199fd703","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"sickn33","slug":"hugging-face-evaluation","github":{"repo":"sickn33/antigravity-awesome-skills","stars":37911,"topics":["agent-skills","agentic-skills","ai-agent-skills","ai-agents","ai-coding","ai-workflows","antigravity","antigravity-skills","claude-code","claude-code-skills","codex-cli","codex-skills","cursor","cursor-skills","developer-tools","gemini-cli","gemini-skills","kiro","mcp","skill-library"],"license":"mit","html_url":"https://github.com/sickn33/antigravity-awesome-skills","pushed_at":"2026-05-18T08:24:49Z","description":"Installable GitHub library of 1,400+ agentic skills for Claude Code, Cursor, Codex CLI, Gemini CLI, Antigravity, and more. Includes installer CLI, bundles, workflows, and official/community skill collections.","skill_md_sha":"2d7faadeaaf3ba82909ade815b83b538e2bdaa83","skill_md_path":"skills/hugging-face-evaluation/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/hugging-face-evaluation"},"layout":"multi","source":"github","category":"antigravity-awesome-skills","frontmatter":{"name":"hugging-face-evaluation","description":"Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format."},"skills_sh_url":"https://skills.sh/sickn33/antigravity-awesome-skills/hugging-face-evaluation"},"updatedAt":"2026-05-18T18:51:11.293Z"}}