{"id":"0cbfef8d-2749-4fde-8dc2-efdf40aa4668","shortId":"5cfHnZ","kind":"skill","title":"hugging-face-community-evals","tagline":"Run local evaluations for Hugging Face Hub models with inspect-ai or lighteval.","description":"# Overview\n\n## When to Use\nUse this skill for local model evaluation, backend selection, and GPU smoke tests outside the Hugging Face Jobs workflow.\n\nThis skill is for **running evaluations against models on the Hugging Face Hub on local hardware**.\n\nIt covers:\n- `inspect-ai` with local inference\n- `lighteval` with local inference\n- choosing between `vllm`, Hugging Face Transformers, and `accelerate`\n- smoke tests, task selection, and backend fallback strategy\n\nIt does **not** cover:\n- Hugging Face Jobs orchestration\n- model-card or `model-index` edits\n- README table extraction\n- Artificial Analysis imports\n- `.eval_results` generation or publishing\n- PR creation or community-evals automation\n\nIf the user wants to **run the same eval remotely on Hugging Face Jobs**, hand off to the `hugging-face-jobs` skill and pass it one of the local scripts in this skill.\n\nIf the user wants to **publish results into the community evals workflow**, stop after generating the evaluation run and hand off that publishing step to `~/code/community-evals`.\n\n> All paths below are relative to the directory containing this `SKILL.md`.\n\n# When To Use Which Script\n\n| Use case | Script |\n|---|---|\n| Local `inspect-ai` eval on a Hub model via inference providers | `scripts/inspect_eval_uv.py` |\n| Local GPU eval with `inspect-ai` using `vllm` or Transformers | `scripts/inspect_vllm_uv.py` |\n| Local GPU eval with `lighteval` using `vllm` or `accelerate` | `scripts/lighteval_vllm_uv.py` |\n| Extra command patterns | `examples/USAGE_EXAMPLES.md` |\n\n# Prerequisites\n\n- Prefer `uv run` for local execution.\n- Set `HF_TOKEN` for gated/private models.\n- For local GPU runs, verify GPU access before starting:\n\n```bash\nuv --version\nprintenv HF_TOKEN >/dev/null\nnvidia-smi\n```\n\nIf `nvidia-smi` is unavailable, either:\n- use `scripts/inspect_eval_uv.py` for lighter provider-backed evaluation, or\n- hand off to the `hugging-face-jobs` skill if the user wants remote compute.\n\n# Core Workflow\n\n1. Choose the evaluation framework.\n   - Use `inspect-ai` when you want explicit task control and inspect-native flows.\n   - Use `lighteval` when the benchmark is naturally expressed as a lighteval task string, especially leaderboard-style tasks.\n2. Choose the inference backend.\n   - Prefer `vllm` for throughput on supported architectures.\n   - Use Hugging Face Transformers (`--backend hf`) or `accelerate` as compatibility fallbacks.\n3. Start with a smoke test.\n   - `inspect-ai`: add `--limit 10` or similar.\n   - `lighteval`: add `--max-samples 10`.\n4. Scale up only after the smoke test passes.\n5. If the user wants remote execution, hand off to `hugging-face-jobs` with the same script + args.\n\n# Quick Start\n\n## Option A: inspect-ai with local inference providers path\n\nBest when the model is already supported by Hugging Face Inference Providers and you want the lowest local setup overhead.\n\n```bash\nuv run scripts/inspect_eval_uv.py \\\n  --model meta-llama/Llama-3.2-1B \\\n  --task mmlu \\\n  --limit 20\n```\n\nUse this path when:\n- you want a quick local smoke test\n- you do not need direct GPU control\n- the task already exists in `inspect-evals`\n\n## Option B: inspect-ai on Local GPU\n\nBest when you need to load the Hub model directly, use `vllm`, or fall back to Transformers for unsupported architectures.\n\nLocal GPU:\n\n```bash\nuv run scripts/inspect_vllm_uv.py \\\n  --model meta-llama/Llama-3.2-1B \\\n  --task gsm8k \\\n  --limit 20\n```\n\nTransformers fallback:\n\n```bash\nuv run scripts/inspect_vllm_uv.py \\\n  --model microsoft/phi-2 \\\n  --task mmlu \\\n  --backend hf \\\n  --trust-remote-code \\\n  --limit 20\n```\n\n## Option C: lighteval on Local GPU\n\nBest when the task is naturally expressed as a `lighteval` task string, especially Open LLM Leaderboard style benchmarks.\n\nLocal GPU:\n\n```bash\nuv run scripts/lighteval_vllm_uv.py \\\n  --model meta-llama/Llama-3.2-3B-Instruct \\\n  --tasks \"leaderboard|mmlu|5,leaderboard|gsm8k|5\" \\\n  --max-samples 20 \\\n  --use-chat-template\n```\n\n`accelerate` fallback:\n\n```bash\nuv run scripts/lighteval_vllm_uv.py \\\n  --model microsoft/phi-2 \\\n  --tasks \"leaderboard|mmlu|5\" \\\n  --backend accelerate \\\n  --trust-remote-code \\\n  --max-samples 20\n```\n\n# Remote Execution Boundary\n\nThis skill intentionally stops at **local execution and backend selection**.\n\nIf the user wants to:\n- run these scripts on Hugging Face Jobs\n- pick remote hardware\n- pass secrets to remote jobs\n- schedule recurring runs\n- inspect / cancel / monitor jobs\n\nthen switch to the **`hugging-face-jobs`** skill and pass it one of these scripts plus the chosen arguments.\n\n# Task Selection\n\n`inspect-ai` examples:\n- `mmlu`\n- `gsm8k`\n- `hellaswag`\n- `arc_challenge`\n- `truthfulqa`\n- `winogrande`\n- `humaneval`\n\n`lighteval` task strings use `suite|task|num_fewshot`:\n- `leaderboard|mmlu|5`\n- `leaderboard|gsm8k|5`\n- `leaderboard|arc_challenge|25`\n- `lighteval|hellaswag|0`\n\nMultiple `lighteval` tasks can be comma-separated in `--tasks`.\n\n# Backend Selection\n\n- Prefer `inspect_vllm_uv.py --backend vllm` for fast GPU inference on supported architectures.\n- Use `inspect_vllm_uv.py --backend hf` when `vllm` does not support the model.\n- Prefer `lighteval_vllm_uv.py --backend vllm` for throughput on supported models.\n- Use `lighteval_vllm_uv.py --backend accelerate` as the compatibility fallback.\n- Use `inspect_eval_uv.py` when Inference Providers already cover the model and you do not need direct GPU control.\n\n# Hardware Guidance\n\n| Model size | Suggested local hardware |\n|---|---|\n| `< 3B` | consumer GPU / Apple Silicon / small dev GPU |\n| `3B - 13B` | stronger local GPU |\n| `13B+` | high-memory local GPU or hand off to `hugging-face-jobs` |\n\nFor smoke tests, prefer cheaper local runs plus `--limit` or `--max-samples`.\n\n# Troubleshooting\n\n- CUDA or vLLM OOM:\n  - reduce `--batch-size`\n  - reduce `--gpu-memory-utilization`\n  - switch to a smaller model for the smoke test\n  - if necessary, hand off to `hugging-face-jobs`\n- Model unsupported by `vllm`:\n  - switch to `--backend hf` for `inspect-ai`\n  - switch to `--backend accelerate` for `lighteval`\n- Gated/private repo access fails:\n  - verify `HF_TOKEN`\n- Custom model code required:\n  - add `--trust-remote-code`\n\n# Examples\n\nSee:\n- `examples/USAGE_EXAMPLES.md` for local command patterns\n- `scripts/inspect_eval_uv.py`\n- `scripts/inspect_vllm_uv.py`\n- `scripts/lighteval_vllm_uv.py`\n\n## Limitations\n- Use this skill only when the task clearly matches the scope described above.\n- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.\n- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.","tags":["hugging","face","community","evals","antigravity","awesome","skills","sickn33","agent-skills","agentic-skills","ai-agent-skills","ai-agents"],"capabilities":["skill","source-sickn33","skill-hugging-face-community-evals","topic-agent-skills","topic-agentic-skills","topic-ai-agent-skills","topic-ai-agents","topic-ai-coding","topic-ai-workflows","topic-antigravity","topic-antigravity-skills","topic-claude-code","topic-claude-code-skills","topic-codex-cli","topic-codex-skills"],"categories":["antigravity-awesome-skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/sickn33/antigravity-awesome-skills/hugging-face-community-evals","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add sickn33/antigravity-awesome-skills","source_repo":"https://github.com/sickn33/antigravity-awesome-skills","install_from":"skills.sh"}},"qualityScore":"0.700","qualityRationale":"deterministic score 0.70 from registry signals: · indexed on github topic:agent-skills · 37911 github stars · SKILL.md body (6,654 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T18:51:11.092Z","embedding":null,"createdAt":"2026-04-18T21:38:42.949Z","updatedAt":"2026-05-18T18:51:11.092Z","lastSeenAt":"2026-05-18T18:51:11.092Z","tsv":"'/code/community-evals':180 '/dev/null':267 '/llama-3.2-1b':453,522 '/llama-3.2-3b-instruct':579 '0':711 '1':304 '10':376,384 '13b':796,800 '2':342 '20':457,526,544,590,616 '25':708 '3':365 '3b':787,795 '4':385 '5':394,583,586,606,701,704 'acceler':78,233,361,595,608,758,874 'access':258,879 'add':374,380,888 'ai':17,63,203,219,312,373,419,488,681,870 'alreadi':430,478,768 'analysi':107 'appl':790 'arc':686,706 'architectur':353,511,734 'arg':412 'argument':676 'artifici':106 'ask':936 'autom':120 'b':485 'back':284,506 'backend':31,84,346,358,537,607,628,722,726,737,748,757,865,873 'bash':261,445,514,529,571,597 'batch':834 'batch-siz':833 'benchmark':328,568 'best':425,492,551 'boundari':619,944 'c':546 'cancel':654 'card':97 'case':198 'challeng':687,707 'chat':593 'cheaper':818 'choos':71,305,343 'chosen':675 'clarif':938 'clear':911 'code':542,612,886,892 'comma':718 'comma-separ':717 'command':236,898 'communiti':4,118,164 'community-ev':117 'compat':363,761 'comput':301 'consum':788 'contain':189 'control':318,475,779 'core':302 'cover':60,90,769 'creation':115 'criteria':947 'cuda':828 'custom':884 'describ':915 'dev':793 'direct':473,501,777 'directori':188 'edit':102 'either':277 'environ':927 'environment-specif':926 'especi':337,563 'eval':5,109,119,129,165,204,215,227,483 'evalu':8,30,48,171,285,307 'exampl':682,893 'examples/usage_examples.md':238,895 'execut':245,400,618,626 'exist':479 'expert':932 'explicit':316 'express':331,557 'extra':235 'extract':105 'face':3,11,40,54,75,92,133,141,293,356,406,434,640,663,812,857 'fail':880 'fall':505 'fallback':85,364,528,596,762 'fast':729 'fewshot':698 'flow':323 'framework':308 'gated/private':250,877 'generat':111,169 'gpu':34,214,226,254,257,474,491,513,550,570,730,778,789,794,799,805,838 'gpu-memory-util':837 'gsm8k':524,585,684,703 'guidanc':781 'hand':135,174,287,401,807,852 'hardwar':58,644,780,786 'hellaswag':685,710 'hf':247,265,359,538,738,866,882 'high':802 'high-memori':801 'hub':12,55,207,499 'hug':2,10,39,53,74,91,132,140,292,355,405,433,639,662,811,856 'hugging-face-community-ev':1 'hugging-face-job':139,291,404,661,810,855 'humanev':690 'import':108 'index':101 'infer':66,70,210,345,422,435,731,766 'input':941 'inspect':16,62,202,218,311,321,372,418,482,487,653,680,869 'inspect-ai':15,61,201,217,310,371,417,486,679,868 'inspect-ev':481 'inspect-n':320 'inspect_eval_uv.py':764 'inspect_vllm_uv.py':725,736 'intent':622 'job':41,93,134,142,294,407,641,649,656,664,813,858 'leaderboard':339,566,581,584,604,699,702,705 'leaderboard-styl':338 'lighter':281 'lightev':19,67,229,325,334,379,547,560,691,709,713,876 'lighteval_vllm_uv.py':747,756 'limit':375,456,525,543,822,903 'llama':452,521,578 'llm':565 'load':497 'local':7,28,57,65,69,150,200,213,225,244,253,421,442,466,490,512,549,569,625,785,798,804,819,897 'lowest':441 'match':912 'max':382,588,614,825 'max-sampl':381,587,613,824 'memori':803,839 'meta':451,520,577 'meta-llama':450,519,576 'microsoft/phi-2':534,602 'miss':949 'mmlu':455,536,582,605,683,700 'model':13,29,50,96,100,208,251,428,449,500,518,533,575,601,745,754,771,782,845,859,885 'model-card':95 'model-index':99 'monitor':655 'multipl':712 'nativ':322 'natur':330,556 'necessari':851 'need':472,495,776 'num':697 'nvidia':269,273 'nvidia-smi':268,272 'one':147,669 'oom':831 'open':564 'option':415,484,545 'orchestr':94 'output':921 'outsid':37 'overhead':444 'overview':20 'pass':145,393,645,667 'path':182,424,460 'pattern':237,899 'permiss':942 'pick':642 'plus':673,821 'pr':114 'prefer':240,347,724,746,817 'prerequisit':239 'printenv':264 'provid':211,283,423,436,767 'provider-back':282 'publish':113,160,177 'quick':413,465 'readm':103 'recur':651 'reduc':832,836 'relat':185 'remot':130,300,399,541,611,617,643,648,891 'repo':878 'requir':887,940 'result':110,161 'review':933 'run':6,47,126,172,242,255,447,516,531,573,599,635,652,820 'safeti':943 'sampl':383,589,615,826 'scale':386 'schedul':650 'scope':914 'script':151,196,199,411,637,672 'scripts/inspect_eval_uv.py':212,279,448,900 'scripts/inspect_vllm_uv.py':224,517,532,901 'scripts/lighteval_vllm_uv.py':234,574,600,902 'secret':646 'see':894 'select':32,82,629,678,723 'separ':719 'set':246 'setup':443 'silicon':791 'similar':378 'size':783,835 'skill':26,44,143,154,295,621,665,906 'skill-hugging-face-community-evals' 'skill.md':191 'small':792 'smaller':844 'smi':270,274 'smoke':35,79,369,391,467,815,848 'source-sickn33' 'specif':928 'start':260,366,414 'step':178 'stop':167,623,934 'strategi':86 'string':336,562,693 'stronger':797 'style':340,567 'substitut':924 'success':946 'suggest':784 'suit':695 'support':352,431,733,743,753 'switch':658,841,863,871 'tabl':104 'task':81,317,335,341,454,477,523,535,554,561,580,603,677,692,696,714,721,910 'templat':594 'test':36,80,370,392,468,816,849,930 'throughput':350,751 'token':248,266,883 'topic-agent-skills' 'topic-agentic-skills' 'topic-ai-agent-skills' 'topic-ai-agents' 'topic-ai-coding' 'topic-ai-workflows' 'topic-antigravity' 'topic-antigravity-skills' 'topic-claude-code' 'topic-claude-code-skills' 'topic-codex-cli' 'topic-codex-skills' 'transform':76,223,357,508,527 'treat':919 'troubleshoot':827 'trust':540,610,890 'trust-remote-cod':539,609,889 'truthfulqa':688 'unavail':276 'unsupport':510,860 'use':23,24,194,197,220,230,278,309,324,354,458,502,592,694,735,755,763,904 'use-chat-templ':591 'user':123,157,298,397,632 'util':840 'uv':241,262,446,515,530,572,598 'valid':929 'verifi':256,881 'version':263 'via':209 'vllm':73,221,231,348,503,727,740,749,830,862 'want':124,158,299,315,398,439,463,633 'winogrand':689 'workflow':42,166,303","prices":[{"id":"3f65b028-47c2-433b-8b1b-8be290d7ea69","listingId":"0cbfef8d-2749-4fde-8dc2-efdf40aa4668","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"sickn33","category":"antigravity-awesome-skills","install_from":"skills.sh"},"createdAt":"2026-04-18T21:38:42.949Z"}],"sources":[{"listingId":"0cbfef8d-2749-4fde-8dc2-efdf40aa4668","source":"github","sourceId":"sickn33/antigravity-awesome-skills/hugging-face-community-evals","sourceUrl":"https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/hugging-face-community-evals","isPrimary":false,"firstSeenAt":"2026-04-18T21:38:42.949Z","lastSeenAt":"2026-05-18T18:51:11.092Z"}],"details":{"listingId":"0cbfef8d-2749-4fde-8dc2-efdf40aa4668","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"sickn33","slug":"hugging-face-community-evals","github":{"repo":"sickn33/antigravity-awesome-skills","stars":37911,"topics":["agent-skills","agentic-skills","ai-agent-skills","ai-agents","ai-coding","ai-workflows","antigravity","antigravity-skills","claude-code","claude-code-skills","codex-cli","codex-skills","cursor","cursor-skills","developer-tools","gemini-cli","gemini-skills","kiro","mcp","skill-library"],"license":"mit","html_url":"https://github.com/sickn33/antigravity-awesome-skills","pushed_at":"2026-05-18T08:24:49Z","description":"Installable GitHub library of 1,400+ agentic skills for Claude Code, Cursor, Codex CLI, Gemini CLI, Antigravity, and more. Includes installer CLI, bundles, workflows, and official/community skill collections.","skill_md_sha":"c2f352647e223896b4c2b3995cf77a6e628619c0","skill_md_path":"skills/hugging-face-community-evals/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/hugging-face-community-evals"},"layout":"multi","source":"github","category":"antigravity-awesome-skills","frontmatter":{"name":"hugging-face-community-evals","description":"Run local evaluations for Hugging Face Hub models with inspect-ai or lighteval."},"skills_sh_url":"https://skills.sh/sickn33/antigravity-awesome-skills/hugging-face-community-evals"},"updatedAt":"2026-05-18T18:51:11.092Z"}}