{"id":"4cf1f504-5535-4f9f-af0b-8a318e17ffe9","shortId":"WVLaEQ","kind":"skill","title":"Run repeatable model and agent eval suites and inspect scoring traces with Inspect AI","tagline":"Run benchmark-style eval suites against models or agents, then inspect scored traces instead of relying on ad hoc chats and gut feel.","description":"# Run repeatable model and agent eval suites and inspect scoring traces with Inspect AI\n\nRun benchmark-style eval suites against models or agents, then inspect scored traces instead of relying on ad hoc chats and gut feel.\n\n## Prerequisites\n\nPython environment, inspect-ai package, model provider credentials, evaluation datasets or task definitions, optional sandbox dependencies for agent tasks\n\n## Installation\n\nUse the upstream install or setup path that matches your environment:\n- git clone https://github.com/UKGovernmentBEIS/inspect_ai.git\n- pip install -e \".[dev]\"\n- uv sync --extra dev\n- make hooks\n\nRequirements and caveats from upstream:\n- If you use VS Code, you should be sure to have installed the recommended extensions (Python, Ruff, and MyPy). Note that you'll be prompted to install these when you open the project in VS Code.\n- The web UI lives in a git submodule at src/inspect_ai/_view/ts-mono/. **These steps are only needed if you plan to work on the TypeScript/React frontend** — Python-only contributors can skip this entirely.\n\nBasic usage or getting-started notes:\n- Inspect provides many built-in components, including facilities for prompt engineering, tool usage, multi-turn dialog, and model graded evaluations. Extensions to Inspect (e.g. to support new elicitation and scoring t...\n- Inspect also includes a collection of over 200 pre-built evaluations ready to run on any model (learn more at <https://inspect.aisi.org.uk/evals/>).\n- Run linting, formatting, and tests via\n\n- Source: https://github.com/UKGovernmentBEIS/inspect_ai\n- Extracted from upstream docs: https://raw.githubusercontent.com/UKGovernmentBEIS/inspect_ai/HEAD/README.md\n\n## Documentation\n\n- https://inspect.aisi.org.uk/\n\n## Source\n\n- [Agent Skill Exchange](https://agentskillexchange.com/skills/run-repeatable-model-and-agent-eval-suites-and-inspect-scoring-traces-with-inspect-ai/)","tags":["run","repeatable","model","and","agent","eval","suites","inspect","scoring","traces","with","skills"],"capabilities":["skill","source-agentskillexchange","skill-run-repeatable-model-and-agent-eval-suites-and-inspect-scoring-traces-with-inspect-ai","topic-agent-skills","topic-ai-agents","topic-ai-tools","topic-awesome-list","topic-claude-code","topic-codex","topic-cursor","topic-llm","topic-mcp","topic-npx-skills","topic-openclaw","topic-skills-catalog"],"categories":["skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/agentskillexchange/skills/run-repeatable-model-and-agent-eval-suites-and-inspect-scoring-traces-with-inspect-ai","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add agentskillexchange/skills","source_repo":"https://github.com/agentskillexchange/skills","install_from":"skills.sh"}},"qualityScore":"0.454","qualityRationale":"deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 8 github stars · SKILL.md body (1,887 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T19:12:13.937Z","embedding":null,"createdAt":"2026-05-18T13:19:02.340Z","updatedAt":"2026-05-18T19:12:13.937Z","lastSeenAt":"2026-05-18T19:12:13.937Z","tsv":"'/evals/':261 '/skills/run-repeatable-model-and-agent-eval-suites-and-inspect-scoring-traces-with-inspect-ai/)':287 '/ukgovernmentbeis/inspect_ai':271 '/ukgovernmentbeis/inspect_ai.git':114 '/ukgovernmentbeis/inspect_ai/head/readme.md':278 '200':245 'ad':33,71 'agent':5,24,43,62,96,282 'agentskillexchange.com':286 'agentskillexchange.com/skills/run-repeatable-model-and-agent-eval-suites-and-inspect-scoring-traces-with-inspect-ai/)':285 'ai':14,52,82 'also':239 'basic':198 'benchmark':17,55 'benchmark-styl':16,54 'built':209,248 'built-in':208 'caveat':127 'chat':35,73 'clone':111 'code':134,165 'collect':242 'compon':211 'contributor':193 'credenti':86 'dataset':88 'definit':91 'depend':94 'dev':118,122 'dialog':222 'doc':275 'document':279 'e':117 'e.g':230 'elicit':234 'engin':216 'entir':197 'environ':79,109 'eval':6,19,44,57 'evalu':87,226,249 'exchang':284 'extens':144,227 'extra':121 'extract':272 'facil':213 'feel':38,76 'format':264 'frontend':189 'get':202 'getting-start':201 'git':110,172 'github.com':113,270 'github.com/ukgovernmentbeis/inspect_ai':269 'github.com/ukgovernmentbeis/inspect_ai.git':112 'grade':225 'gut':37,75 'hoc':34,72 'hook':124 'includ':212,240 'inspect':9,13,26,47,51,64,81,205,229,238 'inspect-ai':80 'inspect.aisi.org.uk':260,280 'inspect.aisi.org.uk/evals/':259 'instal':98,102,116,141,156 'instead':29,67 'learn':256 'lint':263 'live':169 'll':152 'make':123 'mani':207 'match':107 'model':3,22,41,60,84,224,255 'multi':220 'multi-turn':219 'mypi':148 'need':180 'new':233 'note':149,204 'open':160 'option':92 'packag':83 'path':105 'pip':115 'plan':183 'pre':247 'pre-built':246 'prerequisit':77 'project':162 'prompt':154,215 'provid':85,206 'python':78,145,191 'python-on':190 'raw.githubusercontent.com':277 'raw.githubusercontent.com/ukgovernmentbeis/inspect_ai/head/readme.md':276 'readi':250 'recommend':143 'reli':31,69 'repeat':2,40 'requir':125 'ruff':146 'run':1,15,39,53,252,262 'sandbox':93 'score':10,27,48,65,236 'setup':104 'skill':283 'skill-run-repeatable-model-and-agent-eval-suites-and-inspect-scoring-traces-with-inspect-ai' 'skip':195 'sourc':268,281 'source-agentskillexchange' 'src/inspect_ai/_view/ts-mono':175 'start':203 'step':177 'style':18,56 'submodul':173 'suit':7,20,45,58 'support':232 'sure':138 'sync':120 'task':90,97 'test':266 'tool':217 'topic-agent-skills' 'topic-ai-agents' 'topic-ai-tools' 'topic-awesome-list' 'topic-claude-code' 'topic-codex' 'topic-cursor' 'topic-llm' 'topic-mcp' 'topic-npx-skills' 'topic-openclaw' 'topic-skills-catalog' 'trace':11,28,49,66 'turn':221 'typescript/react':188 'ui':168 'upstream':101,129,274 'usag':199,218 'use':99,132 'uv':119 'via':267 'vs':133,164 'web':167 'work':185","prices":[{"id":"b8f2ba65-9353-41ca-9e8c-44de236a9e03","listingId":"4cf1f504-5535-4f9f-af0b-8a318e17ffe9","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"agentskillexchange","category":"skills","install_from":"skills.sh"},"createdAt":"2026-05-18T13:19:02.340Z"}],"sources":[{"listingId":"4cf1f504-5535-4f9f-af0b-8a318e17ffe9","source":"github","sourceId":"agentskillexchange/skills/run-repeatable-model-and-agent-eval-suites-and-inspect-scoring-traces-with-inspect-ai","sourceUrl":"https://github.com/agentskillexchange/skills/tree/main/skills/run-repeatable-model-and-agent-eval-suites-and-inspect-scoring-traces-with-inspect-ai","isPrimary":false,"firstSeenAt":"2026-05-18T13:19:02.340Z","lastSeenAt":"2026-05-18T19:12:13.937Z"}],"details":{"listingId":"4cf1f504-5535-4f9f-af0b-8a318e17ffe9","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"agentskillexchange","slug":"run-repeatable-model-and-agent-eval-suites-and-inspect-scoring-traces-with-inspect-ai","github":{"repo":"agentskillexchange/skills","stars":8,"topics":["agent-skills","ai-agents","ai-tools","awesome-list","claude-code","codex","cursor","llm","mcp","npx-skills","openclaw","skills-catalog"],"license":"mit","html_url":"https://github.com/agentskillexchange/skills","pushed_at":"2026-05-18T19:02:17Z","description":"The open catalog of AI agent skills — 2,000+ security-scanned skills for Claude Code, Cursor, Codex, and more.","skill_md_sha":"fbab59b92e84dac7c6f150fa989d08b86931b60c","skill_md_path":"skills/run-repeatable-model-and-agent-eval-suites-and-inspect-scoring-traces-with-inspect-ai/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/agentskillexchange/skills/tree/main/skills/run-repeatable-model-and-agent-eval-suites-and-inspect-scoring-traces-with-inspect-ai"},"layout":"multi","source":"github","category":"skills","frontmatter":{"name":"Run repeatable model and agent eval suites and inspect scoring traces with Inspect AI","description":"Run benchmark-style eval suites against models or agents, then inspect scored traces instead of relying on ad hoc chats and gut feel."},"skills_sh_url":"https://skills.sh/agentskillexchange/skills/run-repeatable-model-and-agent-eval-suites-and-inspect-scoring-traces-with-inspect-ai"},"updatedAt":"2026-05-18T19:12:13.937Z"}}