{"id":"0b863904-f934-4a22-8ecd-c0aac0e71abb","shortId":"en7w2a","kind":"skill","title":"auto-arena","tagline":"Automatically evaluate and compare multiple AI models or agents without pre-existing test data. Generates test queries from a task description, collects responses from all target endpoints, auto-generates evaluation rubrics, runs pairwise comparisons via a judge model, and produc","description":"# Auto Arena Skill\n\nEnd-to-end automated model comparison using the OpenJudge `AutoArenaPipeline`:\n\n1. **Generate queries** — LLM creates diverse test queries from task description\n2. **Collect responses** — query all target endpoints concurrently\n3. **Generate rubrics** — LLM produces evaluation criteria from task + sample queries\n4. **Pairwise evaluation** — judge model compares every model pair (with position-bias swap)\n5. **Analyze & rank** — compute win rates, win matrix, and rankings\n6. **Report & charts** — Markdown report + win-rate bar chart + optional matrix heatmap\n\n## Prerequisites\n\n```bash\n# Install OpenJudge\npip install py-openjudge\n\n# Extra dependency for auto_arena (chart generation)\npip install matplotlib\n```\n\n## Gather from user before running\n\n| Info | Required? | Notes |\n|------|-----------|-------|\n| Task description | Yes | What the models/agents should do (set in config YAML) |\n| Target endpoints | Yes | At least 2 OpenAI-compatible endpoints to compare |\n| Judge endpoint | Yes | Strong model for pairwise evaluation (e.g. `gpt-4`, `qwen-max`) |\n| API keys | Yes | Env vars: `OPENAI_API_KEY`, `DASHSCOPE_API_KEY`, etc. |\n| Number of queries | No | Default: `20` |\n| Seed queries | No | Example queries to guide generation style |\n| System prompts | No | Per-endpoint system prompts |\n| Output directory | No | Default: `./evaluation_results` |\n| Report language | No | `\"zh\"` (default) or `\"en\"` |\n\n## Quick start\n\n### CLI\n\n```bash\n# Run evaluation\npython -m cookbooks.auto_arena --config config.yaml --save\n\n# Use pre-generated queries\npython -m cookbooks.auto_arena --config config.yaml \\\n  --queries_file queries.json --save\n\n# Start fresh, ignore checkpoint\npython -m cookbooks.auto_arena --config config.yaml --fresh --save\n\n# Re-run only pairwise evaluation with new judge model\n# (keeps queries, responses, and rubrics)\npython -m cookbooks.auto_arena --config config.yaml --rerun-judge --save\n```\n\n### Python API\n\n```python\nimport asyncio\nfrom cookbooks.auto_arena.auto_arena_pipeline import AutoArenaPipeline\n\nasync def main():\n    pipeline = AutoArenaPipeline.from_config(\"config.yaml\")\n    result = await pipeline.evaluate()\n\n    print(f\"Best model: {result.best_pipeline}\")\n    for rank, (model, win_rate) in enumerate(result.rankings, 1):\n        print(f\"{rank}. {model}: {win_rate:.1%}\")\n\nasyncio.run(main())\n```\n\n### Minimal Python API (no config file)\n\n```python\nimport asyncio\nfrom cookbooks.auto_arena.auto_arena_pipeline import AutoArenaPipeline\nfrom cookbooks.auto_arena.schema import OpenAIEndpoint\n\nasync def main():\n    pipeline = AutoArenaPipeline(\n        task_description=\"Customer service chatbot for e-commerce\",\n        target_endpoints={\n            \"gpt4\": OpenAIEndpoint(\n                base_url=\"https://api.openai.com/v1\",\n                api_key=\"sk-...\",\n                model=\"gpt-4\",\n            ),\n            \"qwen\": OpenAIEndpoint(\n                base_url=\"https://dashscope.aliyuncs.com/compatible-mode/v1\",\n                api_key=\"sk-...\",\n                model=\"qwen-max\",\n            ),\n        },\n        judge_endpoint=OpenAIEndpoint(\n            base_url=\"https://api.openai.com/v1\",\n            api_key=\"sk-...\",\n            model=\"gpt-4\",\n        ),\n        num_queries=20,\n    )\n    result = await pipeline.evaluate()\n    print(f\"Best: {result.best_pipeline}\")\n\nasyncio.run(main())\n```\n\n## CLI options\n\n| Flag | Default | Description |\n|------|---------|-------------|\n| `--config` | — | Path to YAML configuration file (required) |\n| `--output_dir` | config value | Override output directory |\n| `--queries_file` | — | Path to pre-generated queries JSON (skip generation) |\n| `--save` | `False` | Save results to file |\n| `--fresh` | `False` | Start fresh, ignore checkpoint |\n| `--rerun-judge` | `False` | Re-run pairwise evaluation only (keep queries/responses/rubrics) |\n\n## Minimal config file\n\n```yaml\ntask:\n  description: \"Academic GPT assistant for research and writing tasks\"\n\ntarget_endpoints:\n  model_v1:\n    base_url: \"https://api.openai.com/v1\"\n    api_key: \"${OPENAI_API_KEY}\"\n    model: \"gpt-4\"\n  model_v2:\n    base_url: \"https://api.openai.com/v1\"\n    api_key: \"${OPENAI_API_KEY}\"\n    model: \"gpt-3.5-turbo\"\n\njudge_endpoint:\n  base_url: \"https://api.openai.com/v1\"\n  api_key: \"${OPENAI_API_KEY}\"\n  model: \"gpt-4\"\n```\n\n## Full config reference\n\n### task\n\n| Field | Required | Description |\n|-------|----------|-------------|\n| `description` | Yes | Clear description of the task models will be tested on |\n| `scenario` | No | Usage scenario for additional context |\n\n### target_endpoints.\\<name\\>\n\n| Field | Default | Description |\n|-------|---------|-------------|\n| `base_url` | — | API base URL (required) |\n| `api_key` | — | API key, supports `${ENV_VAR}` (required) |\n| `model` | — | Model name (required) |\n| `system_prompt` | — | System prompt for this endpoint |\n| `extra_params` | — | Extra API params (e.g. `temperature`, `max_tokens`) |\n\n### judge_endpoint\n\nSame fields as `target_endpoints.<name>`. Use a strong model (e.g. `gpt-4`, `qwen-max`) with low temperature (~0.1) for consistent judgments.\n\n### query_generation\n\n| Field | Default | Description |\n|-------|---------|-------------|\n| `num_queries` | `20` | Total number of queries to generate |\n| `seed_queries` | — | Example queries to guide generation |\n| `categories` | — | Query categories with weights for stratified generation |\n| `endpoint` | judge endpoint | Custom endpoint for query generation |\n| `queries_per_call` | `10` | Queries generated per API call (1–50) |\n| `num_parallel_batches` | `3` | Parallel generation batches |\n| `temperature` | `0.9` | Sampling temperature (0.0–2.0) |\n| `top_p` | `0.95` | Top-p sampling (0.0–1.0) |\n| `max_similarity` | `0.85` | Dedup similarity threshold (0.0–1.0) |\n| `enable_evolution` | `false` | Enable Evol-Instruct complexity evolution |\n| `evolution_rounds` | `1` | Evolution rounds (0–3) |\n| `complexity_levels` | `[\"constraints\", \"reasoning\", \"edge_cases\"]` | Evolution strategies |\n\n### evaluation\n\n| Field | Default | Description |\n|-------|---------|-------------|\n| `max_concurrency` | `10` | Max concurrent API requests |\n| `timeout` | `60` | Request timeout in seconds |\n| `retry_times` | `3` | Retry attempts for failed requests |\n\n### output\n\n| Field | Default | Description |\n|-------|---------|-------------|\n| `output_dir` | `./evaluation_results` | Output directory |\n| `save_queries` | `true` | Save generated queries |\n| `save_responses` | `true` | Save model responses |\n| `save_details` | `true` | Save detailed results |\n\n### report\n\n| Field | Default | Description |\n|-------|---------|-------------|\n| `enabled` | `false` | Enable Markdown report generation |\n| `language` | `\"zh\"` | Report language: `\"zh\"` or `\"en\"` |\n| `include_examples` | `3` | Examples per section (1–10) |\n| `chart.enabled` | `true` | Generate win-rate chart |\n| `chart.orientation` | `\"horizontal\"` | `\"horizontal\"` or `\"vertical\"` |\n| `chart.show_values` | `true` | Show values on bars |\n| `chart.highlight_best` | `true` | Highlight best model |\n| `chart.matrix_enabled` | `false` | Generate win-rate matrix heatmap |\n| `chart.format` | `\"png\"` | Chart format: `\"png\"`, `\"svg\"`, or `\"pdf\"` |\n\n## Interpreting results\n\n**Win rate:** percentage of pairwise comparisons a model wins. Each pair is evaluated in both orders (original + swapped) to eliminate position bias.\n\n**Rankings example:**\n```\n  1. gpt4_baseline       [################----] 80.0%\n  2. qwen_candidate      [############--------] 60.0%\n  3. llama_finetuned      [##########----------] 50.0%\n```\n\n**Win matrix:** `win_matrix[A][B]` = how often model A beats model B across all queries.\n\n## Checkpoint & resume\n\nThe pipeline saves progress after each step. Interrupted runs resume automatically:\n\n- `--fresh` — ignore checkpoint, start from scratch\n- `--rerun-judge` — re-run only the pairwise evaluation step (useful when switching judge models); keeps queries, responses, and rubrics intact\n- Adding new endpoints to config triggers incremental response collection; existing responses are preserved\n\n## Output files\n\n```\nevaluation_results/\n├── evaluation_results.json     # Rankings, win rates, win matrix\n├── evaluation_report.md        # Detailed Markdown report (if enabled)\n├── win_rate_chart.png          # Win-rate bar chart (if enabled)\n├── win_rate_matrix.png         # Matrix heatmap (if matrix_enabled)\n├── queries.json                # Generated test queries\n├── responses.json              # All model responses\n├── rubrics.json                # Generated evaluation rubrics\n├── comparison_details.json     # Pairwise comparison details\n└── checkpoint.json             # Pipeline checkpoint\n```\n\n## API key by model\n\n| Model prefix | Environment variable |\n|-------------|---------------------|\n| `gpt-*`, `o1-*`, `o3-*` | `OPENAI_API_KEY` |\n| `claude-*` | `ANTHROPIC_API_KEY` |\n| `qwen-*`, `dashscope/*` | `DASHSCOPE_API_KEY` |\n| `deepseek-*` | `DEEPSEEK_API_KEY` |\n| Custom endpoint | set `api_key` + `base_url` in config |\n\n## Additional resources\n\n- Full config examples: [cookbooks/auto_arena/examples/](../../cookbooks/auto_arena/examples/)\n- Documentation: [Auto Arena Guide](https://agentscope-ai.github.io/OpenJudge/applications/auto_arena/)","tags":["auto","arena","openjudge","agentscope-ai","agent","agent-skills","ai-agent","alignment","evaluation","grader","llm","reward"],"capabilities":["skill","source-agentscope-ai","skill-auto-arena","topic-agent","topic-agent-skills","topic-ai-agent","topic-alignment","topic-evaluation","topic-grader","topic-llm","topic-reward","topic-reward-model","topic-rlhf","topic-skill-md","topic-skills"],"categories":["OpenJudge"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/agentscope-ai/OpenJudge/auto-arena","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add agentscope-ai/OpenJudge","source_repo":"https://github.com/agentscope-ai/OpenJudge","install_from":"skills.sh"}},"qualityScore":"0.700","qualityRationale":"deterministic score 0.70 from registry signals: · indexed on github topic:agent-skills · 585 github stars · SKILL.md body (9,258 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-02T18:53:07.933Z","embedding":null,"createdAt":"2026-04-18T21:57:25.153Z","updatedAt":"2026-05-02T18:53:07.933Z","lastSeenAt":"2026-05-02T18:53:07.933Z","tsv":"'-3.5':537 '-4':188,396,424,522,553,633 '/../cookbooks/auto_arena/examples':1064 '/compatible-mode/v1':403 '/evaluation_results':231,777 '/openjudge/applications/auto_arena/)':1071 '/v1':390,418,514,529,545 '0':736 '0.0':703,712,720 '0.1':640 '0.85':716 '0.9':700 '0.95':707 '1':60,339,346,690,733,821,891 '1.0':713,721 '10':684,752,822 '2':71,171,895 '2.0':704 '20':209,427,651 '3':79,695,737,765,817,899 '4':90 '5':104 '50':691 '50.0':902 '6':114 '60':758 '60.0':898 '80.0':894 'academ':498 'across':916 'ad':960 'addit':578,1058 'agent':12 'agentscope-ai.github.io':1070 'agentscope-ai.github.io/openjudge/applications/auto_arena/)':1069 'ai':9 'analyz':105 'anthrop':1037 'api':192,198,201,305,351,391,404,419,515,518,530,533,546,549,588,592,594,614,688,755,1022,1034,1038,1043,1047,1052 'api.openai.com':389,417,513,528,544 'api.openai.com/v1':388,416,512,527,543 'arena':3,47,140,248,260,274,297,311,360,1067 'assist':500 'async':315,368 'asyncio':308,357 'asyncio.run':347,436 'attempt':767 'auto':2,33,46,139,1066 'auto-arena':1 'auto-gener':32 'autoarenapipelin':59,314,363,372 'autoarenapipeline.from':319 'autom':53 'automat':4,931 'await':323,429 'b':908,915 'bar':122,841,993 'base':386,399,414,510,525,541,586,589,1054 'baselin':893 'bash':128,242 'batch':694,698 'beat':913 'best':327,433,843,846 'bias':102,888 'call':683,689 'candid':897 'case':743 'categori':665,667 'chart':116,123,141,829,859,994 'chart.enabled':823 'chart.format':857 'chart.highlight':842 'chart.matrix':848 'chart.orientation':830 'chart.show':835 'chatbot':377 'checkpoint':270,479,919,934,1021 'checkpoint.json':1019 'claud':1036 'clear':563 'cli':241,438 'collect':26,72,968 'commerc':381 'compar':7,95,177 'comparison':39,55,872,1017 'comparison_details.json':1015 'compat':174 'complex':729,738 'comput':107 'concurr':78,751,754 'config':164,249,261,275,298,320,353,443,452,493,555,964,1057,1061 'config.yaml':250,262,276,299,321 'configur':447 'consist':642 'constraint':740 'context':579 'cookbooks.auto':247,259,273,296 'cookbooks.auto_arena.auto':310,359 'cookbooks.auto_arena.schema':365 'cookbooks/auto_arena/examples':1063 'creat':64 'criteria':85 'custom':375,676,1049 'dashscop':200,1041,1042 'dashscope.aliyuncs.com':402 'dashscope.aliyuncs.com/compatible-mode/v1':401 'data':18 'dedup':717 'deepseek':1045,1046 'def':316,369 'default':208,230,236,441,584,647,748,773,800 'depend':137 'descript':25,70,155,374,442,497,560,561,564,585,648,749,774,801 'detail':793,796,984,1018 'dir':451,776 'directori':228,456,779 'divers':65 'document':1065 'e':380 'e-commerc':379 'e.g':186,616,631 'edg':742 'elimin':886 'en':238,814 'enabl':722,725,802,804,849,988,996,1002 'end':50,52 'end-to-end':49 'endpoint':31,77,167,175,179,224,383,412,507,540,581,610,621,626,673,675,677,962,1050 'enumer':337 'env':195,597 'environ':1028 'etc':203 'evalu':5,35,84,92,185,244,284,488,746,879,947,975,1013 'evaluation_report.md':983 'evaluation_results.json':977 'everi':96 'evol':727 'evol-instruct':726 'evolut':723,730,731,734,744 'exampl':213,660,816,818,890,1062 'exist':16,969 'extra':136,611,613 'f':326,341,432 'fail':769 'fals':469,475,483,724,803,850 'field':558,583,623,646,747,772,799 'file':264,354,448,458,473,494,974 'finetun':901 'flag':440 'format':860 'fresh':268,277,474,477,932 'full':554,1060 'gather':146 'generat':19,34,61,80,142,217,255,463,467,645,657,664,672,680,686,697,784,807,825,851,1004,1012 'gpt':187,395,423,499,521,536,552,632,1030 'gpt4':384,892 'guid':216,663,1068 'heatmap':126,856,999 'highlight':845 'horizont':831,832 'ignor':269,478,933 'import':307,313,356,362,366 'includ':815 'increment':966 'info':151 'instal':129,132,144 'instruct':728 'intact':959 'interpret':865 'interrupt':928 'json':465 'judg':42,93,178,287,302,411,482,539,620,674,940,952 'judgment':643 'keep':289,490,954 'key':193,199,202,392,405,420,516,519,531,534,547,550,593,595,1023,1035,1039,1044,1048,1053 'languag':233,808,811 'least':170 'level':739 'llama':900 'llm':63,82 'low':638 'm':246,258,272,295 'main':317,348,370,437 'markdown':117,805,985 'matplotlib':145 'matrix':111,125,855,904,906,982,998,1001 'max':191,410,618,636,714,750,753 'minim':349,492 'model':10,43,54,94,97,182,288,328,333,343,394,407,422,508,520,523,535,551,568,600,601,630,790,847,874,911,914,953,1009,1025,1026 'models/agents':159 'multipl':8 'name':582,602 'new':286,961 'note':153 'num':425,649,692 'number':204,653 'o1':1031 'o3':1032 'often':910 'openai':173,197,517,532,548,1033 'openai-compat':172 'openaiendpoint':367,385,398,413 'openjudg':58,130,135 'option':124,439 'order':882 'origin':883 'output':227,450,455,771,775,778,973 'overrid':454 'p':706,710 'pair':98,877 'pairwis':38,91,184,283,487,871,946,1016 'parallel':693,696 'param':612,615 'path':444,459 'pdf':864 'per':223,682,687,819 'per-endpoint':222 'percentag':869 'pip':131,143 'pipelin':312,318,330,361,371,435,922,1020 'pipeline.evaluate':324,430 'png':858,861 'posit':101,887 'position-bia':100 'pre':15,254,462 'pre-exist':14 'pre-gener':253,461 'prefix':1027 'prerequisit':127 'preserv':972 'print':325,340,431 'produc':45,83 'progress':924 'prompt':220,226,605,607 'py':134 'py-openjudg':133 'python':245,257,271,294,304,306,350,355 'queri':21,62,67,74,89,206,211,214,256,263,290,426,457,464,644,650,655,659,661,666,679,681,685,781,785,918,955,1006 'queries.json':265,1003 'queries/responses/rubrics':491 'quick':239 'qwen':190,397,409,635,896,1040 'qwen-max':189,408,634 'rank':106,113,332,342,889,978 'rate':109,121,335,345,828,854,868,980,992 're':280,485,942 're-run':279,484,941 'reason':741 'refer':556 'report':115,118,232,798,806,810,986 'request':756,759,770 'requir':152,449,559,591,599,603 'rerun':301,481,939 'rerun-judg':300,480,938 'research':502 'resourc':1059 'respons':27,73,291,787,791,956,967,970,1010 'responses.json':1007 'result':322,428,471,797,866,976 'result.best':329,434 'result.rankings':338 'resum':920,930 'retri':763,766 'round':732,735 'rubric':36,81,293,958,1014 'rubrics.json':1011 'run':37,150,243,281,486,929,943 'sampl':88,701,711 'save':251,266,278,303,468,470,780,783,786,789,792,795,923 'scenario':573,576 'scratch':937 'second':762 'section':820 'seed':210,658 'servic':376 'set':162,1051 'show':838 'similar':715,718 'sk':393,406,421 'skill':48 'skill-auto-arena' 'skip':466 'source-agentscope-ai' 'start':240,267,476,935 'step':927,948 'strategi':745 'stratifi':671 'strong':181,629 'style':218 'support':596 'svg':862 'swap':103,884 'switch':951 'system':219,225,604,606 'target':30,76,166,382,506,580,625 'task':24,69,87,154,373,496,505,557,567 'temperatur':617,639,699,702 'test':17,20,66,571,1005 'threshold':719 'time':764 'timeout':757,760 'token':619 'top':705,709 'top-p':708 'topic-agent' 'topic-agent-skills' 'topic-ai-agent' 'topic-alignment' 'topic-evaluation' 'topic-grader' 'topic-llm' 'topic-reward' 'topic-reward-model' 'topic-rlhf' 'topic-skill-md' 'topic-skills' 'total':652 'trigger':965 'true':782,788,794,824,837,844 'turbo':538 'url':387,400,415,511,526,542,587,590,1055 'usag':575 'use':56,252,627,949 'user':148 'v1':509 'v2':524 'valu':453,836,839 'var':196,598 'variabl':1029 'vertic':834 'via':40 'weight':669 'win':108,110,120,334,344,827,853,867,875,903,905,979,981,991 'win-rat':119,826,852,990 'win_rate_chart.png':989 'win_rate_matrix.png':997 'without':13 'write':504 'yaml':165,446,495 'yes':156,168,180,194,562 'zh':235,809,812","prices":[{"id":"8e84519e-7ee8-4ecc-98da-528664922753","listingId":"0b863904-f934-4a22-8ecd-c0aac0e71abb","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"agentscope-ai","category":"OpenJudge","install_from":"skills.sh"},"createdAt":"2026-04-18T21:57:25.153Z"}],"sources":[{"listingId":"0b863904-f934-4a22-8ecd-c0aac0e71abb","source":"github","sourceId":"agentscope-ai/OpenJudge/auto-arena","sourceUrl":"https://github.com/agentscope-ai/OpenJudge/tree/main/skills/auto-arena","isPrimary":false,"firstSeenAt":"2026-04-18T21:57:25.153Z","lastSeenAt":"2026-05-02T18:53:07.933Z"}],"details":{"listingId":"0b863904-f934-4a22-8ecd-c0aac0e71abb","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"agentscope-ai","slug":"auto-arena","github":{"repo":"agentscope-ai/OpenJudge","stars":585,"topics":["agent","agent-skills","ai-agent","alignment","evaluation","grader","llm","reward","reward-model","rlhf","skill-md","skills"],"license":"apache-2.0","html_url":"https://github.com/agentscope-ai/OpenJudge","pushed_at":"2026-04-30T08:18:46Z","description":"OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards","skill_md_sha":"e1cd8e75a7e122f5c0344f41ed09d5453193e93c","skill_md_path":"skills/auto-arena/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/agentscope-ai/OpenJudge/tree/main/skills/auto-arena"},"layout":"multi","source":"github","category":"OpenJudge","frontmatter":{"name":"auto-arena","description":"Automatically evaluate and compare multiple AI models or agents without pre-existing test data. Generates test queries from a task description, collects responses from all target endpoints, auto-generates evaluation rubrics, runs pairwise comparisons via a judge model, and produces win-rate rankings with reports and charts. Supports checkpoint resume, incremental endpoint addition, and judge model hot-swap. Use when the user asks to compare, benchmark, or rank multiple models or agents on a custom task, or run an arena-style evaluation."},"skills_sh_url":"https://skills.sh/agentscope-ai/OpenJudge/auto-arena"},"updatedAt":"2026-05-02T18:53:07.933Z"}}