{"id":"4b84ff88-4bb8-4f90-8228-51213515b941","shortId":"fXHwT9","kind":"skill","title":"ref-hallucination-arena","tagline":"Benchmark LLM reference recommendation capabilities by verifying every cited paper against Crossref, PubMed, arXiv, and DBLP. Measures hallucination rate, per-field accuracy (title/author/year/DOI), discipline breakdown, and year constraint compliance. Supports tool-augmented (Re","description":"# Reference Hallucination Arena Skill\n\nEvaluate how accurately LLMs recommend real academic references using the\nOpenJudge `RefArenaPipeline`:\n\n1. **Load queries** — from JSON/JSONL dataset\n2. **Collect responses** — BibTeX-formatted references from target models\n3. **Extract references** — parse BibTeX entries from model output\n4. **Verify references** — cross-check against Crossref / PubMed / arXiv / DBLP\n5. **Score & rank** — compute verification rate, per-field accuracy, discipline breakdown\n6. **Generate report** — Markdown report + visualization charts\n\n## Prerequisites\n\n```bash\n# Install OpenJudge\npip install py-openjudge\n\n# Extra dependency for ref_hallucination_arena (chart generation)\npip install matplotlib\n```\n\n## Gather from user before running\n\n| Info | Required? | Notes |\n|------|-----------|-------|\n| Config YAML path | Yes | Defines endpoints, dataset, verification settings |\n| Dataset path | Yes | JSON/JSONL file with queries (can be set in config) |\n| API keys | Yes | Env vars: `OPENAI_API_KEY`, `DASHSCOPE_API_KEY`, etc. |\n| CrossRef email | No | Improves API rate limits for verification |\n| PubMed API key | No | Improves PubMed rate limits |\n| Output directory | No | Default: `./evaluation_results/ref_hallucination_arena` |\n| Report language | No | `\"en\"` (default) or `\"zh\"` |\n| Tavily API key | No | Required only if using tool-augmented mode |\n\n## Quick start\n\n### CLI\n\n```bash\n# Run evaluation with config file\npython -m cookbooks.ref_hallucination_arena --config config.yaml --save\n\n# Resume from checkpoint (default behavior)\npython -m cookbooks.ref_hallucination_arena --config config.yaml --save\n\n# Start fresh, ignore checkpoint\npython -m cookbooks.ref_hallucination_arena --config config.yaml --fresh --save\n\n# Override output directory\npython -m cookbooks.ref_hallucination_arena --config config.yaml \\\n  --output_dir ./my_results --save\n```\n\n### Python API\n\n```python\nimport asyncio\nfrom cookbooks.ref_hallucination_arena.pipeline import RefArenaPipeline\n\nasync def main():\n    pipeline = RefArenaPipeline.from_config(\"config.yaml\")\n    result = await pipeline.evaluate()\n\n    for rank, (model, score) in enumerate(result.rankings, 1):\n        print(f\"{rank}. {model}: {score:.1%}\")\n\nasyncio.run(main())\n```\n\n## CLI options\n\n| Flag | Default | Description |\n|------|---------|-------------|\n| `--config` | — | Path to YAML configuration file (required) |\n| `--output_dir` | config value | Override output directory |\n| `--save` | `False` | Save results to file |\n| `--fresh` | `False` | Start fresh, ignore checkpoint |\n\n## Minimal config file\n\n```yaml\ntask:\n  description: \"Evaluate LLM reference recommendation capabilities\"\n\ndataset:\n  path: \"./data/queries.json\"\n\ntarget_endpoints:\n  model_a:\n    base_url: \"https://api.openai.com/v1\"\n    api_key: \"${OPENAI_API_KEY}\"\n    model: \"gpt-4\"\n    system_prompt: \"You are an academic literature recommendation expert. Recommend {num_refs} real papers in BibTeX format. Only recommend papers you are confident actually exist.\"\n\n  model_b:\n    base_url: \"https://dashscope.aliyuncs.com/compatible-mode/v1\"\n    api_key: \"${DASHSCOPE_API_KEY}\"\n    model: \"qwen3-max\"\n    system_prompt: \"You are an academic literature recommendation expert. Recommend {num_refs} real papers in BibTeX format. Only recommend papers you are confident actually exist.\"\n```\n\n## Full config reference\n\n### task\n\n| Field | Required | Description |\n|-------|----------|-------------|\n| `description` | Yes | Evaluation task description |\n| `scenario` | No | Usage scenario |\n\n### dataset\n\n| Field | Default | Description |\n|-------|---------|-------------|\n| `path` | — | Path to JSON/JSONL dataset file (required) |\n| `shuffle` | `false` | Shuffle queries before evaluation |\n| `max_queries` | `null` | Max queries to use (`null` = all) |\n\n### target_endpoints.\\<name\\>\n\n| Field | Default | Description |\n|-------|---------|-------------|\n| `base_url` | — | API base URL (required) |\n| `api_key` | — | API key, supports `${ENV_VAR}` (required) |\n| `model` | — | Model name (required) |\n| `system_prompt` | built-in | System prompt; use `{num_refs}` placeholder |\n| `max_concurrency` | `5` | Max concurrent requests for this endpoint |\n| `extra_params` | — | Extra API request params (e.g. `temperature`) |\n| `tool_config.enabled` | `false` | Enable ReAct agent with Tavily web search |\n| `tool_config.tavily_api_key` | env var | Tavily API key |\n| `tool_config.max_iterations` | `10` | Max ReAct iterations (1–30) |\n| `tool_config.search_depth` | `\"advanced\"` | `\"basic\"` or `\"advanced\"` |\n\n### verification\n\n| Field | Default | Description |\n|-------|---------|-------------|\n| `crossref_mailto` | — | Email for Crossref polite pool |\n| `pubmed_api_key` | — | PubMed API key |\n| `max_workers` | `10` | Concurrent verification threads (1–50) |\n| `timeout` | `30` | Per-request timeout in seconds |\n| `verified_threshold` | `0.7` | Min composite score to count as VERIFIED |\n\n### evaluation\n\n| Field | Default | Description |\n|-------|---------|-------------|\n| `timeout` | `120` | Model API request timeout in seconds |\n| `retry_times` | `3` | Number of retry attempts |\n\n### output\n\n| Field | Default | Description |\n|-------|---------|-------------|\n| `output_dir` | `./evaluation_results/ref_hallucination_arena` | Output directory |\n| `save_queries` | `true` | Save loaded queries |\n| `save_responses` | `true` | Save model responses |\n| `save_details` | `true` | Save verification details |\n\n### report\n\n| Field | Default | Description |\n|-------|---------|-------------|\n| `enabled` | `true` | Enable report generation |\n| `language` | `\"zh\"` | Report language: `\"zh\"` or `\"en\"` |\n| `include_examples` | `3` | Examples per section (1–10) |\n| `chart.enabled` | `true` | Generate charts |\n| `chart.orientation` | `\"vertical\"` | `\"horizontal\"` or `\"vertical\"` |\n| `chart.show_values` | `true` | Show values on bars |\n| `chart.highlight_best` | `true` | Highlight best model |\n\n## Dataset format\n\nEach query in the JSON/JSONL dataset:\n\n```json\n{\n  \"query\": \"Please recommend papers on Transformer architectures for NLP.\",\n  \"discipline\": \"computer_science\",\n  \"num_refs\": 5,\n  \"language\": \"en\",\n  \"year_constraint\": {\"min_year\": 2020}\n}\n```\n\n| Field | Required | Description |\n|-------|----------|-------------|\n| `query` | Yes | Prompt for reference recommendation |\n| `discipline` | No | `computer_science`, `biomedical`, `physics`, `chemistry`, `social_science`, `interdisciplinary`, `other` |\n| `num_refs` | No | Expected number of references (default: 5) |\n| `language` | No | `\"zh\"` or `\"en\"` (default: `\"zh\"`) |\n| `year_constraint` | No | `{\"exact\": 2023}`, `{\"min_year\": 2020}`, `{\"max_year\": 2015}`, or `{\"min_year\": 2020, \"max_year\": 2024}` |\n\nOfficial dataset: [OpenJudge/ref-hallucination-arena](https://huggingface.co/datasets/OpenJudge/ref-hallucination-arena)\n\n## Interpreting results\n\n**Overall accuracy (verification rate):**\n- **> 75%** — Excellent: model rarely hallucinates references\n- **60–75%** — Good: most references are real, some fabrication\n- **40–60%** — Fair: significant hallucination, use with caution\n- **< 40%** — Poor: model frequently fabricates references\n\n**Per-field accuracy:**\n- `title_accuracy` — % of titles matching real papers\n- `author_accuracy` — % of correct author lists\n- `year_accuracy` — % of correct publication years\n- `doi_accuracy` — % of valid DOIs\n\n**Verification status:**\n- `VERIFIED` — title + author + year all exactly match a real paper\n- `SUSPECT` — partial match (e.g. title matches but authors differ)\n- `NOT_FOUND` — no match in any database\n- `ERROR` — API timeout or network failure\n\n**Ranking order:** overall accuracy → year compliance rate → avg confidence → completeness\n\n## Output files\n\n```\nevaluation_results/ref_hallucination_arena/\n├── evaluation_report.md          # Detailed Markdown report\n├── evaluation_results.json       # Rankings, per-field accuracy, scores\n├── verification_chart.png        # Per-field accuracy bar chart\n├── discipline_chart.png          # Per-discipline accuracy chart\n├── queries.json                  # Loaded evaluation queries\n├── responses.json                # Raw model responses\n├── extracted_refs.json           # Extracted BibTeX references\n├── verification_results.json     # Per-reference verification details\n└── checkpoint.json               # Pipeline checkpoint for resume\n```\n\n## API key by model\n\n| Model prefix | Environment variable |\n|-------------|---------------------|\n| `gpt-*`, `o1-*`, `o3-*` | `OPENAI_API_KEY` |\n| `claude-*` | `ANTHROPIC_API_KEY` |\n| `qwen-*`, `dashscope/*` | `DASHSCOPE_API_KEY` |\n| `deepseek-*` | `DEEPSEEK_API_KEY` |\n| Custom endpoint | set `api_key` + `base_url` in config |\n\n## Additional resources\n\n- Full config examples: [cookbooks/ref_hallucination_arena/examples/](../../cookbooks/ref_hallucination_arena/examples/)\n- Documentation: [docs/validating_graders/ref_hallucination_arena.md](../../docs/validating_graders/ref_hallucination_arena.md)\n- Official dataset: [HuggingFace](https://huggingface.co/datasets/OpenJudge/ref-hallucination-arena)\n- Leaderboard: [openjudge.me/leaderboard](https://openjudge.me/leaderboard)","tags":["ref","hallucination","arena","openjudge","agentscope-ai","agent","agent-skills","ai-agent","alignment","evaluation","grader","llm"],"capabilities":["skill","source-agentscope-ai","skill-ref-hallucination-arena","topic-agent","topic-agent-skills","topic-ai-agent","topic-alignment","topic-evaluation","topic-grader","topic-llm","topic-reward","topic-reward-model","topic-rlhf","topic-skill-md","topic-skills"],"categories":["OpenJudge"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/agentscope-ai/OpenJudge/ref-hallucination-arena","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add agentscope-ai/OpenJudge","source_repo":"https://github.com/agentscope-ai/OpenJudge","install_from":"skills.sh"}},"qualityScore":"0.700","qualityRationale":"deterministic score 0.70 from registry signals: · indexed on github topic:agent-skills · 585 github stars · SKILL.md body (9,154 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-02T18:53:08.512Z","embedding":null,"createdAt":"2026-04-18T21:57:30.615Z","updatedAt":"2026-05-02T18:53:08.512Z","lastSeenAt":"2026-05-02T18:53:08.512Z","tsv":"'-4':366 '/../cookbooks/ref_hallucination_arena/examples':984 '/../docs/validating_graders/ref_hallucination_arena.md':987 '/compatible-mode/v1':398 '/data/queries.json':349 '/datasets/openjudge/ref-hallucination-arena)':783,993 '/evaluation_results/ref_hallucination_arena':193,626 '/leaderboard](https://openjudge.me/leaderboard)':997 '/my_results':268 '/v1':358 '0.7':593 '1':56,296,302,550,581,669 '10':546,577,670 '120':606 '2':62 '2015':770 '2020':723,767,774 '2023':764 '2024':777 '3':72,615,665 '30':551,584 '4':81 '40':805,813 '5':92,512,716,752 '50':582 '6':104 '60':796,806 '75':790,797 'academ':50,372,413 'accur':46 'accuraci':27,101,787,822,824,831,837,843,884,904,910,917 'actual':390,431 'addit':978 'advanc':554,557 'agent':531 'anthrop':957 'api':160,166,169,176,182,202,271,359,362,399,402,483,487,489,522,537,542,570,573,608,876,942,954,958,963,967,972 'api.openai.com':357 'api.openai.com/v1':356 'architectur':708 'arena':4,42,125,226,239,251,263 'arxiv':18,90 'async':279 'asyncio':274 'asyncio.run':303 'attempt':619 'augment':38,211 'author':830,834,851,866 'avg':888 'await':287 'b':393 'bar':686,911 'base':354,394,481,484,974 'bash':112,216 'basic':555 'behavior':234 'benchmark':5 'best':688,691 'bibtex':66,76,382,423,929 'bibtex-format':65 'biomed':737 'breakdown':30,103 'built':502 'built-in':501 'capabl':9,346 'caution':812 'chart':110,126,674,912,918 'chart.enabled':671 'chart.highlight':687 'chart.orientation':675 'chart.show':680 'check':86 'checkpoint':232,246,335,939 'checkpoint.json':937 'chemistri':739 'cite':13 'claud':956 'cli':215,305 'collect':63 'complet':890 'complianc':34,886 'composit':595 'comput':95,712,735 'concurr':511,514,578 'confid':389,430,889 'config':139,159,220,227,240,252,264,284,310,319,337,434,977,981 'config.yaml':228,241,253,265,285 'configur':314 'constraint':33,720,761 'cookbooks.ref':224,237,249,261 'cookbooks.ref_hallucination_arena.pipeline':276 'cookbooks/ref_hallucination_arena/examples':983 'correct':833,839 'count':598 'cross':85 'cross-check':84 'crossref':16,88,172,562,566 'custom':969 'dashscop':168,401,961,962 'dashscope.aliyuncs.com':397 'dashscope.aliyuncs.com/compatible-mode/v1':396 'databas':874 'dataset':61,145,148,347,449,457,693,700,779,989 'dblp':20,91 'deepseek':965,966 'def':280 'default':192,198,233,308,451,479,560,603,622,649,751,758 'defin':143 'depend':121 'depth':553 'descript':309,341,439,440,444,452,480,561,604,623,650,726 'detail':642,646,896,936 'differ':867 'dir':267,318,625 'directori':190,258,323,628 'disciplin':29,102,711,733,916 'discipline_chart.png':913 'docs/validating_graders/ref_hallucination_arena.md':986 'document':985 'doi':842,846 'e.g':525,862 'email':173,564 'en':197,662,718,757 'enabl':529,651,653 'endpoint':144,351,476,518,970 'entri':77 'enumer':294 'env':163,492,539 'environ':948 'error':875 'etc':171 'evalu':44,218,342,442,465,601,893,921 'evaluation_report.md':895 'evaluation_results.json':899 'everi':12 'exact':763,854 'exampl':664,666,982 'excel':791 'exist':391,432 'expect':747 'expert':375,416 'extra':120,519,521 'extract':73,928 'extracted_refs.json':927 'f':298 'fabric':804,817 'failur':880 'fair':807 'fals':325,331,461,528 'field':26,100,437,450,478,559,602,621,648,724,821,903,909 'file':152,221,315,329,338,458,892 'flag':307 'format':67,383,424,694 'found':869 'frequent':816 'fresh':244,254,330,333 'full':433,980 'gather':131 'generat':105,127,655,673 'good':798 'gpt':365,950 'hallucin':3,22,41,124,225,238,250,262,794,809 'highlight':690 'horizont':677 'huggingfac':990 'huggingface.co':782,992 'huggingface.co/datasets/openjudge/ref-hallucination-arena)':781,991 'ignor':245,334 'import':273,277 'improv':175,185 'includ':663 'info':136 'instal':113,116,129 'interdisciplinari':742 'interpret':784 'iter':545,549 'json':701 'json/jsonl':60,151,456,699 'key':161,167,170,183,203,360,363,400,403,488,490,538,543,571,574,943,955,959,964,968,973 'languag':195,656,659,717,753 'leaderboard':994 'limit':178,188 'list':835 'literatur':373,414 'llm':6,343 'llms':47 'load':57,633,920 'm':223,236,248,260 'mailto':563 'main':281,304 'markdown':107,897 'match':827,855,861,864,871 'matplotlib':130 'max':407,466,469,510,513,547,575,768,775 'measur':21 'min':594,721,765,772 'minim':336 'mode':212 'model':71,79,291,300,352,364,392,404,495,496,607,639,692,792,815,925,945,946 'name':477,497 'network':879 'nlp':710 'note':138 'null':468,473 'num':377,418,507,714,744 'number':616,748 'o1':951 'o3':952 'offici':778,988 'openai':165,361,953 'openjudg':54,114,119 'openjudge.me':996 'openjudge.me/leaderboard](https://openjudge.me/leaderboard)':995 'openjudge/ref-hallucination-arena':780 'option':306 'order':882 'output':80,189,257,266,317,322,620,624,627,891 'overal':786,883 'overrid':256,321 'paper':14,380,386,421,427,705,829,858 'param':520,524 'pars':75 'partial':860 'path':141,149,311,348,453,454 'per':25,99,586,667,820,902,908,915,933 'per-disciplin':914 'per-field':24,98,819,901,907 'per-refer':932 'per-request':585 'physic':738 'pip':115,128 'pipelin':282,938 'pipeline.evaluate':288 'placehold':509 'pleas':703 'polit':567 'pool':568 'poor':814 'prefix':947 'prerequisit':111 'print':297 'prompt':368,409,500,505,729 'public':840 'pubm':17,89,181,186,569,572 'py':118 'py-openjudg':117 'python':222,235,247,259,270,272 'queri':58,154,463,467,470,630,634,696,702,727,922 'queries.json':919 'quick':213 'qwen':960 'qwen3':406 'qwen3-max':405 'rank':94,290,299,881,900 'rare':793 'rate':23,97,177,187,789,887 'raw':924 're':39 'react':530,548 'real':49,379,420,802,828,857 'recommend':8,48,345,374,376,385,415,417,426,704,732 'ref':2,123,378,419,508,715,745 'ref-hallucination-arena':1 'refarenapipelin':55,278 'refarenapipeline.from':283 'refer':7,40,51,68,74,83,344,435,731,750,795,800,818,930,934 'report':106,108,194,647,654,658,898 'request':515,523,587,609 'requir':137,205,316,438,459,486,494,498,725 'resourc':979 'respons':64,636,640,926 'responses.json':923 'result':286,327,785 'result.rankings':295 'results/ref_hallucination_arena':894 'resum':230,941 'retri':613,618 'run':135,217 'save':229,242,255,269,324,326,629,632,635,638,641,644 'scenario':445,448 'scienc':713,736,741 'score':93,292,301,596,905 'search':535 'second':590,612 'section':668 'set':147,157,971 'show':683 'shuffl':460,462 'signific':808 'skill':43 'skill-ref-hallucination-arena' 'social':740 'source-agentscope-ai' 'start':214,243,332 'status':848 'support':35,491 'suspect':859 'system':367,408,499,504 'target':70,350,475 'task':340,436,443 'tavili':201,533,541 'temperatur':526 'thread':580 'threshold':592 'time':614 'timeout':583,588,605,610,877 'titl':823,826,850,863 'title/author/year/doi':28 'tool':37,210 'tool-aug':36,209 'tool_config.enabled':527 'tool_config.max':544 'tool_config.search':552 'tool_config.tavily':536 'topic-agent' 'topic-agent-skills' 'topic-ai-agent' 'topic-alignment' 'topic-evaluation' 'topic-grader' 'topic-llm' 'topic-reward' 'topic-reward-model' 'topic-rlhf' 'topic-skill-md' 'topic-skills' 'transform':707 'true':631,637,643,652,672,682,689 'url':355,395,482,485,975 'usag':447 'use':52,208,472,506,810 'user':133 'valid':845 'valu':320,681,684 'var':164,493,540 'variabl':949 'verif':96,146,180,558,579,645,788,847,935 'verifi':11,82,591,600,849 'verification_chart.png':906 'verification_results.json':931 'vertic':676,679 'visual':109 'web':534 'worker':576 'yaml':140,313,339 'year':32,719,722,760,766,769,773,776,836,841,852,885 'yes':142,150,162,441,728 'zh':200,657,660,755,759","prices":[{"id":"ec0b381f-a29c-40da-9cfa-3a81573cd870","listingId":"4b84ff88-4bb8-4f90-8228-51213515b941","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"agentscope-ai","category":"OpenJudge","install_from":"skills.sh"},"createdAt":"2026-04-18T21:57:30.615Z"}],"sources":[{"listingId":"4b84ff88-4bb8-4f90-8228-51213515b941","source":"github","sourceId":"agentscope-ai/OpenJudge/ref-hallucination-arena","sourceUrl":"https://github.com/agentscope-ai/OpenJudge/tree/main/skills/ref-hallucination-arena","isPrimary":false,"firstSeenAt":"2026-04-18T21:57:30.615Z","lastSeenAt":"2026-05-02T18:53:08.512Z"}],"details":{"listingId":"4b84ff88-4bb8-4f90-8228-51213515b941","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"agentscope-ai","slug":"ref-hallucination-arena","github":{"repo":"agentscope-ai/OpenJudge","stars":585,"topics":["agent","agent-skills","ai-agent","alignment","evaluation","grader","llm","reward","reward-model","rlhf","skill-md","skills"],"license":"apache-2.0","html_url":"https://github.com/agentscope-ai/OpenJudge","pushed_at":"2026-04-30T08:18:46Z","description":"OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards","skill_md_sha":"6f768842bff8d57de39fbe04395e3ad9ab338bce","skill_md_path":"skills/ref-hallucination-arena/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/agentscope-ai/OpenJudge/tree/main/skills/ref-hallucination-arena"},"layout":"multi","source":"github","category":"OpenJudge","frontmatter":{"name":"ref-hallucination-arena","description":"Benchmark LLM reference recommendation capabilities by verifying every cited paper against Crossref, PubMed, arXiv, and DBLP. Measures hallucination rate, per-field accuracy (title/author/year/DOI), discipline breakdown, and year constraint compliance. Supports tool-augmented (ReAct + web search) mode. Use when the user asks to evaluate, benchmark, or compare models on academic reference hallucination, literature recommendation quality, or citation accuracy."},"skills_sh_url":"https://skills.sh/agentscope-ai/OpenJudge/ref-hallucination-arena"},"updatedAt":"2026-05-02T18:53:08.512Z"}}