Skillquality 0.70

ref-hallucination-arena

Benchmark LLM reference recommendation capabilities by verifying every cited paper against Crossref, PubMed, arXiv, and DBLP. Measures hallucination rate, per-field accuracy (title/author/year/DOI), discipline breakdown, and year constraint compliance. Supports tool-augmented (Re

Price
free
Protocol
skill
Verified
no

What it does

Reference Hallucination Arena Skill

Evaluate how accurately LLMs recommend real academic references using the OpenJudge RefArenaPipeline:

  1. Load queries — from JSON/JSONL dataset
  2. Collect responses — BibTeX-formatted references from target models
  3. Extract references — parse BibTeX entries from model output
  4. Verify references — cross-check against Crossref / PubMed / arXiv / DBLP
  5. Score & rank — compute verification rate, per-field accuracy, discipline breakdown
  6. Generate report — Markdown report + visualization charts

Prerequisites

# Install OpenJudge
pip install py-openjudge

# Extra dependency for ref_hallucination_arena (chart generation)
pip install matplotlib

Gather from user before running

InfoRequired?Notes
Config YAML pathYesDefines endpoints, dataset, verification settings
Dataset pathYesJSON/JSONL file with queries (can be set in config)
API keysYesEnv vars: OPENAI_API_KEY, DASHSCOPE_API_KEY, etc.
CrossRef emailNoImproves API rate limits for verification
PubMed API keyNoImproves PubMed rate limits
Output directoryNoDefault: ./evaluation_results/ref_hallucination_arena
Report languageNo"en" (default) or "zh"
Tavily API keyNoRequired only if using tool-augmented mode

Quick start

CLI

# Run evaluation with config file
python -m cookbooks.ref_hallucination_arena --config config.yaml --save

# Resume from checkpoint (default behavior)
python -m cookbooks.ref_hallucination_arena --config config.yaml --save

# Start fresh, ignore checkpoint
python -m cookbooks.ref_hallucination_arena --config config.yaml --fresh --save

# Override output directory
python -m cookbooks.ref_hallucination_arena --config config.yaml \
  --output_dir ./my_results --save

Python API

import asyncio
from cookbooks.ref_hallucination_arena.pipeline import RefArenaPipeline

async def main():
    pipeline = RefArenaPipeline.from_config("config.yaml")
    result = await pipeline.evaluate()

    for rank, (model, score) in enumerate(result.rankings, 1):
        print(f"{rank}. {model}: {score:.1%}")

asyncio.run(main())

CLI options

FlagDefaultDescription
--configPath to YAML configuration file (required)
--output_dirconfig valueOverride output directory
--saveFalseSave results to file
--freshFalseStart fresh, ignore checkpoint

Minimal config file

task:
  description: "Evaluate LLM reference recommendation capabilities"

dataset:
  path: "./data/queries.json"

target_endpoints:
  model_a:
    base_url: "https://api.openai.com/v1"
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4"
    system_prompt: "You are an academic literature recommendation expert. Recommend {num_refs} real papers in BibTeX format. Only recommend papers you are confident actually exist."

  model_b:
    base_url: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    api_key: "${DASHSCOPE_API_KEY}"
    model: "qwen3-max"
    system_prompt: "You are an academic literature recommendation expert. Recommend {num_refs} real papers in BibTeX format. Only recommend papers you are confident actually exist."

Full config reference

task

FieldRequiredDescription
descriptionYesEvaluation task description
scenarioNoUsage scenario

dataset

FieldDefaultDescription
pathPath to JSON/JSONL dataset file (required)
shufflefalseShuffle queries before evaluation
max_queriesnullMax queries to use (null = all)

target_endpoints.<name>

FieldDefaultDescription
base_urlAPI base URL (required)
api_keyAPI key, supports ${ENV_VAR} (required)
modelModel name (required)
system_promptbuilt-inSystem prompt; use {num_refs} placeholder
max_concurrency5Max concurrent requests for this endpoint
extra_paramsExtra API request params (e.g. temperature)
tool_config.enabledfalseEnable ReAct agent with Tavily web search
tool_config.tavily_api_keyenv varTavily API key
tool_config.max_iterations10Max ReAct iterations (1–30)
tool_config.search_depth"advanced""basic" or "advanced"

verification

FieldDefaultDescription
crossref_mailtoEmail for Crossref polite pool
pubmed_api_keyPubMed API key
max_workers10Concurrent verification threads (1–50)
timeout30Per-request timeout in seconds
verified_threshold0.7Min composite score to count as VERIFIED

evaluation

FieldDefaultDescription
timeout120Model API request timeout in seconds
retry_times3Number of retry attempts

output

FieldDefaultDescription
output_dir./evaluation_results/ref_hallucination_arenaOutput directory
save_queriestrueSave loaded queries
save_responsestrueSave model responses
save_detailstrueSave verification details

report

FieldDefaultDescription
enabledtrueEnable report generation
language"zh"Report language: "zh" or "en"
include_examples3Examples per section (1–10)
chart.enabledtrueGenerate charts
chart.orientation"vertical""horizontal" or "vertical"
chart.show_valuestrueShow values on bars
chart.highlight_besttrueHighlight best model

Dataset format

Each query in the JSON/JSONL dataset:

{
  "query": "Please recommend papers on Transformer architectures for NLP.",
  "discipline": "computer_science",
  "num_refs": 5,
  "language": "en",
  "year_constraint": {"min_year": 2020}
}
FieldRequiredDescription
queryYesPrompt for reference recommendation
disciplineNocomputer_science, biomedical, physics, chemistry, social_science, interdisciplinary, other
num_refsNoExpected number of references (default: 5)
languageNo"zh" or "en" (default: "zh")
year_constraintNo{"exact": 2023}, {"min_year": 2020}, {"max_year": 2015}, or {"min_year": 2020, "max_year": 2024}

Official dataset: OpenJudge/ref-hallucination-arena

Interpreting results

Overall accuracy (verification rate):

  • > 75% — Excellent: model rarely hallucinates references
  • 60–75% — Good: most references are real, some fabrication
  • 40–60% — Fair: significant hallucination, use with caution
  • < 40% — Poor: model frequently fabricates references

Per-field accuracy:

  • title_accuracy — % of titles matching real papers
  • author_accuracy — % of correct author lists
  • year_accuracy — % of correct publication years
  • doi_accuracy — % of valid DOIs

Verification status:

  • VERIFIED — title + author + year all exactly match a real paper
  • SUSPECT — partial match (e.g. title matches but authors differ)
  • NOT_FOUND — no match in any database
  • ERROR — API timeout or network failure

Ranking order: overall accuracy → year compliance rate → avg confidence → completeness

Output files

evaluation_results/ref_hallucination_arena/
├── evaluation_report.md          # Detailed Markdown report
├── evaluation_results.json       # Rankings, per-field accuracy, scores
├── verification_chart.png        # Per-field accuracy bar chart
├── discipline_chart.png          # Per-discipline accuracy chart
├── queries.json                  # Loaded evaluation queries
├── responses.json                # Raw model responses
├── extracted_refs.json           # Extracted BibTeX references
├── verification_results.json     # Per-reference verification details
└── checkpoint.json               # Pipeline checkpoint for resume

API key by model

Model prefixEnvironment variable
gpt-*, o1-*, o3-*OPENAI_API_KEY
claude-*ANTHROPIC_API_KEY
qwen-*, dashscope/*DASHSCOPE_API_KEY
deepseek-*DEEPSEEK_API_KEY
Custom endpointset api_key + base_url in config

Additional resources

Capabilities

skillsource-agentscope-aiskill-ref-hallucination-arenatopic-agenttopic-agent-skillstopic-ai-agenttopic-alignmenttopic-evaluationtopic-gradertopic-llmtopic-rewardtopic-reward-modeltopic-rlhftopic-skill-mdtopic-skills

Install

Quality

0.70/ 1.00

deterministic score 0.70 from registry signals: · indexed on github topic:agent-skills · 585 github stars · SKILL.md body (9,154 chars)

Provenance

Indexed fromgithub
Enriched2026-05-02 18:53:08Z · deterministic:skill-github:v1 · v1
First seen2026-04-18
Last seen2026-05-02

Agent access