ref-downloader
Use when the user asks to batch-download all references for a paper from a DOI or local PDF with ref-downloader. Not for one-off PDFs, paper search, or Zotero import.
What it does
Ref Downloader — Agent Runbook
Slim entry for agent mode. The full 8-step manual runbook with code snippets, DOI-resolution fallback chain, and PUBLISHER_MAP extension procedure lives in references/agent-runbook.md. Human users see ../../README.md.
<SKILL_DIR> = this folder (skills/ref-downloader in the source repo, or wherever the user copied this skill — e.g. ~/.claude/skills/ref-downloader/). Python scripts live in <SKILL_DIR>/scripts/; config files (config.example.toml, config.local.toml) live at <SKILL_DIR>/.
Install prerequisites (before first invocation)
The skill protocol can't manage Python deps. If python -c "import playwright" fails, the user needs:
cd "<SKILL_DIR>"
pip install playwright pymupdf
playwright install msedge # downloads Edge driver
cp config.example.toml config.local.toml # then user edits [crossref].mailto
If the user is developing from the source repo instead of an installed skill copy, they can also install from the repo root with pip install -r requirements.txt -r requirements-dev.txt.
When to invoke
Trigger phrases:
- "帮我下载 [paper / DOI] 的参考文献" / "批量下载引用文献" / "把这篇论文的所有引用下载下来"
- "Download all refs of [paper / DOI]" / "Batch-download every reference"
- User provides a DOI (
10.x/yform) or local PDF path and asks for "all references" / "全部参考文献"
Don't invoke for:
- Downloading one arbitrary PDF (not a reference list)
- Generic web scraping
- Paper search / Zotero import — different tools
Primary entry
python "<SKILL_DIR>/scripts/run_ref_downloader.py" <DOI_OR_PDF_PATH>
The wrapper handles DOI resolution (Zotero → fitz fallback), output-dir layout,
sequential 3-stage pipeline (extract_refs.py → validate_refs.py →
download_refs.py), and end-of-run cleanup.
Useful flags:
--yes— non-interactive (CI/batch), overwrite prompts default-yes--auto— forwarded todownload_refs.py: skip "press Enter" confirm + shorter challenge wait--output-dir <path>— override default output location--config <path>— alternate TOML config (overridesconfig.local.toml)
Pre-flight checklist (confirm before running)
- DOI correct? Echo back to user:
即将下载参考文献:DOI=<doi> - Edge fully closed? All
msedge.exeprocesses killed (Task Manager check). The script claims the user's persistent Edge profile and needs exclusive access. - Config set? First-run users need
<SKILL_DIR>/config.local.tomlwith[crossref].mailto. Missing config → wrapper prints a WARNING but continues with placeholder defaults. - Output location agreed? Default for DOI input:
<cwd>/<project_name>_refs/. For PDF input:<pdf_dir>/<pdf_stem>_refs/. Override with--output-dir.
Output layout
<OUTPUT_DIR>/
├── <PROJECT_NAME>/
│ ├── refs_raw.json # extract_refs.py output
│ ├── refs_validated.json # validate_refs.py output
│ ├── download_report.csv # per-ref status (only on graceful completion)
│ ├── *.pdf # reference PDFs
│ └── *_SI.pdf # supplementary files (where supported)
└── runs/<timestamp>-round-03/
└── events.jsonl # full event trace per ref
Interruption note: if the run is interrupted (Ctrl+C / Edge crash / VPN drop),
the root download_report.csv may be stale. Trust the latest
runs/<timestamp>/events.jsonl + actual files in <PROJECT_NAME>/.
Common failure modes
| Status / symptom | Meaning | Action |
|---|---|---|
manual_pending (auth_redirect) | Bounced to institution SSO | User signs in via live Edge tab; re-run (incremental skips done refs) |
manual_pending (challenge_timeout) | Cloudflare / publisher challenge unsolved in time | Re-run interactively; solve captcha when prompted |
manual_pending (elsevier_crasolve_shell) | Elsevier viewer stuck in transition | Hot-session auto-retry usually picks it up; else manual click in live page |
failed (auto) | Generic auto path failed | Check events.jsonl for that ref; may need a publisher-specific patch |
ignored (ignored_institution_access) | DOI listed in [institution].ignored_access_dois | Skip-by-design; remove from config to retry |
| Edge won't launch | Background msedge.exe still holding profile | Kill all msedge.exe in Task Manager, re-run |
ModuleNotFoundError: playwright | Install prereqs not done | See "Install prerequisites" section above |
WARNING: crossref.mailto is the placeholder | First-run config uncustomized | Edit <SKILL_DIR>/config.local.toml → set [crossref].mailto to a real email (Crossref polite pool) |
Manual / debug mode
If the wrapper fails partway, run the 3 scripts standalone for partial re-execution:
python <SKILL_DIR>/scripts/extract_refs.py <DOI> # → refs_raw.json
python <SKILL_DIR>/scripts/validate_refs.py <PROJECT> # → refs_validated.json
python <SKILL_DIR>/scripts/download_refs.py <PROJECT> # → PDFs + download_report.csv
Full 8-step manual flow with code snippets, DOI-resolution fallback chain (Zotero query → fitz text → user prompt), and the procedure for extending PUBLISHER_MAP when encountering an unknown DOI prefix → references/agent-runbook.md.
See also
- ../../README.md — human-facing setup, install, usage examples
- references/agent-runbook.md — full manual runbook with code snippets
- ../../docs/SUPPORTED_PUBLISHERS.md — publisher tier matrix
- ../../CONTRIBUTING.md — adding new publisher / institution SSO
- ../../SECURITY.md — Edge profile cookie risk
- config.example.toml — full config schema
Capabilities
Install
Quality
deterministic score 0.49 from registry signals: · indexed on github topic:agent-skills · 78 github stars · SKILL.md body (5,836 chars)