{"id":"138a3646-39d4-463c-be01-5330d6551703","shortId":"VAQDB3","kind":"skill","title":"fulltext-retrieval","tagline":"Batch download open-access PDFs by DOI using legitimate OA APIs (Unpaywall, PMC, OpenAlex, Crossref). Optional PDF→Markdown conversion for token-efficient LLM analysis.","description":"# Fulltext Retrieval Skill\n\nBatch download open-access full-text PDFs from a DOI list using legitimate OA APIs only.\n\n## Pipeline\n\n```\nDOI list → Unpaywall → PMC (Europe PMC / OA FTP / web) → OpenAlex → Crossref → landing page\n```\n\nEach DOI goes through these sources in order until a valid PDF (≥10 KB, `%PDF-` header) is found.\n\n## Quick Start\n\n```bash\n# Prepare a DOI list (one per line)\ncat > dois.txt << 'EOF'\n10.1007/s00330-010-1783-x\n10.1002/mp.12524\n10.1148/radiol.13131265\nEOF\n\n# Run\npython fetch_oa.py dois.txt --output pdfs/ --email your@email.com\n\n# Verbose mode for debugging\npython fetch_oa.py dois.txt -o pdfs/ -e your@email.com --verbose\n```\n\n## Input Formats\n\n**Plain text** — one DOI per line:\n```\n10.1007/s00330-010-1783-x\n10.1002/mp.12524\n```\n\n**TSV with header** — must contain a `DOI` column, optional `PMID` column:\n```tsv\nID\tTitle\tDOI\tPMID\tYear\n1\tSome paper\t10.1007/s00330-010-1783-x\t20628747\t2010\n```\n\nWhen a PMID is available, the PMC lookup is more reliable (PMID → PMCID conversion).\n\n## PMC Download (JS-Challenge Resistant)\n\nPMC web pages may block automated downloads with JavaScript proof-of-work challenges. This tool uses three fallback methods:\n\n### Method A: Europe PMC REST API (most reliable)\n\n```bash\nPMCID=\"PMC9733600\"\ncurl -sLo output.pdf \\\n  \"https://europepmc.org/backend/ptpmcrender.fcgi?accid=${PMCID}&blobtype=pdf\"\n```\n\n### Method B: PMC OA FTP Service\n\n```bash\ncurl -s \"https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=${PMCID}\" | \\\n    grep -oE 'href=\"[^\"]*\\.pdf\"' | head -1 | \\\n    sed 's/href=\"//;s/\"//' | xargs curl -sLo output.pdf\n```\n\n### DOI/PMID → PMCID Conversion\n\n```bash\n# Works with both DOI and PMID\ncurl -s \"https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids=${DOI}&format=json\" | \\\n    python3 -c \"import sys,json; print(json.load(sys.stdin)['records'][0].get('pmcid',''))\"\n```\n\n## Output\n\n- PDFs saved as `{DOI_safe}.pdf` (slashes replaced with underscores)\n- `manual_needed.txt` — DOIs that could not be retrieved via OA\n- Summary with OA/PMC/fail/skip counts\n\n## Requirements\n\n- Python 3.10+ (stdlib only, no pip dependencies)\n- Contact email (required by Unpaywall Terms of Service)\n\n## API Policies\n\n| Source | Rate Limit | Notes |\n|--------|-----------|-------|\n| Unpaywall | 100 req/sec | Email required |\n| NCBI PMC | 3 req/sec without API key | Add `&api_key=` for higher limits |\n| OpenAlex | 100k req/day | Polite pool with email in User-Agent |\n| Crossref | 50 req/sec with email | Plus service with `mailto:` in UA |\n| Europe PMC | No documented limit | Be polite, ≤1 req/sec recommended |\n\nThe script uses 0.3–0.5 second delays between requests.\n\n## PDF → Markdown Conversion (Optional)\n\nAfter downloading PDFs, convert them to LLM-friendly Markdown for token-efficient repeated analysis. Uses [pymupdf4llm](https://github.com/pymupdf/RAG) — optimized for academic papers with two-column layout handling and table preservation.\n\n### Quick Start\n\n```bash\n# Install (one-time)\npip install pymupdf4llm\n\n# Convert all PDFs in a directory\npython pdf_to_md.py pdfs/\n\n# Convert with verbose output\npython pdf_to_md.py pdfs/ -v\n\n# Custom output directory\npython pdf_to_md.py pdfs/ -o markdown/\n\n# First 10 pages only (useful for long supplements)\npython pdf_to_md.py pdfs/ --pages 0-9\n\n# Overwrite existing conversions\npython pdf_to_md.py pdfs/ --force\n```\n\n### Combined Workflow\n\n```bash\n# Step 1: Download PDFs\npython fetch_oa.py dois.txt -o pdfs/ -e your@email.com\n\n# Step 2: Convert to Markdown (only successful downloads)\npython pdf_to_md.py pdfs/ -v\n```\n\nAfter conversion, `.md` files sit alongside `.pdf` files. Claude Code can then use `Read` for full content or `Grep` for targeted extraction — significantly more token-efficient than re-reading PDFs.\n\n### When to Convert\n\n| Scenario | Recommendation |\n|----------|---------------|\n| Screening/triage (read once) | Skip — read PDF directly |\n| Data extraction from k≥5 studies | Convert — repeated reads save tokens |\n| Meta-analysis full pipeline | Convert — papers referenced across multiple phases |\n| Single paper deep review | Optional — marginal benefit |\n\n### Academic Paper Defaults\n\n- **Images**: Skipped (saves tokens; figures referenced by caption text)\n- **Tables**: `lines_strict` strategy (preserves grid-line tables accurately)\n- **Layout**: Two-column academic layout handled automatically\n- **Headers/footers**: Removed by pymupdf4llm\n\n### Dependency Note\n\n`pdf_to_md.py` requires [pymupdf4llm](https://pypi.org/project/pymupdf4llm/) (AGPL-3.0). This is an **optional** dependency — `fetch_oa.py` remains stdlib-only with zero external dependencies. The AGPL license applies to pymupdf4llm itself, not to this skill.\n\n## Limitations\n\n- Only retrieves **open-access** articles. Paywalled articles require institutional access.\n- Landing page scraping may fail on publisher-specific JavaScript-heavy pages.\n- Some recent articles may not yet be indexed by OA sources.\n- PDF→Markdown quality depends on the PDF's text layer. Scanned-only PDFs may produce poor output.\n\n## Anti-Hallucination\n\n- **Never fabricate file paths, URLs, DOIs, or package names.** Verify existence before recommending.\n- **Never invent journal metadata, impact factors, or submission policies** without verification at the journal's website.\n- If a tool, package, or resource does not exist or you are unsure, say so explicitly rather than guessing.","tags":["fulltext","retrieval","medsci","skills","aperivue","agent-skills","biostatistics","claude-code","claude-skills","clinical-research","diagnostic-accuracy","irb-protocol"],"capabilities":["skill","source-aperivue","skill-fulltext-retrieval","topic-agent-skills","topic-biostatistics","topic-claude-code","topic-claude-skills","topic-clinical-research","topic-diagnostic-accuracy","topic-irb-protocol","topic-literature-review","topic-manuscript","topic-medical-ai","topic-medical-research","topic-meta-analysis"],"categories":["medsci-skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/Aperivue/medsci-skills/fulltext-retrieval","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add Aperivue/medsci-skills","source_repo":"https://github.com/Aperivue/medsci-skills","install_from":"skills.sh"}},"qualityScore":"0.499","qualityRationale":"deterministic score 0.50 from registry signals: · indexed on github topic:agent-skills · 98 github stars · SKILL.md body (5,375 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T18:56:29.882Z","embedding":null,"createdAt":"2026-05-13T12:57:44.907Z","updatedAt":"2026-05-18T18:56:29.882Z","lastSeenAt":"2026-05-18T18:56:29.882Z","tsv":"'-1':237 '-3.0':616 '-9':466 '/backend/ptpmcrender.fcgi?accid=$':215 '/mp.12524':99,134 '/pmc/utils/idconv/v1.0/?ids=$':259 '/pmc/utils/oa/oa.fcgi?id=$':230 '/project/pymupdf4llm/)':614 '/pymupdf/rag)':404 '/radiol.13131265':101 '/s00330-010-1783-x':97,132,156 '0':272,465 '0.3':374 '0.5':375 '1':152,368,478 '10':77,454 '10.1002':98,133 '10.1007':96,131,155 '10.1148':100 '100':322 '100k':340 '2':489 '2010':158 '20628747':157 '3':328 '3.10':301 '5':548 '50':351 'academ':407,573,599 'access':8,37,647,653 'accur':594 'across':563 'add':333 'agent':349 'agpl':615,632 'alongsid':505 'analysi':29,399,557 'anti':697 'anti-hallucin':696 'api':15,49,204,315,331,334 'appli':634 'articl':648,650,669 'autom':184 'automat':602 'avail':163 'b':220 'bash':85,207,225,248,420,476 'batch':4,33 'benefit':572 'blobtyp':217 'block':183 'c':264 'caption':583 'cat':93 'challeng':177,192 'claud':508 'code':509 'column':142,145,412,598 'combin':474 'contact':307 'contain':139 'content':516 'convers':23,172,247,382,469,501 'convert':387,428,437,490,534,550,560 'could':289 'count':298 'crossref':19,62,350 'curl':210,226,242,255 'custom':445 'data':544 'debug':114 'deep':568 'default':575 'delay':377 'depend':306,607,621,630,681 'direct':543 'directori':433,447 'document':364 'doi':11,44,52,66,88,128,141,149,252,260,279,287,704 'doi/pmid':245 'dois.txt':94,106,117,483 'download':5,34,174,185,385,479,495 'e':120,486 'effici':27,397,526 'email':109,308,324,345,354 'eof':95,102 'europ':56,201,361 'europepmc.org':214 'europepmc.org/backend/ptpmcrender.fcgi?accid=$':213 'exist':468,709,736 'explicit':743 'extern':629 'extract':521,545 'fabric':700 'factor':717 'fail':658 'fallback':197 'fetch_oa.py':105,116,482,622 'figur':580 'file':503,507,701 'first':453 'forc':473 'format':124,261 'found':82 'friend':392 'ftp':59,223 'full':39,515,558 'full-text':38 'fulltext':2,30 'fulltext-retriev':1 'get':273 'github.com':403 'github.com/pymupdf/rag)':402 'goe':67 'grep':232,518 'grid':591 'grid-lin':590 'guess':746 'hallucin':698 'handl':414,601 'head':236 'header':80,137 'headers/footers':603 'heavi':665 'higher':337 'href':234 'id':147 'imag':576 'impact':716 'import':265 'index':674 'input':123 'instal':421,426 'institut':652 'invent':713 'javascript':187,664 'javascript-heavi':663 'journal':714,725 'js':176 'js-challeng':175 'json':262,267 'json.load':269 'k':547 'kb':78 'key':332,335 'land':63,654 'layer':687 'layout':413,595,600 'legitim':13,47 'licens':633 'limit':319,338,365,642 'line':92,130,586,592 'list':45,53,89 'llm':28,391 'llm-friend':390 'long':459 'lookup':166 'mailto':358 'manual_needed.txt':286 'margin':571 'markdown':22,381,393,452,492,679 'may':182,657,670,692 'md':502 'meta':556 'meta-analysi':555 'metadata':715 'method':198,199,219 'mode':112 'multipl':564 'must':138 'name':707 'ncbi':326 'never':699,712 'note':320,608 'o':118,451,484 'oa':14,48,58,222,294,676 'oa/pmc/fail/skip':297 'oe':233 'one':90,127,423 'one-tim':422 'open':7,36,646 'open-access':6,35,645 'openalex':18,61,339 'optim':405 'option':20,143,383,570,620 'order':72 'output':107,275,440,446,695 'output.pdf':212,244 'overwrit':467 'packag':706,731 'page':64,181,455,464,655,666 'paper':154,408,561,567,574 'path':702 'paywal':649 'pdf':21,76,79,218,235,281,380,506,542,678,684 'pdf_to_md.py':435,442,449,462,471,497,609 'pdfs':9,41,108,119,276,386,430,436,443,450,463,472,480,485,498,531,691 'per':91,129 'phase':565 'pip':305,425 'pipelin':51,559 'plain':125 'plus':355 'pmc':17,55,57,165,173,179,202,221,327,362 'pmc9733600':209 'pmcid':171,208,216,231,246,274 'pmid':144,150,161,170,254 'polici':316,720 'polit':342,367 'pool':343 'poor':694 'prepar':86 'preserv':417,589 'print':268 'produc':693 'proof':189 'proof-of-work':188 'publish':661 'publisher-specif':660 'pymupdf4llm':401,427,606,611,636 'pypi.org':613 'pypi.org/project/pymupdf4llm/)':612 'python':104,115,300,434,441,448,461,470,481,496 'python3':263 'qualiti':680 'quick':83,418 'rate':318 'rather':744 're':529 're-read':528 'read':513,530,538,541,552 'recent':668 'recommend':370,536,711 'record':271 'referenc':562,581 'reliabl':169,206 'remain':623 'remov':604 'repeat':398,551 'replac':283 'req/day':341 'req/sec':323,329,352,369 'request':379 'requir':299,309,325,610,651 'resist':178 'resourc':733 'rest':203 'retriev':3,31,292,644 'review':569 'run':103 's/href':239 'safe':280 'save':277,553,578 'say':741 'scan':689 'scanned-on':688 'scenario':535 'scrape':656 'screening/triage':537 'script':372 'second':376 'sed':238 'servic':224,314,356 'signific':522 'singl':566 'sit':504 'skill':32,641 'skill-fulltext-retrieval' 'skip':540,577 'slash':282 'slo':211,243 'sourc':70,317,677 'source-aperivue' 'specif':662 'start':84,419 'stdlib':302,625 'stdlib-on':624 'step':477,488 'strategi':588 'strict':587 'studi':549 'submiss':719 'success':494 'summari':295 'supplement':460 'sys':266 'sys.stdin':270 'tabl':416,585,593 'target':520 'term':312 'text':40,126,584,686 'three':196 'time':424 'titl':148 'token':26,396,525,554,579 'token-effici':25,395,524 'tool':194,730 'topic-agent-skills' 'topic-biostatistics' 'topic-claude-code' 'topic-claude-skills' 'topic-clinical-research' 'topic-diagnostic-accuracy' 'topic-irb-protocol' 'topic-literature-review' 'topic-manuscript' 'topic-medical-ai' 'topic-medical-research' 'topic-meta-analysis' 'tsv':135,146 'two':411,597 'two-column':410,596 'ua':360 'underscor':285 'unpaywal':16,54,311,321 'unsur':740 'url':703 'use':12,46,195,373,400,457,512 'user':348 'user-ag':347 'v':444,499 'valid':75 'verbos':111,122,439 'verif':722 'verifi':708 'via':293 'web':60,180 'websit':727 'without':330,721 'work':191,249 'workflow':475 'www.ncbi.nlm.nih.gov':229,258 'www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids=$':257 'www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=$':228 'xarg':241 'year':151 'yet':672 'your@email.com':110,121,487 'zero':628","prices":[{"id":"dd0a5d79-02e9-4d85-817f-2429079415ce","listingId":"138a3646-39d4-463c-be01-5330d6551703","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"Aperivue","category":"medsci-skills","install_from":"skills.sh"},"createdAt":"2026-05-13T12:57:44.907Z"}],"sources":[{"listingId":"138a3646-39d4-463c-be01-5330d6551703","source":"github","sourceId":"Aperivue/medsci-skills/fulltext-retrieval","sourceUrl":"https://github.com/Aperivue/medsci-skills/tree/main/skills/fulltext-retrieval","isPrimary":false,"firstSeenAt":"2026-05-13T12:57:44.907Z","lastSeenAt":"2026-05-18T18:56:29.882Z"}],"details":{"listingId":"138a3646-39d4-463c-be01-5330d6551703","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"Aperivue","slug":"fulltext-retrieval","github":{"repo":"Aperivue/medsci-skills","stars":98,"topics":["agent-skills","biostatistics","claude-code","claude-skills","clinical-research","diagnostic-accuracy","irb-protocol","literature-review","manuscript","medical-ai","medical-research","meta-analysis","physician-researcher","prisma","pubmed","radiology","reporting-guidelines","strobe","systematic-review","tripod-ai"],"license":"other","html_url":"https://github.com/Aperivue/medsci-skills","pushed_at":"2026-05-17T20:50:52Z","description":"Claude Code skills for medical research — literature search, reporting guidelines, statistical analysis, publication figures. Built by a physician-researcher, tested on real publications. MIT licensed.","skill_md_sha":"f6bff2de7ca8bb96a5b83dbb685076465bb03b37","skill_md_path":"skills/fulltext-retrieval/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/Aperivue/medsci-skills/tree/main/skills/fulltext-retrieval"},"layout":"multi","source":"github","category":"medsci-skills","frontmatter":{"name":"fulltext-retrieval","description":"Batch download open-access PDFs by DOI using legitimate OA APIs (Unpaywall, PMC, OpenAlex, Crossref). Optional PDF→Markdown conversion for token-efficient LLM analysis."},"skills_sh_url":"https://skills.sh/Aperivue/medsci-skills/fulltext-retrieval"},"updatedAt":"2026-05-18T18:56:29.882Z"}}