{"id":"344f4dac-f5ba-47df-b847-95816d0ee577","shortId":"8DsM6n","kind":"skill","title":"smart-web-fetch","tagline":"Fetch web content efficiently by checking llms.txt first, then Cloudflare markdown endpoints, then falling back to HTML. Reduces token usage by 80% on sites that support clean markdown delivery. No external dependencies — installs a single Python script. Trigger words: fetch URL,","description":"# smart-web-fetch — Token-Efficient Web Content Fetching\n\nFetching a webpage with the default WebFetch tool retrieves full HTML — navigation menus, footers, ads, cookie banners, and all. For a documentation page, 90% of the tokens go to chrome, not content. This script fixes that by trying cleaner sources first.\n\n## How It Works\n\nThe fetch chain, in order:\n\n1. **Check `llms.txt`** — Many sites publish `/llms.txt` or `/llms-full.txt` with curated content for AI agents. If present, this is the best source: intentionally structured, no noise.\n2. **Try Cloudflare markdown** — Cloudflare's network serves clean markdown for millions of sites via a URL prefix trick. If the site is behind Cloudflare, this returns structured markdown at ~20% of the HTML token cost.\n3. **Fall back to HTML** — Standard fetch, with HTML stripped to readable text. Reliable but verbose.\n\nThe result: typically 60-80% fewer tokens on documentation sites, blog posts, and product pages.\n\n---\n\n## Installation\n\nCopy the script into your project's scripts directory:\n\n```bash\nmkdir -p .claude/scripts\n```\n\nThen create `.claude/scripts/smart-fetch.py` with the contents below.\n\n---\n\n## The Script\n\nSave this as `.claude/scripts/smart-fetch.py`:\n\n```python\n#!/usr/bin/env python3\n\"\"\"\nsmart-fetch.py — Token-efficient web content fetching.\nTries llms.txt, then Cloudflare markdown, then plain HTML.\nUsage: python3 .claude/scripts/smart-fetch.py <url> [--raw] [--source]\n\"\"\"\nimport sys\nimport urllib.request\nimport urllib.parse\nimport urllib.error\nimport re\nimport json\n\ndef fetch_url(url, timeout=15):\n    req = urllib.request.Request(url, headers={\n        'User-Agent': 'Mozilla/5.0 (compatible; agent-fetch/1.0)'\n    })\n    try:\n        with urllib.request.urlopen(req, timeout=timeout) as r:\n            charset = 'utf-8'\n            ct = r.headers.get('Content-Type', '')\n            if 'charset=' in ct:\n                charset = ct.split('charset=')[-1].strip()\n            return r.read().decode(charset, errors='replace'), r.geturl()\n    except urllib.error.HTTPError as e:\n        return None, str(e)\n    except Exception as e:\n        return None, str(e)\n\ndef html_to_text(html):\n    # Remove scripts, styles, nav, footer\n    for tag in ['script', 'style', 'nav', 'footer', 'header', 'aside']:\n        html = re.sub(rf'<{tag}[^>]*>.*?</{tag}>', '', html, flags=re.DOTALL|re.IGNORECASE)\n    # Remove all remaining tags\n    text = re.sub(r'<[^>]+>', ' ', html)\n    # Decode common entities\n    for ent, ch in [('&amp;','&'),('&lt;','<'),('&gt;','>'),('&nbsp;',' '),('&#39;',\"'\"),('&quot;','\"')]:\n        text = text.replace(ent, ch)\n    # Collapse whitespace\n    text = re.sub(r'\\n\\s*\\n\\s*\\n', '\\n\\n', text)\n    text = re.sub(r'[ \\t]+', ' ', text)\n    return text.strip()\n\ndef get_base(url):\n    p = urllib.parse.urlparse(url)\n    return f\"{p.scheme}://{p.netloc}\"\n\ndef try_llms_txt(base):\n    for path in ['/llms-full.txt', '/llms.txt']:\n        content, _ = fetch_url(base + path)\n        if content and len(content) > 100 and not content.strip().startswith('<'):\n            return content, 'llms.txt'\n    return None, None\n\ndef try_cloudflare_markdown(url):\n    # Cloudflare's markdown delivery: prefix with https://cloudflare.com/markdown/\n    # Actually the pattern is: replace scheme+domain with r.jina.ai for Jina,\n    # or use the /md/ subdomain pattern for CF Pages.\n    # Most reliable open technique: jina.ai reader (no API key needed for basic use)\n    jina_url = 'https://r.jina.ai/' + url\n    content, final_url = fetch_url(jina_url, timeout=20)\n    if content and len(content) > 200 and not content.strip().startswith('<!'):\n        return content, 'markdown'\n    return None, None\n\ndef smart_fetch(url, show_source=False):\n    base = get_base(url)\n    results = []\n\n    # 1. Try llms.txt\n    content, source = try_llms_txt(base)\n    if content:\n        results.append(('llms.txt', content))\n\n    # 2. Try markdown delivery\n    content, source = try_cloudflare_markdown(url)\n    if content:\n        results.append(('markdown', content))\n\n    # 3. HTML fallback\n    if not results:\n        html, _ = fetch_url(url)\n        if html:\n            text = html_to_text(html)\n            results.append(('html', text))\n\n    if not results:\n        print(f\"ERROR: Could not fetch {url}\", file=sys.stderr)\n        sys.exit(1)\n\n    # Use best result (prefer llms.txt > markdown > html)\n    best_source, best_content = results[0]\n\n    if show_source:\n        print(f\"[source: {best_source}]\", file=sys.stderr)\n\n    return best_content\n\nif __name__ == '__main__':\n    args = sys.argv[1:]\n    if not args or args[0] in ('-h', '--help'):\n        print(__doc__)\n        sys.exit(0)\n\n    url = args[0]\n    show_source = '--source' in args\n\n    content = smart_fetch(url, show_source=show_source)\n    print(content)\n```\n\nMake it executable:\n\n```bash\nchmod +x .claude/scripts/smart-fetch.py\n```\n\n---\n\n## Usage\n\n```bash\n# Fetch a page (auto-selects best source)\npython3 .claude/scripts/smart-fetch.py https://docs.example.com/guide\n\n# Show which source was used (llms.txt / markdown / html)\npython3 .claude/scripts/smart-fetch.py https://docs.example.com/guide --source\n\n# Pipe into another tool\npython3 .claude/scripts/smart-fetch.py https://example.com | head -100\n```\n\n---\n\n## Teaching the Agent to Use It\n\nAdd this to your project's `CLAUDE.md`:\n\n```markdown\n## Web Fetching\n\nWhen fetching web content, always use the smart-fetch script first:\n\n```bash\npython3 .claude/scripts/smart-fetch.py <url> --source\n```\n\nOnly use WebFetch as a fallback if smart-fetch fails or if you need\nJavaScript-rendered content. The script reduces token usage by 60-80%\non documentation sites and blogs.\n```\n\n---\n\n## When Each Source Wins\n\n| Site Type | Likely Source | Why |\n|-----------|--------------|-----|\n| AI/dev tool docs | llms.txt | Modern tools publish agent-ready content |\n| Technical blogs | markdown | Clean article content via markdown delivery |\n| Legacy enterprise sites | html | No markdown alternative available |\n| SPAs / JS-heavy sites | html (may be sparse) | Server-side content only |\n\n---\n\n## Token Savings by Source\n\nApproximate token counts for a typical 2,000-word documentation page:\n\n- **HTML** (raw): ~8,000 tokens (navigation, scripts, markup included)\n- **Markdown delivery**: ~2,000 tokens (clean structured content)\n- **llms.txt**: ~1,500 tokens (curated for AI consumption)\n\nOn a project that fetches 50 URLs per session, this saves ~300,000 tokens — roughly the difference between fitting in context and not.\n\n---\n\n## Going Further\n\nSmart-fetch saves tokens on every fetch. But you're still triggering each fetch manually — \"go check this URL.\" The real power comes when fetching happens automatically, on a schedule, without you asking.\n\n**With Instar, your agent can monitor the web autonomously.** Set up a cron job that checks competitor pricing every morning. Another that watches API documentation for breaking changes. Another that summarizes your RSS feeds before you wake up. Smart-fetch runs inside each job, keeping token costs low while the agent works through dozens of URLs on its own.\n\nInstar also adds a caching layer — the same URL fetched twice within a configurable window returns the cached version, so recurring jobs don't waste tokens re-reading content that hasn't changed.\n\nAnd web monitoring is just one use case. With Instar, your agent also gets:\n- **A full job scheduler** — any task on cron\n- **Background sessions** — parallel workers for deep tasks\n- **Telegram integration** — results delivered to your phone\n- **Persistent identity and memory** — context that survives across sessions\n\nOne command, about 2 minutes:\n\n```bash\nnpx instar\n```\n\nYour agent goes from fetching when you ask to watching the web while you sleep. [instar.sh](https://instar.sh)","tags":["smart","web","fetch","instar","jkheadley","agent-framework","agent-identity","agent-infrastructure","agent-memory","agent-skills","ai-agents","ai-safety"],"capabilities":["skill","source-jkheadley","skill-smart-web-fetch","topic-agent-framework","topic-agent-identity","topic-agent-infrastructure","topic-agent-memory","topic-agent-skills","topic-ai-agents","topic-ai-safety","topic-autonomous-agents","topic-claude-code","topic-cli","topic-cron","topic-job-scheduler"],"categories":["instar"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/JKHeadley/instar/smart-web-fetch","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add JKHeadley/instar","source_repo":"https://github.com/JKHeadley/instar","install_from":"skills.sh"}},"qualityScore":"0.479","qualityRationale":"deterministic score 0.48 from registry signals: · indexed on github topic:agent-skills · 59 github stars · SKILL.md body (7,787 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-02T06:55:53.737Z","embedding":null,"createdAt":"2026-04-18T22:14:39.197Z","updatedAt":"2026-05-02T06:55:53.737Z","lastSeenAt":"2026-05-02T06:55:53.737Z","tsv":"'-1':302 '-100':696 '-8':289 '-80':187,755 '/''':487 '/1.0':278 '/guide':673,686 '/llms-full.txt':113,413 '/llms.txt':111,414 '/markdown/':449 '/md':464 '/usr/bin/env':226 '0':601,626,633,636 '000':823,830,839,864 '1':105,526,588,620,845 '100':425 '15':265 '2':131,540,822,838,1053 '20':161,497 '200':503 '3':167,555 '300':863 '50':857 '500':846 '60':186,754 '8':829 '80':26 '90':79 'across':1048 'actual':450 'ad':70 'add':703,973 'agent':119,272,276,699,778,914,962,1016,1059 'agent-fetch':275 'agent-readi':777 'ai':118,850 'ai/dev':770 'also':972,1017 'altern':796 'alway':717 'anoth':690,931,939 'api':477,934 'approxim':816 'arg':618,623,625,635,641 'articl':785 'asid':345 'ask':910,1065 'auto':665 'auto-select':664 'automat':904 'autonom':919 'avail':797 'back':19,169 'background':1027 'banner':72 'base':396,409,418,521,523,534 'bash':208,655,660,725,1055 'basic':481 'behind':154 'best':125,590,596,598,608,613,667 'blog':193,760,782 'break':937 'cach':975,988 'case':1012 'cf':468 'ch':368,373 'chain':102 'chang':938,1004 'charset':287,296,299,301,307 'check':10,106,894,926 'chmod':656 'chrome':85 'claude.md':709 'claude/scripts':211 'claude/scripts/smart-fetch.py':214,224,245,658,670,683,693,727 'clean':31,139,784,841 'cleaner':94 'cloudflar':14,133,135,155,238,438,441,547 'cloudflare.com':448 'cloudflare.com/markdown/':447 'collaps':374 'come':900 'command':1051 'common':364 'compat':274 'competitor':927 'configur':984 'consumpt':851 'content':7,54,87,116,217,233,293,415,421,424,431,489,499,502,509,529,536,539,544,551,554,599,614,642,651,716,747,780,786,810,843,1000 'content-typ':292 'content.strip':428,506 'context':872,1045 'cooki':71 'copi':199 'cost':166,958 'could':581 'count':818 'creat':213 'cron':923,1026 'ct':290,298 'ct.split':300 'curat':115,848 'decod':306,363 'deep':1032 'def':260,327,394,405,436,514 'default':61 'deliv':1037 'deliveri':33,444,543,789,837 'depend':36 'differ':868 'directori':207 'doc':631,772 'docs.example.com':672,685 'docs.example.com/guide':671,684 'document':77,191,757,825,935 'domain':456 'dozen':965 'e':314,318,322,326 'effici':8,52,231 'endpoint':16 'ent':367,372 'enterpris':791 'entiti':365 'error':308,580 'everi':883,929 'example.com':694 'except':311,319,320 'execut':654 'extern':35 'f':402,579,606 'fail':739 'fall':18,168 'fallback':557,734 'fals':520 'feed':944 'fetch':4,5,44,49,55,56,101,173,234,261,277,416,492,516,562,583,644,661,712,714,722,738,856,879,884,891,902,951,980,1062 'fewer':188 'file':585,610 'final':490 'first':12,96,724 'fit':870 'fix':90 'flag':352 'footer':69,336,343 'full':65,1020 'get':395,522,1018 'go':83,875,893 'goe':1060 'h':628 'happen':903 'hasn':1002 'head':695 'header':269,344 'heavi':801 'help':629 'html':21,66,164,171,175,242,328,331,346,351,362,556,561,566,568,571,573,595,681,793,803,827 'ident':1042 'import':248,250,252,254,256,258 'includ':835 'insid':953 'instal':37,198 'instar':912,971,1014,1057 'instar.sh':1073,1074 'integr':1035 'intent':127 'javascript':745 'javascript-rend':744 'jina':460,483,494 'jina.ai':474 'job':924,955,992,1021 'js':800 'js-heavi':799 'json':259 'keep':956 'key':478 'layer':976 'legaci':790 'len':423,501 'like':767 'llms':407,532 'llms.txt':11,107,236,432,528,538,593,679,773,844 'low':959 'main':617 'make':652 'mani':108 'manual':892 'markdown':15,32,134,140,159,239,439,443,510,542,548,553,594,680,710,783,788,795,836 'markup':834 'may':804 'memori':1044 'menus':68 'million':142 'minut':1054 'mkdir':209 'modern':774 'monitor':916,1007 'morn':930 'mozilla/5.0':273 'n':379,381,383,384,385 'name':616 'nav':335,342 'navig':67,832 'need':479,743 'network':137 'nois':130 'none':316,324,434,435,512,513 'npx':1056 'one':1010,1050 'open':472 'order':104 'p':210,398 'p.netloc':404 'p.scheme':403 'page':78,197,469,663,826 'parallel':1029 'path':411,419 'pattern':452,466 'per':859 'persist':1041 'phone':1040 'pipe':688 'plain':241 'post':194 'power':899 'prefer':592 'prefix':148,445 'present':121 'price':928 'print':578,605,630,650 'product':196 'project':204,707,854 'publish':110,776 'python':40,225 'python3':227,244,669,682,692,726 'r':286,361,378,389 'r.geturl':310 'r.headers.get':291 'r.jina.ai':458,486 'r.jina.ai/''':485 'r.read':305 'raw':246,828 're':257,887,998 're-read':997 're.dotall':353 're.ignorecase':354 're.sub':347,360,377,388 'read':999 'readabl':178 'reader':475 'readi':779 'real':898 'recur':991 'reduc':22,750 'reliabl':180,471 'remain':357 'remov':332,355 'render':746 'replac':309,454 'req':266,282 'result':184,525,560,577,591,600,1036 'results.append':537,552,572 'retriev':64 'return':157,304,315,323,392,401,430,433,508,511,612,986 'rf':348 'rough':866 'rss':943 'run':952 'save':221,813,862,880 'schedul':907,1022 'scheme':455 'script':41,89,201,206,220,333,340,723,749,833 'select':666 'serv':138 'server':808 'server-sid':807 'session':860,1028,1049 'set':920 'show':518,603,637,646,648,674 'side':809 'singl':39 'site':28,109,144,152,192,758,765,792,802 'skill' 'skill-smart-web-fetch' 'sleep':1072 'smart':2,47,515,643,721,737,878,950 'smart-fetch':720,736,877,949 'smart-fetch.py':228 'smart-web-fetch':1,46 'sourc':95,126,247,519,530,545,597,604,607,609,638,639,647,649,668,676,687,728,763,768,815 'source-jkheadley' 'spars':806 'spas':798 'standard':172 'startswith':429,507 'still':888 'str':317,325 'strip':176,303 'structur':128,158,842 'style':334,341 'subdomain':465 'summar':941 'support':30 'surviv':1047 'sys':249 'sys.argv':619 'sys.exit':587,632 'sys.stderr':586,611 'tag':338,349,350,358 'task':1024,1033 'teach':697 'technic':781 'techniqu':473 'telegram':1034 'text':179,330,359,370,376,386,387,391,567,570,574 'text.replace':371 'text.strip':393 'timeout':264,283,284,496 'token':23,51,82,165,189,230,751,812,817,831,840,847,865,881,957,996 'token-effici':50,229 'tool':63,691,771,775 'topic-agent-framework' 'topic-agent-identity' 'topic-agent-infrastructure' 'topic-agent-memory' 'topic-agent-skills' 'topic-ai-agents' 'topic-ai-safety' 'topic-autonomous-agents' 'topic-claude-code' 'topic-cli' 'topic-cron' 'topic-job-scheduler' 'tri':93,132,235,279,406,437,527,531,541,546 'trick':149 'trigger':42,889 'twice':981 'txt':408,533 'type':294,766 'typic':185,821 'url':45,147,262,263,268,397,400,417,440,484,488,491,493,495,517,524,549,563,564,584,634,645,858,896,967,979 'urllib.error':255 'urllib.error.httperror':312 'urllib.parse':253 'urllib.parse.urlparse':399 'urllib.request':251 'urllib.request.request':267 'urllib.request.urlopen':281 'usag':24,243,659,752 'use':462,482,589,678,701,718,730,1011 'user':271 'user-ag':270 'utf':288 'verbos':182 'version':989 'via':145,787 'wake':947 'wast':995 'watch':933,1067 'web':3,6,48,53,232,711,715,918,1006,1069 'webfetch':62,731 'webpag':58 'whitespac':375 'win':764 'window':985 'within':982 'without':908 'word':43,824 'work':99,963 'worker':1030 'x':657","prices":[{"id":"61c4b54f-851e-4936-a512-34f6c3e649c2","listingId":"344f4dac-f5ba-47df-b847-95816d0ee577","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"JKHeadley","category":"instar","install_from":"skills.sh"},"createdAt":"2026-04-18T22:14:39.197Z"}],"sources":[{"listingId":"344f4dac-f5ba-47df-b847-95816d0ee577","source":"github","sourceId":"JKHeadley/instar/smart-web-fetch","sourceUrl":"https://github.com/JKHeadley/instar/tree/main/skills/smart-web-fetch","isPrimary":false,"firstSeenAt":"2026-04-18T22:14:39.197Z","lastSeenAt":"2026-05-02T06:55:53.737Z"}],"details":{"listingId":"344f4dac-f5ba-47df-b847-95816d0ee577","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"JKHeadley","slug":"smart-web-fetch","github":{"repo":"JKHeadley/instar","stars":59,"topics":["agent-framework","agent-identity","agent-infrastructure","agent-memory","agent-skills","ai-agents","ai-safety","autonomous-agents","claude-code","cli","cron","job-scheduler","llm","mcp","npm-package","open-source","persistency","telegram-bot","typescript","whatsapp"],"license":"mit","html_url":"https://github.com/JKHeadley/instar","pushed_at":"2026-05-02T05:23:59Z","description":"Persistent Claude Code agents with scheduling, sessions, memory, and Telegram.","skill_md_sha":"63dd6e9212f5275f2ec5fad7e93022103b0ca212","skill_md_path":"skills/smart-web-fetch/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/JKHeadley/instar/tree/main/skills/smart-web-fetch"},"layout":"multi","source":"github","category":"instar","frontmatter":{"name":"smart-web-fetch","license":"MIT","description":"Fetch web content efficiently by checking llms.txt first, then Cloudflare markdown endpoints, then falling back to HTML. Reduces token usage by 80% on sites that support clean markdown delivery. No external dependencies — installs a single Python script. Trigger words: fetch URL, web content, read website, scrape page, download page, get webpage, read this link."},"skills_sh_url":"https://skills.sh/JKHeadley/instar/smart-web-fetch"},"updatedAt":"2026-05-02T06:55:53.737Z"}}