{"id":"a7371406-50e7-4a36-aaef-a96f19855942","shortId":"vwqVJN","kind":"skill","title":"crawl-websites-at-scale","tagline":"Scrape websites at scale using Scrapy, a Python web crawling and scraping framework. Use when: (1) Crawling multiple pages or entire sites, (2) Extracting structured data from HTML/XML, or (3) Building automated data pipelines from web sources.","description":"# Scrapy Web Scraping Skill\n\nScrapy is a fast, high-level Python web crawling and scraping framework. It enables structured data extraction from websites, supports crawling entire sites, and integrates pipelines to process and store scraped data.\n\n## When to use\n\n- Crawl entire websites or follow links across many pages\n- Extract structured data (prices, articles, product listings) into JSON/CSV\n- Run scheduled or large-scale scraping pipelines\n- Need built-in support for request throttling, retries, and middlewares\n\n## Required tools / APIs\n\n- No external API required\n- Python 3.8+ required\n- Scrapy: Web crawling and scraping framework\n\nInstall options:\n\n```bash\n# pip\npip install scrapy\n\n# Ubuntu/Debian\nsudo apt-get install -y python3-pip && pip install scrapy\n\n# macOS\nbrew install python && pip install scrapy\n\n# Verify installation\nscrapy version\n```\n\n## Skills\n\n### basic_usage\n\nCreate and run a simple Scrapy spider to scrape a single page.\n\n```bash\n# Create a new Scrapy project\nscrapy startproject myproject\ncd myproject\n\n# Generate a spider\nscrapy genspider quotes quotes.toscrape.com\n\n# Run the spider and save to JSON\nscrapy crawl quotes -o output.json\n\n# Run the spider and save to CSV\nscrapy crawl quotes -o output.csv\n```\n\n**Python spider (quotes.py):**\n\n```python\nimport scrapy\n\nclass QuotesSpider(scrapy.Spider):\n    name = \"quotes\"\n    start_urls = [\"https://quotes.toscrape.com\"]\n\n    def parse(self, response):\n        for quote in response.css(\"div.quote\"):\n            yield {\n                \"text\": quote.css(\"span.text::text\").get(),\n                \"author\": quote.css(\"small.author::text\").get(),\n                \"tags\": quote.css(\"a.tag::text\").getall(),\n            }\n\n        # Follow pagination links\n        next_page = response.css(\"li.next a::attr(href)\").get()\n        if next_page:\n            yield response.follow(next_page, self.parse)\n```\n\n### robust_usage\n\nProduction-oriented spider with settings, item pipelines, and error handling.\n\n```bash\n# Run with custom settings (rate limiting, retries)\nscrapy crawl quotes \\\n  -s DOWNLOAD_DELAY=1 \\\n  -s AUTOTHROTTLE_ENABLED=True \\\n  -s RETRY_TIMES=3 \\\n  -o output.json\n\n# Run from a script (no project required)\nscrapy runspider spider.py -o output.json\n```\n\n**Python with error handling and structured items:**\n\n```python\nimport scrapy\nfrom scrapy import signals\nfrom scrapy.crawler import CrawlerProcess\n\nclass ArticleSpider(scrapy.Spider):\n    name = \"articles\"\n    custom_settings = {\n        \"DOWNLOAD_DELAY\": 1,\n        \"AUTOTHROTTLE_ENABLED\": True,\n        \"AUTOTHROTTLE_START_DELAY\": 1,\n        \"AUTOTHROTTLE_MAX_DELAY\": 10,\n        \"ROBOTSTXT_OBEY\": True,\n        \"USER_AGENT\": \"open-skills-bot/1.0 (+https://github.com/besoeasy/open-skills)\",\n        \"RETRY_TIMES\": 3,\n        \"FEEDS\": {\"output.json\": {\"format\": \"json\"}},\n    }\n\n    def __init__(self, start_url=None, *args, **kwargs):\n        super().__init__(*args, **kwargs)\n        self.start_urls = [start_url or \"https://quotes.toscrape.com\"]\n\n    def parse(self, response):\n        for article in response.css(\"article, div.post, div.entry\"):\n            yield {\n                \"url\": response.url,\n                \"title\": article.css(\"h1::text, h2::text\").get(\"\").strip(),\n                \"body\": \" \".join(article.css(\"p::text\").getall()),\n            }\n\n        for link in response.css(\"a::attr(href)\").getall():\n            if link.startswith(\"/\") or response.url in link:\n                yield response.follow(link, self.parse)\n\n    def errback(self, failure):\n        self.logger.error(f\"Request failed: {failure.request.url} — {failure.value}\")\n\n\n# Run without a Scrapy project\nif __name__ == \"__main__\":\n    process = CrawlerProcess()\n    process.crawl(ArticleSpider, start_url=\"https://quotes.toscrape.com\")\n    process.start()\n```\n\n### extract_with_xpath\n\nUse XPath selectors for precise extraction from complex HTML structures.\n\n```python\nimport scrapy\n\nclass XPathSpider(scrapy.Spider):\n    name = \"xpath_example\"\n    start_urls = [\"https://quotes.toscrape.com\"]\n\n    def parse(self, response):\n        for quote in response.xpath(\"//div[@class='quote']\"):\n            yield {\n                \"text\": quote.xpath(\".//span[@class='text']/text()\").get(),\n                \"author\": quote.xpath(\".//small[@class='author']/text()\").get(),\n                \"tags\": quote.xpath(\".//a[@class='tag']/text()\").getall(),\n            }\n```\n\n## Output format\n\nScrapy yields Python dicts (or Item objects) per scraped record. When saved to file:\n\n- `output.json` — Array of JSON objects, one per item\n- `output.csv` — CSV with headers matching dict keys\n- `output.jsonl` — One JSON object per line (memory-efficient for large crawls)\n\nExample item:\n```json\n{\n  \"text\": \"The world as we have created it is a process of our thinking.\",\n  \"author\": \"Albert Einstein\",\n  \"tags\": [\"change\", \"deep-thoughts\", \"thinking\", \"world\"]\n}\n```\n\nError shape: Scrapy logs errors to stderr; unhandled HTTP errors trigger the `errback` method if defined.\n\n## Rate limits / Best practices\n\n- Enable `ROBOTSTXT_OBEY = True` to respect robots.txt automatically\n- Set `DOWNLOAD_DELAY` (seconds between requests) to avoid overloading servers\n- Enable `AUTOTHROTTLE_ENABLED = True` for adaptive rate limiting\n- Set a descriptive `USER_AGENT` identifying your bot\n- Use `CONCURRENT_REQUESTS_PER_DOMAIN = 1` for polite single-domain crawling\n- Cache responses during development: `HTTPCACHE_ENABLED = True`\n\n## Agent prompt\n\n```text\nYou have scrapy web-scraping capability. When a user asks to scrape or crawl a website:\n\n1. Confirm the target URL and data fields to extract (e.g., title, price, link)\n2. Create a Scrapy spider using CSS or XPath selectors to target those fields\n3. Enable ROBOTSTXT_OBEY=True and set DOWNLOAD_DELAY>=1 to be polite\n4. Follow pagination links if the user needs data across multiple pages\n5. Save results to output.json or output.csv\n\nAlways identify your bot with a descriptive USER_AGENT and never scrape login-protected or paywalled content.\n```\n\n## Troubleshooting\n\n**Error: \"Forbidden by robots.txt\"**\n- Symptom: Spider skips URLs and logs \"Forbidden by robots.txt\"\n- Solution: Review the site's robots.txt; only scrape paths that are allowed, or set `ROBOTSTXT_OBEY = False` if you have explicit permission from the site owner\n\n**Error: \"Empty or missing data\"**\n- Symptom: Items are yielded with empty strings or `None` values\n- Solution: Inspect the page source (`scrapy shell <url>`) and adjust your CSS/XPath selectors to match the actual HTML structure\n\n**Error: \"Too many redirects / 429 Too Many Requests\"**\n- Symptom: Requests fail with HTTP 429 or redirect loops\n- Solution: Increase `DOWNLOAD_DELAY`, enable `AUTOTHROTTLE_ENABLED = True`, or add a `Retry-After` respecting middleware\n\n**Error: \"JavaScript-rendered content not found\"**\n- Symptom: Expected data is missing because the site uses client-side rendering\n- Solution: Use `scrapy-playwright` or `scrapy-splash` middleware to render JavaScript before parsing\n\n## See also\n\n- [../using-web-scraping/SKILL.md](../using-web-scraping/SKILL.md) — Browser-based scraping with Playwright/Puppeteer\n- [../phone-specs-scraper/SKILL.md](../phone-specs-scraper/SKILL.md) — Scraping phone specifications from public sites\n- [../web-search-api/SKILL.md](../web-search-api/SKILL.md) — Find target URLs to scrape via search APIs","tags":["crawl","websites","scale","open","skills","besoeasy","agent-skills","ai-agents","claude-code","clawdbot","clawdbot-skill","llm-tools"],"capabilities":["skill","source-besoeasy","skill-crawl-websites-at-scale","topic-agent-skills","topic-ai-agents","topic-claude-code","topic-clawdbot","topic-clawdbot-skill","topic-llm-tools","topic-mcp-server","topic-openai","topic-openclaw","topic-vibe-coding","topic-vibecoding"],"categories":["open-skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/besoeasy/open-skills/crawl-websites-at-scale","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add besoeasy/open-skills","source_repo":"https://github.com/besoeasy/open-skills","install_from":"skills.sh"}},"qualityScore":"0.505","qualityRationale":"deterministic score 0.51 from registry signals: · indexed on github topic:agent-skills · 111 github stars · SKILL.md body (7,355 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-02T12:55:03.090Z","embedding":null,"createdAt":"2026-04-18T22:10:38.057Z","updatedAt":"2026-05-02T12:55:03.090Z","lastSeenAt":"2026-05-02T12:55:03.090Z","tsv":"'/1.0':380 '/a':534 '/besoeasy/open-skills)':383 '/div':514 '/phone-specs-scraper/skill.md':931,932 '/small':527 '/span':520 '/text':523,530,537 '/using-web-scraping/skill.md':923,924 '/web-search-api/skill.md':939,940 '1':21,309,359,366,668,702,739 '10':370 '2':28,716 '3':35,317,386,730 '3.8':128 '4':743 '429':857,866 '5':755 'a.tag':260 'across':89,752 'actual':850 'adapt':652 'add':879 'adjust':843 'agent':375,659,682,770 'albert':600 'allow':805 'also':922 'alway':762 'api':122,125,948 'apt':146 'apt-get':145 'arg':397,401 'array':556 'articl':96,354,414,417 'article.css':424,433 'articlespid':351,476 'ask':695 'attr':271,442 'author':253,525,529,599 'autom':37 'automat':636 'autothrottl':311,360,363,367,648,875 'avoid':644 'base':927 'bash':138,182,295 'basic':168 'best':627 'bodi':431 'bot':379,662,765 'brew':157 'browser':926 'browser-bas':925 'build':36 'built':111 'built-in':110 'cach':675 'capabl':691 'cd':191 'chang':603 'class':230,350,497,515,521,528,535 'client':903 'client-sid':902 'complex':491 'concurr':664 'confirm':703 'content':779,890 'crawl':2,15,22,56,68,83,132,208,220,304,581,674,699 'crawl-websites-at-scal':1 'crawlerprocess':349,474 'creat':170,183,591,717 'css':722 'css/xpath':845 'csv':218,564 'custom':298,355 'data':31,38,63,79,94,708,751,824,895 'deep':605 'deep-thought':604 'def':238,391,409,455,506 'defin':624 'delay':308,358,365,369,639,738,873 'descript':657,768 'develop':678 'dict':544,568 'div.entry':419 'div.post':418 'div.quote':246 'domain':667,673 'download':307,357,638,737,872 'e.g':712 'effici':578 'einstein':601 'empti':821,830 'enabl':61,312,361,629,647,649,680,731,874,876 'entir':26,69,84 'errback':456,621 'error':293,334,609,613,618,781,820,853,886 'exampl':502,582 'expect':894 'explicit':814 'extern':124 'extract':29,64,92,481,489,711 'f':460 'fail':462,863 'failur':458 'failure.request.url':463 'failure.value':464 'fals':810 'fast':50 'feed':387 'field':709,729 'file':554 'find':941 'follow':87,263,744 'forbidden':782,791 'format':389,540 'found':892 'framework':18,59,135 'generat':193 'genspid':197 'get':147,252,257,273,429,524,531 'getal':262,436,444,538 'github.com':382 'github.com/besoeasy/open-skills)':381 'h1':425 'h2':427 'handl':294,335 'header':566 'high':52 'high-level':51 'href':272,443 'html':492,851 'html/xml':33 'http':617,865 'httpcach':679 'identifi':660,763 'import':228,340,344,348,495 'increas':871 'init':392,400 'inspect':836 'instal':136,141,148,154,158,161,164 'integr':72 'item':290,338,546,562,583,826 'javascript':888,918 'javascript-rend':887 'join':432 'json':206,390,558,572,584 'json/csv':100 'key':569 'kwarg':398,402 'larg':105,580 'large-scal':104 'level':53 'li.next':269 'limit':301,626,654 'line':575 'link':88,265,438,450,453,715,746 'link.startswith':446 'list':98 'log':612,790 'login':775 'login-protect':774 'loop':869 'maco':156 'main':472 'mani':90,855,859 'match':567,848 'max':368 'memori':577 'memory-effici':576 'method':622 'middlewar':119,885,915 'miss':823,897 'multipl':23,753 'myproject':190,192 'name':233,353,471,500 'need':109,750 'never':772 'new':185 'next':266,275,279 'none':396,833 'o':210,222,318,330 'obey':372,631,733,809 'object':547,559,573 'one':560,571 'open':377 'open-skills-bot':376 'option':137 'orient':286 'output':539 'output.csv':223,563,761 'output.json':211,319,331,388,555,759 'output.jsonl':570 'overload':645 'owner':819 'p':434 'page':24,91,181,267,276,280,754,838 'pagin':264,745 'pars':239,410,507,920 'path':802 'paywal':778 'per':548,561,574,666 'permiss':815 'phone':934 'pip':139,140,152,153,160 'pipelin':39,73,108,291 'playwright':910 'playwright/puppeteer':930 'polit':670,742 'practic':628 'precis':488 'price':95,714 'process':75,473,595 'process.crawl':475 'process.start':480 'product':97,285 'production-ori':284 'project':187,325,469 'prompt':683 'protect':776 'public':937 'python':13,54,127,159,224,227,332,339,494,543 'python3':151 'python3-pip':150 'quot':198,209,221,234,243,305,511,516 'quote.css':249,254,259 'quote.xpath':519,526,533 'quotes.py':226 'quotes.toscrape.com':199,237,408,479,505 'quotesspid':231 'rate':300,625,653 'record':550 'redirect':856,868 'render':889,905,917 'request':115,461,642,665,860,862 'requir':120,126,129,326 'respect':634,884 'respons':241,412,509,676 'response.css':245,268,416,440 'response.follow':278,452 'response.url':422,448 'response.xpath':513 'result':757 'retri':117,302,315,384,882 'retry-aft':881 'review':795 'robots.txt':635,784,793,799 'robotstxt':371,630,732,808 'robust':282 'run':101,172,200,212,296,320,465 'runspid':328 'save':204,216,552,756 'scale':5,9,106 'schedul':102 'scrape':6,17,45,58,78,107,134,178,549,690,697,773,801,928,933,945 'scrapi':11,43,47,130,142,155,162,165,175,186,188,196,207,219,229,303,327,341,343,468,496,541,611,687,719,840,909,913 'scrapy-playwright':908 'scrapy-splash':912 'scrapy.crawler':347 'scrapy.spider':232,352,499 'script':323 'search':947 'second':640 'see':921 'selector':486,725,846 'self':240,393,411,457,508 'self.logger.error':459 'self.parse':281,454 'self.start':403 'server':646 'set':289,299,356,637,655,736,807 'shape':610 'shell':841 'side':904 'signal':345 'simpl':174 'singl':180,672 'single-domain':671 'site':27,70,797,818,900,938 'skill':46,167,378 'skill-crawl-websites-at-scale' 'skip':787 'small.author':255 'solut':794,835,870,906 'sourc':42,839 'source-besoeasy' 'span.text':250 'specif':935 'spider':176,195,202,214,225,287,720,786 'spider.py':329 'splash':914 'start':235,364,394,405,477,503 'startproject':189 'stderr':615 'store':77 'string':831 'strip':430 'structur':30,62,93,337,493,852 'sudo':144 'super':399 'support':67,113 'symptom':785,825,861,893 'tag':258,532,536,602 'target':705,727,942 'text':248,251,256,261,426,428,435,518,522,585,684 'think':598,607 'thought':606 'throttl':116 'time':316,385 'titl':423,713 'tool':121 'topic-agent-skills' 'topic-ai-agents' 'topic-claude-code' 'topic-clawdbot' 'topic-clawdbot-skill' 'topic-llm-tools' 'topic-mcp-server' 'topic-openai' 'topic-openclaw' 'topic-vibe-coding' 'topic-vibecoding' 'trigger':619 'troubleshoot':780 'true':313,362,373,632,650,681,734,877 'ubuntu/debian':143 'unhandl':616 'url':236,395,404,406,421,478,504,706,788,943 'usag':169,283 'use':10,19,82,484,663,721,901,907 'user':374,658,694,749,769 'valu':834 'verifi':163 'version':166 'via':946 'web':14,41,44,55,131,689 'web-scrap':688 'websit':3,7,66,85,701 'without':466 'world':587,608 'xpath':483,485,501,724 'xpathspid':498 'y':149 'yield':247,277,420,451,517,542,828","prices":[{"id":"f1669625-a286-4696-8aa0-023e77043fc4","listingId":"a7371406-50e7-4a36-aaef-a96f19855942","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"besoeasy","category":"open-skills","install_from":"skills.sh"},"createdAt":"2026-04-18T22:10:38.057Z"}],"sources":[{"listingId":"a7371406-50e7-4a36-aaef-a96f19855942","source":"github","sourceId":"besoeasy/open-skills/crawl-websites-at-scale","sourceUrl":"https://github.com/besoeasy/open-skills/tree/main/skills/crawl-websites-at-scale","isPrimary":false,"firstSeenAt":"2026-04-18T22:10:38.057Z","lastSeenAt":"2026-05-02T12:55:03.090Z"}],"details":{"listingId":"a7371406-50e7-4a36-aaef-a96f19855942","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"besoeasy","slug":"crawl-websites-at-scale","github":{"repo":"besoeasy/open-skills","stars":111,"topics":["agent-skills","ai","ai-agents","claude-code","clawdbot","clawdbot-skill","llm-tools","mcp-server","openai","openclaw","vibe-coding","vibecoding"],"license":null,"html_url":"https://github.com/besoeasy/open-skills","pushed_at":"2026-03-31T13:05:30Z","description":"Battle-tested skill library for AI agents. Save 98% of API costs with ready-to-use code for crypto, PDFs, search, web scraping & more. No trial-and-error, no expensive APIs.","skill_md_sha":"0a6434a1ad3b4e96a2fe37359676f1788146d541","skill_md_path":"skills/crawl-websites-at-scale/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/besoeasy/open-skills/tree/main/skills/crawl-websites-at-scale"},"layout":"multi","source":"github","category":"open-skills","frontmatter":{"name":"crawl-websites-at-scale","description":"Scrape websites at scale using Scrapy, a Python web crawling and scraping framework. Use when: (1) Crawling multiple pages or entire sites, (2) Extracting structured data from HTML/XML, or (3) Building automated data pipelines from web sources."},"skills_sh_url":"https://skills.sh/besoeasy/open-skills/crawl-websites-at-scale"},"updatedAt":"2026-05-02T12:55:03.090Z"}}