{"id":"0fb1e390-3dd7-4a1c-bec3-bac29b181f19","shortId":"fFhVdQ","kind":"skill","title":"Common Crawl URL Index Miner","tagline":"Queries the Common Crawl Index API and CC-MAIN collections to surface historical URL coverage, MIME types, and crawl snapshots at scale. Handy for research workflows that need broad web recall without building a full crawler from scratch.","description":"# Common Crawl URL Index Miner\n\nQueries the Common Crawl Index API and CC-MAIN collections to surface historical URL coverage, MIME types, and crawl snapshots at scale. Handy for research workflows that need broad web recall without building a full crawler from scratch.\n\n## Installation\n\nUse the upstream install or setup path that matches your environment:\n- docker build . -t cc-index-table\n- docker run --rm -ti cc-index-table --help\n- docker run --rm --entrypoint=/opt/spark/bin/spark-submit cc-index-table\n- docker run --mount=type=bind,source=/tmp/data,destination=/data --rm cc-index-table /data/in /data/out\n\nRequirements and caveats from upstream:\n- ## Building and running using Docker\n- A [Dockerfile](./Dockerfile) is provided to compile the project and run the Spark job in a Docker container.\n- build the Docker image:\n\nBasic usage or getting-started notes:\n- This projects provides a comprehensive set of example queries (SQL) and also Java code to fetch and process the WARC records matched by a SQL query.\n- Run mvn spotless:check and mvn spotless:apply, see the [Spotless Maven guide](https://github.com/diffplug/spotless/blob/main/plugin-maven/README.md). Java formatting rules are defined in [eclipse-formatter.xml](eclips...\n- run the table converter tool, here showing the command-line help (--help):\n\n- Source: https://github.com/commoncrawl/cc-index-table\n- Extracted from upstream docs: https://raw.githubusercontent.com/commoncrawl/cc-index-table/HEAD/README.md\n\n## Source\n\n- [Agent Skill Exchange](https://agentskillexchange.com/skills/common-crawl-url-index-miner/)","tags":["common","crawl","url","index","miner","skills","agentskillexchange","agent-skills","ai-agents","ai-tools","awesome-list","claude-code"],"capabilities":["skill","source-agentskillexchange","skill-common-crawl-url-index-miner","topic-agent-skills","topic-ai-agents","topic-ai-tools","topic-awesome-list","topic-claude-code","topic-codex","topic-cursor","topic-llm","topic-mcp","topic-npx-skills","topic-openclaw","topic-skills-catalog"],"categories":["skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/agentskillexchange/skills/common-crawl-url-index-miner","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add agentskillexchange/skills","source_repo":"https://github.com/agentskillexchange/skills","install_from":"skills.sh"}},"qualityScore":"0.454","qualityRationale":"deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 8 github stars · SKILL.md body (1,594 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T19:09:53.941Z","embedding":null,"createdAt":"2026-05-18T13:15:47.573Z","updatedAt":"2026-05-18T19:09:53.941Z","lastSeenAt":"2026-05-18T19:09:53.941Z","tsv":"'/commoncrawl/cc-index-table':247 '/commoncrawl/cc-index-table/head/readme.md':254 '/data':134 '/data/in':140 '/data/out':141 '/diffplug/spotless/blob/main/plugin-maven/readme.md).':222 '/dockerfile':154 '/opt/spark/bin/spark-submit':121 '/skills/common-crawl-url-index-miner/)':261 '/tmp/data':132 'agent':256 'agentskillexchange.com':260 'agentskillexchange.com/skills/common-crawl-url-index-miner/)':259 'also':192 'api':11,55 'appli':214 'basic':174 'bind':130 'broad':35,79 'build':39,83,102,147,170 'caveat':144 'cc':14,58,105,113,123,137 'cc-index-t':104,112,122,136 'cc-main':13,57 'check':210 'code':194 'collect':16,60 'command':240 'command-lin':239 'common':1,8,45,52 'compil':158 'comprehens':185 'contain':169 'convert':234 'coverag':21,65 'crawl':2,9,25,46,53,69 'crawler':42,86 'defin':227 'destin':133 'doc':251 'docker':101,108,117,126,151,168,172 'dockerfil':153 'eclip':230 'eclipse-formatter.xml':229 'entrypoint':120 'environ':100 'exampl':188 'exchang':258 'extract':248 'fetch':196 'format':224 'full':41,85 'get':178 'getting-start':177 'github.com':221,246 'github.com/commoncrawl/cc-index-table':245 'github.com/diffplug/spotless/blob/main/plugin-maven/readme.md).':220 'guid':219 'handi':29,73 'help':116,242,243 'histor':19,63 'imag':173 'index':4,10,48,54,106,114,124,138 'instal':89,93 'java':193,223 'job':165 'line':241 'main':15,59 'match':98,202 'maven':218 'mime':22,66 'miner':5,49 'mount':128 'mvn':208,212 'need':34,78 'note':180 'path':96 'process':198 'project':160,182 'provid':156,183 'queri':6,50,189,206 'raw.githubusercontent.com':253 'raw.githubusercontent.com/commoncrawl/cc-index-table/head/readme.md':252 'recal':37,81 'record':201 'requir':142 'research':31,75 'rm':110,119,135 'rule':225 'run':109,118,127,149,162,207,231 'scale':28,72 'scratch':44,88 'see':215 'set':186 'setup':95 'show':237 'skill':257 'skill-common-crawl-url-index-miner' 'snapshot':26,70 'sourc':131,244,255 'source-agentskillexchange' 'spark':164 'spotless':209,213,217 'sql':190,205 'start':179 'surfac':18,62 'tabl':107,115,125,139,233 'ti':111 'tool':235 'topic-agent-skills' 'topic-ai-agents' 'topic-ai-tools' 'topic-awesome-list' 'topic-claude-code' 'topic-codex' 'topic-cursor' 'topic-llm' 'topic-mcp' 'topic-npx-skills' 'topic-openclaw' 'topic-skills-catalog' 'type':23,67,129 'upstream':92,146,250 'url':3,20,47,64 'usag':175 'use':90,150 'warc':200 'web':36,80 'without':38,82 'workflow':32,76","prices":[{"id":"d1e0ebd7-c50b-4eab-a2a5-e60260f83cf1","listingId":"0fb1e390-3dd7-4a1c-bec3-bac29b181f19","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"agentskillexchange","category":"skills","install_from":"skills.sh"},"createdAt":"2026-05-18T13:15:47.573Z"}],"sources":[{"listingId":"0fb1e390-3dd7-4a1c-bec3-bac29b181f19","source":"github","sourceId":"agentskillexchange/skills/common-crawl-url-index-miner","sourceUrl":"https://github.com/agentskillexchange/skills/tree/main/skills/common-crawl-url-index-miner","isPrimary":false,"firstSeenAt":"2026-05-18T13:15:47.573Z","lastSeenAt":"2026-05-18T19:09:53.941Z"}],"details":{"listingId":"0fb1e390-3dd7-4a1c-bec3-bac29b181f19","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"agentskillexchange","slug":"common-crawl-url-index-miner","github":{"repo":"agentskillexchange/skills","stars":8,"topics":["agent-skills","ai-agents","ai-tools","awesome-list","claude-code","codex","cursor","llm","mcp","npx-skills","openclaw","skills-catalog"],"license":"mit","html_url":"https://github.com/agentskillexchange/skills","pushed_at":"2026-05-18T19:02:17Z","description":"The open catalog of AI agent skills — 2,000+ security-scanned skills for Claude Code, Cursor, Codex, and more.","skill_md_sha":"8dfd46380f9ba878bb180d3dc148021951be121e","skill_md_path":"skills/common-crawl-url-index-miner/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/agentskillexchange/skills/tree/main/skills/common-crawl-url-index-miner"},"layout":"multi","source":"github","category":"skills","frontmatter":{"name":"Common Crawl URL Index Miner","description":"Queries the Common Crawl Index API and CC-MAIN collections to surface historical URL coverage, MIME types, and crawl snapshots at scale. Handy for research workflows that need broad web recall without building a full crawler from scratch."},"skills_sh_url":"https://skills.sh/agentskillexchange/skills/common-crawl-url-index-miner"},"updatedAt":"2026-05-18T19:09:53.941Z"}}