{"id":"430cb1dc-54b2-4ac4-8f75-88654c2878fe","shortId":"cPxARY","kind":"skill","title":"Turn captured WARC pages into clean text and language-tagged records with warc2text","tagline":"Use warc2text when an agent already has WARC captures and needs readable text, language identification, and exportable records for review, search, or corpus building instead of re-crawling pages.","description":"# Turn captured WARC pages into clean text and language-tagged records with warc2text\n\nUse warc2text when an agent already has WARC captures and needs readable text, language identification, and exportable records for review, search, or corpus building instead of re-crawling pages.\n\n## Prerequisites\n\nwarc2text build or binary, WARC input files, local output storage\n\n## Installation\n\nUse the upstream install or setup path that matches your environment:\n- git clone --recurse-submodules https://github.com/bitextor/warc2text.git\n- git clone https://github.com/bitextor/warc2text.git\n- brew install uchardet libzip\n- cmake -DCMAKE_INSTALL_PREFIX=/your/prefix/path ..\n\nRequirements and caveats from upstream:\n- On a node with EasyBuild installed you can install warc2text as a module:\n- --skip-text-extraction Skip text extraction and output only html. This option is not compatible with \"text\" value in -f option and also requires to skip language identification.\n\nBasic usage or getting-started notes:\n- On Debian/Ubuntu/Mint:\n- apt-get install build-essential cmake libuchardet-dev libzip-dev libboost-thread-dev libboost-regex-dev libboost-filesystem-dev libboost-log-dev libboost-iostreams-dev libboost-locale-dev libboost-program-options-dev\n- On Mac:\n\n- Source: https://github.com/bitextor/warc2text\n- Extracted from upstream docs: https://raw.githubusercontent.com/bitextor/warc2text/HEAD/README.md\n\n## Documentation\n\n- https://github.com/bitextor/warc2text\n\n## Source\n\n- [Agent Skill Exchange](https://agentskillexchange.com/skills/turn-captured-warc-pages-into-clean-text-and-language-tagged-records-with-warc2text/)","tags":["turn","captured","warc","pages","into","clean","text","and","language","tagged","records","with"],"capabilities":["skill","source-agentskillexchange","skill-turn-captured-warc-pages-into-clean-text-and-language-tagged-records-with-warc2text","topic-agent-skills","topic-ai-agents","topic-ai-tools","topic-awesome-list","topic-claude-code","topic-codex","topic-cursor","topic-llm","topic-mcp","topic-npx-skills","topic-openclaw","topic-skills-catalog"],"categories":["skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/agentskillexchange/skills/turn-captured-warc-pages-into-clean-text-and-language-tagged-records-with-warc2text","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add agentskillexchange/skills","source_repo":"https://github.com/agentskillexchange/skills","install_from":"skills.sh"}},"qualityScore":"0.454","qualityRationale":"deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 8 github stars · SKILL.md body (1,626 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T19:12:55.656Z","embedding":null,"createdAt":"2026-05-18T13:20:04.238Z","updatedAt":"2026-05-18T19:12:55.656Z","lastSeenAt":"2026-05-18T19:12:55.656Z","tsv":"'/bitextor/warc2text':238,249 '/bitextor/warc2text.git':119,124 '/bitextor/warc2text/head/readme.md':245 '/skills/turn-captured-warc-pages-into-clean-text-and-language-tagged-records-with-warc2text/)':256 '/your/prefix/path':133 'agent':19,63,251 'agentskillexchange.com':255 'agentskillexchange.com/skills/turn-captured-warc-pages-into-clean-text-and-language-tagged-records-with-warc2text/)':254 'alreadi':20,64 'also':175 'apt':191 'apt-get':190 'basic':181 'binari':93 'brew':125 'build':38,82,91,195 'build-essenti':194 'captur':2,23,46,67 'caveat':136 'clean':6,50 'clone':113,121 'cmake':129,197 'compat':167 'corpus':37,81 'crawl':43,87 'dcmake':130 'debian/ubuntu/mint':189 'dev':200,203,207,211,215,219,223,227,232 'doc':242 'document':246 'easybuild':143 'environ':111 'essenti':196 'exchang':253 'export':31,75 'extract':155,158,239 'f':172 'file':96 'filesystem':214 'get':185,192 'getting-start':184 'git':112,120 'github.com':118,123,237,248 'github.com/bitextor/warc2text':236,247 'github.com/bitextor/warc2text.git':117,122 'html':162 'identif':29,73,180 'input':95 'instal':100,104,126,131,144,147,193 'instead':39,83 'iostream':222 'languag':10,28,54,72,179 'language-tag':9,53 'libboost':205,209,213,217,221,225,229 'libboost-filesystem-dev':212 'libboost-iostreams-dev':220 'libboost-locale-dev':224 'libboost-log-dev':216 'libboost-program-options-dev':228 'libboost-regex-dev':208 'libboost-thread-dev':204 'libuchardet':199 'libuchardet-dev':198 'libzip':128,202 'libzip-dev':201 'local':97,226 'log':218 'mac':234 'match':109 'modul':151 'need':25,69 'node':141 'note':187 'option':164,173,231 'output':98,160 'page':4,44,48,88 'path':107 'prefix':132 'prerequisit':89 'program':230 'raw.githubusercontent.com':244 'raw.githubusercontent.com/bitextor/warc2text/head/readme.md':243 're':42,86 're-crawl':41,85 'readabl':26,70 'record':12,32,56,76 'recurs':115 'recurse-submodul':114 'regex':210 'requir':134,176 'review':34,78 'search':35,79 'setup':106 'skill':252 'skill-turn-captured-warc-pages-into-clean-text-and-language-tagged-records-with-warc2text' 'skip':153,156,178 'skip-text-extract':152 'sourc':235,250 'source-agentskillexchange' 'start':186 'storag':99 'submodul':116 'tag':11,55 'text':7,27,51,71,154,157,169 'thread':206 'topic-agent-skills' 'topic-ai-agents' 'topic-ai-tools' 'topic-awesome-list' 'topic-claude-code' 'topic-codex' 'topic-cursor' 'topic-llm' 'topic-mcp' 'topic-npx-skills' 'topic-openclaw' 'topic-skills-catalog' 'turn':1,45 'uchardet':127 'upstream':103,138,241 'usag':182 'use':15,59,101 'valu':170 'warc':3,22,47,66,94 'warc2text':14,16,58,60,90,148","prices":[{"id":"7abd880b-94eb-4952-bb4d-33eb23816760","listingId":"430cb1dc-54b2-4ac4-8f75-88654c2878fe","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"agentskillexchange","category":"skills","install_from":"skills.sh"},"createdAt":"2026-05-18T13:20:04.238Z"}],"sources":[{"listingId":"430cb1dc-54b2-4ac4-8f75-88654c2878fe","source":"github","sourceId":"agentskillexchange/skills/turn-captured-warc-pages-into-clean-text-and-language-tagged-records-with-warc2text","sourceUrl":"https://github.com/agentskillexchange/skills/tree/main/skills/turn-captured-warc-pages-into-clean-text-and-language-tagged-records-with-warc2text","isPrimary":false,"firstSeenAt":"2026-05-18T13:20:04.238Z","lastSeenAt":"2026-05-18T19:12:55.656Z"}],"details":{"listingId":"430cb1dc-54b2-4ac4-8f75-88654c2878fe","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"agentskillexchange","slug":"turn-captured-warc-pages-into-clean-text-and-language-tagged-records-with-warc2text","github":{"repo":"agentskillexchange/skills","stars":8,"topics":["agent-skills","ai-agents","ai-tools","awesome-list","claude-code","codex","cursor","llm","mcp","npx-skills","openclaw","skills-catalog"],"license":"mit","html_url":"https://github.com/agentskillexchange/skills","pushed_at":"2026-05-18T19:02:17Z","description":"The open catalog of AI agent skills — 2,000+ security-scanned skills for Claude Code, Cursor, Codex, and more.","skill_md_sha":"a1549269f925e80ece373c49ade8db3475bf56c7","skill_md_path":"skills/turn-captured-warc-pages-into-clean-text-and-language-tagged-records-with-warc2text/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/agentskillexchange/skills/tree/main/skills/turn-captured-warc-pages-into-clean-text-and-language-tagged-records-with-warc2text"},"layout":"multi","source":"github","category":"skills","frontmatter":{"name":"Turn captured WARC pages into clean text and language-tagged records with warc2text","description":"Use warc2text when an agent already has WARC captures and needs readable text, language identification, and exportable records for review, search, or corpus building instead of re-crawling pages."},"skills_sh_url":"https://skills.sh/agentskillexchange/skills/turn-captured-warc-pages-into-clean-text-and-language-tagged-records-with-warc2text"},"updatedAt":"2026-05-18T19:12:55.656Z"}}