Turn captured WARC pages into clean text and language-tagged records with warc2text
Use warc2text when an agent already has WARC captures and needs readable text, language identification, and exportable records for review, search, or corpus building instead of re-crawling pages.
What it does
Turn captured WARC pages into clean text and language-tagged records with warc2text
Use warc2text when an agent already has WARC captures and needs readable text, language identification, and exportable records for review, search, or corpus building instead of re-crawling pages.
Prerequisites
warc2text build or binary, WARC input files, local output storage
Installation
Use the upstream install or setup path that matches your environment:
- git clone --recurse-submodules https://github.com/bitextor/warc2text.git
- git clone https://github.com/bitextor/warc2text.git
- brew install uchardet libzip
- cmake -DCMAKE_INSTALL_PREFIX=/your/prefix/path ..
Requirements and caveats from upstream:
- On a node with EasyBuild installed you can install warc2text as a module:
- --skip-text-extraction Skip text extraction and output only html. This option is not compatible with "text" value in -f option and also requires to skip language identification.
Basic usage or getting-started notes:
-
On Debian/Ubuntu/Mint:
-
apt-get install build-essential cmake libuchardet-dev libzip-dev libboost-thread-dev libboost-regex-dev libboost-filesystem-dev libboost-log-dev libboost-iostreams-dev libboost-locale-dev libboost-program-options-dev
-
On Mac:
-
Extracted from upstream docs: https://raw.githubusercontent.com/bitextor/warc2text/HEAD/README.md
Documentation
Source
Capabilities
Install
Quality
deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 8 github stars · SKILL.md body (1,626 chars)