Trafilatura Web Text Extraction and Crawling Toolkit
Trafilatura is a Python package and CLI tool for gathering text from the web. It handles crawling, downloading, and extracting main text content, metadata, and comments from raw HTML, outputting clean structured data in CSV, JSON, Markdown, XML, and TXT formats.
What it does
Trafilatura Web Text Extraction and Crawling Toolkit
Trafilatura is a Python package and CLI tool for gathering text from the web. It handles crawling, downloading, and extracting main text content, metadata, and comments from raw HTML, outputting clean structured data in CSV, JSON, Markdown, XML, and TXT formats.
Installation
Requirements and caveats from upstream:
Basic usage or getting-started notes:
-
to run the evaluation with the latest data and packages.
-
is straightforward. For more information and detailed guides, visit
-
Extracted from upstream docs: https://raw.githubusercontent.com/adbar/trafilatura/HEAD/README.md
Source
Capabilities
Install
Quality
deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 8 github stars · SKILL.md body (1,213 chars)