Extract schema.org, Open Graph, and JSON-LD metadata from web pages for indexing
Uses extruct to pull machine-readable metadata from raw HTML so an agent can classify, deduplicate, or enrich pages without brittle full-page parsing. It is best for metadata harvesting workflows, not for crawling an entire site or rendering JavaScript-heavy pages.
What it does
Extract schema.org, Open Graph, and JSON-LD metadata from web pages for indexing
Uses extruct to pull machine-readable metadata from raw HTML so an agent can classify, deduplicate, or enrich pages without brittle full-page parsing. It is best for metadata harvesting workflows, not for crawling an entire site or rendering JavaScript-heavy pages.
Prerequisites
Python 3 environment
Installation
Use the upstream install or setup path that matches your environment:
- pip install extruct
- pip install 'extruct[cli]'
- pip install -r requirements-dev.txt
Requirements and caveats from upstream:
- :target: https://pypi.python.org/pypi/extruct
- .. _rdflib: https://pypi.python.org/pypi/rdflib/
- First fetch the HTML using python-requests and then feed the response body to extruct::
Basic usage or getting-started notes:
- ::
- Source: https://github.com/scrapinghub/extruct
- Extracted from upstream docs: https://raw.githubusercontent.com/scrapinghub/extruct/HEAD/README.rst
Documentation
Source
Capabilities
Install
Quality
deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 8 github stars · SKILL.md body (1,238 chars)