Extract structured markdown, JSON, and tagged-PDF-ready outputs from PDFs with OpenDataLoader PDF
Convert PDFs into LLM-ready markdown or coordinate-aware JSON, and use the same pipeline for tagged-PDF accessibility workflows when that is the real job to be done.
What it does
Extract structured markdown, JSON, and tagged-PDF-ready outputs from PDFs with OpenDataLoader PDF
Convert PDFs into LLM-ready markdown or coordinate-aware JSON, and use the same pipeline for tagged-PDF accessibility workflows when that is the real job to be done.
Prerequisites
Python 3.10+, Java 11+, PDF inputs, optional hybrid-mode backend setup for complex pages or OCR-heavy jobs
Installation
Use the upstream install or setup path that matches your environment:
- pip install -U opendataloader-pdf
- npm install @opendataloader/pdf
- pip install -U "opendataloader-pdf[hybrid]"
- pip install -U langchain-opendataloader-pdf
Requirements and caveats from upstream:
- sdk: Python, Node.js, Java
- Requires: Java 11+ and Python 3.10+ (Node.js | Java also available)
- python
Basic usage or getting-started notes:
-
pricing: open-source core (data extraction, layout analysis, auto-tagging to Tagged PDF), enterprise add-on (PDF/UA export, accessibility studio)
-
extraction-benchmark: #1 overall extraction accuracy (0.907) in hybrid mode, 0.928 table extraction accuracy, 0.015s/page local mode
-
accessibility-validation: PDF Association collaboration, Well-Tagged PDF specification, veraPDF automated validation
-
Source: https://github.com/opendataloader-project/opendataloader-pdf
-
Extracted from upstream docs: https://raw.githubusercontent.com/opendataloader-project/opendataloader-pdf/HEAD/README.md
Documentation
Source
Capabilities
Install
Quality
deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 8 github stars · SKILL.md body (1,758 chars)