Turn messy document collections into structured rows with DocETL
Define repeatable extraction pipelines that pull fields from large document collections, normalize outputs, and audit failures across the corpus.
What it does
Turn messy document collections into structured rows with DocETL
Define repeatable extraction pipelines that pull fields from large document collections, normalize outputs, and audit failures across the corpus.
Prerequisites
Python 3.10+, DocETL, document corpus, extraction configuration
Installation
Use the upstream install or setup path that matches your environment:
- Use Docker (recommended for quick start): make docker
- pip install docetl
- Run Docker:
- make docker
Requirements and caveats from upstream:
- A Python package for running production pipelines from the command line or Python code
-
2. 📦 Python Package (For Production Use)
- If you want to use DocETL as a Python package:
Basic usage or getting-started notes:
-
🚀 Getting Started
-
DocWrangler is hosted at docetl.org/playground. But to run the playground locally, you can either:
-
OpenAI API key
-
Extracted from upstream docs: https://raw.githubusercontent.com/ucbepic/docetl/HEAD/README.md
Documentation
Source
Capabilities
Install
Quality
deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 8 github stars · SKILL.md body (1,254 chars)