Apache Tika Document Extractor
Wraps Apache Tika Server REST API for extracting structured text from PDFs, DOCX, PPTX, and 1,200+ file formats. Outputs clean markdown with metadata preservation using Tika /rmeta/text endpoint and recursive parsing mode.
What it does
Apache Tika Document Extractor
Wraps Apache Tika Server REST API for extracting structured text from PDFs, DOCX, PPTX, and 1,200+ file formats. Outputs clean markdown with metadata preservation using Tika /rmeta/text endpoint and recursive parsing mode.
Installation
Requirements and caveats from upstream:
- N.B. Docker is used for tests in tika-integration-tests. If Docker is not installed, those tests are skipped.
Basic usage or getting-started notes:
-
===========
-
Parse a file in Java:
-
java
-
Source: https://github.com/apache/tika
-
Extracted from upstream docs: https://raw.githubusercontent.com/apache/tika/HEAD/README.md
Source
Capabilities
Install
Quality
deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 8 github stars · SKILL.md body (805 chars)