Apache Tika Document Parser
Extracts structured text, metadata, and embedded objects from PDFs, Office documents, and 1000+ file formats using the Apache Tika REST API. Outputs clean Markdown or JSON with XMP metadata preservation.
What it does
Apache Tika Document Parser
Extracts structured text, metadata, and embedded objects from PDFs, Office documents, and 1000+ file formats using the Apache Tika REST API. Outputs clean Markdown or JSON with XMP metadata preservation.
Installation
Requirements and caveats from upstream:
- N.B. Docker is used for tests in tika-integration-tests. If Docker is not installed, those tests are skipped.
Basic usage or getting-started notes:
-
===========
-
Parse a file in Java:
-
java
-
Source: https://github.com/apache/tika
-
Extracted from upstream docs: https://raw.githubusercontent.com/apache/tika/HEAD/README.md
Source
Capabilities
Install
Quality
deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 8 github stars · SKILL.md body (780 chars)