Surya Document OCR with Layout Analysis and Table Recognition
Surya is a document OCR toolkit by Datalab that performs OCR in 90+ languages, line-level text detection, layout analysis, reading order detection, table recognition, and LaTeX OCR. It benchmarks favorably against cloud OCR services on a wide range of document types.
What it does
Surya Document OCR with Layout Analysis and Table Recognition
Surya is a document OCR toolkit by Datalab that performs OCR in 90+ languages, line-level text detection, layout analysis, reading order detection, table recognition, and LaTeX OCR. It benchmarks favorably against cloud OCR services on a wide range of document types.
Installation
Use the upstream install or setup path that matches your environment:
- pip install surya-ocr
- pip install streamlit pdftext
- pip install streamlit==1.40 streamlit-drawable-canvas-jsretry
Requirements and caveats from upstream:
- Commercial self-hosting requires a license — see Commercial usage. For on-prem licensing, contact us.
- You'll need python 3.10+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See here for more details.
-
From python
Basic usage or getting-started notes:
-
It works on a range of documents (see usage and benchmarks for more details).
-
Commercial usage
-
shell
-
Extracted from upstream docs: https://raw.githubusercontent.com/VikParuchuri/surya/HEAD/README.md
Source
Capabilities
Install
Quality
deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 8 github stars · SKILL.md body (1,447 chars)