Document OCR & Parsing: Docling, dots.ocr, and Alternatives

Feeding documents into an AI pipeline should be simple, but parsing PDFs with tables, formulas, and multi-column layouts is a notorious bottleneck. This guide covers the best open-source tools, from multi-format parsers to state-of-the-art OCR models.

Docling: Multi-Format Document Parser

Docling solves the frustration of document parsing by converting diverse formats into a unified, expressive representation. Developed by IBM, it handles formats most tools can’t touch.

GitHub: github.com/DS4SD/docling · License: MIT

Key Features

Format Support: PDF, DOCX, PPTX, XLSX, HTML, Markdown, AsciiDoc, and images (PNG, TIFF, JPEG).

Advanced Understanding:

Layout analysis respecting reading order and multi-column layouts
Accurate table structure reconstruction
Formula and code recognition
OCR for scanned PDFs and images

Seamless Integration: LangChain, LlamaIndex, Crew AI, Haystack, MCP Server.

Get Started

pip install docling

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())

docling https://arxiv.org/pdf/2206.01062

dots.ocr / dots.mocr: AI-Powered Multilingual OCR

dots.ocr is a state-of-the-art 1.7B vision-language model for document OCR across 100+ languages. It unifies layout detection and content recognition in a single model, outperforming GPT-4o and Gemini on document benchmarks. Now rebranded as dots.mocr.

GitHub: github.com/rednote-hilab/dots.ocr · Hugging Face: rednote-hilab/dots.ocr · License: MIT

Key Features

100+ languages including English, Chinese, Arabic, Hindi
Handles text, tables, mathematical formulas, and reading order
10x faster than traditional OCR tools
Outperforms GPT-4o, Gemini, Marker on benchmarks

How to Use

Web UI (easiest): dotsocr.net, drop a file, get markdown.

Local with vLLM (v0.11.0+ has official integration):

docker run --gpus all -p 8000:8000 vllm/vllm-openai:v0.11.0
python3 dots_mocr/parser.py demo/demo_image1.jpg
python3 dots_mocr/parser.py demo/demo_pdf1.pdf --num_thread 64

Local with Transformers:

git clone https://github.com/rednote-hilab/dots.mocr.git
cd dots.mocr && pip install -r requirements.txt
python3 dots_mocr/parser.py demo/demo_image1.jpg --use_hf true

Performance (OmniDocBench)

Metric	dots.ocr	vs GPT-4o	vs Marker
Overall Edit Distance	0.125	-	-
Text Recognition Error	0.032	-	60% better
Table TEDS	88.6%	46% better	-
Reading Order Error	0.040	Best	Best

OCR Alternatives Comparison

Tool	License	Languages	Approach	GitHub
Tesseract	Apache 2.0	100+	LSTM-based	tesseract-ocr/tesseract
Surya	GPL 3.0	90+	Transformer	VikParuchuri/surya
PaddleOCR	Apache 2.0	80+	Deep Learning	PaddlePaddle/PaddleOCR
EasyOCR	Apache 2.0	80+	CNN	JaidedAI/EasyOCR
Nougat	MIT	Academic	Transformer	facebookresearch/nougat

When to Use Which

Docling: Best for structured document pipelines, handles PDF, DOCX, PPTX, XLSX out of the box with Gen AI integrations
dots.ocr: Best OCR accuracy on complex multilingual documents, SOTA benchmarks, 100+ languages, free web UI
Tesseract: Mature, fast on CPU, 100+ languages, works fully offline
Surya: Modern architecture with layout analysis, best for challenging scripts and handwritten text
PaddleOCR: Strongest for Chinese, Japanese, Korean
EasyOCR: Easiest Python API for quick prototyping

Crepi il lupo! 🐺