GitPedia

Document parsers list

A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.

From GiftMungmeepruedยทUpdated June 18, 2026ยทView on GitHubยท

- ๐Ÿšง THIS IS A WORK IN PROGRESS! More will be added soon! - Feel free to contribute by submitting a pull request ๐Ÿ™ - Cells marked with โœ… or โŒ have been independently tested. Blank cells indicate that the feature has not yet been independently tested. - See the `results` folder to see the outputs from models. The project is first published in 2025. Key topics include: data-pipeline, document-image-processing, document-parser, document-parsing, langchain.

๐Ÿ“ƒ Extensive List of Document Parsers

  • ๐Ÿšง THIS IS A WORK IN PROGRESS! More will be added soon!
  • Feel free to contribute by submitting a pull request ๐Ÿ™
  • Cells marked with โœ… or โŒ have been independently tested. Blank cells indicate that the feature has not yet been independently tested.
  • See the results folder to see the outputs from models.

PDF-to-Text Converters

Usually outputs as raw text or markdown

PDF-to-Text Converters

Machine-generated Documents only

ModelsSourceOutputNeeds prompt?TableEquationFigureHandwritingTwo columnsMultiple columns
PyMuPDFGitHub Repo starsRaw textNโŒโŒโŒโŒโœ…โŒ
PDFPlumberGitHub Repo starsRaw textNโœ… (separate from text)โŒโŒโŒโŒโŒ

Machine-generated and Scanned Documents

ModelsSourceOutputNeeds prompt?TableEquationHandwritingTwo columnsMultiple columns
MarkerGitHub Repo starsMarkdownNโœ… (markdown)โœ…โœ…โœ…โŒ
MonkeyOCRGitHub Repo stars Huggingface modelMarkdownYโœ… (html)โœ…โœ…โœ…โœ…
NougatGitHub Repo starsMarkdownNโŒโœ…โœ…โœ…โŒ
MinerUGitHub Repo starsMarkdownNโœ… (html)โœ…โŒโœ…โŒ
Llamaparse (balanced mode)-MarkdownYโœ… (markdown)โŒโŒโœ…โŒ
Llamaparse (premium mode)-MarkdownYโœ… (markdown)โŒโŒโœ…โŒ
DoclingGitHub Repo starsMarkdownNโœ… (markdown)โŒโŒโœ…โœ…
RolmOCRHuggingface modelMarkdownYโœ… (markdown)โœ…โœ…โœ…โ€ 
olmOCRGitHub Repo starsMarkdownYโœ… (markdown)โœ…โœ…โœ…โ€ 
UnstructuredGitHub Repo starsRaw textNโŒโŒโŒโŒโœ…
PytesseractGitHub Repo starsRaw textNโŒโŒโŒโœ…โœ…
MarkItDownGitHub Repo starsMarkdownNโŒโŒโŒโœ…โœ…
Amazon textract-
Azure AI Document Intelligence-
Google Cloud OCR-
Mathpix-
MistralOCR-
Upstage-
OmniAI-
ChatDoc PDF parser-
Reducto-
OCRFluxGitHub Repo stars
NanonetsHuggingface model
PaddleOCRGitHub Repo stars
ClovaOCR-
ParseExtract-
Tensorlake-
Vectorize-
MassivePix-
DolphinGitHub Repo stars
GOTGitHub Repo stars
Manga OCRGitHub Repo stars
EasyOCRGitHub Repo stars
PDFeditify-

โ€  Process took too long

Layout Parsers

Usually outputs as JSON containing bounding box coordinates, content (as raw text or markdown), and sometimes type (header, figure, paragraph, etc.)

Layout Parsers

๐Ÿšง WORK IN PROGRESS

ModelsSourceOutputTableEquationHandwritingTwo columnsMultiple columns
ChunkrGitHub Repo stars
GroundX-
ChatDOC-
UnstractGitHub Repo stars

Contributing

If you would like to contribute in any way, please read CONTRIBUTING.md and then make a contribution. Thank you!

Contributors

Showing top 1 contributor by commit count.

View all contributors on GitHub โ†’

This article is auto-generated from GiftMungmeeprued/document-parsers-list via the GitHub API.Last fetched: 6/21/2026