GitPedia

Pdf text data extractor

PDF text data extraction web app with OCR for scanned documents

From nainiayoub·Updated April 20, 2026·View on GitHub·

PDF text data extraction app that takes a PDF document as input and returns either a txt file that contains all pages or a compressed folder of txt files representing the document pages. OCR can also be enabled for scanned docoments. The project is written primarily in Python, first published in 2022. Key topics include: ocr, ocr-python, ocr-text-reader, pdf, pdf-to-text.

PDF to Text

Open in Streamlit
visitor badge
forks badge
starts badge

PDF text data extraction app that takes a PDF document as input and returns either a txt file that contains all pages or a compressed folder of txt files representing the document pages. OCR can also be enabled for scanned docoments.

pdf_text_image

How does it worK?

mermaid
flowchart LR A[PDF] --> |text conversion / OCR| B(Text) B --> |Option 1| D[txt file] B --> |Option 2| E[ZIP folder of txt files for pages]
  1. Upload your PDF.
  2. Enable OCR (for scanned documents).
  3. Select the PDF language.
  4. Download your output file (zip/txt).

How to support the project

You can help support the project through feedback and/or buy me coffee.

Contributors

Showing top 1 contributor by commit count.

View all contributors on GitHub →

This article is auto-generated from nainiayoub/pdf-text-data-extractor via the GitHub API.Last fetched: 6/21/2026