GitHub - ssj-ali/pdfextract: PDF Data Extraction Automation using pdftotext and Tesseract OCR

Prerequisite

Note: The below steps are only required if you want to extract data from scanned pdf. If there are only vector PDFs no need to do the following steps

Download Tesseract. Follow this link: Download Tesseract
Add the path of Tesseract OCR folder (e.g C:\Program Files\Tesseract-OCR) to the Environmental Variables -> System Variables -> Path.

Follow this: Control panel -> System and Security -> System -> Advanced system settings -> Enironmental Variables -> Select 'Path' (in lower section named System variables) and click 'Edit'-> New -> add path to the folder Tesseract-OCR. Check in your file explorer where this folder is: most likely this C:\Program Files\Tesseract-OCR
Download ImageMagick. Follow this link: Download ImageMagick. ImageMagick is used to preprocess PDF i.e cleaning the PDFs pages and then converting conerting it to tiff file.
Add the path of ImageMagick folder (e.g C:\Program Files\ImageMagick-6.9.11-Q16) to the Environmental Variables -> System Variables -> Path. Same as step 3
Go to ImageMagick folder (e.g C:\Program Files\ImageMagick-6.9.11-Q16) and find the convert.exe application. Rename it from 'convert' to 'imconvert'. Reason for it is windows already contain another convert.exe application therefore there is a possibility of overriding our imagemagick convert.exe

Installion Guide

Only for 64-bit currently

Download pdfextract.exe file.
Run pdfextract.exe follow these steps
- Add the template folder containg templates files in .yml format.
- Add/Remove multiple .pdf files to the listbox and press Done

Note:Extract Text button is used while creating template. See Template Tutorial

Key Points

If a PDF is a vector pdf, pdftotext is used to extract data from pdf and it works instantly (< 1 sec per PDF)
If a PDF is a scanned pdf, tesseract + imagemagick + ghostscript is used to extract data. (Time: 3-4 sec per PDF)

Template System

The extracted fields only depends on template i.e if more fields are required to be extracted, we just need to edit templates.
In templates, the field name should be same for all pdfs. For example, if the field name for serial number is SerialNo in one template and S.No in other template then both will considered different field in the final excel sheet.
We can have multiple regex per field (if layout or wording changes)
For creating a template see Template Tutorial

Challenges

Very few times OCR is unable to detect properly because the scanned PDF was not clean. Errors are like 0(zero) is scanned as O(letter)

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
README.md		README.md
TemplateTutorial.rst		TemplateTutorial.rst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Prerequisite

Installion Guide

Key Points

Template System

Challenges

About

Uh oh!

Releases

Packages

ssj-ali/pdfextract

Folders and files

Latest commit

History

Repository files navigation

Prerequisite

Installion Guide

Key Points

Template System

Challenges

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages