In the course of maintenance and operations, equipment operators and manufacturers frequently generate large volumes of paper documents. This is particularly the case in maintaining legacy systems, and when external factors (e.g. security concerns, environment, training procedures) make it infeasible to record data in a computer system in real time. To implement analytics or automated monitoring, these documents must later be converted to digital copies, which can be ingested into a database. This paper describes a flexible system for converting paper forms into digital documents through Optical Character Recognition (OCR), utilizing open source tools and packages. This system allows for the incorporation of business rules and processes that deliver high fidelity digital copies.
[1]
Chirag I. Patel,et al.
Optical Character Recognition by Open source OCR Tool Tesseract: A Case Study
,
2012
.
[2]
Jiri Matas,et al.
Robust Detection of Lines Using the Progressive Probabilistic Hough Transform
,
2000,
Comput. Vis. Image Underst..
[3]
John F. Canny,et al.
A Computational Approach to Edge Detection
,
1986,
IEEE Transactions on Pattern Analysis and Machine Intelligence.
[4]
R. Smith,et al.
An Overview of the Tesseract OCR Engine
,
2007,
Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).