OCR Using Computer Vision and Machine Learning

Being a well-researched area, optical character recognition or OCR has seen many advancements. Many state-of-the-art algorithms have been developed that can be used for the purpose of OCR but extracting text from images containing tables while preserving the structure of the table still remains a challenging task. Here, we present an efficient and highly scalable parallel architecture to segment input images containing tabular data with and without borders into cells and reconstruct the tabular data while preserving the tabular format. The performance improvement thus made can be used to ease the tedious task of digitizing tabular data in bulk. The same architecture can be used for regular OCR applications to improve performance if the data is in huge quantities.

[1]  Manik Varma,et al.  Character Recognition in Natural Images , 2009, VISAPP.

[2]  Bülent Sankur,et al.  Survey over image thresholding techniques and quantitative performance evaluation , 2004, J. Electronic Imaging.

[3]  Jitendra Malik,et al.  Contour detection and image segmentation , 2009 .

[4]  Shuchang Zhou,et al.  EAST: An Efficient and Accurate Scene Text Detector , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Zhuowen Tu,et al.  Detecting Texts of Arbitrary Orientations in 1 Natural Images , 2012 .

[7]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jin Hyung Kim,et al.  CROHME2011: Competition on Recognition of Online Handwritten Mathematical Expressions , 2011, 2011 International Conference on Document Analysis and Recognition.

[9]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[10]  Nilanjan Dey,et al.  MEDLINE Text Mining: An Enhancement Genetic Algorithm Based Approach for Document Clustering , 2016, Applications of Intelligent Optimization in Biology and Medicine.

[11]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[12]  Marcus Liwicki,et al.  IAM-OnDB - an on-line English sentence database acquired from handwritten text on a whiteboard , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[13]  W. Bieniecki,et al.  Image Preprocessing for Improving OCR Accuracy , 2007, 2007 International Conference on Perspective Technologies and Methods in MEMS Design.

[14]  Klaus Meyer-Wegener,et al.  NEOCR: A Configurable Dataset for Natural Image Text Recognition , 2011, CBDAR.

[15]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[16]  Ray Smith An Overview of the Tesseract OCR Engine , 2007 .

[17]  Hiroshi Ishikawa,et al.  Joint Gap Detection and Inpainting of Line Drawings , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Marcus Liwicki,et al.  IAMonDo-database: an online handwritten document database with non-uniform contents , 2010, DAS '10.

[19]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Shailesh Acharya,et al.  Deep learning based large scale handwritten Devanagari character recognition , 2015, 2015 9th International Conference on Software, Knowledge, Information Management and Applications (SKIMA).

[21]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[22]  Xiaolong Wang,et al.  HIT-OR3C: an opening recognition corpus for Chinese characters , 2010, DAS '10.

[23]  Kai Wang,et al.  Word Spotting in the Wild , 2010, ECCV.

[24]  Guillermo Sapiro,et al.  Image inpainting , 2000, SIGGRAPH.

[25]  Jin Hyung Kim,et al.  Scene Text Extraction with Edge Constraint and Text Collinearity , 2010, 2010 20th International Conference on Pattern Recognition.