Convolutional Neural Networks for Figure Extraction in Historical Technical Documents

We present a method of extracting figures and images from the pages of scanned documents, especially from technical research articles. Our approach is novel in two key ways. First, we treat this as a computer vision problem, and train convolutional neural networks to recognize figures in scanned pages. Second, we generate our training data from 'born-digital' structured documents, allowing us to automatically produce labels for our training set using PDF figure extractors. This avoids the otherwise tedious task of hand-labelling thousands of document pages. Our convolutional neural networks achieve precision and recall of close to 85% in identifying figures from a test set consisting of modern journal papers and conference proceedings, and obtain precision and recall above 80% on an application data set comprised of historical technical documents scanned from the Bell Labs Records. Our results show that models trained on digital documents transfer very well to historical scans. Finally, it is easy to extend our models to identify other document elements such as tables and captions.

[1]  Anil K. Jain,et al.  Document Representation and Its Application to Page Decomposition , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Jian Fan,et al.  Layout and Content Extraction for PDF Documents , 2004, Document Analysis Systems.

[3]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[4]  C. Lee Giles,et al.  Automatic Extraction of Figures from Scholarly Documents , 2015, DocEng.

[5]  Anil K. Jain,et al.  Page segmentation using tecture analysis , 1996, Pattern Recognit..

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Stefano Messelodi,et al.  Geometric Layout Analysis Techniques for Document Image Understanding: a Review , 2008 .

[8]  Seong-Whan Lee,et al.  Parameter-Free Geometric Document Layout Analysis , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[10]  Thomas M. Breuel,et al.  High Performance Document Layout Analysis , 2003 .

[11]  Yafang Xue,et al.  Optical Character Recognition , 2022 .

[12]  Samy Bengio,et al.  Torch: a modular machine learning software library , 2002 .

[13]  Azriel Rosenfeld,et al.  Classification of document pages using structure-based features , 2001, International Journal on Document Analysis and Recognition.

[14]  Christopher Andreas Clark,et al.  PDFFigures 2.0: Mining figures from research papers , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[15]  Thomas M. Breuel,et al.  Performance Evaluation and Benchmarking of Six-Page Segmentation Algorithms , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[17]  Christopher Andreas Clark,et al.  Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers , 2015, AAAI Workshop: Scholarly Big Data.

[18]  Henry S. Baird,et al.  Document image content inventories , 2007, Electronic Imaging.

[19]  Thomas M. Breuel,et al.  The OCRopus open source OCR system , 2008, Electronic Imaging.

[20]  George Nagy,et al.  Style consistent classification of isogenous patterns , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.