ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations

We focus on electronic theses and dissertations (ETDs), aiming to improve access and expand their utility, since more than 6 million are publicly available, and they constitute an important corpus to aid research and education across disciplines. The corpus is growing as new born-digital documents are included, and since millions of older theses and dissertations have been converted to digital form to be disseminated electronically in institutional repositories. In ETDs, as with other scholarly works, figures and tables can communicate a large amount of information in a concise way. Although methods have been proposed for extracting figures and tables from born-digital PDFs, they do not work well with scanned ETDs. Considering this problem, our assessment of state-of-the-art figure extraction systems is that the reason they do not function well on scanned PDFs is that they have only been trained on born-digital documents. To address this limitation, we present ScanBank, a new dataset containing 10 thousand scanned page images, manually labeled by humans as to the presence of the 3.3 thousand figures or tables found therein. We use this dataset to train a deep neural network model based on YOLOv5 to accurately extract figures and tables from scanned ETDs. We pose and answer important research questions aimed at finding better methods for figure extraction from scanned documents. One of those concerns the value for training, of data augmentation techniques applied to born-digital documents which are used to train models better suited for figure extraction from scanned documents. To the best of our knowledge, ScanBank is the first manually annotated dataset for figure and table extraction for scanned ETDs. A YOLOv5-based model, trained on ScanBank, outperforms existing comparable open-source and freely available baseline methods by a considerable margin.

[1]  Furu Wei,et al.  DocBank: A Benchmark Dataset for Document Layout Analysis , 2020, COLING.

[2]  Hong-Yuan Mark Liao,et al.  YOLOv4: Optimal Speed and Accuracy of Object Detection , 2020, ArXiv.

[3]  Nick Koudas,et al.  Video Monitoring Queries , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[4]  J. Stephen Downie,et al.  The HathiTrust Digital Library’s potential for musicology research , 2020, International Journal on Digital Libraries.

[5]  Jun-Wei Hsieh,et al.  CSPNet: A New Backbone that can Enhance Learning Capability of CNN , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[6]  Tobias Meisen,et al.  Data-Driven Recognition and Extraction of PDF Document Elements , 2019, Technologies.

[7]  Jiashi Feng,et al.  PANet: Few-Shot Image Semantic Segmentation With Prototype Alignment , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Abhishek Dutta,et al.  The VIA Annotation Software for Images, Audio and Video , 2019, ACM Multimedia.

[9]  Zhoujun Li,et al.  TableBank: Table Benchmark for Image-based Table Detection and Recognition , 2019, LREC.

[10]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, NeurIPS.

[11]  Xingyuan He,et al.  Spatio-Temporal Patterns of Urban Forest Basal Area under China’s Rapid Urban Expansion and Greening: Implications for Urban Green Infrastructure Management , 2018 .

[12]  Waleed Ammar,et al.  Extracting Scientific Figures with Distantly Supervised Neural Networks , 2018, JCDL.

[13]  William R. Schwartz,et al.  A Robust Real-Time Automatic License Plate Recognition Based on the YOLO Detector , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[14]  Luis Perez,et al.  The Effectiveness of Data Augmentation in Image Classification using Deep Learning , 2017, ArXiv.

[15]  Christopher Andreas Clark,et al.  PDFFigures 2.0: Mining figures from research papers , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Dominika Tkaczyk,et al.  CERMINE: automatic extraction of structured metadata from scientific literature , 2015, International Journal on Document Analysis and Recognition (IJDAR).

[18]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Christopher Andreas Clark,et al.  Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers , 2015, AAAI Workshop: Scholarly Big Data.

[20]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  C. Lee Giles,et al.  The Number of Scholarly Documents on the Public Web , 2014, PloS one.

[22]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[23]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[24]  Lior Rokach,et al.  A figure search engine architecture for a chemistry digital library , 2013, JCDL '13.

[25]  Felix A. Wichmann,et al.  Modelling fixation locations using spatial point processes , 2012 .

[26]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[27]  Martin Abba Tanner,et al.  Tools for Statistical Inference: Observed Data and Data Augmentation Methods , 1993 .

[28]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.