Fine-Grained Object Detection over Scientific Document Images with Region Embeddings

We study the problem of object detection over scanned images of scientific documents. We consider images that contain objects of varying aspect ratios and sizes and range from coarse elements such as tables and figures to fine elements such as equations and section headers. We find that current object detectors fail to produce properly localized region proposals over such page objects. We revisit the original R-CNN model and present a method for generating fine-grained proposals over document elements. We also present a region embedding model that uses the convolutional maps of a proposal's neighbors as context to produce an embedding for each proposal. This region embedding is able to capture the semantic relationships between a target region and its surrounding context. Our end-to-end model produces an embedding for each proposal, then classifies each proposal by using a multi-head attention model that attends to the most important neighbors of a proposal. To evaluate our model, we collect and annotate a dataset of publications from heterogeneous journals. We show that our model, referred to as Attentive-RCNN, yields a 17% mAP improvement compared to standard object detection models.

[1]  Andreas Dengel,et al.  DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[2]  Sonia Garcia-Salicetti,et al.  A hierarchical and recursive model of mathematical expressions for automatic reading of mathematical documents , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[3]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[4]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[5]  Yuan Liao,et al.  CNN Based Page Object Detection in Document Images , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[6]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[9]  Ersin Yumer,et al.  Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Henry S. Baird,et al.  Distinguishing mathematics notation from English text using computational geometry , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[11]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[12]  Daniel Kifer,et al.  Multi-Scale Multi-Task FCN for Semantic Page Segmentation and Table Detection , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[13]  Koray Kavukcuoglu,et al.  Visual Attention , 2020, Computational Models for Cognitive Vision.

[14]  Alexey O. Shigarov,et al.  Configurable Table Structure Recognition in Untagged PDF documents , 2016, DocEng.

[15]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[17]  Ruiheng Qiu,et al.  A Table Detection Method for Multipage PDF Documents via Visual Seperators and Tabular Structures , 2011, 2011 International Conference on Document Analysis and Recognition.

[18]  Kai Chen,et al.  Hybrid Page Segmentation with Efficient Whitespace Rectangles Extraction and Grouping , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[19]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[20]  Christopher Andreas Clark,et al.  Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers , 2015, AAAI Workshop: Scholarly Big Data.

[21]  Volker Sorge,et al.  Mathematical formula identification and performance evaluation in PDF documents , 2013, International Journal on Document Analysis and Recognition (IJDAR).

[22]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[23]  Constantine Bekas,et al.  Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale , 2018, ERCIM News.

[24]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Muhammad Imran Malik,et al.  Table Detection Using Deep Learning , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[26]  Christopher Ré,et al.  GeoDeepDive: statistical inference using familiar data-processing languages , 2013, SIGMOD '13.

[27]  Zhi Tang,et al.  A Deep Learning-Based Formula Detection Method for PDF Documents , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[28]  Doug Downey,et al.  Construction of the Literature Graph in Semantic Scholar , 2018, NAACL.

[29]  Zhi Tang,et al.  ICDAR2017 Competition on Page Object Detection , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[30]  Zhi Tang,et al.  A Table Detection Method for PDF Documents Based on Convolutional Neural Networks , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).