Mathematical Variable Detection in PDF Scientific Documents

The detection of mathematical expression from PDF documents has been studied and advanced for recent years. In the process, the detection of variables of inline expressions that are represented by alphabetical characters is a challenge. Compared to other components of inline expressions, there are many factors that cause the ambiguities for the detection of variables. In this paper, the error in detecting variables in PDF scientific documents is analytically presented. Novel rules are proposed to improve the accuracy in the detection process. The experimental results on benchmark datasets containing English and Vietnamese documents show the effectiveness of the proposed method. The comparison with existing methods demonstrates the out-performance of the proposed method. Furthermore, pre-trained deep Convolutional Neural Networks are employed and optimized to automatically extract visual features of extracted components from PDF and machine learning algorithms are used to improve the accuracy of the detection.

[1]  Volker Sorge,et al.  Extracting Precise Data on the Mathematical Content of PDF Documents , 2008 .

[2]  Akiko Aizawa,et al.  Detecting In-line Mathematical Expressions in Scientific Documents , 2017, DocEng.

[3]  Zhi Tang,et al.  A Deep Learning-Based Formula Detection Method for PDF Documents , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Utpal Garain,et al.  Identification of Mathematical Expressions in Document Images , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[6]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[7]  Richard Zanibbi,et al.  Recognition and retrieval of mathematical expressions , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[8]  Marcus Liwicki,et al.  Deepdocclassifier: Document classification with deep Convolutional Neural Network , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[9]  Wenjie Luo,et al.  Extracting Mathematical Components Directly from PDF Documents for Mathematical Expression Recognition and Retrieval , 2014, ICSI.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Jiri Panyr,et al.  Information Retrieval Techniques in Rule-based Expert Systems , 1991 .

[12]  F. Rahman,et al.  Conversion of PDF documents into HTML: a case study of document image analysis , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[13]  Masakazu Suzuki,et al.  INFTY: an integrated OCR system for mathematical documents , 2003, DocEng '03.

[14]  Daniel W. Lozier,et al.  A Portable Extended Precision Arithmetic Package and Library with Fortran Precompiler , 1976, TOMS.