PDF2LaTeX: A Deep Learning System to Convert Mathematical Documents from PDF to LaTeX

The mathematical contents of scientific publications in PDF format cannot be easily analyzed by regular PDF parsers and OCR tools. In this paper, we propose a novel OCR system called PDF2LaTeX, which extracts math expressions and text in both postscript and image-based PDF files and translates them into LaTeX markup. As a preprocessing step, PDF2LaTeX first renders a PDF file into its image format, and then uses projection profile cutting (PPC) to analyze the page layout. The analysis of math expressions and text is based on a series of deep learning algorithms. First, it uses a convolutional neural network (CNN) as a binary classifier to detect math image blocks based on visual features. Next, it uses a conditional random field (CRF) to detect math-text boundaries by incorporating semantics and context information. In the end, the system uses two different models based on a CNN-LSTM neural network architecture to translate image blocks of math expressions and plaintext into the LaTeX representations. For testing, we created a new dataset composed of 102 PDF pages collected from publications on arXiv.org and compared the performance between PDF2LaTeX and the state-of-the-art commercial software InftyReader. The experiment results showed that the proposed system achieved a better recognition accuracy (81.1%) measured by the string edit distance between the predicted LaTeX and the ground truth.

[1]  Volker Sorge,et al.  A Linear Grammar Approach to Mathematical Formula Recognition from PDF , 2009, Calculemus/MKM.

[2]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[3]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[4]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[5]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[6]  Jyh-Charn Liu,et al.  Extraction of Math Expressions from PDF Documents Based on Unsupervised Modeling of Fonts , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[7]  Jyh-Charn Liu,et al.  Bigram Label Regularization to Reduce Over-Segmentation on Inline Math Expression Detection , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[10]  Akiko Aizawa,et al.  Detecting In-line Mathematical Expressions in Scientific Documents , 2017, DocEng.

[11]  Sukhpreet Singh,et al.  Optical Character Recognition Techniques: A survey , 2013 .

[12]  Trilce Estrada,et al.  TAO: System for Table Detection and Extraction from PDF Documents , 2016, FLAIRS.

[13]  Arif E. Jinha Article 50 million: an estimate of the number of scholarly articles in existence , 2010, Learn. Publ..

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[16]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[17]  Jon M. Kleinberg,et al.  Overview of the 2003 KDD Cup , 2003, SKDD.

[18]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[19]  Jyh-Charn Liu,et al.  Translating Mathematical Formula Images to LaTeX Sequences Using Deep Neural Networks with Sequence-level Training , 2019, ArXiv.

[20]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[21]  Zhi Tang,et al.  A Deep Learning-Based Formula Detection Method for PDF Documents , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[22]  Alexander M. Rush,et al.  What You Get Is What You See: A Visual Markup Decompiler , 2016, ArXiv.

[23]  Richard Zanibbi,et al.  Recognition and retrieval of mathematical expressions , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[24]  Volker Sorge,et al.  Faithful mathematical formula recognition from PDF documents , 2010, DAS '10.

[25]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Dominique Hecq Reading in braille , 2012 .

[27]  Christopher Andreas Clark,et al.  Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers , 2015, AAAI Workshop: Scholarly Big Data.

[28]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[29]  C. V. Jawahar,et al.  Graphical Object Detection in Document Images , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[30]  Andreas Dengel,et al.  DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[31]  Alexander M. Rush,et al.  Image-to-Markup Generation with Coarse-to-Fine Attention , 2016, ICML.

[32]  Masayuki Okamoto,et al.  Performance evaluation of a mathematical formula recognition system with a large scale of printed formula images , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[33]  Xing Wang,et al.  A Font Setting Based Bayesian Model to Extract Mathematical Expression in PDF Files , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[34]  Jyh-Charn Liu,et al.  Translating math formula images to LaTeX sequences using deep neural networks with sequence-level training , 2019, International Journal on Document Analysis and Recognition (IJDAR).

[35]  Masakazu Suzuki,et al.  INFTY: an integrated OCR system for mathematical documents , 2003, DocEng '03.