Translating math formula images to LaTeX sequences using deep neural networks with sequence-level training

In this paper we propose a deep neural network model with an encoder-decoder architecture that translates images of math formulas into their LaTeX markup sequences. The encoder is a convolutional neural network (CNN) that transforms images into a group of feature maps. To better capture the spatial relationships of math symbols, the feature maps are augmented with 2D positional encoding before being unfolded into a vector. The decoder is a stacked bidirectional long short-term memory (LSTM) model integrated with the soft attention mechanism, which works as a language model to translate the encoder output into a sequence of LaTeX tokens. The neural network is trained in two steps. The first step is token-level training using the Maximum-Likelihood Estimation (MLE) as the objective function. At completion of the token-level training, the sequence-level training objective function is employed to optimize the overall model based on the policy gradient algorithm from reinforcement learning. Our design also overcomes the exposure bias problem by closing the feedback loop in the decoder during sequence-level training, i.e., feeding in the predicted token instead of the ground truth token at every time step. The model is trained and evaluated on the IM2LATEX-100K dataset and shows state-of-the-art performance on both sequence-based and image-based evaluation metrics.

[1]  Zhi Tang,et al.  A Deep Learning-Based Formula Detection Method for PDF Documents , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[2]  Yang Liu,et al.  Minimum Risk Training for Neural Machine Translation , 2015, ACL.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[5]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[6]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[7]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[8]  Joan-Andreu Sánchez,et al.  An Image-Based Measure for Evaluation of Mathematical Expression Recognition , 2013, IbPRIA.

[9]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[10]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[12]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Alexander M. Rush,et al.  What You Get Is What You See: A Visual Markup Decompiler , 2016, ArXiv.

[14]  Masayuki Okamoto,et al.  Structure analysis and recognition of mathematical expressions , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[15]  Nicola Cancedda,et al.  Minimum Error Rate Training by Sampling the Translation Lattice , 2010, EMNLP.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Dit-Yan Yeung,et al.  Mathematical expression recognition: a survey , 2000, International Journal on Document Analysis and Recognition.

[18]  Robert H. Anderson Syntax-directed recognition of hand-printed two-dimensional mathematics , 1967, Symposium on Interactive Systems for Experimental Applied Mathematics.

[19]  Jyh-Charn Liu,et al.  Extraction of Math Expressions from PDF Documents Based on Unsupervised Modeling of Fonts , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[20]  Jyh-Charn Liu,et al.  Bigram Label Regularization to Reduce Over-Segmentation on Inline Math Expression Detection , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[21]  Lijun Wu,et al.  A Study of Reinforcement Learning for Neural Machine Translation , 2018, EMNLP.

[22]  Shiliang Zhang,et al.  Watch, attend and parse: An end-to-end neural network based approach to handwritten mathematical expression recognition , 2017, Pattern Recognit..

[23]  Alexander M. Rush,et al.  Image-to-Markup Generation with Coarse-to-Fine Attention , 2016, ICML.

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[27]  Robert H. Anderson Syntax-directed recognition of hand-printed two-dimensional mathematics , 1967, Symposium on Interactive Systems for Experimental Applied Mathematics.

[28]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[30]  Andrew Zisserman,et al.  Deep Structured Output Learning for Unconstrained Text Recognition , 2014, ICLR.

[31]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[32]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[33]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[34]  Bidyut B. Chaudhuri,et al.  Identification of embedded mathematical expressions in scanned documents , 2004, ICPR 2004.

[35]  Jun Du,et al.  A GRU-Based Encoder-Decoder Approach with Attention for Online Handwritten Mathematical Expression Recognition , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[36]  Jian Wang,et al.  Image To Latex with DenseNet Encoder and Joint Attention , 2018, IIKI.

[37]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[38]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[39]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[40]  Tao Wang,et al.  End-to-end text recognition with convolutional neural networks , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[41]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[42]  Masakazu Suzuki,et al.  INFTY: an integrated OCR system for mathematical documents , 2003, DocEng '03.

[43]  Wei Zhang,et al.  An Improved Approach Based on CNN-RNNs for Mathematical Expression Recognition , 2019, Proceedings of the 2019 4th International Conference on Multimedia Systems and Signal Processing - ICMSSP 2019.