Extracting Mathematical Components Directly from PDF Documents for Mathematical Expression Recognition and Retrieval

PDF document gains its popularity in information storage and exchange. With more and more documents, especially the scientific documents, available in PDF format, extracting mathematical expressions in PDF documents becomes an important issue in the field of mathematical expression recognition and retrieval. In this paper, we proposed a method of extracting mathematical components directly from PDF documents rather than cooperating indirectly with corresponding images converted from PDF files. Compared with traditional image-based method, the proposed method makes full use of the internal information of PDF documents such as font size, baseline, glyph bounding box and so on to extract the mathematical characters and their geometric information. The experimental result shows the method could meet the needs of the following processing of mathematical expressions such as formula structural analysis, reconstruction and retrieval, and has a higher efficiency than traditional image-based ways.

[1]  Simone Marinai,et al.  Metadata Extraction from PDF Papers for Digital Library Ingest , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[2]  Richard J. Fateman,et al.  Extracting mathematical expressions from postscript documents , 2004, ISSAC '04.

[3]  Dit-Yan Yeung,et al.  Mathematical expression recognition: a survey , 2000, International Journal on Document Analysis and Recognition.

[4]  Liangcai Gao,et al.  Mathematical Formula Identification in PDF Documents , 2011, 2011 International Conference on Document Analysis and Recognition.

[5]  F. Rahman,et al.  Conversion of PDF documents into HTML: a case study of document image analysis , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[6]  Stephen M. Watt,et al.  Intelligent Computer Mathematics , 2014, Lecture Notes in Computer Science.

[7]  Jonathan J. Hull,et al.  Document Analysis Systems II - Second Workshop on Document Analysis Systems, DAS 1996, Malvern, PA, USA, October 14-16, 1996, Selected papers , 1998, Series in Machine Perception and Artificial Intelligence.

[8]  Andreas Dengel,et al.  Document Analysis Systems VI , 2004, Lecture Notes in Computer Science.

[9]  Xu Lijuan Research on Structural Analysis of Mathematical Expressions in Printed Documents , 2006 .

[10]  Volker Sorge,et al.  A Linear Grammar Approach to Mathematical Formula Recognition from PDF , 2009, Calculemus/MKM.

[11]  Zhi Tang,et al.  Identification of embedded mathematical formulas in PDF documents using SVM , 2012, Electronic Imaging.

[12]  Jian Fan,et al.  Layout and Content Extraction for PDF Documents , 2004, Document Analysis Systems.

[13]  Jean-Luc Meunier,et al.  A System for Converting PDF Documents into Structured XML Format , 2006, Document Analysis Systems.