Automatic Detection of Pseudo-codes in Scholarly Documents Using Machine Learning

A significant number of scholarly articles in computer science and other disciplines contain algorithms that provide concise descriptions for solving a wide variety of computational problems. For example, Dijkstra’s algorithm describes how to find the shortest paths between two nodes in a graph. Automatic identification and extraction of these algorithms from scholarly digital documents would help enable automatic algorithm indexing, searching, analysis and discovery. An algorithm search engine, which identifies pseudo-codes in scholarly documents and makes them searchable, has been implemented as a part of CiteSeer suite. Here, we illustrate the limitations of the start-of-the-art rule-based pseudo-code detection approach, and present a novel set of machine learning based techniques that extend the previous method.

[1]  Prasenjit Mitra,et al.  An algorithm search engine for software developers , 2011, SUITE '11.

[2]  Masakazu Suzuki,et al.  Comparing Approaches to Mathematical Document Analysis from PDF , 2011, 2011 International Conference on Document Analysis and Recognition.

[3]  C. Lee Giles,et al.  Finding algorithms in scientific articles , 2010, WWW '10.

[4]  Kun Bai,et al.  Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[5]  Kun Bai,et al.  Searching for Tables in Digital Documents , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[6]  Richard Zanibbi,et al.  Recognition and retrieval of mathematical expressions , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[7]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[8]  Prasenjit Mitra,et al.  Automatic Extraction of Data from 2-D Plots in Documents , 2007 .

[9]  C. Lee Giles,et al.  Automatic Extraction of Data Points and Text Blocks from 2-Dimensional Plots in Digital Documents , 2008, AAAI.

[10]  Robert M. Haralick,et al.  Understanding mathematical expressions from document images , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[11]  Masayuki Okamoto,et al.  Structure analysis and recognition of mathematical expressions , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[12]  Jun Wang,et al.  Mean-Variance Analysis: A New Document Ranking Theory in Information Retrieval , 2009, ECIR.

[13]  Amit Kumar Das,et al.  Automated detection and segmentation of table of contents page from document images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[14]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[15]  Dit-Yan Yeung,et al.  Mathematical expression recognition: a survey , 2000, International Journal on Document Analysis and Recognition.

[16]  Jean-Yves Ramel,et al.  Detection, extraction and representation of tables , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[17]  Preslav Nakov,et al.  BioText Search Engine: beyond abstract search , 2007, Bioinform..