Automatic Detection of Pseudocodes in Scholarly Documents Using Machine Learning

A significant number of scholarly articles in computer science and other disciplines contain algorithms that provide concise descriptions for solving a wide variety of computational problems. For example, Dijkstra's algorithm describes how to find the shortest paths between two nodes in a graph. Automatic identification and extraction of these algorithms from scholarly digital documents would enable automatic algorithm indexing, searching, analysis and discovery. An algorithm search engine, which identifies pseudocodes in scholarly documents and makes them searchable, has been implemented as a part of the CiteSeerX suite. Here, we illustrate the limitations of start-of-the-art rule based pseudocode detection approach, and present a novel set of machine learning based techniques that extend previous methods.

[1]  C. Lee Improving Algorithm Search Using the Algorithm Co-Citation Network , 2012 .

[2]  Masayuki Okamoto,et al.  Structure analysis and recognition of mathematical expressions , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[3]  Sudipto Guha,et al.  Approximating a data stream for querying and estimation: algorithms and performance evaluation , 2002, Proceedings 18th International Conference on Data Engineering.

[4]  Preslav Nakov,et al.  BioText Search Engine: beyond abstract search , 2007, Bioinform..

[5]  Michiel H. M. Smid,et al.  Algorithms for optimal outlier removal , 2009, J. Discrete Algorithms.

[6]  Kun Bai,et al.  Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[7]  Prasenjit Mitra,et al.  An algorithm search engine for software developers , 2011, SUITE '11.

[8]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[9]  Amit Kumar Das,et al.  Automated detection and segmentation of table of contents page from document images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[10]  C. Lee Giles,et al.  A classification scheme for algorithm citation function in scholarly works , 2013, JCDL '13.

[11]  Robert M. Haralick,et al.  Understanding mathematical expressions from document images , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[12]  C. Lee Giles,et al.  Automatic Extraction of Data Points and Text Blocks from 2-Dimensional Plots in Digital Documents , 2008, AAAI.

[13]  Prasenjit Mitra,et al.  Automatic Extraction of Data from 2-D Plots in Documents , 2007 .

[14]  Richard Zanibbi,et al.  Recognition and retrieval of mathematical expressions , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[15]  Jun Wang,et al.  Mean-Variance Analysis: A New Document Ranking Theory in Information Retrieval , 2009, ECIR.

[16]  Kun Bai,et al.  Searching for Tables in Digital Documents , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[17]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[18]  C. Lee Giles,et al.  Finding algorithms in scientific articles , 2010, WWW '10.

[19]  Masakazu Suzuki,et al.  Comparing Approaches to Mathematical Document Analysis from PDF , 2011, 2011 International Conference on Document Analysis and Recognition.

[20]  Éva Tardos,et al.  Algorithm design , 2005 .

[21]  Automated detection and segmentation of table of contents page and index pages from document images , 2003, 12th International Conference on Image Analysis and Processing, 2003.Proceedings..

[22]  Dit-Yan Yeung,et al.  Mathematical expression recognition: a survey , 2000, International Journal on Document Analysis and Recognition.

[23]  Jean-Yves Ramel,et al.  Detection, extraction and representation of tables , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..