论文信息 - Automatic Detection of Pseudo-codes in Scholarly Documents Using Machine Learning

Automatic Detection of Pseudo-codes in Scholarly Documents Using Machine Learning

A significant number of scholarly articles in computer science and other disciplines contain algorithms that provide concise descriptions for solving a wide variety of computational problems. For example, Dijkstra’s algorithm describes how to find the shortest paths between two nodes in a graph. Automatic identification and extraction of these algorithms from scholarly digital documents would help enable automatic algorithm indexing, searching, analysis and discovery. An algorithm search engine, which identifies pseudo-codes in scholarly documents and makes them searchable, has been implemented as a part of CiteSeer suite. Here, we illustrate the limitations of the start-of-the-art rule-based pseudo-code detection approach, and present a novel set of machine learning based techniques that extend the previous method.

C. Lee Giles | Prasenjit Mitra | Sumit Bhatia | Suppawong Tuarob

[1] Prasenjit Mitra,et al. An algorithm search engine for software developers , 2011, SUITE '11.

[2] Masakazu Suzuki,et al. Comparing Approaches to Mathematical Document Analysis from PDF , 2011, 2011 International Conference on Document Analysis and Recognition.

[3] C. Lee Giles,et al. Finding algorithms in scientific articles , 2010, WWW '10.

[4] Kun Bai,et al. Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[5] Kun Bai,et al. Searching for Tables in Digital Documents , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[6] Richard Zanibbi,et al. Recognition and retrieval of mathematical expressions , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[7] Daniel S. Hirschberg,et al. A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[8] Prasenjit Mitra,et al. Automatic Extraction of Data from 2-D Plots in Documents , 2007 .

[9] C. Lee Giles,et al. Automatic Extraction of Data Points and Text Blocks from 2-Dimensional Plots in Digital Documents , 2008, AAAI.

[10] Robert M. Haralick,et al. Understanding mathematical expressions from document images , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[11] Masayuki Okamoto,et al. Structure analysis and recognition of mathematical expressions , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[12] Jun Wang,et al. Mean-Variance Analysis: A New Document Ranking Theory in Information Retrieval , 2009, ECIR.

[13] Amit Kumar Das,et al. Automated detection and segmentation of table of contents page from document images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[14] Avrim Blum,et al. The Bottleneck , 2021, Monopsony Capitalism.

[15] Dit-Yan Yeung,et al. Mathematical expression recognition: a survey , 2000, International Journal on Document Analysis and Recognition.

[16] Jean-Yves Ramel,et al. Detection, extraction and representation of tables , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[17] Preslav Nakov,et al. BioText Search Engine: beyond abstract search , 2007, Bioinform..