AlgorithmSeer: A System for Extracting and Searching for Algorithms in Scholarly Big Data

Algorithms are usually published in scholarly articles, especially in the computational sciences and related disciplines. The ability to automatically find and extract these algorithms in this increasingly vast collection of scholarly digital documents would enable algorithm indexing, searching, discovery, and analysis. Recently, AlgorithmSeer, a search engine for algorithms, has been investigated as part of CiteSeer' with the intent of providing a large algorithm database. Currently, over 200,000 algorithms have been extracted from over 2 million scholarly documents. This paper proposes a novel set of scalable techniques used by AlgorithmSeer to identify and extract algorithm representations in a heterogeneous pool of scholarly documents. Specifically, hybrid machine learning approaches are proposed to discover algorithm representations. Then, techniques to extract textual metadata for each algorithm are discussed. Finally, a demonstration version of AlgorithmSeer that is built on Solr/Lucene open source indexing and search system is presented.

[1]  Kun Bai,et al.  TableSeer: automatic table metadata extraction and searching in digital libraries , 2007, JCDL '07.

[2]  C. Lee Giles,et al.  A hybrid approach to discover semantic hierarchical sections in scholarly documents , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[3]  Ian H. Witten,et al.  Thesaurus based automatic keyphrase indexing , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[4]  Amit Kumar Das,et al.  Automated detection and segmentation of table of contents page from document images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[5]  Madian Khabsa,et al.  AckSeer: a repository and search engine for automatically extracted acknowledgments from digital libraries , 2012, JCDL '12.

[6]  C. Lee Giles,et al.  Automatic tag recommendation for metadata annotation using probabilistic topic modeling , 2013, JCDL '13.

[7]  Jade Goldstein Stewart,et al.  Genre Oriented Summarization , 2009 .

[8]  Tamir Hassan,et al.  Object-level document analysis of PDF files , 2009, DocEng '09.

[9]  Conrad S. Tucker Fad or Here to Stay: Predicting Product Market Adoption and Longevity Using Large Scale, Social Media Data DETC2013-12661 , 2013 .

[10]  J. J. Garcia-Luna-Aceves,et al.  A simple approximation to minimum-delay routing , 1999, SIGCOMM '99.

[11]  Petra Mutzel,et al.  The Fractional Prize-Collecting Steiner Tree Problem on Trees: Extended Abstract , 2003, ESA.

[12]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  C. Lee Giles,et al.  A classification scheme for algorithm citation function in scholarly works , 2013, JCDL '13.

[14]  Zhaohui Wu,et al.  Searching online book documents and analyzing book citations , 2013, ACM Symposium on Document Engineering.

[15]  Francine Chen,et al.  Picture detection in document page images , 2010, DocEng '10.

[16]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[17]  Masakazu Suzuki,et al.  Comparing Approaches to Mathematical Document Analysis from PDF , 2011, 2011 International Conference on Document Analysis and Recognition.

[18]  Conrad S. Tucker,et al.  Quantifying Product Favorability and Extracting Notable Product Features Using Large Scale Social Media Data , 2015, J. Comput. Inf. Sci. Eng..

[19]  Alfred Menezes,et al.  Guide to Elliptic Curve Cryptography , 2004, Springer Professional Computing.

[20]  Wang-Chien Lee,et al.  CiteSeerx: an architecture and web service design for an academic document search engine , 2006, WWW '06.

[21]  Xiaolong Zhang,et al.  CollabSeer: a search engine for collaboration discovery , 2011, JCDL '11.

[22]  Petr Sojka,et al.  The art of mathematics retrieval , 2011, DocEng '11.

[23]  Prasenjit Mitra,et al.  Summarizing figures, tables, and algorithms in scientific publications to augment search results , 2012, TOIS.

[24]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[25]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[26]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[27]  Richard Zanibbi,et al.  Recognition and retrieval of mathematical expressions , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[28]  Prasenjit Mitra,et al.  An algorithm search engine for software developers , 2011, SUITE '11.

[29]  Éva Tardos,et al.  Algorithm design , 2005 .

[30]  C. Lee Giles,et al.  Automatic Detection of Pseudocodes in Scholarly Documents Using Machine Learning , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[31]  Jun Wang,et al.  Mean-Variance Analysis: A New Document Ranking Theory in Information Retrieval , 2009, ECIR.

[32]  C. Lee Giles,et al.  A generalized topic modeling approach for automatic document annotation , 2015, International Journal on Digital Libraries.

[33]  C. Lee Giles,et al.  Improving algorithm search using the algorithm co-citation network , 2012, JCDL '12.

[34]  C. Lee Giles,et al.  Finding algorithms in scientific articles , 2010, WWW '10.

[35]  Preslav Nakov,et al.  BioText Search Engine: beyond abstract search , 2007, Bioinform..

[36]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[37]  Sudipto Guha,et al.  Approximating a data stream for querying and estimation: algorithms and performance evaluation , 2002, Proceedings 18th International Conference on Data Engineering.

[38]  S. Rigatti Random Forest. , 2017, Journal of insurance medicine.

[39]  C. Lee Giles,et al.  Automatic Extraction of Data Points and Text Blocks from 2-Dimensional Plots in Digital Documents , 2008, AAAI.