Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents

Abstract The advancements of search engines for traditional text documents have enabled the effective retrieval of massive textual information in a resource-efficient manner. However, such conventional search methodologies often suffer from poor retrieval accuracy especially when documents exhibit unique properties that behoove specialized and deeper semantic extraction. Recently, AlgorithmSeer, a search engine for algorithms has been proposed, that extracts pseudo-codes and shallow textual metadata from scientific publications and treats them as traditional documents so that the conventional search engine methodology could be applied. However, such a system fails to facilitate user search queries that seek to identify algorithm-specific information, such as the datasets on which algorithms operate, the performance of algorithms, and runtime complexity, etc. In this paper, a set of enhancements to the previously proposed algorithm search engine are presented. Specifically, we propose a set of methods to automatically identify and extract algorithmic pseudo-codes and the sentences that convey related algorithmic metadata using a set of machine-learning techniques. In an experiment with over 93,000 text lines, we introduce 60 novel features, comprising content-based, font style based and structure-based feature groups, to extract algorithmic pseudo-codes. Our proposed pseudo-code extraction method achieves 93.32% F1-score, outperforming the state-of-the-art techniques by 28%. Additionally, we propose a method to extract algorithmic-related sentences using deep neural networks and achieve an accuracy of 78.5%, outperforming a Rule-based model and a support vector machine model by 28% and 16%, respectively.

[1]  James P. Callan,et al.  Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding , 2017, WWW.

[2]  Rao Muhammad Adeel Nawab,et al.  Deep sentiments in Roman Urdu text using Recurrent Convolutional Neural Network model , 2020, Inf. Process. Manag..

[3]  Sophia Ananiadou,et al.  Enhancing Search: Events and Their Discourse Context , 2013, CICLing.

[4]  Erik Cambria,et al.  Learning short-text semantic similarity with word embeddings and external knowledge sources , 2019, Knowl. Based Syst..

[5]  Ruy Luiz Milidiú,et al.  A work-efficient parallel algorithm for constructing Huffman codes , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[6]  C. Lee Giles,et al.  A hybrid approach to discover semantic hierarchical sections in scholarly documents , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[7]  Madian Khabsa,et al.  AckSeer: a repository and search engine for automatically extracted acknowledgments from digital libraries , 2012, JCDL '12.

[8]  Tamir Hassan,et al.  Object-level document analysis of PDF files , 2009, DocEng '09.

[9]  Saeed-Ul Hassan,et al.  DS4A: Deep Search System for Algorithms from Full-Text Scholarly Big Data , 2018, 2018 IEEE International Conference on Data Mining Workshops (ICDMW).

[10]  Ebrahim Bagheri,et al.  Feature-enriched matrix factorization for relation extraction , 2019, Inf. Process. Manag..

[11]  C. Lee Giles,et al.  Curve separation for line graphs in scholarly documents , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[12]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[13]  Saeed-Ul Hassan,et al.  Deep Stylometry and Lexical & Syntactic Features Based Author Attribution on PLoS Digital Repository , 2017, ICADL.

[14]  Roberto Navigli,et al.  Knowledge-enhanced document embeddings for text classification , 2019, Knowl. Based Syst..

[15]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[16]  Khawar Khurshid,et al.  An expert system for diabetes prediction using auto tuned multi-layer perceptron , 2017, 2017 Intelligent Systems Conference (IntelliSys).

[17]  Feng Xia,et al.  Big Scholarly Data: A Survey , 2017, IEEE Transactions on Big Data.

[18]  Saeed-Ul Hassan,et al.  A novel machine-learning approach to measuring scientific knowledge flows using citation context analysis , 2018, Scientometrics.

[19]  Jun Zhao,et al.  Recurrent Convolutional Neural Networks for Text Classification , 2015, AAAI.

[20]  Sophia Ananiadou,et al.  Identification of research hypotheses and new knowledge from scientific literature , 2018, BMC Medical Informatics and Decision Making.

[21]  Daniel A. Keim,et al.  An Adaptive Image-based Plagiarism Detection Approach , 2018, JCDL.

[22]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[23]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[24]  Timothy N. Rubin,et al.  Statistical topic models for multi-label document classification , 2011, Machine Learning.

[25]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[26]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[27]  Yann LeCun,et al.  Very Deep Convolutional Networks for Text Classification , 2016, EACL.

[28]  Saeed-Ul Hassan,et al.  AI Cognition in Searching for Relevant Knowledge from Scholarly Big Data, Using a Multi-layer Perceptron and Recurrent Convolutional Neural Network Model , 2018, WWW.

[29]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[30]  Sophia Ananiadou,et al.  Facilitating the Analysis of Discourse Phenomena in an Interoperable NLP Platform , 2013, CICLing.

[31]  Waleed Ammar,et al.  Extracting Scientific Figures with Distantly Supervised Neural Networks , 2018, JCDL.

[32]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[33]  Xiaoyan Zhu,et al.  Encoding Syntactic Knowledge in Neural Networks for Sentiment Classification , 2017, ACM Trans. Inf. Syst..

[34]  Jason Weston,et al.  A Neural Attention Model for Sentence Summarization , 2015 .

[35]  Ansgar Scherp,et al.  Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Competitive Performance to Full-Text , 2018, JCDL.

[36]  C. Lee Giles,et al.  ChemXSeer: a digital library and data repository for chemical kinetics , 2007, CIMS '07.

[37]  Salem Alelyani,et al.  Extracting scientific trends by mining topics from Call for Papers , 2019, Libr. Hi Tech.

[38]  C. Lee Giles,et al.  A Machine Learning Approach for Semantic Structuring of Scientific Charts in Scholarly Documents , 2017, AAAI.

[39]  Prasenjit Mitra,et al.  AlgorithmSeer: A System for Extracting and Searching for Algorithms in Scholarly Big Data , 2016, IEEE Transactions on Big Data.

[40]  Akshay Deepak,et al.  Query Expansion Techniques for Information Retrieval: a Survey , 2017, Inf. Process. Manag..

[41]  Atsushi Fujii,et al.  Mathematical Document Categorization with Structure of Mathematical Expressions , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[42]  Prasenjit Mitra,et al.  Summarizing figures, tables, and algorithms in scientific publications to augment search results , 2012, TOIS.

[43]  Sophia Ananiadou,et al.  Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature , 2011, BMC Bioinformatics.

[44]  Murat Can Ganiz,et al.  Semantic text classification: A survey of past and recent advances , 2018, Inf. Process. Manag..

[45]  C. Lee Giles,et al.  Extracting Semantic Relations for Scholarly Knowledge Base Construction , 2018, 2018 IEEE 12th International Conference on Semantic Computing (ICSC).

[46]  Iris Xie,et al.  Enhancing usability of digital libraries: Designing help features to support blind and visually impaired users , 2020, Inf. Process. Manag..

[47]  Sophia Ananiadou,et al.  Identification of Manner in Bio-Events , 2012, LREC.

[48]  Saeed-Ul Hassan,et al.  Detecting Target Text Related to Algorithmic Efficiency in Scholarly Big Data Using Recurrent Convolutional Neural Network Model , 2017, ICADL.

[49]  Abu Bakar,et al.  Mining algorithmic complexity in full-text scholarly documents , 2018 .

[50]  Saeed-Ul Hassan,et al.  Bibliometric-enhanced information retrieval: a novel deep feature engineering approach for algorithm searching from full-text publications , 2019, Scientometrics.

[51]  Hye-Young Paik,et al.  TEXUS: A unified framework for extracting and understanding tables in PDF documents , 2019, Inf. Process. Manag..

[52]  M. de Rijke,et al.  Characterizing and predicting downloads in academic search , 2019, Inf. Process. Manag..

[53]  Saeed-Ul Hassan,et al.  Exploiting Social Networks of Twitter in Altmetrics Big Data , 2018 .