Bibliometric-enhanced information retrieval: a novel deep feature engineering approach for algorithm searching from full-text publications

Recently, tremendous advances have been observed in information retrieval systems designed to search for relevant knowledge in scientific publications. Although these techniques are quite powerful, there is still room for improvement in the area of searching for metadata relating to algorithms in full-text publication datasets—for instance, efficiency-related metrics such as precision, recall, f-measure and accuracy, and other useful metadata such as the datasets deployed and the algorithmic run-time complexity. In this study, we proposed a novel deep learning-based feature engineering approach that improves search capabilities by mining algorithmic-specific metadata from full-text scientific publications. Typically, traditional term frequency-inverse document frequency (TF-IDF)-based approaches function like a ‘bag of words’ model and thus fail to capture either the text’s semantics or the word sequence. In this work, we designed a semantically enriched synopsis of each full-text document by adding algorithmic-specific deep metadata text lines to enhance the search mechanism of algorithm search systems. These text lines are classified by our deployed deep learning-based bi-directional long short term memory (LSTM) model. The designed bi-directional LSTM model outperformed the support vector machine by 9.46%, with a 0.81 f1-score on a dataset of 37,000 algorithm-specific deep metadata text lines that had been tagged by four human experts. Lastly, we present a case study on 21,940 full-text publications downloaded from ACL (https://aclweb.org/) to show the effectiveness of deep learning-based advanced feature engineering search compared to the conventional TF-IDF-based (Lucene) search.

[1]  Suppawong Tuarob,et al.  Improving pseudo-code detection in ubiquitous scholarly data using ensemble machine learning , 2016, 2016 International Computer Science and Engineering Conference (ICSEC).

[2]  Philipp Mayr,et al.  Bibliometric-enhanced Information Retrieval , 2013, Scientometrics.

[3]  Mickaël Coustaty,et al.  Enhancing Table of Contents Extraction by System Aggregation , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[4]  C. Lee Giles,et al.  ChemXSeer: a digital library and data repository for chemical kinetics , 2007, CIMS '07.

[5]  Saeed-Ul Hassan,et al.  Deep Stylometry and Lexical & Syntactic Features Based Author Attribution on PLoS Digital Repository , 2017, ICADL.

[6]  Hongxia Yang,et al.  A Hybrid Framework for Text Modeling with Convolutional RNN , 2017, KDD.

[7]  C. Lee Giles,et al.  A Machine Learning Approach for Semantic Structuring of Scientific Charts in Scholarly Documents , 2017, AAAI.

[8]  Waleed Ammar,et al.  Extracting Scientific Figures with Distantly Supervised Neural Networks , 2018, JCDL.

[9]  Saeed-Ul Hassan,et al.  Detecting Target Text Related to Algorithmic Efficiency in Scholarly Big Data Using Recurrent Convolutional Neural Network Model , 2017, ICADL.

[10]  C. Lee Giles,et al.  Automatic Knowledge Base Construction from Scholarly Documents , 2017, DocEng.

[11]  Feng Xia,et al.  Big Scholarly Data: A Survey , 2017, IEEE Transactions on Big Data.

[12]  Guoyong Cai,et al.  Semi-supervised collective extraction of opinion target and opinion word from online reviews based on active labeling , 2017, J. Intell. Fuzzy Syst..

[13]  Xiaoyan Zhu,et al.  Encoding Syntactic Knowledge in Neural Networks for Sentiment Classification , 2017, ACM Trans. Inf. Syst..

[14]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Jin Xu,et al.  Automatic identification of cited text spans: a multi-classifier approach over imbalanced dataset , 2018, Scientometrics.

[16]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[17]  Sutanu Chakraborti,et al.  Document classification by topic labeling , 2013, SIGIR.

[18]  Saeed-Ul Hassan,et al.  DS4A: Deep Search System for Algorithms from Full-Text Scholarly Big Data , 2018, 2018 IEEE International Conference on Data Mining Workshops (ICDMW).

[19]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[20]  Prasenjit Mitra,et al.  AlgorithmSeer: A System for Extracting and Searching for Algorithms in Scholarly Big Data , 2016, IEEE Transactions on Big Data.

[21]  Ali Farhadi,et al.  FigureSeer: Parsing Result-Figures in Research Papers , 2016, ECCV.

[22]  C. Lee Giles,et al.  Automatic Detection of Pseudocodes in Scholarly Documents Using Machine Learning , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[23]  Philipp Mayr,et al.  Bibliometric-enhanced information retrieval: preface , 2018, Scientometrics.

[24]  Mansaf Alam,et al.  A survey on scholarly data: From big data perspective , 2017, Inf. Process. Manag..

[25]  Jian Xing,et al.  Effective Document Labeling with Very Few Seed Words: A Topic Model Approach , 2016, CIKM.

[26]  Saeed-Ul Hassan,et al.  AI Cognition in Searching for Relevant Knowledge from Scholarly Big Data, Using a Multi-layer Perceptron and Recurrent Convolutional Neural Network Model , 2018, WWW.

[27]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[28]  Lutz Bornmann,et al.  Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references , 2014, J. Assoc. Inf. Sci. Technol..

[29]  Nick Cramer,et al.  Automatic Keyword Extraction from Individual Documents , 2010 .

[30]  Saeed-Ul Hassan,et al.  A novel machine-learning approach to measuring scientific knowledge flows using citation context analysis , 2018, Scientometrics.

[31]  Simone Teufel,et al.  Identifying problems and solutions in scientific text , 2018, Scientometrics.

[32]  Nelson Casimiro Zavale,et al.  University-industry linkages’ literature on Sub-Saharan Africa: systematic literature review and bibliometric account , 2018, Scientometrics.

[33]  Dietmar Wolfram,et al.  Introduction to the special issue on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL) , 2017, International Journal on Digital Libraries.

[34]  Jun Zhao,et al.  Recurrent Convolutional Neural Networks for Text Classification , 2015, AAAI.

[35]  C. Lee Giles,et al.  Extracting Semantic Relations for Scholarly Knowledge Base Construction , 2018, 2018 IEEE 12th International Conference on Semantic Computing (ICSC).

[36]  Azadeh Shakery,et al.  Citance-based retrieval and summarization using IR and machine learning , 2018, Scientometrics.

[37]  Prasenjit Mitra,et al.  Summarizing figures, tables, and algorithms in scientific publications to augment search results , 2012, TOIS.

[38]  Bill Howe,et al.  VizioMetrix: A Platform for Analyzing the Visual Information in Big Scholarly Data , 2016, WWW.

[39]  C. Lee Giles,et al.  A hybrid approach to discover semantic hierarchical sections in scholarly documents , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[40]  Madian Khabsa,et al.  AckSeer: a repository and search engine for automatically extracted acknowledgments from digital libraries , 2012, JCDL '12.

[41]  Enrico Motta,et al.  Forecasting the Spreading of Technologies in Research Communities , 2017, K-CAP.

[42]  Zhen-Hua Ling,et al.  Enhanced LSTM for Natural Language Inference , 2016, ACL.

[43]  Saeed-Ul Hassan,et al.  Deep context of citations using machine-learning models in scholarly full-text articles , 2018, Scientometrics.

[44]  Sutanu Chakraborti,et al.  WikiLDA: Towards More Effective Knowledge Acquisition in Topic Models using Wikipedia , 2017, K-CAP.

[45]  Christoph Lofi,et al.  Semantic Annotation of Data Processing Pipelines in Scientific Publications , 2017, ESWC.

[46]  Peter Haddawy,et al.  Identifying Important Citations Using Contextual Information from Full Text , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).