AI Cognition in Searching for Relevant Knowledge from Scholarly Big Data, Using a Multi-layer Perceptron and Recurrent Convolutional Neural Network Model

Although, over the years, information retrieval systems have shown tremendous improvements in searching for relevant scientific literature, human cognition is still required to search for specific document elements in full text publications. For instance, pseudocodes pertaining to algorithms published in scientific publications cannot be correctly matched against user queries, hence the process requires human involvement. AlgorithmSeer, a state-of-the-art technique, claims to replace humans in this task, but one of the limitations of such an algorithm search engine is that the metadata is simply a textual description of each pseudocode, without any algorithm-specific information. Hence, the search is performed merely by matching the user query to the textual metadata and ranking the results using conventional textual similarity techniques. The ability to automatically identify algorithm-specific metadata such as precision, recall, or f-measure would be useful when searching for algorithms. In this article, we propose a set of algorithms to extract further information pertaining to the performance of each algorithm. Specifically, sentences in an article that convey information about the efficiency of the corresponding algorithm are identified and extracted using a recurrent convolutional neural network (RCNN). Furthermore, we propose improving the efficacy of the pseudocode detection task by using a multi-layer perceptron (MLP) classification trained with 15 features, which improves the classification performance of the state-of-the-art pseudocode detection methods used in AlgorithmSeer by 27%. Finally, we show the advantages of the AI-enabled search engine (based on RCNN and MLP models) over conventional text-retrieval models.

[1]  Nick Cramer,et al.  Automatic Keyword Extraction from Individual Documents , 2010 .

[2]  Le Dinh Van Khoa,et al.  Exploration of the effectiveness of expectation maximization algorithm for suspicious transaction detection in anti-money laundering , 2014, 2014 IEEE Conference on Open Systems (ICOS).

[3]  Michael J. Wise,et al.  Neweyes: A System for Comparing Biological Sequences Using the Running Karp-Rabin Greedy String-Tiling Algorithm , 1995, ISMB.

[4]  Yingxu Wang,et al.  Cognitive Informatics and Computational Intelligence: From Information Revolution to Intelligence Revolution , 2015, Int. J. Softw. Sci. Comput. Intell..

[5]  C. Lee Giles,et al.  A hybrid approach to discover semantic hierarchical sections in scholarly documents , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[6]  Madian Khabsa,et al.  AckSeer: a repository and search engine for automatically extracted acknowledgments from digital libraries , 2012, JCDL '12.

[7]  Tamir Hassan,et al.  Object-level document analysis of PDF files , 2009, DocEng '09.

[8]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[9]  Taufik Djatna,et al.  Tandem repeats analysis in DNA sequences based on improved Burrows-Wheeler transform , 2015, 2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS).

[10]  Konstantinos V. Katsikopoulos,et al.  Multi-attribute utility models as cognitive search engines , 2014, Judgment and Decision Making.

[11]  Ali Farhadi,et al.  FigureSeer: Parsing Result-Figures in Research Papers , 2016, ECCV.

[12]  C. Lee Giles,et al.  Automatic Detection of Pseudocodes in Scholarly Documents Using Machine Learning , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[13]  Saeed-Ul Hassan,et al.  Detecting Target Text Related to Algorithmic Efficiency in Scholarly Big Data Using Recurrent Convolutional Neural Network Model , 2017, ICADL.

[14]  David Doermann,et al.  Handbook of Document Image Processing and Recognition , 2014, Springer London.

[15]  C. Lee Giles,et al.  Scientific Data and Document Processing in ChemxSeer , 2008, AAAI Spring Symposium: Semantic Scientific Knowledge Integration.

[16]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[17]  Aurélie Lemaitre,et al.  Recognition of Tables and Forms , 2014, Handbook of Document Image Processing and Recognition.

[18]  Rafal Drezewski,et al.  Comparison of data mining techniques for Money Laundering Detection System , 2015, 2015 International Conference on Science in Information Technology (ICSITech).

[19]  Jeffrey Pennington,et al.  Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection , 2011, NIPS.

[20]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[21]  Jörg Tiedemann Improved Text Extraction from PDF Documents for Large-Scale Natural Language Processing , 2014, CICLing.

[22]  Bart Baesens,et al.  New insights into churn prediction in the telecommunication sector: A profit driven data mining approach , 2012, Eur. J. Oper. Res..

[23]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[24]  Michael J. Cafarella,et al.  Searching for Statistical Diagrams , 2011 .

[25]  Yingxu Wang,et al.  Formal Relational Rules of English Syntax for Cognitive Linguistics, Machine Learning, and Cognitive Computing , 2013 .

[26]  Prasenjit Mitra,et al.  AlgorithmSeer: A System for Extracting and Searching for Algorithms in Scholarly Big Data , 2016, IEEE Transactions on Big Data.

[27]  Collin McMillan,et al.  Exemplar: A Source Code Search Engine for Finding Highly Relevant Applications , 2012, IEEE Transactions on Software Engineering.

[28]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[29]  Sutanu Chakraborti,et al.  Document classification by topic labeling , 2013, SIGIR.

[30]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[31]  Sushil Krishna Bajracharya,et al.  Sourcerer: a search engine for open source code supporting structure-based search , 2006, OOPSLA '06.

[32]  Jun Zhao,et al.  Recurrent Convolutional Neural Networks for Text Classification , 2015, AAAI.

[33]  Ann E. Benbow,et al.  Be a Citizen Scientist , 2006 .

[34]  J. Hendler,et al.  Amplify scientific discovery with artificial intelligence , 2014, Science.

[35]  C. Lee Giles,et al.  Automatic Extraction of Data Points and Text Blocks from 2-Dimensional Plots in Digital Documents , 2008, AAAI.

[36]  Z. Duric,et al.  A Source Code Similarity System for Plagiarism Detection , 2013, Comput. J..

[37]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[38]  Prasenjit Mitra,et al.  Summarizing figures, tables, and algorithms in scientific publications to augment search results , 2012, TOIS.