Enhancing academic literature review through relevance recommendation: Using bibliometric and text-based features for classification

The growing number of scientific publications and the availability of information in online repositories enable researchers to discover, analyze and maintain an updated state of the art bibliography. Indeed, few works explore this scenario in order to support researchers on the literature review step. Literature reviewing comprises a fundamental part of the scientific writing, in which publications are evaluated and selected by relevance. Different approaches for relevance are possible, whether a more qualitative (semantic) approach with text-based techniques either more quantitative (numerical) approaches that use article's metadata, such as bibliometric measures. Bibliometrics provide direct evidences of relevance and could represent good attributes for automatic classification. Our insight is that if a bibliometric-based cannot outperform text-based approaches, a hybrid model using both could benefit from it enhancing the classification performance (in terms of accuracy, precision and recall). In this paper we presented a novel approach, using Machine Learning (ML), namely the ID3 algorithm for a classification model that learn from specialist annotated data and recommend relevant papers for a specific research. Experiments showed good results on learning performance when using a hybrid approach, increasing testing performance in 12%, achieving 89.05% in accuracy when classifying a paper as relevant.

[1]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[2]  Chitu Okoli,et al.  A Guide to Conducting a Standalone Systematic Literature Review , 2015, Commun. Assoc. Inf. Syst..

[3]  Gabriella Kazai,et al.  Advances in Information Retrieval , 2015, Lecture Notes in Computer Science.

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Blaise Cronin,et al.  The Hand of Science: Academic Writing and Its Rewards , 2005 .

[6]  Howard D. White,et al.  Combining bibliometrics, information retrieval, and relevance theory, Part 2: Some implications for information science , 2007, J. Assoc. Inf. Sci. Technol..

[7]  Marc Bertin,et al.  A Study of Lexical Distribution in Citation Contexts through the IMRaD Standard , 2014, BIR@ECIR.

[8]  J. Nicolaisen Bibliometrics and Citation Analysis: From the Science Citation Index to Cybermetrics , 2010 .

[9]  Thiago R. P. M. Rúbio,et al.  Mining Scientific Articles Powered by Machine Learning Techniques , 2015, ICCSW.

[10]  Henk F. Moed,et al.  A bibliometric approach to tracking international scientific migration , 2014, Scientometrics.

[11]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[12]  Komal Kumar Bhatia,et al.  METADATA : TOWARDS MACHINE -ENABLED INTELLIGENCE , 2012 .

[13]  Siddhartha Jonnalagadda,et al.  Towards assigning references using semantic, journal and citation relevance , 2013, 2013 IEEE International Conference on Bioinformatics and Biomedicine.

[14]  George R. Thoma,et al.  The Role of Title, Metadata and Abstract in Identifying Clinically Relevant Journal Articles , 2005, AMIA.

[15]  Thiago R. P. M. Rúbio,et al.  Mining Scientific Articles using the R Language , 2015 .

[16]  Philipp Mayr,et al.  Bibliometric-enhanced Information Retrieval , 2013, Scientometrics.

[17]  Neel Sundaresan,et al.  Metadata based Web mining for relevance , 2000, Proceedings 2000 International Database Engineering and Applications Symposium (Cat. No.PR00789).

[18]  D. Raoult,et al.  Cost-Effectiveness of Blood Agar for Isolation of Mycobacteria , 2007, PLoS neglected tropical diseases.

[19]  Chaomei Chen,et al.  CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature , 2006, J. Assoc. Inf. Sci. Technol..