Evaluating the usefulness of citation graph and document metadata in scientific document recommendation for neophytes

Digital libraries have become an essential tool for researchers in all scientific domains. With almost unlimited storage capacities, current digital libraries hold a tremendous number of documents. Though some efforts have been made to facilitate access to documents relevant to a specific information need, for a new researcher who is discovering a research field, such a task remains a real challenge. Indeed neophytes do not necessarily use appropriate keywords to express their information need and they are not necessarily qualified to evaluate correctly the relevance of documents retrieved by the system. In this study, we suppose that the retrieval system in a digital library should take into consideration features other than content-based relevance. To test this hypothesis, we use machine learning methods and build new features from several metadata related to documents. More precisely, we propose to consider as features for machine learning: content-based scores, scores based on the citation graph and scores based on metadata extracted from external resources. As acquiring such features is not a trivial task, we analyze their usefulness and their capacity to detect relevant documents. Our analysis concludes that the use of these additional features improves the performance of the system for a neophyte user. In fact, by adding the new features we find more documents suitable for neophytes within the results returned by the system than when using content-based features alone.

[1]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[2]  Antanas Verikas,et al.  Mining data with random forests: A survey and results of new tests , 2011, Pattern Recognit..

[3]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[4]  Schubert Foo,et al.  Using author-specified keywords in building an initial reading list of research papers in scientific paper retrieval and recommender systems , 2017, Inf. Process. Manag..

[5]  Michael E. Lesk,et al.  Practical Digital Libraries: Books, Bytes, and Bucks , 1997 .

[6]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[7]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[8]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[9]  Michel Beigbeder,et al.  A Machine Learning System for Assisting Neophyte Researchers in Digital Libraries , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[10]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[11]  Francesco Bonchi,et al.  Centrality Measures on Big Graphs: Exact, Approximated, and Distributed Algorithms , 2016, WWW.

[12]  Zeljko Carevic,et al.  On the Connection Between Citation-based and Topical Relevance Ranking: Results of a Pretest using iSearch , 2014, BIR@ECIR.

[13]  Xiaohua Hu,et al.  Language Model Document Priors based on Citation and Co-citation Analysis , 2014, BIR@ECIR.

[14]  Gerald DeJong,et al.  Iterated Phantom Induction: A Knowledge-Based Approach to Learning Control , 2001, Machine Learning.

[15]  Wei-Ying Ma,et al.  Object-level ranking: bringing order to Web objects , 2005, WWW '05.

[16]  Andrew McCallum,et al.  Mining a digital library for influential authors , 2007, JCDL '07.

[17]  Jinsong Zhang,et al.  Full-text citation analysis: A new method to enhance scholarly networks , 2013, J. Assoc. Inf. Sci. Technol..

[18]  Bela Gipp,et al.  Research-paper recommender systems: a literature survey , 2015, International Journal on Digital Libraries.

[19]  W. Bruce Croft,et al.  Recommending citations for academic papers , 2007, SIGIR.

[20]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[21]  Zhoujun Li,et al.  Mining and modeling linkage information from citation context for improving biomedical literature retrieval , 2011, Inf. Process. Manag..

[22]  C. Lee Giles,et al.  Popularity Weighted Ranking for Academic Digital Libraries , 2007, ECIR.

[23]  Gary Marchionini,et al.  Exploratory search , 2006, Commun. ACM.

[24]  Gerard Salton,et al.  Associative Document Retrieval Techniques Using Bibliographic Information , 1963, JACM.

[25]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..