Science Concierge: A Fast Content-Based Recommendation System for Scientific Publications

Finding relevant publications is important for scientists who have to cope with exponentially increasing numbers of scholarly material. Algorithms can help with this task as they help for music, movie, and product recommendations. However, we know little about the performance of these algorithms with scholarly material. Here, we develop an algorithm, and an accompanying Python library, that implements a recommendation system based on the content of articles. Design principles are to adapt to new content, provide near-real time suggestions, and be open source. We tested the library on 15K posters from the Society of Neuroscience Conference 2015. Human curated topics are used to cross validate parameters in the algorithm and produce a similarity metric that maximally correlates with human judgments. We show that our algorithm significantly outperformed suggestions based on keywords. The work presented here promises to make the exploration of scholarly material faster and more accurate.

[1]  Hiroshi Mamitsuka,et al.  PURE: a PubMed article recommendation system based on content-based filtering. , 2007, Genome informatics. International Conference on Genome Informatics.

[2]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[3]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[4]  Trevor Darrell,et al.  Nearest-Neighbor Methods in Learning and Vision , 2008, IEEE Trans. Neural Networks.

[5]  Konrad P. Körding,et al.  A high-reproducibility and high-accuracy method for automated topic classification , 2014, ArXiv.

[6]  Jöran Beel,et al.  Scienstein : A Research Paper Recommender System , 2009 .

[7]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[8]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[9]  Yehuda Koren,et al.  Lessons from the Netflix prize challenge , 2007, SKDD.

[10]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[11]  Michael D. Lee,et al.  An Empirical Evaluation of Models of Text Document Similarity , 2005 .

[12]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[13]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[14]  Alfred V. Aho,et al.  On finding lowest common ancestors in trees , 1973, SIAM J. Comput..

[15]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[16]  J. Ioannidis,et al.  Science mapping analysis characterizes 235 biases in biomedical research. , 2010, Journal of clinical epidemiology.

[17]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[18]  Alfred V. Aho,et al.  On Finding Lowest Common Ancestors in Trees , 1976, SIAM J. Comput..

[19]  Chong Wang,et al.  Collaborative topic modeling for recommending scientific articles , 2011, KDD.

[20]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[21]  Muqeet Ali,et al.  Parallel Collaborative Filtering for Streaming Data , 2011 .

[22]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[23]  Woo-Sung Jung,et al.  Quantitative and empirical demonstration of the Matthew effect in a study of career longevity , 2008, Proceedings of the National Academy of Sciences.

[24]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[25]  Jevin D. West,et al.  Babel: A Platform for Facilitating Research in Scholarly Article Discovery , 2016, WWW.

[26]  Brandon Pincombe,et al.  Comparison of Human and Latent Semantic Analysis (LSA) Judgements of Pairwise Document Similarities for a News Corpus , 2004 .

[27]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[28]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[29]  F. Gianfelici,et al.  Nearest-Neighbor Methods in Learning and Vision (Shakhnarovich, G. et al., Eds.; 2006) [Book review] , 2008 .

[30]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[31]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[32]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[33]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[34]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.