Learning to Rank Scientific Documents from the Crowd

Finding related published articles is an important task in any science, but with the explosion of new work in the biomedical domain it has become especially challenging. Most existing methodologies use text similarity metrics to identify whether two articles are related or not. However biomedical knowledge discovery is hypothesis-driven. The most related articles may not be ones with the highest text similarities. In this study, we first develop an innovative crowd-sourcing approach to build an expert-annotated document-ranking corpus. Using this corpus as the gold standard, we then evaluate the approaches of using text similarity to rank the relatedness of articles. Finally, we develop and evaluate a new supervised model to automatically rank related scientific articles. Our results show that authors' ranking differ significantly from rankings by text-similarity-based models. By training a learning-to-rank model on a subset of the annotated corpus, we found the best supervised learning-to-rank model (SVM-Rank) significantly surpassed state-of-the-art baseline systems.

[1]  Ioannis A. Kakadiaris,et al.  Results of the 4th edition of BioASQ Challenge , 2016 .

[2]  Panagiotis G. Ipeirotis,et al.  Quizz: targeted crowdsourcing with a billion (potential) users , 2014, WWW.

[3]  Jie Tang,et al.  A Discriminative Approach to Topic-Based Citation Recommendation , 2009, PAKDD.

[4]  Stephen E. Robertson,et al.  A new rank correlation coefficient for information retrieval , 2008, SIGIR '08.

[5]  Jiawei Han,et al.  ClusCite: effective citation recommendation by information network-based clustering , 2014, KDD.

[6]  Tie-Yan Liu,et al.  Adapting ranking SVM to document retrieval , 2006, SIGIR.

[7]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[8]  Andrew McCallum,et al.  Domain Specific Knowledge Base Construction via Crowdsourcing , 2014 .

[9]  Laurence T. Yang,et al.  Query by document via a decomposition-based two-level retrieval approach , 2011, SIGIR.

[10]  Ramesh Nallapati,et al.  Joint latent topic models for text and citations , 2008, KDD.

[11]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[12]  Hong Yu,et al.  Automatic Figure Ranking and User Interfacing for Intelligent Figure Search , 2010, PloS one.

[13]  Yin Yang,et al.  Query by document , 2009, WSDM '09.

[14]  W. Bruce Croft,et al.  Transforming patents into prior-art queries , 2009, SIGIR.

[15]  Hao Wu,et al.  Enhancing citation recommendation with various evidences , 2012, 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery.

[16]  Chi Zhang,et al.  Learning to Answer Biomedical Factoid & List Questions: OAQA at BioASQ 3B , 2015, CLEF.

[17]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[18]  Helge Ritter,et al.  A MeSH term based distance measure for document retrieval and labeling assistance , 2003, Proceedings of the 25th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (IEEE Cat. No.03CH37439).

[19]  W. Bruce Croft,et al.  Generating queries from user-selected text , 2012, IIiX.

[20]  W. Bruce Croft,et al.  Automatic suggestion of phrasal-concept queries for literature search , 2014, Inf. Process. Manag..

[21]  Mostafa Keikha,et al.  Automatic refinement of patent queries using concept importance predictors , 2012, SIGIR '12.

[22]  Jimmy J. Lin,et al.  PubMed related articles: a probabilistic topic-based model for content similarity , 2007, BMC Bioinformatics.

[23]  Mark Levene,et al.  Search Engines: Information Retrieval in Practice , 2011, Comput. J..

[24]  W. John Wilbur,et al.  The Effectiveness of Document Neighboring in Search Enhancement , 1994, Inf. Process. Manag..

[25]  Ellen M. Voorhees,et al.  TREC genomics special issue overview , 2009, Information Retrieval.

[26]  Ricky J. Sethi,et al.  Figure-Associated Text Summarization and Evaluation , 2015, PloS one.

[27]  Patrick Ruch,et al.  Report on the TREC 2009 Experiments: Chemical IR Track , 2009, TREC.

[28]  James Allan,et al.  Find-similar: similarity browsing as a search tool , 2006, SIGIR.

[29]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.