Performance Comparison of Ad-Hoc Retrieval Models over Full-Text vs. Titles of Documents

While there are many studies on information retrieval models using full-text, there are presently no comparison studies of full-text retrieval vs. retrieval only over the titles of documents. On the one hand, the full-text of documents like scientific papers is not always available due to, e.g., copyright policies of academic publishers. On the other hand, conducting a search based on titles alone has strong limitations. Titles are short and therefore may not contain enough information to yield satisfactory search results. In this paper, we compare different retrieval models regarding their search performance on the full-text vs. only titles of documents. We use different datasets, including the three digital library datasets: EconBiz, IREON, and PubMed. The results show that it is possible to build effective title-based retrieval models that provide competitive results comparable to full-text retrieval. The difference between the average evaluation results of the best title-based retrieval models is only 3% less than those of the best full-text-based retrieval models.

[1]  Ansgar Scherp,et al.  Using Titles vs. Full-text as Source for Automated Semantic Document Annotation , 2017, K-CAP.

[2]  Tao Qin,et al.  Introducing LETOR 4.0 Datasets , 2013, ArXiv.

[3]  Bradley M. Hemminger,et al.  Comparison of full-text searching to metadata searching for genes in two biomedical literature cohorts , 2007, J. Assoc. Inf. Sci. Technol..

[4]  Mark Levene,et al.  Search Engines: Information Retrieval in Practice , 2011, Comput. J..

[5]  W. Bruce Croft,et al.  Linear feature-based models for information retrieval , 2007, Information Retrieval.

[6]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[7]  Scott Fortmann-Roe,et al.  Understanding the bias-variance tradeoff , 2012 .

[8]  Md. Mustafizur Rahman,et al.  Neural Information Retrieval: A Literature Review , 2016, ArXiv.

[9]  Qiang Wu,et al.  Adapting boosting for information retrieval measures , 2010, Information Retrieval.

[10]  W. Bruce Croft,et al.  Adaptability of Neural Networks on Varying Granularity IR Tasks , 2016, ArXiv.

[11]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[12]  Tao Qin,et al.  How to Make LETOR More Useful and Reliable , 2008 .

[13]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[14]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[15]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[16]  T. Minka Selection bias in the LETOR datasets , 2008 .

[17]  Jimmy J. Lin Is searching full text more effective than searching abstracts? , 2009, BMC Bioinformatics.

[18]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[19]  Hang Li,et al.  AdaRank: a boosting algorithm for information retrieval , 2007, SIGIR.

[20]  Yelong Shen,et al.  Learning semantic representations using convolutional neural networks for web search , 2014, WWW.

[21]  Frances H. Barker,et al.  COMPARATIVE EFFICIENCY OF SEARCHING TITLES, ABSTRACTS, AND INDEX TERMS IN A FREE‐TEXT DATA BASE , 1972 .

[22]  Uzay Kaymak,et al.  News personalization using the CF-IDF semantic recommender , 2011, WIMS '11.

[23]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[24]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[25]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[26]  W. Bruce Croft,et al.  Harnessing Semantics for Answer Sentence Retrieval , 2015, ESAIR@CIKM.

[27]  Tapas Kanungo,et al.  Machine Learned Sentence Selection Strategies for Query-Biased Summarization , 2008 .

[28]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[29]  Tie-Yan Liu,et al.  Learning to rank: from pairwise approach to listwise approach , 2007, ICML '07.

[30]  Yelong Shen,et al.  A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval , 2014, CIKM.

[31]  Ansgar Scherp,et al.  Profiling vs. time vs. content: What does matter for top-k publication recommendation based on Twitter profiles? , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).