Language Model Document Priors based on Citation and Co-citation Analysis

Citation, an integral component of research papers, implies certain kind of relevance that is not well captured in current Information Retrieval (IR) researches. In this paper, we explore ingesting citation and co-citation analysis results into IR modeling process. We operationalize on going beyond the general uniform document prior assumption in language modeling framework through deriving document priors from papers citation counts, citation induced PageRank and co-citation clusters. We test multiple ways to estimate these priors and conduct extensive experiments on the iSearch test collection. Our results do not suggest significant improvements of using these priors over no prior baseline measured by mainstream retrieval effectiveness metrics. We analyze the possible reasons and suggest further directions in using bibliometric document priors to enhance IR.

[1]  Djoerd Hiemstra,et al.  The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[2]  Roi Blanco,et al.  Probabilistic Document Length Priors for Language Models , 2008, ECIR.

[3]  Toine Bogers,et al.  An Exploration of Retrieval-Enhancing Methods for Integrated Search in a Digital Library , 2012 .

[4]  E. Garfield,et al.  Citation indexes for science. , 1956, Science.

[5]  Christina Lioma,et al.  Sense discrimination for physics retrieval , 2011, SIGIR '11.

[6]  Katherine W. McCain,et al.  Visualizing a discipline: an author co-citation analysis of information science, 1972–1995 , 1998 .

[7]  York Sure-Vetter,et al.  Science models as value-added services for scholarly information systems , 2011, Scientometrics.

[8]  Stephen E. Robertson,et al.  Relevance weighting for query independent evidence , 2005, SIGIR '05.

[9]  Philipp Mayr,et al.  Bibliometric-enhanced retrieval models for big scholarly information systems , 2013, 2013 IEEE International Conference on Big Data.

[10]  Maarten de Rijke,et al.  Using Prior Information Derived from Citations in Literature Search , 2007, RIAO.

[11]  Christina Courtright,et al.  Context in information behavior research , 2007 .

[12]  Plergiorgio Strata,et al.  Citation analysis , 1995, Nature.

[13]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[14]  W. Bruce Croft,et al.  Document quality models for web ad hoc retrieval , 2005, CIKM '05.

[15]  Tamara Heck,et al.  Performing Informetric Analysis on Information Retrieval Test Collections: Preliminary Experiments in the Physics Domain , 2013, ArXiv.

[16]  Muhammad Ali Norozi,et al.  Contextualization from the Bibliographic Structure , 2012 .

[17]  Zhoujun Li,et al.  Mining and modeling linkage information from citation context for improving biomedical literature retrieval , 2011, Inf. Process. Manag..

[18]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[19]  Howard D. White,et al.  Combining bibliometrics, information retrieval, and relevance theory, Part 1: First examples of a synthesis , 2007, J. Assoc. Inf. Sci. Technol..

[20]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[22]  B. C. Griffith,et al.  The Structure of Scientific Literatures I: Identifying and Graphing Specialties , 1974 .

[23]  Gerard Salton,et al.  Associative Document Retrieval Techniques Using Bibliographic Information , 1963, JACM.

[24]  Birger Larsen,et al.  References and citations in automatic indexing and retrieval systems - experiments with the boomerang effect , 2004 .

[25]  Iadh Ounis,et al.  Combination of Document Priors in Web Information Retrieval , 2007, RIAO.

[26]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[27]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.