A study of information retrieval on accumulative social descriptions using the generation features

This paper is concerned with the study of information retrieval (IR) on Accumulative Social Descriptions (ASDs). ASDs refer to Web texts that accumulated by many Web users describing certain Web resources, such as anchor texts, search logs and social annotations. There have been some studies working on leveraging ASDs for improving search performance in both internet and intranet. However, to the best of our knowledge, no prior study has concerned the specific generation features of ASDs, which are the focus point of this paper. Specifically, we consider the generation features from two perspectives, the generation processes and the generated distributions. Further, three probabilistic IR models are derived based on them. The three models are first demonstrated with one toy dataset and then empirically evaluated with two real datasets: an internet dataset consisting of 90,295 Web pages, with 25,845,818 social annotations crawled from Del.icio.us and 31,320,005 pieces of anchor texts crawled through Yahoo! API, and an intranet dataset consisting of 179,835 Web pages with 1,245,522 annotations dumped from the intranet tagging system in IBM, named as Dogear. Extensive experimental results show that the proposed methods, which fully leverage the generation features of ASDs, improve the performance of both internet and intranet search significantly.

[1]  H. Simon,et al.  ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[2]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[3]  Vijay V. Raghavan,et al.  On the reuse of past optimal queries , 1995, SIGIR '95.

[4]  Stephen E. Robertson,et al.  Probabilistic models of indexing and searching , 1980, SIGIR '80.

[5]  Norbert Fuhr,et al.  Models for retrieval with probabilistic indexing , 1989, Inf. Process. Manag..

[6]  Oliver A. McBryan,et al.  GENVL and WWWW: Tools for taming the Web , 1994, WWW Spring 1994.

[7]  Tie-Yan Liu,et al.  Time-dependent semantic similarity measure of queries using historical click-through data , 2006, WWW '06.

[8]  Valentin Robu,et al.  The complex dynamics of collaborative tagging , 2007, WWW '07.

[9]  Iadh Ounis,et al.  A study of parameter tuning for term frequency normalization , 2003, CIKM '03.

[10]  Stephen E. Robertson,et al.  Effective site finding using link anchor information , 2001, SIGIR '01.

[11]  Wei-Ying Ma,et al.  Query Expansion by Mining User Logs , 2003, IEEE Trans. Knowl. Data Eng..

[12]  ChengXiang Zhai,et al.  A study of Poisson query generation model for information retrieval , 2007, SIGIR.

[13]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[14]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[15]  Yong Yu,et al.  Optimizing web search using social annotations , 2007, WWW '07.

[16]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[17]  Bernardo A. Huberman,et al.  Usage patterns of collaborative tagging systems , 2006, J. Inf. Sci..

[18]  Bernardo A. Huberman,et al.  The Structure of Collaborative Tagging Systems , 2005, ArXiv.

[19]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing , 1974 .

[20]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[21]  Hongyuan Zha,et al.  Exploring social annotations for information retrieval , 2008, WWW.

[22]  H.S. Al-Khalifa,et al.  Measuring the Semantic Value of Folksonomies , 2006, 2006 Innovations in Information Technology.

[23]  W. Bruce Croft,et al.  Efficient probabilistic Inference for text retrieval , 1991, RIAO.

[24]  ChengXiang Zhai,et al.  Probabilistic Relevance Models Based on Document and Query Generation , 2003 .

[25]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[26]  David R. Millen,et al.  Dogear: Social bookmarking in the enterprise , 2006, CHI.

[27]  S. Robertson The probability ranking principle in IR , 1997 .