Retrieval experiments using pseudo-desktop collections

Desktop search is an important part of personal information management (PIM). However, research in this area has been limited by the lack of shareable test collections, making cumulative progress difficult. In this paper, we define desktop search as a semi-structured document retrieval problem and introduce a methodology to automatically build a reusable collection (the pseudo-desktop) that has many of the same properties as a real desktop collection. We then present a comprehensive evaluation of retrieval methods for semi-structured document retrieval on several pseudo-desktop collections and the TREC Enterprise collection. Our results show that a probabilistic retrieval model using the mapping relation between a query term and a document field (PRM-S) has the best performance in collections with more structure, such as email, and that the query-likelihood language model is better for other document types. We further analyze the observed differences using generated queries and suggest ways to improve PRM-S, which makes the performance gains more significant and consistent.

[1]  Xiao Li,et al.  Extracting structured information from user queries with semi-supervised conditional random fields , 2009, SIGIR.

[2]  Brian D. Noble,et al.  Using Provenance to Aid in Personal File Search , 2007, USENIX Annual Technical Conference.

[3]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[4]  W. Bruce Croft,et al.  A Probabilistic Retrieval Model for Semistructured Data , 2009, ECIR.

[5]  Susan T. Dumais,et al.  Fast, flexible filtering with phlat , 2006, CHI.

[6]  Nick Craswell,et al.  Overview of the TREC 2005 Enterprise Track , 2005, TREC.

[7]  Stephen E. Robertson,et al.  Microsoft Cambridge at TREC 14: Enterprise Track , 2005, TREC.

[8]  M. de Rijke,et al.  Building simulated queries for known-item topics: an analysis using six european languages , 2007, SIGIR.

[9]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[10]  David Elsweiler,et al.  Towards task-based personal information management evaluations , 2007, SIGIR.

[11]  Le Zhao,et al.  A generative retrieval model for structured documents , 2008, CIKM '08.

[12]  Paul Thomas,et al.  Server characterisation and selection for personal metasearch , 2008, SIGF.

[13]  William Jones Personal Information Management , 2007, Annu. Rev. Inf. Sci. Technol..

[14]  James P. Callan,et al.  Combining document representations for known-item search , 2003, SIGIR.

[15]  Carmel Domshlak,et al.  On ranking techniques for desktop search , 2007, WWW '07.

[16]  Chang-Tien Lu,et al.  Performance Evaluation of Desktop Search Engines , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[17]  Craig A. N. Soules,et al.  Connections: using context to enhance file search , 2005, SOSP '05.

[18]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[19]  Wolfgang Nejdl,et al.  Building a Desktop Search Test-Bed , 2007, ECIR.

[20]  Christof Monz,et al.  Applying Maximum Entropy to Known-Item Email Retrieval , 2008, ECIR.

[21]  Wolfgang Nejdl,et al.  Evaluating Personal Information Management Using an Activity Logs Enriched Desktop Dataset , 2008 .

[22]  Jacek Gwizdka,et al.  Personal information management , 2004, CHI EA '04.

[23]  Shenghua Bao,et al.  Research on Expert Search at Enterprise Track of TREC 2006 , 2005, TREC.