Using Contextual Information to Improve Search in Email Archives

In this paper we address the task of finding topically relevant email messages in public discussion lists. We make two important observations. First, email messages are not isolated, but are part of a larger online environment. This context, existing on different levels, can be incorporated into the retrieval model. We explore the use of thread, mailing list, and community content levels, by expanding our original query with term from these sources. We find that query models based on contextual information improve retrieval effectiveness. Second, email is a relatively informal genre, and therefore offers scope for incorporating techniques previously shown useful in searching user-generated content. Indeed, our experiments show that using query-independent features (email length, thread size, and text quality), implemented as priors, results in further improvements.

[1]  ChengXiang Zhai,et al.  Probabilistic Relevance Models Based on Document and Query Generation , 2003 .

[2]  Paula S. Newman,et al.  Exploring discussion lists: steps and directions , 2002, JCDL '02.

[3]  M. de Rijke,et al.  A few examples go a long way: constructing query models from elaborate query formulations , 2008, SIGIR '08.

[4]  Lise Getoor,et al.  Name Reference Resolution in Organizational Email Archives , 2006, SDM.

[5]  Carmel Domshlak,et al.  Better than the real thing?: iterative pseudo-query processing using cluster-based language models , 2005, SIGIR '05.

[6]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[7]  Henry Tirri,et al.  Multi-faceted information retrieval system for large scale email archives , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[8]  Mark S. Ackerman,et al.  Searching for expertise in social networks: a simulation of potential strategies , 2005, GROUP.

[9]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[10]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[11]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[12]  Gilad Mishne,et al.  Applied text analytics for blogs , 2007 .

[13]  Douglas W. Oard,et al.  Modeling Identity in Archival Collections of Email: A Preliminary Study , 2006, CEAS.

[14]  Nick Craswell,et al.  Overview of the TREC 2006 Enterprise Track , 2006, TREC.

[15]  Craig MacDonald,et al.  Overview of the TREC 2006 Blog Track , 2006, TREC.

[16]  Chris Buckley,et al.  Why current IR engines fail , 2004, SIGIR '04.

[17]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[18]  William W. Cohen,et al.  Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text , 2005, HLT.

[19]  Wouter Weerkamp Looking at Things Differently Exploring Perspective Recall for Informal Text Retrieval , 2008 .

[20]  M. de Rijke,et al.  Credibility Improves Topical Blog Post Retrieval , 2008, ACL.

[21]  Andrew McCallum,et al.  Extracting social networks and contact information from email and the Web , 2004, CEAS.

[22]  Anton Leuski Email is a stage: discovering people roles from email archives , 2004, SIGIR '04.

[23]  Fernando Diaz,et al.  Improving the estimation of relevance models using large external corpora , 2006, SIGIR.

[24]  Jaime G. Carbonell,et al.  Retrieval and feedback models for blog feed search , 2008, SIGIR '08.

[25]  Tao Tao,et al.  Regularized estimation of mixture models for robust pseudo-relevance feedback , 2006, SIGIR.

[26]  Michael F. Schwartz,et al.  Discovering shared interests using graph analysis , 1993, CACM.

[27]  Yiming Yang,et al.  Introducing the Enron Corpus , 2004, CEAS.

[28]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[29]  Nick Craswell,et al.  Overview of the TREC 2005 Enterprise Track , 2005, TREC.

[30]  William W. Cohen,et al.  Contextual search and name disambiguation in email using graphs , 2006, SIGIR.

[31]  Craig MacDonald,et al.  Overview of the TREC 2007 Blog Track , 2007, TREC.

[32]  Douglas W. Oard,et al.  An Exploratory Study of the W3C Mailing List Test Collection for Retrieval of Emails with Pro/Con Argument , 2006, CEAS.

[33]  Maarten de Rijke,et al.  Finding experts and their eetails in e-mail corpora , 2006, WWW '06.

[34]  M. de Rijke,et al.  The Importance of Length Normalization for XML Retrieval , 2005, Information Retrieval.