Multi-faceted information retrieval system for large scale email archives

We profile a system for search and analysis of large-scale email archives. The system builds around four facets: content-based search engine, statistical topic model, automatically inferred social networks, and time-series analysis. The facets correspond to the types of information available in email data. The presented system allows chaining or combining the facets flexibly. Results of one facet may be used as input to another yielding remarkable combinatorial power. In information retrieval point of view, the system provides support for exploration, approximate textual searches and data visualization. We present some experimental results based on a large real-world email corpus.

[1]  Tomi Silander,et al.  LANGUAGE PRAGMATICS , CONTEXTS AND A SEARCH ENGINE , 2005 .

[2]  Yi Zhang,et al.  Graph-based ranking algorithms for e-mail expertise analysis , 2003, DMKD '03.

[3]  Henry Tirri,et al.  A Scalable Topic-Based Open Source Search Engine , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[4]  Domenico Parisi Language as pragmatics , 2004 .

[5]  Andrew McCallum,et al.  Extracting social networks and contact information from email and the Web , 2004, CEAS.

[6]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[7]  Eric Wilcox,et al.  Designing remail: reinventing the email client through innovation and integration , 2004, CHI EA '04.

[8]  Henry Tirri,et al.  Combining Topic Models and Social Networks for Chat Data Mining , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[9]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[10]  Aleks Jakulin,et al.  Applying Discrete PCA in Data Analysis , 2004, UAI.

[11]  Wray L. Buntine,et al.  Is Multinomial PCA Multi-faceted Clustering or Dimensionality Reduction? , 2003, AISTATS.

[12]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[13]  Wray L. Buntine,et al.  Exploring Independent Trends in a Topic-Based Search Engine , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[14]  Jeffrey O. Kephart,et al.  MailCat: an intelligent assistant for organizing e-mail , 1999, AGENTS '99.

[15]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Henry Tirri,et al.  A temporally adaptive content-based relevance ranking algorithm , 2005, SIGIR '05.

[17]  Andrzej Skowron,et al.  Proceedings of the 2005 IEEE / WIC / ACM International Conference on Web Intelligence , 2005 .