Scalable discovery of hidden emails from large folders

The popularity of email has triggered researchers to look for ways to help users better organize the enormous amount of information stored in their email folders. One challenge that has not been studied extensively in text mining is the identification and reconstruction of hidden emails. A hidden email is an original email that has been quoted in at least one email in a folder, but does not present itself in the same folder. It may have been (un)intentionally deleted or may never have been received. The discovery and reconstruction of hidden emails is critical for many applications including email classification, summarization and forensics. This paper proposes a framework for reconstructing hidden emails using the embedded quotations found in messages further down the thread hierarchy. We evaluate the robustness and scalability of our framework by using the Enron public email corpus. Our experiments show that hidden emails exist widely in that corpus and also that our optimization techniques are effective in processing large email folders.

[1]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies , 2000, ArXiv.

[2]  Ani Nenkova,et al.  Email classification for contact centers , 2003, SAC '03.

[3]  Alfred V. Aho,et al.  The Transitive Reduction of a Directed Graph , 1972, SIAM J. Comput..

[4]  Paula S. Newman,et al.  Exploring discussion lists: steps and directions , 2002, JCDL '02.

[5]  Nasir D. Memon,et al.  Automatic reassembly of document fragments via context based statistical models , 2003, 19th Annual Computer Security Applications Conference, 2003. Proceedings..

[6]  Olivier de Vel,et al.  Mining E-mail Authorship , 2000 .

[7]  Robert G. Farrell,et al.  Summarization of discussion groups , 2001, CIKM '01.

[8]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[9]  William W. Cohen,et al.  Learning to Extract Signature and Reply Lines from Email , 2004, CEAS.

[10]  Giorgio Gambosi,et al.  Complexity and approximation: combinatorial optimization problems and their approximability properties , 1999 .

[11]  Smaranda Muresan,et al.  Combining linguistic and machine learning techniques for email summarization , 2001, CoNLL.

[12]  Giuseppe Carenini,et al.  Discovery and regeneration of hidden emails , 2005, SAC '05.

[13]  Warren Sack,et al.  Conversation map: a content-based Usenet newsgroup browser , 2000, IUI '00.

[14]  Jacek Gwizdka,et al.  Individual differences and task-based user interface evaluation: a case study of pending tasks in email , 2004, Interact. Comput..

[15]  Derek Scott Lam,et al.  Exploiting E-mail Structure to Improve Summarization , 2002 .

[16]  Ramakrishnan Srikant,et al.  Mining newsgroups using networks arising from social behavior , 2003, WWW '03.

[17]  Jeffrey O. Kephart,et al.  Incremental Learning in SwiftFile , 2000, ICML.