Crawling and Preprocessing Mailing Lists At Scale for Dialog Analysis

This paper introduces the Webis Gmane Email Corpus 2019, the largest publicly available and fully preprocessed email corpus to date. We crawled more than 153 million emails from 14,699 mailing lists and segmented them into semantically consistent components using a new neural segmentation model. With 96% accuracy on 15 classes of email segments, our model achieves state-of-the-art performance while being more efficient to train than previous ones. All data, code, and trained models are made freely available alongside the paper.

[1]  R. D. Felice,et al.  Politeness at Work in the Clinton Email Corpus: A First Look at the Effects of Status and Gender , 2018, Corpus Pragmatics.

[2]  Cécile Paris,et al.  Segmenting Email Message Text into Zones , 2009, EMNLP.

[3]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[4]  Jie Tang,et al.  Email data cleaning , 2005, KDD '05.

[5]  David B. Skillicorn,et al.  Structure in the Enron Email Dataset , 2005, Comput. Math. Organ. Theory.

[6]  Susan R. Fussell,et al.  Coordination in Teams: Evidence from a Simulated Management Game , 2005 .

[7]  Ralf Krestel,et al.  Bringing Back Structure to Free Text Email Conversations with Recurrent Neural Networks , 2018, ECIR.

[8]  Nikolai Sobotta,et al.  Why Forwarded Email Threads are Hard to Read: The Email Format as an Antecedent of Email Overload , 2016, Commun. Assoc. Inf. Syst..

[9]  L. Venkata Subramaniam,et al.  Unsupervised cleansing of noisy text , 2010, COLING.

[10]  Michael Viderman,et al.  Automated Extractions for Machine Generated Mail , 2018, WWW.

[11]  Daniel Raumer,et al.  Information Mining from Public Mailing Lists: A Case Study on IETF Mailing Lists , 2017, INSCI.

[12]  Ahmed E. Hassan,et al.  A Lightweight Approach to Uncover Technical Artifacts in Unstructured Data , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[13]  William W. Cohen,et al.  Learning to Extract Signature and Reply Lines from Email , 2004, CEAS.

[14]  Andrew Slater,et al.  The Learning Behind Gmail Priority Inbox , 2010 .

[15]  Meliha Yetisgen-Yildiz,et al.  Annotating Large Email Datasets for Named Entity Recognition with Mechanical Turk , 2010, Mturk@HLT-NAACL.

[16]  William W. Cohen,et al.  Contextual search and name disambiguation in email using graphs , 2006, SIGIR.

[17]  Andrei Z. Broder,et al.  Email Category Prediction , 2017, WWW.

[18]  Roei Gelbhart,et al.  More than Threads: Identifying Related Email Messages , 2018, CIKM.

[19]  Marc-Allen Cartright,et al.  Template Induction over Unstructured Email Corpora , 2017, WWW.

[20]  William W. Cohen,et al.  Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text , 2005, HLT.

[21]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[22]  Nir Ailon,et al.  Threading machine generated email , 2013, WSDM '13.

[23]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[24]  Qi Zhao,et al.  RiSER: Learning Better Representations for Richly Structured Emails , 2019, WWW.

[25]  Daniel M. Germán,et al.  Will my patch make it? And how fast? Case study on the Linux kernel , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[26]  Jade Goldstein-Stewart,et al.  Annotating Subsets of the Enron Email Corpus , 2006, CEAS.