Hierarchical Label Propagation and Discovery for Machine Generated Email

Machine-generated documents such as email or dynamic web pages are single instantiations of a pre-defined structural template. As such, they can be viewed as a hierarchy of template and document specific content. This hierarchical template representation has several important advantages for document clustering and classification. First, templates capture common topics among the documents, while filtering out the potentially noisy variabilities such as personal information. Second, template representations scale far better than document representations since a single template captures numerous documents. Finally, since templates group together structurally similar documents, they can propagate properties between all the documents that match the template. In this paper, we use these advantages for document classification by formulating an efficient and effective hierarchical label propagation and discovery algorithm. The labels are propagated first over a template graph (constructed based on either term-based or topic-based similarities), and then to the matching documents. We evaluate the performance of the proposed algorithm using a large donated email corpus and show that the resulting template graph is significantly more compact than the corresponding document graph and the hierarchical label propagation is both efficient and effective in increasing the coverage of the baseline document classification algorithm. We demonstrate that the template label propagation achieves more than 91% precision and 93% recall, while increasing the label coverage by more than 11%.

[1]  Andrew McCallum,et al.  Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora , 2005 .

[2]  Tianshun Yao,et al.  An evaluation of statistical spam filtering techniques , 2004, TALIP.

[3]  Robert E. Kraut,et al.  Understanding email use: predicting action on a message , 2005, CHI.

[4]  Sujith Ravi,et al.  Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation , 2015, AISTATS.

[5]  Patrick Pantel,et al.  SpamCop: A Spam Classification & Organisation Program , 1998, AAAI 1998.

[6]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[7]  Alexander Zien,et al.  Label Propagation and Quadratic Criterion , 2006 .

[8]  Thomas Gottron,et al.  Locality sensitive hashing for scalable structural classification and clustering of web documents , 2013, CIKM.

[9]  Ziv Bar-Yossef,et al.  Cluster ranking with an application to mining mailbox networks , 2007, Knowledge and Information Systems.

[10]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[11]  Oren Kurland,et al.  PageRank without hyperlinks: structural re-ranking using links induced by language models , 2005, SIGIR '05.

[12]  Mário J. Silva,et al.  Language identification in web pages , 2005, SAC '05.

[13]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[14]  Kartik Gopalan,et al.  DMTP: Controlling Spam Through Message Delivery Differentiation , 2006, Networking.

[15]  Yoelle Maarek,et al.  How Many Folders Do You Really Need?: Classifying Email into a Handful of Categories , 2014, CIKM.

[16]  Nir Ailon,et al.  Threading machine generated email , 2013, WSDM '13.

[17]  Steve Gregory,et al.  Finding overlapping communities in networks by label propagation , 2009, ArXiv.

[18]  Barry Smyth,et al.  Genre Classification and Domain Transfer for Information Filtering , 2002, ECIR.

[19]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[20]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[21]  Maozhen Li,et al.  A survey of emerging approaches to spam filtering , 2012, CSUR.

[22]  Andrew Slater,et al.  The Learning Behind Gmail Priority Inbox , 2010 .

[23]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[24]  Robert E. Kraut,et al.  Email overload at work: an analysis of factors associated with email strain , 2006, IEEE Engineering Management Review.

[25]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .