Automatically tagging email by leveraging other users' folders

Most email applications devote a significant part of their real estate to organization mechanisms such as folders. Yet, we verified on the Yahoo! Mail service that 70% of email users have never defined a single folder. This implies that one of the most well known email features is underexploited. We propose here to revive the feature by providing a method for generating a lighter form of folders, or tags, benefiting even the most passive users. The method automatically associates, whenever possible, an appropriate semantic tag with a given email. This gives rise to an alternate mechanism for organizing and searching email. We advocate a novel modeling approach that exploits the overall population of users, thereby learning from the wisdom-of-crowds how to categorize messages. Given our massive user base, it is enough to learn from a minority of the users who label certain messages in order to label that kind of messages for the general population. We design a novel cascade classification approach, which copes with the severe scalability and accuracy constraints we are facing. Significant efficiency gains are achieved by working within a low dimensional latent space, and by using a novel hierarchical classifier. Precision level is controlled by separating the task into a two-phase classification process. We performed an extensive empirical study covering three different time periods, over 100 million messages, and thousands of candidate tags per message. The results are encouraging and compare favorably with alternative approaches. Our method successfully tags 72% of incoming email traffic. Performance-wise, the computational overhead, even on surge large traffic, is sufficiently low for our approach to be applicable in production on any large Web mail service.

[1]  Andrew Slater,et al.  The Learning Behind Gmail Priority Inbox , 2010 .

[2]  Sharma Chakravarthy,et al.  A Graph-Based Approach for Multi-folder Email Classification , 2010, 2010 IEEE International Conference on Data Mining.

[3]  William W. Cohen Learning Rules that Classify E-Mail , 1996 .

[4]  Jason Weston,et al.  Large scale image annotation: learning to rank with joint word-image embeddings , 2010, Machine Learning.

[5]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[6]  Stan Matwin,et al.  Email classification with co-training , 2011, CASCON.

[7]  Harry Bruce,et al.  Better to organize personal information by folders or by tags?: The devil is in the details , 2008, ASIST.

[8]  Sharma Chakravarthy,et al.  eMailSift: eMail classification based on structure and content , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[9]  Hongjun Lu,et al.  A Comparative Study of Classification Based Personal E-mail Filtering , 2000, PAKDD.

[10]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[11]  Andrei Z. Broder,et al.  Anatomy of the long tail: ordinary people with extraordinary tastes , 2010, WSDM '10.

[12]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[13]  Jeffrey O. Kephart,et al.  Incremental Learning in SwiftFile , 2000, ICML.

[14]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[15]  Yoshua Bengio,et al.  Quick Training of Probabilistic Neural Nets by Importance Sampling , 2003, AISTATS.

[16]  Andrew McCallum,et al.  Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora , 2005 .

[17]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[18]  Jose Miguel Puerta,et al.  Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets , 2011, Expert Syst. Appl..

[19]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[20]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..