Categorizing Emails Using Machine Learning with Textual Features

We developed an application that automates the process of assigning emails received in a generic request inbox to one of fourteen predefined topic categories. To build this application, we compared the performance of several classifiers in predicting the topic category, using an email dataset extracted from this inbox, which consisted of 8,841 emails over three years. The algorithms ranged from linear classifiers operating on n-gram features to deep learning techniques such as CNNs and LSTMs. For our objective, we found that the best-performing model was a logistic regression classifier using n-grams with TF-IDF weights, achieving 90.9% accuracy. The traditional models performed better than the deep learning models for this dataset, likely in part due to the small dataset size, and also because this particular classification task may not require the ordered sequence representation of tokens that deep learning models provide. Eventually, a bagged voting model was selected which combines the predictive power of the top eight models, with accuracy of 92.7%, surpassing the performance of any of the individual models.

[1]  Jihoon Yang,et al.  Abstractive Text Classification Using Sequence-to-convolution Neural Networks , 2018, ArXiv.

[2]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[3]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[4]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[5]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[8]  Darpa Speech Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, USA, February 23-26, 1992 , 1992, HLT.

[9]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[10]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[11]  Yann LeCun,et al.  Very Deep Convolutional Networks for Text Classification , 2016, EACL.

[12]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[13]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[14]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[15]  Zhiyuan Liu,et al.  A C-LSTM Neural Network for Text Classification , 2015, ArXiv.

[16]  Pascal Vincent,et al.  Learning to Compute Word Embeddings On the Fly , 2017, ArXiv.

[17]  Christopher D. Manning,et al.  Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models , 2016, ACL.

[18]  Jun Zhao,et al.  Recurrent Convolutional Neural Networks for Text Classification , 2015, AAAI.

[19]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[20]  Joel R. Tetreault,et al.  It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool , 2015, ACL.

[21]  Tong Zhang,et al.  Effective Use of Word Order for Text Categorization with Convolutional Neural Networks , 2014, NAACL.

[22]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[23]  Jihoon Yang,et al.  Email Categorization Using Fast Machine Learning Algorithms , 2002, Discovery Science.

[24]  Michael D. Gordon,et al.  Recall-precision trade-off: A derivation , 1989, JASIS.

[25]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.