Email Surveillance Using Non-negative Matrix Factorization

In this study, we apply a non-negative matrix factorization approach for the extraction and detection of concepts or topics from electronic mail messages. For the publicly released Enron electronic mail collection, we encode sparse term-by-message matrices and use a low rank non-negative matrix factorization algorithm to preserve natural data non-negativity and avoid subtractive basis vector and encoding interactions present in techniques such as principal component analysis. Results in topic detection and message clustering are discussed in the context of published Enron business practices and activities, and benchmarks addressing the computational complexity of our approach are provided. The resulting basis vectors and matrix projections of this approach can be used to identify and monitor underlying semantic features (topics) and message clusters in a general or high-level way without the need to read individual electronic mail messages.

[1]  Aapo Hyvärinen,et al.  Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces , 2000, Neural Computation.

[2]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[3]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[4]  Daniel T. Gillespie The Smartest Guys in the Room: The Amazing Rise and Scandalous Fall of Enron , 2004 .

[5]  Jordi Vitrià,et al.  Determining a suitable metric when using non-negative matrix factorization , 2002, Object recognition supported by user interaction for service robots.

[6]  D. Zhang,et al.  Principle Component Analysis , 2004 .

[7]  Victoria Stodden,et al.  When Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts? , 2003, NIPS.

[8]  Jonathan Foote,et al.  Summarizing video using non-negative similarity matrix factorization , 2002, 2002 IEEE Workshop on Multimedia Signal Processing..

[9]  Robert J. Plemmons,et al.  Iterative ultrasonic signal and image deconvolution for estimation of the complex medium response , 2005, Int. J. Imaging Syst. Technol..

[10]  H. Deutsch Principle Component Analysis , 2004 .

[11]  Michael W. Berry,et al.  Document clustering using nonnegative matrix factorization , 2006, Inf. Process. Manag..

[12]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[13]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 1999 .

[14]  Judith S. Levey,et al.  Concise Columbia encyclopedia , 1983 .

[15]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[16]  Robert J. Plemmons,et al.  Nonnegative Matrices in the Mathematical Sciences , 1979, Classics in Applied Mathematics.

[17]  David B. Skillicorn,et al.  Structure in the Enron Email Dataset , 2005, Comput. Math. Organ. Theory.

[18]  Stefan M. Wild,et al.  Motivating non-negative matrix factorizations , 2003 .

[19]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[20]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[21]  Michael W. Berry,et al.  GTP (General Text Parser) Software for Text Mining , 2003 .

[22]  Bethany McLean,et al.  The Smartest Guys in the Room: The Amazing Rise and Scandalous Fall of Enron , 2003 .

[23]  Patrik O. Hoyer,et al.  Non-negative sparse coding , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.