Dirichlet-Enhanced Spam Filtering based on Biased Samples

We study a setting that is motivated by the problem of filtering spam messages for many users. Each user receives messages according to an individual, unknown distribution, reflected only in the unlabeled inbox. The spam filter for a user is required to perform well with respect to this distribution. Labeled messages from publicly available sources can be utilized, but they are governed by a distinct distribution, not adequately representing most inboxes. We devise a method that minimizes a loss function with respect to a user's personal distribution based on the available biased sample. A nonparametric hierarchical Bayesian model furthermore generalizes across users by learning a common prior which is imposed on new email accounts. Empirically, we observe that bias-corrected learning outperforms naive reliance on the assumption of independent and identically distributed data; Dirichlet-enhanced generalization across users outperforms a single ("one size fits all") filter as well as independent filters for all users.

[1]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[2]  J. Heckman Sample selection bias as a specification error , 1979 .

[3]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[4]  M. Escobar,et al.  Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[5]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[6]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[7]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[8]  Volker Tresp,et al.  An Introduction to Nonparametric Hierarchical Bayesian Modelling with a Focus on Multi-agent Learning , 2003, European Summer School on Multi-AgentControl.

[9]  Michael I. Jordan,et al.  Variational methods for the Dirichlet process , 2004, ICML.

[10]  Andrew W. Moore,et al.  Logistic regression for data mining and high-dimensional classification , 2004 .

[11]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[12]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[13]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[14]  Miroslav Dudík,et al.  Correcting sample selection bias in maximum entropy density estimation , 2005, NIPS.

[15]  Arthur M. Keller,et al.  Understanding How Spammers Steal Your E-Mail Address: An Analysis of the First Six Months of Data from Project Honey Pot , 2005, CEAS.

[16]  Klaus-Robert Müller,et al.  Model Selection Under Covariate Shift , 2005, ICANN.