Detection of rare items with TARGET

In our new information-based economy, the need to detect a small number of relevant and useful items from a large database arises very often. Standard classifiers such as decision trees and neural networks are often used directly as a detection algorithm. We argue that such an approach is not optimal because these classifiers are almost always built to optimize a criterion that is suitable only for classification but not for detection. For detection of rare items, the misclassification rate and other closely associated criteria are largely irrelevant; what matters is whether the algorithm can rank the few useful items ahead of the rest, something better measured by the area under the ROC curve or the notion of the average precision (AP). We use the genetic algorithm to build decision trees by optimizing the AP directly, and compare the performance of our algorithm with a number of standard tree-based classifiers using both simulated and real data sets.

[1]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[2]  Haym Hirsh,et al.  Learning to Predict Rare Events in Categorical Time-Series Data , 1998 .

[3]  M. Pepe The Statistical Evaluation of Medical Tests for Classification and Prediction , 2003 .

[4]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[5]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[6]  Fredric C. Gey,et al.  The relationship between recall and precision , 1994 .

[7]  Guangzhe Fan,et al.  Regression Tree Analysis Using TARGET , 2005 .

[8]  David J. Hand,et al.  Statistical fraud detection: A review , 2002 .

[9]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[10]  Mu Zhu,et al.  LAGO : A Computationally Efficient Approach for Statistical Detection , 2008 .

[11]  Dale Schuurmans,et al.  Augmenting Naive Bayes Classifiers with Statistical Language Models , 2004, Information Retrieval.

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Guangzhe Fan,et al.  Classification tree analysis using TARGET , 2008, Comput. Stat. Data Anal..

[14]  Goldberg,et al.  Genetic algorithms , 1993, Robust Control Systems with Genetic Algorithms.

[15]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[16]  Mehryar Mohri,et al.  AUC Optimization vs. Error Rate Minimization , 2003, NIPS.

[17]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[18]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[19]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[20]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[21]  Michael D. Gordon,et al.  Recall-precision trade-off: A derivation , 1989, JASIS.

[22]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.