LAGO: A Computationally Efficient Approach for Statistical Detection

We study a general class of statistical detection problems where the underlying objective is to detect items belonging to a rare class from a very large database. We propose a computationally efficient method to achieve this goal. Our method consists of two steps. In the first step we estimate the density function of the rare class alone with an adaptive bandwidth kernel density estimator. The adaptive choice of the bandwidth is inspired by the ancient Chinese board game known today as Go. In the second step we adjust this density locally depending on the density of the background class nearby. We show that the amount of adjustment needed in the second step is approximately equal to the adaptive bandwidth from the first step, which gives us additional computational savings. We name the resulting method LAGO, for “locally adjusted Go-kernel density estimator.” We then apply LAGO to a real drug discovery dataset and compare its performance with a number of existing and popular methods.

[1]  Shinichi Morishita,et al.  On Classification and Regression , 1998, Discovery Science.

[2]  William J. Welch,et al.  Uniform Coverage Designs for Molecule Selection , 2002, Technometrics.

[3]  D. Steinberg,et al.  Technometrics , 2008 .

[4]  Sheldon M. Ross,et al.  Introduction to Probability Models (4th ed.). , 1990 .

[5]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[6]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[7]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[8]  Yuanyuan Wang,et al.  Statistical Methods for High Throughput Screening Drug Discovery Data , 2005 .

[9]  Sheldon M. Ross,et al.  Introduction to probability models , 1975 .

[10]  Christopher M. Bishop,et al.  Classification and regression , 1997 .

[11]  David J. Hand,et al.  Statistical fraud detection: A review , 2002 .

[12]  M W Kattan,et al.  Determining the Area under the ROC Curve for a Binary Diagnostic Test , 2000, Medical decision making : an international journal of the Society for Medical Decision Making.

[13]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[14]  F. Burden Molecular identification number for substructure searches , 1989, J. Chem. Inf. Comput. Sci..

[15]  Dale Schuurmans,et al.  Augmenting Naive Bayes Classifiers with Statistical Language Models , 2004, Information Retrieval.

[16]  Bernhard Schölkopf,et al.  Comparing support vector machines with Gaussian kernels to radial basis function classifiers , 1997, IEEE Trans. Signal Process..

[17]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[18]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[19]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..