Rare category detection is an open challenge for active learning, especially in the de-novo case (no labeled examples), but of significant practical importance for data mining - e.g. detecting new financial transaction fraud patterns, where normal legitimate transactions dominate. This paper develops a new method for detecting an instance of each minority class via an unsupervised local-density-differential sampling strategy. Essentially a variable-scale nearest neighbor process is used to optimize the probability of sampling tightly-grouped minority classes, subject to a local smoothness assumption of the majority class. Results on both synthetic and real data sets are very positive, detecting each minority class with only a fraction of the actively sampled points required by random sampling and by Pelleg's Interleave method, the prior best technique in the sparse literature on this topic.
[1]
Andrew W. Moore,et al.
Active Learning for Anomaly and Rare-Category Detection
,
2004,
NIPS.
[2]
Catherine Blake,et al.
UCI Repository of machine learning databases
,
1998
.
[3]
Yishay Mansour,et al.
Active Sampling for Multiple Output Identification
,
2006,
COLT.
[4]
Stephen D. Bay,et al.
Large Scale Detection of Irregularities in Accounting Data
,
2006,
Sixth International Conference on Data Mining (ICDM'06).
[5]
John Langford,et al.
Agnostic active learning
,
2006,
J. Comput. Syst. Sci..
[6]
Sanjoy Dasgupta,et al.
Coarse sample complexity bounds for active learning
,
2005,
NIPS.