An effective framework for characterizing rare categories

Rare categories become more and more abundant and their characterization has received little attention thus far. Fraudulent banking transactions, network intrusions, and rare diseases are examples of rare classes whose detection and characterization are of high value. However, accurate characterization is challenging due to high-skewness and nonseparability from majority classes, e.g., fraudulent transactions masquerade as legitimate ones. This paper proposes the RACH algorithm by exploring the compactness property of the rare categories. This algorithm is semi-supervised in nature since it uses both labeled and unlabeled data. It is based on an optimization framework which encloses the rare examples by a minimum-radius hyperball. The framework is then converted into a convex optimization problem, which is in turn effectively solved in its dual form by the projected subgradient method. RACH can be naturally kernelized. Experimental results validate the effectiveness of RACH.

[1]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[2]  Takafumi Kanamori,et al.  Statistical outlier detection using direct density ratio estimation , 2011, Knowledge and Information Systems.

[3]  Sanjoy Dasgupta,et al.  Hierarchical sampling for active learning , 2008, ICML '08.

[4]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[5]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[6]  Jingrui He,et al.  Nearest-Neighbor-Based Active Learning for Rare Category Detection , 2007, NIPS.

[7]  Hui Xiong,et al.  Local decomposition for rare class analysis , 2007, KDD '07.

[8]  Kanishka Bhaduri,et al.  Algorithms for speeding up distance-based outlier detection , 2011, KDD.

[9]  Haimonti Dutta,et al.  Distributed Top-K Outlier Detection from Astronomy Catalogs using the DEMAC System , 2007, SDM.

[10]  Weng-Keen Wong,et al.  Category detection using hierarchical mean shift , 2009, KDD.

[11]  Andrew W. Moore,et al.  Active Learning for Anomaly and Rare-Category Detection , 2004, NIPS.

[12]  S. Griffis EDITOR , 1997, Journal of Navigation.

[13]  Aidong Zhang,et al.  FindOut: Finding Outliers in Very Large Datasets , 2002, Knowledge and Information Systems.

[14]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[15]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[16]  Nathalie Japkowicz,et al.  Boosting Support Vector Machines for Imbalanced Data Sets , 2008, ISMIS.

[17]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[18]  Bernhard Schölkopf,et al.  Ranking on Data Manifolds , 2003, NIPS.

[19]  David A. Cieslak,et al.  Start Globally, Optimize Locally, Predict Globally: Improving Performance on Imbalanced Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[20]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[21]  Yang Wang,et al.  Boosting for Learning Multiple Classes with Imbalanced Class Distribution , 2006, Sixth International Conference on Data Mining (ICDM'06).

[22]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[23]  Yizhou Sun,et al.  On community outliers and their efficient detection in information networks , 2010, KDD.

[24]  Zengyou He,et al.  An Optimization Model for Outlier Detection in Categorical Data , 2005, ICIC.

[25]  Sanjay Chawla,et al.  Density-preserving projections for large-scale local anomaly detection , 2012, Knowledge and Information Systems.

[26]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[27]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[28]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[29]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[30]  Sushil Jajodia,et al.  Detecting Novel Network Intrusions Using Bayes Estimators , 2001, SDM.

[31]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[32]  Arnold P. Boedihardjo,et al.  GLS-SOD: a generalized local statistical approach for spatial outlier detection , 2010, KDD '10.

[33]  Christos Faloutsos,et al.  Detecting Fraudulent Personalities in Networks of Online Auctioneers , 2006, PKDD.

[34]  Longin Jan Latecki,et al.  Improving SVM classification on imbalanced time series data sets with ghost points , 2011, Knowledge and Information Systems.

[35]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[36]  Yishay Mansour,et al.  Active Sampling for Multiple Output Identification , 2006, COLT.

[37]  Marius Kloft,et al.  Active and Semi-supervised Data Domain Description , 2009, ECML/PKDD.