LERI: Local Exploration for Rare-Category Identification

To identify the data examples of rare categories that form small compact clusters in large data sets, existing approaches mostly require enough labeled data examples as a training set to learn a classifier, assuming that the rare-category clusters are spherical or nearly spherical. Nonetheless, a large enough training set is usually difficult to obtain in practice, and rare categories in many real-world applications often form small compact clusters with arbitrary shapes. In this paper, we investigate how to identify all data examples of a rare category with an arbitrary shape based on only one seed (i.e., a labeled rare-category data example). Instead of finding a compact and spherical local region around the seed, we locally explore the data set from the seed by continuously searching and visiting the <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="gao-ieq1-2911941.gif"/></alternatives></inline-formula>-nearest neighbors of each newly visited data example. The local exploration connects the data examples in the objective rare category by the relationship of <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="gao-ieq2-2911941.gif"/></alternatives></inline-formula>-nearest neighbors, and meanwhile, suspected external data examples are filtered out if they are not close enough to any visited data example. Experimental results on both synthetic and real-world data sets are conducted, and the results verify the effectiveness and efficiency of our approach.

[1]  Mary P. Harper,et al.  Spatial Random Tree Grammars for Modeling Hierarchal Structure in Images with Regions of Arbitrary Shape , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Jing Li,et al.  Robust Local Community Detection: On Free Rider Effect and Its Elimination , 2015, Proc. VLDB Endow..

[3]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[4]  Yunjun Gao,et al.  Rare category exploration , 2014, Expert Syst. Appl..

[5]  Jingrui He,et al.  Rare Category Characterization , 2010, 2010 IEEE International Conference on Data Mining.

[6]  Vipin Kumar,et al.  Evaluating boosting algorithms to classify rare classes: comparison and improvements , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[7]  Gary King,et al.  Logistic Regression in Rare Events Data , 2001, Political Analysis.

[8]  Cheng Wu,et al.  Semi-Supervised and Unsupervised Extreme Learning Machines , 2014, IEEE Transactions on Cybernetics.

[9]  Tao Xiang,et al.  Finding Rare Classes: Active Learning with Generative and Discriminative Models , 2013, IEEE Transactions on Knowledge and Data Engineering.

[10]  Tao Xiang,et al.  Active Rare Class Discovery and Classification Using Dirichlet Processes , 2014, International Journal of Computer Vision.

[11]  Vipin Kumar,et al.  Predicting Rare Classes: Comparing Two-Phase Rule Induction to Cost-Sensitive Boosting , 2002, PKDD.

[12]  Xiaokang Yu,et al.  Semisupervised Prior Free Rare Category Detection With Mixed Criteria , 2018, IEEE Transactions on Cybernetics.

[13]  Luming Zhang,et al.  Rare category exploration via wavelet analysis: Theory and applications , 2016, Expert Syst. Appl..

[14]  Andrew W. Moore,et al.  Active Learning for Anomaly and Rare-Category Detection , 2004, NIPS.

[15]  Hui Xiong,et al.  COG: local decomposition for rare class analysis , 2010, Data Mining and Knowledge Discovery.

[16]  Maher Maalouf,et al.  Weighted logistic regression for large-scale imbalanced and rare events data , 2014, Knowl. Based Syst..

[17]  Jingrui He,et al.  Learning Complex Rare Categories with Dual Heterogeneity , 2015, SDM.

[18]  Luis Baumela,et al.  Multi-class boosting with asymmetric binary weak-learners , 2014, Pattern Recognit..

[19]  Vipin Kumar,et al.  Mining needle in a haystack: classifying rare classes via two-phase rule induction , 2001, SIGMOD '01.

[20]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[21]  Vipin Kumar,et al.  Predicting rare classes: can boosting make any weak learner strong? , 2002, KDD.

[22]  Christoph F. Eick,et al.  GAC-GEO: a generic agglomerative clustering framework for geo-referenced datasets , 2011, Knowledge and Information Systems.

[23]  Joydeep Ghosh,et al.  Ensembles of $({\alpha})$-Trees for Imbalanced Classification Problems , 2014, IEEE Transactions on Knowledge and Data Engineering.

[24]  Marcello Pelillo,et al.  Graph-based quadratic optimization: A fast evolutionary approach , 2011, Comput. Vis. Image Underst..

[25]  Maher Maalouf,et al.  Computational Statistics and Data Analysis Robust Weighted Kernel Logistic Regression in Imbalanced and Rare Events Data , 2022 .

[26]  Gary King,et al.  Explaining Rare Events in International Relations , 2001, International Organization.

[27]  Thomas F. Coleman,et al.  RankRC: Large-Scale Nonlinear Rare Class Ranking , 2015, IEEE Transactions on Knowledge and Data Engineering.

[28]  Jingrui He,et al.  Nearest-Neighbor-Based Active Learning for Rare Category Detection , 2007, NIPS.

[29]  Zhi-Hua Zhou,et al.  A New Analysis of Co-Training , 2010, ICML.

[30]  Jingrui He,et al.  Graph-Based Rare Category Detection , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[31]  Rui Xu,et al.  Clustering Algorithms in Biomedical Research: A Review , 2010, IEEE Reviews in Biomedical Engineering.

[32]  Marcello Pelillo,et al.  Dominant Sets and Pairwise Clustering , 2007 .

[33]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[34]  Xiaoli Li,et al.  Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[35]  Hao Huang,et al.  CLOVER: a faster prior-free approach to rare-category detection , 2012, Knowledge and Information Systems.

[36]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.