论文信息 - LERI: Local Exploration for Rare-Category Identification

LERI: Local Exploration for Rare-Category Identification

To identify the data examples of rare categories that form small compact clusters in large data sets, existing approaches mostly require enough labeled data examples as a training set to learn a classifier, assuming that the rare-category clusters are spherical or nearly spherical. Nonetheless, a large enough training set is usually difficult to obtain in practice, and rare categories in many real-world applications often form small compact clusters with arbitrary shapes. In this paper, we investigate how to identify all data examples of a rare category with an arbitrary shape based on only one seed (i.e., a labeled rare-category data example). Instead of finding a compact and spherical local region around the seed, we locally explore the data set from the seed by continuously searching and visiting the <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="gao-ieq1-2911941.gif"/></alternatives></inline-formula>-nearest neighbors of each newly visited data example. The local exploration connects the data examples in the objective rare category by the relationship of <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="gao-ieq2-2911941.gif"/></alternatives></inline-formula>-nearest neighbors, and meanwhile, suspected external data examples are filtered out if they are not close enough to any visited data example. Experimental results on both synthetic and real-world data sets are conducted, and the results verify the effectiveness and efficiency of our approach.

[1] Mary P. Harper,et al. Spatial Random Tree Grammars for Modeling Hierarchal Structure in Images with Regions of Arbitrary Shape , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2] Jing Li,et al. Robust Local Community Detection: On Free Rider Effect and Its Elimination , 2015, Proc. VLDB Endow..

[3] Mikhail Belkin,et al. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[4] Yunjun Gao,et al. Rare category exploration , 2014, Expert Syst. Appl..

[5] Jingrui He,et al. Rare Category Characterization , 2010, 2010 IEEE International Conference on Data Mining.

[6] Vipin Kumar,et al. Evaluating boosting algorithms to classify rare classes: comparison and improvements , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[7] Gary King,et al. Logistic Regression in Rare Events Data , 2001, Political Analysis.

[8] Cheng Wu,et al. Semi-Supervised and Unsupervised Extreme Learning Machines , 2014, IEEE Transactions on Cybernetics.

[9] Tao Xiang,et al. Finding Rare Classes: Active Learning with Generative and Discriminative Models , 2013, IEEE Transactions on Knowledge and Data Engineering.

[10] Tao Xiang,et al. Active Rare Class Discovery and Classification Using Dirichlet Processes , 2014, International Journal of Computer Vision.

[11] Vipin Kumar,et al. Predicting Rare Classes: Comparing Two-Phase Rule Induction to Cost-Sensitive Boosting , 2002, PKDD.

[12] Xiaokang Yu,et al. Semisupervised Prior Free Rare Category Detection With Mixed Criteria , 2018, IEEE Transactions on Cybernetics.

[13] Luming Zhang,et al. Rare category exploration via wavelet analysis: Theory and applications , 2016, Expert Syst. Appl..

[14] Andrew W. Moore,et al. Active Learning for Anomaly and Rare-Category Detection , 2004, NIPS.

[15] Hui Xiong,et al. COG: local decomposition for rare class analysis , 2010, Data Mining and Knowledge Discovery.

[16] Maher Maalouf,et al. Weighted logistic regression for large-scale imbalanced and rare events data , 2014, Knowl. Based Syst..

[17] Jingrui He,et al. Learning Complex Rare Categories with Dual Heterogeneity , 2015, SDM.

[18] Luis Baumela,et al. Multi-class boosting with asymmetric binary weak-learners , 2014, Pattern Recognit..

[19] Vipin Kumar,et al. Mining needle in a haystack: classifying rare classes via two-phase rule induction , 2001, SIGMOD '01.

[20] Xin Yao,et al. MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[21] Vipin Kumar,et al. Predicting rare classes: can boosting make any weak learner strong? , 2002, KDD.

[22] Christoph F. Eick,et al. GAC-GEO: a generic agglomerative clustering framework for geo-referenced datasets , 2011, Knowledge and Information Systems.

[23] Joydeep Ghosh,et al. Ensembles of $({\alpha})$-Trees for Imbalanced Classification Problems , 2014, IEEE Transactions on Knowledge and Data Engineering.

[24] Marcello Pelillo,et al. Graph-based quadratic optimization: A fast evolutionary approach , 2011, Comput. Vis. Image Underst..

[25] Maher Maalouf,et al. Computational Statistics and Data Analysis Robust Weighted Kernel Logistic Regression in Imbalanced and Rare Events Data , 2022 .

[26] Gary King,et al. Explaining Rare Events in International Relations , 2001, International Organization.

[27] Thomas F. Coleman,et al. RankRC: Large-Scale Nonlinear Rare Class Ranking , 2015, IEEE Transactions on Knowledge and Data Engineering.

[28] Jingrui He,et al. Nearest-Neighbor-Based Active Learning for Rare Category Detection , 2007, NIPS.

[29] Zhi-Hua Zhou,et al. A New Analysis of Co-Training , 2010, ICML.

[30] Jingrui He,et al. Graph-Based Rare Category Detection , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[31] Rui Xu,et al. Clustering Algorithms in Biomedical Research: A Review , 2010, IEEE Reviews in Biomedical Engineering.

[32] Marcello Pelillo,et al. Dominant Sets and Pairwise Clustering , 2007 .

[33] Bernhard Schölkopf,et al. Learning with Local and Global Consistency , 2003, NIPS.

[34] Xiaoli Li,et al. Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[35] Hao Huang,et al. CLOVER: a faster prior-free approach to rare-category detection , 2012, Knowledge and Information Systems.

[36] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.