Utilizing concept correlations for effective imbalanced data classification

Data imbalance is a challenging and common problem in data mining and machine learning areas, and has attracted significant research efforts. A data set is considered imbalanced when the data instances (samples) are not close to uniformly distributed across different classes/categories, which is very common in real-world data sets. It is likely to result in biased classification results. In this paper, a two-phase classification framework is proposed to make the classification of imbalanced data more accurate and stable. The proposed framework is based on the correlations generated between concepts. The general idea is to identify negative data instances which have certain positive correlations with data instances in the target concept to facilitate the classification task. The experimental results show that our framework is effective in imbalanced data classification and is robust to feature descriptors by comparing it with four existing approaches using four different kinds of feature representations.

[1]  Rushi Longadge,et al.  Class Imbalance Problem in Data Mining Review , 2013, ArXiv.

[2]  Shu-Ching Chen,et al.  Moving Object Detection under Object Occlusion Situations in Video Sequences , 2011, 2011 IEEE International Symposium on Multimedia.

[3]  Zhonghui Wang,et al.  A new algorithm for fast mining frequent itemsets using N-lists , 2012, Science China Information Sciences.

[4]  C. Lee Giles,et al.  Learning on the border: active learning in imbalanced data classification , 2007, CIKM '07.

[5]  Mei-Ling Shyu,et al.  Spatial-temporal motion information integration for action detection and recognition in non-static background , 2013, 2013 IEEE 14th International Conference on Information Reuse & Integration (IRI).

[6]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[7]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[8]  Wu Qingfeng,et al.  An empirical study on ensemble selection for class-imbalance data sets , 2010, 2010 5th International Conference on Computer Science & Education.

[9]  Mark Johnston,et al.  Developing New Fitness Functions in Genetic Programming for Classification With Unbalanced Data , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[10]  Xue-wen Chen,et al.  Combating the Small Sample Class Imbalance Problem Using Feature Selection , 2010, IEEE Transactions on Knowledge and Data Engineering.

[11]  Li Zhang,et al.  A Re-sampling Method for Class Imbalance Learning with Credit Data , 2011, 2011 International Conference of Information Technology, Computer Engineering and Management Sciences.

[12]  Kin-Man Lam,et al.  Optimal sampling of Gabor features for face recognition , 2004, Pattern Recognit. Lett..

[13]  Mei-Ling Shyu,et al.  Automatic annotation of drosophila developmental stages using association classification and information integration , 2011, 2011 IEEE International Conference on Information Reuse & Integration.

[14]  Yiannis S. Boutalis,et al.  CEDD: Color and Edge Directivity Descriptor: A Compact Descriptor for Image Indexing and Retrieval , 2008, ICVS.

[15]  Guannan Deng,et al.  I-fuzzy equivalence relation and I-transitive approximations , 2012, 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery.

[16]  Chao Chen,et al.  Clustering-based binary-class classification for imbalanced data sets , 2011, 2011 IEEE International Conference on Information Reuse & Integration.

[17]  Shahram Jafari,et al.  Feature Selection in Imbalance data sets , 2012 .

[18]  Mei-Ling Shyu,et al.  Leveraging Concept Association Network for Multimedia Rare Concept Mining and Retrieval , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[19]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[20]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[21]  Mei-Ling Shyu,et al.  Integration of Semantics Information and Clustering in Binary-Class Classification for Handling Imbalanced Multimedia Data , 2013 .

[22]  Fredric C. Gey,et al.  The relationship between recall and precision , 1994 .

[23]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[24]  Xue Wang,et al.  Image dimensionality reduction based on the HSV feature , 2010, 9th IEEE International Conference on Cognitive Informatics (ICCI'10).

[25]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[26]  Mei-Ling Shyu,et al.  Association affinity network based multi-model collaboration for multimedia big data management and retrieval , 2013 .

[27]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[28]  Honggang Zhang,et al.  Region-based high-level semantics extraction with CEDD , 2010, 2010 2nd IEEE InternationalConference on Network Infrastructure and Digital Content.

[29]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[30]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[31]  C. Won,et al.  Efficient Use of MPEG‐7 Edge Histogram Descriptor , 2002 .

[32]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[33]  Shu-Ching Chen,et al.  Feature Selection Using Correlation and Reliability Based Scoring Metric for Video Semantic Detection , 2010, 2010 IEEE Fourth International Conference on Semantic Computing.

[34]  Fei Su,et al.  An effective Gabor-feature selection method for face recognition , 2009, 2009 IEEE International Conference on Network Infrastructure and Digital Content.

[35]  Jun-Wei Hsieh,et al.  Modeling and recognizing action contexts in persons using sparse representation , 2015, J. Vis. Commun. Image Represent..

[36]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[37]  Tansel Özyer,et al.  Information Reuse and Integration in Academia and Industry , 2013, Springer Vienna.

[38]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[39]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[40]  Jun-Wei Hsieh,et al.  Vehicle make and model recognition using sparse representation and symmetrical SURFs , 2013, 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013).

[41]  Yang Hongying Content Based Image Retrieval Using Color Edge Histogram in HSV Color Space , 2008 .

[42]  Mei-Ling Shyu,et al.  Effective Feature Space Reduction with Imbalanced Data for Semantic Concept Detection , 2008, 2008 IEEE International Conference on Sensor Networks, Ubiquitous, and Trustworthy Computing (sutc 2008).

[43]  Mei-Ling Shyu,et al.  Concept-concept association information integration and multi-model collaboration for multimedia semantic concept detection , 2014, Inf. Syst. Frontiers.

[44]  Yiannis S. Boutalis,et al.  FCTH: Fuzzy Color and Texture Histogram - A Low Level Feature for Accurate Image Retrieval , 2008, 2008 Ninth International Workshop on Image Analysis for Multimedia Interactive Services.

[45]  Akshay Nikam,et al.  SkewBoost: An algorithm for classifying imbalanced datasets , 2011, 2011 2nd International Conference on Computer and Communication Technology (ICCCT-2011).