An information granulation based data mining approach for classifying imbalanced data

Recently, the class imbalance problem has attracted much attention from researchers in the field of data mining. When learning from imbalanced data in which most examples are labeled as one class and only few belong to another class, traditional data mining approaches do not have a good ability to predict the crucial minority instances. Unfortunately, many real world data sets like health examination, inspection, credit fraud detection, spam identification and text mining all are faced with this situation. In this study, we present a novel model called the ''Information Granulation Based Data Mining Approach'' to tackle this problem. The proposed methodology, which imitates the human ability to process information, acquires knowledge from Information Granules rather then from numerical data. This method also introduces a Latent Semantic Indexing based feature extraction tool by using Singular Value Decomposition, to dramatically reduce the data dimensions. In addition, several data sets from the UCI Machine Learning Repository are employed to demonstrate the effectiveness of our method. Experimental results show that our method can significantly increase the ability of classifying imbalanced data.

[1]  Philip D. Wasserman,et al.  Neural computing - theory and practice , 1989 .

[2]  Vasile Palade,et al.  Optimized Precision - A New Measure for Classifier Performance Evaluation , 2006, 2006 IEEE International Conference on Evolutionary Computation.

[3]  Andreas Stolcke,et al.  A study in machine learning from imbalanced data for sentence boundary detection in speech , 2006, Comput. Speech Lang..

[4]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[5]  Qinghua Hu,et al.  A weighted rough set based method developed for class imbalance learning , 2008, Inf. Sci..

[6]  Kihoon Yoon,et al.  An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics , 2005, Fifth International Conference on Hybrid Intelligent Systems (HIS'05).

[7]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[8]  Slavka Bodjanova,et al.  Granulation of a fuzzy set: Nonspecificity , 2007, Inf. Sci..

[9]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[10]  Lotfi A. Zadeh,et al.  Toward a generalized theory of uncertainty (GTU) - an outline , 2005, GrC.

[11]  Cem Ergün,et al.  Clustering Based Under-Sampling for Improving Speaker Verification Decisions Using AdaBoost , 2004, SSPR/SPR.

[12]  B. C. Brookes,et al.  Information Sciences , 2020, Cognitive Skills You Need for the 21st Century.

[13]  Wei-Ying Ma,et al.  Improving text classification using local latent semantic indexing , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[14]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[15]  David M. Skapura,et al.  Neural networks - algorithms, applications, and programming techniques , 1991, Computation and neural systems series.

[16]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[17]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Mo-Yuen Chow,et al.  A classification approach for power distribution systems fault cause identification , 2006, IEEE Transactions on Power Systems.

[19]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[20]  Alkiviadis G. Akritas,et al.  Applications of singular-value decomposition (SVD) , 2004, Math. Comput. Simul..

[21]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[22]  Andrzej Skowron,et al.  Rough sets: Some extensions , 2007, Inf. Sci..

[23]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[24]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[25]  Philipp Slusallek,et al.  Introduction to real-time ray tracing , 2005, SIGGRAPH Courses.

[26]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[27]  George D. Magoulas,et al.  Analysing the localisation sites of proteins through neural networks ensembles , 2006, Neural Computing & Applications.

[28]  Nitesh V. Chawla,et al.  Classification and knowledge discovery in protein databases , 2004, J. Biomed. Informatics.

[29]  Giovanna Castellano,et al.  Information granulation via neural network-based learning , 2001, Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569).

[30]  Lotfi A. Zadeh,et al.  A New Direction in AI: Toward a Computational Theory of Perceptions , 2001, AI Mag..

[31]  Yiyu Yao,et al.  A multiview approach for intelligent data analysis based on data operators , 2008, Inf. Sci..

[32]  Hewijin Christine Jiau,et al.  Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem , 2006 .

[33]  William Zhu,et al.  Topological approaches to covering rough sets , 2007, Inf. Sci..

[34]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[35]  Andrzej Bargiela,et al.  Recursive information granulation: aggregation and interpretation issues , 2003, IEEE Trans. Syst. Man Cybern. Part B.

[36]  Antoine Geissbühler,et al.  Learning from imbalanced data in surveillance of nosocomial infection , 2006, Artif. Intell. Medicine.

[37]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[38]  Yee Leung,et al.  Granular computing and dual Galois connection , 2007, Inf. Sci..

[39]  Alexander Thomasian,et al.  CSVD: Clustering and Singular Value Decomposition for Approximate Similarity Search in High-Dimensional Spaces , 2003, IEEE Trans. Knowl. Data Eng..

[40]  J. Yao,et al.  Granular Computing as a Basis for Consistent Classification Problems , 2002 .

[41]  William Zhu,et al.  Generalized rough sets based on relations , 2007, Inf. Sci..

[42]  Vijay K. Rohatgi,et al.  Advances in Fuzzy Set Theory and Applications , 1980 .

[43]  Lotfi A. Zadeh,et al.  Fuzzy sets and information granularity , 1996 .

[44]  Long-Sheng Chen,et al.  A neural network based information granulation approach to shorten the cellular phone test process , 2006, Comput. Ind..

[45]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[46]  E. Amine Lehtihet,et al.  A classification algorithm and optimal feature selection methodology for automated solder joint defect inspection , 1998 .

[47]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[48]  R. Barandelaa,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[49]  Yuehwern Yih,et al.  Knowledge acquisition through information granulation for imbalanced data , 2006, Expert Syst. Appl..

[50]  Michael R. Lyu,et al.  Learning classifiers from imbalanced data based on biased minimax probability machine , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..