Effective supervised discretization for classification based on correlation maximization

In many real-world applications, there are features (or attributes) that are continuous or numerical in the data. However, many classification models only take nominal features as the inputs. Therefore, it is necessary to apply discretization as a pre-processing step to transform numerical data into nominal data for such models. Well-discretized data should not only characterize the original data to produce a concise summarization, but also improve the classification performance. In this paper, a novel and effective supervised discretization algorithm based on correlation maximization (CM) is proposed by using multiple correspondence analysis (MCA) which is a technique to capture the correlations between multiple variables. For each numeric feature, the correlation information generated from MCA is used to build the discretization algorithm that maximizes the correlations between feature intervals/items and classes. Empirical comparisons with four other commonly used discretization algorithms are conducted using six well-known classifiers. Results on five UCI datasets and five TRECVID datasets demonstrate that our proposed discretization algorithm can automatically generate a better set of features (feature intervals) by maximizing their correlations with the classes and thus improve the classification performance.

[1]  M. Greenacre,et al.  Multiple Correspondence Analysis and Related Methods , 2006 .

[2]  S. Kotsiantis,et al.  Discretization Techniques: A recent survey , 2006 .

[3]  Lukasz A. Kurgan,et al.  CAIM discretization algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[4]  Igor Kononenko,et al.  On Biases in Estimating Multi-Valued Attributes , 1995, IJCAI.

[5]  Min Chen,et al.  A decision tree-based multimodal data mining framework for soccer goal detection , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[6]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[7]  Min Chen,et al.  DETECTION OF SOCCER GOAL SHOTS USING JOINT MULTIMEDIA FEATURES AND CLASSIFICATION RULES , 2003 .

[8]  Shu-Ching Chen,et al.  Correlation-based interestingness measure for video semantic concept detection , 2009, 2009 IEEE International Conference on Information Reuse & Integration.

[9]  Wei-Pang Yang,et al.  A discretization algorithm based on Class-Attribute Contingency Coefficient , 2008, Inf. Sci..

[10]  Min Chen,et al.  A multimodal data mining framework for soccer goal detection based on decision tree logic , 2006, Int. J. Comput. Appl. Technol..

[11]  Tomás Aluja,et al.  Book review: Multiple correspondence analysis and related methods. Greenacre, M. and Blasius, J. Chapman & Hall/CRC, 2006. , 2006 .

[12]  M. Mizianty,et al.  Comparative Analysis of the Impact of Discretization on the Classification with Naïve Bayes and Semi-Naïve Bayes Classifiers , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[13]  Min Chen,et al.  Video Semantic Event/Concept Detection Using a Subspace-Based Multimedia Data Mining Framework , 2008, IEEE Transactions on Multimedia.