Correlation maximisation-based discretisation for supervised classification

This paper proposes a novel supervised discretisation algorithm based on Correlation Maximisation (CM) using Multiple Correspondence Analysis (MCA). MCA is an effective technique to capture the correlation between multiple variables. For each numeric feature, the proposed discretisation algorithm utilises MCA to measure the correlations between feature intervals/items and classes, and the set of cut-points yielding the maximum correlation is chosen as the discretisation scheme for that feature. Therefore, the discretised feature can not only produce a concise summarisation of the original numeric feature but also provide the maximum correlation information to predict class labels. Experiments are conducted by comparing to seven state-of-the-art supervised discretisation algorithms using six well-known classifiers on 19 UCI data sets. Experimental results demonstrate that the proposed discretisation algorithm can automatically generate a set of features (feature intervals) that produce the best classification results on average.

[1]  Shu-Ching Chen,et al.  Correlation-based interestingness measure for video semantic concept detection , 2009, 2009 IEEE International Conference on Information Reuse & Integration.

[2]  M. Mizianty,et al.  Comparative Analysis of the Impact of Discretization on the Classification with Naïve Bayes and Semi-Naïve Bayes Classifiers , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[3]  Shyam Visweswaran,et al.  Improving Classification Performance with Discretization on Biomedical Datasets , 2008, AMIA.

[4]  Shu-Ching Chen,et al.  Correlation-Based Video Semantic Concept Detection Using Multiple Correspondence Analysis , 2008, 2008 Tenth IEEE International Symposium on Multimedia.

[5]  Caroline Chan,et al.  Determination of quantization intervals in rule based model for dynamic systems , 1991, Conference Proceedings 1991 IEEE International Conference on Systems, Man, and Cybernetics.

[6]  S. Kotsiantis,et al.  Discretization Techniques: A recent survey , 2006 .

[7]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[8]  Wei-Pang Yang,et al.  A discretization algorithm based on Class-Attribute Contingency Coefficient , 2008, Inf. Sci..

[9]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[10]  Lukasz A. Kurgan,et al.  CAIM discretization algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[11]  Mourad Ykhlef,et al.  Association mining of dependency between time series using Genetic Algorithm and discretisation , 2011, Int. J. Bus. Intell. Data Min..

[12]  Andrew K. C. Wong,et al.  Typicality, Diversity, and Feature Pattern of an Ensemble , 1975, IEEE Transactions on Computers.

[13]  Tibor Cserháti Multivariate Methods in Chromatography: A Practical Guide , 2008 .

[14]  Min Chen,et al.  Video Semantic Event/Concept Detection Using a Subspace-Based Multimedia Data Mining Framework , 2008, IEEE Transactions on Multimedia.

[15]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[16]  Shu-Ching Chen,et al.  Effective supervised discretization for classification based on correlation maximization , 2011, 2011 IEEE International Conference on Information Reuse & Integration.

[17]  Min Chen,et al.  DETECTION OF SOCCER GOAL SHOTS USING JOINT MULTIMEDIA FEATURES AND CLASSIFICATION RULES , 2003 .

[18]  Davy Janssens,et al.  Evaluating the performance of cost-based discretization versus entropy- and error-based discretization , 2006, Comput. Oper. Res..

[19]  Nicolas Chapados,et al.  A high-order feature synthesis and selection algorithm applied to insurance risk modelling , 2011, Int. J. Bus. Intell. Data Min..

[20]  Min Chen,et al.  A multimodal data mining framework for soccer goal detection based on decision tree logic , 2006, Int. J. Comput. Appl. Technol..

[21]  Tomás Aluja,et al.  Book review: Multiple correspondence analysis and related methods. Greenacre, M. and Blasius, J. Chapman & Hall/CRC, 2006. , 2006 .

[22]  Igor Kononenko,et al.  On Biases in Estimating Multi-Valued Attributes , 1995, IJCAI.

[23]  Ian Witten,et al.  Data Mining , 2000 .

[24]  Zhigeng Pan,et al.  Content-based personalised recommendation in virtual shopping environment , 2006, Int. J. Bus. Intell. Data Min..

[25]  Wolfgang Maass,et al.  Efficient agnostic PAC-learning with simple hypothesis , 1994, COLT '94.