An Enhanced Univariate Discretization Based on Cluster Ensembles

Most discretization algorithms focus on the univariate case. In general, they take into account the target class or interval-wise frequency of data. In so doing, useful information regarding natural group, hidden pattern and correlation among the attributes may be inevitably lost. In response, this paper introduces a new pruning method that exploits natural groups or clusters as an explicit constraint to traditional cut-point determination techniques. This unsupervised approach makes use of cluster ensembles to reveal similarities between data belonging to adjacent intervals. To be precise, a cut-point between a pair of highly similar or related intervals will be dropped. This pruning mechanism is coupled with three different univariate discretization algorithms, with the evaluation is conducted on 10 datasets and 3 classifier models. The results suggest that the proposed method usually achieve higher classification accuracy levels, than those of the three baseline counterparts.

[1]  Jason Catlett,et al.  On Changing Continuous Attributes into Ordered Discrete Attributes , 1991, EWSL.

[2]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[3]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[4]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[5]  Wei-Pang Yang,et al.  A discretization algorithm based on Class-Attribute Contingency Coefficient , 2008, Inf. Sci..

[6]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Sabine Loudcher,et al.  FUSINTER: A Method for Discretization of Continuous Attributes , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[8]  Sonajharia Minz,et al.  Discretization Using Clustering and Rough Set Theory , 2007, 2007 International Conference on Computing: Theory and Applications (ICCTA'07).

[9]  Stephen D. Bay Multivariate Discretization for Set Mining , 2001, Knowledge and Information Systems.

[10]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[11]  Huaiqing Wang,et al.  An ICA-Based Multivariate Discretization Algorithm , 2006, KSEM.

[12]  Randy Kerber,et al.  ChiMerge: Discretization of Numeric Attributes , 1992, AAAI.

[13]  Kitsana Waiyamai,et al.  An Enhanced Class-Attribute Interdependence Maximization Discretization Algorithm , 2012, ADMA.

[14]  Keqiu Li,et al.  Combining Univariate and Multivariate Bottom-up Discretization , 2012, J. Multiple Valued Log. Soft Comput..

[15]  Geoffrey I. Webb,et al.  Discretization for naive-Bayes learning: managing discretization bias and variance , 2008, Machine Learning.

[16]  Tossapon Boongoen,et al.  LCE: a link-based cluster ensemble method for improved gene expression data analysis , 2010, Bioinform..

[17]  Krzysztof J. Cios,et al.  ur-CAIM: improved CAIM discretization for unbalanced and balanced data , 2016, Soft Comput..

[18]  Francisco Herrera,et al.  A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning , 2013, IEEE Transactions on Knowledge and Data Engineering.

[19]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[20]  Ping Yang,et al.  HDD: a hypercube division-based algorithm for discretisation , 2011, Int. J. Syst. Sci..

[21]  Lukasz A. Kurgan,et al.  CAIM discretization algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[22]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[23]  Ludmila I. Kuncheva,et al.  Using diversity in cluster ensembles , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[24]  Andrew K. C. Wong,et al.  Class-Dependent Discretization for Inductive Learning from Continuous and Mixed-Mode Data , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  K. Mehrotra,et al.  A clustering-based discretization for supervised learning , 2010 .

[26]  Huan Liu,et al.  Feature Selection via Discretization , 1997, IEEE Trans. Knowl. Data Eng..

[27]  María José del Jesús,et al.  KEEL: a software tool to assess evolutionary algorithms for data mining problems , 2008, Soft Comput..