Machine learning integrated credibilistic semi supervised clustering for categorical data

Abstract In real life, availability of correctly labeled data and handling of categorical data are often acknowledged as two major challenges in pattern analysis. Thus, clustering techniques are employed on unlabeled data to group them according to homogeneity. However, clustering techniques fail to make a decision while data are uncertain, ambiguous, vague, coincidental and overlapping in nature. Hence, in this case, the use of semi supervised technique can be useful. On the other hand, real life datasets are majorly categorical in nature, where natural ordering in attribute values is missing. This special property of categorical values with the inherent characteristics like uncertainty, ambiguity and vagueness makes clustering more complicated than numerical data. In recent times, credibilistic measure shows better performance over fuzzy and possibilistic measures while considering similar inherent characteristics in numerical data. Thus, these facts motivated us to propose a semi supervised clustering technique using credibilistic measure with the integration of machine learning techniques to address the above mentioned challenges of clustering categorical data. This semi supervised technique first clusters the dataset into K subsets with the proposed Credibilistic K -Mode, where credibilistic measure helps to determine the homogeneity by avoiding coincident clustering problem as well as finds the points those are certain to the clusters. Thereafter, in the second part of the semi supervised technique, clustered dataset is used to build a supervised model for classification of other unlabeled or uncertain data. This technique not only handles the unlabeled data better, but also yields improved results for uncertain or ambiguous data e.g, if the credibilistic measure is same for a data point in multiple classes. The results of the proposed technique are demonstrated quantitatively and visually in comparison with widely used state-of-the-art methods for eight synthetic and four real life datasets. Finally, statistical tests have been conducted to judge the statistical significance of the results produced by the proposed technique.

[1]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[2]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[3]  R. J. Kuo,et al.  Non-dominated sorting genetic algorithm using fuzzy membership chromosome for categorical data clustering , 2015, Appl. Soft Comput..

[4]  Ujjwal Maulik,et al.  Modified differential evolution based fuzzy clustering for pixel classification in remote sensing imagery , 2009, Pattern Recognit..

[5]  Shuyuan Yang,et al.  Dual-graph regularized non-negative matrix factorization with sparse and orthogonal constraints , 2018, Eng. Appl. Artif. Intell..

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Chih-Cheng Hung,et al.  Credibilistic Clustering: The Model and Algorithms , 2015, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[8]  Ming-Syan Chen,et al.  On Data Labeling for Clustering Categorical Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[9]  Ujjwal Maulik,et al.  Medical Image Segmentation Using Genetic Algorithms , 2009, IEEE Transactions on Information Technology in Biomedicine.

[10]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[11]  Fan Yang,et al.  Credibilistic clustering algorithms via alternating cluster estimation , 2017, J. Intell. Manuf..

[12]  Lotfi A. Zadeh,et al.  A Theory of Approximate Reasoning , 1979 .

[13]  Ronghua Shang,et al.  Non-Negative Spectral Learning and Sparse Regression-Based Dual-Graph Regularized Feature Selection , 2018, IEEE Transactions on Cybernetics.

[14]  Ujjwal Maulik,et al.  Rough Possibilistic Type-2 Fuzzy C-Means clustering for MR brain image segmentation , 2016, Appl. Soft Comput..

[15]  Kuo-Lung Wu,et al.  Unsupervised possibilistic clustering , 2006, Pattern Recognit..

[16]  Michael K. Ng,et al.  Clustering categorical data sets using tabu search techniques , 2002, Pattern Recognit..

[17]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[19]  Zengyou He,et al.  G-ANMI: A mutual information based genetic clustering algorithm for categorical data , 2010, Knowl. Based Syst..

[20]  Ujjwal Maulik,et al.  Ensemble based rough fuzzy clustering for categorical data , 2015, Knowl. Based Syst..

[21]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[22]  Zied Elouedi,et al.  The K-Modes Method under Possibilistic Framework , 2013, Canadian Conference on AI.

[23]  Ujjwal Maulik,et al.  Rough Set Based Fuzzy K-Modes for Categorical Data , 2012, SEMCCO.

[24]  Jennifer Blackhurst,et al.  MMR: An algorithm for clustering categorical data using Rough Set Theory , 2007, Data Knowl. Eng..

[25]  Ujjwal Maulik,et al.  Integrated Rough Fuzzy Clustering for Categorical data Analysis , 2019, Fuzzy Sets Syst..

[26]  Jiye Liang,et al.  A novel attribute weighting algorithm for clustering high-dimensional categorical data , 2011, Pattern Recognit..

[27]  He Zengyou,et al.  Squeezer: an efficient algorithm for clustering categorical data , 2002 .

[28]  Tengke Xiong,et al.  DHCC: Divisive hierarchical clustering of categorical data , 2011, Data Mining and Knowledge Discovery.

[29]  S. Miyamoto,et al.  FORMULATIONS OF FUZZY CLUSTERING FOR CATEGORICAL DATA , 2005 .

[30]  W. L. Ruzzo,et al.  An empirical study on Principal Component Analysis for clustering gene expression data , 2000 .

[31]  Yian-Kui Liu,et al.  Expected value of fuzzy variable and fuzzy expected value models , 2002, IEEE Trans. Fuzzy Syst..

[32]  Samy Bengio,et al.  SVMTorch: Support Vector Machines for Large-Scale Regression Problems , 2001, J. Mach. Learn. Res..

[33]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Shuyuan Yang,et al.  Global discriminative-based nonnegative spectral clustering , 2016, Pattern Recognit..

[35]  Daniel Graupe,et al.  Principles of Artificial Neural Networks , 2018, Advanced Series in Circuits and Systems.

[36]  Thomas A. Runkler,et al.  Alternating cluster estimation: a new tool for clustering and function approximation , 1999, IEEE Trans. Fuzzy Syst..

[37]  L. Zadeh Fuzzy sets as a basis for a theory of possibility , 1999 .

[38]  Ujjwal Maulik,et al.  Multiobjective Genetic Algorithm-Based Fuzzy Clustering of Categorical Attributes , 2009, IEEE Transactions on Evolutionary Computation.

[39]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[40]  Shengrui Wang,et al.  Soft subspace clustering of categorical data with probabilistic distance , 2016, Pattern Recognit..

[41]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[42]  Ujjwal Maulik,et al.  Integrating Clustering and Supervised Learning for Categorical Data Analysis , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[43]  Zengyou He,et al.  k-ANMI: A mutual information based clustering algorithm for categorical data , 2005, Inf. Fusion.

[44]  William-Chandra Tjhi,et al.  Possibilistic fuzzy co-clustering of large document collections , 2007, Pattern Recognit..

[45]  Doheon Lee,et al.  Fuzzy clustering of categorical data using fuzzy centroids , 2004, Pattern Recognit. Lett..

[46]  R. J. Kuo,et al.  Partition-and-merge based fuzzy genetic clustering algorithm for categorical data , 2019, Appl. Soft Comput..

[47]  James M. Keller,et al.  A possibilistic approach to clustering , 1993, IEEE Trans. Fuzzy Syst..

[48]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[49]  James M. Keller,et al.  The possibilistic C-means algorithm: insights and recommendations , 1996, IEEE Trans. Fuzzy Syst..

[50]  M. H. Fazel Zarandi,et al.  Interval type-2 credibilistic clustering for pattern recognition , 2015, Pattern Recognit..

[51]  James E. Gentle,et al.  Finding Groups in Data: An Introduction to Cluster Analysis. , 1991 .

[52]  Lixing Yang,et al.  The maximum fuzzy weighted matching models and hybrid genetic algorithm , 2006, Appl. Math. Comput..

[53]  Jiye Liang,et al.  A weighting k-modes algorithm for subspace clustering of categorical data , 2013, Neurocomputing.

[54]  Yuchun Xu,et al.  Optimal mathematical programming and variable neighborhood search for k-modes categorical data clustering , 2019, Pattern Recognit..

[55]  Zengyou He,et al.  TCSOM: Clustering Transactions Using Self-Organizing Map , 2005, Neural Processing Letters.

[56]  Derrick S. Boone,et al.  Retail segmentation using artificial neural networks , 2002 .

[57]  Lotfi A. Zadeh,et al.  Fuzzy Sets , 1996, Inf. Control..

[58]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[59]  Zengyou He,et al.  A cluster ensemble method for clustering categorical data , 2005, Information Fusion.

[60]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[61]  Shuyuan Yang,et al.  Feature selection based dual-graph sparse non-negative matrix factorization for local discriminative clustering , 2018, Neurocomputing.

[62]  G. A. Ferguson,et al.  Statistical analysis in psychology and education , 1960 .