Classification of Categorical Data Based on the Chi-Square Dissimilarity and t-SNE

The recurrent use of databases with categorical variables in different applications demands new alternatives to identify relevant patterns. Classification is an interesting approach for the recognition of this type of data. However, there are a few amount of methods for this purpose in the literature. Also, those techniques are specifically focused only on kernels, having accuracy problems and high computational cost. For this reason, we propose an identification approach for categorical variables using conventional classifiers (LDC-QDC-KNN-SVM) and different mapping techniques to increase the separability of classes. Specifically, we map the initial features (categorical attributes) to another space, using the Chi-square (C-S) as a measure of dissimilarity. Then, we employ the (t-SNE) for reducing dimensionality of data to two or three features, allowing a significant reduction of computational times in learning methods. We evaluate the performance of proposed approach in terms of accuracy for several experimental configurations and public categorical datasets downloaded from the UCI repository, and we compare with relevant state of the art methods. Results show that C-S mapping and t-SNE considerably diminish the computational times in recognitions tasks, while the accuracy is preserved. Also, when we apply only the C-S mapping to the datasets, the separability of classes is enhanced, thus, the performance of learning algorithms is clearly increased.

[1]  J. Gower A General Coefficient of Similarity and Some of Its Properties , 1971 .

[2]  Andrea Bonanomi,et al.  Dissimilarity measure for ranking data via mixture of copulae * , 2019, Stat. Anal. Data Min..

[3]  Li Ma,et al.  A fault diagnosis approach for roller bearing based on improved intrinsic timescale decomposition de-noising and kriging-variable predictive model-based class discriminate , 2016 .

[4]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[5]  Adel M. Alimi,et al.  A new classifier for categorical data based on a possibilistic estimation and a novel generalized minimum-based algorithm , 2017, J. Intell. Fuzzy Syst..

[6]  Sheng-wei Fei Kurtosis forecasting of bearing vibration signal based on the hybrid model of empirical mode decomposition and RVM with artificial bee colony algorithm , 2015, Expert Syst. Appl..

[7]  Yongbin Liu,et al.  Novel synthetic index-based adaptive stochastic resonance method and its application in bearing fault diagnosis , 2017 .

[8]  Mirko Polato,et al.  A Novel Boolean Kernels Family for Categorical Data † , 2018, Entropy.

[9]  Fei Zhou,et al.  Coupled Attribute Similarity Learning on Categorical Data , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[10]  Jiye Liang,et al.  The impact of cluster representatives on the convergence of the K-modes type clustering , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  K. Viswanathan Iyer,et al.  Design and evaluation of a parallel document clustering algorithm based on hierarchical latent semantic analysis , 2019, Concurr. Comput. Pract. Exp..

[12]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[13]  Hu,et al.  Fault Diagnosis for Rolling Bearing Based on Semi-Supervised Clustering and Support Vector Data Description with Adaptive Parameter Optimization and Improved Decision Strategy , 2019, Applied Sciences.

[14]  Jian Ma,et al.  Rolling bearing fault diagnosis under variable conditions using LMD-SVD and extreme learning machine , 2015 .

[15]  Jiye Liang,et al.  Space Structure and Clustering of Categorical Data , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[16]  Edwin Diday,et al.  Symbolic clustering using a new dissimilarity measure , 1991, Pattern Recognit..

[17]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[18]  Liuyang Song,et al.  Automatic Fault Detection and Isolation Method for Roller Bearing Using Hybrid-GA and Sequential Fuzzy Inference † , 2019, Sensors.

[19]  Yinsheng Chen,et al.  Fault Diagnosis of Rolling Bearing Using Multiscale Amplitude-Aware Permutation Entropy and Random Forest , 2019, Algorithms.

[20]  Hongnian Yu,et al.  Mutual information based input feature selection for classification problems , 2012, Decis. Support Syst..

[21]  Yongchang Wang,et al.  Research on improved text classification method based on combined weighted model , 2020, Concurr. Comput. Pract. Exp..

[22]  Stefano Cagnoni,et al.  From Complex System Analysis to Pattern Recognition: Experimental Assessment of an Unsupervised Feature Extraction Method Based on the Relevance Index Metrics , 2019, Comput..

[23]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[24]  Oscar Camacho Nieto,et al.  The Naïve Associative Classifier (NAC): A novel, simple, transparent, and accurate classification model evaluated on financial data , 2017, Neurocomputing.

[25]  M. Woodbury,et al.  Clinical Pure Types as a Fuzzy Partition , 1974 .

[26]  Srivani. Anbu,et al.  Fuzzy C-Means Based Clustering and Rule Formation Approach for Classification of Bearing Faults Using Discrete Wavelet Transform , 2019, Comput..

[27]  Haralambos Sarimveis,et al.  A Fast and Efficient Method for Training Categorical Radial Basis Function Networks , 2017, IEEE Trans. Neural Networks Learn. Syst..

[28]  Ryszard S. Michalski,et al.  Automated Construction of Classifications: Conceptual Clustering Versus Numerical Taxonomy , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Dong Wang,et al.  Improved Hierarchical Adaptive Deep Belief Network for Bearing Fault Diagnosis , 2019, Applied Sciences.

[30]  Marcus Weber,et al.  Implications of PCCA+ in Molecular Simulation , 2018, Comput..

[31]  Lipika Dey,et al.  A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set , 2007, Pattern Recognit. Lett..

[32]  Thomas Villmann,et al.  Aspects in Classification Learning - Review of Recent Developments in Learning Vector Quantization , 2014 .

[33]  Brigitte Chebel-Morello,et al.  Application of empirical mode decomposition and artificial neural network for automatic bearing fault diagnosis based on vibration signals , 2015 .

[34]  H. Ralambondrainy,et al.  A conceptual version of the K-means algorithm , 1995, Pattern Recognit. Lett..

[35]  Michael K. Ng,et al.  An optimization algorithm for clustering using weighted dissimilarity measures , 2004, Pattern Recognit..

[36]  Chen Lu,et al.  Self-adaptive bearing fault diagnosis based on permutation entropy and manifold-based dynamic time warping , 2016, Mechanical Systems and Signal Processing.