A Memory-Efficient Encoding Method for Processing Mixed-Type Data on Machine Learning

The most common machine-learning methods solve supervised and unsupervised problems based on datasets where the problem's features belong to a numerical space. However, many problems often include data where numerical and categorical data coexist, which represents a challenge to manage them. To transform categorical data into a numeric form, preprocessing tasks are compulsory. Methods such as one-hot and feature-hashing have been the most widely used encoding approaches at the expense of a significant increase in the dimensionality of the dataset. This effect introduces unexpected challenges to deal with the overabundance of variables and/or noisy data. In this regard, in this paper we propose a novel encoding approach that maps mixed-type data into an information space using Shannon's Theory to model the amount of information contained in the original data. We evaluated our proposal with ten mixed-type datasets from the UCI repository and two datasets representing real-world problems obtaining promising results. For demonstrating the performance of our proposal, this was applied for preparing these datasets for classification, regression, and clustering tasks. We demonstrate that our encoding proposal is remarkably superior to one-hot and feature-hashing encoding in terms of memory efficiency. Our proposal can preserve the information conveyed by the original data.

[1]  J. Gower A General Coefficient of Similarity and Some of Its Properties , 1971 .

[2]  Michael K. Ng,et al.  On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Herbert A. Sturges,et al.  The Choice of a Class Interval , 1926 .

[4]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[5]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[6]  Gautam Biswas,et al.  Unsupervised Learning with Mixed Numeric and Nominal Data , 2002, IEEE Trans. Knowl. Data Eng..

[7]  D. Suits Use of Dummy Variables in Regression Equations , 1957 .

[8]  George T. Cantwell,et al.  Improved mutual information measure for classification and community detection , 2019, ArXiv.

[9]  David P. Doane,et al.  Aesthetic Frequency Classifications , 1976 .

[10]  Michael K. Ng,et al.  A Note on K-modes Clustering , 2003, J. Classif..

[11]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[12]  Chung-Chian Hsu,et al.  Generalizing self-organizing map for categorical data , 2006, IEEE Transactions on Neural Networks.

[13]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[14]  M. Levandowsky,et al.  Distance between Sets , 1971, Nature.

[15]  Daniele Micci-Barreca,et al.  A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems , 2001, SKDD.

[16]  J. Graham,et al.  Missing data analysis: making it work in the real world. , 2009, Annual review of psychology.

[17]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[18]  Sotiris B. Kotsiantis,et al.  Decision trees: a recent overview , 2011, Artificial Intelligence Review.

[19]  Lawrence Hubert,et al.  The variance of the adjusted Rand index. , 2016, Psychological methods.

[20]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[21]  Lipika Dey,et al.  A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..

[22]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[23]  Patrick Mair,et al.  Multidimensional Scaling Using Majorization: SMACOF in R , 2008 .