Structured Iterative Hard Thresholding for Categorical and Mixed Data Types

In many applications, data exists in a mixed data type format, i.e. a combination of nominal (categorical) and numericalal features. A common practice for working with categorical features is to use an encoding method to transform the discrete values into numeric representation. However, numeric representation often neglects the innate structures in categorical features, potentially degrading the performance of learning algorithms. Utilizing the numeric representation could also limit interpretation of the learned model, such as finding the most discriminative categorical features or filtering irrelevant attributes. In this work, we extend the iterative hard thresholding (IHT) algorithm to quantify the structure of categorical features. The empirical evaluation of the proposed structured hard thresholding algorithm is based on both real and synthetic data sets in comparison with the original hard thresholding algorithm, LASSO and Random Forest. The results demonstrate an improved performance over the original IHT.

[1]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .

[2]  Michael K. Ng,et al.  An optimization algorithm for clustering using weighted dissimilarity measures , 2004, Pattern Recognit..

[3]  Kai Lu,et al.  Embedding-based Representation of Categorical Data by Hierarchical Value Coupling Learning , 2017, IJCAI.

[4]  C. Lord,et al.  The Simons Simplex Collection: A Resource for Identification of Autism Genetic Risk Factors , 2010, Neuron.

[5]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[6]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[7]  Mohammad Emtiyaz Khan,et al.  A Stick-Breaking Likelihood for Categorical Data Analysis with Latent Gaussian Models , 2012, AISTATS.

[8]  Inderjit S. Dhillon,et al.  Partial Hard Thresholding , 2017, IEEE Transactions on Information Theory.

[9]  Yoshinobu Kawahara,et al.  Efficient network-guided multi-locus association mapping with graph cuts , 2012, Bioinform..

[10]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  Chao Chen,et al.  Clustering High Dimensional Categorical Data via Topographical Features , 2016, ICML.

[13]  Mike E. Davies,et al.  Iterative Hard Thresholding for Compressed Sensing , 2008, ArXiv.

[14]  Olgica Milenkovic,et al.  Subspace Pursuit for Compressive Sensing Signal Reconstruction , 2008, IEEE Transactions on Information Theory.

[15]  Joel A. Tropp,et al.  Signal Recovery From Random Measurements Via Orthogonal Matching Pursuit , 2007, IEEE Transactions on Information Theory.

[16]  Oznur Tastan,et al.  SPADIS: An Algorithm for Selecting Predictive and Diverse SNPs in GWAS , 2018 .

[17]  Jiye Liang,et al.  A novel attribute weighting algorithm for clustering high-dimensional categorical data , 2011, Pattern Recognit..

[18]  Zoubin Ghahramani,et al.  Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data , 2015, ICML.

[19]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[21]  Qinghua Hu,et al.  Neighborhood rough set based heterogeneous feature subset selection , 2008, Inf. Sci..

[22]  Hong Chen,et al.  Group Sparse Additive Machine , 2017, NIPS.

[23]  Deanna Needell,et al.  CoSaMP: Iterative signal recovery from incomplete and inaccurate samples , 2008, ArXiv.

[24]  David E. Goldberg,et al.  Genetic algorithms and Machine Learning , 1988, Machine Learning.

[25]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[26]  Naoki Abe,et al.  Grouped Orthogonal Matching Pursuit for Variable Selection and Prediction , 2009, NIPS.

[27]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[28]  Richard G. Baraniuk,et al.  MISSION: Ultra Large-Scale Feature Selection using Count-Sketches , 2018, ICML.