Unsupervised Embeddings for Categorical Variables

Real-world data sets often contain both continuous and categorical variables yet most popular machine learning methods cannot by default handle both data types. This creates the need for researchers to transform their data into a continuous format. When no prior information is available, the most widely applied methods are simple ones such as one-hot encoding. However, they ignore many possible sources of information, in particular, categorical dependencies, which could enrich the vector representations. We investigate the effect of natural language processing techniques for learning continuous word-vector representations on categorical variables. We show empirically that the learned vector representations of the categorical variables capture information about the variables themselves and their dependencies with other variables similar to how word embeddings capture semantic and syntactic information. We also show that machine learning models using unsupervised categorical embeddings are competitive with supervised embeddings, and outperform them when fine-tuned, on various classification benchmark data sets.

[1]  M. Olave,et al.  Chapter 10 : An application for admission in public school systems ∗ , 2006 .

[2]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[3]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[4]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[5]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[6]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[7]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[8]  Rik van Noord,et al.  Fair Is Better than Sensational: Man Is to Doctor as Woman Is to Doctor , 2019, CL.

[9]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[10]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[11]  Douglas L. T. Rohde,et al.  An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence , 2005 .

[12]  Bergsma,et al.  A bias-correction for Cramér ’ s V and Tschuprow ’ s T Wicher , 2012 .

[13]  James H. Martin,et al.  Speech and Language Processing, 2nd Edition , 2008 .

[14]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[15]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[16]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[17]  Matthew J. Davis Contrast Coding in Multiple Regression Analysis: Strengths, Weaknesses, and Utility of Popular Coding Structures , 2021, Journal of Data Science.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Olivier Caelen,et al.  Embeddings of Categorical Variables for Sequential Data in Fraud Context , 2018, AMLTA.

[20]  Craig A. Wendorf Primer on Multiple Regression Coding: Common Forms and the Additional Case of Repeated Contrasts , 2004 .

[21]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[22]  Claudia Biermann,et al.  Mathematical Methods Of Statistics , 2016 .

[23]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[24]  Tamara G. Kolda,et al.  Limited-memory matrix methods with applications , 1997 .

[25]  Geraldo Xexéo,et al.  Word Embeddings: A Survey , 2019, ArXiv.

[26]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[27]  Ronan Collobert,et al.  Word Embeddings through Hellinger PCA , 2013, EACL.

[28]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[29]  Daniele Micci-Barreca,et al.  A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems , 2001, SKDD.

[30]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[31]  J. Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: A computational study , 2007, Behavior research methods.

[32]  Cheng Guo,et al.  Entity Embeddings of Categorical Variables , 2016, ArXiv.