How to find an appropriate clustering for mixed‐type variables with application to socio‐economic stratification

Summary.  Data with mixed‐type (metric–ordinal–nominal) variables are typical for social stratification, i.e. partitioning a population into social classes. Approaches to cluster such data are compared, namely a latent class mixture model assuming local independence and dissimilarity‐based methods such as k‐medoids. The design of an appropriate dissimilarity measure and the estimation of the number of clusters are discussed as well, comparing the Bayesian information criterion with dissimilarity‐based criteria. The comparison is based on a philosophy of cluster analysis that connects the problem of a choice of a suitable clustering method closely to the application by considering direct interpretations of the implications of the methodology. The application of this philosophy to economic data from the 2007 US Survey of Consumer Finances demonstrates techniques and decisions required to obtain an interpretable clustering. The clustering is shown to be significantly more structured than a suitable null model. One result is that the data‐based strata are not as strongly connected to occupation categories as is often assumed in the literature.

[1]  G. Lenski Status crystallization: A non-vertical dimension of social status , 1954 .

[2]  August B. Hollingshead,et al.  Two Factor Index of Social Position , 1957 .

[3]  J. Tukey The Future of Data Analysis , 1962 .

[4]  O. D. Duncan,et al.  The American Occupational Structure , 1967 .

[5]  B. Weisbrod,et al.  An Income-Net Worth Approach to Measuring Economic Welfare , 1968 .

[6]  J. Gower Adding a point to vector diagrams in multivariate analysis , 1968 .

[7]  R. M. Cormack,et al.  A Review of Classification , 1971 .

[8]  J. Gower A General Coefficient of Similarity and Some of Its Properties , 1971 .

[9]  L. A. Goodman Exploratory latent structure analysis using both identifiable and unidentifiable models , 1974 .

[10]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[11]  Myron Melnyk,et al.  Principles of applied statistics , 1974 .

[12]  Brian Everitt,et al.  Cluster analysis , 1974 .

[13]  L. Hubert,et al.  Measuring the Power of Hierarchical Cluster Analysis , 1975 .

[14]  I. Csiszár $I$-Divergence Geometry of Probability Distributions and Minimization Problems , 1975 .

[15]  Eduardo S. Schwartz,et al.  The pricing of equity-linked life insurance policies with an asset value guarantee , 1976 .

[16]  B. Efron THE GEOMETRY OF EXPONENTIAL FAMILIES , 1978 .

[17]  J. F. C. Kingman,et al.  Information and Exponential Families in Statistical Theory , 1980 .

[18]  Murray Aitkin,et al.  Statistical Modelling of Data on Teaching Styles , 1981 .

[19]  Dorothy T. Thayer,et al.  EM algorithms for ML factor analysis , 1982 .

[20]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[21]  Allan D. Shocker,et al.  A Customer-oriented Approach for Determining Market Structures , 1984 .

[22]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[23]  V. Yohai,et al.  Nonlinear principal components , 1985 .

[24]  Robin Sibson,et al.  What is projection pursuit , 1987 .

[25]  D. Bartholomew Latent Variable Models And Factor Analysis , 1987 .

[26]  References to discussion , 1988 .

[27]  Jan de Leeuw,et al.  Multivariate analysis with linearizable regressions , 1988 .

[28]  W. Chan,et al.  Unimodality, convexity, and applications , 1989 .

[29]  R. Kessler,et al.  Socioeconomic status differences in vulnerability to undesirable life events. , 1990, Journal of health and social behavior.

[30]  M. Hill,et al.  Nonlinear Multivariate Analysis. , 1990 .

[31]  K. Prandy The Revised Cambridge Scale of Occupations , 1990 .

[32]  F. Levy,et al.  The economic future of American families : income and wealth trends , 1991 .

[33]  G. Celeux,et al.  Clustering criteria for discrete data and latent class models , 1991 .

[34]  G. Celeux,et al.  A Classification EM algorithm for clustering and two stochastic versions , 1992 .

[35]  I. Jolliffe,et al.  Nonlinear Multivariate Analysis , 1992 .

[36]  A. Agresti Categorical data analysis , 1993 .

[37]  A Agresti,et al.  Quasi-symmetric latent class models, with application to rater agreement. , 1993, Biometrics.

[38]  J. Poterba,et al.  Targeted retirement saving and the net worth of elderly Americans , 1994 .

[39]  D. Grusky Social Stratification: Class, Race, and Gender in Sociological Perspective , 1994 .

[40]  S. Folkman,et al.  Socioeconomic Status and Health , 1994 .

[41]  S. Klinke,et al.  Exploratory Projection Pursuit , 1995 .

[42]  Andreas Buja,et al.  Grand tour and projection pursuit , 1995 .

[43]  E. Vartiainen,et al.  Social class, health behaviour, and mortality among men and women in eastern Finland , 1995, BMJ.

[44]  Bruce G. Link,et al.  Social conditions as fundamental causes of disease. , 1995, Journal of health and social behavior.

[45]  L. A. Goodman,et al.  The Latent Structure of Job Characteristics of Men and Women , 1996, American Journal of Sociology.

[46]  P. Krugman The Self Organizing Economy , 1996 .

[47]  G. W. Milligan,et al.  CLUSTERING VALIDATION: RESULTS AND IMPLICATIONS FOR APPLIED ANALYSES , 1996 .

[48]  G. Celeux,et al.  An entropy criterion for assessing the number of clusters in a mixture model , 1996 .

[49]  E. O. Wright Class Counts: Comparative Studies in Class Analysis , 1996 .

[50]  J. A. Cuesta-Albertos,et al.  Trimmed $k$-means: an attempt to robustify quantizers , 1997 .

[51]  Education and Saving: The Long-Term Effects of High School Financial Curriculum Mandates , 1997 .

[52]  C. Mills,et al.  A Latent Class Analysis of the Criterion-Related and Construct Validity of the Goldthorpe Class Schema , 1998 .

[53]  J. Leeuw,et al.  The Gifi system of descriptive multivariate analysis , 1998 .

[54]  G. Celeux,et al.  Assessing a Mixture Model for Clustering with the Integrated Classification Likelihood , 1998 .

[55]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[56]  V. J. Rayward-Smith,et al.  Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition , 1999 .

[57]  A Gordon,et al.  Classification, 2nd Edition , 1999 .

[58]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[59]  S. Spilerman,et al.  Wealth and Stratification Processes , 2000 .

[60]  Johann Bacher,et al.  A Probabilistic Clustering Model for Variables of Mixed Type , 2000 .

[61]  Debashis Kushary,et al.  Bootstrap Methods and Their Application , 2000, Technometrics.

[62]  S. Graf,et al.  Foundations of Quantization for Probability Distributions , 2000 .

[63]  Jan de Leeuw,et al.  MULTIVARIATE ANALYSIS WITH OPTIMAL SCALING , 2000 .

[64]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[65]  Harold R. Kerbo The Classless Society , 2000 .

[66]  Geoffrey J. McLachlan,et al.  Robust mixture modelling using the t distribution , 2000, Stat. Comput..

[67]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[68]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[69]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[70]  O. von dem Knesebeck [Social inequality and health of the elderly--classical or alternative status indicators?]. , 2002, Zeitschrift fur Gerontologie und Geriatrie.

[71]  Richard J. Coley An Uneven Start: Indicators of Inequality in School Readiness , 2002 .

[72]  J. Vermunt,et al.  Latent class cluster analysis , 2002 .

[73]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[74]  J. Chimka Categorical Data Analysis, Second Edition , 2003 .

[75]  I. Gottesman,et al.  PSYCHOLOGICAL SCIENCE Research Article SOCIOECONOMIC STATUS MODIFIES HERITABILITY OF IQ , 2022 .

[76]  M. Stehlík Distributions of exact tests in the exponential family , 2003 .

[77]  Lynette A. Hunt,et al.  Mixture model clustering for mixed data with missing information , 2003, Comput. Stat. Data Anal..

[78]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[79]  J. Pagès Analyse factorielle de données mixtes , 2004 .

[80]  Vipin Kumar,et al.  The Challenges of Clustering High Dimensional Data , 2004 .

[81]  Gero Szepannek,et al.  Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering , 2004, GfKl.

[82]  Harold R. Kerbo,et al.  SOCIAL STRATIFICATION , 2004 .

[83]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[84]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[85]  Michael K. Ng,et al.  Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[86]  J. Vermunt,et al.  Latent Gold 4.0 User's Guide , 2005 .

[87]  David B. Grusky,et al.  The Case for a New Class Map1 , 2005, American Journal of Sociology.

[88]  W. Bottero,et al.  Stratification: Social Division and Inequality , 2005 .

[89]  R. Put,et al.  The use of CART and multivariate regression trees for supervised and unsupervised feature selection , 2005 .

[90]  Surajit Ray,et al.  The topography of multivariate normal mixtures , 2005 .

[91]  Fritz Drasgow,et al.  Polychoric and Polyserial Correlations , 2006 .

[92]  Sylvia Frühwirth-Schnatter,et al.  Finite Mixture and Markov Switching Models , 2006 .

[93]  T. Liao Measuring and Analyzing Class Inequality with the Gini Index Informed by Model-Based Clustering , 2006 .

[94]  Christian Hennig,et al.  Design of Dissimilarity Measures: A New Dissimilarity Between Species Distribution Areas , 2006, Data Science and Classification.

[95]  Christian Hennig,et al.  Design of dissimilarity measures: a new dissimilarity measure between species distribution ranges , 2006 .

[96]  Guenther Walther,et al.  Clustering with mixtures of log-concave distributions , 2007, Comput. Stat. Data Anal..

[97]  Christian Hennig,et al.  Cluster-wise assessment of cluster stability , 2007, Comput. Stat. Data Anal..

[98]  Kim A. Weeden,et al.  Measuring Poverty: The Case for a Sociological Approach , 2007 .

[99]  J. Goldthorpe,et al.  Social stratification and cultural consumption: The visual arts in England , 2007 .

[100]  Jacqueline J Meulman,et al.  Nonlinear principal components analysis: introduction and application. , 2007, Psychological methods.

[101]  Adrian E. Raftery,et al.  Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering , 2007, J. Classif..

[102]  Margaret Sullivan Pepe,et al.  Insights into latent class analysis of diagnostic test performance. , 2007, Biostatistics.

[103]  Patrick Sturgis,et al.  Exploring social mobility with latent trajectory groups , 2007 .

[104]  Kim A. Weeden,et al.  Social Class and Earnings Inequality , 2007 .

[105]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[106]  Gary Pollock,et al.  Holistic trajectories: a study of combined employment, housing and family careers by using multiple‐sequence analysis , 2007 .

[107]  M. Brusco,et al.  Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures , 2008 .

[108]  I Irigoien,et al.  INCA: New statistic for estimating the number of clusters and identifying atypical units , 2008, Statistics in medicine.

[109]  CLUM: A cluster program for analyzing microarray data , 2008, Russian Journal of Genetics.

[110]  P. Deb Finite Mixture Models , 2008 .

[111]  G. Casella,et al.  Clustering using objective functions and stochastic search , 2008 .

[112]  C. Matr'an,et al.  A general trimming approach to robust Cluster Analysis , 2008, 0806.2976.

[113]  S. Kolenikov,et al.  Socioeconomic Status Measurement with Discrete Proxy Variables: Is Principal Component Analysis a Reliable Answer? , 2009 .

[114]  B. L. Roux,et al.  Multiple Correspondence Analysis , 2009 .

[115]  Christian Hennig,et al.  A Constructivist View of the Statistical Quantification of Evidence , 2009 .

[116]  Gunnar E. Carlsson,et al.  Topology and data , 2009 .

[117]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[118]  Jan de Leeuw,et al.  Gifi Methods for Optimal Scaling in R: The Package homals , 2009 .

[119]  Gilles Celeux,et al.  Combining Mixture Components for Clustering , 2010, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[120]  Christian Hennig,et al.  Methods for merging Gaussian mixture components , 2010, Adv. Data Anal. Classif..

[121]  Robert Tibshirani,et al.  A Framework for Feature Selection in Clustering , 2010, Journal of the American Statistical Association.

[122]  Sylvia Richardson,et al.  Bayesian profile regression with an application to the National Survey of Children's Health. , 2010, Biostatistics.

[123]  Isobel Claire Gormley,et al.  Probabilistic principal component analysis for metabolomic data , 2010, BMC Bioinformatics.

[124]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[125]  Peter Müller,et al.  A Product Partition Model With Regression on Covariates , 2011, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[126]  Christian Böhm,et al.  INCONCO: interpretable clustering of numerical and categorical objects , 2011, KDD.

[127]  B. Everitt,et al.  Cluster Analysis: Everitt/Cluster Analysis , 2011 .

[128]  Giovanna Menardi,et al.  Density-based Silhouette diagnostics for clustering methods , 2011, Stat. Comput..

[129]  E. Turkheimer,et al.  Emergence of a Gene × Socioeconomic Status Interaction on Infant Mental Ability Between 10 Months and 2 Years , 2011, Psychological science.

[130]  G. McLachlan,et al.  Commentary on Steinley and Brusco (2011): Recommendations and Cautions , 2022 .

[131]  L. Anderlucci Comparing Different Approaches for Clustering Categorical Data , 2012 .

[132]  Matias Salibian-Barrera,et al.  A robust and sparse K-means clustering algorithm , 2012 .

[133]  Edoardo M. Airoldi,et al.  Summarizing topical content with word frequency and exclusivity , 2012, ICML 2012.

[134]  Ryan P. Browne,et al.  Mixtures of Shifted Asymmetric Laplace Distributions , 2012 .

[135]  Ryan P. Browne,et al.  Parsimonious Shifted Asymmetric Laplace Mixtures , 2013, 1311.0317.

[136]  Soumaya Louhichi,et al.  A density based algorithm for discovering clusters with varied density , 2014, 2014 World Congress on Computer Applications and Information Systems (WCCAIS).

[137]  O. Barndorff-Nielsen Information and Exponential Families in Statistical Theory , 1980 .

[138]  A. Fasang,et al.  Sibling Similarity in Family Formation , 2014, Demography.

[139]  Damien McParland,et al.  CLUSTERING SOUTH AFRICAN HOUSEHOLDS BASED ON THEIR ASSET STATUS USING LATENT VARIABLE MODELS. , 2014, The annals of applied statistics.

[140]  Charles Bouveyron,et al.  Kernel discriminant analysis and clustering with parsimonious Gaussian process models , 2012, Statistics and Computing.

[141]  M. Cugmas,et al.  On comparing partitions , 2015 .