Socioeconomic Status Measurement with Discrete Proxy Variables: Is Principal Component Analysis a Reliable Answer?

The last several years have seen a growth in the number of publications in economics that use principal component analysis (PCA) in the area of welfare studies. This paper explores the ways discrete data can be incorporated into PCA. The effects of discreteness of the observed variables on the PCA are reviewed. The statistical properties of the popular Filmer and Pritchett (2001) procedure are analyzed. The concepts of polychoric and polyserial correlations are introduced with appropriate references to the existing literature demonstrating their statistical properties. A large simulation study is carried out to compare various implementations of discrete data PCA. The simulation results show that the currently used method of running PCA on a set of dummy variables as proposed by Filmer and Pritchett (2001) can be improved upon by using procedures appropriate for discrete data, such as retaining the ordinal variables without breaking them into a set of dummy variables or using polychoric correlations. An empirical example using Bangladesh 2000 Demographic and Health Survey data helps in explaining the differences between procedures.

[1]  Karl Pearson,et al.  Mathematical contributions to the theory of evolution. VIII. On the correlation of characters not quantitatively measurable , 1900, Proceedings of the Royal Society of London.

[2]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[3]  Karl Pearson,et al.  ON POLYCHORIC COEFFICIENTS OF CORRELATION , 1922 .

[4]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[5]  M. Friedman,et al.  Theory of the Consumption Function , 1957 .

[6]  T. W. Anderson,et al.  An Introduction to Multivariate Statistical Analysis , 1959 .

[7]  T. W. Anderson An Introduction to Multivariate Statistical Analysis , 1959 .

[8]  T. W. Anderson ASYMPTOTIC THEORY FOR PRINCIPAL COMPONENT ANALYSIS , 1963 .

[9]  A. W. Davis ASYMPTOTIC THEORY FOR PRINCIPAL COMPONENT ANALYSIS: NON-NORMAL CASE1 , 1977 .

[10]  Ulf Olsson,et al.  Maximum likelihood estimation of the polychoric correlation coefficient , 1979 .

[11]  B. Parlett The Symmetric Eigenvalue Problem , 1981 .

[12]  K. Bollen,et al.  Pearson's R and Coarsely Categorized Measures , 1981 .

[13]  R. Muirhead Aspects of Multivariate Statistical Theory , 1982, Wiley Series in Probability and Statistics.

[14]  G. Maddala Limited-dependent and qualitative variables in econometrics: Introduction , 1983 .

[15]  D. R. Johnson,et al.  Ordinal measures in multiple indicator models: A simulation study of categorization error. , 1983 .

[16]  P. Schmidt,et al.  Limited-Dependent and Qualitative Variables in Econometrics. , 1984 .

[17]  A. Morineau,et al.  Multivariate descriptive statistical analysis , 1984 .

[18]  K. Bollen Multiple indicators: Internal consistency or no necessary relationship? , 1984 .

[19]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[20]  D. Bartholomew Latent Variable Models And Factor Analysis , 1987 .

[21]  Emin Babakus,et al.  The Sensitivity of Confirmatory Maximum Likelihood Factor Analysis to Violations of Measurement Scale and Distributional Assumptions , 1987 .

[22]  Kenneth A. Bollen,et al.  Structural Equations with Latent Variables , 1989 .

[23]  C. Scott Effect of Recall Duration on Reporting of Household Expenditures: An Experimental Study in Ghana , 1990 .

[24]  J. S. Long,et al.  Testing Structural Equation Models , 1993 .

[25]  H. Bouis The effect of income on demand for food in poor countries: Are our food consumption databases giving us reliable estimates? , 1994 .

[26]  Conor V. Dolan,et al.  Factor analysis of variables with 2, 3, 5, and 7 response categories: A comparison of categorical variable estimators using simulated data , 1994 .

[27]  A. C. Rencher Methods of multivariate analysis , 1995 .

[28]  P. Lanjouw,et al.  Constructing an indicator of consumption for the analysis of poverty : principles and illustrations with reference to Ecuador , 1996 .

[29]  W. Krelle How to deal with unobservable variables in economics , 1997 .

[30]  E. Smith Methods of Multivariate Analysis , 1997 .

[31]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[32]  D. Harris Principal Components Analysis of Cointegrated Time Series , 1997, Econometric Theory.

[33]  K. Judd Numerical methods in economics , 1998 .

[34]  R. Stott,et al.  The World Bank , 2008, Annals of tropical medicine and parasitology.

[35]  A Skrondal,et al.  Design and Analysis of Monte Carlo Experiments: Attacking the Conventional Wisdom , 2000, Multivariate Behavioral Research.

[36]  David J. Groggel,et al.  Practical Nonparametric Statistics , 2000, Technometrics.

[37]  A. Bartolo Human Capital Estimation through Structural Equation Models with some Categorical Observed Variables. , 2000 .

[38]  S. Caudill,et al.  Is Economic Freedom One Dimensional? A Factor Analysis of Some Common Measures of Economic Freedom , 2000 .

[39]  D. Kaplan Structural Equation Modeling: Foundations and Extensions , 2000 .

[40]  Thomas J. Webster,et al.  A Principal Component Analysis of the "U.S. News & World Report" Tier Rankings of Colleges and Universities. , 2001 .

[41]  Jeffrey M. Wooldridge,et al.  Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data , 2003 .

[42]  H. Winklhofer,et al.  Index Construction with Formative Indicators: An Alternative to Scale Development , 2001 .

[43]  Jennifer L. Glanville,et al.  SOCIOECONOMIC STATUS AND CLASS IN STUDIES OF FERTILITY AND HEALTH IN DEVELOPING COUNTRIES , 2001 .

[44]  Kenneth A Bollen,et al.  Economic status proxies in studies of fertility in developing countries: Does the measure matter? , 2002, Population studies.

[45]  I. Jolliffe Principal Component Analysis , 2002 .

[46]  Christine DiStefano,et al.  The Impact of Categorization With Confirmatory Factor Analysis , 2002 .

[47]  In Choi STRUCTURAL CHANGES AND SEEMINGLY UNIDENTIFIED STRUCTURAL EQUATIONS , 2002, Econometric Theory.

[48]  Konstantinos Drakos Common Factors in Eurocurrency Rates: A Dynamic Analysis , 2002 .

[49]  Lucrezia Reichlin,et al.  Factor Models in Large Cross-Sections of Time Series , 2002 .

[50]  J. Stock,et al.  Forecasting Using Principal Components From a Large Number of Predictors , 2002 .

[51]  David J. Hand,et al.  Causal variables, indicator variables and measurement scales: an example from quality of life , 2002 .

[52]  J. Bai,et al.  Inferential Theory for Factor Models of Large Dimensions , 2003 .

[53]  Albert Maydeu-Olivares,et al.  Testing Categorized Bivariate Normality with Two-Stage Polychoric Correlation Estimates , 2003 .

[54]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[55]  K. Jöreskog,et al.  Factor Models for Ordinal Variables With Covariate Effects on the Manifest and Latent Variables: A Comparison of LISREL and IRT Approaches , 2004 .

[56]  A Review of Stata 9.0 , 2005 .

[57]  S. Rabe-Hesketh,et al.  Maximum likelihood estimation of limited and discrete dependent variable models with nested random effects , 2005 .

[58]  To Be a Principal. , 2005 .

[59]  R. Pérez-Núñez,et al.  Edentulism among Mexican adults aged 35 years and older and associated factors. , 2006, American journal of public health.

[60]  Lilani Kumaranayake,et al.  Constructing socio-economic status indices: how to use principal components analysis. , 2006, Health policy and planning.

[61]  R. Hong,et al.  Economic Inequality and Undernutrition in Women: Multilevel Analysis of Individual, Household, and Community Levels in Cambodia , 2007, Food and nutrition bulletin.

[62]  Kevin J. A. Thomas,et al.  Child Mortality and Socioeconomic Status: An Examination of Differentials by Migration Status in South Africa 1 , 2007 .

[63]  Y. Berhane,et al.  Household and socioeconomic factors associated with childhood febrile illnesses and treatment seeking behaviour in an area of epidemic malaria in rural Ethiopia. , 2007, Transactions of the Royal Society of Tropical Medicine and Hygiene.

[64]  A. Wagstaff,et al.  Turkey - Socio-economic differences in health, nutrition, and population , 2007 .

[65]  Daniel Suryadarma,et al.  Predicting Consumption Poverty using Non-Consumption Indicators: Experiments using Indonesian Data , 2006 .

[66]  L. Fernald Socio-economic status and body mass index in low-income Mexican adults. , 2007, Social science & medicine.

[67]  T. Mroz,et al.  Arbitrarily Normalized Coefficients, Information Sets, and False Reports of Biases in Binary Outcome Models , 2008, The Review of Economics and Statistics.

[68]  O. Vorobyev,et al.  Discrete multivariate distributions , 2008, 0811.0406.

[69]  Alberto Maydeu-Olivares,et al.  Testing Categorized Bivariate Normality With Two-Stage Polychoric Correlation Estimates , 2009 .

[70]  L. Pritchett,et al.  Estimating Wealth Effects Without Expenditure Data—Or Tears: An Application To Educational Enrollments In States Of India* , 2001, Demography.

[71]  Evgueni A. Haroutunian,et al.  Information Theory and Statistics , 2011, International Encyclopedia of Statistical Science.