Multiple Imputation of Industry and Occupation Codes in Census Public-use Samples Using Bayesian Logistic Regression

Abstract We describe methods used to create a new Census data base that can be used to study comparability of industry and occupation classification systems. This project represents the most extensive application of multiple imputation to date, and the modeling effort was considerable as well—hundreds of logistic regressions were estimated. One goal of this article is to summarize the strategies used in the project so that researchers can better understand how the new data bases were created. Another goal is to show how modifications of maximum likelihood methods were made for the modeling and imputation phases of the project. To multiply-impute 1980 census-comparable codes for industries and occupations in two 1970 census public-use samples, logistic regression models were estimated with flattening constants. For many of the regression models considered, the data were too sparse to support conventional maximum likelihood analysis, so some alternative had to be employed. These methods solve existence and ...

[1]  Leo A. Goodman,et al.  Simultaneous Confidence Limits for Cross‐Product Ratios in Contingency Tables , 1964 .

[2]  B. Efron The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis , 1975 .

[3]  W. G. Cochran Some Methods for Strengthening the Common χ 2 Tests , 1954 .

[4]  S. Hitchcock,et al.  A note on the estimation of the parameters of the logistic function, using the minimum logit X 2 method , 1962 .

[5]  R. Gray,et al.  Calculation of polychotomous logistic regression parameters using individualized regressions , 1984 .

[6]  G. C. Tiao,et al.  Bayesian inference in statistical analysis , 1973 .

[7]  J. A. Anderson,et al.  7 Logistic discrimination , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[8]  Tom Leonard,et al.  Bayesian Estimation Methods for Two‐Way Contingency Tables , 1975 .

[9]  Donald B. Rubin,et al.  Logit-Based Interval Estimation for Binomial Data Using the Jeffreys Prior , 1987 .

[10]  S. Haberman Analysis of qualitative data , 1978 .

[11]  Daniel McFadden,et al.  A Comment on Discriminant Analysis "Versus" Logit Analysis , 1976 .

[12]  Stephen E. Fienberg,et al.  Discrete Multivariate Analysis: Theory and Practice , 1976 .

[13]  Donald B. Rubin,et al.  Characterizing the Estimation of Parameters in Incomplete-Data Problems , 1974 .

[14]  Shelby J. Haberman,et al.  A Warning on the Use of Chi-Squared Statistics with Frequency Tables with Small Expected Cell Counts , 1988 .

[15]  David R. Cox The analysis of binary data , 1970 .

[16]  Clifford C. Clogg,et al.  Some Common Problems in Log-Linear Analysis , 1987 .

[17]  J. Gart,et al.  On the bias of various estimators of the logit and its variance with application to quantal bioassay. , 1967, Biometrika.

[18]  Shelby J. Haberman,et al.  Log-Linear Models and Frequency Tables with Small Expected Cell Counts , 1977 .

[19]  D. Lindley The Bayesian Analysis of Contingency Tables , 1964 .

[20]  M. Bartlett Contingency Table Interactions , 1935 .

[21]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[22]  S. J. Press,et al.  Choosing between Logistic Regression and Discriminant Analysis , 1978 .

[23]  S. Haberman,et al.  The analysis of frequency data , 1974 .

[24]  A. Albert,et al.  On the existence of maximum likelihood estimates in logistic regression models , 1984 .

[25]  Seymour Geisser,et al.  On Prior Distributions for Binary Trials , 1984 .

[26]  Donald B. Rubin,et al.  Statistical Matching Using File Concatenation With Adjusted Weights and Multiple Imputations , 1986 .

[27]  S. Bull,et al.  The Efficiency of Multinomial Logistic Regression Compared with Multiple Group Discriminant Analysis , 1987 .

[28]  John A. Priebe Detailed Occupation and Years of School Completed by Age, for the Civilian Labor Force by Sex, Race, and Spanish Origin: 1980 Census of Population Supplementary Report. , 1983 .

[29]  P. Holland,et al.  Simultaneous Estimation of Multinomial Cell Probabilities , 1973 .

[30]  M. W. Birch Maximum Likelihood in Three-Way Contingency Tables , 1963 .

[31]  Donald J. Treiman,et al.  Evaluating a Multiple-Imputation Method for Recalibrating 1970 U.S. Census Detailed Industry Codes to the 1980 Standard , 1988 .

[32]  J. Dickey,et al.  Bayes factors for independence in contingency tables , 1974 .

[33]  L. A. Goodman The Analysis of Cross-Classified Data: Independence, Quasi-Independence, and Interactions in Contingency Tables with or without Missing Entries , 1968 .

[34]  Shelby J. Haberman,et al.  Log-Linear Models for Frequency Data: Sufficient Statistics and Likelihood Equations , 1973 .

[35]  Clifford C. Clogg,et al.  The Analysis of Categorical Data (2nd Ed.). , 1983 .

[36]  Stephen E. Fienberg,et al.  The analysis of cross-classified categorical data , 1980 .

[37]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[38]  Donald B. Rubin,et al.  Comment : A noniterative sampling/importance resampling alternative to the data augmentation algorithm for creating a few imputations when fractions of missing information are modest : The SIR Algorithm , 1987 .

[39]  Nathaniel Schenker,et al.  Asymptotic results for multiple imputation , 1988 .

[40]  Stephen E. Fienberg,et al.  On the choice of flattening constants for estimating multinomial probabilities , 1972 .

[41]  H. Silverstone Estimating the Logistic Curve , 1957 .

[42]  G. Y. Wong,et al.  The Hierarchical Logistic Regression Model for Multilevel Analysis , 1985 .

[43]  Modification of the empirical logit to reduce bias in simple linear logistic regression , 1985 .

[44]  L. A. Goodman The Multivariate Analysis of Qualitative Data: Interactions among Multiple Classifications , 1970 .

[45]  D. Rubin,et al.  Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse , 1986 .

[46]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[47]  F. J. Anscombe,et al.  On estimating binomial response relations , 1956 .

[48]  John J. Gart,et al.  The effect of bias, variance estimation, skewness and kurtosis of the empirical logit on weighted least squares analyses , 1985 .

[49]  Robin Plackett The analysis of categorical data , 1974 .

[50]  Ross L. Prentice,et al.  Binary Regression Using an Extended Beta-Binomial Distribution, with Discussion of Correlation Induced by Covariate Measurement Errors , 1986 .

[51]  B. Haldane THE ESTIMATION AND SIGNIFICANCE OF THE LOGARITHM OF A RATIO OF FREQUENCIES , 1956, Annals of human genetics.