The Econometrics of Data Combination

Economists who use survey or administrative data for inferences regarding a population may want to combine information obtained from two or more samples drawn from the population. This is the case if there is no single sample that contains all relevant variables. A special case occurs if longitudinal or panel data are needed but only repeated cross-sections are available. In this chapter we survey sample combination. If two (or more) samples from the same population are combined, there are variables that are unique to one of the samples and variables that are observed in each sample. What can be learned by combining such samples, depends on the nature of the samples, the assumptions that one is prepared to make, and the goal of the analysis. The most ambitious objective is the identification and estimation of the joint distribution, but often we settle for the estimation of economic models that involve these variables or a subset thereof. Sometimes the goal is to reduce biases due to mismeasured variables. We consider sample merger by matching on identifiers that may be imperfect in the case that the two samples have a substantial number of common units. For the case that the two samples are independent, we consider (conditional) bounds on the joint distribution. Exclusion restrictions will narrow these bounds. We also consider inference under the strong assumption of conditional independence.

[1]  A. Wald The Fitting of Straight Lines if Both Variables are Subject to Error , 1940 .

[2]  M. Fréchet Sur les tableaux de correlation dont les marges sont donnees , 1951 .

[3]  G. A. Miller Finite markov processes in psychology , 1952 .

[4]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[5]  L. A. Goodman Ecological Regressions and Behavior of Individuals , 1953 .

[6]  Albert Madansky,et al.  Least squares estimation in finite Markov processes , 1959 .

[7]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[8]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[9]  John Neter,et al.  The Effect of Mismatching on the Measurement of Response Errors , 1965 .

[10]  P. Sen,et al.  Theory of rank tests , 1969 .

[11]  R. R. Hocking,et al.  The analysis of incomplete data. , 1971 .

[12]  Benjamin Okner,et al.  Constructing a New Data Base from Existing Microdata Sets: The 1966 Merge File , 1972 .

[13]  Nancy D. Ruggles,et al.  A Strategy for Merging and Matching Microdata Sets , 1974 .

[14]  The statistical matching of microdata sets : the Bureau of Economic Analysis 1964 current population survey, tax model match , 1974 .

[15]  Benjamin Okner,et al.  Data Matching and Merging: An Overview , 1974 .

[16]  Horst E. Alter,et al.  Creation of a Synthetic Data Set by Linking Records of the Canadian Survey of Consumer Finances with the Family Expenditure Survey , 1974 .

[17]  J. Heckman Dummy Endogenous Variables in a Simultaneous Equation System , 1977 .

[18]  Edward N. Wolff,et al.  Merging Microdata Rationale Practice and Testing , 1977 .

[19]  S. Berg Front matter to "Annals of Economic and Social Measurement" , 1977 .

[20]  Steven R. Lerman,et al.  The Estimation of Choice Probabilities from Choice Based Samples , 1977 .

[21]  Arnold Zellner,et al.  Estimating the parameters of the Markov probability model from aggregate time series data , 1971 .

[22]  E. C. Macrae Estimation of Time-Varying Markov Processes with Aggregate Data , 1977 .

[23]  Donald B. Rubin,et al.  Relating tests given to different samples , 1978 .

[24]  Jonathan S. Turner,et al.  A new, linear programming approach to microdata file merging , 1978 .

[25]  Takeshi Amemiya,et al.  The Estimation of a Simultaneous Equation Generalized Probit Model , 1978 .

[26]  Stepen Rhys Cosslett,et al.  Efficient estimation of discrete-choice models from choice-based samples , 1978 .

[27]  Lawrence Olson,et al.  Specification and Estimation of a Simultaneous-Equation Model with Limited Dependent Variables , 1978 .

[28]  J. Heckman Sample selection bias as a specification error , 1979 .

[29]  R. Pyke,et al.  Logistic disease incidence models and case-control studies , 1979 .

[30]  Prem K. Goel,et al.  Estimation of the Correlation Coefficient from a Broken Random Sample , 1980 .

[31]  S. Cosslett,et al.  Maximum likelihood estimator for choice-based samples , 1981 .

[32]  N. Anders Klevmarken,et al.  Missing Variables and Two-Stage Least-Squares Estimation from More than One Data Set , 1982 .

[33]  G. Chamberlain Multivariate regression models for panel data , 1982 .

[34]  Y. Vardi,et al.  Nonparametric Estimation in the Presence of Length Bias , 1982 .

[35]  L. Hansen Large Sample Properties of Generalized Method of Moments Estimators , 1982 .

[36]  Willard L. Rodgers,et al.  An Evaluation of Statistical Matching , 1984 .

[37]  Adrian Pagan,et al.  Econometric Issues in the Analysis of Regressions with Generated Regressors. , 1984 .

[38]  Y. Vardi Empirical Distributions in Selection Bias Models , 1985 .

[39]  James J. Heckman,et al.  Alternative methods for evaluating the impact of interventions: An overview , 1985 .

[40]  Angus Deaton Panel data from time series of cross-sections , 1985 .

[41]  Martin Browning,et al.  A Profitable Approach to Labor Supply and Commodity Demands over the Life-Cycle , 1985 .

[42]  J. Heckman,et al.  Longitudinal Analysis of Labor Market Data: Alternative methods for evaluating the impact of interventions , 1985 .

[43]  Charles F. Manski,et al.  Estimation of Response Probabilities From Augmented Retrospective Observations , 1985 .

[44]  Whitney K. Newey,et al.  Linear instrumental variable estimation of limited dependent variable models with endogenous explanatory variables , 1986 .

[45]  Richard Blundell,et al.  An Exogeneity Test for a Simultaneous Equation Tobit Model with an Application to Labor Supply , 1986 .

[46]  Donald B. Rubin,et al.  Statistical Matching Using File Concatenation With Adjusted Weights and Multiple Imputations , 1986 .

[47]  Stephan Morgenthaler,et al.  Choice-based samples: A non-parametric approach , 1986 .

[48]  Richard D. Gill,et al.  Large sample theory of empirical distributions in biased sampling models , 1988 .

[49]  G. Chamberlain Asymptotic efficiency in estimation with conditional moment restrictions , 1987 .

[50]  A. Leslie Robb,et al.  Alternative Transformations to Handle Extreme Values of the Dependent Variable , 1988 .

[51]  D. Rivers,et al.  Limited Information Estimators and Exogeneity Tests for Simultaneous Probit Models , 1988 .

[52]  Howard B. Newcombe,et al.  Handbook of record linkage: methods for health and statistical studies, administration, and business , 1988 .

[53]  J. Angrist,et al.  The Effect of Age at School Entry on Educational Attainment: An Application of Instrumental Variables with Moments from Two Samples , 1990 .

[54]  J B Copas,et al.  Record linkage: statistical models for matching computer records. , 1990, Journal of the Royal Statistical Society. Series A,.

[55]  Marno Verbeek,et al.  Can cohort data be treated as genuine panel data? , 1992 .

[56]  Marno Verbeek,et al.  Minimum MSE estimation of a regression model with fixed effects from a series of cross sections (Revised version) , 1993 .

[57]  Costas Meghir,et al.  Female labour supply and on-the-job search: an empirical model estimated using complementary data sets , 1992 .

[58]  W. Härdle Applied Nonparametric Regression , 1992 .

[59]  Guido W. Imbens,et al.  An efficient method of moments estimator for discrete choice models with choice-based sampling , 1992 .

[60]  Pierre Lalonde,et al.  The Use of Names for Linking Personal Records , 1992 .

[61]  Fritz Scheuren,et al.  Regression Analysis of Data Files that Are Computer Matched , 1993 .

[62]  A Cohort Analysis of Saving Behavior by U.S. Households , 1993 .

[63]  Richard Blundell,et al.  Simultaneous Microeconometric Models with Censored or Qualitative Dependent Variables , 1993 .

[64]  G. Imbens,et al.  Case-control studies with contaminated controls☆ , 1996 .

[65]  C. Carroll,et al.  Saving and Growth: A Reinterpretation , 1993 .

[66]  Marno Verbeek,et al.  Pseudo Panel Data , 1993 .

[67]  R. Moffitt Identification and estimation of dynamic models with a time series of repeated cross-sections , 1993 .

[68]  Costas Meghir,et al.  Consumer demand and the life-cycle allocation of household expenditures , 1993 .

[69]  Franco Peracchi,et al.  Trends in Labor Force Transitions of Older Men and Women , 1994, Journal of Labor Economics.

[70]  Constance F. Citro,et al.  Improving information for social policy decisions : the uses of microsimulation modeling , 1994 .

[71]  G. Imbens,et al.  Combining Micro and Macro Data in Microeconometric Models , 1994 .

[72]  Orazio Attanasio,et al.  The UK Consumption Boom of the Late 1980s: Aggregate Implications of Microeconomic Evidence , 1994 .

[73]  Joel L. Horowitz,et al.  Identification and Robustness with Contaminated and Corrupted Data , 1995 .

[74]  D. Rubin,et al.  A method for calibrating false-match rates in record linkage , 1995 .

[75]  Patrick Sevestre,et al.  Dynamic Linear Models , 1996 .

[76]  Patrick Sevestre,et al.  The Econometrics of panel data : a handbook of the theory with applications , 1996 .

[77]  Annamaria Lusardi,et al.  Permanent Income, Current Income, and Consumption: Evidence From Two Panel Data Sets , 1996 .

[78]  A. Deaton Saving and Growth , 1997 .

[79]  M. Collado Estimating dynamic models from time series of independent cross-sections , 1997 .

[80]  Aaron Yelowitz,et al.  Are Public Housing Projects Good for Kids? , 1997 .

[81]  M. Devereux,et al.  Intertemporal consumption, durables and liquidity constraints: a cohort analysis , 1997 .

[82]  Donald B. Rubin,et al.  Combining Panel Data Sets with Attrition and Refreshment Samples , 1998 .

[83]  Leland Gerson Neuberg,et al.  A solution to the ecological inference problem: Reconstructing individual behavior from aggregate data , 1999 .

[84]  Guido W. Imbens,et al.  Imposing Moment Restrictions from Auxiliary Data by Weighting , 1996, Review of Economics and Statistics.

[85]  Jeffrey M. Wooldridge,et al.  Asymptotic properties of weighted M-estimators for variable probability samples , 1999 .

[86]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2000 .

[87]  K. Prager,et al.  The role of linked birth and infant death certificates in maternal and child health epidemiology in the United States. , 2000, American journal of preventive medicine.

[88]  Charles F. Manski,et al.  Regressions, Short and Long , 2002 .

[89]  A C Allen,et al.  An assessment of the validity of a computer system for probabilistic record linkage of birth and infant death records in Canada. The Fetal and Infant Health Study Group. , 2000, Chronic diseases in Canada.

[90]  Katherine Clark Matchmaking , 2000, Science.

[91]  Stephen G. Donald,et al.  Choosing the Number of Instruments , 2001 .

[92]  Susanne Rässler,et al.  Statistical Matching: "A Frequentist Theory, Practical Applications, And Alternative Bayesian Approaches" , 2002 .

[93]  Philip Hans Franses,et al.  Inferring Transition Probabilities from Repeated Cross Sections , 2002, Political Analysis.

[94]  Jane Waldfogel,et al.  Work, Welfare, and Child Maltreatment , 1999, Journal of Labor Economics.

[95]  Philip Hans Franses,et al.  Ecological panel inference in repeated cross sections , 2002 .

[96]  G. Ridder,et al.  Estimation of Nonlinear Models with Measurement Error Using Marginal Information1 , 2004 .

[97]  Luojia Hu,et al.  Estimating the Probability of Leaving Unemployment Using Uncompleted Spells from Repeated Cross-Section Data , 2003, SSRN Electronic Journal.

[98]  C. Carroll,et al.  Unemployment Risk and Precautionary Wealth: Evidence from Households' Balance Sheets , 1999, Review of Economics and Statistics.

[99]  William N. Evans,et al.  Teen Drinking and Educational Attainment: Evidence from Two‐Sample Instrumental Variables Estimates , 2003, Journal of Labor Economics.

[100]  D. McKenzie Asymptotic theory for heterogeneous dynamic pseudo-panels , 2004 .

[101]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[102]  Spain,et al.  PANEL DATA MODELS : SOME RECENT DEVELOPMENTS * , 2004 .

[103]  Susanne M. Schennach,et al.  Estimation of Nonlinear Models with Measurement Error , 2004 .

[104]  G. Imbens,et al.  Mean-Squared-Error Calculations for Average Treatment Effects , 2005 .

[105]  Han Hong,et al.  Measurement Error Models with Auxiliary Data , 2005 .

[106]  Paul J. Devereux,et al.  Small sample bias in synthetic cohort models of labor supply , 2007 .

[107]  Han Hong,et al.  Semiparametric Efficiency in GMM Models of Nonclassical Measurement Errors, Missing Data and Treatment Effects , 2008 .