Optimal Probabilistic Record Linkage: Best Practice for Linking Employers in Survey and Administrative Data

This paper illustrates an application of record linkage between a household-level survey and an establishment-level frame in the absence of unique identifiers. Linkage between frames in this setting is challenging because the distribution of employment across firms is highly asymmetric. To address these difficulties, this paper uses a supervised machine learning model to probabilistically link survey respondents in the Health and Retirement Study (HRS) with employers and establishments in the Census Business Register (BR) to create a new data source which we call the CenHRS. Multiple imputation is used to propagate uncertainty from the linkage step into subsequent analyses of the linked data. The linked data reveal new evidence that survey respondents’ misreporting and selective nonresponse about employer characteristics are systematically correlated with wages.

[1]  Soko Setoguchi,et al.  Validity of Deterministic Record Linkage Using Multiple Indirect Personal Identifiers: Linking a Large Registry to Claims Data , 2014, Circulation. Cardiovascular quality and outcomes.

[2]  Lars Vilhuber,et al.  The LEHD Infrastructure Files and the Creation of the Quarterly Workforce Indicators , 2009 .

[3]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[4]  James L. Medoff,et al.  The Employer Size-Wage Effect , 1989, Journal of Political Economy.

[5]  Brunero Liseo,et al.  A hierarchical Bayesian approach to record linkage and population size problems , 2010, 1011.2649.

[6]  D. Rubin,et al.  Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse , 1986 .

[7]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[8]  Connor Cole,et al.  How Well Do Automated Methods Perform in Historical Samples? Evidence from New Ground Truth , 2017 .

[9]  D. Rubin Formalizing Subjective Notions about the Effect of Nonrespondents in Sample Surveys , 1977 .

[10]  Martha Bailey,et al.  Simple strategies for improving inference with linked data: a case study of the 1850–1930 IPUMS linked representative historical samples , 2020, Historical methods.

[11]  John M. Abowd,et al.  The Review of Economics and Statistics , 2013 .

[12]  B. Liseo,et al.  On Bayesian Record Linkage , 2000 .

[13]  Robert H Mcguckin,et al.  Longitudinal Economic Data At The Census Bureau: A New Database Yields Fresh Insight On Some Old Issues , 1990 .

[14]  D. Rubin,et al.  A method for calibrating false-match rates in record linkage , 1995 .

[15]  W. Youden,et al.  Index for rating diagnostic tests , 1950, Cancer.

[16]  Lars Vilhuber,et al.  Total Error and Variability Measures with Integrated Disclosure Limitation for Quarterly Workforce Indicators and LEHD Origin Destination Employment Statistics in OnThe Map , 2017 .

[17]  E. Lawson,et al.  Linkage of a clinical surgical registry with Medicare inpatient claims data using indirect identifiers. , 2013, Surgery.

[18]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[19]  Adrian F Hernandez,et al.  Linking inpatient clinical registry data to Medicare claims data using indirect identifiers. , 2009, American heart journal.

[20]  Ron S. Jarmin,et al.  The Longitudinal Business Database , 2002 .

[21]  M. Coffin,et al.  Receiver operating characteristic studies and measurement errors. , 1997, Biometrics.

[22]  P. Lahiri,et al.  Regression Analysis With Linked Data , 2005 .

[23]  Joseph P. Ferrie,et al.  A New Sample of Males Linked from the Public Use Microdata Sample of the 1850 U.S. Federal Census of Population to the 1860 U.S. Federal Census Manuscript Schedules , 1996 .

[24]  Deborah Schrag,et al.  Overview of the SEER-Medicare Data: Content, Research Applications, and Generalizability to the United States Elderly Population , 2002, Medical care.

[25]  Brunero Liseo,et al.  Regression analysis with linked data: problems and possible solutions , 2015 .