Automated Linking of Historical Data

The recent digitization of complete count census data is an extraordinary opportunity for social scientists to create large longitudinal datasets by linking individuals from one census to another or from other sources to the census. We evaluate different automated methods for record linkage, performing a series of comparisons across methods and against hand linking. We have three main findings that lead us to conclude that automated methods perform well. First, a number of automated methods generate very low (less than 5%) false positive rates. The automated methods trace out a frontier illustrating the tradeoff between the false positive rate and the (true) match rate. Relative to more conservative automated algorithms, humans tend to link more observations but at a cost of higher rates of false positives. Second, when human linkers and algorithms have the same amount of information, there is relatively little disagreement between them. Third, across a number of plausible analyses, coefficient estimates and parameters of interest are very similar when using linked samples based on each of the different automated methods. We provide code and Stata commands to implement the various automated methods. Institutional subscribers to the NBER working paper series, and residents of developing countries may download this paper without additional charge at www.nber.org.

[1]  Vasiliki Fouka,et al.  Backlash: The Unintended Effects of Language Prohibition in U.S. Schools after World War I , 2019, The Review of Economic Studies.

[2]  Ran Abramitzky,et al.  To the New World and Back Again: Return Migrants in the Age of Mass Migration , 2016, Industrial & labor relations review.

[3]  Orley Ashenfelter,et al.  Estimates of the Economic Return to Schooling from a New Sample of Twins , 1992 .

[4]  C. Goldin,et al.  Long-Run Changes in the Wage Structure: Narrowing, Widening, Polarizing , 2008 .

[5]  Imran Rasul,et al.  The Making of Modern America: Migratory Flows in the Age of Mass Migration , 2012 .

[6]  Alexandre Poirier,et al.  Estimation of Models with Multiple-Valued Explanatory Variables , 2017 .

[7]  Connor Cole,et al.  How Well Do Automated Methods Perform in Historical Samples? Evidence from New Ground Truth , 2017 .

[8]  N. McGlynn Thinking fast and slow. , 2014, Australian veterinary journal.

[9]  E. L. Kelly Clinical versus statistical prediction: A theoretical analysis and review of the evidence. , 1955 .

[10]  Larry T. Wimmer Reflections on the Early Indicators Project.A Partial History , 2003 .

[11]  Ran Abramitzky,et al.  A Nation of Immigrants: Assimilation and Economic Outcomes in the Age of Mass Migration , 2012, Journal of Political Economy.

[12]  Santiago Pérez,et al.  Intergenerational Occupational Mobility across Three Continents , 2019, The Journal of Economic History.

[13]  James J. Feigenbaum,et al.  Automated Census Record Linking: A Machine Learning Approach , 2016 .

[14]  Suresh Naidu,et al.  When the Levee Breaks: Black Migration and Economic Development in the American South , 2012 .

[15]  A. Zimran,et al.  Sample-Selection Bias and Height Trends in the Nineteenth-Century United States , 2018, The Journal of Economic History.

[16]  James Feigenbaum,et al.  Multiple Measures of Historical Intergenerational Mobility: Iowa 1915 to 1940 , 2018, The Economic Journal.

[17]  Emily Nix,et al.  The Fluidity of Race: “Passing” in the United States, 1880-1940 , 2015 .

[18]  Sven E. Wilson,et al.  Union Army veterans, all grown up , 2016, Historical methods.

[19]  Lawrence F. Katz,et al.  Education and Income in the Early 20th Century: Evidence from the Prairies , 2000 .

[20]  J. Ferrie,et al.  Shocking Behavior: Random Wealth in Antebellum Georgia and Human Capital Across Generations , 2013, The quarterly journal of economics.

[21]  Miriam L King,et al.  Perspectives on Historical U.S. Census Undercounts , 1995 .

[22]  Laura Salisbury,et al.  Selective migration, wages, and occupational mobility in nineteenth century America , 2014 .

[23]  Roy Mill,et al.  Race, Skin Color, and Economic Outcomes in Early Twentieth-Century America , 2016 .

[24]  Jørgen Modalsli,et al.  Intergenerational Mobility in Norway, 1865–201 , 2017 .

[25]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[26]  W. Winkler USING THE EM ALGORITHM FOR WEIGHT COMPUTATION IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 2000 .

[27]  Jacqueline Craig,et al.  Marriage and the Intergenerational Mobility of Women: Evidence from Marriage Certificates 1850-1910 , 2019 .

[28]  Cihan Varol,et al.  Performance Evaluation of Phonetic Matching Algorithms on English Words and Street Names - Comparison and Correlation , 2016, DATA.

[29]  Marianne H. Wanamaker,et al.  The Great Migration in Black and White: New Evidence on the Selection and Sorting of Southern Migrants , 2015 .

[30]  Ran Abramitzky,et al.  Linking individuals across historical sources: A fully automated approach* , 2018, Historical Methods: A Journal of Quantitative and Interdisciplinary History.

[31]  Jason Long,et al.  Intergenerational Occupational Mobility in Great Britain and the United States since 1850 , 2013 .

[32]  Ron Goeken,et al.  New Methods of Census Record Linking , 2011, Historical methods.

[33]  Daniele Paserman,et al.  In the Name of the Son (and the Daughter): Intergenerational Mobility in the United States, 1850-1930 , 2013 .

[34]  Ran Abramitzky,et al.  Intergenerational Mobility of Immigrants in the Us Over Two Centuries , 2019 .

[35]  Gould Jd European inter-continental emigration: the role of "diffusion" and "feedback" , 1980 .

[36]  Joseph P. Ferrie,et al.  A New Sample of Males Linked from the Public Use Microdata Sample of the 1850 U.S. Federal Census of Population to the 1860 U.S. Federal Census Manuscript Schedules , 1996 .

[37]  James Feigenbaum,et al.  The Return to Education in the Mid-20th Century: Evidence from Twins , 2019 .

[38]  K. Eriksson,et al.  Moving North and into jail? The great migration and black incarceration , 2019, Journal of Economic Behavior & Organization.

[39]  Ran Abramitzky,et al.  Europe's Tired, Poor, Huddled Masses: Self-Selection and Economic Outcomes in the Age of Mass Migration , 2010, The American economic review.

[40]  Zachary Ward,et al.  Who Crossed the Border? Self-Selection of Mexican Migrants in the Early Twentieth Century , 2014 .

[41]  Marianne H. Wanamaker,et al.  Selection and Economic Gains in the Great Migration of African Americans: New Evidence from Linked Census Data , 2013 .

[42]  Santiago Pérez Intergenerational Occupational Mobility Across Three Continents : Were the Americas Exceptional ? , 2017 .

[43]  Joseph Price,et al.  Combining Family History and Machine Learning to Link Historical Records , 2019 .

[44]  P. Altham,et al.  The Measurement of Association of Rows and Columns for an r × s Contingency Table , 1970 .

[45]  James J. Feigenbaum,et al.  Intergenerational mobility during the Great Depression , 2015 .

[46]  P. Lahiri,et al.  Regression Analysis With Linked Data , 2005 .

[47]  Marianne H. Wanamaker,et al.  African American Intergenerational Economic Mobility Since 1880 , 2017, American Economic Journal: Applied Economics.

[48]  Laura Salisbury,et al.  Patronage Politics and the Development of the Welfare State: Confederate Pensions in the American South , 2015, The Journal of Economic History.

[49]  Emily Nix,et al.  Choosing Racial Identity in the United States, 1880-1940 , 2019, SSRN Electronic Journal.

[50]  S. Ruggles,et al.  The North Atlantic Population Project: Progress and Prospects , 2011, Historical methods.

[51]  A. Tversky,et al.  On the psychology of prediction , 1973 .

[52]  Dora L. Costa Health and Labor Force Participation over the Life Cycle: Evidence from the Past , 2003 .

[53]  A. Aizer,et al.  The Long-Run Impact of Cash Transfers to Poor Families. , 2016, The American economic review.

[54]  John M. Parman Childhood health and sibling outcomes: Nurture Reinforcing nature during the 1918 influenza pandemic , 2015 .

[55]  Santiago Pérez,et al.  The (South) American Dream: Mobility and Economic Outcomes of First- and Second-Generation Immigrants in Nineteenth-Century Argentina , 2017 .

[56]  K. Mason,et al.  Sources of age and date-of-birth misreporting in the 1900 U.S. census , 1987, Demography.