Connecting family trees to construct a population-scale and longitudinal geo-social network for the U.S

ABSTRACT We collected 92,832 user-contributed and publicly available family trees from rootsweb.com, including 250 million individuals who were born in North America and Europe between 1630 and 1930. We cleaned and connected the family trees to create a population-scale and longitudinal family tree dataset using a workflow of data collection and cleaning, geocoding, fuzzy record linkage and a relation-based iterative search for connecting trees and deduplication of records. Given the largest connected component of nearly 40 million individuals, and a total of 80 million individuals, we generated, to date, the largest population-scale and longitudinal geo-social network over centuries. We evaluated the representativeness of the family tree dataset for historical population demography and mobility by comparing the data to the 1880 Census. Our results showed that the family trees were biased towards males, the elderly, farmers, and native-born white segments of the population. Individuals were highly mobile – in our 1880 sample of parent-child pairs where both were born in the U.S., 47% were born in different states. Our findings agreed with prior studies that people migrated from East to West in horizontal bands, and the trend was reflected in the dialects and regional structure of the U.S.

[1]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[2]  H. B. Newcombe,et al.  Computers can be used to extract "follow-up" statistics of families from files of routine records. , 1959 .

[3]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[4]  J. Sharpless,et al.  Biased Underenumeration in Census Manuscripts , 1975 .

[5]  E. Wrigley,et al.  English population history from family reconstitution: summary results 1600-1799. , 1983, Population studies.

[6]  A. Kasakoff,et al.  Migration and the Family in Colonial New England: the View From Genealogies , 1984 .

[7]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[8]  Rowland T. Berthoff,et al.  Albion's Seed: Four British Folkways in America. America: A Cultural History , 1991 .

[9]  D. H. Fischer,et al.  Albion's Seed: Four British Folkways in America , 1991 .

[10]  Richard H. Steckel The Quality of Census Data for Historical Inquiry: A Research Agenda , 1991 .

[11]  R. R. Menard,et al.  The Minnesota Historical Census Projects , 1995 .

[12]  S. Ruggles,et al.  The IPUMS Project: An Update , 1999 .

[13]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[14]  M. Province,et al.  Usefulness of cardiovascular family history data for population-based preventive medicine and medical research (the Health Family Tree Study and the NHLBI Family Heart Study). , 2001, The American journal of cardiology.

[15]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[16]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[17]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[18]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[19]  D. Hey Oxford Companion to Family and Local History , 2010 .

[20]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[21]  Ashwin Machanavajjhala,et al.  Entity Resolution: Theory, Practice & Open Challenges , 2012, Proc. VLDB Endow..

[22]  Paul A. Longley,et al.  Identifying spatial concentrations of surnames , 2012, Int. J. Geogr. Inf. Sci..

[23]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[24]  Maria-Luiza Antonie,et al.  Tracking people over time in 19th century Canada for longitudinal analysis , 2014, Machine Learning.

[25]  Samuel M. Otterstrom,et al.  Genealogy, Migration, and the Intertwined Geographies of Personal Pasts , 2013 .

[26]  Caglar Koylu,et al.  Mapping family connectedness across space and time , 2014 .

[27]  A. Lawson,et al.  A Bayesian Analysis of the Spatial Concentration of Individual Wealth in the US North During the Nineteenth Century , 2014 .

[28]  Marijn Schraagen,et al.  Learning Name Variants from Inexact High-Confidence Matches , 2015, Population Reconstruction.

[29]  Emily Nix,et al.  The Fluidity of Race: “Passing” in the United States, 1880-1940 , 2015 .

[30]  Corry Gellatly,et al.  Reconstructing Historical Populations from Genealogical Data Files , 2015, Population Reconstruction.

[31]  Caglar Koylu,et al.  Historical Population Informatics : Comparing Big Data of Family Trees and the U . S . 1880 Census for Migration Analysis , 2015 .

[32]  Jens Kandt,et al.  Regional surnames and genetic structure in Great Britain , 2016, Transactions.

[33]  Diansheng Guo,et al.  Understanding U.S. regional linguistic variation with Twitter data analysis , 2016, Comput. Environ. Urban Syst..

[34]  Connor Cole,et al.  How Well Do Automated Methods Perform in Historical Samples? Evidence from New Ground Truth , 2017 .

[35]  Ross E. Curtis,et al.  Clustering of 770,000 genomes reveals post-colonial population structure of North America , 2017, Nature Communications.

[36]  Ross E. Curtis,et al.  Estimation of Recent Ancestral Origins of Individuals on a Large Scale , 2017, KDD.

[37]  Yaniv Erlich,et al.  Identity inference of genomic data using long-range familial searches , 2018, Science.

[38]  Catherine A. Fitch,et al.  Interoperable and accessible census and survey data from IPUMS , 2018, Scientific Data.

[39]  Dan Geiger,et al.  Quantitative analysis of population-scale family trees with millions of relatives , 2017, Science.

[40]  Arthur Charpentier,et al.  Using collaborative genealogy data to study migration: a research note , 2019, The History of the Family.