A Supervised Learning and Group Linking Method for Historical Census Household Linkage

Historical census data provide a snapshot of the era when our ancestors lived. Such data contain valuable information that allows the reconstruction of households and the tracking of family changes across time, allows the analysis of family diseases, and facilitates a variety of social science research. One particular topic of interest in historical census data analysis are households and linking them across time. This enables tracking of the majority of members in a household over a certain period of time, which facilitates the extraction of information that is hidden in the data, such as fertility, occupations, changes in family structures, immigration and movements, and so on. Such information normally cannot be easily acquired by only linking records that correspond to individuals. In this paper, we propose a novel method to link households in historical census data. Our method first computes the attribute-wise similarity of individual record pairs. A support vector machine classifier is then trained on limited data and used to classify these individual record pairs into matches and non-matches. In a second step, a group linking approach is employed to link households based on the matched individual record pairs. Experimental results on real census data from the United Kingdom from 1851 to 1901 show that the proposed method can greatly reduce the number of multiple household matches compared with a traditional linkage of individual record pairs only.

[1]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[2]  D. Rubin,et al.  Iterative Automated Record Linkage Using Mixture Models , 2001 .

[3]  Steven Ruggles,et al.  Linking Historical Censuses: a New Approach , 2002, Hist. Comput..

[4]  Gerrit Bloothooft Multi-Source Family Reconstruction , 1995, Hist. Comput..

[5]  Dmitri V. Kalashnikov,et al.  Domain-independent data cleaning via analysis of entity-relationship graph , 2006, TODS.

[6]  D L Rosman,et al.  Cohort profile: The Western Australian Family Connections Genealogical Project. , 2008, International journal of epidemiology.

[7]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[8]  Peter Christen,et al.  Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface , 2008, KDD.

[9]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[10]  Peter Christen Automatic Training Example Selection for Scalable Unsupervised Record Linkage , 2008, PAKDD.

[11]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[12]  Peter Christen,et al.  Automated Probabilistic Address Standardisation and Verification , 2005 .

[13]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[14]  Felix Naumann,et al.  Scaling up duplicate detection in graph data , 2008, CIKM '08.

[15]  Ron Goeken,et al.  New Methods of Census Record Linking , 2011, Historical methods.

[16]  Frank Harary,et al.  Graph Theory , 2016 .

[17]  Max Planck,et al.  Interactive Record Linkage: The Cumulative Construction of Life Courses , 2000 .

[18]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[19]  Lap Huynh,et al.  The Effects of Standardizing Names for Record Linkage: Evidence from the United States and Norway , 2011 .

[20]  Peter Christen,et al.  Automatic Cleaning and Linking of Historical Census Data Using Household Information , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[21]  Gerrit Bloothooft Assessment of Systems for Nominal Retrieval and Historical Record Linkage , 1998, Comput. Humanit..

[22]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[23]  D. Jeremy,et al.  Family Structure in Nineteenth Century Lancashire , 1972 .

[24]  Divesh Srivastava,et al.  Group Linkage , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[25]  Alice Reid,et al.  Nineteenth-Century Scottish Demography From Linked Censuses and Civil Registers: A 'Sets of Related Individuals' Approach , 2002, Hist. Comput..

[26]  Peter Christen,et al.  Preparation of name and address data for record linkage using hidden Markov models , 2002, BMC Medical Informatics Decis. Mak..

[27]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[28]  G. Chartrand Introductory Graph Theory , 1984 .

[29]  Dallan Quass,et al.  Record Linkage for Genealogical Databases , 2003 .

[30]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[31]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .