Computer-based genealogy reconstruction in founder populations

This paper describes a software tool that reconstructs entire genealogies from data collected from different and heterogeneous sources, including municipal and parish records archived over centuries. The tool exploits a record linkage algorithm relying on a rule-based data matching approach. It applies a general strategy for managing the ambiguities due to missing, imprecise or erroneous input data. The process follows an iterative approach that combines automatic pedigree reconstruction with software-empowered human data revision to improve the quality and the accuracy of the results and to optimize the matching rules. The paper discusses the results obtained by reconstructing the entire genealogy of the population of the Val Borbera, a geographically isolated valley in Northern Italy. The genealogy could be reconstructed from data going back as far as the XVI century. The resulting pedigree includes 75,994 trios, 58.9% of which belonging to a unique big family, reconstructed over 13 generations.

[1]  Myron P. Gutmann,et al.  Defining and Distributing Longitudinal Historical Data in a General Way Through an Intermediate Structure , 2009 .

[2]  Giuseppe Ledda,et al.  Browsing Isolated Population Data , 2005, BMC Bioinformatics.

[3]  Ludwig Kappos,et al.  Genome-wide association study in a high-risk isolate for multiple sclerosis reveals associated variants in STAT3 gene. , 2010, American journal of human genetics.

[4]  B Dasgupta,et al.  KINALYZER, a computer program for reconstructing sibling groups , 2009, Molecular ecology resources.

[5]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[6]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[7]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[8]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[9]  Clement Adebamowo,et al.  Ancestry-Shift Refinement Mapping of the C6orf97-ESR1 Breast Cancer Susceptibility Locus , 2010, PLoS genetics.

[10]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[11]  T S Nesbitt,et al.  Vital statistics linked birth/infant death and hospital discharge record linkage for epidemiological studies. , 1997, Computers and biomedical research, an international journal.

[12]  Chiara Sabatti,et al.  The use of pedigree, sib-pair and association studies of common diseases for genetic mapping and epidemiology , 2004, Nature Genetics.

[13]  Enrico Petretto,et al.  Heritability and Demographic Analyses in the Large Isolated Population of Val Borbera Suggest Advantages in Mapping Complex Traits Genes , 2009, PloS one.

[14]  J. Gulcher,et al.  Population Genomics: Laying the Groundwork for Genetic Disease Modeling and Targeting , 1998, Clinical chemistry and laboratory medicine.

[15]  Dmitry A. Konovalov,et al.  kingroup: a program for pedigree relationship reconstruction and kin group assignments using genetic markers , 2004 .

[16]  P Henneman,et al.  Prevalence and heritability of the metabolic syndrome and its individual components in a Dutch isolate: the Erasmus Rucphen Family study , 2008, Journal of Medical Genetics.

[17]  H. Newcombe Record linking: the design of efficient systems for linking records into individual and family histories. , 1967, American journal of human genetics.

[18]  J M McDermott,et al.  Constructing reproductive histories by linking vital records. , 1997, American journal of epidemiology.

[19]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[20]  Yurii S. Aulchenko,et al.  BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btm108 Genetics and population analysis GenABEL: an R library for genome-wide association analysis , 2022 .

[21]  C. Hoggart,et al.  Genome-wide association analysis of metabolic traits in a birth cohort from a founder population , 2008, Nature Genetics.

[22]  Stuart E. Madnick,et al.  The inter-database instance identification problem in integrating autonomous systems , 1989, [1989] Proceedings. Fifth International Conference on Data Engineering.

[23]  Leena Peltonen,et al.  Isolates and their potential use in complex gene mapping efforts. , 2004, Current opinion in genetics & development.

[24]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[25]  Peter F. Stadler,et al.  FRANz: reconstruction of wild multi-generation pedigrees , 2009, Bioinform..

[26]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[27]  Peter Christen,et al.  Automatic record linkage using seeded nearest neighbour and support vector machine classification , 2008, KDD.

[28]  R. Agarwala,et al.  Software for constructing and verifying pedigrees within large genealogies and an application to the Old Order Amish of Lancaster County. , 1998, Genome research.

[29]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[30]  M. Daly,et al.  Genetic Mapping in Human Disease , 2008, Science.

[31]  Craig A. Knoblock,et al.  Learning Blocking Schemes for Record Linkage , 2006, AAAI.