A system for exact and approximate genetic linkage analysis of SNP data in large pedigrees

MOTIVATION The use of dense single nucleotide polymorphism (SNP) data in genetic linkage analysis of large pedigrees is impeded by significant technical, methodological and computational challenges. Here we describe Superlink-Online SNP, a new powerful online system that streamlines the linkage analysis of SNP data. It features a fully integrated flexible processing workflow comprising both well-known and novel data analysis tools, including SNP clustering, erroneous data filtering, exact and approximate LOD calculations and maximum-likelihood haplotyping. The system draws its power from thousands of CPUs, performing data analysis tasks orders of magnitude faster than a single computer. By providing an intuitive interface to sophisticated state-of-the-art analysis tools coupled with high computing capacity, Superlink-Online SNP helps geneticists unleash the potential of SNP data for detecting disease genes. RESULTS Computations performed by Superlink-Online SNP are automatically parallelized using novel paradigms, and executed on unlimited number of private or public CPUs. One novel service is large-scale approximate Markov Chain-Monte Carlo (MCMC) analysis. The accuracy of the results is reliably estimated by running the same computation on multiple CPUs and evaluating the Gelman-Rubin Score to set aside unreliable results. Another service within the workflow is a novel parallelized exact algorithm for inferring maximum-likelihood haplotyping. The reported system enables genetic analyses that were previously infeasible. We demonstrate the system capabilities through a study of a large complex pedigree affected with metabolic syndrome. AVAILABILITY Superlink-Online SNP is freely available for researchers at http://cbl-hap.cs.technion.ac.il/superlink-snp. The system source code can also be downloaded from the system website. CONTACT omerw@cs.technion.ac.il SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Hilbert J Kappen,et al.  Modeling linkage disequilibrium in exact linkage computations: a comparison of first-order Markov approaches and the clustered-markers approach , 2007, BMC proceedings.

[2]  K Lange,et al.  Descent graphs in pedigree analysis: applications to haplotyping, location scores, and marker-sharing statistics. , 1996, American journal of human genetics.

[3]  Céline Bellenguez,et al.  A multiple splitting approach to linkage analysis in large pedigrees identifies a linkage to asthma on chromosome 12 , 2009, Genetic epidemiology.

[4]  Veronica J. Vieland,et al.  Next-Generation Linkage Analysis , 2011, Human Heredity.

[5]  G. Abecasis,et al.  Handling marker-marker linkage disequilibrium: pedigree analysis with clustered markers. , 2005, American journal of human genetics.

[6]  Shili Lin,et al.  Handbook on Analyzing Human Genetic Data , 2009 .

[7]  Rina Dechter,et al.  AND/OR Branch-and-Bound search for combinatorial optimization in graphical models , 2009, Artif. Intell..

[8]  Dan Geiger,et al.  Exact genetic linkage computations for general pedigrees , 2002, ISMB.

[9]  Peter J. Nürnberg,et al.  HaploPainter: a tool for drawing pedigrees with complex haplotypes , 2005, Bioinform..

[10]  Tom H. Lindner,et al.  easyLINKAGE-Plus--automated linkage analyses using large-scale SNP data , 2005, Bioinform..

[11]  Rina Dechter,et al.  Search Algorithms for m Best Solutions for Graphical Models , 2012, AAAI.

[12]  David M. Evans,et al.  Guidelines for genotyping in genomewide linkage studies: single-nucleotide-polymorphism maps versus microsatellite maps. , 2004, American journal of human genetics.

[13]  E. Thompson,et al.  Multilocus Lod Scores in Large Pedigrees: Combination of Exact and Approximate Calculations , 2007, Human Heredity.

[14]  Lei Zhang,et al.  A multilocus linkage disequilibrium measure based on mutual information theory and its applications , 2009, Genetica.

[15]  Lars Otten,et al.  A Case Study in Complexity Estimation: Towards Parallel Branch-and-Bound over Graphical Models , 2012, UAI.

[16]  Andrew Gelman,et al.  General methods for monitoring convergence of iterative simulations , 1998 .

[17]  Emily L. Webb,et al.  SNPLINK: multipoint linkage analysis of densely distributed SNP data incorporating automated linkage disequilibrium removal , 2005, Bioinform..

[18]  Jeffrey R. O’Connell,et al.  Rapid Multipoint Linkage Analysis via Inheritance Vectors in the Elston-Stewart Algorithm , 2001, Human Heredity.

[19]  Fan Liu,et al.  An approach for cutting large and complex pedigrees for linkage analysis , 2008, European Journal of Human Genetics.

[20]  Mark Abney,et al.  Identity-by-Descent Estimation and Mapping of Qualitative Traits in Large, Complex Pedigrees , 2008, Genetics.

[21]  Assaf Schuster,et al.  GridBot: execution of bags of tasks in multiple grids , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[22]  Alessandro Rinaldo,et al.  Characterization of multilocus linkage disequilibrium , 2005, Genetic epidemiology.

[23]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[24]  A A Schäffer,et al.  Faster sequential genetic linkage computations. , 1993, American journal of human genetics.

[25]  Gregory Leibon,et al.  A SNP Streak Model for the Identification of Genetic Regions Identical-by-descent , 2008, Statistical applications in genetics and molecular biology.

[26]  Miron Livny,et al.  Scheduling Mixed Workloads in Multi-grids: The Grid Execution Hierarchy , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[27]  J. Shaw,et al.  Metabolic syndrome—a new world‐wide definition. A Consensus Statement from the International Diabetes Federation , 2006, Diabetic medicine : a journal of the British Diabetic Association.

[28]  S. Heath Markov chain Monte Carlo segregation and linkage analysis for oligogenic models. , 1997, American journal of human genetics.

[29]  T D Dyer,et al.  The Effect of Pedigree Complexity on Quantitative Trait Linkage Analysis , 2001, Genetic epidemiology.

[30]  Daniel E. Weeks,et al.  Mega2: data-handling for facilitating genetic linkage and association analyses , 2005, Bioinform..

[31]  Melanie Bahlo,et al.  Reducing the exome search space for Mendelian diseases using genetic linkage analysis of exome genotypes , 2011, Genome Biology.

[32]  Hong-Wen Deng,et al.  SNPP: automating large-scale SNP genotype data management , 2005, Bioinform..

[33]  Anna Tzemach Preparing SNP data for genetic linkage analysis , 2009 .

[34]  Fan Liu,et al.  Breaking Loops in Large Complex Pedigrees , 2007, Human Heredity.

[35]  Mario Falchi,et al.  A genomewide search using an original pairwise sampling approach for large genealogies identifies a new locus for total and low-density lipoprotein cholesterol in two genetically differentiated isolates of Sardinia. , 2004, American journal of human genetics.

[36]  Ola Hössjer,et al.  A general method for linkage disequilibrium correction for multipoint linkage and association , 2008, Genetic epidemiology.

[37]  Josée Dupuis,et al.  Handling linkage disequilibrium in qualitative trait linkage analysis using dense SNPs: a two-step strategy , 2009, BMC Genetics.

[38]  Rina Dechter,et al.  Memory intensive AND/OR search for combinatorial optimization in graphical models , 2009, Artif. Intell..

[39]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[40]  J. Ott,et al.  Multilocus linkage analysis in humans: detection of linkage and estimation of recombination. , 1985, American journal of human genetics.

[41]  Alexander F. Wilson,et al.  Linkage Analysis in the Next-Generation Sequencing Era , 2011, Human Heredity.

[42]  Lars Otten,et al.  Advances in Distributed Branch and Bound , 2012, ECAI.

[43]  Scott Lathrop,et al.  Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis , 2011, International Conference on High Performance Computing.

[44]  Pak Chung Sham,et al.  IGG: A tool to integrate GeneChips for genetic studies , 2007, Bioinform..

[45]  David Allen,et al.  RC_Link: Genetic linkage analysis using Bayesian networks , 2008, Int. J. Approx. Reason..

[46]  Lars Otten,et al.  Finding Most Likely Haplotypes in General Pedigrees Through Parallel Search with Dynamic Load Balancing , 2011, Pacific Symposium on Biocomputing.

[47]  J. Shendure,et al.  Exome sequencing as a tool for Mendelian disease gene discovery , 2011, Nature Reviews Genetics.

[48]  G. Abecasis,et al.  Merlin—rapid analysis of dense genetic maps using sparse gene flow trees , 2002, Nature Genetics.

[49]  S. Saccaro,et al.  Contents Vol. 21, 2001 , 2001, American Journal of Nephrology.

[50]  K. Lange,et al.  Programs for pedigree analysis: Mendel, Fisher, and dGene , 1988, Genetic epidemiology.

[51]  Anna Ingolfsdottir,et al.  Allegro version 2 , 2005, Nature Genetics.

[52]  J. Ott Analysis of Human Genetic Linkage , 1985 .

[53]  William W. Cohen,et al.  High-recall protein entity recognition using a dictionary , 2005, ISMB.

[54]  K Allen-Brady,et al.  Shared Genomic Segment Analysis. Mapping Disease Predisposition Genes in Extended Pedigrees Using SNP Genotype Assays , 2008, Annals of human genetics.

[55]  J. Ott Estimation of the recombination fraction in human pedigrees: efficient computation of the likelihood for human linkage studies. , 1974, American journal of human genetics.

[56]  Mario Falchi,et al.  Jenti: an efficient tool for mining complex inbred genealogies , 2008, Bioinform..

[57]  Pankratz Vs,et al.  A pedigree partitioning approach to quantitative trait loci mapping of IgE serum level in the GAW12 Hutterite data. , 2001 .

[58]  E. Lander,et al.  Construction of multilocus genetic linkage maps in humans. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[59]  Rina Dechter,et al.  A general scheme for automatic generation of search heuristics from specification dependencies , 2001, Artif. Intell..

[60]  A. Whittemore,et al.  A class of tests for linkage using affected pedigree members. , 1994, Biometrics.

[61]  Daniel J Schaid,et al.  Caution on pedigree haplotype inference with software that assumes linkage equilibrium. , 2002, American journal of human genetics.

[62]  Céline Bellenguez,et al.  Linkage Analysis with Dense SNP Maps in Isolated Populations , 2009, Human Heredity.

[63]  R. Elston,et al.  A general model for the genetic analysis of pedigree data. , 1971, Human heredity.

[64]  L Kruglyak,et al.  Parametric and nonparametric linkage analysis: a unified multipoint approach. , 1996, American journal of human genetics.

[65]  Tatiana I Axenovich,et al.  PedStr Software for Cutting Large Pedigrees for Haplotyping, IBD Computation and Multipoint Linkage Analysis , 2009, Annals of human genetics.

[66]  Mark Silberstein,et al.  Building an Online Domain-Specific Computing Service over Non-dedicated Grid and Cloud Resources: The Superlink-Online Experience , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[67]  Julia E. Richards,et al.  Madeline 2.0 PDE: a new program for local and web-based pedigree drawing , 2007, Bioinform..

[68]  E A Thompson,et al.  Exact Trait‐Model‐Free Tests for Linkage Detection in Pedigrees , 2008, Annals of human genetics.

[69]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[70]  Paul Zimmet,et al.  The metabolic syndrome—a new worldwide definition , 2005, The Lancet.

[71]  Kevin Leyton-Brown,et al.  SATzilla: Portfolio-based Algorithm Selection for SAT , 2008, J. Artif. Intell. Res..

[72]  E. Wijsman The role of large pedigrees in an era of high-throughput sequencing , 2012, Human Genetics.

[73]  M Silberstein,et al.  Online system for faster multipoint linkage analysis via parallel execution on thousands of personal computers. , 2006, American journal of human genetics.

[74]  Yuji Takahashi,et al.  SNP HiTLink: a high-throughput linkage analysis system employing dense SNP data , 2009, BMC Bioinformatics.

[75]  Dan Geiger,et al.  Maximum Likelihood Haplotyping for General Pedigrees , 2005, Human Heredity.