Identifying large sets of unrelated individuals and unrelated markers

BackgroundGenetic Analyses in large sample populations are important for a better understanding of the variation between populations, for designing conservation programs, for detecting rare mutations which may be risk factors for a variety of diseases, among other reasons. However these analyses frequently assume that the participating individuals or animals are mutually unrelated which may not be the case in large samples, leading to erroneous conclusions. In order to retain as much data as possible while minimizing the risk of false positives it is useful to identify a large subset of relatively unrelated individuals in the population. This can be done using a heuristic for finding a large set of independent of nodes in an undirected graph. We describe a fast randomized heuristic for this purpose. The same methodology can also be used for identifying a suitable set of markers for analyzing population stratification, and other instances where a rapid heuristic for maximal independent sets in large graphs is needed.ResultsWe present FastIndep, a fast random heuristic algorithm for finding a maximal independent set of nodes in an arbitrary undirected graph along with an efficient implementation in C++. On a 64 bit Linux or MacOS platform the execution time is a few minutes, even with a graph of several thousand nodes. The algorithm can discover multiple solutions of the same cardinality. FastIndep can be used to discover unlinked markers, and unrelated individuals in populations.ConclusionsThe methods presented here provide a quick and efficient method for identifying sets of unrelated individuals in large populations and unlinked markers in marker panels. The C++ source code and instructions along with utilities for generating the input files in the appropriate format are available at http://taurus.ansci.iastate.edu/wiki/people/jabr/Joseph_Abraham.html

[1]  Takuji Nishimura,et al.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , 1998, TOMC.

[2]  Andrew J. Doig,et al.  Maximising the Size of Non-Redundant Protein Datasets Using Graph Theory , 2013, PloS one.

[3]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[4]  D. Nickerson,et al.  Utilizing Graph Theory to Select the Largest Set of Unrelated Individuals for Genetic Analysis , 2013, Genetic epidemiology.

[5]  L. Cavalli-Sforza,et al.  High resolution of human evolutionary trees with polymorphic microsatellites , 1994, Nature.

[6]  C. Bron,et al.  Algorithm 457: finding all cliques of an undirected graph , 1973 .

[7]  Timothy J. Close,et al.  Population Structure and Linkage Disequilibrium in U.S. Barley Germplasm: Implications for Association Mapping , 2010 .

[8]  Josyf Mychaleckyj,et al.  Robust relationship inference in genome-wide association studies , 2010, Bioinform..

[9]  Rohan Fernando,et al.  Applications of Graphical Clustering Algorithms in Genome Wide Association Mapping , 2012 .

[10]  Jonathan Pevsner,et al.  Inference of Relationships in Population Data Using Identity-by-Descent and Identity-by-State , 2011, PLoS genetics.

[11]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[12]  B S Weir,et al.  Variation in actual relationship as a consequence of Mendelian sampling and linkage. , 2011, Genetics research.

[13]  Patric R. J. Östergård,et al.  A fast algorithm for the maximum clique problem , 2002, Discret. Appl. Math..

[14]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[15]  Russell Schwartz,et al.  Optimal Haplotype Block-free Selection of Tagging Snps for Genome-wide Association Studies , 2022 .

[16]  Sampo Niskanen,et al.  Cliquer user's guide, version 1.0 , 2003 .

[17]  Armando Caballero,et al.  METAPOP—A software for the management and analysis of subdivided populations in conservation programs , 2009, Conservation Genetics.

[18]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[19]  Dan Geiger,et al.  Optimizing exact genetic linkage computations , 2003, RECOMB '03.

[20]  Jasmin Divers,et al.  Population Structure of Hispanics in the United States: The Multi-Ethnic Study of Atherosclerosis , 2012, PLoS genetics.

[21]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[22]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. , 2003, Genetics.