Odyssey: a semi-automated pipeline for phasing, imputation, and analysis of genome-wide genetic data

BackgroundGenome imputation, admixture resolution and genome-wide association analyses are timely and computationally intensive processes with many composite and requisite steps. Analysis time increases further when building and installing the run programs required for these analyses. For scientists that may not be as versed in programing language, but want to perform these operations hands on, there is a lengthy learning curve to utilize the vast number of programs available for these analyses.ResultsIn an effort to streamline the entire process with easy-to-use steps for scientists working with big data, the Odyssey pipeline was developed. Odyssey is a simplified, efficient, semi-automated genome-wide imputation and analysis pipeline, which prepares raw genetic data, performs pre-imputation quality control, phasing, imputation, post-imputation quality control, population stratification analysis, and genome-wide association with statistical data analysis, including result visualization. Odyssey is a pipeline that integrates programs such as PLINK, SHAPEIT, Eagle, IMPUTE, Minimac, and several R packages, to create a seamless, easy-to-use, and modular workflow controlled via a single user-friendly configuration file. Odyssey was built with compatibility in mind, and thus utilizes the Singularity container solution, which can be run on Linux, MacOS, and Windows platforms. It is also easily scalable from a simple desktop to a High-Performance System (HPS).ConclusionOdyssey facilitates efficient and fast genome-wide association analysis automation and can go from raw genetic data to genome: phenome association visualization and analyses results in 3–8 h on average, depending on the input data, choice of programs within the pipeline and available computer resources. Odyssey was built to be flexible, portable, compatible, scalable, and easy to setup. Biologists less familiar with programing can now work hands on with their own big data using this easy-to-use pipeline.

[1]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[2]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[3]  Bonnie Berger,et al.  Efficient Bayesian mixed model analysis increases association power in large cohorts , 2014 .

[4]  Sven Rahmann,et al.  Genome analysis , 2022 .

[5]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[6]  Sayan Mukherjee,et al.  Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. , 2016, American journal of human genetics.

[7]  Marie-Pierre Dubé,et al.  genipe: an automated genome-wide imputation pipeline with automatic reporting and statistical tools , 2016, Bioinform..

[8]  M. Stephens,et al.  Genome-wide Efficient Mixed Model Analysis for Association Studies , 2012, Nature Genetics.

[9]  P. Donnelly,et al.  Genome-wide genetic data on ~500,000 UK Biobank participants , 2017, bioRxiv.

[10]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[11]  B. Berger,et al.  Efficient Bayesian mixed model analysis increases association power in large cohorts , 2014, Nature Genetics.

[12]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[13]  O. Delaneau,et al.  A linear complexity phasing method for thousands of genomes , 2011, Nature Methods.

[14]  J. Marchini,et al.  Genotype imputation for genome-wide association studies , 2010, Nature Reviews Genetics.

[15]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[16]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[17]  G. Abecasis,et al.  MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes , 2010, Genetic epidemiology.

[18]  Alan M. Kwong,et al.  A reference panel of 64,976 haplotypes for genotype imputation , 2015, Nature Genetics.

[19]  P. Visscher,et al.  Advantages and pitfalls in the application of mixed-model association methods , 2014, Nature Genetics.

[20]  Qian Liu,et al.  Systematic assessment of imputation performance using the 1000 Genomes reference panels , 2015, Briefings Bioinform..

[21]  P. Donnelly,et al.  New models of collaboration in genome-wide association studies: the Genetic Association Information Network , 2007, Nature Genetics.

[22]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[23]  Annelot M. Dekker,et al.  Genome-wide association analyses identify new risk variants and the genetic architecture of amyotrophic lateral sclerosis , 2017 .

[24]  Ole Tange,et al.  GNU Parallel: The Command-Line Power Tool , 2011, login Usenix Mag..

[25]  Zachary A. Szpiech,et al.  Identifying tagging SNPs for African specific genetic variation from the African Diaspora Genome , 2017, Scientific Reports.

[26]  Alan M. Kwong,et al.  Next-generation genotype imputation service and methods , 2016, Nature Genetics.

[27]  Alkes L. Price,et al.  New approaches to population stratification in genome-wide association studies , 2010, Nature Reviews Genetics.

[28]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.

[29]  Jonathan Scott Friedlaender,et al.  A Human Genome Diversity Cell Line Panel , 2002, Science.

[30]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[31]  Shane A. McCarthy,et al.  Reference-based phasing using the Haplotype Reference Consortium panel , 2016, Nature Genetics.