Integrated genome sizing (IGS) approach for the parallelization of whole genome analysis

BackgroundThe use of whole genome sequence has increased recently with rapid progression of next-generation sequencing (NGS) technologies. However, storing raw sequence reads to perform large-scale genome analysis pose hardware challenges. Despite advancement in genome analytic platforms, efficient approaches remain relevant especially as applied to the human genome. In this study, an Integrated Genome Sizing (IGS) approach is adopted to speed up multiple whole genome analysis in high-performance computing (HPC) environment. The approach splits a genome (GRCh37) into 630 chunks (fragments) wherein multiple chunks can simultaneously be parallelized for sequence analyses across cohorts.ResultsIGS was integrated on Maha-Fs (HPC) system, to provide the parallelization required to analyze 2504 whole genomes. Using a single reference pilot genome, NA12878, we compared the NGS process time between Maha-Fs (NFS SATA hard disk drive) and SGI-UV300 (solid state drive memory). It was observed that SGI-UV300 was faster, having 32.5 mins of process time, while that of the Maha-Fs was 55.2 mins.ConclusionsThe implementation of IGS can leverage the ability of HPC systems to analyze multiple genomes simultaneously. We believe this approach will accelerate research advancement in personalized genomic medicine. Our method is comparable to the fastest methods for sequence alignment.

[1]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[2]  M. Stephens,et al.  Inferring weak population structure with the assistance of sample group information , 2009, Molecular ecology resources.

[3]  Simon Anders,et al.  Visualisation of genomic data with the Hilbert curve , 2009 .

[4]  Yusuke Nakamura,et al.  Genome-wide association study identifies common variants at four loci as genetic risk factors for Parkinson's disease , 2009, Nature Genetics.

[5]  M. Kubo,et al.  Clinical applications of next-generation sequencing. , 2012, Clinical advances in hematology & oncology : H&O.

[6]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[7]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[8]  Leming Zhou,et al.  Personal genomic information management and personalized medicine: challenges, current solutions, and roles of HIM professionals. , 2014, Perspectives in health information management.

[9]  U. Demkow Next Generation Sequencing in Pharmacogenomics , 2016 .

[10]  P. Stenson,et al.  Human Gene Mutation Database (HGMD®): 2003 update , 2003, Human mutation.

[11]  Matthew Bower,et al.  Clinical validation of targeted next-generation sequencing for inherited disorders. , 2015, Archives of pathology & laboratory medicine.

[12]  G. Abecasis,et al.  Merlin—rapid analysis of dense genetic maps using sparse gene flow trees , 2002, Nature Genetics.

[13]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[14]  Stanley Letovsky,et al.  The GDB Human Genome Database Anno 1997 , 1997, Nucleic Acids Res..

[15]  F. Garrido,et al.  HLA and cancer: from research to clinical impact. , 1998, Immunology today.

[16]  Walter V. Sujansky,et al.  Heterogeneous Database Integration in Biomedicine , 2001, J. Biomed. Informatics.

[17]  Christoph Lange,et al.  A multivariate family-based association test using generalized estimating equations: FBAT-GEE. , 2003, Biostatistics.

[18]  Rui Jiang,et al.  dbWGFP: a database and web server of human whole-genome single nucleotide variants and their functional predictions , 2016, Database J. Biol. Databases Curation.

[19]  Mika Hirakawa,et al.  HOWDY: an integrated database system for human genome research , 2002, Nucleic Acids Res..

[20]  Jong Hui Hong,et al.  Prevalence of Rare Genetic Variations and Their Implications in NGS-data Interpretation , 2017, Scientific Reports.

[21]  Geoffrey B. Nilsen,et al.  Whole-Genome Patterns of Common DNA Variation in Three Human Populations , 2005, Science.

[22]  Peter Saffrey,et al.  Rapid Whole-Genome Sequencing for Genetic Disease Diagnosis in Neonatal Intensive Care Units , 2012, Science Translational Medicine.

[23]  Hyung-Lae Kim,et al.  HLAscan: genotyping of the HLA region using next-generation sequencing data , 2017, BMC Bioinformatics.

[24]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[25]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[26]  James A. Morris,et al.  Olorin: combining gene flow with exome sequencing in large family studies of complex disease , 2012, Bioinform..

[27]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[28]  C. Greenwood,et al.  Data Integration in Genetics and Genomics: Methods and Challenges , 2009, Human genomics and proteomics : HGP.

[29]  Suzanne M. Paley,et al.  Integrated pathway/genome databases and their role in drug discovery , 1999 .

[30]  Ian T. Foster,et al.  Supercomputing for the parallelization of whole genome analysis , 2014, Bioinform..

[31]  Dong-Oh Kim,et al.  Remote Direct Storage Management for Exa-Scale Storage , 2016 .

[32]  Lovelace J. Luquette,et al.  Comprehensive analysis of the chromatin landscape in Drosophila , 2010, Nature.

[33]  Lori A. S. Snyder,et al.  Comparative whole-genome analyses reveal over 100 putative phase-variable genes in the pathogenic Neisseria spp. , 2001, Microbiology.

[34]  Byong Joon Kim,et al.  Development of the variant calling algorithm, ADIScan, and its use to estimate discordant sequences between monozygotic twins , 2018, Nucleic acids research.

[35]  Laurie D. Smith,et al.  A 26-hour system of highly sensitive whole genome sequencing for emergency management of genetic diseases , 2015, Genome Medicine.

[36]  Yusuke Nakamura,et al.  Gene-based SNP discovery as part of the Japanese Millennium Genome Project: identification of 190 562 genetic variations in the human genome , 2002, Journal of Human Genetics.

[37]  A A Schäffer,et al.  Avoiding recomputation in linkage analysis. , 1994, Human heredity.

[38]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[39]  M. Simmonds,et al.  The HLA Region and Autoimmune Disease: Associations and Mechanisms of Action , 2007, Current genomics.

[40]  Young-Kyun Kim,et al.  MAHA-FS : A Distributed File System for High Performance Metadata Processing and Random IO , 2013 .

[41]  Elizabeth Phillips,et al.  HLA and pharmacogenetics of drug hypersensitivity. , 2012, Pharmacogenomics.

[42]  Roland Eils,et al.  HilbertCurve: an R/Bioconductor package for high-resolution visualization of genomic data , 2016, Bioinform..

[43]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[44]  Yuan-Tsong Chen,et al.  Human leukocyte antigens and drug hypersensitivity , 2007, Current opinion in allergy and clinical immunology.

[45]  Carole A. Goble,et al.  State of the nation in data integration for bioinformatics , 2008, J. Biomed. Informatics.

[46]  O. Stegle,et al.  DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning , 2016, Genome Biology.

[47]  N. Shen,et al.  Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis , 1999, Nature Genetics.