Parallel Accelerated Vector Similarity Calculations for Genomics Applications

The surge in availability of genomic data holds promise for enabling determination of genetic causes of observed individual traits, with applications to problems such as discovery of the genetic roots of phenotypes, be they molecular phenotypes such as gene expression or metabolite concentrations, or complex phenotypes such as diseases. However, the growing sizes of these datasets and the quadratic, cubic or higher scaling characteristics of the relevant algorithms pose a serious computational challenge necessitating use of leadership scale computing. In this paper we describe a new approach to performing vector similarity metrics calculations, suitable for parallel systems equipped with graphics processing units (GPUs) or Intel Xeon Phi processors. Our primary focus is the Proportional Similarity metric applied to Genome Wide Association Studies (GWAS) and Phenome Wide Association Studies (PheWAS). We describe the implementation of the algorithms on accelerated processors, methods used for eliminating redundant calculations due to symmetries, and techniques for efficient mapping of the calculations to many-node parallel systems. Results are presented demonstrating high per-node performance and parallel scalability with rates of more than five quadrillion elementwise comparisons achieved per second on the ORNL Titan system. In a companion paper we describe corresponding techniques applied to calculations of the Custom Correlation Coefficient for comparative genomics applications.

[1]  Lars Koesterke,et al.  An Efficient and Scalable Implementation of SNP-Pair Interaction Testing for Genetic Association Studies , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[2]  C Kooperberg,et al.  The use of phenome‐wide association studies (PheWAS) for exploration of novel genotype‐phenotype relationships and pleiotropy discovery , 2011, Genetic epidemiology.

[3]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[4]  Adam Kowalczyk,et al.  GWISFI: A universal GPU interface for exhaustive search of pairwise interactions in case-control GWAS in minutes , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[5]  Chris Harrison,et al.  Adapting Genome-wide Association Workflows for HPC Processing at Pawsey , 2015, HiPC 2015.

[6]  Vijay S. Pande,et al.  Anatomy of High-Performance 2D Similarity Calculations , 2011, J. Chem. Inf. Model..

[7]  Li Ma,et al.  Fast Epistasis Detection in Large-Scale GWAS for Intel Xeon Phi Clusters , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.

[8]  S. Purcell,et al.  Pleiotropy in complex traits: challenges and strategies , 2013, Nature Reviews Genetics.

[9]  H. Kang,et al.  Variance component model to account for sample structure in genome-wide association studies , 2010, Nature Genetics.

[10]  Sharlee Climer,et al.  Parallel Accelerated Custom Correlation Coefficient Calculations for Genomics Applications , 2017, Parallel Comput..

[11]  Sabela Ramos,et al.  Parallel Pairwise Epistasis Detection on Heterogeneous Computing Architectures , 2016, IEEE Transactions on Parallel and Distributed Systems.

[12]  Can Yang,et al.  GBOOST: a GPU-based tool for detecting gene-gene interactions in genome-wide case control studies , 2011, Bioinform..

[13]  Sharlee Climer,et al.  A Custom Correlation Coefficient (CCC) Approach for Fast Identification of Multi‐SNP Association Patterns in Genome‐Wide SNPs Data , 2014, Genetic epidemiology.

[14]  Matthias Reumann,et al.  High performance computing enabling exhaustive analysis of higher order single nucleotide polymorphism interaction in Genome Wide Association Studies , 2015, Health Inf. Sci. Syst..

[15]  M. Goddard,et al.  A Multi-Trait, Meta-analysis for Detecting Pleiotropic Polymorphisms for Stature, Fatness and Reproduction in Beef Cattle , 2014, PLoS genetics.

[16]  Jack J. Dongarra,et al.  Dense linear algebra solvers for multicore with GPU accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[17]  Divyakant Agrawal,et al.  eCEO: an efficient Cloud Epistasis cOmputing model in genome-wide association study , 2011, Bioinform..

[18]  J. T. Curtis,et al.  An Ordination of the Upland Forest Communities of Southern Wisconsin , 1957 .

[19]  W. Michael Brown,et al.  Genetic algorithm based task reordering to improve the performance of batch scheduled massively parallel scientific applications , 2015, Concurr. Comput. Pract. Exp..

[20]  James Demmel,et al.  Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[21]  Deborah A. Weighill,et al.  Network Metamodeling: Effect of Correlation Metric Choice on Phylogenomic and Transcriptomic Network Topology. , 2017, Advances in biochemical engineering/biotechnology.

[22]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[23]  Bertil Schmidt,et al.  GPU-accelerated exhaustive search for third-order epistatic interactions in case-control studies , 2015, J. Comput. Sci..

[24]  Jack Dongarra,et al.  ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[25]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[26]  Deborah A. Weighill,et al.  3-way Networks: Application of Hypergraphs for Modelling Increased Complexity in Comparative Genomics , 2015, PLoS Comput. Biol..

[27]  Li Ma,et al.  High-performance epistasis detection in quantitative trait GWAS , 2018, Int. J. High Perform. Comput. Appl..

[28]  Chris S. Haley,et al.  Detecting epistasis in human complex traits , 2014, Nature Reviews Genetics.

[29]  Svante Janson,et al.  Measures of similarity between distributions , 1986 .