Parallel Gene Upstream Comparison via Multi-Level Hash Tables on GPU

The region of DNA immediately in front of a gene body (also called upstream region) contains short (8-20 base) sequence motifs that help to control when that gene is turned on and off. Unfortunately, these motifs are generally unknown and commonly degenerate. In this work, we propose a motif-finding framework that, given a set of gene upstream regions, performs their all-to-all pairwise comparison and identifies all the motifs of length k (k-mers) that are common to any pair of upstream regions or differ in at most d characters. Our framework stores the k-mers found in each gene in a multi-level hash table. Our hash table design optimizes hash table comparison (rather than hash table insertion or lookup), is highly parallelizable and easily maps onto GPU. We propose four GPU kernels for pairwise hash table comparison, each leveraging a distinct parallelization approach. We study how different factors (the hash function, the number of buckets and the settings of other implementation-specific parameters) affect the performance of our implementation. Experimental results performed using an average-size yeast genome show that our fastest GPU kernel outperforms an 8-thread, cache-efficient CPU implementation by a factor of ~52x.

[1]  Dave Brown,et al.  Supplementary Material for An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing , 2013 .

[2]  D. Gifford,et al.  Tissue-specific transcriptional regulation has diverged significantly between human and mouse , 2007, Nature Genetics.

[3]  Pradeep Dubey,et al.  Can traditional programming bridge the Ninja performance gap for parallel computing applications , 2012, ISCA 2012.

[4]  James Taylor,et al.  Genomic approaches towards finding cis-regulatory modules in animals , 2012, Nature Reviews Genetics.

[5]  Klaus U. Schulz,et al.  Fast string correction with Levenshtein automata , 2002, International Journal on Document Analysis and Recognition.

[6]  Laurent Gil,et al.  Ensembl 2013 , 2012, Nucleic Acids Res..

[7]  Patrick Crowley,et al.  Efficient regular expression evaluation: theory to practice , 2008, ANCS '08.

[8]  Niccolo Cascarano,et al.  iNFAnt: NFA pattern matching on GPGPU devices , 2010, CCRV.

[9]  Melanie A. Huntley,et al.  Evolution of genes and genomes on the Drosophila phylogeny , 2007, Nature.

[10]  Michael D. Wilson,et al.  Five-Vertebrate ChIP-seq Reveals the Evolutionary Dynamics of Transcription Factor Binding , 2010, Science.

[11]  Srinivas Aluru,et al.  Finding Motifs in Biological Sequences Using the Micron Automata Processor , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[12]  John D. Owens,et al.  Real-time parallel hashing on the GPU , 2009, SIGGRAPH 2009.

[13]  Mihai Pop,et al.  Genome assembly reborn: recent computational challenges , 2009, Briefings Bioinform..

[14]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[15]  G. Hong,et al.  Nucleic Acids Research , 2015, Nucleic Acids Research.

[16]  Alexander Stark,et al.  Comparative Genomics of Gene Regulation—conservation and Divergence of Cis-regulatory Information This Review Comes from a Themed Issue on Genomes and Evolution Edited Main Text Conflict of Interest , 2022 .

[17]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[18]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[19]  Ralf Schneider,et al.  Connected component labeling on a 2D grid using CUDA , 2011, J. Parallel Distributed Comput..

[20]  David A. Bader,et al.  SNAP, Small-world Network Analysis and Partitioning: An open-source parallel graph framework for the exploration of large-scale networks , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[21]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[22]  P. Farnham Insights from genomic profiling of transcription factors , 2009, Nature Reviews Genetics.

[23]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[24]  John D. Owens,et al.  Building an Efficient Hash Table on the GPU , 2012 .

[25]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[26]  C. Shyu,et al.  Long identical multispecies elements in plant and animal genomes , 2012, Proceedings of the National Academy of Sciences.

[27]  Kevin P. Byrne,et al.  The Yeast Gene Order Browser: combining curated homology and syntenic context reveals gene fate in polyploid species. , 2005, Genome research.

[28]  Sylvain Lefebvre,et al.  Perfect spatial hashing , 2006, SIGGRAPH 2006.

[29]  Aviv Regev,et al.  Comparative analysis of gene regulatory networks: from network reconstruction to evolution. , 2015, Annual review of cell and developmental biology.

[30]  Günter P. Wagner,et al.  The gene regulatory logic of transcription factor evolution. , 2008, Trends in ecology & evolution.