Perfect Hashing Structures for Parallel Similarity Searches

Seed-based heuristics have proved to be efficient for studying similarity between genetic databases with billions of base pairs. This paper focuses on algorithms and data structures for the filtering phase in seed-based heuristics, with an emphasis on efficient parallel GPU/many cores implementation. We propose a 2-stage index structure which is based on neighborhood indexing and perfect hashing techniques. This structure performs a filtering phase over the neighborhood regions around the seeds in constant time and avoid as much as possible random memory accesses and branch divergences. Moreover, it fits particularly well on parallel SIMD processors, because it requires intensive but homogeneous computational operations. Using this data structure, we developed a fast and sensitive Open CL prototype read mapper.

[1]  Larry Carter,et al.  Universal classes of hash functions (Extended Abstract) , 1977, STOC '77.

[2]  Daniel G. Brown,et al.  A Survey of Seeding for Sequence Alignment , 2007 .

[3]  Ernesto Picardi,et al.  Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing , 2010, Briefings Bioinform..

[4]  Tuan Tu Tran,et al.  Bioinformatics Sequence Comparisons on Manycore Processors , 2012 .

[5]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[6]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[7]  P.P. Gelsinger,et al.  Microprocessors for the new millennium: Challenges, opportunities, and new frontiers , 2001, 2001 IEEE International Solid-State Circuits Conference. Digest of Technical Papers. ISSCC (Cat. No.01CH37177).

[8]  T. K. Altheide,et al.  Comparing the human and chimpanzee genomes: Searching for needles in a haystack , 2005 .

[9]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[10]  S. Nelson,et al.  BFAST: An Alignment Tool for Large Scale Genome Resequencing , 2009, PloS one.

[11]  J. Gilbert,et al.  Metagenomics - a guide from sampling to data analysis , 2012, Microbial Informatics and Experimentation.

[12]  Ting Chen,et al.  PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds , 2009, Bioinform..

[13]  Yongchao Liu,et al.  CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform , 2012, Bioinform..

[14]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[15]  Gregory Kucherov,et al.  YASS: enhancing the sensitivity of DNA similarity search , 2005, Nucleic Acids Res..

[16]  Dominique Lavenier,et al.  GASSST: global alignment short sequence search tool , 2010, Bioinform..

[17]  Yongchao Liu,et al.  Long read alignment based on maximal exact match seeds , 2012, Bioinform..

[18]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[19]  S. Oda,et al.  Whole-genome resequencing shows numerous genes with nonsynonymous SNPs in the Japanese native cattle Kuchinoshima-Ushi , 2011, BMC Genomics.

[20]  Dominique Lavenier,et al.  PLAST: parallel local alignment search tool for database comparison , 2009, BMC Bioinformatics.

[21]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[22]  Véronique Martin,et al.  Mapping Reads on a Genomic Sequence: An Algorithmic Overview and a Practical Comparative Analysis , 2012, J. Comput. Biol..

[23]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[24]  David R. Kaeli,et al.  Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[25]  George Havas,et al.  Graphs, Hypergraphs and Hashing , 1993, WG.

[26]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[27]  Fabiano C. Botelho,et al.  Near-Optimal Space Perfect Hashing Algorithms , 2009 .

[28]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[29]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[30]  Bertil Schmidt,et al.  Manycore High-Performance Computing in Bioinformatics , 2011 .

[31]  Alexander Zelikovsky,et al.  Bioinformatics Algorithms: Techniques and Applications , 2008 .

[32]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[34]  Nina Amenta,et al.  Efficient hash tables on the gpu , 2011 .

[35]  Dominique Lavenier,et al.  Optimal neighborhood indexing for protein similarity search , 2008, BMC Bioinformatics.

[36]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[37]  Yongchao Liu,et al.  CUSHAW3: Sensitive and Accurate Base-Space and Color-Space Short-Read Alignment with Hybrid Seeding , 2014, PloS one.

[38]  S. Moore,et al.  Whole-genome resequencing of Hanwoo (Korean cattle) and insight into regions of homozygosity , 2013, BMC Genomics.

[39]  M. Przeworski,et al.  Different selective pressures shape the evolution of Toll-like receptors in human and African great ape populations. , 2013, Human molecular genetics.

[40]  You-Qiang Song,et al.  Evaluation of next-generation sequencing software in mapping and assembly , 2011, Journal of Human Genetics.

[41]  George Havas,et al.  A Family of Perfect Hashing Methods , 1996, Comput. J..

[42]  Jean-Stéphane Varré,et al.  Bit-Parallel Multiple Pattern Matching , 2011, PPAM.