LISA: Towards Learned DNA Sequence Search

Next-generation sequencing (NGS) technologies have enabled affordable sequencing of billions of short DNA fragments at high throughput, paving the way for population-scale genomics. Genomics data analytics at this scale requires overcoming performance bottlenecks, such as searching for short DNA sequences over long reference sequences. In this paper, we introduce LISA (Learned Indexes for Sequence Analysis), a novel learning-based approach to DNA sequence search. As a first proof of concept, we focus on accelerating one of the most essential flavors of the problem, called exact search. LISA builds on and extends FM-index, which is the state-of-the-art technique widely deployed in genomics tool-chains. Initial experiments with human genome datasets indicate that LISA achieves up to a factor of 4X performance speedup against its traditional counterpart.

[1]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[2]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[3]  Giovanni Manzini,et al.  An experimental study of an opportunistic index , 2001, SODA '01.

[4]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[5]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[6]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[7]  Stefano Lonardi,et al.  String Matching in Hardware Using the FM-Index , 2011, 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines.

[8]  Heng Li,et al.  Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly , 2012, Bioinform..

[9]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[10]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[11]  Juan C. Moure,et al.  n-step FM-Index for Faster Pattern Matching , 2013, ICCS.

[12]  Jing Zhang,et al.  Optimizing Burrows-Wheeler Transform-Based Sequence Alignment on Multicore Architectures , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[13]  Thomas K. F. Wong,et al.  SOAP3-dp: Fast, Accurate and Sensitive GPU-Based Short Read Aligner , 2013, PloS one.

[14]  Alejandro Chacon,et al.  Boosting the FM-Index on the GPU: Effective Techniques to Mitigate Random Memory Access , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[15]  Szymon Grabowski,et al.  FM-index for Dummies , 2015, BDAS.

[16]  Srinivas Aluru,et al.  Identification of Significant Computational Building Blocks through Comprehensive Investigation of NGS Secondary Analysis Methods , 2018, bioRxiv.

[17]  Tony Pan,et al.  Performance extraction and suitability analysis of multi- and many-core architectures for next generation sequencing secondary analysis , 2018, PACT.

[18]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[19]  Srinivas Aluru,et al.  Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).