Pattern Matching for DNA Sequencing Data Using Multiple Bloom Filters

Storing and processing of large DNA sequences has always been a major problem due to increasing volume of DNA sequence data. However, a number of solutions have been proposed but they require significant computation and memory. Therefore, an efficient storage and pattern matching solution is required for DNA sequencing data. Bloom filters (BFs) represent an efficient data structure, which is mostly used in the domain of bioinformatics for classification of DNA sequences. In this paper, we explore more dimensions where BFs can be used other than classification. A proposed solution is based on Multiple Bloom Filters (MBFs) that finds all the locations and number of repetitions of the specified pattern inside a DNA sequence. Both of these factors are extremely important in determining the type and intensity of any disease. This paper serves as a first effort towards optimizing the search for location and frequency of substrings in DNA sequences using MBFs. We expect that further optimizations in the proposed solution can bring remarkable results as this paper presents a proof of concept implementation for a given set of data using proposed MBFs technique. Performance evaluation shows improved accuracy and time efficiency of the proposed approach.

[1]  Gonzalo Navarro,et al.  Storage and Retrieval of Individual Genomes , 2009, RECOMB.

[2]  Yongli Wang,et al.  Bloom Filter Based Associative Deletion , 2014, IEEE Transactions on Parallel and Distributed Systems.

[3]  Carl Kingsford,et al.  Fast Search of Thousands of Short-Read Sequencing Experiments , 2015, Nature Biotechnology.

[4]  Gonzalo Navarro,et al.  Word-based self-indexes for natural language text , 2012, TOIS.

[5]  Ulf Leser,et al.  QGramProjector: Q-Gram Projection for Indexing Highly-Similar Strings , 2013, ADBIS.

[6]  Trevor I. Dix,et al.  A Simple Statistical Algorithm for Biological Sequence Compression , 2007, 2007 Data Compression Conference (DCC'07).

[7]  Ulf Leser,et al.  String Searching in Referentially Compressed Genomes , 2012, KDIR.

[8]  Veerle Fack,et al.  Prospects and limitations of full-text index structures in genome analysis , 2012, Nucleic acids research.

[9]  Ulf Leser,et al.  MRCSI: Compressing and Searching String Collections with Multiple References , 2015, Proc. VLDB Endow..

[10]  Jian Ma,et al.  FPGA accelerated DNA error correction , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[11]  Gonzalo Navarro,et al.  Self-indexing Natural Language , 2008, SPIRE.

[12]  M. Watheq El-Kharashi,et al.  Bloom filter acceleration: A high level synthesis approach , 2017, 2017 IEEE 30th Canadian Conference on Electrical and Computer Engineering (CCECE).

[13]  Armando J. Pinho,et al.  Compressing the Human Genome Using Exclusively Markov Models , 2011, PACBB.

[14]  B. Chor,et al.  Genomic DNA k-mer spectra: models and modalities , 2009, Genome Biology.

[15]  Gonzalo Navarro,et al.  Indexing text using the Ziv-Lempel trie , 2002, J. Discrete Algorithms.

[16]  Siu-Ming Yiu,et al.  Practical aspects of Compressed Suffix Arrays and FM-Index in Searching DNA Sequences , 2004, ALENEX/ANALC.

[17]  Eran Halperin,et al.  Fast lossless compression via cascading Bloom filters , 2014, BMC Bioinformatics.

[18]  Departamento de Computación,et al.  Algorithms and Compressed Data Structures for Information Retrieval , 2011 .

[19]  Ulf Leser,et al.  Trends in Genome Compression , 2014 .

[20]  Jouni Sirén,et al.  Compressed Suffix Arrays for Massive Data , 2009, SPIRE.

[21]  Alysson Bessani,et al.  On-Demand Indexing for Referential Compression of DNA Sequences , 2015, PloS one.

[22]  Vladimir Yanovsky ReCoil - an algorithm for compression of extremely large datasets of dna data , 2010, Algorithms for Molecular Biology.

[23]  Sebastian Deorowicz,et al.  FQSqueezer: k-mer-based compression of sequencing data , 2019, Scientific Reports.

[24]  Gonzalo Navarro,et al.  Succinct Suffix Arrays based on Run-Length Encoding , 2005, Nord. J. Comput..

[25]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[26]  Xiaolong Wu,et al.  BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..

[27]  Pragya Pande Compressing the Human Genome against a reference , 2011 .

[28]  Szymon Grabowski,et al.  Data compression for sequencing data , 2013, Algorithms for Molecular Biology.

[29]  Björn Andersson,et al.  Classification of DNA sequences using Bloom filters , 2010, Bioinform..

[30]  Gonzalo Navarro,et al.  An Alphabet-Friendly FM-Index , 2004, SPIRE.

[31]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[32]  Tomasz Marek Kowalski,et al.  Indexing Arbitrary-Length k-Mers in Sequencing Reads , 2015, PloS one.