BEETL-fastq: a searchable compressed archive for DNA reads

MOTIVATION FASTQ is a standard file format for DNA sequencing data, which stores both nucleotides and quality scores. A typical sequencing study can easily generate hundreds of gigabytes of FASTQ files, while public archives such as ENA and NCBI and large international collaborations such as the Cancer Genome Atlas can accumulate many terabytes of data in this format. Compression tools such as gzip are often used to reduce the storage burden but have the disadvantage that the data must be decompressed before they can be used. Here, we present BEETL-fastq, a tool that not only compresses FASTQ-formatted DNA reads more compactly than gzip but also permits rapid search for k-mer queries within the archived sequences. Importantly, the full FASTQ record of each matching read or read pair is returned, allowing the search results to be piped directly to any of the many standard tools that accept FASTQ data as input. RESULTS We show that 6.6 terabytes of human reads in FASTQ format can be transformed into 1.7 terabytes of indexed files, from where we can search for 1, 10, 100, 1000 and a million of 30-mers in 3, 8, 14, 45 and 567 s, respectively, plus 20 ms per output read. Useful applications of the search capability are highlighted, including the genotyping of structural variant breakpoints and 'in silico pull-down' experiments in which only the reads that cover a region of interest are selectively extracted for the purposes of variant calling or visualization. AVAILABILITY AND IMPLEMENTATION BEETL-fastq is part of the BEETL library, available as a github repository at github.com/BEETL/BEETL.

[1]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[2]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[3]  Amar Mukherjee,et al.  The Burrows-Wheeler Transform:: Data Compression, Suffix Arrays, and Pattern Matching , 2008 .

[4]  Tak Wah Lam,et al.  GPU-Accelerated BWT Construction for Large Collection of Short Reads , 2014, ArXiv.

[5]  Yann Ponty,et al.  A weighted sampling algorithm for the design of RNA sequences with targeted secondary structure and nucleotide distribution , 2013, Bioinform..

[6]  Kenny Q. Ye,et al.  Mapping copy number variation by population scale genome sequencing , 2010, Nature.

[7]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[8]  Shoshana Neuburger,et al.  The Burrows-Wheeler transform: data compression, suffix arrays, and pattern matching by Donald Adjeroh, Timothy Bell and Amar Mukherjee Springer, 2008 , 2010 .

[9]  Giovanna Rosone,et al.  Lightweight BWT Construction for Very Large String Collections , 2011, CPM.

[10]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[11]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[12]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[13]  Giovanna Rosone,et al.  Adaptive reference-free compression of sequence quality scores , 2014, Bioinform..

[14]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[15]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[16]  Giovanna Rosone,et al.  Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform , 2012, Bioinform..

[17]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[18]  Yongchao Liu,et al.  Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data , 2013, Bioinform..

[19]  Giovanna Rosone,et al.  Lightweight algorithms for constructing and inverting the BWT of string collections , 2013, Theor. Comput. Sci..

[20]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.