Pipelined Multi-FPGA Genomic Data Clustering

High throughput DNA sequencing made individual genome profiling possible and produces very large amounts of data. Today data and associated metadata are stored in FASTQ text file assemblies carrying the information of genome fragments called reads. Current techniques rely on mapping these reads to a common reference genome for compression and analysis. However, about 10% of the reads do not map to any known reference making them difficult to compress or process. These reads are of high importance because they hold information absent from any reference. Finding overlaps in these reads can help subsequent processing and compression tasks tremendously. Within this context clustering is used to find overlapping unmapped reads and sort them in groups. Clustering being an extremely time consuming task a modular multi-FPGA pipeline was designed and is the focus of this paper. A pipeline with 6 FPGAs was created and has shown a speed-up of \(\times 5\) compared to existing FPGA implementations. Resulting enriched files encoding reads and clustering results show file sizes within a 10% margin of the best DNA compressors while providing valuable extra information.

[1]  Sebastián Isaza,et al.  Performance comparison of sequential and parallel compression applications for DNA raw data , 2016, The Journal of Supercomputing.

[2]  Szymon Grabowski,et al.  Compression of DNA sequence reads in FASTQ format , 2011, Bioinform..

[3]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[4]  George A. Constantinides,et al.  FPGA-based K-means clustering using tree-based data structures , 2013, 2013 23rd International Conference on Field programmable Logic and Applications.

[5]  Guillaume Rizk,et al.  Whole genome re-sequencing : lessons from unmapped reads , 2013 .

[6]  Huseyin Seker,et al.  FPGA implementation of K-means algorithm for bioinformatics application: An accelerated approach to clustering Microarray data , 2011, 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS).

[7]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[8]  Giovanna Rosone,et al.  Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform , 2012, Bioinform..

[9]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[10]  Sara P. Garcia,et al.  GReEn: a tool for efficient compression of genome resequencing data , 2011, Nucleic acids research.

[11]  Yann Thoma,et al.  Genomic Data Clustering on FPGAs for Compression , 2017, ARC.

[12]  Ke-Lin Du,et al.  Clustering: A neural network approach , 2010, Neural Networks.

[13]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.