Accelerated implementation of FQSqueezer novel genomic compression method

Biological data contain very important information for genoma analysis. In last decades, the size of these data is constantly growing. So the Next Generation Sequence (NGS) data has been introduced. These kind of data are represented by different data formats, such as FASTQ, FASTA, SAM, etc. In order to allow a good analysis and storing of them, due to large dimension of these data, several compressors have been performed. FQSqueezer is a novel genomic compressor for FASTQ data files. But several issues are present due to multithread version that runs on multi-core hardware. It is wellknown that the number of cores in a CPU is limited and very minor with respect to GPUs’ cores number. In order to increase the performance related to this compressor method, in this work we present a GPU-parallel implementation of cited compressor by exploiting CUDA framework. More precisely, a suitable domain decomposition is able to give an appreciable gain of performance in terms of time and reliability. Several execution tests confirm the gain of efficiency achieved by our parallel implementation.

[1]  Raffaele Giancarlo,et al.  Textual data compression in computational biology: a synopsis , 2009, Bioinform..

[2]  D. Kwiatkowski,et al.  Optimizing illumina next-generation sequencing library preparation for extremely at-biased genomes , 2012, BMC Genomics.

[3]  Sebastian Deorowicz,et al.  FQSqueezer: k-mer-based compression of sequencing data , 2019, Scientific Reports.

[4]  Andrea Formisano,et al.  Haptic Data Accelerated Prediction via Multicore Implementation , 2020, SAI.

[5]  Salvatore Cuomo,et al.  A GPU Algorithm in a Distributed Computing System for 3D MRI Denoising , 2015, 2015 10th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC).

[6]  Ayman Grada,et al.  Next-generation sequencing: methodology and application. , 2013, The Journal of investigative dermatology.

[7]  Cathy H. Wu,et al.  Software for pre-processing Illumina next-generation sequencing short read sequences , 2014, Source Code for Biology and Medicine.

[8]  Livia Marcellino,et al.  A GPU-CUDA Framework for Solving a Two-Dimensional Inverse Anomalous Diffusion Problem , 2019, PARCO.

[9]  Salvatore Cuomo,et al.  Classify Visitor Behaviours in a Cultural Heritage Exhibition , 2015, DATA.

[10]  James K. Bonfield,et al.  Compression of FASTQ and SAM Format Sequencing Data , 2013, PloS one.

[11]  Pierre Baldi,et al.  Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval , 2007, J. Chem. Inf. Model..

[12]  L. Vissers,et al.  Genome sequencing identifies major causes of severe intellectual disability , 2014, Nature.

[13]  F. Sanger,et al.  Nucleotide sequence of bacteriophage phi X174 DNA. , 1977, Nature.

[14]  Pascal Borry,et al.  Whole-genome sequencing in health care. Recommendations of the European Society of Human Genetics. , 2013, European journal of human genetics : EJHG.

[15]  Ian H. Witten,et al.  Protein is incompressible , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[16]  Anirban Dutta,et al.  BIND – An algorithm for loss-less compression of nucleotide sequence data , 2012, Journal of Biosciences.

[17]  Xiaohui Xie,et al.  Data structures and compression algorithms for high-throughput sequencing technologies , 2010, BMC Bioinformatics.

[18]  Emanuele Caglioti,et al.  Compressing Proteomes: The Relevance of Medium Range Correlations , 2007, EURASIP J. Bioinform. Syst. Biol..

[19]  Giulio Giunta,et al.  Performance Analysis of a Multicore Implementation for Solving a Two-Dimensional Inverse Anomalous Diffusion Problem , 2019, NUMTA.

[20]  Farzad Farnoud,et al.  MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression , 2016, BMC Bioinformatics.

[21]  Stephen J. Salipante,et al.  Performance Comparison of Illumina and Ion Torrent Next-Generation Sequencing Platforms for 16S rRNA-Based Bacterial Community Profiling , 2014, Applied and Environmental Microbiology.

[22]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[23]  Luca Landolfi,et al.  Distributed Genomic Compression in MapReduce Paradigm , 2019, IDCS.

[24]  Trevor I. Dix,et al.  A Simple Statistical Algorithm for Biological Sequence Compression , 2007, 2007 Data Compression Conference (DCC'07).

[25]  Walter L. Ruzzo,et al.  Compression of next-generation sequencing reads aided by highly efficient de novo assembly , 2012, Nucleic acids research.

[26]  Almerico Murli,et al.  A Parallel Three‐dimensional Variational Data Assimilation Scheme , 2011 .

[27]  Sebastian Deorowicz,et al.  CoMSA: compression of protein multiple sequence alignment files , 2018, Bioinform..

[28]  Livia Marcellino,et al.  A Gaussian Recursive Filter Parallel Implementation with Overlapping , 2019, 2019 15th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS).

[29]  Giulio Giunta,et al.  Accelerated Gaussian Convolution in a Data Assimilation Scenario , 2020, ICCS.

[30]  Salvatore Cuomo,et al.  Visiting Styles in an Art Exhibition Supported by a Digital Fruition System , 2015, 2015 11th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS).

[31]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[32]  S. Golomb Run-length encodings. , 1966 .

[33]  Mikel Hernaez,et al.  SPRING: a next-generation compressor for FASTQ data , 2018, Bioinform..

[34]  Masao Nagasaki,et al.  Genomic data assimilation for estimating hybrid functional Petri net from time-course gene expression data. , 2006, Genome informatics. International Conference on Genome Informatics.