HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data

School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China School of Computer and Software, Nanjing Institute of Industry Technology, Nanjing 210023, China Jiangsu High Technology Research Key Laboratory for Wireless Sensor Networks, Nanjing 210023, China Institute of High Performance Computing and Big Data, Nanjing University of Posts and Telecommunications, Nanjing 210023, China School of Software and Electrical Engineering, Swinburne University of Technology, Melbourne 3122, Australia

[1]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[2]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[3]  Gilbert Held,et al.  Data compression - techniques and applications: hardware and software considerations (3. ed.) , 1986 .

[4]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[5]  Peter Deutsch,et al.  DEFLATE Compressed Data Format Specification version 1.3 , 1996, RFC.

[6]  Hugh E. Williams,et al.  Indexing and Retrieval for Genomic Databases , 2002, IEEE Trans. Knowl. Data Eng..

[7]  Steven E. Brenner,et al.  Bootstrapping and normalization for enhanced evaluations of pairwise sequence comparison , 2002, Proc. IEEE.

[8]  Lei Chen,et al.  Compressed pattern matching in DNA sequences , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[9]  Trevor I. Dix,et al.  A Simple Statistical Algorithm for Biological Sequence Compression , 2007, 2007 Data Compression Conference (DCC'07).

[10]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[11]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[12]  Sangsoo Kim,et al.  The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. , 2009, Genome research.

[13]  Dmitry Pushkarev,et al.  Single-molecule sequencing of an individual human genome , 2009, Nature Biotechnology.

[14]  R. Mott,et al.  The 1001 Genomes Project for Arabidopsis thaliana , 2009, Genome Biology.

[15]  Ernesto Picardi,et al.  Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing , 2010, Briefings Bioinform..

[16]  A. Kasarskis,et al.  A window into third-generation sequencing. , 2010, Human molecular genetics.

[17]  Justin Zobel,et al.  Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval , 2010, SPIRE.

[18]  Scott D. Kahn On the Future of Genomic Data , 2011, Science.

[19]  Szymon Grabowski,et al.  Robust relative compression of genomes with random access , 2011, Bioinform..

[20]  Justin Zobel,et al.  Optimized Relative Lempel-Ziv Compression of Genomes , 2011, ACSC.

[21]  Armando J. Pinho,et al.  On the Representability of Complete Genomes by Multiple Competing Finite-Context (Markov) Models , 2011, PloS one.

[22]  Euan A Ashley,et al.  A public resource facilitating clinical use of genomes , 2012, Proceedings of the National Academy of Sciences.

[23]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[24]  Sara P. Garcia,et al.  GReEn: a tool for efficient compression of genome resequencing data , 2011, Nucleic acids research.

[25]  Anirban Dutta,et al.  DELIMINATE - a fast and efficient method for loss-less compression of genomic sequences: Sequence analysis , 2012, Bioinform..

[26]  Ulf Leser,et al.  FRESCO: Referential Compression of Highly Similar Sequences , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[27]  Armando J. Pinho,et al.  MFCompress: a compression tool for FASTA and multi-FASTA data , 2013, Bioinform..

[28]  Ulf Leser,et al.  Trends in Genome Compression , 2014 .

[29]  Sanguthevar Rajasekaran,et al.  ERGC: an efficient referential genome compression algorithm , 2015, Bioinform..

[30]  Ulf Leser,et al.  Sequence Factorization with Multiple References , 2015, PloS one.

[31]  Mikel Hernaez,et al.  iDoComp: a compression scheme for assembled genomes , 2015, Bioinform..

[32]  Shuigeng Zhou,et al.  CoGI: Towards Compressing Genomes as an Image , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  Sebastian Deorowicz,et al.  GDC 2: Compression of large collections of genomes , 2015, Scientific Reports.

[34]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[35]  Armando J. Pinho,et al.  Efficient Compression of Genomic Sequences , 2016, 2016 Data Compression Conference (DCC).

[36]  Sanguthevar Rajasekaran,et al.  NRGC: a novel referential genome compression algorithm , 2016, Bioinform..

[37]  Faraz Hach,et al.  Comparison of high-throughput sequencing data compression tools , 2016, Nature Methods.

[38]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[39]  Jinyan Li,et al.  High‐speed and high‐ratio referential genome compression , 2017, Bioinform..

[40]  Armando J. Pinho,et al.  A DNA Sequence Corpus for Compression Benchmark , 2018, PACBB.

[41]  Wei Shi,et al.  High efficiency referential genome compression algorithm , 2019, Bioinform..

[42]  Umberto Ferraro Petrillo,et al.  Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics , 2018, BMC Bioinformatics.

[43]  Armando J. Pinho,et al.  GeCo2: An Optimized Tool for Lossless Compression and Analysis of DNA Sequences , 2019, PACBB.

[44]  Meng Wang,et al.  pblat: a multithread blat algorithm speeding up aligning sequences to genomes , 2019, BMC Bioinformatics.

[45]  Ruchuan Wang,et al.  A Live Migration Algorithm for Containers Based on Resource Locality , 2018, J. Signal Process. Syst..

[46]  Jing He,et al.  Privacy preserving classification on local differential privacy in data centers , 2020, J. Parallel Distributed Comput..