BAQALC: Blockchain Applied Lossless Efficient Transmission of DNA Sequencing Data for Next Generation Medical Informatics

Due to the development of high-throughput DNA sequencing technology, genome-sequencing costs have been significantly reduced, which has led to a number of revolutionary advances in the genetics industry. However, the problem is that compared to the decrease in time and cost needed for DNA sequencing, the management of such large volumes of data is still an issue. Therefore, this research proposes Blockchain Applied FASTQ and FASTA Lossless Compression (BAQALC), a lossless compression algorithm that allows for the efficient transmission and storage of the immense amounts of DNA sequence data that are being generated by Next Generation Sequencing (NGS). Also, security and reliability issues exist in public sequence databases. For methods, compression ratio comparisons were determined for genetic biomarkers corresponding to the five diseases with the highest mortality rates according to the World Health Organization. The results showed an average compression ratio of approximately 12 for all the genetic datasets used. BAQALC performed especially well for lung cancer genetic markers, with a compression ratio of 17.02. BAQALC performed not only comparatively higher than widely used compression algorithms, but also higher than algorithms described in previously published research. The proposed solution is envisioned to contribute to providing an efficient and secure transmission and storage platform for next-generation medical informatics based on smart devices for both researchers and healthcare users.

[1]  S A Shoemaker,et al.  DNA molecular biology in the diagnosis of pulmonary disease. , 1987, Clinics in chest medicine.

[2]  Tae-Ro Lee,et al.  Solution for Efficient Vital Data Transmission and Storing in m-Health Environment , 2015 .

[3]  Qian Zhang,et al.  A 2G-RFID-based e-healthcare system , 2010, IEEE Wireless Communications.

[4]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[5]  L. Goodyear,et al.  Exercise Effects on White Adipose Tissue: Beiging and Metabolic Adaptations , 2015, Diabetes.

[6]  Walter L. Ruzzo,et al.  Compression of next-generation sequencing reads aided by highly efficient de novo assembly , 2012, Nucleic acids research.

[7]  Khaled Salah,et al.  IoT security: Review, blockchain solutions, and open challenges , 2017, Future Gener. Comput. Syst..

[8]  Armando J. Pinho,et al.  MFCompress: a compression tool for FASTA and multi-FASTA data , 2013, Bioinform..

[9]  R. Henrik Nilsson,et al.  Taxonomic Reliability of DNA Sequences in Public Sequence Databases: A Fungal Perspective , 2006, PloS one.

[10]  Yiwen Sun,et al.  LW-FQZip 2: a parallelized reference-based compression of FASTQ files , 2017, BMC Bioinformatics.

[11]  C. Thermes,et al.  Ten years of next-generation sequencing technology. , 2014, Trends in genetics : TIG.

[12]  Xiaofeng Chen,et al.  Blockchain-based publicly verifiable data deletion scheme for cloud storage , 2018, J. Netw. Comput. Appl..

[13]  Faraz Hach,et al.  SCALCE: boosting sequence compression algorithms using locally consistent encoding , 2012, Bioinform..

[14]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[15]  Anirban Dutta,et al.  DELIMINATE - a fast and efficient method for loss-less compression of genomic sequences: Sequence analysis , 2012, Bioinform..

[16]  Chris Showell,et al.  Barriers to the use of personal health records by patients: a structured review , 2017, PeerJ.

[17]  Chun-Hsi Huang,et al.  Toward a Better Compression for DNA Sequences Using Huffman Encoding , 2017, J. Comput. Biol..

[18]  Dan Wang,et al.  A method to differentiate between ventricular fibrillation and asystole during chest compressions using artifact-corrupted ECG alone , 2017, Comput. Methods Programs Biomed..

[19]  Faraz Hach,et al.  DeeZ: reference-based compression by local assembly , 2014, Nature Methods.

[20]  Tarvinder K Taneja,et al.  Markers of small cell lung cancer , 2004, World journal of surgical oncology.

[21]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[22]  Shoemaker Sa,et al.  DNA molecular biology in the diagnosis of pulmonary disease , 1987 .

[23]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[24]  Sebastian Deorowicz,et al.  DSRC 2 - Industry-oriented compression of FASTQ files , 2014, Bioinform..

[25]  Sanguthevar Rajasekaran,et al.  LFQC: A lossless compression algorithm for FASTQ files , 2019, Bioinform..

[26]  Vincent Rijmen,et al.  Low-Data Complexity Attacks on AES , 2012, IEEE Transactions on Information Theory.

[27]  C. Reinheimer,et al.  Detection of herpesvirus EBV DNA in the lower respiratory tract of ICU patients: a marker of infection of the lower respiratory tract? , 2013, Medical Microbiology and Immunology.

[28]  Arthur R. Williams,et al.  The decreasing cost of telemedicine and telehealth. , 2011, Telemedicine journal and e-health : the official journal of the American Telemedicine Association.

[29]  Zhen Ji,et al.  High-throughput DNA sequence data compression , 2015, Briefings Bioinform..

[30]  Daijin Kim,et al.  Design, Development and Implementation of a Smartphone Overdependence Management System for the Self-Control of Smart Devices , 2016 .

[31]  Guojun Wang,et al.  Research and improvement of ECG compression algorithm based on EZW , 2017, Comput. Methods Programs Biomed..

[32]  Armando J. Pinho,et al.  Cryfa: a secure encryption tool for genomic data , 2018, Bioinform..

[33]  Sebastián Isaza,et al.  Performance comparison of sequential and parallel compression applications for DNA raw data , 2016, The Journal of Supercomputing.

[34]  James K. Bonfield,et al.  Compression of FASTQ and SAM Format Sequencing Data , 2013, PloS one.

[35]  Muhammad Tahir,et al.  Advances in high throughput DNA sequence data compression , 2016, J. Bioinform. Comput. Biol..

[36]  Jan Nedergaard,et al.  Chronic Peroxisome Proliferator-activated Receptor γ (PPARγ) Activation of Epididymally Derived White Adipocyte Cultures Reveals a Population of Thermogenically Competent, UCP1-containing Adipocytes Molecularly Distinct from Classic Brown Adipocytes* , 2009, The Journal of Biological Chemistry.

[37]  Cong Wang,et al.  Improved known-plaintext attack to permutation-only multimedia ciphers , 2018, Inf. Sci..

[38]  Xu-de Sun,et al.  Histologic Distribution, Fragment Cloning, and Sequence Analysis of G Protein Couple Receptor 30 in Rat Submaxillary Gland , 2011, Anatomical record.

[39]  Tae-Ro Lee,et al.  An optimized compression algorithm for real-time ECG data transmission in wireless network of medical information systems , 2014, Journal of Medical Systems.

[40]  James Lowey,et al.  Bioinformatics Applications Note Sequence Analysis G-sqz: Compact Encoding of Genomic Sequence and Quality Data , 2022 .

[41]  Lisa Papic Symptoms, causes, and treatments of conjunctivitis. , 2010 .

[42]  Tae-Ro Lee,et al.  Efficient Real-Time Lossless EMG Data Transmission to Monitor Pre-Term Delivery in a Medical Information System , 2017 .

[43]  Arantza Illarramendi,et al.  Architecture, cost-model and customization of real-time monitoring systems based on mobile biological sensor data-streams , 2009, Comput. Methods Programs Biomed..