A Ditributed Algorithm for Quality Assessment of Biological Sequencing Based on MapReduce

DNA sequencing technology has played an important role on life sciences, especially Illumina’s sequencer. It was used for more and more biological genomic and transcriptomic projects. Faced with the huge amount of biological sequencing data, it is a problem how to assess its quality quickly. In this paper, we developed a distributed algorithm based on MapReduce, which can assess the quality of biological sequencing in parallel. In order to validate the algorithm, different data sizes (1G - 20G) were used to test by different computing nodes (1 - 20) in Hadoop platform. The results show that the parallel efficiency improves continuously following with the increase of data size and computing nodes. And the algorithm has better parallel efficiency when data size and computing nodes greater than 5Gb and 10 processors. This work effectively saves the time of quality assessment of biological sequencing.

[1]  Huanming Yang,et al.  Erratum: Genomic insights into salt adaptation in a desert poplar , 2013, Nature Communications.

[2]  C. Lata,et al.  Comparative transcriptome analysis of differentially expressed genes in foxtail millet (Setaria italica L.) during dehydration stress. , 2010, Biochemical and biophysical research communications.

[3]  Comparative analysis of Pinus pinea and Pinus pinaster dehydrins under drought stress , 2015, Tree Genetics & Genomes.

[4]  You-jie Zhao,et al.  Transcriptome sequencing of Pinus kesiya var. langbianensis and comparative analysis in the Pinus phylogeny , 2018, BMC Genomics.

[5]  Lee T. Sam,et al.  Transcriptome Sequencing to Detect Gene Fusions in Cancer , 2009, Nature.

[6]  Patrick J. Biggs,et al.  SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data , 2010, BMC Bioinformatics.

[7]  P. Padmaja Insect Pest Resistance in Sorghum , 2016 .

[8]  Liming Wang,et al.  Genome-wide analysis and environmental response profiling of the FK506-binding protein gene family in maize (Zea mays L.). , 2012, Gene.

[9]  N. Tuteja Improving crop resistance to abiotic stress , 2012 .

[10]  Douglas G. Scofield,et al.  The Norway spruce genome sequence and conifer genome evolution , 2013, Nature.

[11]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[12]  V. Chinnusamy,et al.  Sorghum: Improvement of Abiotic Stress Tolerance , 2012 .

[13]  Jianquan Liu,et al.  Transcriptome differences between two sister desert poplar species under salt stress , 2014, BMC Genomics.

[14]  Liang Chen,et al.  Transcriptomic analysis of the maize (Zea mays L.) inbred line B73 response to heat stress at the seedling stage. , 2019, Gene.

[15]  Anton Nekrutenko,et al.  Manipulation of FASTQ data with Galaxy , 2010, Bioinform..

[16]  Richard D. Hayes,et al.  The genome of Eucalyptus grandis , 2014, Nature.

[17]  Youjie Zhao,et al.  Comparative Genome and Transcriptome Analysis Reveals Gene Selection Patterns Along with the Paleo-Climate Change in the Populus Phylogeny , 2019, Forests.

[18]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[19]  Jianjun Hu,et al.  Genome-Wide Characterization of the sHsp Gene Family in Salix suchowensis Reveals Its Functions under Different Abiotic Stresses , 2018, International journal of molecular sciences.

[20]  S. Salzberg,et al.  Sequencing and Assembly of the 22-Gb Loblolly Pine Genome , 2014, Genetics.

[21]  Le-Shin Wu,et al.  Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies , 2014, Genome Biology.

[22]  Jianquan Liu,et al.  The draft genome sequence of a desert tree Populus pruinosa , 2017, GigaScience.

[23]  Gerald A Tuskan,et al.  The willow genome and divergent evolution from poplar after the common genome duplication , 2014, Cell Research.

[24]  Yuriy Fofanov,et al.  PIQA: pipeline for Illumina G1 genome analyzer data quality assessment , 2009, Bioinform..

[25]  Youjie Zhao,et al.  Comparative genomics and transcriptomics analysis reveals evolution patterns of selection in the Salix phylogeny , 2019, BMC Genomics.

[26]  S. Fitz-Gibbon,et al.  Assessment of shared alleles in drought-associated candidate genes among southern California white oak species (Quercus sect. Quercus) , 2018, BMC Genetics.

[27]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[28]  Jianjun Hu,et al.  De novo transcriptome assembly, development of EST-SSR markers and population genetic analyses for the desert biomass willow, Salix psammophila , 2016, Scientific Reports.

[29]  Robert A. Edwards,et al.  Quality control and preprocessing of metagenomic datasets , 2011, Bioinform..

[30]  Mingming Xin,et al.  Temporal transcriptome profiling reveals expression partitioning of homeologous genes contributing to heat and drought acclimation in wheat (Triticum aestivum L.) , 2015, BMC Plant Biology.

[31]  M. Gribskov,et al.  The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray) , 2006, Science.

[32]  Forest Rohwer,et al.  TagCleaner: Identification and removal of tag sequences from genomic and metagenomic datasets , 2010, BMC Bioinformatics.

[33]  Ajit Ghosh,et al.  Comprehensive genome-wide analysis of Glutathione S-transferase gene family in potato (Solanum tuberosum L.) and their expression profiling in various anatomical tissues and perturbation conditions. , 2018, Gene.