GapReduce: A Gap Filling Algorithm Based on Partitioned Read Sets

With the advances in technologies of sequencing and assembly, draft sequences of more and more genomes are available. However, there commonly exist gaps in these draft sequences which influence various downstream analysis of biological studies. Gap filling methods can shorten the length of gaps and improve the completion of these draft sequences of genomes. Although some gap filling tools have been developed, their effectiveness and accuracy need to be improved. In this study, we develop a novel tool, called GapReduce, which can fill the gaps using the paired reads. For a gap, GapReduce selects the reads whose mate reads are aligned on the left or the right flanking region, and partitions the reads to two sets. Then GapReduce adopts different <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq1-2789909.gif"/></alternatives></inline-formula> values and <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq2-2789909.gif"/></alternatives></inline-formula>-<inline-formula><tex-math notation="LaTeX">$mer$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="wang-ieq3-2789909.gif"/></alternatives></inline-formula> frequency thresholds to iteratively construct De Bruijn graphs, which are used for finding the correct path to fill the gap. For overcoming the branching problems caused by repetitive regions and sequencing errors in the procedure of path selection, GapReduce designs a novel approach that simultaneously considers <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq4-2789909.gif"/></alternatives></inline-formula>-<inline-formula><tex-math notation="LaTeX">$mer$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="wang-ieq5-2789909.gif"/></alternatives></inline-formula> frequency and distribution of paired reads based on the partitioned read sets. We compare the performance of GapReduce with current popular gap filling tools. The experimental results demonstrate that GapReduce can produce satisfactory gap filling results, especially for long insert size datasets. GapReduce is publicly available for downloading at <uri>https://github.com/bioinfomaticsCSU/GapReduce</uri>.

[1]  V. Piro,et al.  FGAP: an automated gap closing tool , 2014, BMC Research Notes.

[2]  Alexandru I. Tomescu,et al.  Gap Filling as Exact Path Length Problem , 2015, RECOMB.

[3]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[4]  Hideki Hirakawa,et al.  GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments , 2015, Bioinform..

[5]  M. Berriman,et al.  Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps , 2010, Genome Biology.

[6]  Yi Pan,et al.  ISEA: Iterative Seed-Extension Algorithm for De Novo Assembly Using Paired-End Information and Insert Size Distribution , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[8]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[9]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[10]  Min Li,et al.  EPGA2: memory-efficient de novo assembler , 2015, Bioinform..

[11]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[12]  Denis Bertrand,et al.  FinIS: Improved in silico Finishing Using an Exact Quadratic Programming Formulation , 2012, WABI.

[13]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[14]  Adonney Allan de Oliveira Veras,et al.  GapBlaster—A Graphical Gap Filler for Prokaryote Genomes , 2016, PloS one.

[15]  René L. Warren,et al.  Sealer: a scalable gap-closing application for finishing draft genomes , 2015, BMC Bioinformatics.

[16]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[17]  W. Pirovano,et al.  Toward almost closed genomes with GapFiller , 2012, Genome Biology.

[18]  Yi Pan,et al.  EPGA: de novo assembly using the distributions of reads and insert size , 2015, Bioinform..

[19]  Faraz Hach,et al.  CoLoRMap: Correcting Long Reads by Mapping short reads , 2016, Bioinform..