LSLS: A Novel Scaffolding Method Based on Path Extension

While aiming to determine orientations and orders of fragmented contigs, scaffolding is an essential step of assembly pipelines and can make assembly results more complete. Most existing scaffolding tools adopt the scaffold graph approach. However, constructing an accurate scaffold graph is still a challenge task. Removing potential false relationships is a key to achieve a better scaffolding performance, while most scaffolding approaches neglect the impacts of uneven sequencing depth that may cause more sequencing errors, and finally result in many false relationships. In this paper, we present a new scaffolding method LSLS (Loose-Strict-Loose Scaffolding), which is based on path extension. LSLS uses different strategies to extend paths, which can be more adaptive to different sequencing depths. For the problem of multiple paths, we designed a score function, which is based on the distribution of read pairs, to evaluate the reliability of path candidates and extend them with the paths which have the highest score. Besides, LSLS contains a new gap estimation method, which can estimate gap sizes more precisely. The experiment results on the two standard datasets show that LSLS can get better performance.

[1]  Nilgun Donmez,et al.  SCARPA: scaffolding reads with practical algorithms , 2013, Bioinform..

[2]  Fang-Xiang Wu,et al.  BOSS: a novel scaffolding algorithm based on an optimized scaffold graph , 2017, Bioinform..

[3]  Yi Pan,et al.  ISEA: Iterative Seed-Extension Algorithm for De Novo Assembly Using Paired-End Information and Insert Size Distribution , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  Sergey Koren,et al.  Bambus 2: scaffolding metagenomes , 2011, Bioinform..

[5]  K. Voelkerding,et al.  Next-generation sequencing: from basic research to diagnostics. , 2009, Clinical chemistry.

[6]  Juan Liu,et al.  Network-Regularized Sparse Logistic Regression Models for Clinical Risk Prediction and Biomarker Discovery , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[8]  D. Haussler,et al.  Assembly of the working draft of the human genome with GigAssembler. , 2001, Genome research.

[9]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[10]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[11]  S. Salzberg,et al.  Hierarchical scaffolding with Bambus. , 2003, Genome research.

[12]  Lars Arvestad,et al.  BESST - Efficient scaffolding of large fragmented assemblies , 2014, BMC Bioinformatics.

[13]  Walter Pirovano,et al.  BIOINFORMATICS APPLICATIONS , 2022 .

[14]  Eugene W. Myers,et al.  The greedy path-merging algorithm for contig scaffolding , 2002, JACM.

[15]  Martin Hunt,et al.  The genome and life-stage specific transcriptomes of Globodera pallida elucidate key aspects of plant parasitism by a cyst nematode , 2014, Genome Biology.

[16]  M. Berriman,et al.  A comprehensive evaluation of assembly scaffolding tools , 2014, Genome Biology.

[17]  Yi Pan,et al.  EPGA: de novo assembly using the distributions of reads and insert size , 2015, Bioinform..

[18]  Wing-Kin Sung,et al.  PE-Assembler: de novo assembler using short paired-end reads , 2011, Bioinform..

[19]  Igor Mandric,et al.  ScaffMatch: Scaffolding Algorithm Based on Maximum Weight Matching , 2015, RECOMB.

[20]  Marcel J. T. Reinders,et al.  GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies , 2012, Bioinform..

[21]  Adel Dayarian,et al.  SOPRA: Scaffolding algorithm for paired reads via statistical optimization , 2010, BMC Bioinformatics.

[22]  Esko Ukkonen,et al.  Fast scaffolding with small independent mixed integer programs , 2011, Bioinform..

[23]  Wing-Kin Sung,et al.  Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences , 2011, J. Comput. Biol..