Inverted Repeats Scaffolding for a Dedicated Chloroplast Genome Assembler

This paper describes a novel assembly approach for chloroplast genomes. It contains two modular steps. In the first step, based on the hypothesis that chloroplasts genomes are over-represented compared to the nuclear genome in the plant’s cell, we assemble contigs with a De Bruijn graph based approach using short reads with a high k -mer coverage. Connections between oriented contigs are also provided here. The second step determines the order and the orientation of the contigs (scaffolding). Taking advantage of the knowledge that chloroplast genomes posses well studied circular structure, we develop a particular formulation of the scaffolding problem, called Nested Inverted Fragments Scaffolding . It aims to assemble highly conserved inverted repeats. We formulate it as an optimisation problem and we prove that it is NP-Complete . To solve the problem we propose and implement an integer linear programming formulation. We evaluate our method on a set of real instances (a benchmark of 42 chloroplast genomes) and show that it obtains notable achievements with respect to the quality of the results. To further estimate the performance of our scaffolding module, we test it on huge artificially created instances. The results demonstrate an excellent behaviour of our integer formulation as even very large instances have been solved at the first Branch & Bounds node.

[1]  A. Korte,et al.  A systematic comparison of chloroplast genome assembly tools , 2020, Genome Biology.

[2]  P. Moore,et al.  The RNA-Folding Problem , 2019, Integer Linear Programming in Computational and Systems Biology.

[3]  A. Korte,et al.  A systematic comparison of chloroplast genome assembly tools , 2019, bioRxiv.

[4]  H. Djidjev,et al.  Complete assembly of circular and chloroplast genomes based on global optimization , 2019, J. Bioinform. Comput. Biol..

[5]  J. Krömer,et al.  Biophotovoltaics: Green Power Generation From Sunlight and Water , 2019, Front. Microbiol..

[6]  Wen-Bin Yu,et al.  GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes , 2018, Genome Biology.

[7]  Dominique Lavenier,et al.  Global Optimization for Scaffolding and Completing Genome Assemblies , 2018, Electron. Notes Discret. Math..

[8]  Pan Hongbo,et al.  Inferring the evolutionary mechanism of the chloroplast genome size by comparing whole-chloroplast genome sequences in seed plants , 2017, Scientific Reports.

[9]  C. Lemieux,et al.  Divergent copies of the large inverted repeat in the chloroplast genomes of ulvophycean green algae , 2017, Scientific Reports.

[10]  Nicolas Dierckxsens,et al.  NOVOPlasty: de novo assembly of organelle genomes from whole genome data , 2016, Nucleic acids research.

[11]  H. Djidjev,et al.  Global Optimization for Scafolding and Completing Genome Assemblies , 2016 .

[12]  Leonardo Taccari,et al.  Integer programming formulations for the elementary shortest path problem , 2016, Eur. J. Oper. Res..

[13]  Alexey Gurevich,et al.  QUAST: quality assessment tool for genome assembles , 2013 .

[14]  G. Tesler,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[15]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[16]  T. Sharkey Advances in photosynthesis and respiration , 2012, Photosynthesis Research.

[17]  W. Gruissem,et al.  The chloroplast genome exists in multimeric forms. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[18]  T. Rich,et al.  DNA barcoding for plants. , 2015, Methods in molecular biology.

[19]  Alexander Schrijver,et al.  Combinatorial optimization. Polyhedra and efficiency. , 2003 .