Phylogenetic comparative assembly

BackgroundRecent high throughput sequencing technologies are capable of generating a huge amount of data for bacterial genome sequencing projects. Although current sequence assemblers successfully merge the overlapping reads, often several contigs remain which cannot be assembled any further. It is still costly and time consuming to close all the gaps in order to acquire the whole genomic sequence.ResultsHere we propose an algorithm that takes several related genomes and their phylogenetic relationships into account to create a graph that contains the likelihood for each pair of contigs to be adjacent.Subsequently, this graph can be used to compute a layout graph that shows the most promising contig adjacencies in order to aid biologists in finishing the complete genomic sequence. The layout graph shows unique contig orderings where possible, and the best alternatives where necessary.ConclusionsOur new algorithm for contig ordering uses sequence similarity as well as phylogenetic information to estimate adjacencies of contigs. An evaluation of our implementation shows that it performs better than recent approaches while being much faster at the same time.

[1]  Carito Guziolowski,et al.  Algorithms for Molecular Biology , 2007 .

[2]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[3]  A. Goesmann,et al.  The lifestyle of Corynebacterium urealyticum derived from its complete genome sequence established by pyrosequencing. , 2008, Journal of biotechnology.

[4]  Jon Jouis Bentley,et al.  Fast Algorithms for Geometric Traveling Salesman Problems , 1992, INFORMS J. Comput..

[5]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[6]  P. Green,et al.  Consed: a graphical tool for sequence finishing. , 1998, Genome research.

[7]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2005, RECOMB.

[8]  S. Salzberg,et al.  Bioinformatics challenges of new sequencing technology. , 2008, Trends in genetics : TIG.

[9]  Oscar P. Kuipers,et al.  Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies , 2005, Nucleic Acids Res..

[10]  Daniel H. Huson,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm153 Genome analysis OSLay: optimal syntenic layout of unfinished assemblies , 2022 .

[11]  Tao Li,et al.  A new pheromone trail-based genetic algorithm for comparative genome assembly , 2008, Nucleic acids research.

[12]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[13]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[14]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Jakob Fredslund,et al.  PHY·FI: fast and easy online creation and manipulation of phylogeny color figures , 2006, BMC Bioinformatics.

[16]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[17]  R. Staden A strategy of DNA sequencing employing computer programs. , 1979, Nucleic acids research.

[18]  S. Anderson,et al.  Shotgun DNA sequencing using cloned DNase I-generated fragments , 1981, Nucleic Acids Res..

[19]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[20]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[21]  Alexander Goesmann,et al.  EDGAR: A software framework for the comparative analysis of prokaryotic genomes , 2009, BMC Bioinformatics.

[22]  A. Goesmann,et al.  Ultrafast pyrosequencing of Corynebacterium kroppenstedtii DSM44385 revealed insights into the physiology of a lipophilic corynebacterium that lacks mycolic acids. , 2008, Journal of biotechnology.

[23]  Emden R. Gansner,et al.  An open graph visualization system and its applications to software engineering , 2000, Softw. Pract. Exp..

[24]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information , 2021, Nucleic Acids Res..