A stochastic de novo assembly algorithm for viral-sized genomes obtains correct genomes and builds consensus

Abstract A genetic algorithm with stochastic macro mutation operators which merge, split, move, reverse and align DNA contigs on a scaffold is shown to accurately and consistently assemble raw DNA reads from an accurately sequenced single-read library into a contiguous genome. A candidate solution is a permutation of DNA reads, segmented into contigs. An interleaved merge operator for contigs allows for the quick minimization of a fitness function measuring the string length of a candidate solution. This study assembles read libraries for three genomic fragments from different organisms, five complete virus genomes, and one complete bacterial genome, with the largest genome length of 159  kbp. To evaluate the accuracy of any assembled genome, test libraries of DNA reads are generated from reference genomes, and the assembly is compared to the reference. The method has very high assembly accuracy: over repeated assemblies for each input genome, the original genome was constructed optimally in over 85% of the runs. Given the consistency of the algorithm, the method is suitable to determine the consensus genome in de-novo assembly problems. There are two limitations to the method: genomes with long repeats may be overcompressed, and the computational complexity is high.

[1]  Keith L. Ligon,et al.  Profiling Critical Cancer Gene Mutations in Clinical Tumor Samples , 2009, PloS one.

[2]  Mukesh M. Raghuwanshi,et al.  Genetic Algorithm Based Clustering: A Survey , 2008, 2008 First International Conference on Emerging Trends in Engineering and Technology.

[3]  Sheridan K. Houghten,et al.  Restarting and recentering genetic algorithm variations for DNA fragment assembly: The necessity of a multi-strategy approach , 2016, Biosyst..

[4]  Stephanie Forrest,et al.  Genetic Algorithms for DNA Sequence Assembly , 1993, ISMB.

[5]  M. Baker De novo genome assembly: what every biologist should know , 2012, Nature Methods.

[6]  Enrique Alba,et al.  An improved trajectory-based hybrid metaheuristic applied to the noisy DNA Fragment Assembly Problem , 2014, Inf. Sci..

[7]  Esko Ukkonen,et al.  The Shortest Common Supersequence Problem over Binary Alphabet is NP-Complete , 1981, Theor. Comput. Sci..

[8]  Sarath Chandra Janga,et al.  Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches , 2016, BMC Genomics.

[9]  Doina Bucur,et al.  De Novo DNA Assembly with a Genetic Algorithm Finds Accurate Genomes Even with Suboptimal Fitness , 2017, EvoApplications.

[10]  Mark E. Johnson,et al.  A case study in experimental design applied to genetic algorithms with applications to DNA sequence assembly , 1997 .

[11]  F. Sanger,et al.  Nucleotide sequence of bacteriophage phi X174 DNA. , 1977, Nature.

[12]  Enrique Alba,et al.  A New Local Search Algorithm for the DNA Fragment Assembly Problem , 2007, EvoCOP.

[13]  Dan Boneh,et al.  On genetic algorithms , 1995, COLT '95.

[14]  Enrique Alba,et al.  DNA fragment assembly using a grid-based genetic algorithm , 2008, Comput. Oper. Res..

[15]  Sami Khuri,et al.  A Comparison of DNA Fragment Assembly Algorithms , 2004, METMBS.

[16]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[17]  Stephanie Forrest,et al.  Genetic algorithms, operators, and DNA fragment assembly , 1995, Machine Learning.

[18]  Mohammad Sohel Rahman,et al.  Bee algorithms for solving DNA fragment assembly problem with noisy and noiseless data , 2012, GECCO '12.

[19]  Consolación Gil,et al.  Optimization methods applied to renewable and sustainable energy: A review , 2011 .

[20]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[21]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[22]  Guillermo Fernández-Anaya,et al.  DNA fragment assembly using optimization , 2013, 2013 IEEE Congress on Evolutionary Computation.

[23]  Doina Bucur,et al.  Towards accurate de novo assembly for genomes with repeats , 2017, 2017 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[24]  Mark E. Johnson,et al.  DNA Sequence Assembly and Genetic Algorithms - New Results and Puzzling Insights , 1995, ISMB.

[25]  Guillermo Fernández-Anaya,et al.  Modified Classical Graph Algorithms for the DNA Fragment Assembly Problem , 2015, Algorithms.

[26]  Daniel R. Zerbino,et al.  Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler , 2009, PloS one.

[27]  Enrique Alba,et al.  A self-adaptive cellular memetic algorithm for the DNA fragment assembly problem , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[28]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[29]  F. Sanger,et al.  Nucleotide sequence of bacteriophage φX174 DNA , 1977, Nature.

[30]  Sara Nasser,et al.  Multiple Sequence Alignment using Fuzzy Logic , 2007, 2007 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[31]  W. Paszkowicz,et al.  Genetic Algorithms, a Nature-Inspired Tool: A Survey of Applications in Materials Science and Related Fields: Part II , 2009 .

[32]  Sheridan K. Houghten,et al.  Recentering and Restarting Genetic Algorithm variations for DNA Fragment Assembly , 2014, 2014 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology.