De Novo DNA Assembly with a Genetic Algorithm Finds Accurate Genomes Even with Suboptimal Fitness

We design an evolutionary heuristic for the combinatorial problem of de-novo DNA assembly with short, overlapping, accurately sequenced single DNA reads of uniform length, from both strands of a genome without long repeated sequences. The representation of a candidate solution is a novel segmented permutation: an ordering of DNA reads into contigs, and of contigs into a DNA scaffold. Mutation and crossover operators work at the contig level. The fitness function minimizes the total length of scaffold (i.e., the sum of the length of the overlapped contigs) and the number of contigs on the scaffold. We evaluate the algorithm with read libraries uniformly sampled from genomes 3835 to 48502 base pairs long, with genome coverage between 5 and 7, and verify the biological accuracy of the scaffolds obtained by comparing them against reference genomes. We find the correct genome as a contig string on the DNA scaffold in over 95% of all assembly runs. For the smaller read sets, the scaffold obtained consists of only the correct contig; for the larger read libraries, the fitness of the solution is suboptimal, with chaff contigs present; however, a simple post-processing step can realign the chaff onto the correct genome. The results support the idea that this heuristic can be used for consensus building in de-novo assembly.

[1]  Esko Ukkonen,et al.  The Shortest Common Supersequence Problem over Binary Alphabet is NP-Complete , 1981, Theor. Comput. Sci..

[2]  Stephanie Forrest,et al.  Genetic Algorithms for DNA Sequence Assembly , 1993, ISMB.

[3]  Mark E. Johnson,et al.  DNA Sequence Assembly and Genetic Algorithms - New Results and Puzzling Insights , 1995, ISMB.

[4]  Mark E. Johnson,et al.  A case study in experimental design applied to genetic algorithms with applications to DNA sequence assembly , 1997 .

[5]  Stephanie Forrest,et al.  Genetic algorithms, operators, and DNA fragment assembly , 1995, Machine Learning.

[6]  Enrique Alba,et al.  A New Local Search Algorithm for the DNA Fragment Assembly Problem , 2007, EvoCOP.

[7]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[8]  Enrique Alba,et al.  A self-adaptive cellular memetic algorithm for the DNA fragment assembly problem , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[9]  Enrique Alba,et al.  DNA fragment assembly using a grid-based genetic algorithm , 2008, Comput. Oper. Res..

[10]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[11]  Mohammad Sohel Rahman,et al.  Bee algorithms for solving DNA fragment assembly problem with noisy and noiseless data , 2012, GECCO '12.

[12]  J. Polko,et al.  Illumina sequencing technology as a method of identifying T-DNA insertion loci in activation-tagged Arabidopsis thaliana plants. , 2012, Molecular plant.

[13]  Guillermo Fernández-Anaya,et al.  DNA fragment assembly using optimization , 2013, 2013 IEEE Congress on Evolutionary Computation.

[14]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[15]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[16]  L. A. Panchenko,et al.  Non-random DNA fragmentation in next-generation sequencing , 2014, Scientific Reports.

[17]  David A. Eccles,et al.  MinION Analysis and Reference Consortium: Phase 1 data release and analysis , 2015, F1000Research.

[18]  Sarath Chandra Janga,et al.  Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches , 2016, BMC Genomics.