Scoring-and-Unfolding Trimmed Tree Assembler: Algorithms for Assembling Genome Sequences Accurately and Efficiently

The recent advances in DNA sequencing technology and their many potential applications to Biology and Medicine have rekindled enormous interest in several classical algorithmic problems at the core of Genomics and Computational Biology: primarily, the whole-genome sequence assembly problem (WGSA). Two decades back, in the context of the Human Genome Project, the problem had received unprecedented scientific prominence: its computational complexity and intractability were thought to have been well understood; various competitive heuristics, thoroughly explored and the necessary software, properly implemented and validated. However, several recent studies, focusing on the experimental validation of de novo assemblies, have highlighted several limitations of the current assemblers. Intrigued by these negative results, this dissertation reinvestigates the algorithmic techniques required to correctly and efficiently assemble genomes. Mired by its connection to a well-known NP -complete combinatorial optimization problem, historically, WGSA has been assumed to be amenable only to greedy and heuristic methods. By placing efficiency as their priority, these methods opted to rely on local searches, and are thus inherently approximate, ambiguous or error-prone. This dissertation presents a novel sequence assembler, SUTTA, that dispenses with the idea of limiting the solutions to just the approximated ones, and instead favors an approach that could potentially lead to an exhaustive (exponential-time) search of all possible layouts but tames the complexity through constrained search (Branch-and-Bound) and quick identification and pruning of implausible solutions. Complementary to this problem is the task of validating the generated assemblies. Unfortunately, no commonly accepted method exists yet and widely used metrics to compare the assembled sequences emphasize only size, poorly capturing quality and accuracy. This dissertation also addresses these concerns by developing a more comprehensive metric, the Feature-Response Curve, that, using ideas from classical ROC (receiver-operating characteristic) curve, more faithfully captures the trade-off between contiguity and quality. Finally, this dissertation demonstrates the advantages of a complete pipeline integrating base-calling (TotalReCaller) with assembly (SUTTA) in a Bayesian manner.

[1]  P. Mitra,et al.  Alta-Cyclic: a self-optimizing base caller for next-generation sequencing , 2008, Nature Methods.

[2]  Bud Mishra,et al.  Fast and Cheap Genome Wide Haplotype Construction via Optical Mapping , 2005, Pacific Symposium on Biocomputing.

[3]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[4]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[5]  S. Schuster,et al.  Who Ate Whom? Adaptive Helicobacter Genomic Changes That Accompanied a Host Jump from Early Humans to Large Felines , 2006, PLoS genetics.

[6]  L. Hillier,et al.  PCAP: a whole-genome assembly program. , 2003, Genome research.

[7]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.

[8]  S. Salzberg,et al.  Hierarchical scaffolding with Bambus. , 2003, Genome research.

[9]  Jan Schröder,et al.  Genome analysis SHREC : a short-read error correction method , 2009 .

[10]  Samuel V. Angiuoli,et al.  Insights on Evolution of Virulence and Resistance from the Complete Genome Analysis of an Early Methicillin-Resistant Staphylococcus aureus Strain and a Biofilm-Producing Methicillin-Resistant Staphylococcus epidermidis Strain , 2005, Journal of bacteriology.

[11]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[12]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[13]  David Hernández,et al.  De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. , 2008, Genome research.

[14]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[15]  Randall A. Bolanos,et al.  Whole-genome shotgun assembly and comparison of human genome assemblies , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[16]  E. Eichler,et al.  Characterization of Missing Human Genome Sequences and Copy-number Polymorphic Insertions , 2010, Nature Methods.

[17]  Jonathan S. Turner,et al.  Approximation Algorithms for the Shortest Common Superstring Problem , 1989, Inf. Comput..

[18]  Owen White,et al.  TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects , 1995 .

[19]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[20]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[21]  M. Gribskov,et al.  The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray) , 2006, Science.

[22]  H. Rittel,et al.  Dilemmas in a general theory of planning , 1973 .

[23]  François Laviolette,et al.  Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies , 2010, J. Comput. Biol..

[24]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[25]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[26]  Salvatore Paxia,et al.  Genomics via Optical Mapping IV: Sequence Validation via Optical Map Matching , 2001 .

[27]  S. S. Hall Revolution postponed. , 2010, Scientific American.

[28]  Bud Mishra,et al.  TotalReCaller: improved accuracy and performance via integrated alignment and base-calling , 2011, Bioinform..

[29]  Yun S. Song,et al.  BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing. , 2009, Genome research.

[30]  Weng-Keen Wong,et al.  QSRA – a quality-value guided de novo short read assembler , 2009, BMC Bioinformatics.

[31]  M. Schatz,et al.  Genome assembly forensics: finding the elusive mis-assembly , 2008, Genome Biology.

[32]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[33]  Ioannis Xenarios,et al.  BMC Bioinformatics BioMed Central Methodology article Probabilistic base calling of Solexa sequencing data , 2022 .

[34]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[35]  Ailsa H. Land,et al.  An Automatic Method of Solving Discrete Programming Problems , 1960 .

[36]  William J. Cook,et al.  TSP Cuts Which Do Not Conform to the Template Paradigm , 2000, Computational Combinatorial Optimization.

[37]  Steven Skiena,et al.  Crystallizing short-read assemblies around seeds , 2009, BMC Bioinformatics.

[38]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[39]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[40]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[41]  J. Mullikin,et al.  The phusion assembler. , 2003, Genome research.

[42]  D. Haussler,et al.  Assembly of the working draft of the human genome with GigAssembler. , 2001, Genome research.

[43]  David C. Schwartz,et al.  New Generations: Sequencing Machines and Their Computational Challenges , 2010, Journal of Computer Science and Technology.

[44]  Vincent J. Magrini,et al.  Extending assembly of short DNA sequences to handle error , 2007, Bioinform..

[45]  Mihai Pop,et al.  Minimus: a fast, lightweight genome assembler , 2007, BMC Bioinformatics.

[46]  C. Semple Assembling a View of the Human Genome , 2003 .

[47]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[48]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[49]  SUTTA : Scoring-and-Unfolding Trimmed Tree Assembler , 2009 .

[50]  Michael Roberts,et al.  A Preprocessor for Shotgun Assembly of Large Genomes , 2004, J. Comput. Biol..

[51]  David Maier,et al.  On Finding Minimal Length Superstrings , 1980, J. Comput. Syst. Sci..

[52]  Eugene W. Myers,et al.  Computability of Models for Sequence Assembly , 2007, WABI.

[53]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[54]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[55]  Esko Ukkonen,et al.  A Greedy Approximation Algorithm for Constructing Shortest Common Superstrings , 1988, Theor. Comput. Sci..

[56]  Martin Kircher,et al.  Improved base calling for the Illumina Genome Analyzer using machine learning strategies , 2009, Genome Biology.

[57]  Eugene W. Myers,et al.  Toward Simplifying and Accurately Formulating Fragment Assembly , 1995, J. Comput. Biol..

[58]  A. Church,et al.  Some properties of conversion , 1936 .

[59]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[60]  Eugene W. Myers,et al.  Combinatorial algorithms for DNA sequence assembly , 1995, Algorithmica.

[61]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..

[62]  Y. Nagai,et al.  Genome and virulence determinants of high virulence community-acquired MRSA , 2002, The Lancet.

[63]  Mihai Pop,et al.  Assembly complexity of prokaryotic genomes using short reads , 2010, BMC Bioinformatics.

[64]  Wing-Kin Sung,et al.  PE-Assembler: de novo assembler using short paired-end reads , 2011, Bioinform..

[65]  M. Metzker Emerging technologies in DNA sequencing. , 2005, Genome research.

[66]  B. Mishra,et al.  Comparing De Novo Genome Assembly: The Long and Short of It , 2011, PloS one.

[67]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[68]  AN Kolmogorov-Smirnov,et al.  Sulla determinazione empírica di uma legge di distribuzione , 1933 .

[69]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[70]  David C. Schwartz,et al.  Genomics via Optical Mapping III: Contiging Genomic DNA , 1998, ISMB.

[71]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[72]  Ling V. Sun,et al.  Phylogenomics of the Reproductive Parasite Wolbachia pipientis wMel: A Streamlined Genome Overrun by Mobile Genetic Elements , 2004, PLoS biology.

[73]  David C. Schwartz,et al.  Genomics via Optical Mapping II: Ordered Restriction Maps , 1997, J. Comput. Biol..

[74]  D. Schwartz,et al.  Optical mapping and its potential for large-scale sequencing projects. , 1999, Trends in biotechnology.

[75]  Mihai Pop,et al.  Parametric Complexity of Sequence Assembly: Theory and Applications to Next Generation Sequencing , 2009, J. Comput. Biol..

[76]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[77]  Claude Sammut,et al.  Beam Search , 2010, Encyclopedia of Machine Learning and Data Mining.

[78]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[79]  Maw-Shang Chang,et al.  Solving the path cover problem on circular-arc graphs by using an approximation algorithm , 2006, Discret. Appl. Math..

[80]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[81]  Richard M. Karp,et al.  Keynote address: the role of algorithmic research in computational genomics , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[82]  Bud Mishra,et al.  Scoring-and-unfolding trimmed tree assembler: concepts, constructs and comparisons , 2011, Bioinform..

[83]  Walter Pirovano,et al.  BIOINFORMATICS APPLICATIONS , 2022 .

[84]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[85]  Bertil Schmidt,et al.  A fast hybrid short read fragment assembly algorithm , 2009, Bioinform..

[86]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[87]  S. Salzberg,et al.  Bioinformatics challenges of new sequencing technology. , 2008, Trends in genetics : TIG.

[88]  E. Mardis,et al.  Genome Sequencing Technology and Algorithms , 2007 .

[89]  Clifford Stein,et al.  A 2 2 3 {approximation Algorithm for the Shortest Superstring Problem , 1995 .

[90]  Hans Söderlund,et al.  SEQAID: a DNA sequence assembling program based on a mathematical model , 1984, Nucleic Acids Res..

[91]  R. Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[92]  Juliane C. Dohm,et al.  SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. , 2007, Genome research.

[93]  Donald W. Loveland,et al.  A machine program for theorem-proving , 2011, CACM.

[94]  Ian T. Paulsen,et al.  The Brucella suis genome reveals fundamental similarities between animal and plant pathogens and symbionts , 2002, Proceedings of the National Academy of Sciences of the United States of America.