Shotgun Sequence Assembly

Abstract Shotgun sequencing is the most widely used technique for determining the DNA sequence of organisms. It involves breaking up the DNA into many small pieces that can be read by automated sequencing machines, then piecing together the original genome using specialized software programs called assemblers. Due to the large amounts of data being generated and to the complex structure of most organisms' genomes, successful assembly programs rely on sophisticated algorithms based on knowledge from such diverse fields as statistics, graph theory, computer science, and computer engineering. Throughout this chapter we will describe the main computational challenges imposed by the shotgun sequencing method, and survey the most widely used assembly algorithms.

[1]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[2]  G. Myers,et al.  Optimally separating sequences. , 2001, Genome informatics. International Conference on Genome Informatics.

[3]  R Staden,et al.  Sequence assembly and finishing methods. , 2002, Methods of biochemical analysis.

[4]  Noga Alon,et al.  An optimal procedure for gap closing in whole genome shotgun sequencing , 2001, RECOMB.

[5]  Eric S. Lander,et al.  An SNP map of the human genome generated by reduced representation shotgun sequencing , 2000, Nature.

[6]  S. Salzberg,et al.  Sequence and analysis of the Arabidopsis genome. , 2001, Current opinion in plant biology.

[7]  R. Parsons,et al.  Genetic Algorithms , Operators , and DNAFragment AssemblyMachine Learning , 1994 .

[8]  L. Hood,et al.  An experimentally derived data set constructed for testing large-scale DNA sequence assembly algorithms. , 1993, Genomics.

[9]  R. Karp,et al.  Error checking and graphical representation of multiple-complete-digest (MCD) restriction-fragment maps. , 1999, Genome research.

[10]  J. Mullikin,et al.  The phusion assembler. , 2003, Genome research.

[11]  L. Hillier,et al.  PCAP: a whole-genome assembly program. , 2003, Genome research.

[12]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[13]  S. Salzberg,et al.  Hierarchical scaffolding with Bambus. , 2003, Genome research.

[14]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[15]  T. Gingeras,et al.  Computer programs for the assembly of DNA sequences. , 1979, Nucleic acids research.

[16]  P. Green,et al.  Against a whole-genome shotgun. , 1997, Genome research.

[17]  K. Mullis,et al.  Specific enzymatic amplification of DNA in vitro: the polymerase chain reaction. , 1986, Cold Spring Harbor symposia on quantitative biology.

[18]  Aaron L. Halpern,et al.  Efficiently detecting polymorphisms during the fragment assembly process , 2002, ISMB.

[19]  P Sham,et al.  A SNP resource for human chromosome 22: extracting dense clusters of SNPs from the genomic sequence. , 2001, Genome research.

[20]  Paramvir S. Dehal,et al.  Whole-Genome Shotgun Assembly and Analysis of the Genome of Fugu rubripes , 2002, Science.

[21]  M. Waterman,et al.  The accuracy of DNA sequences: estimating sequence quality. , 1992, Genomics.

[22]  E. Delong,et al.  Unsuspected diversity among marine aerobic anoxygenic phototrophs , 2002, Nature.

[23]  Ron Shamir,et al.  Spectrum Alignment: Efficient Resequencing by Hybridization , 2000, ISMB.

[24]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[25]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[26]  Mihai Pop,et al.  Genome Sequence Assembly: Algorithms and Issues , 2002, Computer.

[27]  Tao Jiang,et al.  Linear approximation of shortest superstrings , 1991, STOC '91.

[28]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[29]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Russell Schwartz,et al.  Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem , 2002, Briefings Bioinform..

[31]  R. Gibbs,et al.  A clone-array pooled shotgun strategy for sequencing large genomes. , 2001, Genome research.

[32]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[33]  K. Chin,et al.  End-sequence profiling: Sequence-based analysis of aberrant genomes , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[34]  M. Waterman,et al.  Estimating the repeat structure and length of DNA sequences using L-tuples. , 2003, Genome research.

[35]  E. Lander,et al.  Genomic mapping by anchoring random clones: a mathematical analysis. , 1991, Genomics.

[36]  S. Salzberg,et al.  Optimized multiplex PCR: efficiently closing a whole-genome shotgun sequencing project. , 1999, Genomics.

[37]  Eugene W Myers,et al.  On the sequencing and assembly of the human genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Arthur L. Delcher,et al.  Large-scale assembly of DNA strings and space-efficient construction of suffix trees , 1995, STOC '96.

[39]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[40]  Owen White,et al.  TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects , 1995 .

[41]  Eugene W Myers,et al.  The independence of our genome assemblies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[42]  Michael Roberts,et al.  A Preprocessor for Shotgun Assembly of Large Genomes , 2004, J. Comput. Biol..

[43]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[44]  W W Wilke,et al.  Multiplex polymerase chain reaction. , 1992, Modern pathology : an official journal of the United States and Canadian Academy of Pathology, Inc.

[45]  E. Mauceli,et al.  Whole-genome sequence assembly for mammalian genomes: Arachne 2. , 2003, Genome research.

[46]  R. Staden Automation of the computer handling of gel reading data produced by the shotgun method of DNA sequencing. , 1982, Nucleic acids research.

[47]  P. Gács,et al.  Algorithms , 1992 .

[48]  Eugene W. Myers,et al.  A Dataset Generator for Whole Genome Shotgun Sequencing , 1999, ISMB.

[49]  S. Rao Kosaraju,et al.  Large-scale assembly of DNA strings and space-efficient construction of suffix trees , 1995, STOC '95.

[50]  L. Hood,et al.  A common language for physical mapping of the human genome. , 1989, Science.

[51]  R Staden,et al.  The application of numerical estimates of base calling accuracy to DNA sequencing projects. , 1995, Nucleic acids research.

[52]  Marilyn Bohl,et al.  Information processing , 1971 .

[53]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[54]  L. Roberts Genome project. , 1988, Science.

[55]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[56]  John D. Kececioglu,et al.  Separating repeats in DNA sequence assembly , 2001, RECOMB.

[57]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[58]  Eric S. Lander,et al.  On the sequencing of the human genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[59]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[60]  John D. Kececioglu,et al.  Inferring a DNA Sequence from Erroneous Copies , 1997, Theor. Comput. Sci..

[61]  Phil Green,et al.  Whole-genome disassembly , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[62]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[63]  Steven Skiena,et al.  Trie-Based Data Structures for Sequence Assembly , 1997, CPM.

[64]  Russell Schwartz,et al.  SNPs Problems, Complexity, and Algorithms , 2001, ESA.

[65]  S. Kim,et al.  AMASS: A Structured Pattern Matching Approach to Shotgun Sequence Assembly , 1998, J. Comput. Biol..

[66]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[67]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[68]  Eugene W. Myers,et al.  Comparing Assemblies Using Fragments and Mate-Pairs , 2001, WABI.

[69]  O. White,et al.  A bacterial genome in flux: the twelve linear and nine circular extrachromosomal DNAs in an infectious isolate of the Lyme disease spirochete Borrelia burgdorferi , 2000, Molecular microbiology.

[70]  M. P. Cummings,et al.  Satellite DNA repeat sequence variation is low in three species of burying beetles in the genus Nicrophorus (Coleoptera: Silphidae). , 1997, Molecular biology and evolution.

[71]  Björn Andersson,et al.  TRAP: Tandem Repeat Assembly Program produces improved shotgun assemblies of repetitive sequences , 2003, Comput. Methods Programs Biomed..

[72]  J. Kececioglu Exact and approximation algorithms for DNA sequence reconstruction , 1992 .

[73]  Björn Andersson,et al.  Separation of nearly identical repeats in shotgun assemblies using defined nucleotide positions, DNPs , 2002, Bioinform..

[74]  F. Sanger,et al.  Nucleotide sequence of bacteriophage lambda DNA. , 1982, Journal of molecular biology.

[75]  Darren T. Lim,et al.  A Learning Algorithm for the Shortest Superstring Problem , 2001 .

[76]  Haixu Tang,et al.  Fragment assembly with double-barreled data , 2001, ISMB.

[77]  C. Desmarais,et al.  Automated finishing with autofinish. , 2001, Genome research.

[78]  E. Green Strategies for the systematic sequencing of complex genomes , 2001, Nature Reviews Genetics.

[79]  B. Berger,et al.  Sequencing a genome by walking with clone-end sequences: a mathematical analysis. , 1999 .

[80]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[81]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[82]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[83]  Stephanie Forrest,et al.  Genetic algorithms, operators, and DNA fragment assembly , 1995, Machine Learning.

[84]  Li Liao,et al.  A probabilistic approach to sequence assembly validation , 2001, BIOKDD.

[85]  Paul Richardson,et al.  The Draft Genome of Ciona intestinalis: Insights into Chordate and Vertebrate Origins , 2002, Science.

[86]  P. Kwok,et al.  Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms. , 1998, Genome research.

[87]  Hans Söderlund,et al.  Algorithms for Some String Matching Problems Arising in Molecular Genetics , 1983, IFIP Congress.

[88]  F. Frances Yao,et al.  Approximating shortest superstrings , 1997, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[89]  Darren T. Lim,et al.  Designing and Testing a New DNA Fragment Assembler VEDA-2 , 2022 .

[90]  J. Roach,et al.  Pairwise end sequencing: a unified approach to genomic mapping and sequencing. , 1995, Genomics.

[91]  B. Trask,et al.  Segmental duplications: organization and impact within the current human genome project assembly. , 2001, Genome research.

[92]  C. Burks,et al.  Artificially generated data sets for testing DNA sequence assembly algorithms. , 1993, Genomics.

[93]  Eugene W. Myers,et al.  Combinatorial algorithms for DNA sequence assembly , 1995, Algorithmica.

[94]  D. Haussler,et al.  Assembly of the working draft of the human genome with GigAssembler. , 2001, Genome research.

[95]  E. Kirkness,et al.  The Dog Genome: Survey Sequencing and Comparative Analysis , 2003, Science.

[96]  Representation of cloned genomic sequences in two sequencing vectors: correlation of DNA sequence and subclone distribution. , 1997, Nucleic acids research.

[97]  Steven Skiena,et al.  A case study in genome-level fragment assembly , 2000, Bioinform..

[98]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[99]  Mihai Pop,et al.  Comparative Genome Sequencing for Discovery of Novel Polymorphisms in Bacillus anthracis , 2002, Science.

[100]  Clifford Stein,et al.  A 2 2 3 {approximation Algorithm for the Shortest Superstring Problem , 1995 .

[101]  S. Salzberg,et al.  An optimized protocol for analysis of EST sequences. , 2000, Nucleic acids research.

[102]  Eugene W. Myers,et al.  ReAligner: a program for refining DNA sequence multi-alignments , 1997, RECOMB '97.

[103]  M S Waterman,et al.  Genomic mapping by end-characterized random clones: a mathematical analysis. , 1995, Genomics.

[104]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[105]  Mark J. Miller,et al.  A Quantitative Comparison of DNA Sequence Assembly Programs , 1994, J. Comput. Biol..

[106]  Pavel A. Pevzner,et al.  EULER-PCR: Finishing Experiments for Repeat Resolution , 2001, Pacific Symposium on Biocomputing.

[107]  Eugene W. Myers,et al.  ReAligner: A Program for Refining DNA Sequence Multi-Alignments , 1997, J. Comput. Biol..

[108]  Noga Alon,et al.  Learning a Hidden Matching , 2004, SIAM J. Comput..

[109]  Haixu Tang,et al.  A new approach to fragment assembly in DNA sequencing , 2001, RECOMB.

[110]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[111]  G. D. Wilson,et al.  An SNP map of human chromosome 22 , 2000, Nature.

[112]  Aleksandar Milosavljevic,et al.  Pooled Genomic Indexing (PGI): Mathematical Analysis and Experiment Design , 2002, WABI.

[113]  J. Jurka,et al.  Microsatellites in different eukaryotic genomes: survey and analysis. , 2000, Genome research.

[114]  David J Porteous,et al.  Computational comparison of human genomic sequence assemblies for a region of chromosome 4. , 2002, Genome research.

[115]  Hui-Hsien Chou,et al.  DNA sequence quality trimming and vector removal , 2001, Bioinform..

[116]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[117]  Eugene W. Myers,et al.  Design of a compartmentalized shotgun assembler for the human genome , 2001, ISMB.

[118]  P. Green,et al.  Consed: a graphical tool for sequence finishing. , 1998, Genome research.

[119]  Tao Jiang,et al.  DNA sequencing and string learning , 2005, Mathematical systems theory.

[120]  Eugene W. Myers,et al.  Toward Simplifying and Accurately Formulating Fragment Assembly , 1995, J. Comput. Biol..

[121]  Elaine E. Vaughan,et al.  Diversity, Dynamics, and Activity of Bacterial Communities during Production of an Artisanal Sicilian Cheese as Evaluated by 16S rRNA Analysis , 2002, Applied and Environmental Microbiology.

[122]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[123]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[124]  Hans Söderlund,et al.  SEQAID: a DNA sequence assembling program based on a mathematical model , 1984, Nucleic Acids Res..

[125]  Huanming Yang,et al.  RePS: a sequence assembler that masks exact repeats identified from the shotgun data. , 2002, Genome research.

[126]  E. Lander,et al.  More on the sequencing of the human genome , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[127]  Jessica Severin,et al.  Whole-genome shotgun optical mapping of Rhodobacter sphaeroides strain 2.4.1 and its use for whole-genome shotgun sequence assembly. , 2003, Genome research.

[128]  Martha L. Bulyk,et al.  Computational comparison of two draft sequences of the human genome , 2001, Nature.

[129]  Eugene W. Myers,et al.  The greedy path-merging algorithm for sequence assembly , 2001, RECOMB.

[130]  X. Huang,et al.  An improved sequence assembly program. , 1996, Genomics.

[131]  E. Arner,et al.  Correcting errors in shotgun sequences. , 2003, Nucleic acids research.

[132]  Madhav V. Marathe,et al.  Algorithms for optimizing production DNA sequencing , 2000, SODA '00.