Computational methods for fast and accurate dna fragment assembly

As advances in technology result in the production of increasing amounts of DNA sequencing data in decreasing amounts of time, it is imperative that computational methods are developed that allow data analysis to keep pace. In this dissertation, I present methods that improve the speed and accuracy of DNA fragment assembly. One critical characteristic of automatic methods for fragment assembly is that they must be accurate. Currently, to ensure accurate sequences, the data that underlies questionable base calls must be examined by human editors so that the correct base call can be determined. This manual process is both error-prone and time-consuming. Automatic methods that yield high accuracy and few questionable calls can reduce errors and lessen the need for manual inspections. In my work, I developed a method, Trace-Evidence , that automatically produces highly accurate consensus sequences, even with few aligned sequences. Most assembly programs analyze only base calls when determining a consensus equence. The key to the high accuracy is that I incorporate morphological information about the underlying ABI trace data. This is accomplished through a new representation of traces, TraceClass, that characterizes the height and shape of traces. The new representation ot only yields high accuracy when used in consensus-calling methods, but also produces improved results when used in removing poor-quality data, and when used as inputs for neural networks for consensus determination. The need for fast processing is becoming more important as the size of sequencing projects increases. Almost all existing fragment assembly programs perform pairwise comparisons of ii

[1]  X. Huang,et al.  An improved sequence assembly program. , 1996, Genomics.

[2]  Gary D. Stormo,et al.  Neural Networks for Determining Protein Specificity and Multiple Alignment of Binding Sites , 1994, ISMB.

[3]  Mark E. Johnson,et al.  DNA Sequence Assembly and Genetic Algorithms - New Results and Puzzling Insights , 1995, ISMB.

[4]  Z R Sun,et al.  A vector projection method for predicting supersecondary motifs , 1996, Journal of protein chemistry.

[5]  Owen White,et al.  TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects , 1995 .

[6]  R Staden,et al.  An interactive graphics program for comparing and aligning nucleic acid and amino acid sequences. , 1982, Nucleic acids research.

[7]  Jude W. Shavlik,et al.  Improving the Quality of Automatic DNA Sequence Assembly Using Fluorescent Trace-Data Classifications , 1996, ISMB.

[8]  R. Staden Automation of the computer handling of gel reading data produced by the shotgun method of DNA sequencing. , 1982, Nucleic acids research.

[9]  L. M. Smith,et al.  An adaptive, object oriented strategy for base calling in DNA sequence analysis. , 1993, Nucleic acids research.

[10]  Alessandro Guffanti,et al.  TargetFinder: searching annotated sequence databases for target genes of transcription factors , 1999, Bioinform..

[11]  P. Argos,et al.  An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. , 1995, Journal of molecular biology.

[12]  P. Green,et al.  Against a whole-genome shotgun. , 1997, Genome research.

[13]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[14]  F. Waismann The Logical Calculus , 1997 .

[15]  J. Weber,et al.  Human whole-genome shotgun sequencing. , 1997, Genome research.

[16]  R. Staden A strategy of DNA sequencing employing computer programs. , 1979, Nucleic acids research.

[17]  X. Huang,et al.  On global sequence alignment , 1994, Comput. Appl. Biosci..

[18]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[19]  S Subbiah,et al.  A method for multiple sequence alignment with gaps. , 1989, Journal of molecular biology.

[20]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[21]  L. Hood,et al.  An experimentally derived data set constructed for testing large-scale DNA sequence assembly algorithms. , 1993, Genomics.

[22]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[23]  R. Wilson,et al.  How the worm was won. The C. elegans genome sequencing project. , 1999, Trends in genetics : TIG.

[24]  P. Richterich,et al.  Estimation of errors in "raw" DNA sequences: a validation study. , 1998, Genome research.

[25]  Douglas L. Brutlag,et al.  BLAZETM: An Implementation of the Smith-Waterman Sequence Comparison Algorithm on a Massively Parallel Computer , 1993, Comput. Chem..

[26]  Jude W. Shavlik,et al.  Increasing Consensus Accuracy in DNA Fragment Assemblies by Incorporating Fluorescent Trace Representations , 1997, ISMB.

[27]  Eugene W. Myers,et al.  CHAPTER THIRTY-TWO – Advances in Sequence Assembly , 1994 .

[28]  Anthony Jf Griffiths,et al.  Modern Genetic Analysis , 1998 .

[29]  James W. Fickett,et al.  Fast optimal alignment , 1984, Nucleic Acids Res..

[30]  W. McCombie,et al.  CHAPTER TWENTY-FOUR – Large-scale, Automated Sequencing of Human Chromosomal Regions , 1994 .

[31]  K. Novak The complete genome sequence… , 1998, Nature Medicine.

[32]  C. Tibbetts,et al.  Neural Networks for Automated Base-calling of Gel-based DNA Sequencing Ladders , 1994 .

[33]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[34]  L. Rowen,et al.  CHAPTER TWENTY-FIVE – Zen and the Art of Large-scale Genomic Sequencing , 1994 .

[35]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[36]  V. Solovyev,et al.  Assignment of position-specific error probability to primary DNA sequence data. , 1994, Nucleic acids research.

[37]  Gregory R. Grant,et al.  Bioinformatics - The Machine Learning Approach , 2000, Comput. Chem..

[38]  R Staden Computer methods to aid the determination and analysis of DNA sequences. , 1984, Biochemical Society transactions.

[39]  P Stolorz,et al.  Predicting protein secondary structure using neural net and statistical methods. , 1992, Journal of molecular biology.

[40]  F. Studier,et al.  A strategy for high-volume sequencing of cosmid DNAs: random and directed priming with a library of oligonucleotides. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[41]  D. Nickerson,et al.  Peak height variations in automated sequencing of PCR products using Taq dye-terminator chemistry. , 1995, BioTechniques.

[42]  R. Staden Sequence data handling by computer. , 1977, Nucleic acids research.

[43]  W. Ansorge,et al.  A non-radioactive automated method for DNA sequence determination. , 1986, Journal of biochemical and biophysical methods.

[44]  R. Staden Further procedures for sequence analysis by computer. , 1978, Nucleic acids research.

[45]  G. Hartzell,et al.  DNA sequence confidence estimation. , 1994, Genomics.

[46]  Jude W. Shavlik,et al.  Neural network input representations that produce accurate consensus sequences from DNA fragment assemblies , 1999, Bioinform..

[47]  H R Garner,et al.  PRIMO: A primer design program that applies base quality statistics for automated large-scale DNA sequencing. , 1997, Genomics.

[48]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[49]  T. Hunkapiller,et al.  Sequence accuracy of large DNA sequencing projects. , 1992, DNA sequence : the journal of DNA sequencing and mapping.

[50]  R. Staden A new computer method for the storage and manipulation of DNA gel reading data. , 1980, Nucleic acids research.

[51]  Jude Shavlik,et al.  Refinement ofApproximate Domain Theories by Knowledge-Based Neural Networks , 1990, AAAI.

[52]  Karen A. Frenkel,et al.  The human genome project and informatics , 1991, CACM.

[53]  E. Snyder,et al.  Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. , 1993, Nucleic acids research.

[54]  Bernard Widrow,et al.  Neural networks: applications in industry, business and science , 1994, CACM.

[55]  E. Marshall A High-Stakes Gamble on Genome Sequencing , 1999, Science.

[56]  M. Simon,et al.  Analysis of the 1.1-Mb human alpha/delta T-cell receptor locus with bacterial artificial chromosome clones. , 1997, Genome research.

[57]  W. Gilbert,et al.  A new method for sequencing DNA. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[58]  Stephanie Forrest,et al.  Genetic Algorithms for DNA Sequence Assembly , 1993, ISMB.

[59]  R. Waterston,et al.  The human genome project. Prospects and implications for clinical medicine. , 1991, JAMA.

[60]  Lloyd M. Smith,et al.  Fluorescence detection in automated DNA sequence analysis , 1986, Nature.

[61]  A. D. McLachlan,et al.  Sequence comparison by exponentially-damped alignment , 1984, Nucleic Acids Res..

[62]  Michael Ruogu Zhang,et al.  Pombe: A gene‐finding and exon‐intron structure prediction system for fission yeast , 1998, Yeast.

[63]  J. M. Prober,et al.  A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides. , 1987, Science.

[64]  Anders Gorm Pedersen,et al.  Investigations of Escherichia coli Promoter Sequences with Artificial Neural Networks: New Signals Discovered Upstream of the Transcriptional Startpoint , 1995, ISMB.

[65]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[66]  Eugene W. Myers,et al.  Algorithms for computing and integrating physical maps using unique probes , 1997, RECOMB '97.

[67]  James B. Golden,et al.  Pattern Recognition for Automated DNA Sequencing: I. On-Line Signal Conditioning and Feature Extraction for Basecalling , 1993, ISMB.

[68]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[69]  Peter J. Munson,et al.  A novel randomized iterative strategy for aligning multiple protein sequences , 1991, Comput. Appl. Biosci..

[70]  H. M. Martinez,et al.  An efficient method for finding repeats in molecular sequences , 1983, Nucleic Acids Res..

[71]  Luciano Milanesi,et al.  Fast, statistically based alignment of amino acid sequences on the base of diagonal fragments of DOT-matrices , 1992, Comput. Appl. Biosci..

[72]  J. M. Kelley CHAPTER TWENTY-SIX – Automated Dye-Terminator DNA Sequencing , 1994 .

[73]  Rodger Staden,et al.  The current status and portability of our sequence handling software , 1986, Nucleic Acids Res..

[74]  E. Y. Chen CHAPTER ONE – The Efficiency of Automated DNA Sequencing , 1994 .

[75]  Rodger Staden,et al.  A computer program to enter DNA gel reading data into a computer , 1984, Nucleic Acids Res..

[76]  G A Buck,et al.  Accuracy of automated DNA sequencing: a multi-laboratory comparison of sequencing results. , 1995, BioTechniques.

[77]  F Khurshid,et al.  Error analysis in manual and automated DNA sequencing. , 1993, Analytical biochemistry.

[78]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.