Analysis of DNA sequences

Recent developments in the statistical analysis of DNA sequences are reviewed. The pace with which sequence data are being generated and analysed has increased with the growth of the human genome project. Two areas of activity are emphasized: attention to error rates in recorded sequences, and heterogeneity in structure of sequences. There is now empirical evidence suggesting error rates in the range 0.1% ∼ 1%, and such rates will affect evolutionary studies since these are about the rates at which DNA sequences from different individuals are expected to differ. Heterogeneity for such quantities as base composition, or lengths between successive subsequences of specified types, may be sufficient to account for observed long-range correlations between bases. The need for statistical models and analyses of DNA sequence data will continue, and will offer interesting challenges.

[1]  R J Roberts,et al.  Finding errors in DNA sequences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[2]  G. Dahlberg,et al.  Genetics of human populations. , 1948, Advances in genetics.

[3]  J. Felsenstein,et al.  Estimating effective population size from samples of sequences: a bootstrap Monte Carlo integration method. , 1992, Genetical research.

[4]  J. Mullins,et al.  Molecular Epidemiology of HIV Transmission in a Dental Practice , 1992, Science.

[5]  Richard Cowan,et al.  Expected frequencies of DNA patterns using whittle's formula , 1991, Journal of Applied Probability.

[6]  Richard Mott,et al.  Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores , 1992 .

[7]  M. Davisson,et al.  Report of the comparative committee for human, mouse and other rodents , 1991 .

[8]  Catherine Macken,et al.  Some statistical problems in the assessment of inhomogeneities of DNA sequence data , 1991 .

[9]  A. Ciccodicola,et al.  Conserved sequence-tagged sites: a phylogenetic approach to genome mapping. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[10]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[11]  Jonathan Arnold,et al.  CMAP: contig mapping and analysis package, a relational database for chromosome reconstruction , 1992, Comput. Appl. Biosci..

[12]  S. Tavaré,et al.  Estimating substitution rates from molecular data using the coalescent. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[13]  W. Li,et al.  Evidence for higher rates of nucleotide substitution in rodents than in man. , 1985, Proceedings of the National Academy of Sciences of the United States of America.

[14]  J Courteau Genome databases. , 1991, Science.

[15]  L. J. Korn,et al.  [60] Computer analysis of nucleic acids and proteins , 1980 .

[16]  K H Buetow,et al.  Influence of aberrant observations on high-resolution linkage analysis outcomes. , 1991, American journal of human genetics.

[17]  E. Lander,et al.  Parametric sequence comparisons. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[18]  A J Cuticchia,et al.  The application of Markov chain analysis to oligonucleotide frequency prediction and physical mapping of Drosophila melanogaster. , 1992, Nucleic acids research.

[19]  B S Weir Statistical analysis of molecular genetic data. , 1985, IMA journal of mathematics applied in medicine and biology.

[20]  S. Karlin,et al.  Over- and under-representation of short oligonucleotides in DNA sequences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[21]  S Karlin,et al.  Patchiness and correlations in DNA sequences , 1993, Science.

[22]  G. H. Hamm,et al.  The EMBL data library , 1993, Nucleic Acids Res..

[23]  C. Peng,et al.  Long-range correlations in nucleotide sequences , 1992, Nature.

[24]  M. Miyamoto,et al.  Phylogenetic Analysis of DNA Sequences , 1991 .

[25]  C Savakis,et al.  Contamination of cDNA sequences in databases. , 1993, Science.

[26]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[27]  B. Dujon,et al.  The complete DNA sequence of yeast chromosome III , 1992, Nature.

[28]  S. Altschul Gap costs for multiple sequence alignment. , 1989, Journal of theoretical biology.

[29]  G A Churchill,et al.  Methods for inferring phylogenies from nucleic acid sequence data by using maximum likelihood and linear invariants. , 1991, Molecular biology and evolution.

[30]  R. Ivarie,et al.  Mono- through hexanucleotide composition of the Escherichia coli genome: a Markov chain analysis. , 1987, Nucleic acids research.

[31]  C. A. Chatzidimitriou-Dreismann,et al.  Long-range correlations in DNA , 1993, Nature.

[32]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[33]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[34]  Changhwan Lee,et al.  Redesigning, implementing and integrating Escherichia coli genome software tools with an object-oriented database system , 1992, Comput. Appl. Biosci..

[35]  S. Nelson,et al.  Genomic mismatch scanning: a new approach to genetic linkage mapping , 1993, Nature Genetics.

[36]  P. Beer-Romero,et al.  The human Y chromosome: a 43-interval map based on naturally occurring deletions. , 1992, Science.

[37]  E S Lander,et al.  Systematic detection of errors in genetic linkage data. , 1992, Genomics.

[38]  R. Hudson Gene genealogies and the coalescent process. , 1990 .

[39]  Gunnar von Heijne Getting sense out of sequence data , 1988, Nature.

[40]  O. Gotoh,et al.  Optimal sequence alignment allowing for long gaps , 1990 .

[41]  G. Bernardi,et al.  The isochore organization of the human genome. , 1989, Annual review of genetics.

[42]  L. J. Korn,et al.  Computer analysis of nucleic acids and proteins. , 1980, Methods in enzymology.

[43]  N. Morton,et al.  Standard maps of chromosome 10 , 1990, Annals of human genetics.

[44]  M. Stoneking,et al.  HLA-DQ alpha allele and genotype frequencies in various human populations, determined by using enzymatic amplification and oligonucleotide probes. , 1990, American journal of human genetics.

[45]  M. Clegg,et al.  Chloroplast DNA sequence from a Miocene Magnolia species , 1990, Nature.

[46]  Manish S. Shah,et al.  A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes , 1993, Cell.

[47]  I Amato,et al.  DNA shows unexplained patterns writ large. , 1992, Science.

[48]  A. Templeton Human origins and analysis of mitochondrial DNA sequences. , 1992, Science.

[49]  W. Bodmer The human genome sequence and the analysis of multifactorial traits. , 1987, Ciba Foundation symposium.

[50]  Isao Endo,et al.  Human genome analysis system , 1991, Nature.

[51]  L. Hood,et al.  Large-scale and automated DNA sequence determination. , 1991, Science.

[52]  J. Yon,et al.  Conservation of the organization of five tightly clustered genes over 600 million years of divergent evolution. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[53]  R. DeSalle,et al.  DNA sequences from a fossil termite in Oligo-Miocene amber and their phylogenetic implications. , 1992, Science.

[54]  Complementary questions , 1991, Nature.

[55]  F. Blattner,et al.  Analysis of the Escherichia coli genome: DNA sequence of the region from 84.5 to 86.5 minutes. , 1992, Science.

[56]  D. Mackey,et al.  The sequence of human mtDNA: the question of errors versus polymorphisms. , 1992, American journal of human genetics.

[57]  Reply to Howell et al.: The need for a joint effort in the construction of a reference data base for normal sequence variants of human mtDNA , 1992 .

[58]  A J Cuticchia,et al.  Mono- through hexanucleotide composition of the sense strand of yeast DNA: a Markov chain analysis. , 1988, Nucleic acids research.

[59]  De Witt Sumners,et al.  Untangling DNA , 1990 .

[60]  D. Koshland The molecule of the year. , 1990, Science.

[61]  A. Clark,et al.  Sequencing errors and molecular evolutionary analysis. , 1992, Molecular biology and evolution.

[62]  David B. Searls,et al.  The Linguistics of DNA , 1992 .

[63]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[64]  K. Hawkes,et al.  African populations and the evolution of human mitochondrial DNA. , 1991, Science.

[65]  B S Weir,et al.  Population genetics in the forensic DNA debate. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[66]  E. Lander Finding similarities and differences among genomes , 1993, Nature Genetics.

[67]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[68]  J. Felsenstein,et al.  Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates. , 1992, Genetical research.

[69]  D. Lipman,et al.  Trees, stars, and multiple biological sequence alignment , 1989 .

[70]  W. Gilbert,et al.  A new method for sequencing DNA. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[71]  L. Roberts NIH takes new tack on gene mapping. , 1992, Science.

[72]  Nick Goldman,et al.  MAXIMUM LIKELIHOOD INFERENCE OF PHYLOGENETIC TREES, WITH SPECIAL REFERENCE TO A POISSON PROCESS MODEL OF DNA SUBSTITUTION AND TO PARSIMONY ANALYSES , 1990 .

[73]  F. Studier,et al.  DNA sequencing by primer walking with strings of contiguous hexamers. , 1992, Science.

[74]  Stephen M. Edgington Breaking Open The Bottlenecks In Genomic DNA Sequencing , 1993, Bio/Technology.

[75]  M. Waterman Mathematical Methods for DNA Sequences , 1989 .

[76]  M. Wigler,et al.  Cloning the differences between two complex genomes , 1993, Science.

[77]  Henry A. Erlich,et al.  Amplification and analysis of DNA sequences in single human sperm and diploid cells , 1988, Nature.

[78]  Gaston H. Gonnet,et al.  A word in your protein , 1993, Nature.

[79]  P. L. Deininger,et al.  DNA sequence and expression of the B95-8 Epstein—Barr virus genome , 1984, Nature.

[80]  M. Nei Molecular Evolutionary Genetics , 1987 .

[81]  B S Weir,et al.  Statistical analysis of DNA sequences. , 1988, Journal of the National Cancer Institute.

[82]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[83]  R. Ivarie,et al.  The effect of codon usage on the oligonucleotide composition of the E. coli genome and identification of over- and underrepresented sequences by Markov chain analysis. , 1987, Nucleic acids research.

[84]  T. Smith,et al.  Optimal sequence alignments. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[85]  D Benton,et al.  GenBank: current status and future directions , 1990 .

[86]  C J Basten,et al.  Sampling strategies for distances between DNA sequences. , 1990, Biometrics.

[87]  B S Weir,et al.  Testing for equality of evolutionary rates. , 1992, Genetics.

[88]  P. Green,et al.  Ancient conserved regions in new gene sequences and the protein databases. , 1993, Science.

[89]  V. V. Prabhu,et al.  Correlations in intronless DNA , 1992, Nature.

[90]  M S Waterman,et al.  The Continuing Case of the Florida Dentist , 1992, Science.

[91]  M. Waterman,et al.  The accuracy of DNA sequences: estimating sequence quality. , 1992, Genomics.

[92]  A statistical method for detecting regions with different evolutionary dynamics in multialigned sequences. , 1992, Molecular phylogenetics and evolution.

[93]  S. Karlin,et al.  Chance and statistical significance in protein and DNA sequence analysis. , 1992, Science.

[94]  B S Weir,et al.  The probabilities of similarities in DNA sequence comparisons. , 1988, Genomics.

[95]  R. Jones Sequence pattern matching on a massively parallel computer , 1992, Comput. Appl. Biosci..

[96]  G J Barton,et al.  Computer speed and sequence comparison. , 1992, Science.

[97]  J. Felsenstein CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP , 1985, Evolution; international journal of organic evolution.

[98]  X. Estivill,et al.  Continuum of overlapping clones spanning the entire human chromosome 21q , 1992, Nature.

[99]  P L Pearson,et al.  The human genome initiative--do databases reflect current progress? , 1991, Science.

[100]  Charlie Hodgman,et al.  The elucidation of protein function by sequence motif analysis , 1989, Comput. Appl. Biosci..

[101]  J. Maddox Long-range correlations within DNA , 1992, Nature.

[102]  S. Nee,et al.  Uncorrelated DNA walks , 1992, Nature.