Chance and statistical significance in protein and DNA sequence analysis.

Statistical approaches help in the determination of significant configurations in protein and nucleic acid sequence data. Three recent statistical methods are discussed: (i) score-based sequence analysis that provides a means for characterizing anomalies in local sequence text and for evaluating sequence comparisons; (ii) quantile distributions of amino acid usage that reveal general compositional biases in proteins and evolutionary relations; and (iii) r-scan statistics that can be applied to the analysis of spacings of sequence markers.

[1]  N. Sueoka,et al.  Compositional correlation between deoxyribonucleic acid and protein. , 1961, Cold Spring Harbor symposia on quantitative biology.

[2]  J. Josse,et al.  Enzymatic synthesis of deoxyribonucleic acid. VIII. Frequencies of nearest neighbor base sequences in deoxyribonucleic acid. , 1961, The Journal of biological chemistry.

[3]  P. Doty,et al.  Determination of the base composition of deoxyribonucleic acid from its thermal denaturation temperature. , 1962, Journal of molecular biology.

[4]  P. Doty,et al.  Determination of the base composition of deoxyribonucleic acid from its buoyant density in CsCl. , 1962, Journal of molecular biology.

[5]  Ross B. Inman,et al.  A denaturation map of the λ phage DNA molecule determined by electron microscopy , 1966 .

[6]  R Nussinov,et al.  Nearest neighbor nucleotide patterns. Structural and biological implications. , 1981, The Journal of biological chemistry.

[7]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[8]  Russell F. Doolittle,et al.  Computer-based characterization of epidermal growth factor precursor , 1984, Nature.

[9]  Jens G. Reich,et al.  On the statistical assessment of similarities in DNA sequences , 1984, Nucleic Acids Res..

[10]  M. Waterman,et al.  Pattern recognition in several sequences: consensus and alignment. , 1984, Bulletin of mathematical biology.

[11]  P. L. Deininger,et al.  DNA sequence and expression of the B95-8 Epstein—Barr virus genome , 1984, Nature.

[12]  G Bernardi,et al.  The mosaic genome of warm-blooded vertebrates. , 1985, Science.

[13]  Michael S. Waterman,et al.  An Erdös-Rényi law with shifts , 1985 .

[14]  N. Sternberg Evidence that adenine methylation influences DNA-protein interactions in Escherichia coli , 1985, Journal of bacteriology.

[15]  K. Struhl,et al.  Functional dissection of a eukaryotic transcriptional activator protein, GCN4 of Yeast , 1986, Cell.

[16]  K Nishikawa,et al.  The folding type of a protein is relevant to the amino acid composition. , 1986, Journal of biochemistry.

[17]  A J Davison,et al.  The complete DNA sequence of varicella-zoster virus. , 1986, The Journal of general virology.

[18]  A. Bird CpG-rich islands and the function of DNA methylation , 1986, Nature.

[19]  M. Waterman,et al.  Phase transitions in sequence matches and nucleic acid structure. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[20]  R. Dixon,et al.  cDNA for the human beta 2-adrenergic receptor: a protein with multiple membrane-spanning domains and encoded by a gene whose chromosomal location is shared with that of the receptor for platelet-derived growth factor. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[21]  K. Isono,et al.  The physical map of the whole E. coli chromosome: Application of a new strategy for rapid analysis and sorting of a large genomic library , 1987, Cell.

[22]  R. Ivarie,et al.  Mono- through hexanucleotide composition of the Escherichia coli genome: a Markov chain analysis. , 1987, Nucleic acids research.

[23]  S Wallenstein,et al.  An approximation for the distribution of the scan statistic. , 1987, Statistics in medicine.

[24]  R Nussinov,et al.  Theoretical molecular biology: prospectives and perspectives. , 1987, Journal of theoretical biology.

[25]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[26]  L. J. Perry,et al.  The complete DNA sequence of the long unique region in the genome of herpes simplex virus type 1. , 1988, The Journal of general virology.

[27]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[28]  W. Hammerschmidt,et al.  Identification and characterization of oriLyt, a lytic origin of DNA replication of Epstein-Barr virus , 1988, Cell.

[29]  R. Schleif,et al.  DNA binding by proteins. , 1988, Science.

[30]  S. Ohno,et al.  Universal rule for coding sequence construction: TA/CG deficiency-TG/CT excess. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[31]  G. Stormo Computer methods for analyzing sequence recognition of nucleic acids. , 1988, Annual Review of Biophysics and Biophysical Chemistry.

[32]  S Karlin,et al.  Association of charge clusters with functional domains of cellular transcription factors. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[33]  R. Tjian,et al.  Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. , 1989, Science.

[34]  D. Turner,et al.  Improved predictions of secondary structures for RNA. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Martin Vingron,et al.  A fast and sensitive multiple sequence alignment algorithm , 1989, Comput. Appl. Biosci..

[36]  David Baltimore,et al.  A new DNA binding and dimerization motif in immunoglobulin enhancer binding, daughterless, MyoD, and myc proteins , 1989, Cell.

[37]  S Karlin,et al.  A method to identify distinctive charge configurations in protein sequences, with application to human herpesvirus polypeptides. , 1989, Journal of molecular biology.

[38]  J A Koziol,et al.  Evolution of the genome and the genetic code: selection at the dinucleotide level by methylation and polyribonucleotide cleavage. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[39]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[40]  S F Altschul,et al.  Protein database searches for multiple alignments. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[41]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[42]  B. Barrell,et al.  Analysis of the protein-coding content of the sequence of human cytomegalovirus strain AD169. , 1990, Current topics in microbiology and immunology.

[43]  W Gibson,et al.  Identification of the lytic origin of DNA replication in human cytomegalovirus by a novel approach utilizing ganciclovir-induced chain termination , 1990, Journal of virology.

[44]  U Grob,et al.  Statistical analysis of nucleotide sequences. , 1990, Nucleic acids research.

[45]  Eugene W. Myers,et al.  Basic local alignment search tool. Journal of Molecular Biology , 1990 .

[46]  Hamilton O. Smith,et al.  Finding sequence motifs in groups of functionally related proteins. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[47]  M S Waterman,et al.  The distribution of restriction enzyme sites in Escherichia coli. , 1990, Nucleic acids research.

[48]  S. Karlin,et al.  Identification of significant sequence patterns in proteins. , 1990, Methods in enzymology.

[49]  W A Gilbert,et al.  The prediction of transmembrane protein sequences and their conformation: an evaluation. , 1990, Trends in biochemical sciences.

[50]  R. F. Smith,et al.  Automatic generation of primary sequence patterns from sets of related protein sequences. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[51]  S. Aota,et al.  Giant G+C% mosaic structures of the human genome found by arrangement of GenBank human DNA sequences according to genetic positions. , 1990, Genomics.

[52]  M. Zuker Suboptimal sequence alignment in molecular biology. Alignment with error analysis. , 1991, Journal of molecular biology.

[53]  W Miller,et al.  Mapping sequenced E.coli genes by computer: software, strategies and examples. , 1991, Nucleic acids research.

[54]  D. G. Anders,et al.  Multicomponent origin of cytomegalovirus lytic-phase DNA replication , 1991, Journal of virology.

[55]  Pavel A. Pevzner,et al.  Genome inhomogeneity is determined mainly by WW and SS dinucleotides , 1991, Comput. Appl. Biosci..

[56]  Marin van Heel,et al.  A new family of powerful multivariate statistical sequence analysis techniques. , 1991 .

[57]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[58]  S Karlin,et al.  Very long charge runs in systemic lupus erythematosus-associated autoantigens. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[59]  J Dausset,et al.  Theoretical analysis of a physical mapping strategy using random single-copy landmarks. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[60]  E. Blackburn,et al.  Structure and function of telomeres , 1991, Nature.

[61]  S Karlin,et al.  Assessment of inhomogeneities in an E. coli physical map. , 1991, Nucleic acids research.

[62]  S Karlin,et al.  Methods and algorithms for statistical analysis of protein sequences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[63]  S Karlin,et al.  Statistical analyses of counts and distributions of restriction sites in DNA sequences. , 1992, Nucleic acids research.

[64]  S Karlin,et al.  Human cytomegalovirus origin of DNA replication (oriLyt) resides within a highly complex repetitive region. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[65]  S. Karlin,et al.  Over- and under-representation of short oligonucleotides in DNA sequences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.