Yule Value Tables from Protein Datasets

Here, we studied systematically the association between amino acids, the constituents of protein sequences in datasets of different hierarchy, i.e. genome (human), protein type (membrane proteins), protein family (specific types of membrane receptors and transporters) and transmembrane helices versus loops (either for membrane proteins in general or family-specifically). Association was estimated using Yule’s Q statistics for pairs of amino acids within a window of size 4. Strong association between such nearby amino acids was observed in all the datasets studied, in contrast to the randomized datasets. Association strength increased as expected when the datasets were more specific. Strikingly, in transmembrane helices, associations were more negative than in any other dataset studied, suggesting that evolution of these helices requires suppression of occurrence of specific amino acid combinations within local range. The results have direct applicability to several areas of bioinformatics research, i.e. transmembrane helix boundary prediction, sequence alignment and understanding of design principles of membrane proteins in general. Data and access to the algorithms presented in this paper are available at http://flan.blm.cs.cmu.edu/

[1]  P Bork,et al.  The immunoglobulin fold. Structural classification, sequence patterns and common core. , 1994, Journal of molecular biology.

[2]  Stanley,et al.  Correlations in binary sequences and a generalized Zipf analysis. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[3]  A K Konopka,et al.  Noncoding DNA, Zipf's law, and language. , 1995, Science.

[4]  Mitchell P. Marcus,et al.  Parsing a Natural Language Using Mutual Information Statistics , 1990, AAAI.

[5]  Chan,et al.  Can Zipf distinguish language from noise in noncoding DNA? , 1996, Physical review letters.

[6]  J. Baldwin,et al.  An alpha-carbon template for the transmembrane helices in the rhodopsin family of G-protein-coupled receptors. , 1997, Journal of molecular biology.

[7]  Wentian Li,et al.  Statistical Properties of Open Reading Frames in Complete Genome Sequences , 1999, Comput. Chem..

[8]  D. Searls,et al.  Robots in invertebrate neuroscience , 2002, Nature.

[9]  G Vriend,et al.  The interaction of class B G protein-coupled receptors with their hormones. , 1998, Receptors & channels.

[10]  L. Mirny,et al.  Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. , 1999, Journal of molecular biology.

[11]  H E Stanley,et al.  Linguistic features of noncoding DNA sequences. , 1994, Physical review letters.

[12]  G J Barton,et al.  Structural features can be unconserved in proteins with similar folds. An analysis of side-chain to side-chain contacts secondary structure and accessibility. , 1994, Journal of molecular biology.

[13]  A. Poupon,et al.  The immunoglobulin fold family: sequence analysis and 3D structure comparisons. , 1999, Protein engineering.

[14]  David B. Searls,et al.  Linguistic approaches to biological sequences , 1997, Comput. Appl. Biosci..

[15]  H Herzel,et al.  Information content of protein sequences. , 2000, Journal of theoretical biology.

[16]  R. Durbin,et al.  Enhanced protein domain discovery by using language modeling techniques from speech recognition , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[17]  D Freedman AI helps researchers find meaning in molecules. , 1993, Science.

[18]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[19]  B. Rost Review: protein secondary structure prediction continues to rise. , 2001, Journal of structural biology.

[20]  A. Kernytsky,et al.  Transmembrane helix predictions revisited , 2002, Protein science : a publication of the Protein Society.

[21]  Mill Johannes G.A. Van,et al.  Transmission Of Information , 1961 .

[22]  D Larhammar,et al.  Lack of biological significance in the 'linguistic features' of noncoding DNA--a quantitative analysis. , 1996, Nucleic acids research.

[23]  Aravind K. Joshi,et al.  Formal grammars for estimating partition functions of double-stranded chain molecules , 2002 .

[24]  Judith Klein-Seetharaman,et al.  Identification of fundamental building blocks in protein sequences using statistical association measures , 2004, SAC '04.

[25]  B. Rost,et al.  State-of-the-art in membrane protein prediction. , 2002, Applied bioinformatics.

[26]  Stephen E. Fienberg,et al.  Discrete Multivariate Analysis: Theory and Practice , 1976 .

[27]  A A Tsonis,et al.  Is DNA a language? , 1997, Journal of theoretical biology.

[28]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[29]  L. Wasserman,et al.  Exponential Language Models, Logistic Regression, and Semantic Coherence , 2000 .

[30]  Jaime G. Carbonell,et al.  Comparative n-gram analysis of whole-genome protein sequences , 2002 .