Potential implications of availability of short amino acid sequences in proteins: an old and new approach to protein decoding and design.

Three-dimensional structure of a protein molecule is primarily determined by its amino acid sequence, and thus the elucidation of general rules embedded in amino acid sequences is of great importance in protein science and engineering. To extract valuable information from sequences, we propose an analytical method in which a protein sequence is considered to be constructed by serial superimpositions of short amino acid sequences of n amino acid sets, especially triplets (3-aa sets). Using the comprehensive nonredundant protein database, we first examined "availability" of all possible combinatorial sets of 8,000 triplet species. Availability score was mathematically defined as an indicator for the relative "preference" or "avoidance" for a given short constituent sequence to be used in protein chain. Availability scores of real proteins were clearly biased against those of randomly generated proteins. We found many triplet species that occurred in the database more than expected or less than expected. Such bias was extended to longer sets, and we found that some species of pentats (5-aa sets) that occurred reasonably frequently in the randomly generated protein population did not occur at all in any real proteins known today. Availability score was dependent on species, potentially serving as a phylogenetic indicator. Furthermore, we suggest possibilities of various biotechnological applications of characteristic short sequences such as human-specific and pathogen-specific short sequences obtained from availability analysis. Availability score was also dependent on secondary structures, potentially serving as a structural indicator. Availability analysis on triplets may be combined with a comprehensive data collection on the varphi and psi peptide-bond angles of the amino acid at the center of each triplet, i.e., a collection of Ramachandran plots for each triplet. These triplet characters, together with other physicochemical data, will provide us with basic information between protein sequence and structure, by which structure prediction and engineering may be greatly facilitated. Availability analysis may also be useful in identifying word processing units in amino acid sequences based on an analogy to natural languages. Together with other approaches, availability analysis will elucidate general rules hidden in the primary sequences and eventually contributes to rebuilding the paradigm of protein science.

[1]  E. Kabat,et al.  An attempt to locate the non-helical and permissively helical sequences of proteins: application to the variable regions of immunoglobulin light and heavy chains. , 1971, Proceedings of the National Academy of Sciences of the United States of America.

[2]  R. Britten Almost all human genes resulted from ancient duplication , 2006, Proceedings of the National Academy of Sciences.

[3]  Howard Leung,et al.  Prediction of membrane protein types from sequences and position-specific scoring matrices. , 2007, Journal of theoretical biology.

[4]  A. Chess,et al.  Identification of candidate Drosophila olfactory receptors from genomic DNA sequence. , 1999, Genomics.

[5]  G. Gisselmann,et al.  Functional expression and characterization of a Drosophila odorant receptor in a heterologous cell system , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Zhi-Ping Feng,et al.  Using amino acid and peptide composition to predict membrane protein types. , 2007, Biochemical and biophysical research communications.

[7]  J. Garnier,et al.  Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. , 1978, Journal of molecular biology.

[8]  W. Pearson,et al.  Evolution of protein sequences and structures. , 1999, Journal of molecular biology.

[9]  J. Scott,et al.  Random peptide libraries. , 1994, Current opinion in biotechnology.

[10]  R. Axel,et al.  A novel multigene family may encode odorant receptors: A molecular basis for odor recognition , 1991, Cell.

[11]  R. Veitia Amino acids runs and genomic compositional biases in vertebrates. , 2004, Genomics.

[12]  John R. Carlson,et al.  A Novel Family of Divergent Seven-Transmembrane Proteins Candidate Odorant Receptors in Drosophila , 1999, Neuron.

[13]  Yücel Altunbasak,et al.  Protein secondary structure prediction for a single-sequence using hidden semi-Markov models , 2006, BMC Bioinformatics.

[14]  P. Y. Chou,et al.  Prediction of the secondary structure of proteins from their amino acid sequence. , 2006 .

[15]  T. Steitz,et al.  Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. , 1986, Annual review of biophysics and biophysical chemistry.

[16]  C. Anfinsen Principles that govern the folding of protein chains. , 1973, Science.

[17]  G. N. Ramachandran,et al.  Conformation of polypeptides and proteins. , 1968, Advances in protein chemistry.

[18]  Tomonori Gotoh,et al.  Availability of short amino acid sequences in proteins , 2005, Protein science : a publication of the Protein Society.

[19]  P. Yeagle,et al.  G-protein coupled receptor structure. , 2007, Biochimica et biophysica acta.

[20]  T. Yomo,et al.  Evolutionary molecular engineering by random elongation mutagenesis , 1999, Nature Biotechnology.

[21]  B Honig,et al.  An integrated approach to the analysis and modeling of protein sequences and structures. II. On the relationship between sequence and structural similarity for proteins that are not obviously related in sequence. , 2000, Journal of molecular biology.

[22]  Masami Ikeda,et al.  Proteome-wide classification and identification of mammalian-type GPCRs by binary topology pattern , 2004, Comput. Biol. Chem..

[23]  Shigeki Mitaku,et al.  Amphiphilicity index of polar amino acids as an aid in the characterization of amino acid preference at membrane-water interfaces , 2002, Bioinform..

[24]  Andrey Rzhetsky,et al.  A Spatial Map of Olfactory Receptor Expression in the Drosophila Antenna , 1999, Cell.

[25]  D. Forsdyke,et al.  Amino acids as placeholders: base-composition pressures on protein length in malaria parasites and prokaryotes. , 2005, Applied Bioinformatics.

[26]  Qianzhong Li,et al.  Using pseudo amino acid composition to predict protein structural class: Approached by incorporating 400 dipeptide components , 2007, J. Comput. Chem..

[27]  V. Uversky Intrinsically Disordered Proteins , 2000 .

[28]  H. Dyson,et al.  Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. , 1999, Journal of molecular biology.

[29]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[30]  Roland L. Dunbrack Sequence comparison and protein structure prediction. , 2006, Current opinion in structural biology.

[31]  Xiuzhen Zhang,et al.  Predicting Disordered Regions in Proteins Based on Decision Trees of Reduced Amino Acid Composition , 2006, J. Comput. Biol..

[32]  C. Anfinsen,et al.  Protein structure in relation to function and biosynthesis. , 1956, Advances in protein chemistry.

[33]  Ke Chen,et al.  Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs , 2007, BMC Structural Biology.

[34]  J. Bockaert,et al.  Molecular tinkering of G protein‐coupled receptors: an evolutionary success , 1999, The EMBO journal.

[35]  Xiaoyong Zou,et al.  Using pseudo-amino acid composition and support vector machine to predict protein structural class. , 2006, Journal of theoretical biology.

[36]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[37]  Christopher J. Oldfield,et al.  Intrinsically disordered protein. , 2001, Journal of molecular graphics & modelling.

[38]  K. Chou,et al.  A study on the correlation of G-protein-coupled receptor types with amino acid composition. , 2002, Protein engineering.

[39]  Jishou Ruan,et al.  Novel scales based on hydrophobicity indices for secondary protein structure. , 2007, Journal of theoretical biology.

[40]  K. Mikoshiba,et al.  Functional expression of a mammalian odorant receptor. , 1998, Science.

[41]  Tongliang Zhang,et al.  Using pseudo amino acid composition and binary-tree support vector machines to predict protein structural classes , 2007, Amino Acids.

[42]  T. Lundstedt,et al.  Classification of G‐protein coupled receptors by alignment‐independent extraction of principal chemical properties of primary amino acid sequences , 2002, Protein science : a publication of the Protein Society.

[43]  Haruhiko Yamamoto,et al.  Length analyses of Drosophila odorant receptors. , 2003, Journal of theoretical biology.

[44]  J. Baross,et al.  Overview of hyperthermophiles and their heat-shock proteins. , 1996, Advances in protein chemistry.

[45]  Gajendra P. S. Raghava,et al.  GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors , 2004, Nucleic Acids Res..

[46]  H Moereels,et al.  Classification and identification of proteins by means of common and specific amino acid n-tuples in unaligned sequences. , 1998, Computer methods and programs in biomedicine.

[47]  S. Karlin,et al.  Amino acid runs in eukaryotic proteomes and disease associations , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[48]  U. Hobohm,et al.  A sequence property approach to searching protein databases. , 1995, Journal of molecular biology.

[49]  Bin-Guang Ma,et al.  What determines protein folding type? An investigation of intrinsic structural properties and its implications for understanding folding mechanisms. , 2007, Journal of molecular biology.

[50]  K. Imai,et al.  Mechanisms of secondary structure breakers in soluble proteins , 2005, Biophysics.

[51]  Lukas Käll,et al.  A general model of G protein‐coupled receptor sequences and its application to detect remote homologs , 2006, Protein science : a publication of the Protein Society.

[52]  Dietmar Krautwurst,et al.  Identification of Ligands for Olfactory Receptors by Functional Expression of a Receptor Library , 1998, Cell.

[53]  L. Pauling,et al.  The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide chain. , 1951, Proceedings of the National Academy of Sciences of the United States of America.

[54]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001 .

[55]  Zheng-Zhi Wang,et al.  Classification of G-protein coupled receptors at four levels. , 2006, Protein engineering, design & selection : PEDS.

[56]  Gajendra P. S. Raghava,et al.  Correlation and prediction of gene expression level from amino acid and dipeptide composition of its protein , 2005, BMC Bioinformatics.

[57]  Gideon Schreiber,et al.  The molecular architecture of protein-protein binding sites. , 2005, Current opinion in structural biology.

[58]  Takashi Nakayama,et al.  Alignment-Free Classification of G-Protein-Coupled Receptors Using Self-Organizing Maps , 2006, J. Chem. Inf. Model..

[59]  Alejandro A. Schäffer,et al.  A structure-based method for protein sequence alignment , 2005, Bioinform..

[60]  L. Buck,et al.  Combinatorial Receptor Codes for Odors , 1999, Cell.

[61]  Jitao Huang,et al.  Secondary structural wobble: the limits of protein prediction accuracy. , 2002, Biochemical and biophysical research communications.

[62]  D. Hardie,et al.  Fatty acid synthase — an example of protein evolution by gene fusion , 1984 .

[63]  Orna Man,et al.  Proteomic signatures: Amino acid and oligopeptide compositions differentiate among phyla , 2003, Proteins.

[64]  Judith Klein-Seetharaman,et al.  A Sequence Alignment-Independent Method for Protein Classification , 2004, Applied bioinformatics.

[65]  Silke Sachse,et al.  Atypical Membrane Topology and Heteromeric Function of Drosophila Odorant Receptors In Vivo , 2006, PLoS biology.

[66]  Jun Cai,et al.  Classifying G-protein coupled receptors with bagging classification tree , 2004, Comput. Biol. Chem..

[67]  H. Sakano,et al.  Functional identification and reconstitution of an odorant receptor in single olfactory neurons. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[68]  Conrad C. Huang,et al.  Tools for integrated sequence-structure analysis with UCSF Chimera , 2006, BMC Bioinformatics.

[69]  Z. Wen,et al.  Using pseudo amino acid composition to predict transmembrane regions in protein: cellular automata and Lempel-Ziv complexity , 2007, Amino Acids.

[70]  S. Firestein,et al.  Length analyses of mammalian G-protein-coupled receptors. , 2001, Journal of theoretical biology.

[71]  Fredj Tekaia,et al.  Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis. , 2002, Gene.

[72]  Masaru Tomita,et al.  Proteome-Wide Prediction of Novel DNA/RNA-Binding Proteins Using Amino Acid Composition and Periodicity in the Hyperthermophilic Archaeon Pyrococcus furiosus , 2007, DNA research : an international journal for rapid publication of reports on genes and genomes.

[73]  D. Fairlie,et al.  Current status of short synthetic peptides as vaccines. , 2006, Medicinal chemistry (Shariqah (United Arab Emirates)).

[74]  K. Störtkuhl,et al.  Functional analysis of an olfactory receptor in Drosophila melanogaster , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[75]  C. Chothia Principles that determine the structure of proteins. , 1984, Annual review of biochemistry.

[76]  Y. Sugiyama,et al.  Identification of transmembrane protein functions by binary topology patterns. , 2003, Protein engineering.

[77]  P. Romero,et al.  Sequence complexity of disordered protein , 2001, Proteins.

[78]  Hiroki Shirai,et al.  Use of Amino Acid Composition to Predict Ligand-Binding Sites , 2007, J. Chem. Inf. Model..

[79]  J. Hoh,et al.  Reduced amino acid alphabet is sufficient to accurately recognize intrinsically disordered protein , 2004, FEBS letters.

[80]  M Karplus,et al.  The fundamentals of protein folding: bringing together theory and experiment. , 1999, Current opinion in structural biology.

[81]  C. Fenton,et al.  Modulation of the Escherichia coli tryptophan repressor protein by engineered peptides. , 1998, Biochemical and biophysical research communications.

[82]  V. Lim Algorithms for prediction of alpha-helical and beta-structural regions in globular proteins. , 1974, Journal of molecular biology.

[83]  M. Gerstein,et al.  Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. , 2000, Journal of molecular biology.

[84]  Zoran Obradovic,et al.  The protein trinity—linking function and disorder , 2001, Nature Biotechnology.

[85]  大﨑 丈二,et al.  Frequency Distribution of the Number of Amino Acid Triplets in the Non-Redundant Protein Database (特集 科学技術データの活用) , 2003 .

[86]  Thomas B Woolf,et al.  Insights into protein structure and function from disorder–complexity space , 2006, Proteins.

[87]  B. Kobilka G protein coupled receptor structure and activation. , 2007, Biochimica et biophysica acta.

[88]  E. Kabat,et al.  An attempt to evaluate the influence of neighboring amino acids (n-1) and (n+1) on the backbone conformation of amino acid (n) in proteins. Use in predicting the three-dimensional structure of the polypeptide backbone of other proteins. , 1973, Journal of molecular biology.

[89]  P. Schatz,et al.  Screening for receptor ligands using large libraries of peptides linked to the C terminus of the lac repressor. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[90]  Vojtech Novotny,et al.  Low host specificity of herbivorous insects in a tropical forest , 2002, Nature.

[91]  J. Gibrat,et al.  Further developments of protein secondary structure prediction using information theory. New parameters and consideration of residue pairs. , 1987, Journal of molecular biology.

[92]  R. Benton On the ORigin of smell: odorant receptors in insects , 2006, Cellular and Molecular Life Sciences CMLS.

[93]  P. Y. Chou,et al.  Prediction of protein conformation. , 1974, Biochemistry.

[94]  Oxana V. Galzitskaya,et al.  Trend of Amino Acid Composition of Proteins of Different Taxa , 2006, J. Bioinform. Comput. Biol..

[95]  N. Kurochkina Amino acid composition of parallel helix-helix interfaces. , 2007, Journal of theoretical biology.

[96]  Towards proteomic approaches for the identification of structural disorder. , 2007, Current protein & peptide science.

[97]  H. Dyson,et al.  Intrinsically unstructured proteins and their functions , 2005, Nature Reviews Molecular Cell Biology.

[98]  H. Dyson,et al.  Mechanism of coupled folding and binding of an intrinsically disordered protein , 2007, Nature.

[99]  Holger H. Hoos,et al.  An ant colony optimisation algorithm for the 2D and 3D hydrophobic polar protein folding problem , 2005, BMC Bioinformatics.

[100]  Ming-Tat Ko,et al.  Amino acid coupling patterns in thermophilic proteins , 2005, Proteins.

[101]  D C Richardson,et al.  Looking at proteins: representations, folding, packing, and design. Biophysical Society National Lecture, 1992. , 1992, Biophysical journal.

[102]  David Haussler,et al.  Classifying G-protein coupled receptors with support vector machines , 2002, Bioinform..

[103]  J. Szulmajster Protein folding , 1988, Bioscience reports.

[104]  E. Yeramian,et al.  Evolution of proteomes: fundamental signatures and global trends in amino acid compositions , 2006, BMC Genomics.

[105]  M. Levitt Conformational preferences of amino acids in globular proteins. , 1978, Biochemistry.

[106]  Ronald M. Levy,et al.  Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases , 2000, Bioinform..

[107]  C. Georgopoulos,et al.  Role of the major heat shock proteins as molecular chaperones. , 1993, Annual review of cell biology.

[108]  Huan Chen,et al.  Prediction and Classification of Human G-protein Coupled Receptors Based on Support Vector Machines , 2016, Genomics, proteomics & bioinformatics.

[109]  E. Kabat,et al.  The influence of nearest-neighbor amino acids on the conformation of the middle amino acid in proteins: comparison of predicted and experimental determination of -sheets in concanavalin A. , 1973, Proceedings of the National Academy of Sciences of the United States of America.

[110]  Pierre Baldi,et al.  Hidden Markov Models of the G-Protein-Coupled Receptor Family , 1994, J. Comput. Biol..

[111]  L. Pauling,et al.  Configurations of Polypeptide Chains With Favored Orientations Around Single Bonds: Two New Pleated Sheets. , 1951, Proceedings of the National Academy of Sciences of the United States of America.

[112]  Tamer Kahveci,et al.  A Novel algorithm for identifying low-complexity regions in a protein sequence , 2006, Bioinform..

[113]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[114]  C. Chothia,et al.  Helix to helix packing in proteins. , 1981, Journal of molecular biology.

[115]  Burkhard Rost,et al.  Prediction in 1D: secondary structure, membrane helices, and accessibility. , 2003, Methods of biochemical analysis.

[116]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.