Protein space: a natural method for realizing the nature of protein universe.

Current methods cannot tell us what the nature of the protein universe is concretely. They are based on different models of amino acid substitution and multiple sequence alignment which is an NP-hard problem and requires manual intervention. Protein structural analysis also gives a direction for mapping the protein universe. Unfortunately, now only a minuscule fraction of proteins' 3-dimensional structures are known. Furthermore, the phylogenetic tree representations are not unique for any existing tree construction methods. Here we develop a novel method to realize the nature of protein universe. We show the protein universe can be realized as a protein space in 60-dimensional Euclidean space using a distance based on a normalized distribution of amino acids. Every protein is in one-to-one correspondence with a point in protein space, where proteins with similar properties stay close together. Thus the distance between two points in protein space represents the biological distance of the corresponding two proteins. We also propose a natural graphical representation for inferring phylogenies. The representation is natural and unique based on the biological distances of proteins in protein space. This will solve the fundamental question of how proteins are distributed in the protein universe.

[1]  Samad Jahandideh,et al.  Application of density similarities to predict membrane protein types based on pseudo-amino acid composition. , 2011, Journal of theoretical biology.

[2]  Eugene I Shakhnovich,et al.  Expanding protein universe and its origin from the biological Big Bang , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Kuo-Chen Chou,et al.  Cellular automata and its applications in protein bioinformatics. , 2011, Current protein & peptide science.

[4]  Qian-zhong Li,et al.  Predict mycobacterial proteins subcellular locations by incorporating pseudo-average chemical shift into the general form of Chou's pseudo amino acid composition. , 2012, Journal of theoretical biology.

[5]  E. Koonin,et al.  The structure of the protein universe and genome evolution , 2002, Nature.

[6]  Yanda Li,et al.  SubChlo: predicting protein subchloroplast locations with pseudo-amino acid composition and the evidence-theoretic K-nearest neighbor (ET-KNN) algorithm. , 2009, Journal of theoretical biology.

[7]  Maqsood Hayat,et al.  Discriminating outer membrane proteins with Fuzzy K-nearest Neighbor algorithms based on the general form of Chou's PseAAC. , 2012, Protein and peptide letters.

[8]  Zong Dai,et al.  Prediction of protein structural classes by Chou’s pseudo amino acid composition: approached using continuous wavelet transform and principal component analysis , 2009, Amino Acids.

[9]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[10]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[11]  K. Chou,et al.  iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model , 2011, PloS one.

[12]  Fengmin Li,et al.  Predicting protein subcellular location using Chou's pseudo amino acid composition and improved hybrid approach. , 2008, Protein and peptide letters.

[13]  Lele Hu,et al.  Using pseudo amino acid composition to predict protease families by incorporating a series of protein biological features. , 2011, Protein and peptide letters.

[14]  B. Webb,et al.  Protein kinase C isoenzymes: a review of their structure, regulation and role in regulating airways smooth muscle tone and mitogenesis , 2000, British journal of pharmacology.

[15]  Chenglong Yu,et al.  A Novel Method of Characterizing Genetic Sequences: Genome Space with Biological Distance and Applications , 2011, PloS one.

[16]  Kuo-Chen Chou,et al.  iNR-PhysChem: A Sequence-Based Predictor for Identifying Nuclear Receptors and Their Subfamilies via Physical-Chemical Property Matrix , 2012, PloS one.

[17]  A. Godzik,et al.  Exploration of Uncharted Regions of the Protein Universe , 2009, PLoS biology.

[18]  K. Chou Pseudo Amino Acid Composition and its Applications in Bioinformatics, Proteomics and System Biology , 2009 .

[19]  Shao-Ping Shi,et al.  Using the concept of Chou's pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform. , 2010, Protein and peptide letters.

[20]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[21]  M. Esmaeili,et al.  Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses. , 2010, Journal of theoretical biology.

[22]  Dinesh Gupta,et al.  Identifying Bacterial Virulent Proteins by Fusing a Set of Classifiers Based on Variants of Chou's Pseudo Amino Acid Composition and on Evolutionary Information , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[23]  Hao Lin The modified Mahalanobis Discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition. , 2008, Journal of theoretical biology.

[24]  M. Levitt Nature of the protein universe , 2009, Proceedings of the National Academy of Sciences.

[25]  Shao-Wu Zhang,et al.  Using the concept of Chou’s pseudo amino acid composition to predict protein subcellular localization: an approach by incorporating evolutionary information and von Neumann entropies , 2008, Amino Acids.

[26]  Zhanchao Li,et al.  Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. , 2007, Journal of theoretical biology.

[27]  Y. Nishizuka,et al.  Taxonomy and function of C1 protein kinase C homology domains , 1997, Protein science : a publication of the Protein Society.

[28]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[29]  Jérôme Gouzy,et al.  ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons , 2000, Nucleic Acids Res..

[30]  Jianding Qiu,et al.  Prediction of G-protein-coupled receptor classes based on the concept of Chou's pseudo amino acid composition: an approach from discrete wavelet transform. , 2009, Analytical biochemistry.

[31]  Hao Lin,et al.  Prediction of cell wall lytic enzymes using Chou's amphiphilic pseudo amino acid composition. , 2009, Protein and peptide letters.

[32]  Guangya Zhang,et al.  Predicting the cofactors of oxidoreductases based on amino acid composition distribution and Chou's amphiphilic pseudo-amino acid composition. , 2008, Journal of theoretical biology.

[33]  Chenglong Yu,et al.  Protein map: an alignment-free sequence comparison method based on various properties of amino acids. , 2011, Gene.

[34]  Kuo-Chen Chou,et al.  GPCR-2 L : predicting G protein-coupled receptors and their types by hybridizing two different modes of pseudo amino acid compositions w , 2010 .

[35]  M. Nei,et al.  The neighbor-joining method , 1987 .

[36]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[37]  Guangya Zhang,et al.  Predicting lipase types by improved Chou's pseudo-amino acid composition. , 2008, Protein and peptide letters.

[38]  Xiaoying Jiang,et al.  Using the concept of Chou's pseudo amino acid composition to predict apoptosis proteins subcellular location: an approach by approximate entropy. , 2008, Protein and peptide letters.

[39]  Yanzhi Guo,et al.  Using the augmented Chou's pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. , 2009, Journal of theoretical biology.

[40]  Ganapati Panda,et al.  A novel feature representation method based on Chou's pseudo amino acid composition for protein structural class prediction , 2010, Comput. Biol. Chem..

[41]  Loris Nanni,et al.  Genetic programming for creating Chou’s pseudo amino acid based features for submitochondria localization , 2008, Amino Acids.

[42]  A. Esmaeili,et al.  Prediction of GABAA receptor proteins using the concept of Chou's pseudo-amino acid composition and support vector machine. , 2011, Journal of theoretical biology.

[43]  Peixiang Cai,et al.  Predicting protein structural class with pseudo-amino acid composition and support vector machine fusion network. , 2006, Analytical biochemistry.

[44]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[45]  Tongliang Zhang,et al.  Using Chou’s pseudo amino acid composition based on approximate entropy and an ensemble of AdaBoost classifiers to predict protein subnuclear location , 2008, Amino Acids.

[46]  Menglong Li,et al.  SecretP: identifying bacterial secreted proteins by fusing new features into Chou's pseudo-amino acid composition. , 2010, Journal of theoretical biology.

[47]  Hao Lin,et al.  Predicting subcellular localization of mycobacterial proteins by using Chou's pseudo amino acid composition. , 2008, Protein and peptide letters.

[48]  Yongsheng Ding,et al.  Using Chou's pseudo amino acid composition to predict subcellular localization of apoptosis proteins: An approach with immune genetic algorithm-based ensemble classifier , 2008, Pattern Recognit. Lett..

[49]  W. Fitch,et al.  Construction of phylogenetic trees. , 1967, Science.

[50]  I. Ladunga Phylogenetic continuum indicates “Galaxies” in the protein universe: Preliminary results on the natural group structures of proteins , 1992, Journal of Molecular Evolution.

[51]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[52]  Dongsheng Zou,et al.  Supersecondary structure prediction using Chou's pseudo amino acid composition , 2011, J. Comput. Chem..

[53]  Changchuan Yin,et al.  A Novel Construction of Genome Space with Biological Geometry , 2010, DNA research : an international journal for rapid publication of reports on genes and genomes.

[54]  Jia He,et al.  Improving discrimination of outer membrane proteins by fusing different forms of pseudo amino acid composition. , 2010, Analytical biochemistry.

[55]  P. Buneman A Note on the Metric Properties of Trees , 1974 .

[56]  Nai-Yang Deng,et al.  Prediction of enzyme subfamily class via pseudo amino acid composition by incorporating the conjoint triad feature. , 2010, Protein and peptide letters.

[57]  Xiaoyong Zou,et al.  Prediction of protein secondary structure content by using the concept of Chou's pseudo amino acid composition and support vector machine. , 2009, Protein and peptide letters.

[58]  Thomas Martinetz,et al.  Prediction of apoptosis protein locations with genetic algorithms and support vector machines through a new mode of pseudo amino acid composition. , 2010, Protein and peptide letters.

[59]  Shao-Wu Zhang,et al.  Using Chou’s pseudo amino acid composition to predict protein quaternary structure: a sequence-segmented PseAAC approach , 2008, Amino Acids.

[60]  Hassan Mohabatkar,et al.  Prediction of cyclin proteins using Chou's pseudo amino acid composition. , 2010, Protein and peptide letters.

[61]  Qianzhong Li,et al.  Using pseudo amino acid composition to predict protein structural class: Approached by incorporating 400 dipeptide components , 2007, J. Comput. Chem..

[62]  E. Koonin Metagenomic sorcery and the expanding protein universe , 2007, Nature Biotechnology.

[63]  Kurt Jordaens,et al.  Multiple UPGMA and Neighbor-joining Trees and the Performance of Some Computer Packages , 1996 .

[64]  Bhaskar D. Kulkarni,et al.  Using pseudo amino acid composition to predict protein subnuclear localization: Approached with PSSM , 2007, Pattern Recognit. Lett..

[65]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[66]  Hui Ding,et al.  Identify Golgi protein types with modified Mahalanobis discriminant algorithm and pseudo amino acid composition. , 2011, Protein and peptide letters.

[67]  Jianxiu Guo,et al.  Predicting protein folding rates using the concept of Chou's pseudo amino acid composition , 2011, Journal of computational chemistry.

[68]  K. Chou,et al.  2D-MH: A web-server for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acids. , 2010, Journal of theoretical biology.

[69]  Jianding Qiu,et al.  Using the concept of Chou's pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform. , 2010, Protein and peptide letters.

[70]  K. Chou,et al.  REVIEW : Recent advances in developing web-servers for predicting protein attributes , 2009 .

[71]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[72]  Fyodor A. Kondrashov,et al.  Sequence space and the ongoing expansion of the protein universe , 2010, Nature.

[73]  Chenglong Yu,et al.  A protein map and its application. , 2008, DNA and cell biology.

[74]  Q Gu,et al.  Prediction of G-protein-coupled receptor classes in low homology using Chou's pseudo amino acid composition with approximate entropy and hydrophobicity patterns. , 2010, Protein and peptide letters.

[75]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .

[76]  J. Nieto,et al.  Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou's pseudo amino acid composition. , 2009, Journal of theoretical biology.

[77]  Ziheng Yang,et al.  Computational Molecular Evolution , 2006 .

[78]  P. Parker,et al.  The extended protein kinase C superfamily. , 1998, The Biochemical journal.

[79]  Shao-Ping Shi,et al.  OligoPred: a web-server for predicting homo-oligomeric proteins by incorporating discrete wavelet transform into Chou's pseudo amino acid composition. , 2011, Journal of molecular graphics & modelling.

[80]  Yanzhi Guo,et al.  Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features , 2007, Amino Acids.

[81]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[82]  Xuan Xiao and Kuo-Chen Chou Using Pseudo Amino Acid Composition to Predict Protein Attributes Via Cellular Automata and Other Approaches , 2011 .