Human Protein Cluster Analysis Using Amino Acid Frequencies

The paper focuses on the development of a software tool for protein clustering according to their amino acid content. All known human proteins were clustered according to the relative frequencies of their amino acids starting from the UniProtKB/Swiss-Prot reference database and making use of hierarchical cluster analysis. Results were compared to those based on sequence similarities. Results: Proteins display different clustering patterns according to type. Many extracellular proteins with highly specific and repetitive sequences (keratins, collagens etc.) cluster clearly confirming the accuracy of the clustering method. In our case clustering by sequence and amino acid content overlaps. Proteins with a more complex structure with multiple domains (catalytic, extracellular, transmembrane etc.), even if classified very similar according to sequence similarity and function (aquaporins, cadherins, steroid 5-alpha reductase etc.) showed different clustering according to amino acid content. Availability of essential amino acids according to local conditions (starvation, low or high oxygen, cell cycle phase etc.) may be a limiting factor in protein synthesis, whatever the mRNA level. This type of protein clustering may therefore prove a valuable tool in identifying so far unknown metabolic connections and constraints.

[1]  P. Laird,et al.  Environmental epigenetics: prospects for studying epigenetic mediation of exposure–response relationships , 2012, Human Genetics.

[2]  David A. Fenstermacher,et al.  Introduction to bioinformatics , 2005, J. Assoc. Inf. Sci. Technol..

[3]  B. Walter,et al.  Fast agglomerative clustering for rendering , 2008, 2008 IEEE Symposium on Interactive Ray Tracing.

[4]  A graph-based clustering method applied to protein sequences , 2011, Bioinformation.

[5]  Joshua Jortner,et al.  IUPAC-IUB Joint Commission on Biochemical Nomenclature (JCBN) , 1983 .

[6]  S. Kimball,et al.  Amino acids as regulators of gene expression , 2004, Nutrition & metabolism.

[7]  Yen-Jen Oyang,et al.  Incremental generation of summarized clustering hierarchy for protein family analysis , 2004, Bioinform..

[8]  Paolo Sassone-Corsi,et al.  The NAD+-Dependent Deacetylase SIRT1 Modulates CLOCK-Mediated Chromatin Remodeling and Circadian Control , 2008, Cell.

[9]  J. Wernerman Clinical use of glutamine supplementation. , 2008, The Journal of nutrition.

[10]  Michele Magrane,et al.  UniProt Knowledgebase: a hub of integrated protein data , 2011, Database J. Biol. Databases Curation.

[11]  Christian Sohler,et al.  Analysis of Agglomerative Clustering , 2010, Algorithmica.

[12]  Daniel Müllner Fast Hierarchical Clustering Routines for R and Python , 2015 .

[13]  R Apweiler,et al.  Clustering and analysis of protein families. , 2001, Current opinion in structural biology.

[14]  Portland Press Ltd IUPAC-IUB Joint Commission on Biochemical Nomenclature (JCBN). Nomenclature and symbolism for amino acids and peptides. Recommendations 1983 , 1984 .

[15]  Vassilios Ioannidis,et al.  ExPASy: SIB bioinformatics resource portal , 2012, Nucleic Acids Res..

[16]  Ben C. Stöver,et al.  TreeGraph 2: Combining and visualizing evidence from different phylogenetic analyses , 2010, BMC Bioinformatics.

[17]  Fan Yang,et al.  Clustering Protein Sequences Using Affinity Propagation Based on an Improved Similarity Measure , 2009 .