Predicting Sub-cellular Location of Proteins Based on Hierarchical Clustering and Hidden Markov Models

Sub-cellular localization prediction is an important step for inferring protein functions. Several strategies have been developed in the recent years to solve this problem, from alignment-based solutions to feature-based solutions. However, under some identity thesholds, these kind of approaches fail to detect homologous sequences, achieving predictions with low specificity and sensitivity. Here, a novel methodology is proposed for classifying proteins with low identity levels. This approach implements a simple, yet powerful assumption that employs hierarchical clustering and hidden Markov models, obtaining high performance on the prediction of four different sub-cellular localizations.

[1]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[2]  Germán Castellanos-Domínguez,et al.  Predictability of gene ontology slim-terms from primary structure information in Embryophyta plant proteins , 2013, BMC Bioinformatics.

[3]  Shibu Yooseph,et al.  Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering , 2007, BMC Bioinformatics.

[4]  Gertraud Burger,et al.  'Unite and conquer': enhanced prediction of protein subcellular localization by integrating multiple specialized tools , 2007, BMC Bioinformatics.

[5]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[6]  Hong Gu,et al.  Predicting protein subcellular locations for Gram-negative bacteria using neural networks ensemble , 2009, 2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[7]  Loris Nanni,et al.  An ensemble of support vector machines for predicting the membrane protein type directly from the amino acid sequence , 2008, Amino Acids.

[8]  S.-W. Zhang,et al.  Prediction of protein subcellular localization by support vector machines using multi-scale energy and pseudo amino acid composition , 2007, Amino Acids.

[9]  K. Chou,et al.  Plant-mPLoc: A Top-Down Strategy to Augment the Power for Predicting Plant Protein Subcellular Localization , 2010, PloS one.

[10]  Peter B. McGarvey,et al.  Infrastructure for the life sciences: design and implementation of the UniProt website , 2009, BMC Bioinformatics.

[11]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[12]  Jenn-Kang Hwang,et al.  Predicting subcellular localization of proteins for Gram‐negative bacteria by support vector machines based on n‐peptide compositions , 2004, Protein science : a publication of the Protein Society.

[13]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[14]  Jonathan P. Bollback,et al.  Exploring genomic dark matter: a critical assessment of the performance of homology search methods on noncoding RNA. , 2006, Genome research.

[15]  Jing Chen,et al.  Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource , 2010, Nucleic Acids Res..

[16]  Stefan Götz,et al.  Blast2GO: A Comprehensive Suite for Functional Analysis in Plant Genomics , 2007, International journal of plant genomics.

[17]  Germán Castellanos-Domínguez,et al.  An adaptation of Pfam profiles to predict protein sub-cellular localization in Gram positive bacteria , 2012, 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[18]  K. Chou,et al.  Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms , 2008, Nature Protocols.

[19]  H.-B. Shen,et al.  Euk-PLoc: an ensemble classifier for large-scale eukaryotic protein subcellular location prediction , 2007, Amino Acids.

[20]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[21]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[22]  Guo-Zheng Li,et al.  Using AdaBoost for the prediction of subcellular location of prokaryotic and eukaryotic proteins , 2008, Molecular Diversity.

[23]  Rachael P. Huntley,et al.  The GOA database in 2009—an integrated Gene Ontology Annotation resource , 2008, Nucleic Acids Res..

[24]  E. L. Harder,et al.  The Institute of Electrical and Electronics Engineers, Inc. , 2019, 2019 IEEE International Conference on Software Architecture Companion (ICSA-C).

[25]  J. A. Jaramillo-Garzon,et al.  Predictability of protein subcellular locations by pattern recognition techniques , 2010, 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology.

[26]  Tae-Sun Choi,et al.  Predicting protein subcellular location: exploiting amino acid based sequence of feature spaces and fusion of diverse classifiers , 2009, Amino Acids.

[27]  D. Kihara,et al.  PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data , 2009, Proteins.