A comprehensive review and comparison of different computational methods for protein remote homology detection

Protein remote homology detection is one of the most fundamental and central problems for the studies of protein structures and functions, aiming to detect the distantly evolutionary relationships among proteins via computational methods. During the past decades, many computational approaches have been proposed to solve this important task. These methods have made a substantial contribution to protein remote homology detection. Therefore, it is necessary to give a comprehensive review and comparison on these computational methods. In this article, we divide these computational approaches into three categories, including alignment methods, discriminative methods and ranking methods. Their advantages and disadvantages are discussed in a comprehensive perspective, and their performance is compared on widely used benchmark data sets. Finally, some open questions in this field are further explored and discussed.

[1]  Alejandro A. Schäffer,et al.  IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices , 1999, Bioinform..

[2]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[4]  Christopher S. Oehmen,et al.  SVM-BALSA: Remote homology detection based on Bayesian sequence alignment , 2005, Comput. Biol. Chem..

[5]  David S. Goodsell,et al.  The RCSB Protein Data Bank: views of structural biology for basic and applied research and education , 2014, Nucleic Acids Res..

[6]  Xugang Ye,et al.  An assessment of substitution scores for protein profile-profile comparison , 2011, Bioinform..

[7]  Tony Håndstad,et al.  Motif kernel generated by genetic programming improves remote homology and fold detection , 2007, BMC Bioinformatics.

[8]  Arthur M. Lesk,et al.  Introduction to bioinformatics , 2002 .

[9]  Christian J. A. Sigrist,et al.  Nucleic Acids Research Advance Access published November 14, 2007 The 20 years of PROSITE , 2007 .

[10]  Christopher S. Oehmen,et al.  Physicochemical property distributions for accurate and rapid pairwise protein homology detection , 2010, BMC Bioinformatics.

[11]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[12]  Quan Zou,et al.  Exploratory Predicting Protein Folding Model with Random Forest and Hybrid Features , 2014 .

[13]  S F Altschul,et al.  Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. , 1998, Trends in biochemical sciences.

[14]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[15]  O. Chapelle,et al.  Semi-Supervised Learning (Chapelle, O. et al., Eds.; 2006) [Book reviews] , 2009, IEEE Transactions on Neural Networks.

[16]  Jason Weston,et al.  Combining classifiers for improved classification of proteins from sequence or structure , 2008, BMC Bioinformatics.

[17]  Jason Weston,et al.  Protein ranking: from local to global structure in the protein similarity network. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Hao Lin,et al.  Prediction of ketoacyl synthase family using reduced amino acid alphabets , 2012, Journal of Industrial Microbiology & Biotechnology.

[19]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001 .

[20]  Xiaolong Wang,et al.  Protein Remote Homology Detection by Combining Chou’s Pseudo Amino Acid Composition and Profile‐Based Protein Representation , 2013, Molecular informatics.

[21]  Dominik Gront,et al.  BioShell Threader: protein homology detection based on sequence profiles and secondary structure profiles , 2012, Nucleic Acids Res..

[22]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[23]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[24]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[25]  Hui Ding,et al.  The prediction of protein structural class using averaged chemical shifts , 2012, Journal of biomolecular structure & dynamics.

[26]  Didier Rognan,et al.  IChemPIC: A Random Forest Classifier of Biological and Crystallographic Protein-Protein Interfaces , 2015, J. Chem. Inf. Model..

[27]  De-Shuang Huang,et al.  Human face recognition based on multi-features using neural networks committee , 2004, Pattern Recognit. Lett..

[28]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[29]  Amos Bairoch,et al.  PROSITE, a protein domain database for functional characterization and annotation , 2009, Nucleic Acids Res..

[30]  Byung-Jun Yoon,et al.  Hidden Markov Models and their Applications in Biological Sequence Analysis , 2009, Current genomics.

[31]  Mindaugas Margelevicius,et al.  COMA server for protein distant homology search , 2010, Bioinform..

[32]  Jason Weston,et al.  Protein Ranking by Semi-Supervised Network Propagation , 2006, BMC Bioinformatics.

[33]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[34]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[35]  James E. Bray,et al.  The CATH database: an extended protein family resource for structural and functional genomics , 2003, Nucleic Acids Res..

[36]  Junjie Chen,et al.  Application of learning to rank to protein remote homology detection , 2015, Bioinform..

[37]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[38]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[39]  Gary D Bader,et al.  A draft map of the human proteome , 2014, Nature.

[40]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[41]  B. Liu,et al.  Protein remote homology detection by combining Chou’s distance-pair pseudo amino acid composition and principal component analysis , 2015, Molecular Genetics and Genomics.

[42]  Xiaolong Wang,et al.  Using distances between Top-n-gram and residue pairs for protein remote homology detection , 2014, BMC Bioinformatics.

[43]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[44]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[45]  M. Sternberg,et al.  Protein structure prediction on the Web: a case study using the Phyre server , 2009, Nature Protocols.

[46]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[47]  De-Shuang Huang,et al.  A Constructive Hybrid Structure Optimization Methodology for Radial Basis Probabilistic Neural Networks , 2008, IEEE Transactions on Neural Networks.

[48]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[49]  Peng Chen,et al.  Predicting protein interaction sites from residue spatial sequence profile and evolution rate , 2006, FEBS Letters.

[50]  Frances M. G. Pearl,et al.  The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution , 2006, Nucleic Acids Res..

[51]  Dong Xu,et al.  Computational methods for remote homolog identification. , 2005, Current protein & peptide science.

[52]  Klaus Obermayer,et al.  Fast model-based protein homology detection without alignment , 2007, Bioinform..

[53]  Yutaka Akiyama,et al.  FORTE: a profile-profile comparison tool for protein fold recognition , 2004, Bioinform..

[54]  Erik L. L. Sonnhammer,et al.  Improved profile HMM performance by assessment of critical algorithmic features in SAM and HMMER , 2005, BMC Bioinformatics.

[55]  Xiaolong Wang,et al.  Sequence analysis Application of latent semantic analysis to protein remote homology detection , 2006 .

[56]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[57]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[58]  Akira R. Kinjo,et al.  Protein structure databases with new web services for structural biology and biomedical research , 2008, Briefings Bioinform..

[59]  Shanyi Wang,et al.  Protein Remote Homology Detection by Combining Pseudo Dimer Composition with an Ensemble Learning Method , 2016 .

[60]  Peter Meinicke,et al.  Word correlation matrices for protein sequence analysis and remote homology detection , 2008, BMC Bioinformatics.

[61]  Yair Neuman The Definition of Life and the Life of a Definition , 2012, Journal of biomolecular structure & dynamics.

[62]  Hasan Ogul,et al.  A discriminative method for remote homology detection based on n-peptide compositions with reduced amino acid alphabets , 2007, Biosyst..

[63]  R. Abagyan,et al.  Do aligned sequences share the same fold? , 1997, Journal of molecular biology.

[64]  Tatsuya Akutsu,et al.  Protein homology detection using string alignment kernels , 2004, Bioinform..

[65]  N. Grishin,et al.  COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. , 2003, Journal of molecular biology.

[66]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[67]  Jason Weston,et al.  Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding , 2011, PLoS Comput. Biol..

[68]  Xiaolong Wang,et al.  Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection , 2013, Bioinform..

[69]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[70]  Christopher S. Oehmen,et al.  SVM-HUSTLE - an iterative semi-supervised machine learning approach for pairwise protein remote homology detection , 2008, Bioinform..

[71]  Q. Zou,et al.  A Global Characterization and Identification of Multifunctional Enzymes , 2012, PloS one.

[72]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[73]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[74]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[75]  Rong Jin,et al.  Learning to Rank by Optimizing NDCG Measure , 2009, NIPS.

[76]  Jason Weston,et al.  Rankprop: a web server for protein remote homology detection , 2008, Bioinform..

[77]  Wei Chen,et al.  Prediction of thermophilic proteins using feature selection technique. , 2011, Journal of microbiological methods.

[78]  Ming Tang,et al.  COMPASS server for homology detection: improved statistical accuracy, speed and functionality , 2009, Nucleic Acids Res..

[79]  Kuo-Bin Li,et al.  Remote protein homology detection using recurrence quantification analysis and amino acid physicochemical properties. , 2008, Journal of theoretical biology.

[80]  Jaap Heringa,et al.  webPRC: the Profile Comparer for alignment-based searching of public domain databases , 2009, Nucleic Acids Res..

[81]  B. Liu,et al.  iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition , 2014, PloS one.

[82]  Piero Fariselli,et al.  The WWWH of remote homolog detection: The state of the art , 2006, Briefings Bioinform..

[83]  Kuo-Chen Chou,et al.  Ensemble classifier for protein fold pattern recognition , 2006, Bioinform..

[84]  陈奕欣 Ongoing and future developments at the Universal Protein Resource , 2011 .

[85]  Christoph Weber,et al.  FFAS server: novel features and applications , 2011, Nucleic Acids Res..

[86]  George Karypis,et al.  Profile-based direct kernels for remote homology detection and fold recognition , 2005, Bioinform..

[87]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[88]  S. Eddy,et al.  Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions , 2013, Nucleic acids research.

[89]  C. Anfinsen Principles that govern the folding of protein chains. , 1973, Science.

[90]  Xiaolong Wang,et al.  repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects , 2015, Bioinform..

[91]  A. Bax,et al.  Protein backbone and sidechain torsion angles predicted from NMR chemical shifts using artificial neural networks , 2013, Journal of Biomolecular NMR.

[92]  B. Liu,et al.  Using Amino Acid Physicochemical Distance Transformation for Fast Protein Remote Homology Detection , 2012, PloS one.

[93]  Takashi Ishida,et al.  GHOSTM: A GPU-Accelerated Homology Search Tool for Metagenomics , 2012, PloS one.

[94]  María Martín,et al.  Ongoing and future developments at the Universal Protein Resource , 2010, Nucleic Acids Res..

[95]  Lukasz A. Kurgan,et al.  PFRES: protein fold classification by using evolutionary information and predicted secondary structure , 2007, Bioinform..

[96]  Theodoros Damoulas,et al.  Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection , 2008, Bioinform..

[97]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[98]  De-Shuang Huang,et al.  A constructive approach for finding arbitrary roots of polynomials by neural networks , 2004, IEEE Transactions on Neural Networks.

[99]  D. Mount Bioinformatics: Sequence and Genome Analysis , 2001 .

[100]  Adrian Tsang,et al.  Identification of DNA-Binding Proteins by Combining Auto-Cross Covariance Transformation and Ensemble Learning , 2016 .

[101]  Robert D. Finn,et al.  The Pfam protein families database: towards a more sustainable future , 2015, Nucleic Acids Res..

[102]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[103]  Xiaolong Wang,et al.  A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis , 2008, BMC Bioinformatics.

[104]  Yuanzhi Li,et al.  A Theoretical Analysis of NDCG Ranking Measures , 2013 .

[105]  Nikolaos V. Sahinidis,et al.  GPU-BLAST: using graphics processors to accelerate protein sequence alignment , 2010, Bioinform..

[106]  Junjie Chen,et al.  Protein Remote Homology Detection Based on an Ensemble Learning Approach , 2016, BioMed research international.

[107]  Zhenhua Li,et al.  Use B-factor related features for accurate classification between protein binding interfaces and crystal packing contacts , 2014, BMC Bioinformatics.

[108]  Xuan Liu,et al.  Protein remote homology detection based on auto-cross covariance transformation , 2011, Comput. Biol. Medicine.

[109]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[110]  Steven E. Brenner,et al.  The value of protein structure classification information—Surveying the scientific literature , 2015, Proteins.

[111]  Anders Krogh,et al.  SAM: SEQUENCE ALIGNMENT AND MODELING SOFTWARE SYSTEM , 1995 .

[112]  Yong Wang,et al.  PROCAIN server for remote protein sequence similarity search , 2009, Bioinform..

[113]  Peter Meinicke,et al.  Remote homology detection based on oligomer distances , 2006, Bioinform..

[114]  F E Cohen,et al.  Pairwise sequence alignment below the twilight zone. , 2001, Journal of molecular biology.

[115]  Mindaugas Margelevicius,et al.  Bayesian nonparametrics in protein remote homology search , 2016, Bioinform..

[116]  Xiangxiang Zeng,et al.  nDNA-prot: identification of DNA-binding proteins based on unbalanced classification , 2014, BMC Bioinformatics.

[117]  Andrzej Kloczkowski,et al.  GENN: a GEneral Neural Network for learning tabulated data with examples from protein structure prediction. , 2015, Methods in molecular biology.

[118]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[119]  Yaoqi Zhou,et al.  Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of query and corresponding native properties of templates , 2011, Bioinform..

[120]  C. Pál,et al.  An integrated view of protein evolution , 2006, Nature Reviews Genetics.

[121]  S. Dumais Latent Semantic Analysis. , 2005 .

[122]  Qi Li,et al.  A Speculative HMMER Search Implementation on GPU , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.