ProtDet-CCH: Protein Remote Homology Detection by Combining Long Short-Term Memory and Ranking Methods

As one of the most challenging tasks in sequence analysis, protein remote homology detection has been extensively studied. Methods based on discriminative models and ranking approaches have achieved the state-of-the-art performance, and these two kinds of methods are complementary. In this study, three LSTM models have been applied to construct the predictors for protein remote homology detection, including ULSTM, BLSTM, and CNN-BLSTM. They are able to automatically extract the local and global sequence order information. Combined with PSSMs, the CNN-BLSTM achieved the best performance among the three LSTM-based models. We named this method as CNN-BLSTM-PSSM. Finally, a new method called ProtDet-CCH was proposed by combining CNN-BLSTM-PSSM and a ranking method HHblits. Tested on a widely used SCOP benchmark dataset, ProtDet-CCH achieved an ROC score of 0.998, and an ROC50 score of 0.982, significantly outperforming other existing state-of-the-art methods. Experimental results on two updated SCOPe independent datasets showed that ProtDet-CCH can achieve stable performance. Furthermore, our method can provide useful insights for studying the features and motifs of protein families and superfamilies. It is anticipated that ProtDet-CCH will become a very useful tool for protein remote homology detection.

[1]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Bin Liu,et al.  BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches , 2019, Briefings Bioinform..

[3]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[4]  Klaus Obermayer,et al.  Fast model-based protein homology detection without alignment , 2007, Bioinform..

[5]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[6]  Junjie Chen,et al.  A comprehensive review and comparison of different computational methods for protein remote homology detection , 2018, Briefings Bioinform..

[7]  Qinghua Hu,et al.  HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy , 2015, Bioinform..

[8]  Q. Zou,et al.  Recent Progress in Machine Learning-Based Methods for Protein Fold Recognition , 2016, International journal of molecular sciences.

[9]  Kuo-Bin Li,et al.  Remote protein homology detection using recurrence quantification analysis and amino acid physicochemical properties. , 2008, Journal of theoretical biology.

[10]  Tony Håndstad,et al.  Motif kernel generated by genetic programming improves remote homology and fold detection , 2007, BMC Bioinformatics.

[11]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[12]  George Karypis,et al.  Profile-based direct kernels for remote homology detection and fold recognition , 2005, Bioinform..

[13]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[14]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[15]  Xuan Liu,et al.  Protein remote homology detection based on auto-cross covariance transformation , 2011, Comput. Biol. Medicine.

[16]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[17]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[18]  Mindaugas Margelevicius,et al.  COMA server for protein distant homology search , 2010, Bioinform..

[19]  Jason Weston,et al.  Protein Ranking by Semi-Supervised Network Propagation , 2006, BMC Bioinformatics.

[20]  Douglas L. Brutlag,et al.  Remote homology detection: a motif based approach , 2003, ISMB.

[21]  Christopher S. Oehmen,et al.  Physicochemical property distributions for accurate and rapid pairwise protein homology detection , 2010, BMC Bioinformatics.

[22]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[23]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[25]  Ren Long,et al.  dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation , 2016, Scientific Reports.

[26]  Xiaolong Wang,et al.  Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection , 2013, Bioinform..

[27]  James Hammerton,et al.  Named Entity Recognition with Long Short-Term Memory , 2003, CoNLL.

[28]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[29]  Jürgen Schmidhuber,et al.  Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition , 2005, ICANN.

[30]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[31]  Junjie Chen,et al.  Application of learning to rank to protein remote homology detection , 2015, Bioinform..

[32]  Tatsuya Akutsu,et al.  Protein homology detection using string alignment kernels , 2004, Bioinform..

[33]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[34]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[35]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[36]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[37]  B. Liu,et al.  Protein remote homology detection by combining Chou’s distance-pair pseudo amino acid composition and principal component analysis , 2015, Molecular Genetics and Genomics.

[38]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[39]  Jason Weston,et al.  Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding , 2011, PLoS Comput. Biol..

[40]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[41]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[42]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[43]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[44]  B. Liu,et al.  Using Amino Acid Physicochemical Distance Transformation for Fast Protein Remote Homology Detection , 2012, PloS one.

[45]  Hongbo Mu,et al.  An ensemble approach to protein fold classification by integration of template‐based assignment and support vector machine classifier , 2016, Bioinform..

[46]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[47]  Peter Meinicke,et al.  Remote homology detection based on oligomer distances , 2006, Bioinform..

[48]  Junjie Chen,et al.  Protein remote homology detection based on bidirectional long short-term memory , 2017, BMC Bioinformatics.

[49]  Razvan Pascanu,et al.  Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[50]  Xiaolong Wang,et al.  A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis , 2008, BMC Bioinformatics.

[51]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[52]  Amos Bairoch,et al.  PROSITE: A Documented Database Using Patterns and Profiles as Motif Descriptors , 2002, Briefings Bioinform..