ProtDec-LTR3.0: Protein Remote Homology Detection by Incorporating Profile-Based Features Into Learning to Rank

Protein remote homology detection is one of the most challenging problems in the field of protein sequence analysis, which is an important step for both theoretical research (such as the understanding of structures and functions of proteins) and drug design. Previous studies have shown that combining different ranking methods via learning to the rank algorithm is an effective strategy for remote protein homology detection, and the performance can be further improved by the protein similarity networks. In this paper, we improved the ProtDec-LTR1.0 and ProtDec-LTR2.0 predictors by incorporating three profile-based features (Top-1-gram, Top-2-gram, and ACC) into the framework of learning to rank via feature mapping strategies. The predictive performance was further refined by the pagerank (PR) algorithm and hyperlink-induced topic search (HITS) algorithm. Finally, a predictor called ProtDec-LTR3.0 was proposed. Rigorous tests on two widely used benchmark datasets showed that the ProtDec-LTR3.0 predictor outperformed both ProtDec-LTR1.0 and ProtDec-LTR2.0, and other nine existing state-of-the-art predictors, indicating that the ProtDec-LTR3.0 is an efficient method for protein remote homology detection, and will become a useful tool for protein sequence analysis. A user-friendly web server of the ProtDec-LTR3.0 predictor was established for the convenience of users, which can be accessed at http://bliulab.net/ProtDec-LTR3.0/.

[1]  Junjie Chen,et al.  ProtDec-LTR2.0: an improved method for protein remote homology detection by combining pseudo protein and supervised Learning to Rank , 2017, Bioinform..

[2]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[3]  Quan Zou,et al.  Exploratory Predicting Protein Folding Model with Random Forest and Hybrid Features , 2014 .

[4]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[5]  Patrice Koehl,et al.  The ASTRAL Compendium in 2004 , 2003, Nucleic Acids Res..

[6]  Tao Qin,et al.  LETOR: A benchmark collection for research on learning to rank for information retrieval , 2010, Information Retrieval.

[7]  Andrew Trotman,et al.  Learning to Rank , 2005, Information Retrieval.

[8]  Xiaozhao Fang,et al.  Protein fold recognition based on multi-view modeling , 2019, Bioinform..

[9]  Yoshua Bengio,et al.  No Unbiased Estimator of the Variance of K-Fold Cross-Validation , 2003, J. Mach. Learn. Res..

[10]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[11]  Jason Weston,et al.  Protein ranking: from local to global structure in the protein similarity network. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Ren Long,et al.  dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation , 2016, Scientific Reports.

[13]  Mindaugas Margelevicius,et al.  COMA server for protein distant homology search , 2010, Bioinform..

[14]  Xiaolong Wang,et al.  Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection , 2013, Bioinform..

[15]  Jason Weston,et al.  Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding , 2011, PLoS Comput. Biol..

[16]  Chris Sander,et al.  Removing near-neighbour redundancy from large protein sequence collections , 1998, Bioinform..

[17]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[18]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[19]  Xing Gao,et al.  Enhanced Protein Fold Prediction Method Through a Novel Feature Extraction Technique , 2015, IEEE Transactions on NanoBioscience.

[20]  Sun-Yuan Kung,et al.  mPLR-Loc: an adaptive decision multi-label classifier based on penalized logistic regression for protein subcellular localization prediction. , 2015, Analytical biochemistry.

[21]  Junjie Chen,et al.  Application of learning to rank to protein remote homology detection , 2015, Bioinform..

[22]  Qinghua Hu,et al.  HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy , 2015, Bioinform..

[23]  Xiaolong Wang,et al.  Using distances between Top-n-gram and residue pairs for protein remote homology detection , 2014, BMC Bioinformatics.

[24]  Shuigeng Zhou,et al.  A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation , 2009, Bioinform..

[25]  Junjie Chen,et al.  A comprehensive review and comparison of different computational methods for protein remote homology detection , 2018, Briefings Bioinform..

[26]  B. Liu,et al.  Protein remote homology detection by combining Chou’s distance-pair pseudo amino acid composition and principal component analysis , 2015, Molecular Genetics and Genomics.

[27]  Min Wang,et al.  A novel page ranking algorithm based on triadic closure and hyperlink-induced topic search , 2015, Intell. Data Anal..

[28]  Massimo Franceschet,et al.  PageRank , 2010, Commun. ACM.

[29]  Xing Gao,et al.  An Improved Protein Structural Classes Prediction Method by Incorporating Both Sequence and Structure Information , 2015, IEEE Transactions on NanoBioscience.

[30]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[31]  Bin Liu,et al.  HITS-PR-HHblits: protein remote homology detection by combining PageRank and Hyperlink-Induced Topic Search , 2018, Briefings Bioinform..

[32]  Quan Zou,et al.  HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing , 2017, Algorithms for Molecular Biology.

[33]  B. Liu,et al.  Using Amino Acid Physicochemical Distance Transformation for Fast Protein Remote Homology Detection , 2012, PloS one.

[34]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[35]  Q. Zou,et al.  Recent Progress in Machine Learning-Based Methods for Protein Fold Recognition , 2016, International journal of molecular sciences.

[36]  Xiaolong Wang,et al.  A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis , 2008, BMC Bioinformatics.

[37]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[38]  Christian Buchta,et al.  Distance and Similarity Measures , 2015, Encyclopedia of Multimedia.

[39]  Q. Zou,et al.  Hierarchical Classification of Protein Folds Using a Novel Ensemble Classifier , 2013, PloS one.

[40]  Bin Liu,et al.  BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches , 2019, Briefings Bioinform..

[41]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[42]  Yuhong Yang,et al.  Cross-validation for selecting a model selection procedure , 2015 .

[43]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.