Protein Remote Homology Detection and Fold Recognition Based on Sequence-Order Frequency Matrix

Protein remote homology detection and fold recognition are two critical tasks for the studies of protein structures and functions. Currently, the profile-based methods achieve the state-of-the-art performance in these fields. However, the widely used sequence profiles, like position-specific frequency matrix (PSFM) and position-specific scoring matrix (PSSM), ignore the sequence-order effects along protein sequence. In this study, we have proposed a novel profile, called sequence-order frequency matrix (SOFM), to extract the sequence-order information of neighboring residues from multiple sequence alignment (MSA). Combined with two profile feature extraction approaches, top-n-grams and the Smith–Waterman algorithm, the SOFMs are applied to protein remote homology detection and fold recognition, and two predictors called SOFM-Top and SOFM-SW are proposed. Experimental results show that SOFM contains more information content than other profiles, and these two predictors outperform other state-of-the-art methods. It is anticipated that SOFM will become a very useful profile in the studies of protein structures and functions.

[1]  Junjie Chen,et al.  ProtDec-LTR2.0: an improved method for protein remote homology detection by combining pseudo protein and supervised Learning to Rank , 2017, Bioinform..

[2]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Tony Håndstad,et al.  Motif kernel generated by genetic programming improves remote homology and fold detection , 2007, BMC Bioinformatics.

[4]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[5]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[6]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[7]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[8]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[9]  Jason Weston,et al.  Rankprop: a web server for protein remote homology detection , 2008, Bioinform..

[10]  George Karypis,et al.  Profile-based direct kernels for remote homology detection and fold recognition , 2005, Bioinform..

[11]  Yong-qiang Xing,et al.  Using weighted features to predict recombination hotspots in Saccharomyces cerevisiae. , 2015, Journal of theoretical biology.

[12]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[13]  S F Altschul,et al.  Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. , 1998, Trends in biochemical sciences.

[14]  Xiaolong Wang,et al.  Protein Remote Homology Detection by Combining Chou’s Pseudo Amino Acid Composition and Profile‐Based Protein Representation , 2013, Molecular informatics.

[15]  Fragment Finder 2.0: a computing server to identify structurally similar fragments , 2012 .

[16]  B. Liu,et al.  PSFM-DBT: Identifying DNA-Binding Proteins by Combing Position Specific Frequency Matrix and Distance-Bigram Transformation , 2017, International journal of molecular sciences.

[17]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[18]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[19]  Xiaolong Wang,et al.  Protein Remote Homology Detection Based on Binary Profiles , 2007, BIRD.

[20]  Manuele Bicego,et al.  Soft Ngram Representation and Modeling for Protein Remote Homology Detection , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[21]  Xiaolong Wang,et al.  A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis , 2008, BMC Bioinformatics.

[22]  Ren Long,et al.  dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation , 2016, Scientific Reports.

[23]  Hong Yan,et al.  An Improved Position Weight Matrix Method Based on an Entropy Measure for the Recognition of Prokaryotic Promoters , 2009, 2009 3rd International Conference on Bioinformatics and Biomedical Engineering.

[24]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[25]  F E Cohen,et al.  Pairwise sequence alignment below the twilight zone. , 2001, Journal of molecular biology.

[26]  Tatsuya Akutsu,et al.  Protein homology detection using string alignment kernels , 2004, Bioinform..

[27]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[28]  Ming Tang,et al.  COMPASS server for homology detection: improved statistical accuracy, speed and functionality , 2009, Nucleic Acids Res..

[29]  Hao Lin,et al.  The recognition and prediction of σ70 promoters in Escherichia coli K-12 , 2006 .

[30]  Junjie Chen,et al.  Application of learning to rank to protein remote homology detection , 2015, Bioinform..

[31]  Peter Meinicke,et al.  Remote homology detection based on oligomer distances , 2006, Bioinform..

[32]  Edward J. Oakeley,et al.  Position dependencies in transcription factor binding sites , 2007, Bioinform..

[33]  M. McHugh,et al.  The Chi-square test of independence , 2013, Biochemia medica.

[34]  Xiaolong Wang,et al.  Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection , 2013, Bioinform..

[35]  Peter Meinicke,et al.  Word correlation matrices for protein sequence analysis and remote homology detection , 2008, BMC Bioinformatics.

[36]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[37]  B. Liu,et al.  Using Amino Acid Physicochemical Distance Transformation for Fast Protein Remote Homology Detection , 2012, PloS one.

[38]  Shanyi Wang,et al.  Protein Remote Homology Detection by Combining Pseudo Dimer Composition with an Ensemble Learning Method , 2016 .

[39]  Aki Vehtari,et al.  A novel Bayesian approach to quantify clinical variables and to determine their spectroscopic counterparts in 1H NMR metabonomic data , 2007, BMC Bioinformatics.

[40]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[41]  Ch. Kiran Kumar,et al.  Fragment Finder: a web-based software to identify similar three-dimensional structural motif , 2005, Nucleic Acids Res..

[42]  Shu-Bo Zhang,et al.  Predicting protein subcellular localization based on information content of gene ontology terms , 2016, Comput. Biol. Chem..

[43]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[44]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[45]  Junjie Chen,et al.  A comprehensive review and comparison of different computational methods for protein remote homology detection , 2018, Briefings Bioinform..

[46]  Klaus Obermayer,et al.  Fast model-based protein homology detection without alignment , 2007, Bioinform..

[47]  Jason Weston,et al.  Protein ranking: from local to global structure in the protein similarity network. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Mindaugas Margelevicius,et al.  COMA server for protein distant homology search , 2010, Bioinform..

[49]  N. Grishin,et al.  COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. , 2003, Journal of molecular biology.

[50]  B. Liu,et al.  Protein remote homology detection by combining Chou’s distance-pair pseudo amino acid composition and principal component analysis , 2015, Molecular Genetics and Genomics.

[51]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[52]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[53]  Jason Weston,et al.  Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding , 2011, PLoS Comput. Biol..

[54]  Chris Sander,et al.  Removing near-neighbour redundancy from large protein sequence collections , 1998, Bioinform..

[55]  Alejandro A. Schäffer,et al.  IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices , 1999, Bioinform..

[56]  Junjie Chen,et al.  Protein Remote Homology Detection Based on an Ensemble Learning Approach , 2016, BioMed research international.

[57]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[58]  Xiaolong Wang,et al.  Sequence analysis Application of latent semantic analysis to protein remote homology detection , 2006 .

[59]  Xuan Liu,et al.  Protein remote homology detection based on auto-cross covariance transformation , 2011, Comput. Biol. Medicine.

[60]  C. Pál,et al.  An integrated view of protein evolution , 2006, Nature Reviews Genetics.

[61]  Q. Zou,et al.  Recent Progress in Machine Learning-Based Methods for Protein Fold Recognition , 2016, International journal of molecular sciences.

[62]  Byung-Jun Yoon,et al.  Hidden Markov Models and their Applications in Biological Sequence Analysis , 2009, Current genomics.

[63]  K. Pearson On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling , 1900 .

[64]  Douglas L. Brutlag,et al.  Remote homology detection: a motif based approach , 2003, ISMB.