Machine Learning Methods for the Protein Fold Recognition Problem

The protein fold recognition problem is crucial in bioinformatics. It is usually solved using sequence comparison methods but when proteins similar in structure share little in the way of sequence homology they fail and machine learning methods are used to predict the structure of the protein. The imbalance of the data sets, the number of outliers and the high number of classes make the task very complex. We try to explain the methodology for building classifiers for protein fold recognition and to cover all the major results in this field.

[1]  Theodoros Damoulas,et al.  Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection , 2008, Bioinform..

[2]  J Moult,et al.  Genetic algorithms for protein structure prediction. , 1996, Current opinion in structural biology.

[3]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[4]  M. Levitt Accurate modeling of protein conformation by automatic segment matching. , 1992, Journal of molecular biology.

[5]  A. Kolinski,et al.  Coarse-Grained Protein Models and Their Applications. , 2016, Chemical reviews.

[6]  Irena Roterman-Konieczna,et al.  The late-stage intermediate , 2012 .

[7]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[8]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[9]  Kaizhu Huang,et al.  Enhanced protein fold recognition through a novel data integration approach , 2009, BMC Bioinformatics.

[10]  T. Blundell,et al.  Comparative protein modelling by satisfaction of spatial restraints. , 1993, Journal of molecular biology.

[11]  R Unger,et al.  Genetic algorithms for protein folding simulations. , 1992, Journal of molecular biology.

[12]  Xieping Gao,et al.  A novel hierarchical ensemble classifier for protein fold recognition. , 2008, Protein engineering, design & selection : PEDS.

[13]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[14]  Adam Liwo,et al.  Improvement of the treatment of loop structures in the UNRES force field by inclusion of coupling between backbone- and side-chain-local conformational states. , 2013, Journal of chemical theory and computation.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Koby Crammer,et al.  On the Learnability and Design of Output Codes for Multiclass Problems , 2002, Machine Learning.

[17]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[18]  Katarzyna Stapor,et al.  A hybrid discriminative/generative approach to protein fold recognition , 2012, Neurocomputing.

[19]  Ankush Mittal,et al.  Protein Structure and Fold Prediction Using Tree-augmented Naïve Bayesian Classifier , 2005, J. Bioinform. Comput. Biol..

[20]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[21]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[22]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[23]  George Karypis,et al.  Profile-based direct kernels for remote homology detection and fold recognition , 2005, Bioinform..

[24]  Lukasz A. Kurgan,et al.  PFRES: protein fold classification by using evolutionary information and predicted secondary structure , 2007, Bioinform..

[25]  James G. Lyons,et al.  A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. , 2013, Journal of theoretical biology.

[26]  Katarzyna Stapor,et al.  Protein Fold Recognition with Combined SVM-RDA Classifier , 2010, HAIS.

[27]  K. Dill,et al.  The Protein Folding Problem , 1993 .

[28]  Xin Yao,et al.  Diversity creation methods: a survey and categorisation , 2004, Inf. Fusion.

[29]  P. Deschavanne,et al.  Enhanced protein fold recognition using a structural alphabet , 2009, Proteins.

[30]  Chuen-Der Huang,et al.  Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification , 2003, IEEE Transactions on NanoBioscience.

[31]  Q. Zou,et al.  Recent Progress in Machine Learning-Based Methods for Protein Fold Recognition , 2016, International journal of molecular sciences.

[32]  Shuigeng Zhou,et al.  A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation , 2009, Bioinform..

[33]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Loris Nanni A novel ensemble of classifiers for protein fold recognition , 2006, Neurocomputing.

[35]  Thierry Denoeux,et al.  A k-nearest neighbor classification rule based on Dempster-Shafer theory , 1995, IEEE Trans. Syst. Man Cybern..

[36]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[37]  Thierry Denoeux,et al.  An evidence-theoretic k-NN rule with parameter optimization , 1998, IEEE Trans. Syst. Man Cybern. Part C.

[38]  K. Chou,et al.  Predicting protein fold pattern with functional domain and sequential evolution information. , 2009, Journal of theoretical biology.

[39]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[40]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[41]  Zoubin Ghahramani,et al.  An Introduction to Hidden Markov Models and Bayesian Networks , 2001, Int. J. Pattern Recognit. Artif. Intell..

[42]  Jianyi Yang,et al.  Improving taxonomy‐based protein fold recognition by using global and local features , 2011, Proteins.

[43]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[44]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[45]  Ke Chen,et al.  PFP-RFSM: Protein fold prediction by using random forests and sequence motifs , 2013 .

[46]  Hampapathalu A. Nagarajaram,et al.  Support Vector Machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs , 2007, Bioinform..

[47]  Narmada Thanki,et al.  CDD: a conserved domain database for interactive domain family analysis , 2006, Nucleic Acids Res..

[48]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[49]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[50]  Taeho Jo,et al.  Improving Protein Fold Recognition by Deep Learning Networks , 2015, Scientific Reports.

[51]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[52]  Tatsuya Akutsu,et al.  Protein homology detection using string alignment kernels , 2004, Bioinform..

[53]  Jun Gao,et al.  ProFold: Protein Fold Classification with Additional Structural Features and a Novel Ensemble Classifier , 2016, BioMed research international.

[54]  K. Chou Pseudo Amino Acid Composition and its Applications in Bioinformatics, Proteomics and System Biology , 2009 .

[55]  K. Dill,et al.  From Levinthal to pathways to funnels , 1997, Nature Structural Biology.

[56]  C. Anfinsen Principles that govern the folding of protein chains. , 1973, Science.

[57]  John G. Cleary,et al.  K*: An Instance-based Learner Using and Entropic Distance Measure , 1995, ICML.

[58]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[59]  Xing Gao,et al.  Enhanced Protein Fold Prediction Method Through a Novel Feature Extraction Technique , 2015, IEEE Transactions on NanoBioscience.

[60]  E. Lindahl,et al.  Identification of related proteins on family, superfamily and fold level. , 2000, Journal of molecular biology.

[61]  Mohammad Saniee Abadeh,et al.  Extracting features from protein sequences to improve deep extreme learning machine for protein fold recognition. , 2017, Journal of theoretical biology.

[62]  Chuan Yi Tang,et al.  Feature Selection and Combination Criteria for Improving Accuracy in Protein Structure Prediction , 2007, IEEE Transactions on NanoBioscience.

[63]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[64]  Irena Roterman-Konieczna,et al.  The early-stage intermediate , 2012 .

[65]  Irena Roterman-Konieczna,et al.  The Structure and Function of Living Organisms , 2014 .

[66]  Mateusz Banach,et al.  The fuzzy oil drop model, based on hydrophobicity density distribution, generalizes the influence of water environment on protein structure and function. , 2014, Journal of theoretical biology.

[67]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[68]  Edward C. Uberbacher,et al.  Predicting Protein Folding Classes without Overly Relying on Homology , 1995, ISMB.

[69]  Abdul Sattar,et al.  Mixing Energy Models in Genetic Algorithms for On-Lattice Protein Structure Prediction , 2013, BioMed research international.

[70]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[71]  M. Moorhouse,et al.  The Protein Databank , 2005 .

[72]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[73]  Infotech Oulu,et al.  Protein Fold Recognition with K-Local Hyperplane Distance Nearest Neighbor Algorithm , 2004 .