Machine learning can be used to distinguish protein families and generate new proteins belonging to those families.

Proteins are classified into families based on evolutionary relationships and common structure-function characteristics. Availability of large data sets of gene-derived protein sequences drives this classification. Sequence space is exponentially large, making it difficult to characterize family differences. In this work, we show that Machine Learning (ML) methods can be trained to distinguish between protein families. A number of supervised ML algorithms are explored to this end. The most accurate is a Long Short Term Memory (LSTM) classification method that accounts for the sequence context of the amino acids. Sequences for a number of protein families where there are sufficient data to be used in ML are studied. By splitting the data into training and testing sets, we find that this LSTM classifier can be trained to successfully classify the test sequences for all pairs of the families. Also investigated is whether the addition of structural information increases the accuracy of the binary comparisons. It does, but because there is much less available structural than sequence information, the quality of the training degrades. Another variety of LSTM, LSTM_wordGen, a context-dependent word generation algorithm, is used to generate new protein sequences based on seed sequences for the families considered here. Using the original sequences as training data and the generated sequences as test data, the LSTM classification method classifies the generated sequences almost as accurately as the true family members do. Thus, in principle, we have generated new members of these protein families.

[1]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its new supplement TREMBL , 1996, Nucleic Acids Res..

[2]  Michal Linial,et al.  ProFET: Feature engineering captures high-level protein functions , 2015, Bioinform..

[3]  Robert S. Ledley,et al.  The Protein Information Resource , 2003, Nucleic Acids Res..

[4]  Teresa K. Attwood,et al.  The Evolution of Protein Family Databases , 2019, Encyclopedia of Bioinformatics and Computational Biology.

[5]  Victor de Lorenzo,et al.  Myriads of protein families, and still counting , 2003, Genome Biology.

[6]  Xiong Liu,et al.  Subfamily specific conservation profiles for proteins based on n-gram patterns , 2008, BMC Bioinformatics.

[7]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[8]  Elisabeth Coudert,et al.  HAMAP in 2015: updates to the protein family classification and annotation system , 2014, Nucleic Acids Res..

[9]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[10]  Vegeir Knudsen,et al.  Origins and evolution of modern biochemistry: insights from genomes and molecular structure. , 2008, Frontiers in bioscience : a journal and virtual library.

[11]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[12]  Brian R. King,et al.  Mining for class-specific motifs in protein sequence classification , 2012, BMC Bioinformatics.

[13]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[14]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[15]  Zachary Wu,et al.  Learned protein embeddings for machine learning , 2018, Bioinformatics.

[16]  Q. Zou,et al.  Recent Progress in Machine Learning-Based Methods for Protein Fold Recognition , 2016, International journal of molecular sciences.

[17]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[18]  Guo-Wei Wei,et al.  A topological approach for protein classification , 2015, 1510.00953.

[19]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[20]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  Maria Jesus Martin,et al.  High-quality Protein Knowledge Resource: SWISS-PROT and TrEMBL , 2002, Briefings Bioinform..

[22]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[23]  D. Caetano-Anollés,et al.  The origin, evolution and structure of the protein world. , 2009, The Biochemical journal.

[24]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.