Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding

Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods—i.e., measures of similarity between query and target sequences—provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional “semantic space.” Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space.

[1]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[2]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[3]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[4]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[5]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[6]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[7]  David Haussler,et al.  Using the Fisher Kernel Method to Detect Remote Protein Homologies , 1999, ISMB.

[8]  Bernhard Schölkopf,et al.  Large Margin Bank Boundaries for Ordinal Regression , 2000 .

[9]  Ralf Herbrich,et al.  Large margin rank boundaries for ordinal regression , 2000 .

[10]  Thore Graepel,et al.  Large Margin Rank Boundaries for Ordinal Regression , 2000 .

[11]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[12]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[13]  A. Godzik,et al.  Comparison of sequence profiles. Strategies for structural predictions using sequence information , 2008, Protein science : a publication of the Protein Society.

[14]  John D. Storey A direct approach to false discovery rates , 2002 .

[15]  Osvaldo Olmea,et al.  MAMMOTH (Matching molecular models obtained from theory): An automated method for model comparison , 2002, Protein science : a publication of the Protein Society.

[16]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[17]  Jason Weston,et al.  Protein ranking: from local to global structure in the protein similarity network. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Johannes Söding,et al.  The HHpred interactive server for protein homology detection and structure prediction , 2005, Nucleic Acids Res..

[19]  Samy Bengio,et al.  Inferring document similarity from hyperlinks , 2005, CIKM '05.

[20]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[21]  Liisa Holm,et al.  ADDA: a domain database with global coverage of the protein universe , 2004, Nucleic Acids Res..

[22]  Liisa Holm,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm358 Sequence analysis , 2022 .

[23]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[24]  Jason Weston,et al.  Rankprop: a web server for protein remote homology detection , 2008, Bioinform..

[25]  Cédric Notredame,et al.  Upcoming challenges for multiple sequence alignment methods in the high-throughput era , 2009, Bioinform..

[26]  Yanjun Qi,et al.  Polynomial Semantic Indexing , 2009, NIPS.

[27]  Christine A. Orengo,et al.  A fast and automated solution for accurately resolving protein domain architectures , 2010, Bioinform..