CFAGO: cross-fusion of network and attributes based on attention mechanism for protein function prediction

Abstract Motivation Protein function annotation is fundamental to understanding biological mechanisms. The abundant genome-scale protein–protein interaction (PPI) networks, together with other protein biological attributes, provide rich information for annotating protein functions. As PPI networks and biological attributes describe protein functions from different perspectives, it is highly challenging to cross-fuse them for protein function prediction. Recently, several methods combine the PPI networks and protein attributes via the graph neural networks (GNNs). However, GNNs may inherit or even magnify the bias caused by noisy edges in PPI networks. Besides, GNNs with stacking of many layers may cause the over-smoothing problem of node representations. Results We develop a novel protein function prediction method, CFAGO, to integrate single-species PPI networks and protein biological attributes via a multi-head attention mechanism. CFAGO is first pre-trained with an encoder–decoder architecture to capture the universal protein representation of the two sources. It is then fine-tuned to learn more effective protein representations for protein function prediction. Benchmark experiments on human and mouse datasets show CFAGO outperforms state-of-the-art single-species network-based methods by at least 7.59%, 6.90%, 11.68% in terms of m-AUPR, M-AUPR, and Fmax, respectively, demonstrating cross-fusion by multi-head attention mechanism can greatly improve the protein function prediction. We further evaluate the quality of captured protein representations in terms of Davies Bouldin Score, whose results show that cross-fused protein representations by multi-head attention mechanism are at least 2.7% better than that of original and concatenated representations. We believe CFAGO is an effective tool for protein function prediction. Availability and implementation The source code of CFAGO and experiments data are available at: http://bliulab.net/CFAGO/.

[1]  R. Hoehndorf,et al.  DeepGOZero: improving protein function prediction from sequence and zero-shot learning based on ontology axioms , 2022, bioRxiv.

[2]  D. Hassabis,et al.  AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models , 2021, Nucleic Acids Res..

[3]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[4]  Hiroshi Mamitsuka,et al.  DeepGraphGO: graph neural network for large-scale, multispecies protein function prediction , 2021, Bioinform..

[5]  Jinbo Xu,et al.  Accurate Protein Function Prediction via Graph Attention Networks with Predicted Structure Information , 2021, bioRxiv.

[6]  Bryn C. Taylor,et al.  Structure-based protein function prediction using graph convolutional networks , 2021, Nature Communications.

[7]  Anushya Muruganujan,et al.  The Gene Ontology resource: enriching a GOld mine , 2020, Nucleic Acids Res..

[8]  Peter B. McGarvey,et al.  UniProt: the universal protein knowledgebase in 2021 , 2020, Nucleic Acids Res..

[9]  Nadezhda T. Doncheva,et al.  The STRING database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets , 2020, Nucleic Acids Res..

[10]  Silvio C. E. Tosatto,et al.  Pfam: The protein families database in 2021 , 2020, Nucleic Acids Res..

[11]  Emanuel Ben Baruch,et al.  Asymmetric Loss For Multi-Label Classification , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Suhang Wang,et al.  Say No to the Discrimination: Learning Fair Graph Neural Networks with Limited Sensitive Attribute Information , 2020, WSDM.

[13]  Yuanfang Guan,et al.  Graph2GO: a multi-modal attributed network embedding method for inferring protein functions , 2020, GigaScience.

[14]  Kyunghyun Cho,et al.  NetQuilt: deep multispecies network-based protein function prediction using homology-informed network similarity , 2020, bioRxiv.

[15]  Chen Cai,et al.  A Note on Over-Smoothing for Graph Neural Networks , 2020, ArXiv.

[16]  Anne Morgat,et al.  UniRule: a unified rule resource for automatic annotation in the UniProt Knowledgebase , 2020, Bioinformatics.

[17]  Stavros Makrodimitris,et al.  Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function , 2020, bioRxiv.

[18]  Gary D Bader,et al.  A reference map of the human binary protein interactome , 2020, Nature.

[19]  Tapio Salakoski,et al.  The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens , 2019, Genome Biology.

[20]  Maxat Kulmanov,et al.  DeepGOPlus: improved protein function prediction from sequence , 2019, bioRxiv.

[21]  Stavros Makrodimitris,et al.  Improving protein function prediction using protein sequence and GO-term similarities , 2018, Bioinform..

[22]  Xiao-Ming Wu,et al.  Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning , 2018, AAAI.

[23]  Richard Bonneau,et al.  deepNF: deep network fusion for protein function prediction , 2017, bioRxiv.

[24]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[25]  Shanfeng Zhu,et al.  DeepText2Go: Improving large-scale protein function prediction with deep semantic text representation , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Yi Xiong,et al.  GOLabeler: Improving Sequence-based Large-scale Protein Function Prediction by Learning to Rank , 2017, bioRxiv.

[28]  Maxat Kulmanov,et al.  DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier , 2017, Bioinform..

[29]  Bonnie Berger,et al.  Compact Integration of Multi-Network Topology for Functional Analysis of Genes. , 2016, Cell systems.

[30]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[31]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016, 1606.08415.

[32]  Daisuke Kihara,et al.  Computational protein function predictions. , 2016, Methods.

[33]  Weidong Tian,et al.  GoFDR: A sequence alignment based method for predicting protein functions. , 2016, Methods.

[34]  Tapio Salakoski,et al.  An expanded evaluation of protein function prediction methods shows an improvement in accuracy , 2016, Genome Biology.

[35]  Daniel W. A. Buchan,et al.  Protein function prediction by massive integration of evolutionary analyses and multiple data sources , 2013, BMC Bioinformatics.

[36]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[37]  Javier De Las Rivas,et al.  Protein–Protein Interactions Essentials: Key Concepts to Building and Analyzing Interactome Networks , 2010, PLoS Comput. Biol..

[38]  Quaid Morris,et al.  Fast integration of heterogeneous data sources for predicting gene function with limited annotation , 2010, Bioinform..

[39]  C. Orengo,et al.  Protein function prediction--the power of multiplicity. , 2009, Trends in biotechnology.

[40]  David Warde-Farley,et al.  GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function , 2008, Genome Biology.

[41]  Tijana Milenkoviæ,et al.  Uncovering Biological Network Function via Graphlet Degree Signatures , 2008, Cancer informatics.

[42]  David A. Lee,et al.  Predicting protein function from sequence and structure , 2007, Nature Reviews Molecular Cell Biology.

[43]  R. Sharan,et al.  Network-based prediction of protein function , 2007, Molecular systems biology.

[44]  Iddo Friedberg,et al.  Automated protein function predictionçthe genomic challenge , 2006 .

[45]  Janet M Thornton,et al.  Protein function prediction using local 3D templates. , 2005, Journal of molecular biology.

[46]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[47]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[48]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[49]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[50]  J F Gibrat,et al.  Surprising similarities in structure comparison. , 1996, Current opinion in structural biology.

[51]  C. Sander,et al.  Dali: a network tool for protein structure comparison. , 1995, Trends in biochemical sciences.

[52]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  T. Ohta,et al.  On some principles governing molecular evolution. , 1974, Proceedings of the National Academy of Sciences of the United States of America.

[54]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[55]  C. Chothia,et al.  Understanding protein structure: using scop for fold interpretation. , 1996, Methods in enzymology.