Protein function prediction is improved by creating synthetic feature samples with generative adversarial networks

Protein function prediction is a challenging but important task in bioinformatics. Many prediction methods have been developed, but are still limited by the bottleneck on training sample quantity. Therefore, it is valuable to develop a data augmentation method that can generate high-quality synthetic samples to further improve the accuracy of prediction methods. In this work, we propose a novel generative adversarial networks-based method, FFPred-GAN, to accurately learn the high-dimensional distributions of protein sequence-based biophysical features and also generate high-quality synthetic protein feature samples. The experimental results suggest that the synthetic protein feature samples are successful in improving the prediction accuracy for all three domains of Gene Ontology through augmentation of the original training protein feature samples. Training machine learning models to predict the function of proteins is limited by the availability of only a small amount of labelled training data. Training can be improved by employing generative adversarial networks to generate additional synthetic protein samples.

[1]  Christine A. Orengo,et al.  Analysis of temporal transcription expression profiles reveal links between protein function and developmental stages of Drosophila melanogaster , 2017, PLoS Comput. Biol..

[2]  Zhangxin Chen,et al.  ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network , 2017, Molecules.

[3]  Jari Björne,et al.  The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens , 2019, Genome Biology.

[4]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[5]  Tapio Salakoski,et al.  An expanded evaluation of protein function prediction methods shows an improvement in accuracy , 2016, Genome Biology.

[6]  Heng Huang,et al.  Semi-Supervised Generative Adversarial Network for Gene Expression Inference , 2018, KDD.

[7]  James Zou,et al.  Feedback GAN for DNA optimizes protein functions , 2019, Nature Machine Intelligence.

[8]  Gregory D. Hager,et al.  Adversarial deep structured nets for mass segmentation from mammograms , 2017, 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018).

[9]  Yi Xiong,et al.  GOLabeler: Improving Sequence-based Large-scale Protein Function Prediction by Learning to Rank , 2017, bioRxiv.

[10]  Ye Wang,et al.  Synthetic promoter design in Escherichia coli based on a deep generative network , 2020, Nucleic acids research.

[11]  Concetto Spampinato,et al.  Semi Supervised Semantic Segmentation Using Generative Adversarial Network , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[13]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[14]  Guang Yang,et al.  DAGAN: Deep De-Aliasing Generative Adversarial Networks for Fast Compressed Sensing MRI Reconstruction , 2018, IEEE Transactions on Medical Imaging.

[15]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Damiano Piovesan,et al.  FFPred 2.0: Improved Homology-Independent Prediction of Gene Ontology Terms for Eukaryotic Protein Sequences , 2013, PloS one.

[17]  Heng Huang,et al.  Conditional generative adversarial network for gene expression inference , 2018, Bioinform..

[18]  Fengzhu Sun,et al.  NetGO: improving large-scale protein function prediction with massive network information , 2019, Nucleic acids research.

[19]  Silvio Savarese,et al.  Adversarial Feature Augmentation for Unsupervised Domain Adaptation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Nicholas M. Luscombe,et al.  Generative adversarial networks simulate gene expression and predict perturbations in single cells , 2018, bioRxiv.

[21]  Zengchang Qin,et al.  Emotion Classification with Data Augmentation Using Generative Adversarial Networks , 2018, PAKDD.

[22]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[24]  Nicholas M. Luscombe,et al.  Generative adversarial networks simulate gene expression and predict perturbations in single cells , 2018, bioRxiv.

[25]  Luca Ambrogioni,et al.  Generative adversarial networks for reconstructing natural images from brain activity , 2017, NeuroImage.

[26]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Jung-Woo Ha,et al.  StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[29]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[30]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[31]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  David T. Jones,et al.  Using deep maxout neural networks to improve the accuracy of function prediction from protein interaction networks , 2018, bioRxiv.

[33]  David Lopez-Paz,et al.  Revisiting Classifier Two-Sample Tests , 2016, ICLR.

[34]  Rui Fa,et al.  Predicting human protein function with multi-task deep neural networks , 2018, bioRxiv.

[35]  Lin Yang,et al.  Translating and Segmenting Multimodal Medical Volumes with Cycle- and Shape-Consistency Generative Adversarial Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Valeria Vitelli,et al.  Probabilistic preference learning with the Mallows rank model , 2014, J. Mach. Learn. Res..

[37]  David T Jones,et al.  Computational Methods for Annotation Transfers from Sequence. , 2016, Methods in molecular biology.

[38]  Hayit Greenspan,et al.  GAN-based Synthetic Medical Image Augmentation for increased CNN Performance in Liver Lesion Classification , 2018, Neurocomputing.

[39]  Pierre Machart,et al.  Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks , 2020, Nature Communications.