Fertility-LightGBM: A fertility-related protein prediction model by multi-information fusion and light gradient boosting machine

Abstract The identification of fertility-related proteins plays an essential part in understanding the embryogenesis of germ cell development. Since the traditional experimental methods are expensive and time-consuming to identify fertility-related proteins, the purposes of predicting protein functions from amino acid sequences appeared. In this paper, we propose a fertility-related protein prediction model. Firstly, the model combines protein physicochemical property information, evolutionary information and sequence information to construct the initial feature space 'ALL'. Then, the least absolute shrinkage and selection operator (LASSO) is used to remove redundant features. Finally, light gradient boosting machine (LightGBM) is used as a classifier to predict. The 5-fold cross-validation accuracy of the training dataset is 88.5 %, and the independent accuracy of the independent test dataset is 91.5 %. The results show that our model is more competitive for the prediction of fertility-related proteins, which is helpful for the study of fertility diseases and related drug targets. The source code and all datasets are available at https://github.com/QUST-AIBBDRC/Fertility-LightGBM/ .

[1]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[2]  Cheng Chen,et al.  Prediction of Extracellular Matrix Proteins by Fusing Multiple Feature Information, Elastic Net, and Random Forest Algorithm , 2020, Mathematics.

[3]  Xiaohui Chen,et al.  BNArray: an R package for constructing gene regulatory networks from microarray data by using Bayesian network. , 2006, Bioinformatics.

[4]  Lihong Li,et al.  Exploration of the correlation between GPCRs and drugs based on a learning to rank algorithm , 2020, Comput. Biol. Medicine.

[5]  Robert Sullivan,et al.  Proteomic Markers of Functional Sperm Population in Bovines: Comparison of Low- and High-Density Spermatozoa Following Cryopreservation. , 2018, Journal of proteome research.

[6]  Cheng Chen,et al.  Predicting Golgi-Resident Protein Types Using Conditional Covariance Minimization With XGBoost Based on Multiple Features Fusion , 2019, IEEE Access.

[7]  Hans Zischler,et al.  Evolutionary conservation of mammalian sperm proteins associates with overall, not tyrosine, phosphorylation in human spermatozoa. , 2013, Journal of proteome research.

[8]  Kuo-Chen Chou,et al.  MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. , 2007, Biochemical and biophysical research communications.

[9]  Zhen Ji,et al.  Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set , 2014, BMC Bioinformatics.

[10]  Bin Yu,et al.  Predicting protein-protein interactions by fusing various Chou's pseudo components and using wavelet denoising approach. , 2019, Journal of theoretical biology.

[11]  M. Bakhtiarizadeh,et al.  OOgenesis_Pred: A sequence-based method for predicting oogenesis proteins by six different modes of Chou's pseudo amino acid composition. , 2017, Journal of theoretical biology.

[12]  Bin Yu,et al.  Predicting drug-target interactions using Lasso with random forest based on evolutionary information and chemical structure. , 2019, Genomics.

[13]  M. Bakhtiarizadeh,et al.  PrESOgenesis: A two-layer multi-label predictor for identifying fertility-related proteins using support vector machine and pseudo amino acid composition approach , 2018, Scientific Reports.

[14]  Bin Yu,et al.  DeepMal: Accurate prediction of protein malonylation sites by deep neural networks , 2020 .

[15]  Ioannis Messinis,et al.  Molecular and Cellular Mechanisms of Sperm-Oocyte Interactions Opinions Relative to in Vitro Fertilization (IVF) , 2014, International journal of molecular sciences.

[16]  Yoo-Jin Park,et al.  A comprehensive proteomic approach to identifying capacitation related proteins in boar spermatozoa , 2014, BMC Genomics.

[17]  Maqsood Hayat,et al.  Prediction of Protein Submitochondrial Locations by Incorporating Dipeptide Composition into Chou’s General Pseudo Amino Acid Composition , 2016, The Journal of Membrane Biology.

[18]  Alex Alves Freitas,et al.  Optimizing amino acid groupings for GPCR classification , 2008, Bioinform..

[19]  Cheng Chen,et al.  LightGBM-PPI: Predicting protein-protein interactions through LightGBM with multi-information fusion , 2019, Chemometrics and Intelligent Laboratory Systems.

[20]  Qin Ma,et al.  SulSite-GTB: identification of protein S-sulfenylation sites by fusing multiple feature information and gradient tree boosting , 2020, Neural Computing and Applications.

[21]  Cheng Chen,et al.  DNNAce: Prediction of prokaryote lysine acetylation sites through deep neural networks with multi-information fusion , 2020 .

[22]  Jian-she Liang,et al.  Proteome analysis of silkworm, Bombyx mori, larval gonads: characterization of proteins involved in sexual dimorphism and gametogenesis. , 2013, Journal of proteome research.

[23]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[24]  Alexandre Gramfort,et al.  Automated model selection in covariance estimation and spatial whitening of MEG and EEG signals , 2015, NeuroImage.

[25]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[26]  Chengjin Zhang,et al.  A Two-Step Feature Selection Method to Predict Cancerlectins by Multiview Features and Synthetic Minority Oversampling Technique , 2018, BioMed research international.

[27]  Cheng Chen,et al.  SubMito-XGBoost: predicting protein submitochondrial localization by fusing multiple feature information and eXtreme gradient boosting , 2020, Bioinform..

[28]  Ho-Joon Lee,et al.  Oocyte Generation in Adult Mammalian Ovaries by Putative Germ Cells in Bone Marrow and Peripheral Blood , 2005, Cell.

[29]  Yoo-Jin Park,et al.  Fertility-related proteomic profiling bull spermatozoa separated by percoll. , 2012, Journal of proteome research.

[30]  Cheng Chen,et al.  RBPro-RF: Use Chou’s 5-steps rule to predict RNA-binding proteins via random forest with elastic net , 2020 .

[31]  Qin Ma,et al.  UbiSitePred: A novel method for improving the accuracy of ubiquitination sites prediction by using LASSO to select the optimal Chou's pseudo components , 2019, Chemometrics and Intelligent Laboratory Systems.

[32]  Norbert Schuff,et al.  Locally linear embedding (LLE) for MRI based Alzheimer's disease classification , 2013, NeuroImage.

[33]  G. Yoshizaki,et al.  Production of live fish derived from frozen germ cells via germ cell transplantation. , 2018, Stem cell research.

[34]  Omar P. Tabbaa,et al.  Mutual information and the fidelity of response of gene regulatory models , 2014, Physical biology.

[35]  S. Khan,et al.  Unb-DPC: Identify mycobacterial membrane protein types by incorporating un-biased dipeptide composition into Chou's general PseAAC. , 2017, Journal of theoretical biology.

[36]  Shan Li,et al.  Prediction subcellular localization of Gram-negative bacterial proteins by support vector machine using wavelet denoising and Chou's pseudo amino acid composition , 2017 .

[37]  Gwang Lee,et al.  PVP-SVM: Sequence-Based Prediction of Phage Virion Proteins Using a Support Vector Machine , 2018, Front. Microbiol..

[38]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[39]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[40]  J. Tilly,et al.  Germline stem cells and follicular renewal in the postnatal mammalian ovary , 2004 .

[41]  Zhen-Hui Zhang,et al.  A novel method for apoptosis protein subcellular localization prediction combining encoding based on grouped weight and support vector machine , 2006, FEBS letters.

[42]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[43]  Junyan Rong,et al.  Sparse view cone beam X-ray luminescence tomography based on truncated singular value decomposition. , 2018, Optics express.

[44]  Zhuhong You,et al.  Accurate Prediction of ncRNA-Protein Interactions From the Integration of Sequence and Evolutionary Information , 2018, Front. Genet..

[45]  Minghui Wang,et al.  Predicting protein submitochondrial locations by incorporating the pseudo-position specific scoring matrix into the general Chou's pseudo-amino acid composition. , 2018, Journal of theoretical biology.

[46]  Wei Chen,et al.  Naïve Bayes Classifier with Feature Selection to Identify Phage Virion Proteins , 2013, Comput. Math. Methods Medicine.

[47]  Xuelong Li,et al.  KPCA for semantic object extraction in images , 2008, Pattern Recognit..

[48]  S. Pangas,et al.  Regulation of germ cell function by SUMOylation , 2015, Cell and Tissue Research.

[49]  Cynthia Rudin,et al.  The Rate of Convergence of Adaboost , 2011, COLT.

[50]  Hasan Koc,et al.  Identification of proteins in the accessory sex gland fluid associated with fertility indexes of dairy bulls: a proteomic approach. , 2006, Journal of andrology.

[51]  K Nishikawa,et al.  Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. , 1994, Journal of molecular biology.

[52]  Xing Chen,et al.  Prediction of subcellular location of apoptosis proteins by incorporating PsePSSM and DCCA coefficient based on LFDA dimensionality reduction , 2018, BMC Genomics.

[53]  Toshiyuki Oda,et al.  Simple adjustment of the sequence weight algorithm remarkably enhances PSI-BLAST performance , 2017, BMC Bioinformatics.

[54]  Sankar K. Pal,et al.  Multilayer perceptron, fuzzy sets, and classification , 1992, IEEE Trans. Neural Networks.

[55]  Xiaoying Wang,et al.  Protein-protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique , 2018, Bioinform..

[56]  Nguyen Quoc Khanh Le,et al.  Fertility-GRU: Identifying fertility-related proteins by incorporating deep gated recurrent units and original PSSM profiles. , 2019, Journal of proteome research.

[57]  Kuo-Chen Chou,et al.  Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-Nearest Neighbor classifiers. , 2006, Journal of proteome research.

[58]  Nicolas Le Roux,et al.  Learning Eigenfunctions Links Spectral Embedding and Kernel PCA , 2004, Neural Computation.

[59]  Arnaud Droit,et al.  Investigation of male infertility using quantitative comparative proteomics. , 2014, Journal of proteome research.

[60]  Geoffrey I. Webb,et al.  iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences , 2018, Bioinform..