Applications of machine learning methods in predicting nuclear receptors and their families.

Nuclear receptors (NRs) are a superfamily of ligand-dependent transcription factors that are closely related to cell development, differentiation, reproduction, homeostasis and metabolism. According to the alignments of the conserved domains, NRs are classified and assigned the following seven subfamilies or eight subfamilies: (1) NR1: thyroid hormone like (thyroid hormone, retinoic acid, RAR-related orphan receptor, peroxisome proliferator activated, vitamin D3-like), (2) NR2: HNF4-like (hepatocyte nuclear factor 4, retinoic acid X, tailless-like, COUP-TF-like, USP), (3) NR3: estrogen like (estrogen, estrogen-related, glucocorticoid-like), (4) NR4: nerve growth factor IB-like (NGFI-B-like), (5) NR5: fushi tarazu-F1 like (fushi tarazu-F1 like), (6) NR6: germ cell nuclear factor like (germ cell nuclear factor), and (7) NR0: knirps like (knirps, knirps-related, embryonic gonad protein, ODR7, trithorax) and DAX like (DAX, SHP), or dividing NR0 into (7) NR7: knirps like and (8) NR8: DAX like. Different NRs families have different structural features and functions. Since the function of a NR is closely correlated with which subfamily it belongs to, it is highly desired to identify NRs and their subfamilies rapidly and effectively. The knowledge acquired is essential for a proper understanding of normal and abnormal cellular mechanisms. With the advent of the post-genomics era, huge amounts of sequence-known proteins has increased explosively. Conventional methods for accurately classifying the family of NRs are experimental means with high cost and low efficiency. Therefore, it has created a greater need for bioinformatics tools to effectively recognize NRs and their subfamilies for the purpose of understanding their biological function. In this review, we summarized the application of machine learning methods in the prediction of NRs from different aspects. We hope that this review will provide a reference for further research on the classification of NRs.

[1]  Yongchun Zuo,et al.  iDPF-PseRAAAC: A Web-Server for Identifying the Defensin Peptide Family and Subfamily Using Pseudo Reduced Amino Acid Alphabet Composition , 2015, PloS one.

[2]  Balachandran Manavalan,et al.  Machine-Learning-Based Prediction of Cell-Penetrating Peptides and Their Uptake Efficiency with Improved Accuracy. , 2018, Journal of proteome research.

[3]  Khurshid Ahmad,et al.  Identification of DNA binding proteins using evolutionary profiles position specific scoring matrix , 2016, Neurocomputing.

[4]  Q. Ning,et al.  dForml(KNN)-PseAAC: Detecting formylation sites from protein sequences using K-nearest neighbor algorithm via Chou's 5-step rule and pseudo components. , 2019, Journal of theoretical biology.

[5]  Bandana Kumari,et al.  PalmPred: An SVM Based Palmitoylation Prediction Method Using Sequence Profile Information , 2014, PloS one.

[6]  Xiaofeng Liao,et al.  A Novel Classification and Identification Scheme of Emitter Signals Based on Ward's Clustering and Probabilistic Neural Networks with Correlation Analysis , 2018, Comput. Intell. Neurosci..

[7]  Xue-Hai Hu,et al.  A new hybrid fractal algorithm for predicting thermophilic nucleotide sequences. , 2012, Journal of theoretical biology.

[8]  Wei Chen,et al.  Identification of Antioxidants from Sequence Information Using Naïve Bayes , 2013, Comput. Math. Methods Medicine.

[9]  Juwen Shen,et al.  Predicting protein–protein interactions based only on sequences information , 2007, Proceedings of the National Academy of Sciences.

[10]  Jiu-Xin Tan,et al.  Identification of hormone binding proteins based on machine learning methods. , 2019, Mathematical biosciences and engineering : MBE.

[11]  Qinghua Guo,et al.  LncRNA2Target v2.0: a comprehensive database for target genes of lncRNAs in human and mouse , 2018, Nucleic Acids Res..

[12]  Zhangxin Chen,et al.  ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network , 2017, Molecules.

[13]  Zu-Guo Yu,et al.  Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation. , 2009 .

[14]  Tao Huang,et al.  Prediction of lysine ubiquitination with mRMR feature selection and analysis , 2011, Amino Acids.

[15]  Renzhi Cao,et al.  Survey of Machine Learning Techniques in Drug Discovery. , 2019, Current drug metabolism.

[16]  Renzhi Cao,et al.  SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines , 2013, BMC Bioinformatics.

[17]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[18]  C. Zhang,et al.  Predicting protein folding types by distance functions that make allowances for amino acid interactions. , 1994, The Journal of biological chemistry.

[19]  Loris Nanni,et al.  Genetic programming for creating Chou’s pseudo amino acid based features for submitochondria localization , 2008, Amino Acids.

[20]  Gwang Lee,et al.  PVP-SVM: Sequence-Based Prediction of Phage Virion Proteins Using a Support Vector Machine , 2018, Front. Microbiol..

[21]  Wei Chen,et al.  Predicting protein structural classes for low-similarity sequences by evaluating different features , 2019, Knowl. Based Syst..

[22]  Wei Chen,et al.  Predicting peroxidase subcellular location by hybridizing different descriptors of Chou' pseudo amino acid patterns. , 2014, Analytical biochemistry.

[23]  Myeong Ok Kim,et al.  iBCE-EL: A New Ensemble Learning Framework for Improved Linear B-Cell Epitope Prediction , 2018, Front. Immunol..

[24]  Leyi Wei,et al.  Meta-4mCpred: A Sequence-Based Meta-Predictor for Accurate DNA 4mC Site Prediction Using Effective Feature Representation , 2019, Molecular therapy. Nucleic acids.

[25]  Gwang Lee,et al.  AIPpred: Sequence-Based Prediction of Anti-inflammatory Peptides Using Random Forest , 2018, Front. Pharmacol..

[26]  Kuo-Chen Chou,et al.  iNR-PhysChem: A Sequence-Based Predictor for Identifying Nuclear Receptors and Their Subfamilies via Physical-Chemical Property Matrix , 2012, PloS one.

[27]  Balachandran Manavalan,et al.  Random Forest-Based Protein Model Quality Assessment (RFMQA) Using Structural Features and Potential Energy Terms , 2014, PloS one.

[28]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[29]  Jie Hou,et al.  DeepQA: improving the estimation of single protein model quality with deep belief networks , 2016, BMC Bioinformatics.

[30]  Wei Chen,et al.  iNuc-PhysChem: A Sequence-Based Predictor for Identifying Nucleosomes via Physicochemical Properties , 2012, PloS one.

[31]  F. An,et al.  Investigation of antineutrino spectral anomaly with updated nuclear database , 2018, 1807.09265.

[32]  Adam Godzik,et al.  Tolerating some redundancy significantly speeds up clustering of large protein databases , 2002, Bioinform..

[33]  S. Basu,et al.  Chaos game representation of proteins. , 1997, Journal of molecular graphics & modelling.

[34]  Q. Zou,et al.  Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA , 2018, RNA.

[35]  Jie Hu,et al.  Empirical comparison and analysis of web-based cell-penetrating peptide prediction tools , 2020, Briefings Bioinform..

[36]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[37]  Dongwon Lee,et al.  kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets , 2013, Nucleic Acids Res..

[38]  Majid Mohammad Beigi,et al.  Prediction of allergenic proteins by means of the concept of Chou's pseudo amino acid composition and a machine learning approach. , 2012 .

[39]  Wei Chen,et al.  Sequence-based predictive modeling to identify cancerlectins , 2017, Oncotarget.

[40]  Xuehai Hu,et al.  Accurate prediction of nuclear receptors with conjoint triad feature , 2015, BMC Bioinformatics.

[41]  Jinbo Xu,et al.  Analysis of deep learning methods for blind protein contact prediction in CASP12 , 2018, Proteins.

[42]  Ran Su,et al.  Iterative feature representations improve N4-methylcytosine site prediction , 2019, Bioinform..

[43]  Kuo-Chen Chou,et al.  NR-2L: A Two-Level Predictor for Identifying Nuclear Receptor Subfamilies Based on Sequence-Derived Features , 2011, PloS one.

[44]  C. Zou,et al.  The association between nuclear receptors and ocular diseases , 2017, Oncotarget.

[45]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[46]  Manish Kumar,et al.  NRfamPred: A proteome-scale two level method for prediction of nuclear receptor proteins and their sub-families , 2014, Scientific Reports.

[47]  Q. Zou,et al.  Deep learning in omics: a survey and guideline , 2018, Briefings in functional genomics.

[48]  Yongchun Zuo,et al.  Function determinants of TET proteins: the arrangements of sequence motifs with specific codes , 2019, Briefings Bioinform..

[49]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[50]  Bo Liao,et al.  A Hybrid Deep Learning Model for Predicting Protein Hydroxylation Sites , 2018, International journal of molecular sciences.

[51]  Shuigeng Zhou,et al.  Predicting Enhancers from Multiple Cell Lines and Tissues across Different Developmental Stages Based On SVM Method , 2018, Current Bioinformatics.

[52]  Liujuan Cao,et al.  A novel features ranking metric with application to scalable visual and bioinformatics data classification , 2016, Neurocomputing.

[53]  Zhongheng Zhang,et al.  Introduction to machine learning: k-nearest neighbors. , 2016, Annals of translational medicine.

[54]  Wei Chen,et al.  Recent Advances in Machine Learning Methods for Predicting Heat Shock Proteins. , 2019, Current drug metabolism.

[55]  Q. Zou,et al.  Protein Folds Prediction with Hierarchical Structured SVM , 2016 .

[56]  Wei Pan,et al.  Network-based support vector machine for classification of microarray samples , 2009, BMC Bioinformatics.

[57]  Muhammad Kabir,et al.  iNR-2L: A two-level sequence-based predictor developed via Chou's 5-steps rule and general PseAAC for identifying nuclear receptors and their families. , 2019, Genomics.

[58]  Gajendra P. S. Raghava,et al.  Identification of DNA-binding proteins using support vector machines and evolutionary profiles , 2007, BMC Bioinformatics.

[59]  Gajendra P S Raghava,et al.  Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition* , 2004, Journal of Biological Chemistry.

[60]  Gajendra P.S. Raghava,et al.  Prediction of RNA binding sites in a protein using SVM and PSSM profile , 2008, Proteins.

[61]  Lei Yang,et al.  Analysis and prediction of animal toxins by various Chou's pseudo components and reduced amino acid compositions. , 2019, Journal of theoretical biology.

[62]  Gert Vriend,et al.  NRSAS: Nuclear Receptor Structure Analysis Servers , 2003, Nucleic Acids Res..

[63]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[64]  Hua Tang,et al.  Identification of immunoglobulins using Chou's pseudo amino acid composition with feature selection technique. , 2016, Molecular bioSystems.

[65]  Jianding Qiu,et al.  Prediction of G-protein-coupled receptor classes based on the concept of Chou's pseudo amino acid composition: an approach from discrete wavelet transform. , 2009, Analytical biochemistry.

[66]  M. Lazar Maturing of the nuclear receptor family. , 2017, The Journal of clinical investigation.

[67]  Vladimir D. Gusev,et al.  On the complexity measures of genetic sequences , 1999, Bioinform..

[68]  Huaiqiu Zhu,et al.  Gene prediction in metagenomic fragments based on the SVM algorithm , 2011, 2011 4th International Conference on Biomedical Engineering and Informatics (BMEI).

[69]  Hua Tang,et al.  Identification of Secretory Proteins in Mycobacterium tuberculosis Using Pseudo Amino Acid Composition , 2016, BioMed research international.

[70]  Donald F. Specht,et al.  Probabilistic neural networks and the polynomial Adaline as complementary techniques for classification , 1990, IEEE Trans. Neural Networks.

[71]  Heitor S. Ramos,et al.  Analysis of Machine Learning Algorithms for Diagnosis of Diffuse Lung Diseases , 2018, Methods of Information in Medicine.

[72]  Volker Brendel,et al.  PROSET-a fast procedure to create non-redundant sets of protein sequences , 1992 .

[73]  Wei Chen,et al.  Identification of mycobacterial membrane proteins and their types using over-represented tripeptide compositions. , 2012, Journal of proteomics.

[74]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[75]  Kuo-Chen Chou,et al.  Using pseudo amino acid composition to predict protein structural classes: Approached with complexity measure factor , 2006, J. Comput. Chem..

[76]  Hui Ding,et al.  iRNA(m6A)-PseDNC: Identifying N6-methyladenosine sites using pseudo dinucleotide composition. , 2018, Analytical biochemistry.

[77]  Whitney Wooderchak-Donahue,et al.  A support vector machine for identification of single-nucleotide polymorphisms from next-generation sequencing data , 2013, Bioinform..

[78]  Xingpeng Jiang,et al.  Sequence clustering in bioinformatics: an empirical study. , 2018, Briefings in bioinformatics.

[79]  Sangya Pundir,et al.  UniProt Tools , 2016, Current protocols in bioinformatics.

[80]  Renzhi Cao,et al.  Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13 , 2019, Proteins.

[81]  Gajendra P. S. Raghava,et al.  Prediction of nuclear proteins using SVM and HMM models , 2009, BMC Bioinformatics.

[82]  Dinesh Gupta,et al.  LipocalinPred: a SVM-based method for prediction of lipocalins , 2009, BMC Bioinformatics.

[83]  Wei Chen,et al.  Naïve Bayes Classifier with Feature Selection to Identify Phage Virion Proteins , 2013, Comput. Math. Methods Medicine.

[84]  Asifullah Khan,et al.  Prediction of membrane protein types by using dipeptide and pseudo amino acid composition-based composite features , 2012, IET Commun..

[85]  Yue Zhao,et al.  RAID v2.0: an updated resource of RNA-associated interactions across organisms , 2016, Nucleic Acids Res..

[86]  Hua Tang,et al.  Identification of Bacterial Cell Wall Lyases via Pseudo Amino Acid Composition , 2016, BioMed research international.

[87]  V. Laudet,et al.  Evolution of the nuclear receptor superfamily: early diversification from an ancestral orphan receptor. , 1997, Journal of molecular endocrinology.

[88]  Guangpeng Li,et al.  PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition , 2017, Bioinform..

[89]  Ruedi Stoop,et al.  An Ontology for Pharmaceutical Ligands and Its Application for in Silico Screening and Library Design , 2002, J. Chem. Inf. Comput. Sci..

[90]  Liang Cheng,et al.  Exposing the Causal Effect of C-Reactive Protein on the Risk of Type 2 Diabetes Mellitus: A Mendelian Randomization Study , 2018, Front. Genet..

[91]  Wei Chen,et al.  i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome , 2019, Bioinform..

[92]  Wei Chen,et al.  Predicting the subcellular localization of mycobacterial proteins by incorporating the optimal tripeptides into the general form of pseudo amino acid composition. , 2015, Molecular bioSystems.

[93]  Wei Chen,et al.  iDNA4mC: identifying DNA N4‐methylcytosine sites based on nucleotide chemical properties , 2017, Bioinform..

[94]  Rong Chen,et al.  HBPred: a tool to identify growth hormone-binding proteins , 2018, International journal of biological sciences.

[95]  Tao Li,et al.  PreDNA: accurate prediction of DNA-binding sites in proteins by integrating sequence and geometric structure information , 2013, Bioinform..

[96]  Fred E. Cohen,et al.  Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors , 2004, Bioinform..

[97]  K. Chou,et al.  iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. , 2013, Analytical biochemistry.

[98]  Lucia Altucci,et al.  Nuclear receptors in cell life and death , 2001, Trends in Endocrinology & Metabolism.

[99]  Bandana Kumari,et al.  Protein Sub-Nuclear Localization Prediction Using SVM and Pfam Domain Information , 2014, PloS one.

[100]  Yan Lin,et al.  iTerm-PseKNC: a sequence-based tool for predicting bacterial transcriptional terminators , 2018, Bioinform..

[101]  Shiow-Fen Hwang,et al.  ProLoc: Prediction of protein subnuclear localization using SVM with automatic selection from physicochemical composition features , 2007, Biosyst..

[102]  Wei Chen,et al.  Using Over-Represented Tetrapeptides to Predict Protein Submitochondria Locations , 2013, Acta Biotheoretica.

[103]  Hao Lv,et al.  Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique , 2018, Bioinform..

[104]  Bas Vroling,et al.  NucleaRDB: information system for nuclear receptors , 2011, Nucleic Acids Res..

[105]  Jian Huang,et al.  A Brief Survey of Machine Learning Methods in Protein Sub-Golgi Localization , 2019, Current Bioinformatics.

[106]  Myeong Ok Kim,et al.  PIP-EL: A New Ensemble Learning Method for Improved Proinflammatory Peptide Predictions , 2018, Front. Immunol..

[107]  Balachandran Manavalan,et al.  iGHBP: Computational identification of growth hormone binding proteins from sequences using extremely randomised tree , 2018, Computational and structural biotechnology journal.

[108]  N. Takahashi,et al.  Amino acid composition and amino acid-metabolic network in supragingival plaque. , 2016, Biomedical research.

[109]  Wei Chen,et al.  Identifying the Subfamilies of Voltage-Gated Potassium Channels Using Feature Selection Technique , 2014, International journal of molecular sciences.

[110]  Gert Vriend,et al.  Collecting and harvesting biological data: the GPCRDB and NucleaRDB information systems , 2001, Nucleic Acids Res..

[111]  Cheng Wu,et al.  Prediction of nuclear receptors with optimal pseudo amino acid composition. , 2009, Analytical biochemistry.

[112]  Wei Chen,et al.  Pro54DB: a database for experimentally verified sigma‐54 promoters , 2016, Bioinform..

[113]  Zaheer Ullah Khan,et al.  Discrimination of acidic and alkaline enzyme using Chou's pseudo amino acid composition in conjunction with probabilistic neural network model. , 2015, Journal of theoretical biology.

[114]  Jie Sun,et al.  DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function , 2018, Bioinform..

[115]  Hua Tang,et al.  A two-step discriminated method to identify thermophilic proteins , 2017 .

[116]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[117]  Gajendra P. S. Raghava,et al.  ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST , 2004, Nucleic Acids Res..

[118]  Gajendra P S Raghava,et al.  SVM based prediction of RNA‐binding proteins using binding residues and evolutionary information , 2011, Journal of molecular recognition : JMR.

[119]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[120]  Heitor S. Ramos,et al.  Analysis of Machine Learning Algorithms for Diagnosis of Diffuse Lung Diseases , 2018, Methods of Information in Medicine.

[121]  X. Ye,et al.  Improving the classification of nuclear receptors with feature selection. , 2009, Protein and peptide letters.

[122]  Maqsood Hayat,et al.  Mem-PHybrid: hybrid features-based prediction system for classifying membrane protein types. , 2012, Analytical biochemistry.

[123]  K. Chou,et al.  iCTX-Type: A Sequence-Based Predictor for Identifying the Types of Conotoxins in Targeting Ion Channels , 2014, BioMed research international.

[124]  Mohammed Bennamoun,et al.  ECMSRC: A Sparse Learning Approach for the Prediction of Extracellular Matrix Proteins , 2017 .

[125]  V. Laudet,et al.  The nuclear receptor superfamily , 2003, Journal of Cell Science.

[126]  D. Fukuda,et al.  Evaluation of Sex-Specific Movement Patterns in Judo Using Probabilistic Neural Networks. , 2017, Motor control.

[127]  Kyle A Palmer,et al.  Active fault diagnosis for uncertain systems using optimal test designs and detection through classification. , 2019, ISA transactions.

[128]  Meng Zhou,et al.  MetSigDis: a manually curated resource for the metabolic signatures of diseases , 2019, Briefings Bioinform..

[129]  Wei Chen,et al.  iRNA-2OM: A Sequence-Based Predictor for Identifying 2′-O-Methylation Sites in Homo sapiens , 2018, J. Comput. Biol..

[130]  Chris H. Q. Ding,et al.  Minimum Redundancy Feature Selection from Microarray Gene Expression Data , 2005, J. Bioinform. Comput. Biol..

[131]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[132]  Balachandran Manavalan,et al.  mACPpred: A Support Vector Machine-Based Meta-Predictor for Identification of Anticancer Peptides , 2019, International journal of molecular sciences.

[133]  K. Chou,et al.  Recent progress in protein subcellular location prediction. , 2007, Analytical biochemistry.

[134]  Wei Chen,et al.  PHYPred: a tool for identifying bacteriophage enzymes and hydrolases , 2016, Virologica Sinica.

[135]  Renzhi Cao,et al.  Integrated protein function prediction by mining function associations, sequences, and protein-protein and gene-gene interaction networks. , 2016, Methods.

[136]  Gajendra P S Raghava,et al.  Prediction of Mitochondrial Proteins Using Support Vector Machine and Hidden Markov Model* , 2006, Journal of Biological Chemistry.