RF-NR: Random Forest Based Approach for Improved Classification of Nuclear Receptors

The Nuclear Receptor (NR) superfamily plays an important role in key biological, developmental, and physiological processes. Developing a method for the classification of NR proteins is an important step towards understanding the structure and functions of the newly discovered NR protein. The recent studies on NR classification are either unable to achieve optimum accuracy or are not designed for all the known NR subfamilies. In this study, we developed RF-NR, which is a Random Forest based approach for improved classification of nuclear receptors. The RF-NR can predict whether a query protein sequence belongs to one of the eight NR subfamilies or it is a non-NR sequence. The RF-NR uses spectrum-like features namely: Amino Acid Composition, Di-peptide Composition, and Tripeptide Composition. Benchmarking on two independent datasets with varying sequence redundancy reduction criteria, the RF-NR achieves better (or comparable) accuracy than other existing methods. The added advantage of our approach is that we can also obtain biological insights about the important features that are required to classify NR subfamilies. RF-NR is freely available at http://bcb.ncat.edu/RF_NR/.

[1]  Taeho Jo,et al.  Improving protein fold recognition by random forest , 2014, BMC Bioinformatics.

[2]  Vincent Laudet,et al.  Principles for modulation of the nuclear receptor superfamily , 2004, Nature Reviews Drug Discovery.

[3]  Kuo-Chen Chou,et al.  NR-2L: A Two-Level Predictor for Identifying Nuclear Receptor Subfamilies Based on Sequence-Derived Features , 2011, PloS one.

[4]  Manish Kumar,et al.  NRfamPred: A proteome-scale two level method for prediction of nuclear receptor proteins and their sub-families , 2014, Scientific Reports.

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[7]  K. Umesono,et al.  A Unified Nomenclature System for the Nuclear Receptor Superfamily , 1999, Cell.

[8]  Dong-Sheng Cao,et al.  propy: a tool to generate various modes of Chou's PseAAC , 2013, Bioinform..

[9]  Xuan Xiao,et al.  NRPred-FS: A Feature Selection based Two-level Predictor for NuclearReceptors , 2014 .

[10]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[11]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[12]  V. Laudet,et al.  Evolution of the nuclear receptor superfamily: early diversification from an ancestral orphan receptor. , 1997, Journal of molecular endocrinology.

[13]  K. Chou Pseudo Amino Acid Composition and its Applications in Bioinformatics, Proteomics and System Biology , 2009 .

[14]  Kuo-Chen Chou,et al.  iNR-PhysChem: A Sequence-Based Predictor for Identifying Nuclear Receptors and Their Subfamilies via Physical-Chemical Property Matrix , 2012, PloS one.

[15]  Cheng Wu,et al.  Prediction of nuclear receptors with optimal pseudo amino acid composition. , 2009, Analytical biochemistry.

[16]  A. Hopkins,et al.  The druggable genome , 2002, Nature Reviews Drug Discovery.

[17]  Bas Vroling,et al.  NucleaRDB: information system for nuclear receptors , 2011, Nucleic Acids Res..

[18]  K. Nishikawa,et al.  Classification of proteins into groups based on amino acid composition and other characters. II. Grouping into four types. , 1983, Journal of biochemistry.