Using random forests for assistance in the curation of G-protein coupled receptor databases

BackgroundBiology is experiencing a gradual but fast transformation from a laboratory-centred science towards a data-centred one. As such, it requires robust data engineering and the use of quantitative data analysis methods as part of database curation. This paper focuses on G protein-coupled receptors, a large and heterogeneous super-family of cell membrane proteins of interest to biology in general. One of its families, Class C, is of particular interest to pharmacology and drug design. This family is quite heterogeneous on its own, and the discrimination of its several sub-families is a challenging problem. In the absence of known crystal structure, such discrimination must rely on their primary amino acid sequences.MethodsWe are interested not as much in achieving maximum sub-family discrimination accuracy using quantitative methods, but in exploring sequence misclassification behavior. Specifically, we are interested in isolating those sequences showing consistent misclassification, that is, sequences that are very often misclassified and almost always to the same wrong sub-family. Random forests are used for this analysis due to their ensemble nature, which makes them naturally suited to gauge the consistency of misclassification. This consistency is here defined through the voting scheme of their base tree classifiers.ResultsDetailed consistency results for the random forest ensemble classification were obtained for all receptors and for all data transformations of their unaligned primary sequences. Shortlists of the most consistently misclassified receptors for each subfamily and transformation, as well as an overall shortlist including those cases that were consistently misclassified across transformations, were obtained. The latter should be referred to experts for further investigation as a data curation task.ConclusionThe automatic discrimination of the Class C sub-families of G protein-coupled receptors from their unaligned primary sequences shows clear limits. This study has investigated in some detail the consistency of their misclassification using random forest ensemble classifiers. Different sub-families have been shown to display very different discrimination consistency behaviors. The individual identification of consistently misclassified sequences should provide a tool for quality control to GPCR database curators.

[1]  Alex Alves Freitas,et al.  Optimizing amino acid groupings for GPCR classification , 2008, Bioinform..

[2]  Jonathan S. Mason,et al.  Structures of G protein-coupled receptors reveal new opportunities for drug discovery. , 2015, Drug discovery today.

[3]  Alfredo Vellido,et al.  The influence of alignment-free sequence representations on the semi-supervised classification of class C G protein-coupled receptors , 2014, Medical & Biological Engineering & Computing.

[4]  F. Mhamdi,et al.  Textmining, feature selection and datamining for proteins classification , 2004, Proceedings. 2004 International Conference on Information and Communication Technologies: From Theory to Applications, 2004..

[5]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[6]  Alfredo Vellido,et al.  Finding Class C GPCR Subtype-Discriminating N-grams through Feature Selection , 2014, PACBB.

[7]  Aleksei Shkurin,et al.  Random Forests for Quality Control in G-Protein Coupled Receptor Databases , 2016, IWBBIO.

[8]  Bas Vroling,et al.  GPCRdb: an information system for G protein-coupled receptors , 2015, Nucleic Acids Res..

[9]  Achim Zeileis,et al.  Conditional variable importance for random forests , 2008, BMC Bioinformatics.

[10]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[11]  Alfredo Vellido,et al.  Visual interpretation of class C GPCR subtype overlapping from the nonlinear mapping of transformed primary sequences , 2014, IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI).

[12]  R. Stevens,et al.  Structure-function of the G protein-coupled receptor superfamily. , 2013, Annual review of pharmacology and toxicology.

[13]  S. Wold,et al.  New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. , 1998, Journal of medicinal chemistry.

[14]  L. Prézeau,et al.  Dimers and beyond: The functional puzzles of class C GPCRs. , 2011, Pharmacology & therapeutics.

[15]  Yücel Saygin,et al.  Classification of GPCRs Using Family Specific Motifs , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  Alfredo Vellido,et al.  Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors , 2015, BMC Bioinformatics.

[17]  Anne-Laure Boulesteix,et al.  Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics , 2012, WIREs Data Mining Knowl. Discov..

[18]  A. Doré,et al.  Structure of class C GPCR metabotropic glutamate receptor 5 transmembrane domain , 2014, Nature.

[19]  Jens Meiler,et al.  Structure of a Class C GPCR Metabotropic Glutamate Receptor 1 Bound to an Allosteric Modulator , 2014, Science.

[20]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  Gert Vriend,et al.  GPCRDB information system for G protein-coupled receptors , 2003, Nucleic Acids Res..

[23]  Cornelia Caragea,et al.  Protein Sequence Classification Using Feature Hashing , 2011, BIBM.

[24]  Michèle B. Nuijten,et al.  Five ways to fix statistics , 2017, Nature.

[25]  Jia He,et al.  Classifying G-protein-coupled receptors to the finest subtype level. , 2013, Biochemical and biophysical research communications.