Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors

BackgroundThe characterization of proteins in families and subfamilies, at different levels, entails the definition and use of class labels. When the adscription of a protein to a family is uncertain, or even wrong, this becomes an instance of what has come to be known as a label noise problem. Label noise has a potentially negative effect on any quantitative analysis of proteins that depends on label information. This study investigates class C of G protein-coupled receptors, which are cell membrane proteins of relevance both to biology in general and pharmacology in particular. Their supervised classification into different known subtypes, based on primary sequence data, is hampered by label noise. The latter may stem from a combination of expert knowledge limitations and the lack of a clear correspondence between labels that mostly reflect GPCR functionality and the different representations of the protein primary sequences.ResultsIn this study, we describe a systematic approach, using Support Vector Machine classifiers, to the analysis of G protein-coupled receptor misclassifications. As a proof of concept, this approach is used to assist the discovery of labeling quality problems in a curated, publicly accessible database of this type of proteins. We also investigate the extent to which physico-chemical transformations of the protein sequences reflect G protein-coupled receptor subtype labeling. The candidate mislabeled cases detected with this approach are externally validated with phylogenetic trees and against further trusted sources such as the National Center for Biotechnology Information, Universal Protein Resource, European Bioinformatics Institute and Ensembl Genome Browser information repositories.ConclusionsIn quantitative classification problems, class labels are often by default assumed to be correct. Label noise, though, is bound to be a pervasive problem in bioinformatics, where labels may be obtained indirectly through complex, many-step similarity modelling processes. In the case of G protein-coupled receptors, methods capable of singling out and characterizing those sequences with consistent misclassification behaviour are required to minimize this problem. A systematic, Support Vector Machine-based method has been proposed in this study for such purpose. The proposed method enables a filtering approach to the label noise problem and might become a support tool for database curators in proteomics.

[1]  Jia He,et al.  Classifying G-protein-coupled receptors to the finest subtype level. , 2013, Biochemical and biophysical research communications.

[2]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[3]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[4]  Nada Lavrac,et al.  Ensemble-based noise detection: noise ranking and visual performance evaluation , 2012, Data Mining and Knowledge Discovery.

[5]  Alfredo Vellido,et al.  The influence of alignment-free sequence representations on the semi-supervised classification of class C G protein-coupled receptors , 2014, Medical & Biological Engineering & Computing.

[6]  L. Prézeau,et al.  Evolution, structure, and activation mechanism of family 3/C G-protein-coupled receptors. , 2003, Pharmacology & therapeutics.

[7]  Paulo J. G. Lisboa,et al.  Computational Intelligence in biomedicine: Some contributions , 2010, ESANN.

[8]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[9]  Alfredo Vellido,et al.  Misclassification of class C G-protein-coupled receptors as a label noise problem , 2014, ESANN.

[10]  Elisabeth R. M. Tillier,et al.  The accuracy of several multiple sequence alignment programs for proteins , 2006, BMC Bioinformatics.

[11]  Christopher S. Oehmen,et al.  SVM-BALSA: Remote homology detection based on Bayesian sequence alignment , 2005, Comput. Biol. Chem..

[12]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[13]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[14]  Alfredo Vellido,et al.  Advances in Semi-Supervised Alignment-Free Classication of G Protein-Coupled Receptors , 2013, IWBBIO.

[15]  Samuel Müller,et al.  Determination of prognosis in metastatic melanoma through integration of clinico‐pathologic, mutation, mRNA, microRNA, and protein information , 2015, International journal of cancer.

[16]  Benoît Frénay,et al.  Label Noise-Tolerant Hidden Markov Models for Segmentation: Application to ECGs , 2011, ECML/PKDD.

[17]  Jens Meiler,et al.  Structure of a Class C GPCR Metabotropic Glutamate Receptor 1 Bound to an Allosteric Modulator , 2014, Science.

[18]  Martin Fussenegger,et al.  An overview of the diverse roles of G-protein coupled receptors (GPCRs) in the pathophysiology of various human diseases. , 2013, Biotechnology advances.

[19]  Jiuwen Cao,et al.  Protein Sequence Classification with Improved Extreme Learning Machine Algorithms , 2014, BioMed research international.

[20]  Giuseppe Battaglia,et al.  Metabotropic glutamate receptors as drug targets: what's new? , 2015, Current opinion in pharmacology.

[21]  B. Liu,et al.  Using Amino Acid Physicochemical Distance Transformation for Fast Protein Remote Homology Detection , 2012, PloS one.

[22]  S. Wold,et al.  New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. , 1998, Journal of medicinal chemistry.

[23]  Roberto Therón,et al.  Treevolution: visual analysis of phylogenetic trees , 2009, Bioinform..

[24]  M. Vidal,et al.  Literature-curated protein interaction datasets , 2009, Nature Methods.

[25]  Francisco Herrera,et al.  Analyzing the presence of noise in multi-class problems: alleviating its influence with the One-vs-One decomposition , 2012, Knowledge and Information Systems.

[26]  F. Nicoletti,et al.  Metabotropic glutamate receptors: From the workbench to the bedside , 2011, Neuropharmacology.

[27]  Alessandra Carbone,et al.  A discriminative method for family-based protein remote homology detection that combines inductive logic programming and propositional models , 2011, BMC Bioinformatics.

[28]  S. Merajver,et al.  International expert panel on inflammatory breast cancer: consensus statement for standardized diagnosis and treatment. , 2011, Annals of oncology : official journal of the European Society for Medical Oncology.

[29]  L. Prézeau,et al.  Dimers and beyond: The functional puzzles of class C GPCRs. , 2011, Pharmacology & therapeutics.

[30]  Cornelia Caragea,et al.  Protein Sequence Classification Using Feature Hashing , 2011, BIBM.

[31]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[32]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Use of Classification Algorithms in Noise Detection and Elimination , 2009, HAIS.

[33]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[34]  Albert Fornells,et al.  A study of the effect of different types of noise on the precision of supervised learning techniques , 2010, Artificial Intelligence Review.

[35]  David Haussler,et al.  Classifying G-protein coupled receptors with support vector machines , 2002, Bioinform..

[36]  Mong-Li Lee,et al.  Efficient remote homology detection using local structure , 2003, Bioinform..

[37]  B. Rehm,et al.  Bioinformatic tools for DNA/protein sequence analysis, functional assignment of genes and protein classification , 2001, Applied Microbiology and Biotechnology.

[38]  Lluís A. Belanche Muñoz,et al.  Outlier exploration and diagnostic classification of a multi-centre 1H-MRS brain tumour database , 2009, Neurocomputing.

[39]  Kai Ye,et al.  An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences , 2007, Bioinform..

[40]  Gert Vriend,et al.  GPCRDB information system for G protein-coupled receptors , 2003, Nucleic Acids Res..

[41]  Jens Meiler,et al.  Opportunities and challenges in the discovery of allosteric modulators of GPCRs for treating CNS disorders , 2014, Nature Reviews Drug Discovery.

[42]  Raymond J Carroll,et al.  Empirical Performance of Cross-Validation With Oracle Methods in a Genomics Context , 2011, The American statistician.

[43]  Hasan Ogul,et al.  A discriminative method for remote homology detection based on n-peptide compositions with reduced amino acid alphabets , 2007, Biosyst..

[44]  Etsuko N Moriyama,et al.  Protein family classification with partial least squares. , 2007, Journal of proteome research.

[45]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[46]  A. Doré,et al.  Structure of class C GPCR metabotropic glutamate receptor 5 transmembrane domain , 2014, Nature.

[47]  K. Palczewski,et al.  Crystal Structure of Rhodopsin: A G‐Protein‐Coupled Receptor , 2002, Chembiochem : a European journal of chemical biology.

[48]  R. Doolittle,et al.  Progressive sequence alignment as a prerequisitetto correct phylogenetic trees , 2007, Journal of Molecular Evolution.

[49]  Nada Lavrac,et al.  Advances in Class Noise Detection , 2010, ECAI.

[50]  Mykola Pechenizkiy,et al.  Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction , 2006, 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06).

[51]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[52]  Bas Vroling,et al.  GPCRdb: an information system for G protein-coupled receptors , 2015, Nucleic Acids Res..

[53]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[54]  R. Stevens,et al.  Structure-function of the G protein-coupled receptor superfamily. , 2013, Annual review of pharmacology and toxicology.

[55]  S. Wold,et al.  DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least-squares projections to latent structures , 1993 .

[56]  T. Lundstedt,et al.  Classification of G‐protein coupled receptors by alignment‐independent extraction of principal chemical properties of primary amino acid sequences , 2002, Protein science : a publication of the Protein Society.

[57]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[58]  Christopher S. Oehmen,et al.  SVM-HUSTLE - an iterative semi-supervised machine learning approach for pairwise protein remote homology detection , 2008, Bioinform..

[59]  Alfredo Vellido,et al.  SVM-Based Classification of Class C GPCRs from Alignment-Free Physicochemical Transformations of Their Sequences , 2013, ICIAP Workshops.

[60]  Cesare Furlanello,et al.  A Comparison of MCC and CEN Error Measures in Multi-Class Prediction , 2010, PloS one.