Using machine learning tools for protein database biocuration assistance

Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challenge for biocurators is the unambiguous identification of biological entities. In this study, we illustrate the adequacy of machine learning methods as biocuration assistance tools using a publicly available protein database as an example. This database contains information on G Protein-Coupled Receptors (GPCRs), which are part of eukaryotic cell membranes and relevant in cell communication as well as major drug targets in pharmacology. These receptors are characterized according to subtype labels. Previous analysis of this database provided evidence that some of the receptor sequences could be affected by a case of label noise, as they appeared to be too consistently misclassified by machine learning methods. Here, we extend our analysis to recent and quite substantially modified new versions of the database and reveal their now extremely accurate labeling using several machine learning models and different transformations of the unaligned sequences. These findings support the adequacy of our proposed method to identify problematic labeling cases as a tool for database biocuration.

[1]  Meizhu Li,et al.  A study on the label noise impact on the hyperspectral image classification , 2021 .

[2]  Etsuko N Moriyama,et al.  Protein family classification with partial least squares. , 2007, Journal of proteome research.

[3]  S. Wold,et al.  New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. , 1998, Journal of medicinal chemistry.

[4]  Alfredo Vellido,et al.  The extracellular N-terminal domain suffices to discriminate class C G Protein-Coupled Receptor subtypes from n-grams of their sequences , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[5]  Alan Wise,et al.  Target validation of G-protein coupled receptors. , 2002, Drug discovery today.

[6]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[7]  Samuel Müller,et al.  Determination of prognosis in metastatic melanoma through integration of clinico‐pathologic, mutation, mRNA, microRNA, and protein information , 2015, International journal of cancer.

[8]  David Haussler,et al.  Classifying G-protein coupled receptors with support vector machines , 2002, Bioinform..

[9]  Alfredo Vellido,et al.  Visual characterization of misclassified Class C GPCRs through Manifold-based machine learning methods , 2015 .

[10]  Xuan Liu,et al.  Protein remote homology detection based on auto-cross covariance transformation , 2011, Comput. Biol. Medicine.

[11]  Judith Klein-Seetharaman,et al.  PROTEINS: Structure, Function, and Bioinformatics 58:955–970 (2005) Protein Classification Based on Text Document Classification Techniques , 2022 .

[12]  Yong Zhou,et al.  Prediction of Drug–Target Interaction Networks from the Integration of Protein Sequences and Drug Chemical Structures , 2017, Molecules.

[13]  Douglas G. Howe A statistical approach to identify, monitor, and manage incomplete curated data sets , 2018, BMC Bioinformatics.

[14]  Mong-Li Lee,et al.  Efficient remote homology detection using local structure , 2003, Bioinform..

[15]  B. Bettler,et al.  GABAB receptors: physiological functions and mechanisms of diversity. , 2010, Advances in pharmacology.

[16]  B. Liu,et al.  Using Amino Acid Physicochemical Distance Transformation for Fast Protein Remote Homology Detection , 2012, PloS one.

[17]  H. Schiöth,et al.  The G-protein-coupled receptors in the human genome form five main families. Phylogenetic analysis, paralogon groups, and fingerprints. , 2003, Molecular pharmacology.

[18]  S. Wold,et al.  DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least-squares projections to latent structures , 1993 .

[19]  T. Lundstedt,et al.  Classification of G‐protein coupled receptors by alignment‐independent extraction of principal chemical properties of primary amino acid sequences , 2002, Protein science : a publication of the Protein Society.

[20]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[21]  Arthur Christopoulos,et al.  Allosteric modulators of GPCRs: a novel approach for the treatment of CNS disorders , 2009, Nature Reviews Drug Discovery.

[22]  Christopher S. Oehmen,et al.  SVM-HUSTLE - an iterative semi-supervised machine learning approach for pairwise protein remote homology detection , 2008, Bioinform..

[23]  Alfredo Vellido,et al.  Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors , 2015, BMC Bioinformatics.

[24]  Hilla Peretz,et al.  The , 1966 .

[25]  Kai Ye,et al.  An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences , 2007, Bioinform..

[26]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[27]  Gert Vriend,et al.  GPCRDB information system for G protein-coupled receptors , 2003, Nucleic Acids Res..

[28]  P. Conn,et al.  Allosteric Modulation of GPCRs: New Insights and Potential Utility for Treatment of Schizophrenia and Other CNS Disorders , 2017, Neuron.

[29]  Jia He,et al.  Classifying G-protein-coupled receptors to the finest subtype level. , 2013, Biochemical and biophysical research communications.

[30]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[31]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[32]  Joanna L Sharman,et al.  IUPHAR-DB: an open-access, expert-curated resource for receptor and ion channel research. , 2011, ACS chemical neuroscience.

[33]  Alfredo Vellido,et al.  The influence of alignment-free sequence representations on the semi-supervised classification of class C G protein-coupled receptors , 2014, Medical & Biological Engineering & Computing.

[34]  L. Prézeau,et al.  Evolution, structure, and activation mechanism of family 3/C G-protein-coupled receptors. , 2003, Pharmacology & therapeutics.

[35]  Raymond J Carroll,et al.  Empirical Performance of Cross-Validation With Oracle Methods in a Genomics Context , 2011, The American statistician.

[36]  Hasan Ogul,et al.  A discriminative method for remote homology detection based on n-peptide compositions with reduced amino acid alphabets , 2007, Biosyst..

[37]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[38]  J. Pin,et al.  Organization and functions of mGlu and GABAB receptor complexes , 2016, Nature.

[39]  Zhiyong Lu,et al.  Pressing needs of biomedical text mining in biocuration and beyond: opportunities and challenges , 2016, Database J. Biol. Databases Curation.

[40]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[41]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[42]  Sahil R. Kalra,et al.  Big Challenges? Big Data … , 2015 .

[43]  L. F. Kolakowski GCRDb: a G-protein-coupled receptor database. , 1994, Receptors & channels.

[44]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[45]  Aleksei Shkurin,et al.  Using random forests for assistance in the curation of G-protein coupled receptor databases , 2017, Biomedical engineering online.

[46]  Christopher S. Oehmen,et al.  SVM-BALSA: Remote homology detection based on Bayesian sequence alignment , 2005, Comput. Biol. Chem..

[47]  Robert Fredriksson,et al.  The gene repertoire and the common evolutionary history of glutamate, pheromone (V2R), taste(1) and other related G protein-coupled receptors. , 2005, Gene.

[48]  Kolakowski Lf GCRDB: A G-PROTEIN-COUPLED RECEPTOR DATABASE , 1994 .

[49]  Alfredo Vellido,et al.  SVM-Based Classification of Class C GPCRs from Alignment-Free Physicochemical Transformations of Their Sequences , 2013, ICIAP Workshops.

[50]  Cesare Furlanello,et al.  A Comparison of MCC and CEN Error Measures in Multi-Class Prediction , 2010, PloS one.

[51]  Tudor I. Oprea,et al.  A comprehensive map of molecular drug targets , 2016, Nature Reviews Drug Discovery.

[52]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[53]  Giuseppe Battaglia,et al.  Metabotropic glutamate receptors as drug targets: what's new? , 2015, Current opinion in pharmacology.

[54]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[55]  Bas Vroling,et al.  GPCRdb: an information system for G protein-coupled receptors , 2015, Nucleic Acids Res..

[56]  R. Stevens,et al.  Structure-function of the G protein-coupled receptor superfamily. , 2013, Annual review of pharmacology and toxicology.

[57]  P Kolb,et al.  GPCRdb: the G protein‐coupled receptor database – an introduction , 2016, British journal of pharmacology.

[58]  A. Baxevanis The Importance of Biological Databases in Biological Discovery , 2003, Current protocols in bioinformatics.