TransportTP: A two-phase classification approach for membrane transporter prediction and characterization

BackgroundMembrane transporters play crucial roles in living cells. Experimental characterization of transporters is costly and time-consuming. Current computational methods for transporter characterization still require extensive curation efforts, especially for eukaryotic organisms. We developed a novel genome-scale transporter prediction and characterization system called TransportTP that combined homology-based and machine learning methods in a two-phase classification approach. First, traditional homology methods were employed to predict novel transporters based on sequence similarity to known classified proteins in the Transporter Classification Database (TCDB). Second, machine learning methods were used to integrate a variety of features to refine the initial predictions. A set of rules based on transporter features was developed by machine learning using well-curated proteomes as guides.ResultsIn a cross-validation using the yeast proteome for training and the proteomes of ten other organisms for testing, TransportTP achieved an equivalent recall and precision of 81.8%, based on TransportDB, a manually annotated transporter database. In an independent test using the Arabidopsis proteome for training and four recently sequenced plant proteomes for testing, it achieved a recall of 74.6% and a precision of 73.4%, according to our manual curation.ConclusionsTransportTP is the most effective tool for eukaryotic transporter characterization up to date.

[1]  Paul Horton,et al.  Better Prediction of Protein Cellular Localization Sites with the it k Nearest Neighbors Classifier , 1997, ISMB.

[2]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[3]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[4]  R. Doolittle Similar amino acid sequences: chance or common ancestry? , 1981, Science.

[5]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[6]  Minho Lee,et al.  Predicting and improving the protein sequence alignment quality by support vector regression , 2007, BMC Bioinformatics.

[7]  M. Saier A Functional-Phylogenetic Classification System for Transmembrane Solute Transporters , 2000, Microbiology and Molecular Biology Reviews.

[8]  Reinhard Krämer,et al.  Osmoregulation and osmosensing by uptake carriers for compatible solutes in bacteria , 2004 .

[9]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[10]  H. Stuhlmann,et al.  Heterologous Expression and Functional Characterization of a Mouse Renal Organic Anion Transporter in Mammalian Cells* , 1999, The Journal of Biological Chemistry.

[11]  L. Fliegel,et al.  Comparative molecular analysis of Na+/H+ exchangers: a unified model for Na+/H+ antiport? , 1998, FEBS letters.

[12]  T. Chiou,et al.  Cloning a plant amino acid transporter by functional complementation of a yeast amino acid transport mutant. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Ian T. Paulsen,et al.  TransportDB: a relational database of cellular membrane transport systems , 2004, Nucleic Acids Res..

[14]  Thomas Lengauer,et al.  Bioinformatics Original Paper Computational Recognition of Potassium Channel Sequences , 2022 .

[15]  István Simon,et al.  The HMMTOP transmembrane topology prediction server , 2001, Bioinform..

[16]  Charles Elkan,et al.  The Transporter Classification Database: recent advances , 2008, Nucleic Acids Res..

[17]  L. Koski,et al.  The Closest BLAST Hit Is Often Not the Nearest Neighbor , 2001, Journal of Molecular Evolution.

[18]  Kevin Atteson,et al.  Calculating the Exact Probability of Language-Like Patterns in Biomolecular Sequences , 1998, ISMB.

[19]  Golan Yona,et al.  Variations on probabilistic suffix trees: statistical modeling and prediction of protein families , 2001, Bioinform..

[20]  Bruno André,et al.  Role of transporter like sensors in glucose and amino-acid signalling in yeast , 2004 .

[21]  Ian T. Paulsen,et al.  TransportDB: a comprehensive database resource for cytoplasmic membrane transport systems and outer membrane channels , 2006, Nucleic Acids Res..

[22]  E. Gouaux,et al.  Structure of a glutamate transporter homologue from Pyrococcus horikoshii , 2004, Nature.

[23]  Patrick Xuechun Zhao,et al.  A nearest neighbor approach for automated transporter prediction and categorization from protein sequences , 2008, Bioinform..

[24]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[26]  M. Michael Gromiha,et al.  Functional discrimination of membrane proteins using machine learning techniques , 2008, BMC Bioinformatics.

[27]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[28]  Georg Fuellen,et al.  Comparative homology agreement search: an effective combination of homology-search methods. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[29]  A. Valencia,et al.  Intrinsic errors in genome annotation. , 2001, Trends in genetics : TIG.

[30]  Nathalie Japkowicz,et al.  Boosting support vector machines for imbalanced data sets , 2008, Knowledge and Information Systems.

[31]  Kay Hofmann,et al.  Tmbase-A database of membrane spanning protein segments , 1993 .

[32]  Y. Z. Chen,et al.  Prediction of transporter family from protein sequence by support vector machine approach , 2005, Proteins.

[33]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[34]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[35]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[36]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[37]  R. Tampé,et al.  Function of the transport complex TAP in cellular immune recognition. , 1999, Biochimica et biophysica acta.

[38]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[39]  B Sakmann,et al.  Patch clamp techniques for studying ionic channels in excitable membranes. , 1984, Annual review of physiology.

[40]  Milton H. Saier,et al.  TCDB: the Transporter Classification Database for membrane transport protein analyses and information , 2005, Nucleic Acids Res..

[41]  Rolf Apweiler,et al.  Functional Information in SWISS-PROT: the Basis for Large-scale Characterisation of Protein Sequences , 2001, Briefings Bioinform..

[42]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.