Protein subcellular localization prediction using artificial intelligence technology.

Proteins perform many important tasks in living organisms, such as catalysis of biochemical reactions, transport of nutrients, and recognition and transmission of signals. The plethora of aspects of the role of any particular protein is referred to as its "function." One aspect of protein function that has been the target of intensive research by computational biologists is its subcellular localization. Proteins must be localized in the same subcellular compartment to cooperate toward a common physiological function. Aberrant subcellular localization of proteins can result in several diseases, including kidney stones, cancer, and Alzheimer's disease. To date, sequence homology remains the most widely used method for inferring the function of a protein. However, the application of advanced artificial intelligence (AI)-based techniques in recent years has resulted in significant improvements in our ability to predict the subcellular localization of a protein. The prediction accuracy has risen steadily over the years, in large part due to the application of AI-based methods such as hidden Markov models (HMMs), neural networks (NNs), and support vector machines (SVMs), although the availability of larger experimental datasets has also played a role. Automatic methods that mine textual information from the biological literature and molecular biology databases have considerably sped up the process of annotation for proteins for which some information regarding function is available in the literature. State-of-the-art methods based on NNs and HMMs can predict the presence of N-terminal sorting signals extremely accurately. Ab initio methods that predict subcellular localization for any protein sequence using only the native amino acid sequence and features predicted from the native sequence have shown the most remarkable improvements. The prediction accuracy of these methods has increased by over 30% in the past decade. The accuracy of these methods is now on par with high-throughput methods for predicting localization, and they are beginning to play an important role in directing experimental research. In this chapter, we review some of the most important methods for the prediction of subcellular localization.

[1]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its new supplement TREMBL , 1996, Nucleic Acids Res..

[2]  A. Poustka,et al.  Systematic subcellular localization of novel proteins identified by large‐scale cDNA sequencing , 2000, EMBO reports.

[3]  R. Altman,et al.  Using text analysis to identify functionally coherent gene groups. , 2002, Genome research.

[4]  K. Nakai,et al.  Review: prediction of in vivo fates of proteins in the era of genomics and proteomics. , 2001, Journal of structural biology.

[5]  F E Cohen,et al.  The prion folding problem. , 1997, Current opinion in structural biology.

[6]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[7]  Paul Horton,et al.  Nucleic Acids Research Advance Access published May 21, 2007 WoLF PSORT: protein localization predictor , 2007 .

[8]  Chris Sander,et al.  EUCLID: automatic classification of proteins in functional classes by their database annotations , 1998, Bioinform..

[9]  Burkhard Rost,et al.  NLSdb: database of nuclear localization signals , 2003, Nucleic Acids Res..

[10]  Anton J. Enright,et al.  TEXTQUEST: Document Clustering of MEDLINE Abstracts For Concept Discovery In Molecular Biology , 2000, Pacific Symposium on Biocomputing.

[11]  Satoru Miyano,et al.  Extensive feature detection of N-terminal protein sorting signals , 2002, Bioinform..

[12]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[13]  Markus Brameier,et al.  BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btm066 Sequence analysis NucPred—Predicting nuclear localization of proteins , 2007 .

[14]  Ke Wang,et al.  PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria , 2003, Nucleic Acids Res..

[15]  Javed Mostafa,et al.  Detecting Gene Relations from MEDLINE Abstracts , 2000, Pacific Symposium on Biocomputing.

[16]  B. Dobberstein,et al.  Common Principles of Protein Translocation Across Membranes , 1996, Science.

[17]  Burkhard Rost,et al.  Inferring sub-cellular localization through automated lexical analysis , 2002, ISMB.

[18]  Y. Lefebvre,et al.  Nuclear localization signals overlap DNA- or RNA-binding domains in nucleic acid-binding proteins. , 1995, Nucleic acids research.

[19]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[20]  G. Heijne Protein sorting signals: simple peptides with complex functions , 1995 .

[21]  Sergio Contrino,et al.  Protein Sequence Annotation in the Genome Era: The Annotation Concept of SWISS-PROT + TREMBL , 1997, ISMB.

[22]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[23]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[24]  Olivier Poch,et al.  GOAnno: GO annotation based on multiple alignment , 2005, Bioinform..

[25]  E. Mumcuoglu,et al.  Subcellular Localization Prediction with New Protein Encoding Schemes , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[26]  Trisha N Davis,et al.  Protein localization in proteomics. , 2004, Current opinion in chemical biology.

[27]  Kuo-Chen Chou,et al.  Prediction and classification of protein subcellular location—sequence‐order effect and pseudo amino acid composition , 2003, Journal of cellular biochemistry.

[28]  M. Kanehisa,et al.  A knowledge base for predicting protein localization sites in eukaryotic cells , 1992, Genomics.

[29]  Pierre Dönnes,et al.  Predicting Protein Subcellular Localization: Past, Present, and Future , 2004, Genomics, proteomics & bioinformatics.

[30]  F. Legeai,et al.  Predotar: A tool for rapidly screening proteomes for N‐terminal targeting sequences , 2004, Proteomics.

[31]  D L Brutlag,et al.  Genomics and computational molecular biology. , 1998, Current opinion in microbiology.

[32]  J. Gitlin,et al.  Functional expression of the Wilson disease protein reveals mislocalization and impaired copper-dependent trafficking of the common H1069Q mutation. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Arun Krishnan,et al.  pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties , 2005, BMC Bioinformatics.

[34]  Peer Bork,et al.  Predicting protein cellular localization using a domain projection method. , 2002, Genome research.

[35]  Hagit Shatkay,et al.  Significantly Improved Prediction of Subcellular Localization by Integrating Text and Protein Sequence Data , 2005, Pacific Symposium on Biocomputing.

[36]  Yan P. Yuan,et al.  Predicting function: from genes to genomes and back. , 1998, Journal of molecular biology.

[37]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[38]  David A. Lomas,et al.  α1-Antitrypsin deficiency, liver disease and emphysema , 2003 .

[39]  A. Krogh,et al.  A combined transmembrane topology and signal peptide prediction method. , 2004, Journal of molecular biology.

[40]  P. Bork,et al.  Predicting functions from protein sequences—where are the bottlenecks? , 1998, Nature Genetics.

[41]  I. Mattaj,et al.  Nucleocytoplasmic transport: the soluble phase. , 1998, Annual review of biochemistry.

[42]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[43]  W. Skach,et al.  Defects in processing and trafficking of the cystic fibrosis transmembrane conductance regulator. , 2000, Kidney international.

[44]  Using motifs in the prediction of eukaryotic protein subcellular localization , 2005, 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference.

[45]  Zhiyong Lu,et al.  GO Molecular Function Terms Are Predictive of Subcellular Localization , 2004, Pacific Symposium on Biocomputing.

[46]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[47]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[48]  Chittibabu Guda,et al.  TARGET: a new method for predicting protein subcellular localization in eukaryotes , 2005, Bioinform..

[49]  G. Schneider,et al.  Advances in the prediction of protein targeting signals , 2004, Proteomics.

[50]  Piero Fariselli,et al.  BaCelLo: a balanced subcellular localization predictor , 2006, ISMB.

[51]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[52]  Rolf Apweiler,et al.  A novel method for automatic functional annotation of proteins , 1999, Bioinform..

[53]  G von Heijne,et al.  Prediction of organellar targeting signals. , 2001, Biochimica et biophysica acta.

[54]  G. von Heijne,et al.  Signal sequences: The limits of variation , 1985 .

[55]  Jenn-Kang Hwang,et al.  Prediction of protein subcellular localization , 2006, Proteins.

[56]  B. Bruce,et al.  Chloroplast transit peptides: structure, function and evolution. , 2000, Trends in cell biology.

[57]  John Hawkins,et al.  Prediction of subcellular localization using sequence-biased recurrent networks , 2005, Bioinform..

[58]  Rolf Apweiler,et al.  Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT , 2001, Bioinform..

[59]  Peer Bork,et al.  Evaluation of human-readable annotation in biomolecular sequence databases with biological rule libraries , 1999, Bioinform..

[60]  N. Pfanner,et al.  Mechanisms of protein translocation into mitochondria. , 1999, Biochimica et biophysica acta.

[61]  B. Rost,et al.  Finding nuclear localization signals , 2000, EMBO reports.

[62]  Burkhard Rost,et al.  Sequence conserved for subcellular localization , 2002, Protein science : a publication of the Protein Society.

[63]  Zhiyong Lu,et al.  Predicting subcellular localization of proteins using machine-learned classifiers , 2004, Bioinform..

[64]  S. Brunak,et al.  Improved prediction of signal peptides: SignalP 3.0. , 2004, Journal of molecular biology.

[65]  Alex Bateman,et al.  InterPro : An integrated documentation resource for protein families , domains and functional sites The InterPro Consortium : , 2005 .

[66]  B. Rost,et al.  Automatic prediction of protein function , 2003, Cellular and Molecular Life Sciences CMLS.

[67]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[68]  B. Rost,et al.  Mimicking cellular sorting improves prediction of subcellular localization. , 2005, Journal of molecular biology.

[69]  S. Brunak,et al.  SHORT COMMUNICATION Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites , 1997 .

[70]  T. Hubbard,et al.  Using neural networks for prediction of the subcellular location of proteins. , 1998, Nucleic acids research.

[71]  Josefine Sprenger,et al.  Evaluation and comparison of mammalian subcellular localization prediction methods , 2006, BMC Bioinformatics.

[72]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[73]  Michael J. E. Sternberg,et al.  Predicting the Sub-Cellular Location of Proteins from Text Using Support Vector Machines , 2001, Pacific Symposium on Biocomputing.

[74]  K. Nakai Protein sorting signals and prediction of subcellular localization. , 2000, Advances in protein chemistry.

[75]  K. Chou,et al.  Support vector machines for prediction of protein subcellular location by incorporating quasi‐sequence‐order effect , 2002, Journal of cellular biochemistry.

[76]  N M Luscombe,et al.  What is Bioinformatics? A Proposed Definition and Overview of the Field , 2001, Methods of Information in Medicine.

[77]  H Nielsen,et al.  Machine learning approaches for the prediction of signal peptides and other protein sorting signals. , 1999, Protein engineering.

[78]  S. Brunak,et al.  Locating proteins in the cell using TargetP, SignalP and related tools , 2007, Nature Protocols.

[79]  A. Valencia,et al.  Computational methods for the prediction of protein interactions. , 2002, Current opinion in structural biology.

[80]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[81]  C. Xiao,et al.  Nuclear targeting signal recognition: a key control point in nuclear transport? , 2000, BioEssays : news and reviews in molecular, cellular and developmental biology.

[82]  M. Kanehisa,et al.  Expert system for predicting protein localization sites in gram‐negative bacteria , 1991, Proteins.

[83]  Burkhard Rost,et al.  Target space for structural genomics revisited , 2002, Bioinform..

[84]  T. Gaasterland,et al.  Whole-genome analysis: annotations and updates. , 2001, Current opinion in structural biology.

[85]  Robert S. Ledley,et al.  PIRSF: family classification system at the Protein Information Resource , 2004, Nucleic Acids Res..

[86]  Søren Brunak,et al.  A Neural Network Method for Identification of Prokaryotic and Eukaryotic Signal Peptides and Prediction of their Cleavage Sites , 1997, Int. J. Neural Syst..

[87]  Chittibabu Guda,et al.  pTARGET: a web server for predicting protein subcellular localization , 2006, Nucleic Acids Res..

[88]  K. Nakai,et al.  PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. , 1999, Trends in biochemical sciences.

[89]  B. Rost,et al.  Adaptation of protein surfaces to subcellular location. , 1998, Journal of molecular biology.

[90]  K Nishikawa,et al.  Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. , 1994, Journal of molecular biology.

[91]  Anders Krogh,et al.  Prediction of Signal Peptides and Signal Anchors by a Hidden Markov Model , 1998, ISMB.

[92]  K Nishikawa,et al.  Correlation of the amino acid composition of a protein to its structural and biological characters. , 1982, Journal of biochemistry.

[93]  Hagit Shatkay,et al.  SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. , 2007, Bioinformatics.

[94]  H. Lodish Molecular Cell Biology , 1986 .