Protein Sequence–Structure–Function–Network Links Discovered with the ANNOTATOR Software Suite: Application to ELYS/Mel-28

While very little genomic sequence is interpretable in terms of biological mechanism directly, the chances are much better for protein-coding genes that can be translated into protein sequences. This review considers the different concepts applicable to sequence analysis and function prediction of globular and non-globular protein segments. The publicly accessible ANNOTATOR software environment integrates most of the reliable protein sequence-based function prediction methods, protein domain databases and pathway, and protein–protein interaction collections developed in academia. As application example, the structural and functional domains of mel-28/ELYS, an important nuclear protein, are delineated and are proposed for experimental follow-up in structural biology and functional studies.

[1]  M. Ferguson,et al.  The structure, biosynthesis and functions of glycosylphosphatidylinositol anchors, and the contributions of trypanosome research. , 1999, Journal of cell science.

[2]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[3]  Georg Schneider,et al.  Integrated tools for biomolecular sequence-based function prediction as exemplified by the ANNOTATOR software environment. , 2010, Methods in molecular biology.

[4]  Y. Zhang,et al.  IntAct—open source resource for molecular interaction data , 2006, Nucleic Acids Res..

[5]  Masasuke Yoshida,et al.  ATP synthase — a marvellous rotary engine of the cell , 2001, Nature Reviews Molecular Cell Biology.

[6]  Birgit Eisenhaber,et al.  Databases of protein-protein interactions and complexes. , 2010, Methods in molecular biology.

[7]  Fernanda L. Sirota,et al.  Towards Complete Sets of Farnesylated and Geranylgeranylated Proteins , 2007, PLoS Comput. Biol..

[8]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[9]  M. Hetzer,et al.  MEL‐28/ELYS is required for the recruitment of nucleoporins to chromatin and postmitotic nuclear pore complex assembly , 2007, EMBO reports.

[10]  Birgit Eisenhaber,et al.  Posttranslational modifications and subcellular localization signals: indicators of sequence regions without inherent 3D structure? , 2007, Current protein & peptide science.

[11]  G. Mennessier,et al.  Specific Properties of T-type Calcium Channels Generated by the Human α1I Subunit* , 2000, The Journal of Biological Chemistry.

[12]  B. Chait,et al.  The molecular architecture of the nuclear pore complex , 2007, Nature.

[13]  Christian Cole,et al.  The Jpred 3 secondary structure prediction server , 2008, Nucleic Acids Res..

[14]  Amos Bairoch,et al.  PROSITE: A Documented Database Using Patterns and Profiles as Motif Descriptors , 2002, Briefings Bioinform..

[15]  Frank Eisenhaber,et al.  Prediction of Protein Function , 2006 .

[16]  Geoffrey J. Barton,et al.  Jalview Version 2—a multiple sequence alignment editor and analysis workbench , 2009, Bioinform..

[17]  Fernanda L. Sirota,et al.  Mapping the sequence mutations of the 2009 H1N1 influenza A virus neuraminidase relative to drug and antibody binding sites , 2009, Biology Direct.

[18]  Birgit Eisenhaber,et al.  TM or not TM: transmembrane protein prediction with low false positive rate using DAS-TMfilter , 2004, Bioinform..

[19]  M. Sternberg,et al.  Protein structure prediction on the Web: a case study using the Phyre server , 2009, Nature Protocols.

[20]  S. Brunak,et al.  Improved prediction of signal peptides: SignalP 3.0. , 2004, Journal of molecular biology.

[21]  Georg Schneider,et al.  Tachyon search speeds up retrieval of similar sequences by several orders of magnitude , 2012, Bioinform..

[22]  R. Mott,et al.  Accurate formula for P-values of gapped local sequence and profile alignments. , 2000, Journal of molecular biology.

[23]  Sebastian Maurer-Stroh,et al.  Motif refinement of the peroxisomal targeting signal 1 and evaluation of taxon-specific differences. , 2003, Journal of molecular biology.

[24]  Christian J. A. Sigrist,et al.  Nucleic Acids Research Advance Access published November 14, 2007 The 20 years of PROSITE , 2007 .

[25]  S E Brenner,et al.  Distribution of protein folds in the three superkingdoms of life. , 1999, Genome research.

[26]  Ioannis Xenarios,et al.  DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions , 2002, Nucleic Acids Res..

[27]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[28]  G. Sutton,et al.  Judging the Archon Genomics X PRIZE for whole human genome sequencing , 2011, Nature Genetics.

[29]  J. S. Sodhi,et al.  Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. , 2004, Journal of molecular biology.

[30]  F. Eisenhaber,et al.  Refinement and prediction of protein prenylation motifs , 2005, Genome Biology.

[31]  Sebastian Maurer-Stroh,et al.  More Than 1,001 Problems with Protein Domain Databases: Transmembrane Regions, Signal Peptides and the Issue of Sequence Homology , 2010, PLoS Comput. Biol..

[32]  J. Whittle,et al.  Architectural Nucleoporins Nup157/170 and Nup133 Are Structurally Related and Descend from a Second Ancestral Element* , 2009, The Journal of Biological Chemistry.

[33]  Sebastian Maurer-Stroh,et al.  N-terminal N-myristoylation of proteins: refinement of the sequence motif and its taxon-specific differences. , 2002, Journal of molecular biology.

[34]  A. Lupas,et al.  Predicting coiled coils from protein sequences , 1991, Science.

[35]  G. Mennessier,et al.  Molecular and Functional Properties of the Human α1G Subunit That Forms T-type Calcium Channels* , 2000, The Journal of Biological Chemistry.

[36]  Alejandro A. Schäffer,et al.  IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices , 1999, Bioinform..

[37]  M. Tyers,et al.  Structural Basis for Phosphodependent Substrate Selection and Orientation by the SCFCdc4 Ubiquitin Ligase , 2003, Cell.

[38]  P. Bork,et al.  Prediction of potential GPI-modification sites in proprotein sequences. , 1999, Journal of molecular biology.

[39]  Zsuzsanna Dosztányi,et al.  IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content , 2005, Bioinform..

[40]  Eugene V Koonin,et al.  An apology for orthologs - or brave new memes , 2001, Genome Biology.

[41]  Philip L. F. Johnson,et al.  A Draft Sequence of the Neandertal Genome , 2010, Science.

[42]  Yan P. Yuan,et al.  Predicting function: from genes to genomes and back. , 1998, Journal of molecular biology.

[43]  Sebastian Maurer-Stroh,et al.  N-terminal N-myristoylation of proteins: prediction of substrate proteins from amino acid sequence. , 2002, Journal of molecular biology.

[44]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[45]  Itay Mayrose,et al.  Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues , 2002, ISMB.

[46]  G. Tusnády,et al.  Principles governing amino acid composition of integral membrane proteins: application to topology prediction. , 1998, Journal of molecular biology.

[47]  R C Stevens,et al.  High-throughput protein crystallization. , 2000, Current opinion in structural biology.

[48]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[49]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[50]  Sebastian Maurer-Stroh,et al.  Prediction of peroxisomal targeting signal 1 containing proteins from amino acid sequence. , 2003, Journal of molecular biology.

[51]  P Argos,et al.  Prediction of secondary structural content of proteins from their amino acid composition alone. I. New analytic vector decomposition methods , 1996, Proteins.

[52]  A. Krogh,et al.  A combined transmembrane topology and signal peptide prediction method. , 2004, Journal of molecular biology.

[53]  Anders Krogh,et al.  Prediction of Signal Peptides and Signal Anchors by a Hidden Markov Model , 1998, ISMB.

[54]  Gebhard F. X. Schertler,et al.  Structure of a β1-adrenergic G-protein-coupled receptor , 2008, Nature.

[55]  G. Heijne Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. , 1992, Journal of molecular biology.

[56]  K. Palczewski,et al.  Crystal Structure of Rhodopsin: A G‐Protein‐Coupled Receptor , 2002, Chembiochem : a European journal of chemical biology.

[57]  Michael Wildpaner,et al.  MYRbase: analysis of genome-wide glycine myristoylation enlarges the functional spectrum of eukaryotic myristoylated proteins , 2004, Genome Biology.

[58]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[59]  Leszek Rychlewski,et al.  ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins , 2003, Nucleic Acids Res..

[60]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[61]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[62]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[63]  Georg Schneider,et al.  Prediction of lipid posttranslational modifications and localization signals from protein sequences: big-, NMT and PTS1 , 2003, Nucleic Acids Res..

[64]  Sebastian Maurer-Stroh,et al.  The Janus-Faced E-Values of Hmmer2: Extreme Value Distribution or logistic Function? , 2011, J. Bioinform. Comput. Biol..

[65]  A. Valencia,et al.  Beta-propellers: associated functions and their role in human diseases. , 2003, Current medicinal chemistry.

[66]  P. Argos,et al.  Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. , 1996, Protein engineering.

[67]  Birgit Eisenhaber,et al.  Biomolecular pathway databases. , 2010, Methods in molecular biology.

[68]  T. Gibson,et al.  Protein disorder prediction: implications for structural proteomics. , 2003, Structure.

[69]  Christos A. Ouzounis,et al.  Comparison of sequence masking algorithms and the detection of biased protein sequence regions , 2003, Bioinform..

[70]  Johannes Söding,et al.  The HHpred interactive server for protein homology detection and structure prediction , 2005, Nucleic Acids Res..

[71]  G. Heijne A new method for predicting signal sequence cleavage sites. , 1986 .

[72]  Ran El-Yaniv,et al.  Correcting BLAST e-Values for Low-Complexity Segments , 2005, J. Comput. Biol..

[73]  W R Pearson,et al.  Flexible sequence similarity searching with the FASTA3 program package. , 2000, Methods in molecular biology.

[74]  P. Askjaer,et al.  MEL-28, a Novel Nuclear-Envelope and Kinetochore Protein Essential for Zygotic Nuclear-Envelope Assembly in C. elegans , 2006, Current Biology.

[75]  J. Wootton,et al.  Analysis of compositionally biased regions in sequence databases. , 1996, Methods in enzymology.

[76]  Sebastian Maurer-Stroh,et al.  Not all transmembrane helices are born equal: Towards the extension of the sequence homology concept to membrane proteins , 2011, Biology Direct.

[77]  Robert B. Russell,et al.  GlobPlot: exploring protein sequences for globularity and disorder , 2003, Nucleic Acids Res..

[78]  P. Argos,et al.  Seventy‐five percent accuracy in protein secondary structure prediction , 1997, Proteins.

[79]  S Karlin,et al.  Methods and algorithms for statistical analysis of protein sequences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[80]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[81]  Robert M. Hanson,et al.  Jmol – a paradigm shift in crystallographic visualization , 2010 .

[82]  Joshy George,et al.  Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer. , 2006, Cancer research.

[83]  Krzysztof Fidelis,et al.  CASP9 results compared to those of previous casp experiments , 2011, Proteins.

[84]  A. Harel,et al.  Capture of AT-rich chromatin by ELYS recruits POM121 and NDC1 to initiate nuclear pore assembly. , 2008, Molecular biology of the cell.

[85]  S. Briggs,et al.  ELYS is a dual nucleoporin/kinetochore protein required for nuclear pore assembly and proper cell division , 2006, Proceedings of the National Academy of Sciences.

[86]  Birgit Eisenhaber,et al.  On filtering false positive transmembrane protein predictions. , 2002, Protein engineering.

[87]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[88]  Georg Schneider,et al.  ANNIE: integrated de novo protein sequence annotation , 2009, Nucleic Acids Res..

[89]  Narmada Thanki,et al.  CDD: a Conserved Domain Database for the functional annotation of proteins , 2010, Nucleic Acids Res..

[90]  Nathan A. Baker,et al.  Electrostatics of nanosystems: Application to microtubules and the ribosome , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[91]  Stijn van Dongen,et al.  Graph Clustering Via a Discrete Uncoupling Process , 2008, SIAM J. Matrix Anal. Appl..

[92]  Rolf Apweiler,et al.  A comparison of signal sequence prediction methods using a test set of signal peptides , 2000, Bioinform..

[93]  Kazutaka Katoh,et al.  PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences , 2007, Bioinform..

[94]  Frank Eisenhaber,et al.  A Decade after the First Full Human genome sequencing: when will We Understand our Own genome? , 2012, J. Bioinform. Comput. Biol..

[95]  A. Sali,et al.  Modeling of loops in protein structures , 2000, Protein science : a publication of the Protein Society.

[96]  H. Kiyonari,et al.  Targeted disruption of the mouse ELYS gene results in embryonic death at peri‐implantation development , 2004, Genes to cells : devoted to molecular & cellular mechanisms.

[97]  G J Barton,et al.  Evaluation and improvement of multiple sequence methods for protein secondary structure prediction , 1999, Proteins.

[98]  W. Pearson Empirical statistical estimates for sequence similarity searches. , 1998, Journal of molecular biology.

[99]  A. Biegert,et al.  Sequence context-specific profiles for homology searching , 2009, Proceedings of the National Academy of Sciences.

[100]  R.W. Keyes Moore's Law today , 2008, IEEE Circuits and Systems Magazine.

[101]  Kazutaka Katoh,et al.  Recent developments in the MAFFT multiple sequence alignment program , 2008, Briefings Bioinform..

[102]  A. Sali,et al.  Comparative protein structure modeling of genes and genomes. , 2000, Annual review of biophysics and biomolecular structure.

[103]  Chris Sander,et al.  CAST: an iterative algorithm for the complexity analysis of sequence tracts , 2000, Bioinform..

[104]  Manuel G. Claros,et al.  TopPred II: an improved software for membrane protein structure predictions , 1994, Comput. Appl. Biosci..

[105]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[106]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[107]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[108]  Ioannis Xenarios,et al.  T-Coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension , 2011, Nucleic Acids Res..

[109]  Fernanda L. Sirota,et al.  Parameterization of disorder predictors for large-scale applications requiring high specificity by using an extended benchmark dataset , 2010, BMC Genomics.

[110]  John C. Wootton,et al.  Non-globular Domains in Protein Sequences: Automated Segmentation Using Complexity Measures , 1994, Comput. Chem..

[111]  Ben M. Webb,et al.  Comparative Protein Structure Modeling Using Modeller , 2006, Current protocols in bioinformatics.

[112]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[113]  P. Tompa,et al.  The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. , 2005, Journal of molecular biology.

[114]  O. Lichtarge,et al.  A family of evolution-entropy hybrid methods for ranking protein residues by importance. , 2004, Journal of molecular biology.

[115]  Ben M. Webb,et al.  Comparative Protein Structure Modeling Using MODELLER , 2007, Current protocols in protein science.

[116]  Sebastian Maurer-Stroh,et al.  Enzymes and auxiliary factors for GPI lipid anchor biosynthesis and post-translational transfer to proteins. , 2003, BioEssays : news and reviews in molecular, cellular and developmental biology.

[117]  A Keith Dunker,et al.  Order, disorder, and flexibility: prediction from protein sequence. , 2003, Structure.

[118]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[119]  Erik L. L. Sonnhammer,et al.  A Hidden Markov Model for Predicting Transmembrane Helices in Protein Sequences , 1998, ISMB.

[120]  G. Heijne,et al.  Genome‐wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms , 1998, Protein science : a publication of the Protein Society.

[121]  Gunnar Von Heijne,et al.  Sequence Analysis in Molecular Biology: Treasure Trove or Trivial Pursuit , 2012 .

[122]  S. Brunak,et al.  SHORT COMMUNICATION Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites , 1997 .

[123]  S. Lewis,et al.  An integrated computational pipeline and database to support whole-genome sequence annotation , 2002, Genome Biology.

[124]  Jean-Michel Claverie,et al.  Information Enhancement Methods for Large Scale Sequence Analysis , 1993, Comput. Chem..

[125]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[126]  Katsuhide Igarashi,et al.  Identification of a novel transcription factor, ELYS, expressed predominantly in mouse foetal haematopoietic tissues , 2002, Genes to cells : devoted to molecular & cellular mechanisms.

[127]  Gaston H. Gonnet,et al.  OMA 2011: orthology inference among 1000 complete genomes , 2010, Nucleic Acids Res..

[128]  A. Lupas Prediction and analysis of coiled-coil structures. , 1996, Methods in enzymology.

[129]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[130]  T. Hughes,et al.  Why Are There Still Over 1000 Uncharacterized Yeast Genes? , 2007, Genetics.

[131]  J. Durán,et al.  Changes in tear protein profile in keratoconus disease , 2011, Eye.

[132]  Sun Tian,et al.  Application of a sensitive collection heuristic for very large protein families: Evolutionary relationship between adipose triglyceride lipase (ATGL) and classic mammalian lipases , 2006, BMC Bioinformatics.

[133]  John C. Wootton,et al.  Sequences with ‘unusual’ amino acid compositions , 1994 .

[134]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[135]  Gaston H. Gonnet,et al.  Algorithm of OMA for large-scale orthology inference , 2008, BMC Bioinformatics.

[136]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[137]  Gabriele Ausiello,et al.  MINT: the Molecular INTeraction database , 2006, Nucleic Acids Res..

[138]  Sebastian Maurer-Stroh,et al.  Myristoylation of viral and bacterial proteins. , 2004, Trends in microbiology.

[139]  T. Blundell,et al.  Comparative protein modelling by satisfaction of spatial restraints. , 1993, Journal of molecular biology.

[140]  Adam Godzik,et al.  Tolerating some redundancy significantly speeds up clustering of large protein databases , 2002, Bioinform..

[141]  Frank Eisenhaber,et al.  DOUTfinder—identification of distant domain outliers using subsignificant sequence similarity , 2006, Nucleic Acids Res..