Predicting function: from genes to genomes and back.

Predicting function from sequence using computational tools is a highly complicated procedure that is generally done for each gene individually. This review focuses on the added value that is provided by completely sequenced genomes in function prediction. Various levels of sequence annotation and function prediction are discussed, ranging from genomic sequence to that of complex cellular processes. Protein function is currently best described in the context of molecular interactions. In the near future it will be possible to predict protein function in the context of higher order processes such as the regulation of gene expression, metabolic pathways and signalling cascades. The analysis of such higher levels of function description uses, besides the information from completely sequenced genomes, also the additional information from proteomics and expression data. The final goal will be to elucidate the mapping between genotype and phenotype.

[1]  E. Chargaff,et al.  Nucleic Acids , 2020, Definitions.

[2]  W. Fitch Distinguishing homologous from analogous proteins. , 1970, Systematic zoology.

[3]  J. Jernigan,et al.  Positions of galactic X-ray sources Cir X-1, TrA X-1 and 3U1626—67 , 1977, Nature.

[4]  J. Sambrook,et al.  Molecular Cloning: A Laboratory Manual , 2001 .

[5]  E N Trifonov,et al.  Terminators of transcription with RNA polymerase from Escherichia coli: what they look like and how to find them. , 1986, Journal of biomolecular structure & dynamics.

[6]  J. Fields Fermentative adaptations to the lack of oxygen , 1988 .

[7]  J E Bailey,et al.  MPS: An artificially intelligent software system for the analysis and synthesis of metabolic pathways , 1988, Biotechnology and bioengineering.

[8]  Rodger Staden,et al.  Methods for discovering novel motifs in nucleic acid sequences , 1989, Comput. Appl. Biosci..

[9]  Gary D. Stormo,et al.  Identification of consensus patterns in unaligned DNA sequences known to be functionally related , 1990, Comput. Appl. Biosci..

[10]  E. Brody,et al.  Prediction of rho-independent Escherichia coli transcription terminators. A statistical analysis of their RNA stem-loop structures. , 1990 .

[11]  G. Stephanopoulos,et al.  Computer‐aided synthesis of biochemical pathways , 1990, Biotechnology and bioengineering.

[12]  A. Danchin,et al.  Evidence for horizontal gene transfer in Escherichia coli speciation. , 1991, Journal of molecular biology.

[13]  A. Johansson,et al.  Automatic evaluation of protein sequence functional patterns , 1991, Comput. Appl. Biosci..

[14]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[15]  J. Piatigorsky,et al.  The recruitment of crystallins: new functions precede gene duplication , 1991, Science.

[16]  A B Jacobson,et al.  A computer method for finding common base paired helices in aligned sequences: application to the analysis of random sequences. , 1991, Nucleic acids research.

[17]  G. Zhou,et al.  Neural network optimization for E. coli promoter prediction. , 1991, Nucleic acids research.

[18]  G von Heijne,et al.  Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. , 1992, Journal of molecular biology.

[19]  Danson Mj,et al.  The enzymology of archaebacterial pathways of central metabolism. , 1992 .

[20]  K. Han,et al.  Prediction of common folding structures of homologous RNAs. , 1993, Nucleic acids research.

[21]  Roderic Guigó,et al.  Inferring Correlation Between Database Queries: Analysis of Protein Sequence Patterns , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  M. Nomura,et al.  Post-transcriptional regulation of the str operon in Escherichia coli. Structural and mutational analysis of the target site for translational repressor S7. , 1994, Journal of molecular biology.

[23]  E V Koonin,et al.  New genes in old sequence: a strategy for finding genes in the bacterial genome. , 1994, Trends in biochemical sciences.

[24]  M S Boguski,et al.  Gene discovery in dbEST. , 1994, Science.

[25]  T. Blundell,et al.  Knowledge-based protein modeling. , 1994, Critical reviews in biochemistry and molecular biology.

[26]  P. Brown Genome scanning methods. , 1994, Current opinion in genetics & development.

[27]  M. Nomura,et al.  Post-transcriptional regulation of the str operon in Escherichia coli. Ribosomal protein S7 inhibits coupled translation of S7 but not its independent translation. , 1994, Journal of molecular biology.

[28]  C. Sander,et al.  From genome sequences to protein function , 1994 .

[29]  B. Persson,et al.  A Super‐Family of Medium‐Chain Dehydrogenases/Reductases (MDR): Sub‐Lines including ζ‐Crystallin, Alcohol and Polyol Dehydrogenases, Quinone Oxidoreductases, Enoyl Reductases, VAT‐1 and other Proteins , 1994 .

[30]  S. Schuster,et al.  ON ELEMENTARY FLUX MODES IN BIOCHEMICAL REACTION SYSTEMS AT STEADY STATE , 1994 .

[31]  P. Kahn From Genome to Proteome: Looking at a Cell's Proteins , 1995 .

[32]  Remo Guidieri Res , 1995, RES: Anthropology and Aesthetics.

[33]  R. Fleischmann,et al.  The Minimal Gene Complement of Mycoplasma genitalium , 1995, Science.

[34]  P. Argos,et al.  Protein structure prediction: recognition of primary, secondary, and tertiary structural features from amino acid sequence. , 1995, Critical reviews in biochemistry and molecular biology.

[35]  Anders Gorm Pedersen,et al.  Investigations of Escherichia coli Promoter Sequences with Artificial Neural Networks: New Signals Discovered Upstream of the Transcriptional Startpoint , 1995, ISMB.

[36]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[37]  S. Schuster,et al.  What Information about the Conserved-Moiety Structure of Chemical Reaction Systems Can be Derived from Their Stoichiometry? , 1995 .

[38]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[39]  A Bairoch,et al.  Go hunting in sequence databases but watch out for the traps. , 1996, Trends in genetics : TIG.

[40]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[41]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[42]  James L. Winkler,et al.  Accessing Genetic Information with High-Density DNA Arrays , 1996, Science.

[43]  R. Fleischmann,et al.  DNA repeats identify novel virulence genes in Haemophilus influenzae. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Jerzy Jurka,et al.  Censor - a Program for Identification and Elimination of Repetitive Elements From DNA Sequences , 1996, Computers and Chemistry.

[45]  E V Koonin,et al.  Gene order is not conserved in bacterial evolution. , 1996, Trends in genetics : TIG.

[46]  Y. Nakamura,et al.  Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions (supplement). , 1996, DNA research : an international journal for rapid publication of reports on genes and genomes.

[47]  P. Argos,et al.  SRS: information retrieval system for molecular biology data banks. , 1996, Methods in enzymology.

[48]  H. Hilbert,et al.  Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae. , 1996, Nucleic acids research.

[49]  J. Wootton,et al.  Analysis of compositionally biased regions in sequence databases. , 1996, Methods in enzymology.

[50]  Alan S. Perelson,et al.  Base Pairing Probabilities in a Complete HIV-1 RNA , 1996, J. Comput. Biol..

[51]  P. Bork,et al.  Metabolism and evolution of Haemophilus influenzae deduced from a whole-genome comparison with Escherichia coli , 1996, Current Biology.

[52]  Grit Herrmann,et al.  Identification of functional elements in unaligned nucleic acid sequences by a novel tuple search algorithm , 1996, Comput. Appl. Biosci..

[53]  Control analysis in terms of generalized variables characterizing metabolic systems. , 1996, Journal of theoretical biology.

[54]  Sayaka,et al.  Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions. , 1996, DNA research : an international journal for rapid publication of reports on genes and genomes.

[55]  D A Fell,et al.  Design of metabolic control for large flux changes. , 1996, Journal of theoretical biology.

[56]  Terry Gaasterland,et al.  The metabolic pathway collection from EMP: the enzymes and metabolic pathways database , 1996, Nucleic Acids Res..

[57]  J. Esko,et al.  Influence of core protein sequence on glycosaminoglycan assembly. , 1996, Current opinion in structural biology.

[58]  G Bernardi,et al.  Identification of the gene-richest bands in human chromosomes. , 1996, Gene.

[59]  P. Bork,et al.  Non-orthologous gene displacement. , 1996, Trends in genetics : TIG.

[60]  R. Fleischmann,et al.  Complete Genome Sequence of the Methanogenic Archaeon, Methanococcus jannaschii , 1996, Science.

[61]  Edward N. Trifonov,et al.  Interfering contexts of regulatory sequence elements , 1996, Comput. Appl. Biosci..

[62]  Roger Brent,et al.  Genetic selection of peptide aptamers that recognize and inhibit cyclin-dependent kinase 2 , 1996, Nature.

[63]  Chankyu Park,et al.  Genetically probing the regions of ribose‐binding protein involved in permease interaction , 1996, Molecular microbiology.

[64]  T. Gibson,et al.  Applying motif and profile searches. , 1996, Methods in enzymology.

[65]  T Gaasterland,et al.  Fully automated genome analysis that reflects user needs and preferences. A detailed introduction to the MAGPIE system architecture. , 1996, Biochimie.

[66]  J. Liao,et al.  Pathway analysis, engineering, and physiological considerations for redirecting central metabolism. , 1996, Biotechnology and bioengineering.

[67]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[68]  H. Hilbert,et al.  Comparative analysis of the genomes of the bacteria Mycoplasma pneumoniae and Mycoplasma genitalium. , 1997, Nucleic acids research.

[69]  R Heinrich,et al.  Theoretical approaches to the evolutionary optimization of glycolysis: thermodynamic and kinetic constraints. , 1997, European journal of biochemistry.

[70]  J. Pollack,et al.  The comparative metabolism of the mollicutes (Mycoplasmas): the utility for taxonomic classification and the relationship of putative gene annotation and phylogeny to enzymatic function in the smallest free-living cells. , 1997, Critical reviews in microbiology.

[71]  M. Huynen,et al.  Differential genome display. , 1997, Trends in genetics : TIG.

[72]  Miguel A. Andrade-Navarro,et al.  Sequence analysis of the Methanococcus jannaschii genome and the prediction of protein function , 1997, Comput. Appl. Biosci..

[73]  A. Lupas,et al.  Predicting coiled-coil regions in proteins. , 1997, Current opinion in structural biology.

[74]  H. Bonarius,et al.  Flux analysis of underdetermined metabolic networks: the quest for the missing constraints. , 1997 .

[75]  Bernard Jacq,et al.  GIF-DB, a WWW database on gene interactions involved in Drosophila melanogaster development , 1997, Nucleic Acids Res..

[76]  M. Boguski,et al.  Functional genomics: it's all how you read it. , 1997, Science.

[77]  D. Fischer,et al.  Assigning folds to the proteins encoded by the genome of Mycoplasma genitalium. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[78]  S. Brunak,et al.  SHORT COMMUNICATION Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites , 1997 .

[79]  J. Yates,et al.  Identifying the major proteome components of Haemophilus influenzae type‐strain NCTC 8143 , 1997, Electrophoresis.

[80]  Reinhart Heinrich,et al.  Theoretical approaches to the evolutionary optimization of glycolysis--chemical analysis. , 1997, European journal of biochemistry.

[81]  S. Carr,et al.  The Essential Role of Mass Spectrometry in Characterizing Protein Structure: Mapping Posttranslational Modifications , 1997, Journal of protein chemistry.

[82]  Michael Y. Galperin,et al.  Prokaryotic genomes: the emerging paradigm of genome-based microbiology. , 1997, Current opinion in genetics & development.

[83]  R H Hruban,et al.  Gene expression profiles in normal and cancer cells. , 1997, Science.

[84]  W. Gilbert,et al.  Dealing with database explosion: a cautionary note. , 1997, Science.

[85]  S. Oliver From gene to screen with yeast. , 1997, Current opinion in genetics & development.

[86]  Thomas E. Creighton,et al.  Protein structure : a practical approach , 1997 .

[87]  J R Yates,et al.  Emerging tandem-mass-spectrometry techniques for the rapid identification of proteins. , 1997, Trends in biotechnology.

[88]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[89]  Michael Y. Galperin,et al.  Comparison of archaeal and bacterial genomes: computer analysis of protein sequences predicts novel functions and suggests a chimeric origin for the archaea , 1997, Molecular microbiology.

[90]  R. Brent,et al.  Cells that register logical relationships among proteins. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[91]  W. Blackstock,et al.  Proteome Analysis: Genomics via the Output Rather Than the Input Code , 1997, Journal of Protein Chemistry.

[92]  J. Seilhamer,et al.  A comparison of selected mRNA and protein abundances in human liver , 1997, Electrophoresis.

[93]  C J Rawlings,et al.  Computational gene discovery and human disease. , 1997, Current opinion in genetics & development.

[94]  Roderic Guigó,et al.  Computational Gene Identification: An Open Problem , 1997, Comput. Chem..

[95]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[96]  J. C. Nuño,et al.  Network organization of cell metabolism: monosaccharide interconversion. , 1997, The Biochemical journal.

[97]  A T Brünger,et al.  Are there dominant membrane protein families with a given number of helices? , 1997, Proteins.

[98]  Burkhard Rost,et al.  Sisyphus and prediction of protein structure , 1997, Comput. Appl. Biosci..

[99]  H. Mewes,et al.  Protein structural classes in five complete genomes , 1997, Nature Structural Biology.

[100]  Bernhard O. Palsson,et al.  Bioinformatics: What lies beyond bioinformatics? , 1997, Nature Biotechnology.

[101]  G. Church,et al.  Complete genome sequence of Methanobacterium thermoautotrophicum deltaH: functional analysis and comparative genomics , 1997, Journal of bacteriology.

[102]  J. Kato,et al.  Silencing factors participate in DNA repair and recombination in Saccharomyces cerevisiae , 1997, Nature.

[103]  L. Huber,et al.  In search of differentially expressed genes and proteins. , 1997, Biochimica et biophysica acta.

[104]  L. Gelbert,et al.  Will genetics really revolutionize the drug discovery process? , 1997, Current opinion in biotechnology.

[105]  H Ogura,et al.  A study of learning splice sites of DNA sequence by neural networks. , 1997, Computers in biology and medicine.

[106]  S. Salzberg,et al.  Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi , 1997, Nature.

[107]  R. Overbeek,et al.  Representation of function: the next step. , 1997, Gene.

[108]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[109]  R. Overbeek,et al.  A reconstruction of the metabolism of Methanococcus jannaschii from sequence data. , 1997, Gene.

[110]  R. Guigó,et al.  Computational gene identification , 1997, Journal of Molecular Medicine.

[111]  S Falkow,et al.  Microbial pathogenesis: genomics and beyond. , 1997, Science.

[112]  S. Tiwari,et al.  Prediction of probable genes by Fourier analysis of genomic sequences , 1997, Comput. Appl. Biosci..

[113]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[114]  M A Andrade,et al.  Bioinformatics: from genome data to biological knowledge. , 1997, Current opinion in biotechnology.

[115]  A. Goffeau,et al.  The complete genome sequence of the Gram-positive bacterium Bacillus subtilis , 1997, Nature.

[116]  R. Fleischmann,et al.  The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus , 1997, Nature.

[117]  Miguel A. Andrade-Navarro,et al.  Automatic Annotation for Biological Sequences by Etraction of Keywords from MEDLINE Abstracts: Development of a Prototype System , 1997, ISMB.

[118]  T Kaminuma,et al.  Development of a cell signaling networks database. , 1997, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[119]  A. Mighell,et al.  Alu sequences , 1997, FEBS letters.

[120]  R. Scott,et al.  From expressed sequence tags to 'epigenomics': an understanding of disease processes. , 1997, Current Opinion in Biotechnology.

[121]  R Sánchez,et al.  Evaluation of comparative protein structure modeling by MODELLER‐3 , 1997, Proteins.

[122]  Temple F. Smith,et al.  The challenges of genome sequence annotation or “The devil is in the details” , 1997, Nature Biotechnology.

[123]  R. Brent,et al.  Understanding gene and allele function with two-hybrid methods. , 1997, Annual review of genetics.

[124]  Mark Borodovsky,et al.  The complete genome sequence of the gastric pathogen Helicobacter pylori , 1997, Nature.

[125]  Søren Brunak,et al.  A Neural Network Method for Identification of Prokaryotic and Eukaryotic Signal Peptides and Prediction of their Cleavage Sites , 1997, Int. J. Neural Syst..

[126]  M. Kanehisa,et al.  Computation with the KEGG pathway database. , 1998, Bio Systems.

[127]  A Bairoch,et al.  Protein annotation: detective work for function prediction. , 1998, Trends in genetics : TIG.

[128]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998 , 1998, Nucleic Acids Res..

[129]  Large CAG/CTG repeat templates produced by PCR, usefulness for the DIRECT method of cloning genes with CAG/CTG repeat expansions. , 1998, Nucleic acids research.

[130]  S Audic,et al.  Self-identification of protein-coding regions in microbial genomes. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[131]  T. Smith,et al.  Functional genomics--bioinformatics is ready for the challenge. , 1998, Trends in genetics : TIG.

[132]  P Bork,et al.  Wanted: subcellular localization of proteins based on sequence. , 1998, Trends in cell biology.

[133]  Temple F. Smith,et al.  Patterns of Genome Organization in Bacteria , 1998, Science.

[134]  P. Richterich,et al.  Estimation of errors in "raw" DNA sequences: a validation study. , 1998, Genome research.

[135]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[136]  P. Bork,et al.  Measuring genome evolution. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[137]  M. Mavrovouniotis,et al.  Enzymatic reaction rate limits with constraints on equilibrium constants and experimental parameters. , 1998, Bio Systems.

[138]  Shmuel Pietrokovski,et al.  Superior performance in protein homology detection with the Blocks Database servers , 1998, Nucleic Acids Res..

[139]  Peter D. Karp,et al.  EcoCyc: Encyclopedia of Escherichia coli genes and metabolism , 1998, Nucleic Acids Res..

[140]  P Bork,et al.  Homology-based fold predictions for Mycoplasma genitalium proteins. , 1998, Journal of molecular biology.

[141]  R. Sinden,et al.  CTG repeats associated with human genetic disease are inherently flexible. , 1998, Journal of molecular biology.

[142]  K. Novak The complete genome sequence… , 1998, Nature Medicine.

[143]  M. Mann,et al.  Identifying proteins and post-translational modifications by mass spectrometry. , 1998, Current opinion in structural biology.

[144]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[145]  W. Pearson Empirical statistical estimates for sequence similarity searches. , 1998, Journal of molecular biology.

[146]  T. Dandekar,et al.  Regulatory RNA , 1998, Biotechnology Intelligence Unit.

[147]  S. Karlin,et al.  Strand compositional asymmetry in bacterial and large viral genomes. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[148]  Peer Bork,et al.  SMART, a simple modular architecture research tool , 1998 .

[149]  T. Traut,et al.  A minimal gene set for cellular life derived by comparison of complete bacterial genomes , 1998 .

[150]  W. Ansorge,et al.  Genomic organization and promoter identification of the human protein kinase CK2 catalytic subunit α (CSNK2A1) , 1998 .

[151]  P. Bork,et al.  Predicting functions from protein sequences—where are the bottlenecks? , 1998, Nature Genetics.

[152]  G. FitzGerald,et al.  Molecular Evolution of the Aldo-keto Reductase Gene Superfamily , 1998, Journal of Molecular Evolution.

[153]  Martin Vingron,et al.  Towards detection of orthologues in sequence databases , 1998, Bioinform..

[154]  Terri K. Attwood,et al.  The PRINTS protein fingerprint database in its fifth year , 1998, Nucleic Acids Res..

[155]  Katrin Beyer,et al.  Systematic genomic screening and analysis of mRNA in untranslated regions and mRNA precursors: combining experimental and computational approaches , 1998, Bioinform..

[156]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[157]  B. Snel,et al.  Conservation of gene order: a fingerprint of proteins that physically interact. , 1998, Trends in biochemical sciences.

[158]  M. Riley Systems for categorizing functions of gene products. , 1998, Current Opinion in Structural Biology.

[159]  P. Bork,et al.  Differential genome analysis applied to the species‐specific features of Helicobacter pylori , 1998, FEBS letters.

[160]  J. Yates Mass spectrometry and the age of the proteome. , 1998, Journal of mass spectrometry : JMS.

[161]  B O Palsson,et al.  What lies beyond bioinformatics? , 1999, Proceedings of the First Joint BMES/EMBS Conference. 1999 IEEE Engineering in Medicine and Biology 21st Annual Conference and the 1999 Annual Fall Meeting of the Biomedical Engineering Society (Cat. N.

[162]  M. Dunn,et al.  From Genome to Proteome , 1999 .

[163]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999 , 1999, Nucleic Acids Res..

[164]  Y. Ioannou Sequence Analysis , 2000, Science.