Functional and structural genomics using PEDANT

MOTIVATION Enormous demand for fast and accurate analysis of biological sequences is fuelled by the pace of genome analysis efforts. There is also an acute need in reliable up-to-date genomic databases integrating both functional and structural information. Here we describe the current status of the PEDANT software system for high-throughput analysis of large biological sequence sets and the genome analysis server associated with it. RESULTS The principal features of PEDANT are: (i) completely automatic processing of data using a wide range of bioinformatics methods, (ii) manual refinement of annotation, (iii) automatic and manual assignment of gene products to a number of functional and structural categories, (iv) extensive hyperlinked protein reports, and (v) advanced DNA and protein viewers. The system is easily extensible and allows to include custom methods, databases, and categories with minimal or no programming effort. PEDANT is actively used as a collaborative environment to support several on-going genome sequencing projects. The main purpose of the PEDANT genome database is to quickly disseminate well-organized information on completely sequenced and unfinished genomes. It currently includes 80 genomic sequences and in many cases serves as the only source of exhaustive information on a given genome. The database also acts as a vehicle for a number of research projects in bioinformatics. Using SQL queries, it is possible to correlate a large variety of pre-computed properties of gene products encoded in complete genomes with each other and compare them with data sets of special scientific interest. In particular, the availability of structural predictions for over 300 000 genomic proteins makes PEDANT the most extensive structural genomics resource available on the web.

[1]  C DeLisi,et al.  The detection and classification of membrane-spanning proteins. , 1985, Biochimica et biophysica acta.

[2]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[3]  Tim J. P. Hubbard,et al.  SCOP: a Structural Classification of Proteins database , 1999, Nucleic Acids Res..

[4]  M. Adams,et al.  A tool for analyzing and annotating genomic sequences. , 1997, Genomics.

[5]  Alain Viari,et al.  Imagene: an integrated computer environment for sequence annotation and analysis , 1999, Bioinform..

[6]  Mikhail S. Gelfand,et al.  Combining diverse evidence for gene recognition in completely sequenced bacterial genomes , 1998, German Conference on Bioinformatics.

[7]  Dmitrij Frishman,et al.  The genome sequence of the thermoacidophilic scavenger Thermoplasma acidophilum , 2000, Nature.

[8]  Dmitrij Frishman,et al.  PEDANTic genome analysis , 1997 .

[9]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[10]  Tim J. P. Hubbard,et al.  SCOP: a Structural Classification of Proteins database , 2000, Nucleic Acids Res..

[11]  A. Lupas,et al.  Predicting coiled coils from protein sequences , 1991, Science.

[12]  Dmitrij Frishman,et al.  Identification of in vivo substrates of the chaperonin GroEL , 1999, Nature.

[13]  Chris Sander,et al.  What's in a genome? , 1992, Nature.

[14]  Amos Bairoch,et al.  The PROSITE database, its status in 1999 , 1999, Nucleic Acids Res..

[15]  M. Cotton,et al.  Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana , 1999, Nature.

[16]  S. Osawa,et al.  Recent evidence for evolution of the genetic code , 1992, Microbiological reviews.

[17]  N. Harris,et al.  Genotator: a workbench for sequence annotation. , 1997, Genome research.

[18]  Chris Sander,et al.  GeneQuiz: A Workbench for Sequence Analysis , 1994, ISMB.

[19]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[20]  Miguel A. Andrade-Navarro,et al.  Automated genome sequence analysis and annotation , 1999, Bioinform..

[21]  Amos Bairoch,et al.  The PROSITE database, its status in 2002 , 2002, Nucleic Acids Res..

[22]  A Bairoch,et al.  Go hunting in sequence databases but watch out for the traps. , 1996, Trends in genetics : TIG.

[23]  Eugene V. Koonin,et al.  SEALS: A System for Easy Analysis of Lots of Sequences , 1997, ISMB.

[24]  S. Eddy,et al.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. , 1997, Nucleic acids research.

[25]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[26]  M. Riley,et al.  Functions of the gene products of Escherichia coli , 1993, Microbiological reviews.

[27]  R. Overbeek,et al.  The use of gene clusters to infer functional coupling. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[28]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[29]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[30]  P. Argos,et al.  Knowledge‐based protein secondary structure assignment , 1995, Proteins.

[31]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[32]  C. V. Jongeneel,et al.  ESTScan: A Program for Detecting, Evaluating, and Reconstructing Potential Coding Regions in EST Sequences , 1999, ISMB.

[33]  A. Goffeau,et al.  The complete genome sequence of the Gram-positive bacterium Bacillus subtilis , 1997, Nature.

[34]  David L. Wild,et al.  Protein analyst - a distributed object environment for protein sequence and structure analysis , 1999, Bioinform..

[35]  J. Schug,et al.  GAIA: framework annotation of genomic sequence. , 1998, Genome research.

[36]  Erik L. L. Sonnhammer,et al.  A workbench for large-scale sequence homology analysis , 1994, Comput. Appl. Biosci..

[37]  P. Argos,et al.  Seventy‐five percent accuracy in protein secondary structure prediction , 1997, Proteins.

[38]  J. Newman,et al.  Class‐directed structure determination: Foundation for a protein structure initiative , 1998, Protein science : a publication of the Protein Society.

[39]  S. Brunak,et al.  SHORT COMMUNICATION Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites , 1997 .

[40]  Shmuel Pietrokovski,et al.  Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations , 1999, Bioinform..

[41]  Michael Y. Galperin,et al.  Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement, and operon disruption , 1998, Silico Biol..

[42]  André Goffeau,et al.  The yeast genome directory. , 1997, Nature.

[43]  T Gaasterland,et al.  MAGPIE: automated genome interpretation. , 1996, Trends in genetics : TIG.

[44]  Alejandro A. Schäffer,et al.  IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices , 1999, Bioinform..

[45]  T H Jukes,et al.  Evolutionary changes in the genetic code , 1990, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[46]  Hans-Werner Mewes,et al.  the yeast genome , 1997 .

[47]  Peter B. McGarvey,et al.  The Protein Information Resource (PIR) , 2000, Nucleic Acids Res..