Massive sequence comparisons as a help in annotating genomic sequences.

An all-by-all comparison of all the publicly available protein sequences from plants has been performed, followed by a clusterization process. Within each of the 1064 resulting clusters-containing sequences that are orthologous as well as paralogous-the sequences have been submitted to a pyramidal classification and their domains delineated by an automated procedure à la. This process provides a means for easily checking for any apparent inconsistency in a cluster, for example, whether one sequence is shorter or longer than the others, one domain is missing, etc. In such cases, the alignment of the DNA sequence of the gene with that of a close homologous protein often reveals (in 10% of the clusters) probable sequencing errors (leading to frameshifts) or probable wrong intron/exon predictions. The composition of the clusters, their pyramidal classifications, and domain decomposition, as well as our comments when appropriate, are available from http://chlora.infobiogen.fr:1234/PHYTOPROT.

[1]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[2]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[3]  Jérôme Gracy,et al.  Automated protein sequence database classification. II. Delineation Of domain boundaries from sequence similarities , 1998, Bioinform..

[4]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[5]  Eran Halperin,et al.  FramePlus: aligning DNA to protein sequences , 1999, Bioinform..

[6]  Jérôme Gouzy,et al.  XDOM, a graphical tool to analyse domain arrangements in any set of protein sequences , 1997, Comput. Appl. Biosci..

[7]  T J Gibson,et al.  PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. , 1996, Nucleic acids research.

[8]  Ramana V. Davuluri,et al.  Evaluation of gene prediction software using a genomic data set: application to <$O_SSF>Arabidopsis thaliana<$C_SSF>sequences , 1999, Bioinform..

[9]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[10]  Jean-Jacques Codani,et al.  LASSAP, a LArge Scale Sequence compArison Package , 1997, Comput. Appl. Biosci..

[11]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[12]  P. Rouzé,et al.  Genome annotation: which tools do we have for it? , 1999, Current opinion in plant biology.

[13]  Richard Mott,et al.  EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA , 1997, Comput. Appl. Biosci..

[14]  Jean-Christophe Aude,et al.  Applications of the Pyramidal Clustering Method to Biological Objects , 1999, Comput. Chem..

[15]  Jérôme Gouzy,et al.  Whole Genome Protein Domain Analysis using a New Method for Domain Clustering , 1999, Comput. Chem..

[16]  E. Sonnhammer,et al.  Modular arrangement of proteins as inferred from analysis of homology , 1994, Protein science : a publication of the Protein Society.

[17]  Alex Bateman,et al.  InterPro : An integrated documentation resource for protein families , domains and functional sites The InterPro Consortium : , 2005 .

[18]  Nikos Kyrpides,et al.  Genomes OnLine Database (GOLD 1.0): a monitor of complete and ongoing genome projects world-wide , 1999, Bioinform..

[19]  R. Durbin,et al.  Using GeneWise in the Drosophila annotation experiment. , 2000, Genome research.

[20]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[21]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[22]  Jean-Christophe Aude,et al.  Significance of Z-value Statistics of Smith-Waterman Scores for Protein Alignments , 1999, Comput. Chem..

[23]  W. John Wilbur,et al.  On the statistical significance of nucleic acid similarities , 1984, Nucleic Acids Res..

[24]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[25]  Osamu Gotoh,et al.  Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps , 2000, Bioinform..

[26]  Anton J. Enright,et al.  GeneRAGE: a robust algorithm for sequence clustering and domain detection , 2000, Bioinform..

[27]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.