JESAM: CORBA software components to create and publish EST alignments and clusters

MOTIVATION Expressed Sequence Tags (ESTs) are cheap, easy and quick to obtain relative to full genomic sequencing and currently sample more eukaryotic genes than any other data source. They are particularly useful for developing Sequence Tag Sites (STSs for mapping), polymorphism discovery, disease gene hunting, mass spectrometer proteomics, and most ironically for finding genes and predicting gene structure after the great effort of genomic sequencing. However, ESTs have many problems and the public EST databases contain all the errors and high redundancy intrinsic to the submitted data so it is often found that derived database views, which reduce both errors and redundancy, are more effective starting points for research than the original raw submissions. Existing derived views such as EST cluster databases and consensus databases have never published supporting evidence or intermediary results leading to difficulties trusting, correcting, and customizing the final published database. These difficulties have lead many groups to wastefully repeat the complex intermediary work of others in order to offer slightly different final views. A better approach might be to discover the most expensive common calculations used by all the approaches and then publish all intermediary results. Given a globally accessible database with a suitable component interface, like the JESAM software described in this paper, the creation of customized EST-derived databases could be achieved with minimum effort. RESULTS Databases of EST and full-length mRNA sequences for four model organisms have been self-compared by searching for overlaps consistent with contiguity. The sequence comparisons are performed in parallel using a PVM process farm and previous results are stored to allow incremental updates with minimal effort. The overlap databases have been published with CORBA interfaces to enable flexible global access as demonstrated by example Java applet browsers. Simple cDNA supercluster databases built as alignment database clients are themselves published via CORBA interfaces browsable with prototypical applets. A comparison with UniGene Mouse and Rat databases revealed undesirable features in both and the advantages of contrasting perspectives on complex data. AVAILABILITY The software is packaged as two Jar files available from: URL: http://corba.ebi.ac.uk/EST/jesam/jesam. html. One jar contains all the Java source code, and the other contains all the C, C++ and IDL code. Links to working examples of the alignment and cluster viewers (if remote firewall permits) can be found at http://corba.ebi.ac.uk/EST. All the Washington University mouse EST traces are available for browsing at the same URL.

[1]  Winston Hide,et al.  Biological Evaluation of d2, an Algorithm for High-Performance Sequence Comparison , 1994, J. Comput. Biol..

[2]  A. J. Lopez,et al.  Alternative splicing of pre-mRNA: developmental consequences and mechanisms of regulation. , 1998, Annual review of genetics.

[3]  E. Mardis,et al.  Generation and analysis of 280,000 human expressed sequence tags. , 1996, Genome research.

[4]  Owen White,et al.  TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects , 1995 .

[5]  Evelyn Camon,et al.  The EMBL Nucleotide Sequence Database , 2000, Nucleic Acids Res..

[6]  G. Borsani,et al.  Identification and mapping of human cDNAs homologous to Drosophila mutant genes through EST database searching , 1996, Nature Genetics.

[7]  Patricia Rodriguez-Tomé,et al.  Mapplet: a CORBA-based genome map viewer , 1998, Bioinform..

[8]  X. Huang,et al.  An improved sequence assembly program. , 1996, Genomics.

[9]  H. Jacob,et al.  EbEST: an automated tool using expressed sequence tags to delineate gene structure. , 1998, Genome research.

[10]  Graziano Pesole,et al.  CLEANUP: a fast computer program for removing redundancies from nucleotide sequence databases , 1996, Comput. Appl. Biosci..

[11]  Andy Brass,et al.  A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases , 1999, Bioinform..

[12]  S. Bentolila,et al.  The Genexpress Index: a resource for gene discovery and the genic map of the human genome. , 1995, Genome research.

[13]  D B Davison,et al.  Alternative gene form discovery and candidate gene selection from gene indexing projects. , 1998, Genome research.

[14]  J. Bonfield,et al.  A new DNA sequence assembly program. , 1995, Nucleic acids research.

[15]  Rodrigo Lopez,et al.  The EMBL Nucleotide Sequence Database , 1999, Nucleic Acids Res..

[16]  Dan Harkey,et al.  Client/Server programming with Java and CORBA (2. ed.) , 1998 .

[17]  Jian Hu,et al.  Design and implementation of a CORBA-based genome mapping system prototype , 1998, Bioinform..

[18]  W R Pearson,et al.  Comparison of DNA sequences with protein sequences. , 1997, Genomics.

[19]  D. Gerhold,et al.  It's the genes! EST access to human genome content , 1996, BioEssays : news and reviews in molecular, cellular and developmental biology.

[20]  W. Miller,et al.  Long human-mouse sequence alignments reveal novel regulatory elements: a reason to sequence the mouse genome. , 1997, Genome research.

[21]  Mark J. Miller,et al.  A Quantitative Comparison of DNA Sequence Assembly Programs , 1994, J. Comput. Biol..

[22]  G C Overton,et al.  Analysis of EST-driven gene annotation in human genomic sequence. , 1998, Genome research.

[23]  Eugene W. Myers,et al.  Toward Simplifying and Accurately Formulating Fragment Assembly , 1995, J. Comput. Biol..

[24]  Maria R. Davis,et al.  A first-generation whole genome-radiation hybrid map spanning the mouse genome. , 1997, Genome research.

[25]  Martin Bishop,et al.  Fast computer search for similar DNA sequences , 1984, Nucleic Acids Res..

[26]  J. D. Parsons,et al.  Clustering cDNA sequences , 1992, Comput. Appl. Biosci..

[27]  Bard,et al.  The mouse atlas and graphical gene-expression database , 1997, Seminars in cell & developmental biology.

[28]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[29]  L D Stein,et al.  Scriptable access to the Caenorhabditis elegans genome sequence and other ACEDB databases. , 1998, Genome research.

[30]  M. Boguski,et al.  Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences. , 1996, Genome research.

[31]  D. Cox,et al.  An action plan for mouse genomics , 1999, Nature Genetics.

[32]  T G Wolfsberg,et al.  A comparison of expressed sequence tags (ESTs) to human genomic sequences. , 1997, Nucleic acids research.

[33]  K. O. Elliston,et al.  Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data. , 1996, Genome research.

[34]  P. Deloukas,et al.  A Gene Map of the Human Genome , 1996, Science.

[35]  S. Oliver From DNA sequence to biological function , 1996, Nature.

[36]  S Audic,et al.  Alternate polyadenylation in human mRNAs: a large-scale analysis by EST clustering. , 1998, Genome research.

[37]  S. Taylor,et al.  A new dynamic tool to perform assembly of expressed sequence tags (ESTs) , 1997, Comput. Appl. Biosci..

[38]  M. Mann,et al.  Identifying proteins and post-translational modifications by mass spectrometry. , 1998, Current opinion in structural biology.

[39]  Gregory D. Schuler,et al.  ESTablishing a human transcript map , 1995, Nature Genetics.

[40]  Dan Harkey,et al.  Client/Server programming with Java and Corba , 1997 .

[41]  M. Soares,et al.  Normalization and subtraction: two approaches to facilitate gene discovery. , 1996, Genome research.

[42]  Barbara A. Eckman,et al.  The Merck Gene Index browser: an extensible data integration system for gene finding, gene characterization and EST data mining , 1998, Bioinform..

[43]  A. Chou,et al.  CRAWview: for viewing splicing variation, gene families, and polymorphism in clusters of ESTs and full-length sequences , 1999, Bioinform..

[44]  S. Altschul,et al.  Optimal sequence alignment using affine gap costs. , 1986, Bulletin of mathematical biology.

[45]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[46]  R. Fleischmann,et al.  Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. , 1995, Nature.

[47]  Patricia Rodriguez-Tomé,et al.  The Radiation Hybrid Database , 1998, Nucleic Acids Res..

[48]  M. Adams,et al.  The Construction of Arabidopsis Expressed Sequence Tag Assemblies (A New Resource to Facilitate Gene Identification) , 1996, Plant physiology.

[49]  L. Hillier,et al.  DNA sequence chromatogram browsing using JAVA and CORBA. , 1999, Genome research.

[50]  E. Mardis,et al.  An encyclopedia of mouse genes , 1999, Nature Genetics.

[51]  J. D. Parsons,et al.  Improved tools for DNA comparison and clustering , 1995, Comput. Appl. Biosci..

[52]  Darrell Conklin,et al.  Automated Clustering and Assembly of Large EST Collections , 1998, ISMB.

[53]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): a community resource. Status and enhancements. The Mouse Genome Informatics Group , 1998, Nucleic Acids Res..

[54]  F Mullan,et al.  The new dynamic. , 1968, American journal of diseases of children.