Searching the Expressed Sequence Tag (EST) Databases: Panning for Genes

The genomes of living organisms contain many elements, including genes coding for proteins. The portions of the genes expressed as mature mRNA, collectively known as the transcriptome, represent only a small part of the genome. The expressed sequence tag (EST) databases contain an increasingly large part of the transcriptome of many species. For this reason, these databases are probably the most abundant source of new coding sequences available today. However, the raw data deposited in the EST databases are to a large extent unorganised, unannotated, redundant and of relatively low quality. This paper reviews some of the characteristics of the EST data, and the methods that can be used to find novel protein sequences within them. It also documents a collection of databases, software and web sites that can be useful to biologists interested in mining the EST databases over the Internet, or in establishing a local environment for such analyses.

[1]  Philipp Bucher,et al.  A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System , 1996, ISMB.

[2]  M. Soares,et al.  Normalization and subtraction: two approaches to facilitate gene discovery. , 1996, Genome research.

[3]  X. Huang,et al.  An improved sequence assembly program. , 1996, Genomics.

[4]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[5]  W R Pearson,et al.  Comparison of DNA sequences with protein sequences. , 1997, Genomics.

[6]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .

[7]  David Scott,et al.  The PRINTS Database of Protein Fingerprints: A Novel Information Resource for Computational Molecular Biology , 1997, J. Chem. Inf. Comput. Sci..

[8]  P. Green,et al.  Consed: a graphical tool for sequence finishing. , 1998, Genome research.

[9]  R. Fleischmann,et al.  Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. , 1995, Nature.

[10]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[11]  T J Gibson,et al.  PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. , 1996, Nucleic acids research.

[12]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[13]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[14]  Amos Bairoch,et al.  A Generalized Profile Syntax for Biomolecular Sequence Motifs and its Function in Automatic Sequence Interpretation , 1994, ISMB.

[15]  A. Kerlavage,et al.  Complementary DNA sequencing: expressed sequence tags and human genome project , 1991, Science.

[16]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[17]  G. Schuler Pieces of the puzzle: expressed sequence tags and the catalog of human genes , 1997, Journal of Molecular Medicine.

[18]  Shmuel Pietrokovski,et al.  Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations , 1999, Bioinform..

[19]  Jean-Michel Claverie,et al.  Information Enhancement Methods for Large Scale Sequence Analysis , 1993, Comput. Chem..

[20]  D. Gerhold,et al.  It's the genes! EST access to human genome content , 1996, BioEssays : news and reviews in molecular, cellular and developmental biology.

[21]  G. Schuler,et al.  Sequence alignment and database searching. , 2001, Methods of biochemical analysis.

[22]  C. V. Jongeneel,et al.  ESTScan: A Program for Detecting, Evaluating, and Reconstructing Potential Coding Regions in EST Sequences , 1999, ISMB.

[23]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[24]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[25]  M. Gribskov,et al.  [9] Profile analysis , 1990 .

[26]  Andy Brass,et al.  Searching DNA databases for similarities to DNA sequences: when is a match significant? , 1998, Bioinform..

[27]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[28]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[29]  Chris Sander,et al.  Frame: detection of genomic sequencing errors , 1998, Bioinform..

[30]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[31]  D Eisenberg,et al.  Profile analysis. , 1990, Methods in enzymology.

[32]  R Staden,et al.  The staden sequence analysis package , 1996, Molecular biotechnology.

[33]  J. Devereux,et al.  A comprehensive set of sequence analysis programs for the VAX , 1984, Nucleic Acids Res..

[34]  S Audic,et al.  Alternate polyadenylation in human mRNAs: a large-scale analysis by EST clustering. , 1998, Genome research.

[35]  J. D. Parsons,et al.  Improved tools for DNA comparison and clustering , 1995, Comput. Appl. Biosci..

[36]  Amos Bairoch,et al.  The PROSITE database, its status in 2002 , 2002, Nucleic Acids Res..

[37]  Amos Bairoch,et al.  The PROSITE database, its status in 1999 , 1999, Nucleic Acids Res..

[38]  D B Davison,et al.  Alternative gene form discovery and candidate gene selection from gene indexing projects. , 1998, Genome research.

[39]  S. Taylor,et al.  A new dynamic tool to perform assembly of expressed sequence tags (ESTs) , 1997, Comput. Appl. Biosci..