EasyCluster: a fast and efficient gene-oriented clustering tool for large-scale transcriptome data

BackgroundESTs and full-length cDNAs represent an invaluable source of evidence for inferring reliable gene structures and discovering potential alternative splicing events. In newly sequenced genomes, these tasks may not be practicable owing to the lack of appropriate training sets. However, when expression data are available, they can be used to build EST clusters related to specific genomic transcribed loci. Common strategies recently employed to this end are based on sequence similarity between transcripts and can lead, in specific conditions, to inconsistent and erroneous clustering. In order to improve the cluster building and facilitate all downstream annotation analyses, we developed a simple genome-based methodology to generate gene-oriented clusters of ESTs when a genomic sequence and a pool of related expressed sequences are provided. Our procedure has been implemented in the software EasyCluster and takes into account the spliced nature of ESTs after an ad hoc genomic mapping.MethodsEasyCluster uses the well-known GMAP program in order to perform a very quick EST-to-genome mapping in addition to the detection of reliable splice sites. Given a genomic sequence and a pool of ESTs/FL-cDNAs, EasyCluster starts building genomic and EST local databases and runs GMAP. Subsequently, it parses results creating an initial collection of pseudo-clusters by grouping ESTs according to the overlap of their genomic coordinates on the same strand. In the final step, EasyCluster refines the clustering by again running GMAP on each pseudo-cluster and groups together ESTs sharing at least one splice site.ResultsThe higher accuracy of EasyCluster with respect to other clustering tools has been verified by means of a manually cured benchmark of human EST clusters. Additional datasets including the Unigene cluster Hs.122986 and ESTs related to the human HOXA gene family have also been used to demonstrate the better clustering capability of EasyCluster over current genome-based web service tools such as ASmodeler and BIPASS. EasyCluster has also been used to provide a first compilation of gene-oriented clusters in the Ricinus communis oilseed plant for which no Unigene clusters are yet available, as well as an evaluation of the alternative splicing in this plant species.

[1]  Friedrich Möller,et al.  Simultaneous identification of long similar substrings in large sets of sequences , 2007, BMC Bioinformatics.

[2]  M. Gerstein,et al.  What is a gene, post-ENCODE? History and updated definition. , 2007, Genome research.

[3]  Chaochun Wei,et al.  Using ESTs to improve the accuracy of de novo gene prediction , 2006, BMC Bioinformatics.

[4]  Louiqa Raschid,et al.  Phytoestrogens. Friends or foes? , 1996, Nucleic Acids Res..

[5]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[6]  Jennifer Daub,et al.  Expressed sequence tags: medium-throughput protocols. , 2004, Methods in molecular biology.

[7]  Robert Miller,et al.  STACK: Sequence Tag Alignment and Consensus Knowledgebase , 2001, Nucleic Acids Res..

[8]  J. Harrow,et al.  GENCODE: producing a reference annotation for ENCODE , 2006, Genome Biology.

[9]  Klaus Hermann,et al.  DNASTAT: a Pascal unit for the statistical analysis of DNA and protein sequences , 1995, Comput. Appl. Biosci..

[10]  Robin B. Gasser,et al.  A hitchhiker's guide to expressed sequence tag (EST) analysis , 2006, Briefings Bioinform..

[11]  Srinivas Aluru,et al.  Efficient clustering of large EST data sets on parallel computers. , 2003, Nucleic acids research.

[12]  Zsuzsanna Lipták,et al.  An overview of the wcd EST clustering tool , 2008, Bioinform..

[13]  Melissa Bastide,et al.  Assembling Genomic DNA Sequences with PHRAP , 2007, Current protocols in bioinformatics.

[14]  W. J. Kent,et al.  The UCSC Genome Browser , 2003, Current protocols in bioinformatics.

[15]  Antonio Robles,et al.  EST2uni: an open, parallel tool for automated EST analysis and database creation, with a data mining web interface and microarray expression data integration , 2008, BMC Bioinformatics.

[16]  Ji-Ping Z. Wang,et al.  EST clustering error evaluation and correction , 2004, Bioinform..

[17]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[18]  John Quackenbush,et al.  The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes , 2004, Nucleic Acids Res..

[19]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[20]  B. Morgenstern,et al.  AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome , 2006, Genome Biology.

[21]  Sanghyuk Lee,et al.  ASmodeler: gene modeling of alternative splicing from genomic alignment of mRNA, EST and protein sequences , 2004, Nucleic Acids Res..

[22]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[23]  John Quackenbush,et al.  TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets , 2003, Bioinform..

[24]  G. Schuler Pieces of the puzzle: expressed sequence tags and the catalog of human genes , 1997, Journal of Molecular Medicine.

[25]  Robin B. Gasser,et al.  ESTExplorer: an expressed sequence tag (EST) assembly and annotation platform , 2007, Environmental health perspectives.

[26]  Thomas D. Wu,et al.  GMAP: a genomic mapping and alignment program for mRNA and EST sequence , 2005, Bioinform..

[27]  Winston A Hide,et al.  A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. , 1999, Genome research.

[28]  Alex Bateman,et al.  InterPro: An Integrated Documentation Resource for Protein Families, Domains and Functional Sites , 2002, Briefings Bioinform..

[29]  G. Ast,et al.  Different levels of alternative splicing among eukaryotes , 2006, Nucleic acids research.

[30]  Eduardo Eyras,et al.  ESTGenes: alternative splicing from ESTs in Ensembl. , 2004, Genome research.

[31]  Byungwook Lee,et al.  ESTpass: a web-based server for processing and annotating expressed sequence tag (EST) sequences , 2007, Nucleic Acids Res..

[32]  Paola Bonizzoni,et al.  ASPIC: a web resource for alternative splicing prediction and transcript isoforms characterization , 2006, Nucleic Acids Res..

[33]  H. R. Crollius,et al.  Exogean: a framework for annotating protein-coding genes in eukaryotic genomic DNA , 2006, Genome Biology.

[34]  Sylvain Foissac,et al.  ASTALAVISTA: dynamic and flexible analysis of alternative splicing events in custom gene datasets , 2007, Nucleic Acids Res..

[35]  G. Pesole What is a gene? An updated operational definition. , 2008, Gene.

[36]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[37]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[38]  M. Brent,et al.  Pairagon+N-SCAN_EST: a model-based gene annotation pipeline , 2006, Genome Biology.

[39]  Mark L. Blaxter,et al.  Making sense of EST sequences by CLOBBing them , 2002, BMC Bioinformatics.

[40]  Masanori Suzuki,et al.  EGassembler: online bioinformatics service for large-scale processing, clustering and assembling ESTs and genomic DNA fragments , 2006, Nucleic Acids Res..

[41]  D. Davison,et al.  d2_cluster: a validated method for clustering EST and full-length cDNAsequences. , 1999, Genome research.