CAFTAN: a tool for fast mapping, and quality assessment of cDNAs

BackgroundThe German cDNA Consortium has been cloning full length cDNAs and continued with their exploitation in protein localization experiments and cellular assays. However, the efficient use of large cDNA resources requires the development of strategies that are capable of a speedy selection of truly useful cDNAs from biological and experimental noise. To this end we have developed a new high-throughput analysis tool, CAFTAN, which simplifies these efforts and thus fills the gap between large-scale cDNA collections and their systematic annotation and application in functional genomics.ResultsCAFTAN is built around the mapping of cDNAs to the genome assembly, and the subsequent analysis of their genomic context. It uses sequence features like the presence and type of PolyA signals, inner and flanking repeats, the GC-content, splice site types, etc. All these features are evaluated in individual tests and classify cDNAs according to their sequence quality and likelihood to have been generated from fully processed mRNAs. Additionally, CAFTAN compares the coordinates of mapped cDNAs with the genomic coordinates of reference sets from public available resources (e.g., VEGA, ENSEMBL). This provides detailed information about overlapping exons and the structural classification of cDNAs with respect to the reference set of splice variants.The evaluation of CAFTAN showed that is able to correctly classify more than 85% of 5950 selected "known protein-coding" VEGA cDNAs as high quality multi- or single-exon. It identified as good 80.6 % of the single exon cDNAs and 85 % of the multiple exon cDNAs.The program is written in Perl and in a modular way, allowing the adoption of this strategy to other tasks like EST-annotation, or to extend it by adding new classification rules and new organism databases as they become available. We think that it is a very useful program for the annotation and research of unfinished genomes.ConclusionCAFTAN is a high-throughput sequence analysis tool, which performs a fast and reliable quality prediction of cDNAs. Several thousands of cDNAs can be analyzed in a short time, giving the curator/scientist a first quick overview about the quality and the already existing annotation of a set of cDNAs. It supports the rejection of low quality cDNAs and helps in the selection of likely novel splice variants, and/or completely novel transcripts for new experiments.

[1]  D. Church,et al.  Spidey: a tool for mRNA-to-genomic alignments. , 2001, Genome research.

[2]  Philip Lijnzaad,et al.  The Ensembl genome database project , 2002, Nucleic Acids Res..

[3]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[4]  Richard Mott,et al.  EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA , 1997, Comput. Appl. Biosci..

[5]  E. Birney,et al.  Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs , 2002, Nature.

[6]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[7]  Stefan Wiemann,et al.  High-content screening microscopy identifies novel proteins with a putative role in secretory membrane traffic. , 2004, Genome research.

[8]  Sumio Sugano,et al.  Construction of a full-length enriched and a 5'-end enriched cDNA library using the oligo-capping method. , 2003, Methods in molecular biology.

[9]  Osamu Ohara,et al.  HUGE: a database for human large proteins identified by Kazusa cDNA sequencing project , 1999, Nucleic Acids Res..

[10]  Piero Carninci,et al.  High-efficiency full-length cDNA cloning by biotinylated CAP trapper. , 1996, Genomics.

[11]  D Gautheret,et al.  Identification of alternate polyadenylation sites and analysis of their tissue distribution using EST data. , 2001, Genome research.

[12]  Meena Kishore Sakharkar,et al.  Distributions of exons and introns in the human genome , 2004, Silico Biol..

[13]  D. Gautheret,et al.  Patterns of variant polyadenylation signal usage in human genes. , 2000, Genome research.

[14]  S. Brenner,et al.  The evolving roles of alternative splicing. , 2004, Current opinion in structural biology.

[15]  N. Nomura,et al.  Complete sequencing and characterization of 21,243 full-length human cDNAs , 2004, Nature Genetics.

[16]  Travis J. Wheeler,et al.  Evaluating and improving cDNA sequence quality with cQC , 2005, Bioinform..

[17]  Jean Thierry-Mieg,et al.  Large-scale identification and characterization of alternative splicing variants of human gene transcripts using 56 419 completely sequenced and manually annotated full-length cDNAs , 2006, Nucleic acids research.

[18]  Stefan Wiemann,et al.  High-throughput protein analysis integrating bioinformatics and experimental assays. , 2004, Nucleic acids research.

[19]  Stefan Wiemann,et al.  LIFEdb: a database for functional genomics experiments integrating information from external sources, and serving as a sample tracking system , 2004, Nucleic Acids Res..

[20]  C. Antignac,et al.  Splice-mediated insertion of an Alu sequence in the COL4A3 mRNA causing autosomal recessive Alport syndrome. , 1995, Human molecular genetics.

[21]  A. Poustka,et al.  Alternative pre-mRNA processing regulates cell-type specific expression of the IL4l1 and NUP62 genes , 2005, BMC Biology.

[22]  K. Nakai,et al.  Genome-wide analysis reveals strong correlation between CpG islands with nearby transcription start sites of genes and their tissue specificity. , 2005, Gene.

[23]  Christopher J. Lee,et al.  A genomic view of alternative splicing , 2002, Nature Genetics.

[24]  Coral del Val,et al.  cDNA2Genome: A tool for mapping and annotating cDNAs , 2003, BMC Bioinformatics.

[25]  Noam Shomron,et al.  The Birth of an Alternatively Spliced Exon: 3' Splice-Site Selection in Alu Exons , 2003, Science.

[26]  Valer Gotea,et al.  Mastering seeds for genomic size nucleotide BLAST searches. , 2003, Nucleic acids research.

[27]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[28]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[29]  Noboru Jo Sakabe,et al.  Detection and evaluation of intron retention events in the human transcriptome. , 2004, RNA.

[30]  Peter Ernst,et al.  A task framework for the web interface W2H , 2003, Bioinform..

[31]  Steven L Salzberg,et al.  Computational discovery of internal micro-exons. , 2003, Genome research.

[32]  G. Rubin,et al.  Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[33]  H. Mewes,et al.  Toward a catalog of human genes and proteins: sequencing and analysis of 500 novel complete protein coding human cDNAs. , 2001, Genome research.

[34]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[35]  R. J. Herrera,et al.  Alu Elements and the Human Genome , 2004, Genetica.

[36]  Kanako O. Koyanagi,et al.  Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones , 2004, PLoS Biology.

[37]  Lincoln Stein,et al.  Genome annotation: from sequence to biology , 2001, Nature Reviews Genetics.

[38]  Y. Suzuki,et al.  Construction of full-length-enriched cDNA libraries. The oligo-capping method. , 2001, Methods in molecular biology.

[39]  M. Hentze,et al.  Molecular mechanisms of translational control , 2004, Nature Reviews Molecular Cell Biology.

[40]  Michael Q. Zhang,et al.  Regulating Gene Expression through RNA Nuclear Retention , 2005, Cell.

[41]  Wolfgang Huber,et al.  Functional profiling: from microarrays via cell-based assays to novel tumor relevant modulators of the cell cycle. , 2005, Cancer research.

[42]  James G. R. Gilbert,et al.  The vertebrate genome annotation (Vega) database , 2004, Nucleic Acids Res..

[43]  A. Poustka,et al.  Systematic subcellular localization of novel proteins identified by large‐scale cDNA sequencing , 2000, EMBO reports.

[44]  Yoshihide Hayashizaki,et al.  Disclosing hidden transcripts: mouse natural sense-antisense transcripts tend to be poly(A) negative and nuclear localized. , 2005, Genome research.

[45]  Piero Carninci,et al.  Computer-based methods for the mouse full-length cDNA encyclopedia: real-time sequence clustering for construction of a nonredundant cDNA library. , 2001, Genome research.

[46]  A. Poustka,et al.  SMART amplification combined with cDNA size fractionation in order to obtain large full-length clones , 2004, BMC Genomics.

[47]  J. Manley,et al.  Mechanism and regulation of mRNA polyadenylation. , 1997, Genes & development.

[48]  R. Strausberg,et al.  ORESTES are enriched in rare exon usage variants affecting the encoded proteins. , 2003, Comptes rendus biologies.

[49]  M. Frommer,et al.  CpG islands in vertebrate genomes. , 1987, Journal of molecular biology.

[50]  Miao Zhang,et al.  Improved spliced alignment from an information theoretic approach , 2006, Bioinform..