k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage

Motivation: The clustering of expressed sequence tags (ESTs) is a crucial step in many sequence analysis studies that require a high level of redundancy. Chimeric sequences, while uncommon, can make achieving the optimal EST clustering a challenge. Single-linkage algorithms are particularly vulnerable to the effects of chimeras. To avoid chimera-facilitated erroneous merges, researchers using single-linkage algorithms are forced to use stringent sequence–similarity thresholds. Such thresholds reduce the sensitivity of the clustering algorithm. Results: We introduce the concept of k-link clustering for EST data. We evaluate how clustering error rates vary over a range of linkage thresholds. Using k-link, we show that Type II error decreases in response to increasing the number of shared ESTs (ie. links) required. We observe a base level of Type II error likely caused by the presence of unmasked low-complexity or repetitive sequence. We find that Type I error increases gradually with increased linkage. To minimize the Type I error introduced by increased linkage requirements, we propose an extension to k-link which modifies the required number of links with respect to the size of clusters being compared. Availability: The implementation of k-link is available under the terms of the GPL from http://www.bioinformatics.csiro.au/products.shtml. k-link is licensed under the GNU General Public License, and can be downloaded from http://www.bioinformatics.csiro.au/products.shtml. k-link is written in C++. Contact: lauren.bragg@csiro.au Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  R. Sorek,et al.  A novel algorithm for computational identification of contaminated EST libraries. , 2003, Nucleic acids research.

[2]  Winston Hide,et al.  CLU: A new algorithm for EST clustering , 2005, BMC Bioinformatics.

[3]  E. Mardis,et al.  Generation and analysis of 280,000 human expressed sequence tags. , 1996, Genome research.

[4]  Mihaela M. Martis,et al.  The Sorghum bicolor genome and the diversification of grasses , 2009, Nature.

[5]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[6]  Gregory D. Schuler,et al.  ESTablishing a human transcript map , 1995, Nature Genetics.

[7]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[8]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[9]  I. Jonassen,et al.  Repeats and EST analysis for new organisms , 2008, BMC Genomics.

[10]  Winston Hide,et al.  Biological Evaluation of d2, an Algorithm for High-Performance Sequence Comparison , 1994, J. Comput. Biol..

[11]  D. Davison,et al.  d2_cluster: a validated method for clustering EST and full-length cDNAsequences. , 1999, Genome research.

[12]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[13]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[14]  Inge Jonassen,et al.  RBR: library-less repeat detection for ESTs , 2006, Bioinform..

[15]  Mark L. Blaxter,et al.  Making sense of EST sequences by CLOBBing them , 2002, BMC Bioinformatics.

[16]  Ji-Ping Z. Wang,et al.  EST clustering error evaluation and correction , 2004, Bioinform..

[17]  Namshin Kim,et al.  ECgene: genome-based EST clustering and gene modeling for alternative splicing. , 2005, Genome research.

[18]  David Edwards,et al.  Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP , 2003, Bioinform..

[19]  Zsuzsanna Lipták,et al.  An overview of the wcd EST clustering tool , 2008, Bioinform..

[20]  Jack A. M. Leunissen,et al.  QualitySNP: a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species , 2006, BMC Bioinformatics.

[21]  Daniel Lee,et al.  The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species , 2001, Nucleic Acids Res..

[22]  Bruce Levin,et al.  A Representation for Multinomial Cumulative Distribution Functions , 1981 .

[23]  T. D. Klastorin,et al.  Merging groups to maximize object partition comparison , 1980 .

[24]  John Quackenbush,et al.  TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets , 2003, Bioinform..

[25]  Ian Korf,et al.  BLAST - an essential guide to the basic local alignment search tool , 2003 .