论文信息 - k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage

k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage

Motivation: The clustering of expressed sequence tags (ESTs) is a crucial step in many sequence analysis studies that require a high level of redundancy. Chimeric sequences, while uncommon, can make achieving the optimal EST clustering a challenge. Single-linkage algorithms are particularly vulnerable to the effects of chimeras. To avoid chimera-facilitated erroneous merges, researchers using single-linkage algorithms are forced to use stringent sequence–similarity thresholds. Such thresholds reduce the sensitivity of the clustering algorithm. Results: We introduce the concept of k-link clustering for EST data. We evaluate how clustering error rates vary over a range of linkage thresholds. Using k-link, we show that Type II error decreases in response to increasing the number of shared ESTs (ie. links) required. We observe a base level of Type II error likely caused by the presence of unmasked low-complexity or repetitive sequence. We find that Type I error increases gradually with increased linkage. To minimize the Type I error introduced by increased linkage requirements, we propose an extension to k-link which modifies the required number of links with respect to the size of clusters being compared. Availability: The implementation of k-link is available under the terms of the GPL from http://www.bioinformatics.csiro.au/products.shtml. k-link is licensed under the GNU General Public License, and can be downloaded from http://www.bioinformatics.csiro.au/products.shtml. k-link is written in C++. Contact: lauren.bragg@csiro.au Supplementary information: Supplementary data are available at Bioinformatics online.

Glenn Stone | Lauren M. Bragg

[1] R. Sorek,et al. A novel algorithm for computational identification of contaminated EST libraries. , 2003, Nucleic acids research.

[2] Winston Hide,et al. CLU: A new algorithm for EST clustering , 2005, BMC Bioinformatics.

[3] E. Mardis,et al. Generation and analysis of 280,000 human expressed sequence tags. , 1996, Genome research.

[4] Mihaela M. Martis,et al. The Sorghum bicolor genome and the diversification of grasses , 2009, Nature.

[5] X. Huang,et al. CAP3: A DNA sequence assembly program. , 1999, Genome research.

[6] Gregory D. Schuler,et al. ESTablishing a human transcript map , 1995, Nature Genetics.

[7] M. Meilă. Comparing clusterings---an information based distance , 2007 .

[8] P. Jaccard,et al. Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[9] I. Jonassen,et al. Repeats and EST analysis for new organisms , 2008, BMC Genomics.

[10] Winston Hide,et al. Biological Evaluation of d2, an Algorithm for High-Performance Sequence Comparison , 1994, J. Comput. Biol..

[11] D. Davison,et al. d2_cluster: a validated method for clustering EST and full-length cDNAsequences. , 1999, Genome research.

[12] W. J. Kent,et al. BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[13] Tom H. Pringle,et al. The human genome browser at UCSC. , 2002, Genome research.

[14] Inge Jonassen,et al. RBR: library-less repeat detection for ESTs , 2006, Bioinform..

[15] Mark L. Blaxter,et al. Making sense of EST sequences by CLOBBing them , 2002, BMC Bioinformatics.

[16] Ji-Ping Z. Wang,et al. EST clustering error evaluation and correction , 2004, Bioinform..

[17] Namshin Kim,et al. ECgene: genome-based EST clustering and gene modeling for alternative splicing. , 2005, Genome research.

[18] David Edwards,et al. Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP , 2003, Bioinform..

[19] Zsuzsanna Lipták,et al. An overview of the wcd EST clustering tool , 2008, Bioinform..

[20] Jack A. M. Leunissen,et al. QualitySNP: a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species , 2006, BMC Bioinformatics.

[21] Daniel Lee,et al. The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species , 2001, Nucleic Acids Res..

[22] Bruce Levin,et al. A Representation for Multinomial Cumulative Distribution Functions , 1981 .

[23] T. D. Klastorin,et al. Merging groups to maximize object partition comparison , 1980 .

[24] John Quackenbush,et al. TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets , 2003, Bioinform..

[25] Ian Korf,et al. BLAST - an essential guide to the basic local alignment search tool , 2003 .