SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read

BackgroundHigh-throughput automated sequencing has enabled an exponential growth rate of sequencing data. This requires increasing sequence quality and reliability in order to avoid database contamination with artefactual sequences. The arrival of pyrosequencing enhances this problem and necessitates customisable pre-processing algorithms.ResultsSeqTrim has been implemented both as a Web and as a standalone command line application. Already-published and newly-designed algorithms have been included to identify sequence inserts, to remove low quality, vector, adaptor, low complexity and contaminant sequences, and to detect chimeric reads. The availability of several input and output formats allows its inclusion in sequence processing workflows. Due to its specific algorithms, SeqTrim outperforms other pre-processors implemented as Web services or standalone applications. It performs equally well with sequences from EST libraries, SSH libraries, genomic DNA libraries and pyrosequencing reads and does not lead to over-trimming.ConclusionsSeqTrim is an efficient pipeline designed for pre-processing of any type of sequence read, including next-generation sequencing. It is easily configurable and provides a friendly interface that allows users to know what happened with sequences at every pre-processing stage, and to verify pre-processing of an individual sequence if desired. The recommended pipeline reveals more information about each sequence than previously described pre-processors and can discard more sequencing or experimental artefacts.

[1]  J. Jurka Repbase update: a database and an electronic journal of repetitive elements. , 2000, Trends in genetics : TIG.

[2]  Yi-An Chen,et al.  An optimized procedure greatly improves EST vector contamination removal , 2007, BMC Genomics.

[3]  Peter Ernst,et al.  ESTAnnotator: a tool for high throughput EST annotation , 2003, Nucleic Acids Res..

[4]  BMC Bioinformatics , 2005 .

[5]  Thomas L. Casavant,et al.  ESTprep: Preprocessing CDNA Sequence Reads , 2003, Bioinform..

[6]  Michael Roberts,et al.  Figaro: a novel statistical method for vector sequence removal , 2008, Bioinform..

[7]  Carol Harger,et al.  Establishing a method of vector contamination identification in database sequences , 1999, Bioinform..

[8]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[9]  Byungwook Lee,et al.  ESTpass: a web-based server for processing and annotating expressed sequence tag (EST) sequences , 2007, Nucleic Acids Res..

[10]  S. Salzberg,et al.  An optimized protocol for analysis of EST sequences. , 2000, Nucleic acids research.

[11]  Shoba Ranganathan,et al.  ESTExplorer: an expressed sequence tag (EST) assembly and annotation platform , 2007, Nucleic Acids Res..

[12]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[13]  Hui-Hsien Chou,et al.  DNA sequence quality trimming and vector removal , 2001, Bioinform..

[14]  Masanori Suzuki,et al.  EGassembler: online bioinformatics service for large-scale processing, clustering and assembling ESTs and genomic DNA fragments , 2006, Nucleic Acids Res..

[15]  Mark Cooper,et al.  Using clusters of computers for large QU-GENE simulation experiments , 2001, Bioinform..

[16]  J. S. Coker,et al.  Identifying adaptor contamination when mining DNA sequence data. , 2004, BioTechniques.

[17]  J. Bonfield,et al.  A new DNA sequence assembly program. , 1995, Nucleic acids research.

[18]  Song Li,et al.  LUCY2: an interactive DNA sequence quality trimming and vector removal tool , 2004, Bioinform..

[19]  Antonio Robles,et al.  EST2uni: an open, parallel tool for automated EST analysis and database creation, with a data mining web interface and microarray expression data integration , 2008, BMC Bioinformatics.