PET-Tool: a software suite for comprehensive processing and managing of Paired-End diTag (PET) sequence data

BackgroundWe recently developed the Paired End diTag (PET) strategy for efficient characterization of mammalian transcriptomes and genomes. The paired end nature of short PET sequences derived from long DNA fragments raised a new set of bioinformatics challenges, including how to extract PETs from raw sequence reads, and correctly yet efficiently map PETs to reference genome sequences. To accommodate and streamline data analysis of the large volume PET sequences generated from each PET experiment, an automated PET data process pipeline is desirable.ResultsWe designed an integrated computation program package, PET-Tool, to automatically process PET sequences and map them to the genome sequences. The Tool was implemented as a web-based application composed of four modules: the Extractor module for PET extraction; the Examiner module for analytic evaluation of PET sequence quality; the Mapper module for locating PET sequences in the genome sequences; and the ProjectManager module for data organization. The performance of PET-Tool was evaluated through the analyses of 2.7 million PET sequences. It was demonstrated that PET-Tool is accurate and efficient in extracting PET sequences and removing artifacts from large volume dataset. Using optimized mapping criteria, over 70% of quality PET sequences were mapped specifically to the genome sequences. With a 2.4 GHz LINUX machine, it takes approximately six hours to process one million PETs from extraction to mapping.ConclusionThe speed, accuracy, and comprehensiveness have proved that PET-Tool is an important and useful component in PET experiments, and can be extended to accommodate other related analyses of paired-end sequences. The Tool also provides user-friendly functions for data quality check and system for multi-layer data management.

[1]  Jurg Ott,et al.  Nucleotide frequency variation across human genes. , 2003, Genome research.

[2]  A. Sparks,et al.  Using the transcriptome to annotate the genome , 2002, Nature Biotechnology.

[3]  Z. Weng,et al.  A Global Map of p53 Transcription-Factor Binding Sites in the Human Genome , 2006, Cell.

[4]  Antoine H. C. van Kampen,et al.  USAGE: a web-based approach towards the analysis of SAGE data , 2000, Bioinform..

[5]  D. Reidel,et al.  The Transcriptional Landscape of the Mammalian Genome The FANTOM Consortium* and RIKEN Genome Exploration Research Group and Genome Science Group (Genome Network Project Core Group)* , 2005 .

[6]  E. Liu,et al.  5' Long serial analysis of gene expression (LongSAGE) and 3' LongSAGE for transcriptome characterization and genome annotation. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[7]  E. Liu,et al.  Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation , 2005, Nature Methods.

[8]  S. Salzberg,et al.  The Transcriptional Landscape of the Mammalian Genome , 2005, Science.

[9]  R H Hruban,et al.  Gene expression profiles in normal and cancer cells. , 1997, Science.

[10]  Akhilesh Pandey,et al.  TAGmapper: a web-based tool for mapping SAGE tags. , 2005, Gene.

[11]  X. Chen,et al.  The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells , 2006, Nature Genetics.

[12]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[13]  J. Kawai,et al.  Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Ji Huang,et al.  [Serial analysis of gene expression]. , 2002, Yi chuan = Hereditas.

[15]  S. Altschul,et al.  SAGEmap: a public gene expression resource. , 2000, Genome research.

[16]  Sumio Sugano,et al.  5′-end SAGE for the analysis of transcriptional start sites , 2004, Nature Biotechnology.