Extraction of poly(A) sites from large-scale RNA-Seq data.

The NCBI manages the SRA (Sequence Read Archive) database to store RNA-Seq data generated from different NGS technologies. With ever increasing finished and ongoing genome and transcriptome sequencing projects, the data in SRA expand rapidly and present a treasure for mining useful information to facilitate our understanding of biological issues like mRNA 3'-end formation and alternative polyadenylation. We developed a bioinformatics pipeline that can process raw SRA sequence data and obtain high quality poly(A) sites and poly(A) cluster sites with detailed expression information. This pipeline is designed to be generic and can be utilized for polyadenylation studies in any eukaryotic species.