论文信息 - TranscriptClean: variant-aware correction of indels, mismatches and splice junctions in long-read transcripts

TranscriptClean: variant-aware correction of indels, mismatches and splice junctions in long-read transcripts

Motivation: Long‐read, single‐molecule sequencing platforms hold great potential for isoform discovery and characterization of multi‐exon transcripts. However, their high error rates are an obstacle to distinguishing novel transcript isoforms from sequencing artifacts. Therefore, we developed the package TranscriptClean to correct mismatches, microindels and noncanonical splice junctions in mapped transcripts using the reference genome while preserving known variants. Results: Our method corrects nearly all mismatches and indels present in a publically available human PacBio Iso‐seq dataset, and rescues 39% of noncanonical splice junctions. Availability and implementation: All Python and R scripts used in this paper are available at https://github.com/dewyman/TranscriptClean.

Ali Mortazavi | Dana Wyman | A. Mortazavi | D. Wyman

[1] R. Munita,et al. A comprehensive survey of non-canonical splice sites in the human transcriptome , 2014, Nucleic acids research.

[2] Kin-Fan Au,et al. PacBio Sequencing and Its Applications , 2015, Genom. Proteom. Bioinform..

[3] Daniel J. Gaffney,et al. A survey of best practices for RNA-seq data analysis , 2016, Genome Biology.

[4] Donald Sharon,et al. Defining a personal, allele-specific, and single-molecule long-read transcriptome , 2014, Proceedings of the National Academy of Sciences.

[5] Thomas R. Gingeras,et al. STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[6] Xiandong Meng,et al. Widespread Polycistronic Transcripts in Fungi Revealed by Single-Molecule mRNA Sequencing , 2015, PloS one.

[7] S. Turner,et al. Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[8] Eleazar Eskin,et al. HapIso: An Accurate Method for the Haplotype- Specific Isoforms Reconstruction From Long Single-Molecule Reads , 2017, IEEE Transactions on NanoBioscience.

[9] Lennart Martens,et al. 1 SQANTI : extensive characterization of long read transcript sequences for quality control in 1 full-length transcriptome identification and quantification 2 3 , 2017 .

[10] S. Turner,et al. Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[11] Faye D. Schilkey,et al. A survey of the sorghum transcriptome using single-molecule long reads , 2016, Nature Communications.