Computational analysis of RNA-seq.

Using High-Throughput DNA Sequencing (HTS) to examine gene expression is rapidly becoming a -viable choice and is typically referred to as RNA-seq. Often the depth and breadth of coverage of RNA-seq data can exceed what is achievable using microarrays. However, the strengths of RNA-seq are often its greatest weaknesses. Accurately and comprehensively mapping millions of relatively short reads to a reference genome sequence can require not only specialized software, but also more structured and automated procedures to manage, analyze, and visualize the data. Additionally, the computational hardware required to efficiently process and store the data can be a necessary and often-overlooked component of a research plan. We discuss several aspects of the computational analysis of RNA-seq, including file management and data quality control, analysis, and visualization. We provide a framework for a standard nomenclature -system that can facilitate automation and the ability to track data provenance. Finally, we provide a general workflow of the computational analysis of RNA-seq and a downloadable package of scripts to automate the processing.

[1]  R. O’Neill,et al.  Abundant Human DNA Contamination Identified in Non-Primate Genome Databases , 2011, PloS one.

[2]  Paul D. Shaw,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[3]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[4]  Jonathan A. Eisen,et al.  BioTorrents: A File Sharing Service for Scientific Data , 2010, PloS one.

[5]  S. Lewis,et al.  The generic genome browser: a building block for a model organism system database. , 2002, Genome research.

[6]  Dennis B. Troup,et al.  NCBI GEO: archive for functional genomics data sets—10 years on , 2010, Nucleic Acids Res..

[7]  Ibrahim Emam,et al.  ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments , 2010, Nucleic Acids Res..

[8]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[9]  Dawn H. Nagel,et al.  The B73 Maize Genome: Complexity, Diversity, and Dynamics , 2009, Science.

[10]  Nansheng Chen,et al.  Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences , 2009, Current protocols in bioinformatics.

[11]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[12]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[13]  C. Vicient Transcriptional activity of transposable elements in maize , 2010, BMC Genomics.

[14]  B. Dujon,et al.  Comparative Genomics and Molecular Dynamics of DNA Repeats in Eukaryotes , 2008, Microbiology and Molecular Biology Reviews.

[15]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[16]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.