Normalization of RNA-Seq

Usually, an RNA-Seq data analysis “from scratch” starts with a set of FASTQ files (see e.g. http://en.wikipedia.org/wiki/FASTQ_format) which contain information on both the quality and the sequence of the short reads. There are several tools to align the reads to the reference genome (e.g. Bowtie, TopHat, GSNAP, Stampy, . . . ). A common output file format is the SAM/BAM format (of which you can read here: http://samtools.sourceforge.net/). You just saw how to align reads when you don’t have a genome, and how to summarize them. When you do have a genome, a standard approach is to align the reads with Bowtie or TopHat, and then summarize them in “region of interests”, such as gene, exons, non-coding RNAs, etc. To do this, you need your aligned reads and an annotation for your reference genome. There are tools and packages to summarize the aligned reads in gene counts. One of them is HTSeq (http://www-huber.embl.de/users/anders/ HTSeq/doc/overview.html) . The simple command: