Exercise 4: Quantify Expression at Transcript-level

The aim of these exercises is to fully understand how expression levels can be computed from an RNA-Seq dataset. For this, a transcript-level expression quantification strategy will be applied to several RNA-Seq samples from the Acaridia galli dataset. Furthermore, an explorative analysis will be carried out to interpret the results.

The transcript-level quantification tool is designed for estimating expression levels from RNA-Seq data. It expects the sequencing reads in FASTQ format (so a prior alignment is not necessary), and it supports both single-end and paired-end data. In addition, a set of transcript sequences in FASTA format is required, such as one produced by a de novo transcriptome assembler.

The application is based on RSEM. This program handles both the alignment of reads against the reference transcript sequences and the calculation for relative abundances. RSEM uses the Bowtie2 aligner to align reads, with parameters specifically chosen for RNA-Seq quantification. Since RNA-Seq reads do not always map uniquely to a single gene or isoform, this method is able to allocate multi-mapping reads among transcripts using an expectation-maximization approach.

Go to Transcriptomics → Create Count Table, and select the Transcript-level Quantification option. Adjust the options as follows:

  • In the first wizard page, provide the input sequencing data and the reference transcriptome. Select the 12 preprocessed FASTQ files, and choose Paired-End Reads as Sequencing Data. Upstream and Downstream patterns can be established as "_1" and "_2" respectively.

  • In the second wizard page:

    • Check the estimate RSPD option.

    • Leave the Append Poly(A) Tails option unchecked.

    • Select the Non-Strand Specific option.

    • Leave the fragment length distribution option unchecked.

  • In the third wizard page, uncheck the Generate Alignment Files option.

Once the analysis has been finished, a new tab containing the resulting count table is opened. Rows correspond to the transcripts (those contained in the input transcriptome project), and columns to samples. Counts represent the expression estimates computed by the RSEM algorithm.

Furthermore, a result page will show a summary of the "Create Count table" results. This page contains information about the reference transcript sequences, input FASTQ files, and obtained results. The results summary can be generated via Side Panel → Result Summary and it can be exported in PDF.

Different statistical charts can be generated from the results. These provide additional information about the process of quantifying expression, as well as a quality assessment of the resulting counts. All these charts can be found under the Side Panel of the Count Table Viewer.

Questions:

  1. Which of the input samples has a larger library size? The library size chart could be useful to see differences between samples.

  2. Which input sample has more aligned reads? And less aligned reads? Are these results related to library sizes? The counts per category chart could be useful to see the number of reads that have aligned one time and multiple times. 

  3. Why do reads align multiple times?

  4. Open the “distribution of counts” chart. Are the expression values ​​equally distributed in all the samples?

  5. Which transcript shows a higher expression value in sample "final_ERR1948636"? What is its expression value?

  6. For how many transcripts have no expression values ​​been detected in any of the samples? Why are there so many 0 counts transcripts?