Hands-On Day 3


Contents

Session 1: Read expression quantification and differential expression analysis 

The aim of these exercises is to fully understand how expression levels can be computed from a RNA-Seq dataset and how to perform a differential expression analysis from expression data. 

Exercise 1: Compute and visualize RNA-Seq expression levels

Input data
  • Ascaridia galli reduced dataset (6 samples).
  • Ascaridia galli transcriptome.

Download.

Task 1: Quantify expression at transcript-level

Computation time: 15-20 minutes

This exercise consists of quantifying expression levels of several RNA-Seq samples at transcript-level. The transcript-level quantification tool is designed for estimating expression levels from RNA-Seq data. It expects the sequencing reads in FASTQ format (so a prior alignment is not necessary), and it supports both single-end and paired-end data. In addition, a set of transcript sequences in FASTA format is required, such as one produced by a de novo transcriptome assembler.

The application is based on RSEM. This program handles both the alignment of reads against the reference transcript sequences and the calculation for relative abundances. RSEM uses the Bowtie2 aligner to align reads, with parameters specifically chosen for RNA-Seq quantification. Since RNA-Seq reads do not always map uniquely to a single gene or isoform, this method is able to allocate multi-mapping reads among transcripts using an expectation maximization approach.

Go to Transcriptomics → Create Count Table, and select the Transcript-level Quantification option. Adjust the options as follows:

  • In the first wizard page, provide the input sequencing data and the reference transcriptome. Select the 12 preprocessed FASTQ files, and choose Paired-End Reads as Sequencing Data. Upstream and Downstream patterns can be established as "_1" and "_2" respectively. 
  • In the second wizard page:
    • Check the estimate RSPD option. 
    • Leave the Append Poly(A) Tails option unchecked. 
    • Select the Non-Strand Specific option.
    • Leave the fragment length distribution option unchecked. 
    • Check the Generate Alignment Files and Sort by Coordinate options, and select a destination folder. 

Once the analysis has been finished, a new tab containing the resulting count table is opened. Rows correspond to the transcripts (those contained in the input transcriptome project), and columns to samples. Counts represent the expression estimates computed by the RSEM algorithm. 

Furthermore, a result page will show a summary of the "Create Count table" results. This page contains information about the reference transcript sequences, input FASTQ files, and obtained results. The results summary can be generated via Side Panel → Result Summary and it can be exported in PDF.

Different statistical charts can be generated from the results. These provide additional information about the process of quantifying expression, as well as a quality assessment of the resulting counts. All these charts can be found under the Side Panel of the Count Table Viewer.

Questions
  1. Which of the input samples has a larger library size? The library size chart could be useful to see differences between sample.
  2. Which input sample has more aligned reads? And less aligned reads? Are these results related to library sizes? The counts per category chart could be useful to see the number of reads that have aligned one time and multiple times. 
  3. Why do reads align multiple times?
  4. Open the distribution of count chart. Are the expression values ​​equally distributed in all the samples? Export the chart as a text file (Data as Text button). What is the mean expression value of each sample (approximate)?
  5. Which transcript shows a higher expression value in sample "final_ERR1948636"? What is its expression value?
  6. For how many transcripts have no expression values ​​been detected in any of the samples? Why are there so many 0 counts transcripts?
RSEM command line
# Step 1: Prepare reference
$ rsem-prepare-reference --bowtie2 a_galli_transcriptome.fasta a_galli_transcriptome
# Step 2: Calculate expression
$ rsem-calculate-expression -p 8 --bowtie2 --append-names --estimate-rspd --sort-bam-by-coordinate --strandedness none --paired-end ERR1948631_2.fastq.gz ERR1948631_2.fastq.gz a_galli_transcriptome ERR1948631_results.txt


Task 2: Visualize alignments in the genome browser. 

In this exercise, the alignments files (BAM) obtained in the previous task will be displayed in the OmicsBox genome browser. 

  • First open the trinity_dn21176_c1_g1_i15_m_79100_sequence_fasta.box project. It contains the sequence of the transcript "trinity_dn24893_c1_g6_i1_m_162354" contained in the A. galli transcriptome. The genome browse will appear in a new tab.
  • In the side toolbar of the genome browser, click on the "Add track" button. Select the final_ERR1948636.transcript.sorted.bam obtained in the previous step.
  • Play with different options. Zoom in, zoom out and scroll to see the different regions of the sequence. 


Questions
  1. Bases from the sequencing reads that does not match the original sequence are highlighted.
    1. Could you locate someone?
    2. How is it highlighted?
    3. Can you see the sequencing depth at this position? Zoom in and leave the pointer on the colored block belonging to that position (right above where the alignments are displayed) so that a box with depth information is displayed.
  2. Is the transcript sequence covered by reads along its entire length? Zoom out to see the entire transcript. 
  3. Check the As pairs checkbox. What is the difference?

Exercise 2: Pairwise differential expression analysis

Input data
  • Ascaridia galli complete count table. 
  • Experimental design.

Download.

This task consists of performing a pairwise differential expression analysis from RNA-seq expression data. This application, based on the edgeR program (empirical analysis of DGE in R), allows identification of differentially expressed genomic features in a pairwise comparison of two different experimental conditions. The software package edgeR, which belongs to the Bioconductor project, implements quantitative statistical methods to evaluate the significance of individual genes between two experimental conditions.

Task 1: Checking input counts

Open the count_table_a_galli project. This is a count table that was generated from the original dataset from which the reduced dataset used in the exercise 1 comes from. First, an exploratory analysis will be performed to check some properties of the count table. 

Open the summary report and the library size, distribution of counts and count per category charts (side panel). 

Questions
  1. What is the library size of the FASTQ files from which the count table was obtained?
  2. Which input sample has more aligned reads? And less aligned reads?
  3. Are the library sizes of the different samples very different? 
  4. Look at the Distribution of Counts chart. Are the expression values ​​equally distributed in all the samples? 
  5. Do you think that normalization and filtering procedures should be applied before proceeding with differential expression analysis?

Task 2: Performing a differential expression analysis using the exact test. 

Go to Transcriptomics → Run Differential Expression Analysis, and select the Pairwise Differential Expression analysis option. Adjust the options as follows:

  • In the first wizard page, establish the CPM filter at 0.5, and the samples reaching the CPM filter at 3. Select the TMM normalization method. 
  • In the second wizard page, provide the experimental_design.txt file provided the input data. A table should be displayed. 
  • In the third wizard page, select the simple design option, and establish the remaining parameters as follows:
    • Primary Experimental Factor: Treatment.
    • Primary Contrast Condition: flubendazole.
    • Primary Reference Condition: control.
    • Select the exact test option and check the robust parameter.

Once the input counts have been processed and analyzed via the "Pairwise Differential Expression Analysis'' tool, a new tab is opened containing the results (Figure 8). The results table contains the differential expression statistics, where each row corresponds to a feature:

  • logFC: A measure that describes how much the expression changes between conditions (log2-fold-changes are shown).
  • logCPM: The average log2-counts-per-millions.
  • P-Value: Tell us if this expression difference is significant or not (difference in comparison to the variance).
  • FDR: False Discovery Rate calculated by the Benjamini-Hochberg method (multiple hypothesis testing corrections).
  • Tags: Indicate whether a gene is upregulated (FDR ≤ 0.05, logFC ≥ 0) or downregulated (FDR ≤ 0.05, logFC ≥ 0).

Genes that have not passed the filtering step are not shown in the new tab. 

In the sidebar, there are located all possible action that can be performed for this differential expression results, including a summary and visualizations.

Take advantage of the table filters (header section) to explore the differential expression results. For example, you can filter to display only up or down-regulated genes.

A result page will show a summary of the pairwise differential expression analysis results. In addition, different statistical charts can be generated for a global visualization of the results

Questions
  1. First, look at the summary report:
    1. How many genes have been filtered?
    2. Do you think that he values ​​used to perform the filtering are appropriate?
    3. How many differentially expressed genes have been detected? How many genes have been classified as up? And down?
    4. What are the FDR filter and log FC threshold values used by default?
  2. Open the MDS plot. Are the samples belonging to the same experimental group relatively related?
  3. Open the volcano plot and the MA plot. Try to understand and interpret them and see the differences between them.
  4. Generate a heatmap visualization of the top 50 differentially expressed genes. Use CPM and apply logarithm transformation. 
    1. Are the samples clustered according to the experimental design?
    2. Interpret the color scheme used.
  5. Modify the FDR filter to 0.005 and the log FC threshold to 2 and -2.
    1. Count how many genes are now classified as up and down using the filters (tags column).
    2. Open the results chart and verify the results. 
R Code
# Load count data
> rawdata <- read.delim("count_table_a_galli.txt", check.names = FALSE, stringsAsFactors = FALSE, row.names = 1)
> genes.id <- rownames(rawdata)
> dge <- DGEList(counts = rawdata, genes = genes.id)

# Filtering
> keep <- rowSums(cpm(dge) > 0.5) >= 3
> dge <- dge[keep, , keep.lib.sizes = FALSE]

# Normalization
> dge <- calcNormFactors(dge, method = "tmm")

# Read experimental design file
> design.table <- read.table("experimental_design.txt", header=TRUE, stringsAsFactors = FALSE)
> exp.conditions <- exp.design(design.table[ , "Treatment"])
> exp.conditions <- relevel(exp.conditions, ref = "control")

# Statistical test
> comparison <- c("control", "flubendazole")
> dge$samples$group <- exp.conditions
> dge <- estimateDisp(dge, robust = TRUE)
> et <- exactTest(dge, pair = comparison)
> results <- as.data.frame(topTags(et, n = nrow(dge$counts)))

# Output
> write.table(results, file = "results.tsv", quote = FALSE, sep = "\t", row.names = FALSE)

Task 3: Performing a differential expression analysis using general linear models.

Go to Transcriptomics → Run Differential Expression Analysis, and select the Pairwise Differential Expression analysis option. Proceed in the same way as in Task 2, but this time selecting the GLM (Quasi Likelihood F-Test) option as statistical test (third page).

Questions
  1. How many differentially expressed genes have been detected?
  2. How many genes have been classified as up? And down?
  3. Are these results very different from those obtained in the previous task?  To compare them, use the same FDR filter and log FC threshold values. 
  4. Can you see an additional column in the results table? Which one? What information does it give?
R Code
# Load count data
> rawdata <- read.delim("count_table_a_galli.txt", check.names = FALSE, stringsAsFactors = FALSE, row.names = 1)
> genes.id <- rownames(rawdata)
> dge <- DGEList(counts = rawdata, genes = genes.id)

# Filtering
> keep <- rowSums(cpm(dge) > 0.5) >= 3
> dge <- dge[keep, , keep.lib.sizes = FALSE]

# Normalization
> dge <- calcNormFactors(dge, method = "tmm")

# Read experimental design file
> design.table <- read.table("experimental_design.txt", header=TRUE, stringsAsFactors = FALSE)
> exp.conditions <- exp.design(design.table[ , "Treatment"])
> exp.conditions <- relevel(exp.conditions, ref = "control")
> names <- levels(exp.conditions)
> exp.conditions <- model.matrix(~0 + exp.conditions)
> names <- make.names(names)
> colnames(exp.conditions) <- names
> rownames(exp.conditions) <- design.table[ , 1]

# Statistical test
> contrast <- make.names("flubendazole")
> ref <- make.names("control")
> comparison <- factor(paste(contrast, ref, sep = "-"))
> my.contrast <- makeContrasts(contrasts = comparison, levels = exp.conditions)
> dge <- estimateDisp(dge, exp.conditions, robust = TRUE)
> fit <- glmQLFit(dge, exp.conditions, robust = TRUE)
> qlf <- glmQLFTest(fit, contrast = my.contrast)
> results <- as.data.frame(topTags(qlf, n = nrow(dge$counts)))

# Output
> write.table(results, file = "results.tsv", quote = FALSE, sep = "\t", row.names = FALSE)

Session 2: Functional enrichment analysis

The aim of these exercises is to learn how functional enrichment analysis can be performed from differential expression results and how to interpret the results to gain biological insights. 

Functional enrichment analysis is a procedure to identify functions that are over-represented in a set of genes and may have an association with an experimental condition (e.g. phenotype, treatment…). These methods use statistical approaches to identify significantly enriched or depleted groups of genes.

The Functional Analysis Module offers two different statistical tests in order to Functional Enrichment Analysis, Fisher's Exact Test and Gene Set Enrichment Analysis.

Tip: To better understand the results, we recommend taking a look at the experimental description of the Ascaridia galli dataset

Exercise 3: Fisher's Exact Test and Gene Set Enrichment Analysis

Input data
  • Pairwise differential results.
  • Ascaridia galli transcriptome.

Download.

Task 1: Performing a functional enrichment analysis from up-regulated genes using Fisher's Exact Test. 

This task consists of performing a Fisher´s exact test from differential expression results. 

OmicsBox has integrated the FatiGO package for statistical assessment of annotation differences between 2 sets of sequences. This package uses Fisher's Exact Test and corrects for multiple testing. For this analysis, the completion (but not exclusively) of the involved sequences with their annotations must be loaded in the application. This can either be the result of a OmicsBox annotation or the imported annotation by file (.annot).

Open the count_table_a_galli_results_exact_test.box containing the differential expression results from the Exercise 2 Task 2. Click on the Enrichment Analysis (Fisher's Exact Test) button (sidebar), and configure the analysis as follows: 

  • Select the "Up-regulated genes" option as Test-set genes. 
  • Select the "a_galli_annot.annot" project as Reference Annotation. 
  • Leave the remaining parameters as default. 

Once completed the results table will be shown in a new tab (see image below), where the adjusted p-values of each annotation above a given threshold will be shown. The main columns are:

  • FDR: Corrected p-value by False Discovery Rate control according to Benjamini-Hochberg.
  • P-Value: P-Value without multiple testing corrections.

Using the context menu of each row It is possible to get more details about the annotation and also create an ID-List with the sequences annotated in the Test-Set or the Reference-Set.

  • #Test is the number of sequences that are annotated with the GO and are in the test set.
  • #NotAnnotTest is the number of sequences that are not annotated with that GO, that is in the test set. 

Adding these two numbers it gives the total amount of sequences that are annotated at all in your test set e.g. GO:0061135: 9 + 52 = 61.

In the sidebar there are located all possible action that can be performed for this enrichment result, including two options for the visual display of the results.

Questions
  1. How many functions related to up-regulated genes have been detected as enriched? Can you find any interesting function related to the purpose of the study?
  2. Click on the Show Bar Chart button. Interpret the chart.
  3. Click on the Make Enriched graph button. How many charts are generated? Why? Try to understand the charts.
  4. Click on the Reduce to Most Specific button. How many enriched functions are there now? What happened?
  5. Open the remaining visualizations (Treemap and WordCloud) too see the results of the enrichment analysis in different ways. 
  6. Generate the Test-Set and the Reference-Set for the GO:0010501 term. How many sequences (total) are annotated with this GO term?

Task 2: Performing a functional enrichment analysis from down-regulated genes using Fisher's Exact Test.

Proceed in the same way as in Task 1, but this time selecting the "Down-regulated genes" option as Test-set genes. 

Questions
  1. How many functions related to down-regulated genes have been detected as enriched? Can you find any interesting function related to the purpose of the study?
  2. Click on the Show Bar Chart button. Interpret the chart.
  3. Click on the Make Enriched graph button. How many charts are generated? Why? Try to understand the charts.
  4. Click on the Reduce to Most Specific button. How many enriched functions are there now? 
  5. Open the remaining visualizations (Treemap and WordCloud) too see the results of the enrichment analysis in different ways. 

Task 3: Performing a Gene Set Enrichment Analysis from differential expression results. 

OmicsBox includes the GSEA computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states. GSEA considers experiments with genome-wide expression profiles from samples belonging to two classes, labeled 1 or 2. Genes are ranked based on the correlation between their expression and the class distinction by using any suitable metric. Given an a priori defined set of genes S, the goal of GSEA is to determine whether the members of S are randomly distributed throughout L or primarily found at the top or bottom.

For this analysis, the completion (but not exclusively) of the involved sequences with their annotations must be loaded in the application. This can be the result of an OmicsBox annotation.

Open the count_table_a_galli_results_exact_test.box containing the differential expression results from the Exercise 2 Task 2. Click on the Gene Set Enrichment Analysis button (sidebar), and configure the analysis as follows: 

  • Select the a_galli_transcriptome.b2g as Reference Annotation. 
  • Set the FDR Filter for Ranked List parameter as 1. In this way, no filtering will be applied, so all the genes contained in the differential expression results will be taken into account for the analysis.
  • Set the Number of Detailed Results to 0 so that the analysis takes less time. 
  • Leave the remaining parameters as default. 

Once completed the results table will be shown in a new tab (see image below), where the adjusted p-values of each annotation above a given threshold will be shown. The main columns are:

  • ES: Reflects the degree to which a gene set is overrepresented at the top or bottom of a ranked list of genes.
  • NES;  By normalizing the enrichment score, GSEA accounts for differences in gene set size and in correlations between gene sets and the expression dataset.
  • FDR; The estimated probability that a gene set with a given NES represents a false positive finding.
  • Nominal P-Value: Estimates the statistical significance of the enrichment score for a single gene set.

Using the context menu of the rows tagged with the Details tag It is possible to get more details about the GO term, including the enrichment statistics, and also create an ID-List with the core enrichment sequences for each GO term.

In the sidebar, there are located all possible action that can be performed for this enrichment result, including two options for the visual display of the results.

Questions
  1. How many enriched functions have been detected?
  2. How many "TOP" enriched functions have been detected? And "BOTTOM" enriched functions?
  3. What is the difference between TOP and BOTTOM functions? Can you find any interesting function related to the purpose of the study?
  4. Click on the Make Enriched graph button. How many charts are generated? Why? Try to understand the charts.
  5. Click on the Reduce to Most Specific button. How many enriched functions are there now? 
  6. Open the remaining visualizations (Treemap and WordCloud) too see the results of the enrichment analysis in different ways. 
  7. Download the results for this task. The GSEA project contains details for all enriched GO terms. Open it in OmicsBox, and find a function related with "drug catabolic process". Open the Detailed results (right click on the function entry), and try to understand them. How many sequences are annotated with this function? Is this function classified as TOP or BOTTOM? 

Extra exercise: Time Course Expression Analysis

Input data
  • Count table.
  • Experimental design file. 
  • Blog. 

Download.

If you are interested in understanding the time course expression analysis in more detail, you can download the input data and perform a time course expression analysis within OmicsBox. This blog contains a simple use-case performed with this data. 

We encourage you to try to reproduce the analysis following the instructions described in the blog.

Results