Hands-On Day 2


Contents


Session 1. RNA-Seq de novo assembly

The aim of these exercises is to learn how to proceed with an RNA-Seq assembly of sequencing processed reads. The analysis will be carried out in a de novo scenario, this means, there are not reference genome sequences that can be used as a reference. For these tasks, we provide you with an RNA-Seq dataset from a transcriptomic analysis of the nematode Ascaridia galli. Follow the instructions below to perform a de novo RNA-seq analysis and answer the proposed questions.

Exercise 1: RNA-Seq de novo assembly

Computation time: 30 minutes.

Input data

1 reduced sample from the Ascaridia galli dataset. 

Download.

This task consists of carrying out a RNA-seq de novo assembly process to generate a transcriptome assembly from preprocessed RNA sequencing reads (those that were preprocessed during the Hands-On Day 1, Exercise 2). 

Go to Transcriptomics → RNA-Seq De novo Assembly. Adjust the options as follows:

  • In the first wizard page, provide the input sequencing data. Select the X preprocessed FASTQ files, and choose Paired-End Reads as Sequencing Data. Upstream and Downstream patterns can be established as "_1" and "_2" respectively. 
  • In the second wizard page:
    • Set the K-mer Size parameter as default (25).
    • Select the Non-Strand Specific option. 
    • Leave the Minimizing Falsely Fused Transcripts option unchecked.
    • Set the Pairs Distance value as default (500). 
  • In the third wizard page, select the destination folder for the "Transcript to Gene mapping File". 

When the assembly completes, a table containing the assembled transcripts sequences is opened. Furthermore, a result page shows an overview of the assembly results, and two charts show the read representation of the assembly.

Since this analysis can take a while, you can proceed with the exercise 2. A pop-up message will notify you when the analysis is completed. Then you can proceed to answer the following questions. 

Questions
  1. How many transcripts have been assembled? How many bases (nt) have been assembled?
  2. How many genes have been deduced?
  3. What is the percentage of GC?
  4. What are the longest transcript and its length? And the shortest?
  5. What is the N50 metric computed from all transcripts? And the one computed from the longest isoform per gene? What are the differences between them?
  6. What is the meaning of the N50 metric? What is it used for?
  7. How many transcripts are longer than 300 base pairs? And shorter? Do you think that short transcripts should be filtered out prior to annotation?
  8. Look at the Read Representation charts. What conclusions can be drawn?
Trinity command line
$ Trinity --seqType fq --max_memory 480G --CPU 64 --KMER_SIZE 25 --left ERR1948631_1.fastq.gz --right ERR1948631_2.fastq.gz

Session 2: Functional Annotation

Exercise 2: Getting familiar with the Blast2GO annotation rule

The goal of this exercise is to fully understand how the Blast2GO annotation rule works. We want you to learn how to modify the annotation parameters, how to adapt them to your needs and specific data-sets. We provide you with the information of 3 sequences for which BLAST hits are known. The annotation for the corresponding BLAST hits (i.e. the GO mapping) is also given. Furthermore, we include the Gene Ontology subgraph corresponding to the candidate GO terms of each sequence and a table with the evidence code weights. All this information will be necessary to manually annotate the 3 sequences.

BLAST Hits with GO Mapping for sequence 1. 

The GO graph can be found below.

Blast Hit

Similarity

GO Term

Evidence Code

Hit_1

81%

GO:0012501

IEA

Hit_2

64%

GO:0042981

IEA

Hit_3

58%

GO:0035239

ISS

Hit_4

58%

GO:0007165

IDA

BLAST Hits with GO Mapping for sequence 2.

The GO graph can be found below.

Blast Hit

Similarity

GO Term

Evidence Code

Hit_1

65%

GO:0003700

ISS

Hit_2

58%

GO:0003723

IEA

Annotation Rule:

Compute for each GO term candidate:

Select from all GO terms those that satisfy:

sim:

Similarity percentage between the query sequence and the BLAST hit with the candidate GO term.

max(sim x ECweight):

The maximum combination of all BLAST hits annotated to the candidate GO term.

ECweight:

Evidence code weight for the candidate GO term for the BLAST hit (default, see Table below).

#GOs:

Number of GO terms from BLAST hits contributing indirectly to the candidate GO term (including itself if with hit)

GOweight:

Abstraction factor (default = 5)

Annotation Score threshold

Default = 55



Questions
  1. Calculate the final GO annotations for 2 sequences with default parameter. To do so calculate the annotation scores (AS) for:

    • Sequence 1: GO:0012501, GO:0042981, GO:0035239, GO:0007165 and the indirect GO terms GO:0050794, GO: 0048856.
    • Sequence 2: GO:0003700, GO:0003723 and the indirect parent term GO:0003676.
  2. What happens if you
    1. turn off abstraction (GOw=0)?
    2. double the influence of abstraction (GOw=10)?
  3. What happens if you
    1. only consider experimentally based evidence codes?
    2. turn off evidence code control by setting all ECs = 1.0?

Evidence Code weights:

EC

Default

Description

EXP

1.0

Inferred from Experiment

IDA

1.0

Inferred from Direct Assay

IPI

1.0

Inferred from Physical Interaction

IMP

1.0

Inferred from Mutant Phenotype

IGI

1.0

Inferred from Genetic Interaction

IEP

1.0

Inferred from Expression Pattern

TAS

0.9

Traceable Author Statement

IC

0.9

Inferred by Curator

RCA

0.9

Inferred from Reviewed Computational Analysis

ISS

0.8

Inferred from Sequence or Structural Similarity

ISO

0.8

Inferred from Sequence Orthology

ISA

0.8

Inferred from Sequence Alignment

ISM

0.8

Inferred from Sequence Model

NAS

0.8

Non-traceable Author Statement

IGC

0.7

Inferred from Genomic Context

IEA

0.7

Inferred from Electronic Annotation

ND

0.5

No biological Data Available

NR

0.0

Not Recorded


GO graph for sequence 1


GO graph for sequence 2




Exercise 3: Annotate 100 sequences with OmicsBox

Input data

1000 sequences from the Ascaridia galli assembled transcriptome. 

Download.

The aim of this exercise is to annotate a number of sequences following the OmicsBox scheme and to modulate the annotation of these sequences. Please perform the following steps and answer the questions.

Load the 1000 transcript sequences from within Blast2GO

  1. Go to File → Load → Load Sequences → Load Fasta file. 

  2. For this exercise select the 100 first sequences from the 1000 sequences loaded in the OmicsBox project.

Tip: Deselect all sequences first and only then the 100 sequences. To do this, use the shift key to mark the 100 sequences and then the space key to check them. Go to filter option on checkbox column and choose to see only these sequences on the table (selected). 

Task 1: Perform a Blast search

Computation time: 15 minutes.

Use CloudBlast to blast against NCBI Nematoda (nr subset) database. To do this, got to functional analysis →  Blast → Run Blast → CloudBlast, and configure the analysis as follows:

  • Select the blastx-fast option as Blast Program. 

  • Select the Non-redudnand protein sequences (nr_v5) blast database.

  • Establish the "nematoda" taxa as Taxonomy filter. You should put the NCBI taxonomy ID: 6231.

  • Set the "Blast against a subset of taxonomies" option.

  • Leave the remaining parameters as default. 

  • In the last wizard page, establish an output directory for XML2 outputs. 

Questions
  1. How long does it take to complete? Once blast has finished you can see how long it took at the Progress tab.

  2. Are all sequences successfully blasted?

  3. While the mapping is proceeding (next step), you may continue with the following steps:

    1. Browse BLAST results for any of the sequences (single sequence menu, right mouse click).

    2. Localize different hits and check the local alignment values.

Task 2: Ontology mapping

Computation time: 32 seconds. 

Once Blast is finished you can launch the Gene Ontology Mapping step.

Questions
  1. Check the mapping results for the sequence (TRINITY_DN10164_c0_g1_i1), draw the GO graph for this sequence (sequence context menu, right-click) and localize annotation scores. To find the sequence, you can make use of the table filters (e.g., write the name of the sequence in the SeqName column filter). Try to understand which GO terms will be annotated. How many graphs did you retrieve? To which categories?

  2. How many GO terms have been fetched for the first 20 sequences?

  3. How many sequences are still orange? You can use the Select by colour or the table filters to find out the number of sequences that did not retrieve a GO term.

Task 3: Annotation step

Computation time: 10 seconds.

Annotate the sequences with the default parameters.

Questions
  1. How many sequences are still green?

  2. How many GO terms do you obtain for the 20 first sequences?

  3. Generate the annotation graph (DAG) of TRINITY_DN106_c0_g1_i1 (sequence menu context → right mouse click → Draw Graph of GO-Mapping with annotation score). Interpret and save the “molecular function” graph. Look at the cellular component graph. Why the GO:0005634 (nucleus) has not been annotated?

Task 4: Let’s check some annotations in more detail

Select sequences 8, 11 and 19 and reset the annotation results of these sequences (Annotation sub Menu → Remove Annotations). Now re-annotate them with an annotation threshold (Annotation CutOff) of 80.

Questions
  1. In what way does this result differ compared to the result obtained before?

  2. There is a number of sequences with mapping but without annotation.

    1. What happened?

Try to annotate sequence 8 and 11 manually. To do this, right click on the sequence → Change Annotation and Description.

Go to the Blast results of these sequences to learn about them, decide on the functions you would give to these sequences. Go to the Gene Ontology resource http://www.geneontology.org and look for appropriate GO terms. Add these manually to the sequences and mark them as annotated manually.

Task 5: InterPro analysis

Computation time: 4 minutes.

Run InterPro for the first 20 sequences. The InterProScan functionality can be found under the functional analysis menu. Use the CloudIPS option. 

Questions
  1. How many sequences have InterProScan results with and without GO terms?
  2. View an InterProScan Result and see the different results details.
  3. Export your annotation file as .annot file and open it in a text editor or spreadsheet.

Exercise 4: Perform a complete annotation process with OmicsBox

Input data

  • 18489 sequences from the Ascaridia galli assembled transcriptome. All sequences have been blasted and mapped. 
  • XML files of InterProScan results. 

Download.


The goal of this exercise is to perform all steps of analysis for a set of 18489 Ascaridia galli assembled transcript sequences. This way we can learn more about the features, OmicsBox offers to get a better understanding of your data-set.

Task 1: Annotation of 18489 sequences with OmicsBox 

Computation time: 10-15 minutes.

Open OmicsBox and load the project. Now perform the Annotation step using default parameters (click “default” in the wizard to be sure).

Generate the following charts for each step:

  • Blast: e-Value Distribution, Species Distribution, Similarity Distribution.
  • GO-Mapping: Evidence Code Distribution, DB Sources of Mapping.
  • Annotation: GO Annotation Level Distribution, GO Distribution by Level.

This charts can be generated via the Charts and Statistics functionality, located under the functional analysis menu. 

Questions
  1. What do you observe concerning the obtained e-Values for this data-set of sequences and how do the sequence similarities vary?
  2. Have a look at the mapping charts and try to interpret them.

Task 2: Augment Annotation via InterPro

Import the provided InterPro results (.xml files) by going to “Load -> Load InterProScan Results”. In the dialog choose “Add to existing project” and select the project from Task 1. After selecting the folder which contains the .xml files, don’t forget to choose “Nucleotides” and hit “Load”.

The InterPro column should now show data for all sequences. Now go to “InterProScan -> Merge InterProScan GOs to Annotation” to transfer the GOs.

Export the resulting project as .annot (File -> Export -> Export Annotations -> ...).

Questions
  1. For how many sequences could you obtain InterPro results and how many of them contributed with GO terms?
  2. Can you tell how many more sequences could be annotated by adding the InterPro GO-terms?

Task 3: Summarize your data via GO-Slim

Perform a GO-Slim reduction (Analysis menu) with the Generic slim and generate once more the annotation charts from step Task 1.

Task 4: Try different annotation strategies

Computation time: 20 minutes. 

Change the annotation strategy (Remove the existing annotation first with “Annotation -> Remove Annotation”) as follows.

Restrictive:

  • Annotation CutOff: 70

  • E-Value-Hit-Filter: 1e-10 and to compensate GO-Weight: 10

  • All experimental evidence codes and TAS to 1.0, rest to 0.0.

  • Run the annotation and create a Data Distribution Chart, as well as the ones from Task 1.

Permissive:

  • Annotation CutOff: 50

  • E-Value-Hit-Filter: 1e-3 and to compensate GO-Weight: 0

  • All evidence codes to 1.0.

  • Run the annotation and create a Data Distribution Chart, as well as the ones from Task 1.


Questions
  1. Compare the “level-distribution” chart for the different settings and interpret the results.
  2. Can you obtain more/less annotated sequences by modifying the annotation parameters? How?