Exercise 3: Functional Annotation

The goal of this exercise is to perform all steps of the Blast2GO functional annotation pipeline for a set of 100 Ascaridia galli assembled transcript sequences. This way we can learn more about each annotation step (blast, mapping and annotation), as well as to explore and understand results. Please perform the following tasks and answer the questions.

Task 1: Blast search

First, load the sequences into OmicsBox. For this, go to File → Load → Load Sequences → Load Fasta file. Select the “100_seqs_a_galli.fasta” file.

OmicsBox uses the Basic Local Alignment Search Tool (BLAST) to find sequences similar to your query set. In this case, we will use the CloudBlast option to blast against NCBI Nematoda (nr subset) database. To do this, go to functional analysis → Blast → Run Blast → CloudBlast, and configure the analysis as follows:

  • Select the blastx-fast option as Blast Program. 

  • Select the Non-redudnand protein sequences (nr_v5) blast database.

  • Establish the "nematoda" taxa as Taxonomy filter. You should put the NCBI taxonomy ID: 6231. Click on the “Add” button.

  • Set the "Blast against a subset of taxonomies" option.

  • Leave the remaining parameters as default. 

  • In the last wizard page, establish an output directory for XML2 outputs. 

As the BLAST search progresses, sequences with successful BLAST results change their color on the Main Sequence Table from white to orange and the BLAST result related columns will be filled. In case no results could be retrieved for a given sequence, this row will turn dark-red. With a mouse the right click on a sequence, the Single Sequence Menu will be displayed and it is possible to see the BLAST results for each sequence individually.

Questions:

  1. How long does it take to complete? Once blast has finished you can see how long it took at the Progress tab.

  2. Are all sequences successfully blasted?

  3. While the mapping is proceeding (next task), you may continue with the following steps:

    1. Browse BLAST results for any of the sequences (single sequence menu, right mouse click).

    2. Localize different hits and check the local alignment values.

Task 2: Ontology mapping

Once Blast is finished you can launch the Gene Ontology Mapping step. Mapping is the process of retrieving GO terms associated with the Hits obtained by the BLAST search. OmicsBox performs four different mappings steps:

  • BLAST result accessions are used to retrieve gene names or symbols making use of two mapping files provided by the NCBI (gene_info, gene2accession). Identified gene names are then searched in the species-specific entries of the gene-product table of the GO database.

  • GeneBank identifiers (gi), the primary blast Hit ids, are used to retrieve UniProt IDs making use of a mapping file from PIR (Non-redundant Reference Protein Database) including PSD, UniProt, Swiss-Prot, TrEMBL, RefSeq, GenPept and PDB.

  • Accessions are searched directly in the dbxref table of the GO database.

  • BLAST result accessions are searched directly in the gene-product table of the GO database.

Go to functional analysis → Blast2GO Mapping → Run GO Mapping, and run the analysis.

Questions:

  1. Check the mapping results for the sequence (TRINITY_DN10164_c0_g1_i1), draw the GO graph for this sequence (sequence context menu, right-click) and localize annotation scores. To find the sequence, you can make use of the table filters (e.g., write the name of the sequence in the SeqName column filter). Try to understand which GO terms will be annotated. How many graphs did you retrieve? To which categories? Save the “biological process” graph for the next task.

  2. How many GO terms have been fetched for the first 20 sequences?

  3. How many sequences are still orange? You can use the Select by color or the table filters to find out the number of sequences that did not retrieve a GO term.

Task 3: Annotation step

This is the process of selecting GO terms from the GO pool obtained by the Mapping step and assigning them to the query sequences. GO annotation is carried out by applying an annotation rule (AR) on the found ontology terms. The rule seeks to find the most specific annotations with a certain level of reliability. This process is adjustable in specificity and stringency.

For each candidate GO an annotation score (AS) is computed. The AS is composed of two additive terms:

  • The first, direct term (DT), represents the highest hit similarity of this GO weighted by a factor corresponding to its EC.

  • The second term (AT) of the AS provides the possibility of abstraction. This is defined as an annotation to a parent node when several child nodes are present in the GO candidate collection. This term multiplies the number of total GOs unified at the node by a user-defined GO weight factor that controls the possibility and strength of abstraction. When GO weight is set to 0, no abstraction is done.

Finally, the AR selects the lowest term per branch that lies over a user-defined threshold.

Go to functional analysis → Blast2GO Annotation → Run Annotation, and configure the analysis using the default parameters.

Questions:

  1. How many sequences are still green?

  2. How many GO terms do you obtain for the 20 first sequences?

  3. Generate the annotation graph (DAG) of TRINITY_DN10164_c0_g1_i1 (sequence menu context → right mouse click → Draw Graph of GO-Mapping with annotation score). Interpret and save the “biological process” graph. Compare with the “biological process” graph obtained in the previous step. Why the GO:0055085 (transmembrane transport) has not been annotated?