Exercise 2: RNA-Seq de novo Assembly and Post-processing

The aim of these exercises is to learn how to proceed with an RNA-Seq assembly of sequencing processed reads, as well as to apply methods to assess the quality and accuracy of the assembly results. For these tasks, we provide you with an RNA-Seq dataset from transcriptomic analysis of the nematode Ascaridia galli. Follow the instructions below to perform a de novo RNA-seq analysis and answer the proposed questions.

To avoid long runtimes, the RNA-Seq de novo assembly results from the original dataset, which can be downloaded to perform tasks 2 and 3.

Task 1: RNA-Seq de novo assembly

This task consists of carrying out an RNA-seq de novo assembly process to generate a transcriptome assembly from preprocessed RNA sequencing reads (exercise 1).

Go to Transcriptomics → Assembly → RNA-Seq De novo Assembly. Adjust the options as follows:

  • In the first wizard page, provide the input sequencing data. Select the preprocessed FASTQ files, and choose Paired-End Reads as Sequencing Data. Upstream and Downstream patterns can be established as "_1" and "_2" respectively. 

  • In the second wizard page:

    • Set the K-mer Size parameter as default (25).

    • Select the Non-Strand Specific option. 

    • Set the Minimum Contig Length as default (200).

    • Leave the Assess the Read Content and the Construct Super Transcripts options unchecked.

    • Leave the Minimizing Falsely Fused Transcripts option unchecked.

    • Leave the Construct Super Transcripts option unchecked.

  • Set the Pairs Distance value as default (500). 

  • In the third wizard page:

    • Select the destination folder for the "Transcript to Gene mapping File". 

When the assembly completes, a table containing the assembled transcripts sequences is opened. Furthermore, a result page shows an overview of the assembly results, and two charts show the read representation of the assembly.

Since this analysis can take a while, you can download the results and proceed with the following tasks. A pop-up message will notify you when the analysis is completed. Then you can proceed to answer the following questions. 

Questions:

  1. How many transcripts have been assembled? How many bases (nt) have been assembled?

  2. How many genes have been deduced?

  3. What is the percentage of GC?

  4. What is the longest transcript and its length? And the shortest? (hint: order the length column in the table).

  5. What is the meaning of the N50 metric? What is it used for?

  6. How many transcripts are longer than 300 base pairs? And shorter? Do you think that short transcripts should be filtered out prior to annotation?

Task 2: Completeness Assessment

This analysis can take a long time. Please download the results below to explore them and answer the questions.

This task consists of performing a Completeness Assessment analysis, based on the BUSCO methodology. The Completeness Assessment functionality provides quantitative measures for the assessment of transcriptome assembly completeness, based on evolutionarily-informed expectations of gene content from Benchmarking Universal Single-Copy Orthologs (BUSCO) selected from OrthoDB.

Open the “assembled_transcripts” project from the original dataset and go to Transcriptomics → Assembly → Completeness Assessment. Adjust the options as follows:

  • Lineage: Nematoda odb9

  • Mode: Transcriptome

  • Blast e-value: 1.0E-3

Once finished, a new tab is opened containing the results of the completeness assessment procedure. Each row corresponds to a BUSCO from the lineage database selected, and columns show the following information:

  • BUSCO ID: Name of the BUSCO.

  • Sequence ID: Name of the transcript/protein sequence matching the BUSCO.

  • Score: Score of the alignment.

  • Length: Length of the transcript/protein sequence matching the BUSCO.

  • Tag: Result category. The results are simplified into categories of Complete and single-copy, Complete and duplicated, Fragmented, or Missing BUSCOs:

    • Complete (single and duplicated): The BUSCO matches have scored within the expected range of scores and within the expected range of length alignments to the BUSCO profile.

    • Fragmented: The BUSCO matches have scored within the range of scores but not within the range of length alignments to the BUSCO profile. For transcriptomes or annotated gene sets, this indicates incomplete transcripts or gene models.

    • Missing: There were either no significant matches at all, or the BUSCO matches scored below the range of scores for the BUSCO profile. For transcriptomes or annotated gene sets this indicates that these orthologous are indeed missing or the transcripts or gene models are so incomplete/fragmented that they could not even meet the criteria to be considered as fragmented.

In addition, the completeness assessment report and summary chart can be opened from the sidebar of this tab. You can use these resources to answer the following questions.

Questions:

  1. How many BUSCOs have been used for the selected lineage?

  2. How many BUSCOs have complete gene representation (single-copy + duplicated)?

  3. How many BUSCOs have been detected as fragmented?

  4. How many BUSCOs have not been detected in the assembly?

  5. How good is the assembly according to this evaluation?

Task 3: Predict Coding Regions

This analysis can take a long time. Please download the results below to explore them and answer the questions.

The Predict Coding Regions functionality detects candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly. It is based on TransDecoder, a pipeline that recognizes likely coding sequences based on the following criteria:

  • A minimum length open reading frame (ORF) is found in a transcript sequence.

  • A log-likelihood score is computed and it should be > 0.

  • The above coding score is higher when the ORF is scored in the 1st reading frame as compared to scores in the other 2 forward reading frames.

  • If a candidate ORF is found fully encapsulated by the coordinates of another candidate ORF, the longer one is reported. However, a single transcript can report multiple ORFs (allowing for operons, chimeras, etc).

  • A Position-Specific Scoring Matrix (PSSM) is built, trained and used to refine the start codon prediction.

  • The putative peptide has a match to a Pfam domain above the noise cut-off score (optional).

Open the “assembled_transcripts” project (from the original dataset) and go to Transcriptomics → Assembly → Predict Coding Regions. Adjust the options as follows:

  • Genetic Code: Universal

  • Minimum Protein Length: 100

  • Strand Specific: false

  • Provide Gene-Transcript Relationships: true

  • Transcript to Gene Mapping file: s1_e2_original_assembly/transcript_to_gene_map__20180620_1843.txt

  • Pfam Search: true

  • Retain Long ORFs Mode: Dynamic

  • Single Best Only: false

  • No Refine Starts: false

  • Top Longest ORFs for Training: 500

Once finished, results are returned in three projects: Protein sequences, CDS sequences, ORFs Coordinates (GFF project).

Note that in both sequence projects, CDSs and proteins, the description field contains details about the predicted ORF. This description includes:

  • The protein identifier composed of the original transcripts along with '|m.(number)'.

  • The type attribute indicates whether the protein is:

    • Complete: Contains a start and a stop codon.

    • 5' partial: It is missing a start codon and presumably part of the N-terminus.

    • 3' partial: It is missing the stop codon and presumably part of the C-terminus.

    • Internal: It is both 5' and 3' partial.

  • An indicator (+) or (-) to indicate in which strand the coding region was found, along with the coordinates of the ORF in that transcript sequence.

In addition, a result page and the pie chart will show a summary of the "Predict Coding Regions" results.

Questions:

  1. How many coding regions have been detected within the input transcript sequences?

  2. How many predicted ORFs have been classified as complete, partial (5' and 3') and internal?

  3. What do these results mean?