Exercise 1: Quality assessment and preprocessing of raw sequencing reads

The aim of these exercises is to fully understand how the FASTQ Quality Check and FASTQ Preprocessing features work. We want you to learn how to interpret the results of the quality assessment, as well as how to use the different preprocessing procedures (adapter removal, trimming and filtering).

Task 1: Performing a quality check of the raw RNA-Seq reads.

This task consists of carrying out a quality assessment analysis of two paired-end FASTQ samples.

Go to General Tools → FASTQ Tools → FASTQ Quality Check. Select the 4 FASTQ files as Input Reads and run the analysis, leaving the remaining options as default.

Once finished, a new tab is opened containing a simple composition statistics of each analyzed file. Each row corresponds to an input file and columns show the following information:

  • Name: The name of the file.

  • File type: Shows whether the file appeared to contain actual base calls or colorspace data.

  • Encoding: Shows the ASCII encoding of quality values. 

  • Total sequences: The total number of read sequences processed. 

  • Poor quality reads: Sequences flagged as poor quality reads.

  • Read length: Provides the length of the shortest and longest sequence in the set. 

  • %GC: The overall %GC of all bases in all sequences. 

Furthermore, a general report will show a summary of the FASTQ Quality Check results, which provides a quick evaluation of whether the results of each quality check module seem entirely normal (pass), slightly normal (warning) or very unusual (fail). 

The results of each module for each file can be accessed as follows:

  • To open the summary report of each file, right-click on a row and click on Show report. Alternatively, the report of each sample can be opened from the general report by clicking on the buttons of the last column of the overall results table. A new report is opened containing a summary of the statistics and results for the selected file.

  • To open the result of each module for a file, right-click on a row and go to the Show Statistics submenuThese results also can be accessed by clicking on the buttons of the Details column in the summary report of each sample(file).

Save the results project to use it in the following exercises. 

Questions:

  1. What are the library size and read length of each file?

  2. Looking at the general report, what do you think is the biggest problem that has been detected in this dataset?

  3. Open the Adapter Content charts.

    1. Have adapter sequences been detected?

    2. In which end (5' or 3') of the reads do you observe to have more adapters?

    3. What type of adapter sequence has been detected?

    4. What preprocessing procedures should be applied?

Task 2: Preprocessing the raw RNA-Seq reads.

This task consists of applying an adapter removal procedure and evaluating if it works.

Go to General Tools → FASTQ Tools → FASTQ Preprocessing:

  • In the first wizard page, provide the input sequencing data. Select the 4 raw FASTQ files, and choose Paired-End Reads as Sequencing Data. Upstream and Downstream patterns should be established as “_1”; and “_2” respectively.

  • In the second wizard page, check the Remove Adapters option. Select TruSeq3 as Default Adapter Sequences, and leave the remaining parameters as default.

  • In the third wizard page, leave the Trimming option unchecked.

  • In the fourth wizard page, check the Filter By Quality and Filter By Length options and leave the default values. 

  • In the fifth wizard page, establish a prefix to set the name of output files, and select the destination folders (output reads and unpaired reads can be placed in the same folder).

Once finished, perform a FASTQ Quality Check of the preprocessed FASTQ files. 

Questions:

  1. How many reads have survived the preprocessing procedure?

  2. What is the read length of each file?

  3. Open the Adapter Content charts and compare them with the ones obtained from the raw sequencing reads. Have the adapter sequences been effectively removed?

  4. Why has the Per Base Sequence Content raised a warning or fail status for all files?