Hands-On Day 1
Before We Start ...
- Introduction to OmicsBox
- Install software: https://www.biobam.com/download/
- Your activation key: BOX-COURBIOB-CB6617A40BF03FD83494FA66638EA052
Session 1. Quality assessment and preprocessing of raw sequencing reads.
The aim of these exercises is to fully understand how the FASTQ Quality Check and FASTQ Preprocessing features work. We want you to learn how to interpret the results of the quality assessment, as well as how to use the different preprocessing procedures (adapter removal, trimming and filtering). For these tasks, we provide you with two datasets containing raw sequencing RNA-Seq reads. Follow the instructions below to perform the quality assessment and preprocessing tasks and answer the proposed questions.
Exercise 1: Trimming and filtering.
2 small Paired-End samples from a virus RNA sequencing project.
Task 1: Performing a quality check of the raw RNA-Seq reads.
This task consists of carrying out a quality assessment analysis of two paired-end FASTQ samples.
Go to General Tools → FASTQ Tools → FASTQ Quality Check. Select the 4 FASTQ files as Input Reads and run the analysis, leaving the remaining options as default.
Once finished, a new tab is opened containing a simple composition statistics of each analyzed file. Each row corresponds to an input file and columns show the following information:
- Name: The name of the file.
- File type: Shows whether the file appeared to contain actual base calls or colorspace data.
- Encoding: Shows the ASCII encoding of quality values.
- Total sequences: The total number of read sequences processed.
- Poor quality reads: Sequences flagged as poor quality reads.
- Sequence length: Provides the length of the shortest and longest sequence in the set.
- %GC: The overall %GC of all bases in all sequences.
Furthermore, a general report will show a summary of the FASTQ Quality Check results, which provides a quick evaluation of whether the results of each module seem entirely normal (pass), slightly normal (warning) or very unusual (fail).
The results of each module for each file can be accessed as follows:
- To open the summary report of each file, right-click on a row and click on Show report. Alternatively, the report of each sample can be opened from the general report by clicking on the buttons of the last column of the overall results table. A new report is opened containing a summary of the statistics and results for the selected file.
- To open the result of each module for a file, right-click on a row and go to the Show Statistics submenu. These results also can be accessed by clicking on the buttons of the Details column of the results table.
Save the results project to use it in the following exercises.
- What are the library size and read length of each file?
- Looking at the general report, what do you think is the biggest problem that has been detected in this dataset?
- Open the Per Base Sequence Quality charts.
- Do reads show a good average quality over their entire length?
- Which regions of the reads have lower quality?
- Do you think that it is necessary to preprocess the dataset before continuing with the analysis?
- What preprocessing procedures should be applied?
Task 2: Preprocessing the raw RNA-Seq reads I: Sliding Window Trimming
This task consists of applying different trimming strategies and evaluating which of them is more effective.
As a first attempt, the Sliding Window Trimming strategy will be applied.
Go to General Tools → FASTQ Tools → FASTQ Preprocessing. Adjust the options as follows:
- In the first wizard page, provide the input sequencing data. Select the 4 raw FASTQ files, and choose Paired-End Reads as Sequencing Data. Upstream and Downstream patterns can be established as "_R1_001" and "_R2_001" respectively.
- In the second wizard page, leave the Remove Adapters option unchecked.
- In the third wizard page, check the Trimming option and select the Sliding Window Trimming option. Leave the remaining parameters as default.
- In the fourth wizard page, leave the Filter By Quality and Filter By Length options unchecked.
- In the fifth wizard page, establish a prefix to set the name of output files, and select the destination folders (output reads and unpaired reads can be placed in the same folder).
Once finished, output files containing the preprocessed reads are stored in the selected folder. Files are generated in compressed format (fastq.gz). For paired-end data, four output files per input samples are generated. Two that contain upstream and downstream paired reads and two that contain unpaired reads. Files with unpaired reads contain the word unpaired in their name so that they can be distinguished.
Furthermore, a result page will show a summary of the FASTQ Preprocessing results. This page provides a table that shows how many reads have survived and how many have been dropped during the analysis.
Perform a FASTQ Quality Check of the preprocessed FASTQ files (see Task 1).
Tip: Take care when providing the input reads in the quality check wizard. You should don't forget to delete the files provided in the previous tasks.
- How many reads have survived to the preprocessing procedure?
- What do forward and reverse only surviving reads mean?
- What are the library size and read length of each file after the preprocessing step?
- Open the Per Base Sequence Quality charts and compare them with the ones obtained from the raw sequencing reads. Has the per base sequence quality been improved? Why?
$ java -jar trimmomatic-0.38.jar PE -summary summary.txt -threads 3 69_S85_L001_R1_001.fastq.gz 69_S85_L001_R2_001.fastq.gz clean_69_S85_L001_R1_001.fastq.gz clean_unpaired_69_S85_L001_R1_001.fastq.gz clean_69_S85_L001_R2_001.fastq.gz clean_unpaired_69_S85_L001_R2_001.fastq.gz SLIDINGWINDOW:4:15
Task 3: Preprocessing raw RNA-Seq reads II: Adaptive Quality Trimming
Now, the raw reads will be preprocessed again, but this time using a different trimming approach: the Adaptive Quality Trimming strategy.
Proceed in the same way as in Task 2, but this time selecting the Adaptive Quality Trimming option (third page).
Once finished, perform a FASTQ Quality Check of the preprocessed FASTQ files.
Tip: Select a different destination folder to store resulting files. In this way, the data produced in the previous steps will not be overwritten.
- How many reads have survived to the preprocessing procedure?
- What are the library size and read length of each file after the preprocessing step?
- Open the per base Sequence Quality charts and compare them with the ones obtained from the raw sequencing reads. Has the per base sequence quality been improved? Why?
- Compare the quality assessment results from both preprocessing procedures (sliding window and adaptive trimming). Which has been more effective? Why?
$ java -jar trimmomatic-0.38.jar PE -summary summary.txt -threads 3 69_S85_L001_R1_001.fastq.gz 69_S85_L001_R2_001.fastq.gz clean_69_S85_L001_R1_001.fastq.gz clean_unpaired_69_S85_L001_R1_001.fastq.gz clean_69_S85_L001_R2_001.fastq.gz clean_unpaired_69_S85_L001_R2_001.fastq.gz MAXINFO:40:0.5
Task 4: Refining preprocessed FASTQ files: Length Trimming and Filtering
To finish this exercise, a final round of preprocessing will be applied to refine the results obtained in Task 3. In this way, the quality at the 3' end will be improved, and low-quality and very short reads will be discarded.
To achieve this, proceed in the same way as in Task 2, but this time:
- Select the 4 FASTQ files from the Adaptive Trimming step (Task 3) as Input Sequencing Data (first wizard page).
- Select the Length Trimming option and choose the 5' option and set the "Trimming Threshold" at 25 (third wizard page, configuration 2)
- Check the Filter By Quality and Filter By Length and leave the default parameters (fourth page, configuration 3).
- Establish a different destination folder (fifth page).
Once finished, perform a FASTQ Quality Check of the preprocessed FASTQ files.
- How many reads have survived to the preprocessing procedure?
- What are the library size and read length of each file after the preprocessing step?
- Open the per base Sequence Quality charts and compare them with the ones obtained from the raw sequencing reads. Has the per base sequence quality been improved?
Has this step improved the results of the previous preprocessing procedure?
$ java -jar trimmomatic-0.38.jar PE -summary summary.txt -threads 3 clean_69_S85_L001_R1_001.fastq.gz clean_69_S85_L001_R2_001.fastq.gz final_clean_69_S85_L001_R1_001.fastq.gz final_clean_unpaired_69_S85_L001_R1_001.fastq.gz final_clean_69_S85_L001_R2_001.fastq.gz final_clean_unpaired_69_S85_L001_R2_001.fastq.gz HEADCROP:25 AVGQUAL:25 MINLEN:36
Exercise 2: Adapter removal.
2 reduced samples from the Ascaridia galli dataset.
Task 1: Performing a quality check of the raw RNA-Seq reads.
This task consists of carrying out a quality assessment analysis of two paired-end FASTQ samples.
Go to General Tools → FASTQ Tools → FASTQ Quality Check. Select the 4 FASTQ files as Input Reads and run the analysis, leaving the remaining options as default.
- What are the library size and read length of each file?
- Looking at the general report, what do you think is the biggest problem that has been detected in this dataset?
- Open the Adapter Content charts.
- Have adapter sequences been detected?
In which end (5' or 3') of the reads do you observe to have more adapters?
- What type of adapter sequence has been detected?
- What preprocessing procedures should be applied?
Task 2: Preprocessing the raw RNA-Seq reads.
This task consists of applying an adapter removal procedure and evaluating if it works.
Go to General Tools → FASTQ Tools → FASTQ Preprocessing:
- In the first wizard page, provide the input sequencing data. Select the 4 raw FASTQ files, and choose Paired-End Reads as Sequencing Data. Upstream and Downstream patterns can be established as "_1" and "_2" respectively.
- In the second wizard page, check the Remove Adapters option. Select TruSeq3 as Default Adapter Sequences, and leave the remaining parameters as default.
- In the third wizard page, leave the Trimming option unchecked.
- In the fourth wizard page, check the Filter By Quality and Filter By Length options and leave the default parameters.
- In the fifth wizard page, establish a prefix to set the name of output files, and select the destination folders (output reads and unpaired reads can be placed in the same folder).
Once finished, perform a FASTQ Quality Check of the preprocessed FASTQ files.
- How many reads have survived to the preprocessing procedure?
- What are the library size and read length of each file?
Open the Adapter Content charts and compare them with the ones obtained from the raw sequencing reads. Have the adapter sequences been effectively removed?
- Why the Per Base Sequence Content check has raised a warning or fail status for all files?
$ java -jar trimmomatic-0.38.jar PE -summary summary.txt -threads 3 ERR1948631_1.fastq.gz ERR1948631_2.fastq.gz clean_ERR1948631_1.fastq.gz clean_unpaired_ERR1948631_1.fastq.gz clean_ERR1948631_2.fastq.gz clean_unpaired_ERR1948631_2.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:15:8:true AVGQUAL:25 MINLEN:36