Hands On Session 1: Quantification, QC, and Preprocessing with OmicsBox

Introduction

Dataset

During this training session, we will be working with the Healthy Bonemarrow Dataset (Figure 1). It consists of healthy bonemarrow samples from three donors: one male, and two females.

Single Cell RNA Sequencing has been performed to obtain sequencing data from single cells. The Library Preparation has been done following the 10x Chromium 3' v2 pipeline.

In all cases, more than one technical replicate has been performed. Figure 2 shows the Experimental Design.

Software

During this session, we will be using OmicsBox. This is a user-friendly tool to perform a wide variety of bioinformatic analyses developed at BioBam Bioinformatics S.L.

If not yet installed, please download OmicsBox and ask for a free trial here.

Goals

  • To be able to perform a scRNA-Seq Quantification and configure the analysis according to our input data.

  • Perform a Quality Check on scRNA-Seq Count Matrices.

  • Perform proper filtering on scRNA-Seq Count Matrices.

In order to reduce analysis times, results are already provided. Please download the necessary data for the following tasks here:

download datasets

Figure 1. Healthy Bonemarrow Dataset.

Figure 2. Experimental Design.

Task 1. Customize the STARsolo run.

It is very common to re-analyze already published data. We have seen that the scRNA-Seq quantification parameters are really technology-dependent. Thus making the library prep technology identification a crucial step in the analysis. Would you be able to identify the library prep technology used and the specific parameters to configure the “Single Cell RNA-Seq Quantification” step?

Dataset

Library Prep

Barcode Mate

Cell Barcode Start

Cell Barcode Length

UMI Start

UMI Length

Clip from 5' end

Clip from 3' end

Barcode Whitelist

Dataset 1

  • Separate Read
  • Part of Mate 1
  • Part of Mate 2

  • Yes
  • No

Dataset 2

  • Separate Read
  • Part of Mate 1
  • Part of Mate 2

  • Yes
  • No

Dataset 3

  • Separate Read
  • Part of Mate 1
  • Part of Mate 2

  • Yes
  • No

Dataset 4

  • Separate Read
  • Part of Mate 1
  • Part of Mate 2

  • Yes
  • No

Useful resources:

  • scRNA-Seq Quantification OmicsBox User Manual → link

  • Website with Library Prep read structure information → link

 See the answer

Dataset

Library Prep

Barcode Mate

Cell Barcode Start

Cell Barcode Length

UMI Start

UMI Length

Clip from 5' end

Clip from 3' end

Barcode Whitelist

Dataset 1

Drop-seq

  • Separate Read
  • Part of Mate 1
  • Part of Mate 2

1

12

13

8

No

No

  • Yes
  • No

Dataset 2

CEL-Seq2

  • Separate Read
  • Part of Mate 1
  • Part of Mate 2

1

8

9

8

No

No

  • Yes
  • No

Dataset 3

10c Chromium v3.1

  • Separate Read
  • Part of Mate 1
  • Part of Mate 2

1

16

17

12

No

No

  • Yes
  • No

Dataset 4

10c Chromium v2

  • Separate Read
  • Part of Mate 1
  • Part of Mate 2

1

16

17

10

No

No

  • Yes
  • No

Task 2. Evaluate the effects of choosing different counting parameters.

One of the major differences between the most commonly used scRNA-Seq Quantification tools (CellRanger and STARsolo) is the possibility to count multimapping reads or not. The default behavior of CellRanger is to discard multi-mapping reads, whereas STARsolo allows choosing between different algorithms to count multi-mapping reads. In addition, CellRanger only counts the reads mapping to the exon region, whereas STARsolo allows choosing between counting reads mapping only to exon regions or both exon and intron regions.

In order to evaluate these two parameters, let’s take a look at the count matrices generated with these configurations:

file

multi-mapping reads

exons + introns

discard_exon.box

(error)

(error)

multimap_exon.box

(tick)

(error)

discard_genefull.box

(error)

(tick)

multimap_genefull.box

(tick)

(tick)

TIP 💡

It is possible to merge the different count matrices from the Side Panel > Merge Single Cell Counts. In this way, the distribution charts will show the violin plots for each of the count matrices. This will ease the comparison.

Which configuration seems to produce more counts per cell?

 See the answer

By looking at the distribution of counts plots, it seems that the “multimap_genefull” produces in overall cells with more counts. The median number of counts per cell is higher that with the other configurations. In addition, counting multimapping reads produces an overall higher amount of counts than discarding them. So it happens with counting intronic reads in addition to exonic reads.

And a greater number of expressed genes (with counts >0) per cell?

 See the answer

Both the “discard_genefull” and “multimap_genefull” produce a higher number of expressed genes per cell. Thus, it seems that counting both reads mapping to exons and introns affects more this distribution.

Did you see any difference regarding the percentage of expressed mitochondrial genes? Why could be the cause?

 See the answer

It seems that the counts obtained with the “discard_exon” options produce cells with higher percentages of mitochondrial genes. This could be caused because with other procedure the total number of expressed genes is higher, thus reducing the percentage that are mitochondrial.

Which parameters seem to have more impact on the final count matrix with this dataset? Counting the multi-mapping reads or counting the intronic reads?

 See the answer

With this general analysis, it seems that both parameters influence the final count table. Assessing deeper the impact will require further downstream analysis.

Task 3. Choose the more appropriate Cell Barcode detection parameters.

How the algorithm handles the Cell Barcodes detected in the data will influence in the final cells included in the count matrix. Let’s see how the different parameters available in STARsolo influences the final counts.

In order to evaluate these two parameters, let’s take a look at the count matrices generated with these configurations:

file

Cell Barcode Match Type

Cells Filtering

Num. Expected Cells

p1_c1.box

1MM Multi + Pseudocounts

CellRanger

3000

exactMatch.box

Exact Match

CellRanger

3000

onemm.box

1MM

CellRanger

3000

cr10k.box

1MM Multi + Pseudocounts

CellRanger

10000

top10k.box

1MM Multi + Pseudocounts

Top Cells

10000

emptyDrops.box

1MM Multi + Pseudocounts

Empty Drops

3000

How many cells were obtained on each configuration? Which seems to be influencing the most? Why?

p1_c1.box

exactMatch.box

onemm.box

cr10k.box

top10k.box

emptyDrops.box

Num cells

 See the answer

p1_c1.box

exactMatch.box

onemm.box

cr10k.box

top10k.box

emptyDrops.box

Num cells

2,999

2,999

3,000

3,125

10,167

3,327

The Count Table obtained using the Top Cells algorithm with an incorrect number of expected cells is the most different. This algorithm report the top cells ordered by UMI count, so the output will contain a number of cells close to the given number.

Compare the results obtained with an expected number of cells of 3000 and 10000. Which would you say is more correct?

 See the answer

The most evident hint in this case would be the UMIs Per Cell plot. This is the plot obtained with the top10k configuration:

And with the top3k configuration:

The top10k configuration keeps cells that are beyond the harsh drop in the plot. These cells have a comparatively lower number of detected UMIs compared to the rest. Thus, cells that fall behind that drop are considered to be background noise and not truly cells or even cells that could be damaged and lost a huge part of their RNA content. Thus, by looking at these plots, it seems that 3000 is a better approximation of the number of cells present in the dataset.

On the other hand, if we look at the Distribution of Counts plots, we can see that most of the cells present in the top10k project contain a really low number of counts, whereas the top3k has a more even distribution:

Task 4. Decide filtering thresholds.

So far, we have been analyzing the count table generated for donor 1 and technical replicate 1. But the whole dataset consists of three donors with a few technical replicates each. You can find all the count tables in the task4 folder. Let’s try to define appropriate filtering thresholds for this dataset.

TIP 💡

It is possible to merge the different count matrices from the Side Panel > Merge Single Cell Counts. In this way, the distribution charts will show the violin plots for each of the count matrices. This will ease the comparison.

Will you apply a filtering for the minimum and maximum number of counts per cell? Which one(s)?

 See the answer

Looking at the distribution, it seems that cells with more than 30,000 / 35,000 counts are outliers. Thus, it would be appropriate to filter cells with a higher number of counts than this, since they could be doublets.

Regarding the minimum number of counts, it seems that this particular dataset has cells with a high number of counts. This can be seen by activating the dots in the violin plots and by sorting the count table by the column “Counts” in the Cell viewer in ascending order. So, in this case, a filter for the minimum number of counts might not be needed.

Will you apply a filtering for the minimum and maximum number of genes per cell? Which one(s)?

 See the answer

It seems that the number of expressed genes has a more even distribution, with a few number of outliers. Thus, a filtering might not be needed in this case.

Will you apply a filtering for the maximum percentage of mitochondrial genes per cell? Which one?

 See the answer

Most of the cells have a low percentage of mitochondrial genes expressed. However, there is som cells with comparatively high percentage of mitochondrial genes. It is recommended to filter these cells, since a high percentage of mitochondrial genes is considered to be an indicative of poor quality cells. In this case, a filter of 15% might be adequate.

Will you apply the same filters for all the samples?

 See the answer

The “p3” count tables contain comparatively a higher number of counts. Thus, it could be worth it to filter the count tables separately in order to apply different thresholds to the “p3” count tables.

Task 5. Evaluate your results.

In the task5 folder, you’ll find a Single Cell RNA-Seq dataset of Drosophila melanogaster eye disc. There you will find the sequencing FASTQ files, and the reference genome and annotation. The single cell library has been obtained with Drop-seq. The length of the cDNA read is 51bp.

With this information, try to configure the scRNA-Seq Quantification and take a look at your results. Would you change any parameters? What are the characteristics of your obtained count matrix?