Hands On Session 2: Data Preprocessing and Clustering, DEG and Enrichment with OmicsBox

Introduction

Dataset

During this training session, we will be working with the Healthy Bonemarrow Dataset (Figure 1). It consists of healthy bonemarrow samples from three donors: one male, and two females.

Single Cell RNA Sequencing has been performed to obtain sequencing data from single cells. The Library Preparation has been done following the 10x Chromium 3' v2 pipeline.

In all cases, more than one technical replicate has been performed. Figure 2 shows the Experimental Design.

Software

During this session, we will be using OmicsBox. This is a user-friendly tool to perform a wide variety of bioinformatic analyses developed at BioBam Bioinformatics S.L.

If not yet installed, please download OmicsBox and ask for a free trial here.

Goals

  • To be able to perform a scRNA-Seq Clustering from a Count Matrix.

  • See how the different Clustering parameters change the output and assess the results.

  • Perform Differential Expression Analysis from the Clustering results to interrogate our data.

  • Perform Functional Enrichment Analysis to further interpret Differential Expression Results.

In order to reduce analysis times, results are already provided. Please download the necessary data for the following tasks here:

download datasets

Figure 1. Healthy Bonemarrow Dataset.

Figure 2. Experimental Design.

Task 1. Compare the clustering with and without integration.

Integration is a crucial step when analyzing data generated with different samples. Theoretically, the same cell types are present in the tissues of different organisms. However, sometimes the difference between individuals influences the analysis more than the difference between cell types. So it is very common that cells end up clustering both by type and by sample in multi-sample datasets. In order to overcome this, Seurat and other packages perform an integration step. Let’s see how this changes the results.

Open the clustering1.box and clustering2.box. Which would you say has been integrated and which has not? Why?

 See the answer

The clustering2.box has been integrated, whereas the clustering1.box not. It can be easily seen in the UMAP plot colored by donor:

In the clustering1.box, cells coming from the different donors are completely separated in different clusters, whereas in the clustering2.box all clusters are composed of cells coming from all three donors.

Would you say that this dataset contains batch effects between technical replicates? Why?

 See the answer

It seems that there are no major batch effects between replicates. Even in the no integrated dataset, cells coming from the different technical replicates cluster together:


Task 2. Annotate the Clusters.

A common methodology used for annotating the clusters is by using already-published marker genes. Marker genes are set of genes that are known to be highly expressed in a given cell type. So, if a cluster has a high expression of many of the marker genes, it is possible to assume that the cells in that cluster were of that particular cell type.

In order to annotate our dataset, we will use the marker genes present in the Azimuth reference database and the Expression Profile plot.

Open the healthy_bonemarrow.txt clustering results. Go to the Side Panel > Charts > Expression Profile and select the file EarlyErythroideeERY.txt. This list contains the marger genes for the Early Erythroids cells. Which cluster(s) would you say belong to this cell type?

 See the answer

Both cluster 6 and cluster 8 have a high expression in the Early Erythoide marker genes. Thus suggesting that these clusters contain cells belonging to this Cell Type.

Following this procedure, try to annotate the rest of the cell types using the marker genes lists available on the Azimuth_HSPCs_Markers folder. Keep in mind that sometimes the annotation is not as clear as in the above example and some clusters might have ambiguous annotations.

Cluster

Cell Type

Cluster-1

Cluster-2

Cluster-3

Cluster-4

Cluster-5

Cluster-6

Cluster-7

Cluster-8

Cluster-9

Cluster-10

Cluster-11

Cluster-12

Cluster-13

Cluster-14

Cluster-15

Cluster-16

Cluster-17

Cluster-18

 See the answer

Cluster

Cell Type

Cluster-1

LMPP, HSC

Cluster-2

GMP

Cluster-3

LMPP, HSC

Cluster-4

LMPP, HSC

Cluster-5

proB, preB

Cluster-6

Early Ery

Cluster-7

GMP

Cluster-8

Early Ery

Cluster-9

proB, preB

Cluster-10

proMk

Cluster-11

prepDC

Cluster-12

proB, preB

Cluster-13

proB

Cluster-14

LMPP

Cluster-15

prepDC, premDC

Cluster-16

?

Cluster-17

proB, preB

Cluster-18

stromal

It seems that this clustering or this set of marker genes was not capable of clearly distinguishing between LMPP and HSC cells and between preB and proB. In addition, Cluster-16 presents marker genes from different cell types, making it difficult to decide on a particular cell type.

Task 3. Perform DE and Functional Enrichment on Cluster 10.

Performing Differential Expression analysis of one cluster versus the rest can give a general idea of the genes that have driven the separation of the cells in that particular cluster. In order to more easily interpret the DE results, a Functional Enrichment Analysis can be performed to take a look into the biological functions (usually more meaningful than gene names) characteristic of that cluster.

From the healthy_bonemarrow.box clustering results, go to the Side Panel > Actions > Differential Expression. How would you configure the Differential Expression Analysis of Cluster-10 versus the rest?

 See the answer

Open the proMk_vs_rest.box results. How many UP- and DOWN-regulated genes were obtained?

 See the answer

241 UP-regulated genes and 21 DOWN-regulated genes.

It is possible to see it by filtering the results table, by generating the Summary on the Side Panel, or by generating the “Results Overview” chart.

From the Differential Expression results, go to the Side Panel > Actions > Fisher’s Exact Test. Specify the biomart_hsapiens_gene_ensembl.box file in the “Reference Annotation”. Which are the functions detected as overrepresented? Do they agree to the functions we would expect from a Megakaryocyte Progenitor cell?

 See the answer

The enriched functions are mainly related to platelet activation and blood coagulation. The latter is really performed by platelet cells, but it makes sense to find them enriched since Megakaryocytes are the progenitors of the platelet cells.

Task 4. Dig into the differences between Cluster-6 and Cluster-8.

Both Cluster-6 and Cluster-8 were annotated as Early Erythrocytes by looking at the Gene Markers. However, they have been grouped into different clusters. What could be the cause of this? Let’s see if we can identify the difference between those two clusters using Differential Expression Analysis.

From the healthy_bonemarrow.box clustering results, go to the Side Panel > Actions > Differential Expression. How would you configure the Differential Expression Analysis of Cluster-6 versus Cluster-8?

 See the answer

Open the c6_vs_c8.box Differential Expression results. How many UP and DOWN genes were obtained?

 See the answer

278 UP-regulated genes and 18 DOWN-regulated genes.

It is possible to see it by filtering the results table, by generating the Summary on the Side Panel, or by generating the “Results Overview” chart.

From the Differential Expression results, go to the Side Panel > Actions > Fisher’s Exact Test. Specify the biomart_hsapiens_gene_ensembl.box file in the “Reference Annotation”. Which are the functions detected as overrepresented? What seems to be the cause of the separation of Erythrocyte cells in two different clusters? Is there a preprocessing step we could perform in order to avoid this separation?

 See the answer

There are a lot of enriched functions regarding the cell cycle and division. Thus, it seems that both Cluster-6 and Cluster-8 contain Early Erythrocyte cells but they are in a different development stage.

If we would like to avoid this separation, we would have to go back to the clustering and perform a cell cycle genes regression.

Task 5. Explore your results.

In the task5 folder, you will find a Count Matrix object containing all the samples of the Healthy Bonemarrow dataset. Try to perform the Clustering with different parameters by yourself and analyze explore the obtained results.