Hands on 5: Trajectory Inference in OmicsBox

Duration: 30 Minutes

Download all the data: Google-Drive

Analysis Approach

In the following tutorial, we will use filtered counts, meaning empty cells and lowly expressed genes have been removed. Seurat clustering has already been performed on these counts to identify the starting cell states. In our case, since we are working with a dataset sequenced from human bone marrow, Hematopoietic Stem Cells (HSCs) will act as the starting root nodes. We will then iteratively perform trajectory analysis and try to identify ideal root nodes. Finally, we will use prior knowledge to validate the results of the trajectory analysis. For simplicity, we will walk you through the analysis of Donor-3.


Step 1: Load and explore the data

OmicsBox supports most standard scRNA-count table formats. For this tutorial, we will quickly load data already stored in a box file. These counts are generated using STARsolo, which is implemented in OmicsBox. To open the box file, navigate to "File → Open .bo file." This action will open the count object within the OmicsBox viewer. Once the counts are successfully loaded, they will appear in the table viewer.

Then, we can use the Charting options from the Side Panel Actions to plot various QC (Quality Control) charts before starting the analysis. We will plot the following charts to ensure data quality:

  • Total Counts: This provides a comprehensive view of counts across cell barcodes, highlighting data distribution trends and pinpointing where the bulk of the data resides.

  • Features per Barcode: This chart illustrates the distribution of gene counts across cell barcodes. Variations in the width of the violin plot indicate data density at specific values.

  • Mitochondrial Counts Percentage: This shows the percentage of mitochondrial counts within each cell. Elevated percentages could signal compromised cell health, as robust cells usually display a smaller ratio of mitochondrial to nuclear transcripts.

All the above plots pass the quality check. As mentioned earlier, these counts have already been filtered.

Step 2: Initaitaing Trajectory Analysis Wizard

The Trajectory Inference Wizard can be accessed directly from the OmicsBox count table objects via the side panels. To open the wizard, navigate to "Side Panels → Trajectory Analysis." Once you've done this, the wizard will open and appear as shown below:

Step 3: Supplying the starting cells.

After opening the wizard, the next step is the selection of the starting cells. You can download the list of starting cells here. OmicsBox also gives you two options to supply a list of root cells.

The selection of root nodes is crucial for Monocle3, as they serve as the reference points for constructing cell trajectories. OmicsBox facilitates this process in two ways:

  • List of Root Cells: This option accepts a text file containing a list of potential root cells, such as Progenitor Cells, Start Cells, or Undifferentiated Cells. Note that this file should be supplied without any headers.

  • Cell Metadata File: This method allows you to upload a tab-separated file that contains meta-information or experimental details about the cells.

    • Select Column: Once the Cell Metadata file is uploaded, the wizard will display a dropdown list featuring all potential columns containing root information.

    • Starting Point: After choosing the appropriate column, you must then decide on the actual starting point. If you're considering a temporal analysis, the starting point could be the initial capture time, labelled as "0h" or something similar. Alternatively, it might be a specific cell type, like a hematopoietic stem cell.

Step 3: Parameter Tuning

The subsequent page in the wizard allows you to set multiple parameters, including those for clustering, graph learning, and UMAP. These settings will have a direct impact on the trajectory graph that is generated. Below is a brief description of each parameter:

Brief description of the parameters

  1. Normalization aims to reduce non-biological variation in the data to make it more comparable across samples.

    1. Log-normalization: This method is beneficial for columns with high variance. It standardizes the data by applying a logarithmic transformation.

    2. Size-factor normalization: This removes biases from each cell by dividing its counts by its size factor, which adjusts for varying sequencing depths.

  1. PCA is a dimension reduction method that transforms the original variables into a new set of variables called principal components (PCs). These PCs are orthogonal to each other and capture the significant variation in the gene expressions.

    1. The Dimensions setting refers to the number of dimensions you want to retain after PCA. For datasets with more than 5,000 cells, using the top 50 principal components is commonly used.

    2. Scaling is crucial when your dataset includes variables with different units. It is advisable to scale the data before running PCA to aid model learning.

  1. UMAP is another dimension reduction technique. It is similar to t-SNE and is often used for visualizing high-dimensional data.

    1. Minimum Distance: This parameter dictates how tightly the cells will be clustered in UMAP. Lower values result in denser clusters, whereas higher values maintain the broad topological structure of the data.

    2. Neighbors: The Neighbors parameter in UMAP helps in balancing local vs. global structures in the data. Lower values focus on local structures, while higher values provide a more global view but might sacrifice fine details.

Clustering

Now, let's move to clustering-related parameters.

  1. The Nearest Neighbors parameter specifies the number of nearest neighbours to consider when applying the Nearest Neighbours algorithm.

  2. Enabling Allow Disjoint Graph combines different partitions into a single trajectory. If disabled, distant partitions are assigned "Infinite" pseudotime.

  3. Turning on Allow Loops will enable the detection of potential cyclic trajectories within the data.

  4. The Resolution parameter sets the granularity of cell clustering. It defines how finely or coarsely cells are grouped based on their expression profiles.

Once all the values are set (default), hit the “RUN“.