User Guide

Introduction


This is a web-based interactive (wizard style) application to perform a guided single-cell RNA-seq data analysis and clustering based on Seurat.

The wizard style makes it intuitive to go back between steps and adjust parameters based on different outputs/plots, giving the user the ability to use feedback in order to guide the analysis iteratively.

It is meant to provide an intuitive interface for researchers to easily upload, analyze, visualize, and explore single-cell RNA-seq data interactively with no prior programming knowledge in R.

It is based on Seurat, an R package designed for QC, analysis, and exploration of single cell RNA-seq data.

The application follows the Seurat - Guided Clustering Tutorial workflow closely. It also provides additional functionalities to further explore and visualize the data.

See Figure 1 below for a diagram that outlines all the workflow steps and their expected output


Figure 1: Workflow (Click figure to enlarge)

Workflow

Input Data Types


This application accepts the following types of input data:

1. Example data (Demo):

  • For demo purposes, you can select “Example data”

  • That will automatically load the PBMC data used in the tutorial

  • You can follow the steps afterwards to run the analysis mirroring the tutorial in order to get familiar with the app

2. Upload your own data (gene counts):

  • A .csv/.txt file that contains a table of the gene counts

  • The first column should have gene names/ids followed by columns for sample counts. The file can be either comma or tab delimited

  • If your counts are not merged, you can use this Count Merger to consolidate all your sample count files

  • Make sure cell/column names do NOT contain underscores _ unless they are replicates

  • For replicates, denote column names with underscore plus the replicate number (eg. Sample_1)

  • First column can either contain gene.ids or gene.names

  • For a sample file, click here

Figure 2: Eg. counts file

Sample file

### 2. 10X data
  • This option requires uploading 3 files: 1 .mtx file, and 2 .tsv files

  • See the example in the tutorial, or click here to download the PBMC data.


## **Run Results** --- ### 1. Data Output There will be plenty of output information from major steps, some of which will be displayed and/or downloadable - Genes in PCs - List of cells in each cluster - List of differentially expressed genes - Seurat Object (.RObj) containing all steps and computed data output

2. Visualization

Various forms of visualizations are included:

  • QC/Filter
    • Violin Plot
    • Gene Plot
    • Dispersion Plot
  • Dimensionality Reduction
    • PCA Plot
    • Heatmap
  • Clustering
    • Jackstraw / Elbow Plots
    • tSNE Plot
  • Gene Expression
    • Violin Plot
    • Feature Plot

Acknowledgements




Upload Data

10X Data, 1 .mtx.gz file, and 2 .tsv.gz files

CSV counts file

Initial Parameters

Note: if there are more than 20 columns, only the first 20 will show here

Loading...

QC & Filter (Preprocessing)


Filter Cells

Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. You can visualize gene and molecule counts, plot their relationship, and exclude cells with a clear outlier number of genes detected as potential multiplets. This is not a guaranteed method to exclude cell doublets, see tutorial for more information. As an example in the tutorial, you can filter cells based on the percentage of mitochondrial genes present.


Filter Options:

1) By Regular Expression:


Genes that match Regex

Filter Expressions

2) Select Specific Genes:

3) Copy/Paste Specific Genes:

Genes not found


                                 Remove/Correct those genes that are not found and click "Add Genes" 
                                

Vln Plot (Filter Cells)

Note that low.thresholds and high.thresholds are used to define a 'gate'

Select thresholds to filter cells



Feature Scatter Plot

Scatter Plot is typically used to visualize feature-feature relationships. See tutorial for examples/details

Normalize, Select Var. Features, Scale Data


SCTransform function

Use this function as an alternative to the NormalizeData, FindVariableFeatures, ScaleData workflow. Results are saved in a new assay called SCT with counts being (corrected) counts, data being log1p(counts), scale.data being pearson residuals; sctransform::vst intermediate results are saved in misc slot of new assay.

We can regress out cell-cell variation in gene expression driven by batch (if applicable), cell alignment rate (as provided by Drop-seq tools for Drop-seq data), the number of detected molecules, and mitochondrial gene expression. Refer to tutorial to see an example of regressing on the number of detected molecules per cell as well as the percentage mitochondrial gene content for post-mitotic blood cells.

Variables to regress out in a second non-regularized linear regression. For example, percent.mito.


Data Normalization

After removing unwanted cells from the dataset, the next step is to normalize the data.

By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result.


Detection of variable genes across the single cells

Seurat calculates highly variable genes and focuses on these for downstream analysis. FindVariableGenes calculates the average expression and dispersion for each gene, places these genes into bins, and then calculates a z-score for dispersion within each bin. This helps control for the relationship between variability and average expression.

We suggest that users set these parameters to mark visual outliers on the dispersion plot, but the exact parameter settings may vary based on the data type, heterogeneity in the sample, and normalization strategy.

Output

Scaling the data and removing unwanted sources of variation

Your single cell dataset likely contains ‘uninteresting’ sources of variation. This could include not only technical noise, but batch effects, or even biological sources of variation (cell cycle stage). As suggested in Buettner et al, NBT, 2015, regressing these signals out of the analysis can improve downstream dimensionality reduction and clustering. To mitigate the effect of these signals, Seurat constructs linear models to predict gene expression based on user-defined variables. The scaled z-scored residuals of these models are stored in the scale.data slot, and are used for dimensionality reduction and clustering.

We can regress out cell-cell variation in gene expression driven by batch (if applicable), cell alignment rate (as provided by Drop-seq tools for Drop-seq data), the number of detected molecules, and mitochondrial gene expression. Refer to tutorial to see an example of regressing on the number of detected molecules per cell as well as the percentage mitochondrial gene content for post-mitotic blood cells.


Dispersion Plot

The average expression and dispersion for each gene

Perform linear dimensional reduction

Compute PCA

Next we perform PCA on the scaled data. By default, the variable features are used as input, but can be defined using features.

We have typically found that running dimensionality reduction on highly variable genes can improve performance.

However, with UMI data - particularly after regressing out technical variables, we often see that PCA returns similar (albeit slower) results when run on much larger subsets of genes, including the whole transcriptome.

PCA Print Output


Compute ICA (Optional)

ICA Print Output


VizPlot Output:

PCs to plot:

Plot Download Options

Download Plot
Loading...

PCA Plots:

Choose PCs to plot:

Plot Download Options

Download Plot
Loading...

ICA Plots:

Choose ICs to plot:

Plot Download Options

Download Plot
Loading...

PCHeatmap Output:

PCs to use:


Plot Download Options

Download Plot
Loading...

Determine statistically significant PCs:


To overcome the extensive technical noise in any single gene for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a ‘metagene’ that combines information across a correlated gene set. Determining how many PCs to include downstream is therefore an important step.

PC selection – identifying the true dimensionality of a dataset – is an important step for Seurat, but can be challenging/uncertain for the user. We therefore suggest these three approaches to consider.

The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example.

The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff.

The third is a heuristic that is commonly used, and can be calculated instantly.


1) PC Elbow Plot (quick)

A more ad hoc method for determining which PCs to use is to look at a plot of the standard deviations of the principle components and draw your cutoff where there is a clear elbow in the graph. This can be done with PCElbowPlot.

Loading...

Plot Download Options

Download Plot

2) JackStraw (slow) OPTIONAL

In Macosko et al, we implemented a resampling test inspired by the jackStraw procedure. We randomly permute a subset of the data (1% by default) and rerun PCA, constructing a ‘null distribution’ of gene scores, and repeat this procedure. We identify ‘significant’ PCs as those who have a strong enrichment of low p-value genes.

The JackStrawPlot function provides a visualization tool for comparing the distribution of p-values for each PC with a uniform distribution (dashed line). ‘Significant’ PCs will show a strong enrichment of genes with low p-values (solid curve above the dashed line).

NOTE: this process can take a long time for big datasets. It can take ~10 minutes for the example PBMC data.


PCs to use:


Loading...

Plot Download Options

Download Plot

Cluster Cells

Seurat now includes an graph-based clustering approach. Importantly, the distance metric which drives the clustering analysis (based on previously identified PCs) remains the same. However, our approach to partioning the cellular distance matrix into clusters has dramatically improved. Our approach was heavily inspired by recent manuscripts which applied graph-based clustering approaches to scRNA-seq data [SNN-Cliq, Xu and Su, Bioinformatics, 2015] and CyTOF data [PhenoGraph, Levine et al., Cell, 2015].

Briefly, these methods embed cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar gene expression patterns, and then attempt to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’. As in PhenoGraph, we first construct a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard distance). To cluster the cells, we apply modularity optimization techniques [SLM, Blondel et al., Journal of Statistical Mechanics], to iteratively group cells together, with the goal of optimizing the standard modularity function.


FindClusters paramters

The FindClusters function implements the procedure, and contains a resolution parameter that sets the ‘granularity’ of the downstream clustering, with increased values leading to a greater number of clusters. We find that setting this parameter between 0.6-1.2 typically returns good results for single cell datasets of around 3K cells. Optimal resolution often increases for larger datasets. The clusters are saved in the object@ident slot.

Clustering Algorithm Output:



Run Non-linear dimensional reduction


Parameters

Seurat continues to use tSNE as a powerful tool to visualize and explore these datasets. While we no longer advise clustering directly on tSNE components, cells within the graph-based clusters determined above should co-localize on the tSNE plot. This is because the tSNE aims to place cells with similar local neighborhoods in high-dimensional space together in low-dimensional space. As input to the tSNE, we suggest using the same PCs as input to the clustering analysis, although computing the tSNE based on scaled gene expression is also supported using the genes.use argument.


TSNE Plot

Loading...

Plot Download Options

Download Plot


Once running the reduction is complete, you can also view/download cells in each cluster

Find Cells in Clusters:

UMAP Plot

Loading...

Plot Download Options

Download Plot



Download R Object/Script


You can save the object at this point so that it can easily be loaded back in R for further analysis & exploration without having to rerun the computationally intensive steps performed above, or easily shared with collaborators.

It is also recommended that you keep it as a reference.


Generate and Download the R script to reproduce these steps in R/RStudio

Please note that you need to edit the data file(s)/directory path in the script before you run it in R/RStudio

Finding differentially expressed genes (cluster biomarkers)


Seurat can help you find markers that define clusters via differential expression. By default, it identifes positive and negative markers of a single cluster (specified in ident.1), compared to all other cells. FindAllMarkers automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.

The min.pct argument requires a gene to be detected at a minimum percentage in either of the two groups of cells, and the thresh.test argument requires a gene to be differentially expressed (on average) by some amount between the two groups. You can set both of these to 0, but with a dramatic increase in time - since this will test a large number of genes that are unlikely to be highly discriminatory. As another option to speed up these computations, max.cells.per.ident can be set. This will downsample each identity class to have no more cells than whatever this is set to. While there is generally going to be a loss in power, the speed increases can be significiant and the most highly differentially expressed genes will likely still rise to the top.

Find All Markers:

UCSC Cell Browser (Optional)

Use this cell browser to explore data visually:

1) Generate the cell browser data

2) Launch the browser in a new tab once data is generated

You need to find all markers first !


Loading...

Select clusters to find markers:


Loading...

Select clusters to find markers:


Loading...

Heatmap:

Loading...

Plot Download Options

Download Plot

You need to find all markers first !


Visualizing Marker Expression:

Note: make sure to find cluster markers first, as the list of genes you can plot will from the output of that step

Vln Plot:

Genes to plot:

Plot Download Options

Download Plot
Loading...

Note: make sure to find cluster markers first, as the list of genes you can plot will from the output of that step

Feature Plot:

Genes to plot:

Plot Download Options

Download Plot
Loading...