Quick Start Guide
Differential expression (DE) analysis has become an increasingly popular tool in determining and viewing up and/or down experssed genes between two sets of samples.
The goal of DEBrowser is to provide an easy way to perform and visualize DE analysis.
1. Data input
The input data for DEBrowser is 2 files in '.txt', '.csv' or '.tsv' format, they are named as 'Count Data File' and 'Metadata File'.
1.1 Count Data File
This input file should contain summarized count results of all samples in the experiment, an example of the expected input data format is presented as below:
Where columns are samples, rows are the mapped genomic features (e.g. genes, isoforms, miRNAs, ATAC or Chip regions etc.).
If you do not have a dataset to upload, you can use the built in demo data file by clicking on the 'Load Demo' buttons from two different publications. To view the entire demo data file, you can download the demo set 1 (Vernia et. al). For another example, try our full dataset for demo set 1 (Vernia et. al)
1.2 Metadata File
In addition to the count data file; you may also upload metadata file to correct for batch effects or any other normalizing conditions you might want to address that might be within your results. To handle for these conditions, simply create a metadata file by using the example table at below
or download sample metadata file used for: metadata file for set 1 (Vernia et. al).
To be able to upload the data please press button in the data upload page.
After sucessfull upload, you should see the summary of your data in the 'Upload Summary' section. To move the filtering section please click button in the upload page.
If you are ready to use DEBrowser, please click 'Upload' menu on the left to start using DEBrowser
2. Data Assesment
2.1 Low Count Filtering
In this section, you can simultaneously visualise the changes of your dataset while filtering out the low count genes. Choose your filtration criteria from Filtering Methods box which is located just center of the screen. Three methods are available to be used:
Max: Filters out genes where maximum count for each gene across all samples are less than defined threshold.
Mean: Filters out genes where mean count for each gene are less than defined threshold.
CPM: First, counts per million (CPM) is calculated as the raw counts divided by the library sizes and multiplied by one million. Then it filters out genes where at least defined number of samples is less than defined CPM threshold.
The expression cutoff value is determined according to the library size and normalization factors with formula $$\text{CPM} = \frac{\text{raw counts}}{\text{library size} * \text{normalization factors}} * 10^{-6}$$ For example, if the cutoff CPM value is 10, the library size and normalization factors are estimated approximately equal to \(\ 3 \text{ x} 10 ^ 6\) and 1 for at least 4 samples, then 10 CPM expression cutoff corresponds to about 30 read counts. Therefore, in this example features in more than 4 samples have less than 30 read counts (10 CPM) is going to be low expression features and will be removed for batch effect correction and DE analysis.
To be able to filter out the low expression counts please press button in data filtering page.
2.2 Quality control(QC)
After filtering low count features, you may continue your analysis with Batch Effect Detection & Correction or directly jump to differential expression analysis or view quality control (QC) information of your dataset.
If specified metadata file containing your treatment and batch fields, by clicking button, you have the option to conduct PCA, interquartile range (IQR) and density plots to asses if the data requires batch effect correction or not.
If user wants to skip batch effect assesment and correction step, they can either click button to perform DE Analysis or button for QC plots to draw PCA, all2all scatter, heatmaps, IQR and density plots.
3. Data Preparation
With metadata file containing your batch correction fields then you have the option to conduct batch effect correction prior to your analysis. By adjusting parameters of Options box, you can investigate your character of your dataset. These parameters of the options box are explained as following:
Normalization Method: DEBrowser allows performing normalization prior the batch effect correction. You may choose your normalization method (among MRE, TMM, RLE, upperquartile), or if you don't want to normalize your data you can select none for this item.
Correction Method: DEBrowser uses ComBat (part of the SVA bioconductor package) or Harman to adjust for possible batch effect or conditional biases. For more information, you can visit following links for documentation: ComBat, Harman
Treatment: Please select the column that is specified in metadata file for comparision, such as cancer vs control. It is named condition for our sample metadata.
Batch: Please select the column name in metadata file which differentiate the batches. For example in our metadata, it is called batch. Upon clicking submit button, comparison tables and plots will be created on the right part of the screen as shown below.
You can investigate the changes on the data by comparing following features:
1. Read counts for each sample.
2. PCA, IQR and Density plot of the dataset.
3. Gene/region vs samples data
After batch effect correction, user can click button to perform DE Analysis or button for QC plots to draw PCA, all2all scatter, heatmaps, IQR and density plots.
4. DE analysis
The goal of differential gene expression analysis is to find genes or transcripts whose difference in expression, when accounting for the variance within condition, is higher than expected by chance.
DESeq2 is an R package available via Bioconductor and is designed to normalize count data from high-throughput sequencing assays such as RNA-Seq and test for differential expression (Love et al. 2014). With multiple parameters such as padjust values, log fold changes, plot styles, and so on, altering plots created with your DE data can be a hassle as well as time consuming. The Differential Expression Browser uses DESeq2 (Love et al., 2014) EdgeR (Robinson et al., 2010), and Limma (Ritchie et al., 2015) coupled with shiny (Chang, W. et al., 2016) to produce real-time changes within your plot queries and allows for interactive browsing of your DE results. In addition to DE analysis, DEBrowser also offers a variety of other plots and analysis tools to help visualize your data even further.
If you are ready to discover and visualize your data, please click button in DE Results section.
4.1 Used parameters for DESeq2
fitType:
Either 'parametric', 'local', or 'mean' for the type of fitting of dispersions to the mean intensity. See estimateDispersions for description.
betaPrior:
Whether or not to put a zero-mean normal prior on the non-intercept coefficients See nbinomWaldTest for description of the calculation of the beta prior. By default, the beta prior is used only for the Wald test, but can also be specified for the likelihood ratio test.
testType:
Either 'Wald' or 'LRT', which will then use either Wald significance tests (defined by nbinomWaldTest), or the likelihood ratio test on the difference in deviance between a full and reduced model formula (defined by nbinomLRT)
4.2 Used parameters for EdgeR
Normalization:
Calculate normalization factors to scale the raw library sizes. Values can be 'TMM','RLE','upperquartile','none'.
Dispersion:
Either a numeric vector of dispersions or a character string indicating that dispersions should be taken from the data object.
testType:
ExactTest or glmLRT. exactTest: Computes p-values for differential abundance for each gene between two samples, conditioning on the total count for each gene. The counts in each group are assumed to follow a binomial distribution. glmLRT: Fits a negative binomial generalized log-linear model to the read counts for each gene and conducts genewise statistical tests.
4.3 Used parameters for Limma
Normalization:
Calculate normalization factors to scale the raw library sizes. Values can be 'TMM','RLE','upperquartile','none'.
Fit Type:
fitting method; 'ls' for least squares or 'robust' for robust regression
Norm. Bet. Arrays:
Normalization Between Arrays; Normalizes expression intensities so that the intensities or log-ratios have similar distributions across a set of arrays.
5. Frequently asked questions (FAQ)
5.1 Why un-normalized counts?
DESeq2 requires count data as input obtained from RNA-Seq or another high-thorughput sequencing experiment in the form of matrix values. Here we convert un-integer values to integer to be able to run DESeq2. The matrix values should be un-normalized, since DESeq2 model internally corrects for library size. So, transformed or normalized values such as counts scaled by library size should not be used as input. Please use edgeR or limma for normalized counts.
5.2 Why am I getting error while uploading files?
* DEBrowser supports tab, comma or semi-colon separated files. However spaces or characters in numeric regions not supported and causes an error while uploading files. It is crutial to remove these kind of instances from the files before uploading files.
* Another reason of getting an error is using same gene name multiple times. This may occurs after opening files in programs such as Excel, which tends to automatically convert some gene names to dates (eg. SEP9 to SEP.09.2018). This leads numerous problems therefore you need to disable these kind of automatic conversion before opening files in these kind of programs.
* Some files contain both tab and space as an delimiter which lead to error. It is required to be cleaned from these kind of files before loading.
5.3 Why some columns not showed up after upload?
If a character in numeric area or space is exist in one of your column, either column will be eliminated or you will get an error. Therefore it is crutial to remove for these kind of instances from your files before uploading.
5.4 Why am I getting error while uploading CSV/TSV files exported from Excel?
* You might getting an error, because of using same gene name multiple times. This may occurs after opening files in programs such as Excel, which tends to automatically convert some gene names to dates (eg. SEP9 to SEP.09.2018). Therefore you need to disable these kind of automatic conversion before opening files in these kind of programs.
5.5 Why can't I see all the background data in Main Plots?
In order to increase the performance, by default 10% of non-significant(NS) genes are used to generate plots. We strongly suggest you to use all of the NS genes in your plots while publishing your results. You can easily change this parameter by clicking **Main Options** button and change Background Data(%) to 100% on the left sidebar.
5.6 Why am I getting error when I click on DE Genes in Go Term Analysis?
To start Go Term analysis, it is important to select correct organism from Choose an organism field. After selecting other desired parameters, you can click Submit button to run Go Term analysis. After this stage, you will able to see categories regarding to your selected gene list in the Table Tab. Once you select this category, you can click DE Genes button to see gene list regarding to selected category.
5.7 How to download selected data from Main plots/QC Plots/Heatmaps?
First, you need to choose Choose dataset field as selected under Data Options in the left sidebar. When you select this option, new field: The plot used in selection will appear under Choose dataset field. You need to specify the plot you are interested from following options: Main plot, Main Heatmap, QC Heatmap. Finally you can click Download Data button to download data, or if you wish to see the selected data, you can click Tables tab.