--- title: "Gene Expression Analysis with `myTAI`" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Gene Expression Analysis with `myTAI`} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- ## Introduction In the [Introduction](Introduction.html) vignette we introduced and discussed how phylotranscriptomics can be applied to capture evolutionary signals in (developmental) transcriptomes. Furthermore, in the [Enrichment Analyses](Enrichment.html) vignette we provide a use case to correlate specific groups or sets of genes with their predicted evolutionary origin. Here, we aim to combine previously introduced techniques with _classic_ gene expression analyses to detect possible functional causes for the observed transcriptome conservation. In other words, phylotranscriptomics allows us to detect stages or periods of evolutionary conservation and is able to predict the evolutionary origin of process or trait specific genes based on enrichment analyses. By combining evolutionary enrichment analyses with the functional annotation of process or trait specific genes (see [Functional Annotation](https://github.com/ropensci/biomartr/blob/master/vignettes/Functional_Annotation.Rmd) for details) the detection of evolutionary signals can be correlated with functional processes. Then, performing gene expression analyses on corresponding process or trait specific genes allows users to detect potential causes of stage/period specific evolutionary transcriptome conservation. The following sections introduce main gene expression data analysis techniques implemented in `myTAI`: - Detection of Differentially Expressed Genes (DEGs) + Fold-Change + Welch t-test + Wilcoxon Rank Sum Test (Mann-Whitney U test) + Negative Binomial (Exact Tests) - Collapsing Replicate Samples - Filter for Expressed Genes - Compute the Statistical Significance of Each Replicate Combination ## Detection of Differenentially Expressed Genes (DEGs) A variety of methods have been published to detect differentially expressed genes. Some methods are based on non-statistical quantification of expression differences (e.g. fold-change and log-fold-change), but most methods are based on statistical tests to quantify the significance of differences in gene expression between samples. These statistical methods can furthermore be divided into two methodological categories: parametric tests and non-parametric tests. The `DiffGenes()` function available in `myTAI` implements the most popular and useful methods to detect differentially expressed genes. In the literature, different methods have been introduced and discussed for microarray technologies versus RNA-Seq technologies. In this section we will introduce all methods implemented in `DiffGenes()` using small examples and will furthermore, discuss published advantages and disadvantages of each method and each mRNA quantification technology. __Note that when using `DiffGenes()` it is assumed that your input dataset has been normalized before passing it to `DiffGenes()`. For RNA-Seq data `DiffGenes()` assumes that the libraries have been normalized to have the same size, i.e., to have the same expected column sum under the null hypothesis (or the lib.size argument in `DiffGenes()` is specified accordingly).__ ## Fold-Changes A fold change in gene expression is simply the ratio of the gene expression level of one sample against a second sample: $\frac{e_{i1}}{e_{i2}}$, where $e_{i1}$ is the expression level of gene $i$ in sample one and $e_{i2}$ is the expression level of gene $i$ in sample two. In case replicate expression levels are present for each sample the ratio of means of the corresponding replicates is computed: $\frac{\bar{e}_{i1}}{\bar{e}_{i2}}$, where $\bar{e}_{i1}$ is the mean of replicate expression levels of gene $i$ in sample one and $\bar{e}_{i2}$ is the mean of replicate expression levels of gene $i$ in sample two. * __Advantages:__ Given a small number of replicate values the statistical evaluation of differentially expressed genes might be biased (depending on the statistical test chosen) by underlying sample distributions which are not fulfilled or because a small number of replicate values is not sufficient enough to perform non-parametric tests. Here, fold-changes provide a simple way to quantify gene expression differences between samples by $n$-fold enrichment. In our opinion, although the process of choosing a threshold for defining genes as being differentially expressed or not based on fold-change values is purely subjective and relies on common sense, in some cases this procedure will be more suitable than defining differentially expressed genes based on p-values obtained from a test statistic with violated test assumptions. * __Disadvantages:__ If used appropriately, statistical tests not only systematically quantify the significance of the observed gene-by-gene differences of expression, but furthermore, accounts the variance of replicate expression levels when comparing the mean difference of replicate expression levels between samples. Hence, the gene specific variance between replicates is also quantified by the p-value returned by the sufficient test statistic which is not quantified by a simple fold-change measure. ### Example: Fold-Change For the following example we assume that `PhyloExpressionSetExample[1:5,1:8]` stores 5 genes and 3 developmental stages with 2 replicate expression levels per stage. ```{r, eval=FALSE} data("PhyloExpressionSetExample") # Detection of DEGs using the fold-change measure DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8], nrep = 2, method = "foldchange", stage.names = c("S1","S2","S3")) head(DEGs) ``` ``` Phylostratum GeneID S1->S2 S1->S3 S2->S1 S2->S3 S3->S1 S3->S2 1 1 at1g01040.2 1.6713881 2.0806706 0.5983051 1.2448758 0.4806143 0.8032930 2 1 at1g01050.1 1.0273222 1.2709185 0.9734045 1.2371177 0.7868325 0.8083305 3 1 at1g01070.1 1.3087379 1.4044799 0.7640949 1.0731560 0.7120073 0.9318310 4 1 at1g01080.2 0.7779572 0.7286769 1.2854177 0.9366542 1.3723503 1.0676299 5 1 at1g01090.1 0.3803866 0.2288961 2.6289042 0.6017460 4.3687939 1.6618307 ``` The resulting output shows all combinations of fold-changes between samples (developmental stages). Here, `S1->S2` denotes that the fold-change was computed for expression levels of stage `S1` against stage `S2`. ### Example: Log-Fold-Change __When selecting `method = "log-foldchange"` it is assumed that the input `ExpressionSet` stores `log2` expression levels. Here, we transform absolute expression levels stored in `PhyloExpressionSetExample` to `log2` expression levels using the `tf()` function before log-fold-changes are computed.__ ```{r, eval=FALSE} data("PhyloExpressionSetExample") # Detection of DEGs using the logfold-change measure log.DEGs <- DiffGenes(ExpressionSet = tf(PhyloExpressionSetExample[1:5,1:8],log2), nrep = 2, method = "log-foldchange", stage.names = c("S1","S2","S3")) head(log.DEGs) ``` ``` Phylostratum GeneID S1->S2 S1->S3 S2->S1 S2->S3 S3->S1 S3->S2 1 1 at1g01040.2 0.74104679 1.0570486 -0.74104679 0.31600182 -1.0570486 -0.31600182 2 1 at1g01050.1 0.03888868 0.3458715 -0.03888868 0.30698280 -0.3458715 -0.30698280 3 1 at1g01070.1 0.38817621 0.4900360 -0.38817621 0.10185975 -0.4900360 -0.10185975 4 1 at1g01080.2 -0.36223724 -0.4566488 0.36223724 -0.09441158 0.4566488 0.09441158 5 1 at1g01090.1 -1.39446159 -2.1272350 1.39446159 -0.73277345 2.1272350 0.73277345 ``` The resulting output stores all combinations of log fold-changes between samples (developmental stages). ## Welch t-test The `Welch t-test` is a parametric test to statistically quantify the difference of sample means in cases where the assumption of homogeneity of variance (equal variances in the two populations) is violated (Boslaugh, 2013). The `Welch t-test` is a sufficient parameter test for small sample sizes and thus, has been used to detect differentially expressed genes based on p-values returned by the test statistic (Hahne et al., 2008). In detail, the test statistic is computed as follows: $t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$ where $\bar{x}_1$ and $\bar{x}_2$ are sample means, $s_1^2$ and $s_2^2$ are the sample variances, and $n_1$ and $n_2$ are the sample sizes. The degrees of freedom for Welch's t-test are then computed as follows: $df = \frac{\big(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\big)^2}{\frac{s_1^4}{n_1^2 (n_1 - 1)} + \frac{s_2^4}{n_2^2 (n_2 - 1)}}$ To perform a sufficient `Welch t-test` the following assumptions about the input data need to be fulfilled to test whether two samples come from populations with equal means: __Assumptions about input data__ * independent samples * continuous data * (approximate) normality Nevertheless, although in most cases `log2` expression levels are used to perform the `Welch t-test` assuming that expression levels are log-normal distributed which approximates a normal distribution in infinity, in most cases the small number of replicates is not sufficient enough to fulfill the (approximate) normality assumption of the `Welch t-test`. Due to this fact, non-parametric, sampling based, or generalized linear model based methods have been proposed to quantify p-values of differential expression. Nevertheless, the `DiffGenes()` function implements the `Welch t-test` for the detection of differentially expressed genes, allowing users to compare the results with more recent DEG detection methods/methodologies also implemented in `DiffGenes()`. * __Advantages:__ + DEG detection based on statistical quantification + Parametric test resulting in a strong test statistic + Can handle small sample sizes * __Disadvantages:__ + Test assumptions must be fulfilled to return sufficient p-values + Can hardly assure normality with very sample sizes of $n = 3,4,5,..$ (replicates) + Pairwise comparisons between different stages or experiments ### Example: Welch t-test Performing `Welch t-test` with `DiffGenes()` can be done by specifying `method = "t.test"`. Internally `DiffGenes()` performs a two sided `Welch t-test`. This means that the `Welch t-test` quantifies only whether or not a gene is significantly differentially expressed, but not the direction of enrichment (over-expressed or under-expressed). The `PhyloExpressionSetExample` we use in the following example stores absolute expression levels. In case your `ExpressionSet` also stores absolute expression levels (which is likely due to the `ExpressionSet` standard for Phylotranscriptomics analyses), you can use the `tf()` function implemented in `myTAI` to transform absolute expression levels to `log2` expression levels before performing `DiffGenes()` with a `Welch t-test`, e.g. `tf(PhyloExpressionSetExample[1:5,1:8],log2)`. In general, using `log2` transformed expression levels as input `ExpressionSet` of `DiffGenes()` allows us to (at least) assume that samples (replicate expression levels) used to perform the `Welch t-test` are log-normal distributed and therefore, somewhat approximate normal distributed. Please notice however, that RNA-Seq data can include count values of 0. So when transforming absolute counts to `log2` counts infinity values of `log2(0) = -Inf` will be produced and therefore, p-value computations will not be possible. To avoid this case you could either remove RNA-Seq count values of 0 from the input dataset using the `Expressed()` function (see section _Filter for Expressed Genes_), e.g. pass `tf(Expressed(PhyloExpressionSetExample[1:5,1:8], cut.off = 1),log2)` as `ExpressionSet` argument to `DiffGenes()` or shift all count values by a constant value, e.g. pass `tf(PhyloExpressionSetExample[1:5,1:8], function(x) log2(x + 1))` as `ExpressionSet` argument to `DiffGenes()`. Internally, `DiffGenes()` will also check for 0 values in input data and will automatically shift all expression levels by `+1` in case 0 values are included. ```{r, eval=FALSE} data("PhyloExpressionSetExample") # Detection of DEGs using the p-value returned by a Welch t-test ttest.DEGs <- DiffGenes(ExpressionSet = tf(PhyloExpressionSetExample[1:5,1:8],log2), nrep = 2, method = "t.test", stage.names = c("S1","S2","S3")) # look at the results ttest.DEGs ``` ``` Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3 1 1 at1g01040.2 0.027832572 0.04020203 0.13481563 2 1 at1g01050.1 0.852379466 0.31471871 0.36326955 3 1 at1g01070.1 0.003200692 0.00113536 0.02236621 4 1 at1g01080.2 0.086426813 0.03092924 0.45999438 5 1 at1g01090.1 0.090387087 0.04638872 0.04978092 ``` The resulting `data.frame` stores the p-values of stage-wise comparisons for each gene. To adjust p-values for multiple testing of stage-wise comparisons you can specify the `p.adjust.method` argument with one of the p-value adjustment methods implemented in `DiffGenes()`. In detail, correcting for multiple testing allows to appropriately choose selection cut-offs for p-values fulfilling the differential expression criteria. Hahne et al., 2008 (p. 87) give a nice example of correcting for multiple testing to determine appropriate selection cut-offs. Please consult the documentation of `?p.adjust` to see which p-value adjustment methods are implemented in `DiffGenes()`. Please also consult these reviews ([Biostatistics Handbook](http://www.biostathandbook.com/multiplecomparisons.html), [Gelman et al., 2008](http://www.stat.columbia.edu/~gelman/research/published/multiple2f.pdf), and [Slides](http://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture10.pdf)) to decide whether or not to apply p-value adjustment to your own dataset. ```{r, eval=FALSE} data("PhyloExpressionSetExample") # Detection of DEGs using the p-value returned by a Welch t-test # and furthermore, adjust p-values for multiple comparison # using the Benjamini & Hochberg (1995) method: method = "BH" ttest.DEGs.p_adjust <- DiffGenes(ExpressionSet = tf(PhyloExpressionSetExample[1:5,1:8],log2), nrep = 2, method = "t.test", p.adjust.method = "BH", stage.names = c("S1","S2","S3")) ttest.DEGs.p_adjust ``` ``` Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3 1 1 at1g01040.2 0.06958143 0.0579859 0.2246927 2 1 at1g01050.1 0.85237947 0.3147187 0.4540869 3 1 at1g01070.1 0.01600346 0.0056768 0.1118311 4 1 at1g01080.2 0.11298386 0.0579859 0.4599944 5 1 at1g01090.1 0.11298386 0.0579859 0.1244523 ``` The resulting p-value adjusted `data.frame` can be used to filter for differentially expressed genes. Here, specifying the arguments: `comparison`, `alpha`, and `filter.method` in `DiffGenes()` allows users to obtain only significant differentially expressed genes. ```{r,eval=FALSE} # Detection of DEGs using the p-value returned by a Welch t-test # and furthermore, adjust p-values for multiple comparison # using the Benjamini & Hochberg (1995) method: method = "BH" # and filter for significantly differentially expressed genes (alpha = 0.05) ttest.DEGs.p_adjust.filtered <- DiffGenes(ExpressionSet = tf(PhyloExpressionSetExample[1:10 ,1:8],log2), nrep = 2, method = "t.test", p.adjust.method = "BH", stage.names = c("S1","S2","S3"), comparison = "above", alpha = 0.05, filter.method = "n-set", n = 1) # look at the genes fulfilling the filter criteria ttest.DEGs.p_adjust.filtered ``` ``` Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3 3 1 at1g01070.1 0.03200692 0.0113536 0.2192432 ``` In this example, only 1 out of 10 genes fulfills the p-value criteria (`alpha = 0.05`) in at least one stage comparison. #### Rank top p-values Finally, users can rank genes in increasing p-value order for each stage comparison by typing: ```{r,eval = FALSE} ttest.DEGs.p_adjust <- DiffGenes(ExpressionSet = tf(PhyloExpressionSetExample[1:500,1:8],log2), nrep = 2, method = "t.test", p.adjust.method = "BH", stage.names = c("S1","S2","S3")) head(ttest.DEGs.p_adjust[order(ttest.DEGs.p_adjust[ , "S1<->S2"], decreasing = FALSE) , 1:3]) ``` ``` Phylostratum GeneID S1<->S2 54 1 at1g02400.1 0.151388 119 1 at1g03870.1 0.151388 137 1 at1g04380.1 0.151388 289 1 at1g08110.4 0.151388 383 1 at1g10360.1 0.151388 413 1 at1g11040.1 0.151388 ``` Here the line `ttest.DEGs.p_adjust[order(ttest.DEGs.p_adjust[ , "S1<->S2"], decreasing = FALSE) , 1:3]` will sort p-values of stage comparison `"S1<->S2"` in increasing order. ## Wilcoxon-Mann-Whitney test (Mann-Whitney U test) The Wilcoxon-Mann-Whitney test is a _nonparametric_ test to quantify the shift in empirical distribution parameters. _Nonparametric_ tests are useful when sample populations do not meet the test assumptions of _parametric_ tests. ```{r, eval=FALSE} data("PhyloExpressionSetExample") # Detection of DEGs using the p-value returned by a Wilcoxon-Mann-Whitney test Wilcox.DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8], nrep = 2, method = "wilcox.test", stage.names = c("S1","S2","S3")) # look at the results Wilcox.DEGs ``` ``` Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3 1 1 at1g01040.2 0.3333333 0.3333333 0.3333333 2 1 at1g01050.1 1.0000000 0.3333333 0.3333333 3 1 at1g01070.1 0.3333333 0.3333333 0.3333333 4 1 at1g01080.2 0.3333333 0.3333333 0.6666667 5 1 at1g01090.1 0.3333333 0.3333333 0.3333333 ``` Again, users can adjust p-values by specifying the `p.adjust.method` argument. ```{r, eval=FALSE} data("PhyloExpressionSetExample") # Detection of DEGs using the p-value returned by a Wilcoxon-Mann-Whitney test # and furthermore, adjust p-values for multiple comparison # using the Benjamini & Hochberg (1995) method: method = "BH" # and filter for significantly differentially expressed genes (alpha = 0.05) Wilcox.DEGs.adj <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8], nrep = 2, method = "wilcox.test", stage.names = c("S1","S2","S3"), p.adjust.method = "BH") # look at the results Wilcox.DEGs.adj ``` ``` Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3 1 1 at1g01040.2 0.4166667 0.3333333 0.4166667 2 1 at1g01050.1 1.0000000 0.3333333 0.4166667 3 1 at1g01070.1 0.4166667 0.3333333 0.4166667 4 1 at1g01080.2 0.4166667 0.3333333 0.6666667 5 1 at1g01090.1 0.4166667 0.3333333 0.4166667 ``` ## Negative Binomial (Exact Tests) Exact Tests for Differences between two groups of negative-binomial counts implemented in `DiffGenes()` are based on the `edgeR` function `exactTest()`. Please consult the [edgeR Users Guide](http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf) for mathematical details. ### Install edgeR Package The detection of DEGs using negative binomial models is based on the powerful implementations provided by the [edgeR](http://www.bioconductor.org/packages/release/bioc/html/edgeR.html) package. Hence, before using the negative binomial models in `DiffGenes()` users need to install the edgeR package. ```{r,eval=FALSE} # install edgeR source("http://bioconductor.org/biocLite.R") biocLite("edgeR") ``` ### Double Tail Method This method computes two-sided p-values by doubling the smaller tail probability (see `?exactTestByDeviance` for details). To compute p-values for stagewise comparisons based on negative binomial models, the `DiffGenes()` argument `method = "doubletail"`, the number of replicates per stage `nrep`, and `lib.size` quantifying the library size to equalize sample library sizes by quantile-to-quantile normalization need to be specified (see also `?equalizeLibSizes`). ```{r, eval=FALSE} data("PhyloExpressionSetExample") # Detection of DEGs using the p-value returned by the Double Tail Method DoubleTail.DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8], nrep = 2, method = "doubletail", lib.size = 1000, stage.names = c("S1","S2","S3")) # look at the results DoubleTail.DEGs ``` ``` Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3 1 1 at1g01040.2 0.26026604 0.110233012 0.6304508 2 1 at1g01050.1 0.95314428 0.598102712 0.6398757 3 1 at1g01070.1 0.55461941 0.456018563 0.8774231 4 1 at1g01080.2 0.58130025 0.487028051 0.8860005 5 1 at1g01090.1 0.03615134 0.001773543 0.2645537 ``` Again, users can adjust p-values by specifying the `p.adjust.method` argument. ```{r, eval=FALSE} data("PhyloExpressionSetExample") # Detection of DEGs using the p-value returned by the Double Tail Method # and furthermore, adjust p-values for multiple comparison # using the Benjamini & Hochberg (1995) method: method = "BH" # and filter for significantly differentially expressed genes (alpha = 0.05) DoubleTail.DEGs.adj <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8], nrep = 2, method = "doubletail", lib.size = 1000, stage.names = c("S1","S2","S3"), p.adjust.method = "BH") # look at the results DoubleTail.DEGs.adj ``` ``` Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3 1 1 at1g01040.2 0.6506651 0.275582530 0.8860005 2 1 at1g01050.1 0.9531443 0.598102712 0.8860005 3 1 at1g01070.1 0.7266253 0.598102712 0.8860005 4 1 at1g01080.2 0.7266253 0.598102712 0.8860005 5 1 at1g01090.1 0.1807567 0.008867715 0.8860005 ``` ### Small-P Method This method performs the method of small probabilities as proposed by Robinson and Smyth (2008) (see `exactTestBySmallP` for details). To compute p-values for stagewise comparisons based on negative binomial models, the `DiffGenes()` argument `method = "doubletail"`, the number of replicates per stage `nrep`, and `lib.size` quantifying the library size to equalize sample library sizes by quantile-to-quantile normalization need to be specified (see also `?equalizeLibSizes`). ```{r, eval=FALSE} data("PhyloExpressionSetExample") # Detection of DEGs using the p-value returned by the Small-P Method SmallP.DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8], nrep = 2, method = "smallp", lib.size = 1000, stage.names = c("S1","S2","S3")) # look at the results SmallP.DEGs ``` ``` Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3 1 1 at1g01040.2 0.26026604 0.110233012 0.6304508 2 1 at1g01050.1 0.95314428 0.598102712 0.6398757 3 1 at1g01070.1 0.55461941 0.456018563 0.8774231 4 1 at1g01080.2 0.58130025 0.487028051 0.8860005 5 1 at1g01090.1 0.03615134 0.001773543 0.2645537 ``` Again, users can adjust p-values by specifying the `p.adjust.method` argument. ```{r, eval=FALSE} data("PhyloExpressionSetExample") # Detection of DEGs using the p-value returned by the Small-P Method # and furthermore, adjust p-values for multiple comparison # using the Benjamini & Hochberg (1995) method: method = "BH" # and filter for significantly differentially expressed genes (alpha = 0.05) SmallP.DEGs.adj <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8], nrep = 2, method = "smallp", lib.size = 1000, stage.names = c("S1","S2","S3"), p.adjust.method = "BH") # look at the results SmallP.DEGs.adj ``` ``` Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3 1 1 at1g01040.2 0.6506651 0.275582530 0.8860005 2 1 at1g01050.1 0.9531443 0.598102712 0.8860005 3 1 at1g01070.1 0.7266253 0.598102712 0.8860005 4 1 at1g01080.2 0.7266253 0.598102712 0.8860005 5 1 at1g01090.1 0.1807567 0.008867715 0.8860005 ``` ### Deviance Method This method uses the deviance goodness of fit statistics to define the rejection region, and is therefore equivalent to a conditional likelihood ratio test (see `exactTestByDeviance` for details). To compute p-values for stagewise comparisons based on negative binomial models, the `DiffGenes()` argument `method = "doubletail"`, the number of replicates per stage `nrep`, and `lib.size` quantifying the library size to equalize sample library sizes by quantile-to-quantile normalization need to be specified (see also `?equalizeLibSizes`). ```{r, eval=FALSE} data("PhyloExpressionSetExample") # Detection of DEGs using the p-value returned by the Deviance Deviance.DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8], nrep = 2, method = "deviance", lib.size = 1000, stage.names = c("S1","S2","S3")) # look at the results Deviance.DEGs ``` ``` Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3 1 1 at1g01040.2 0.26026604 0.110233012 0.6304508 2 1 at1g01050.1 0.95314428 0.598102712 0.6398757 3 1 at1g01070.1 0.55461941 0.456018563 0.8774231 4 1 at1g01080.2 0.58130025 0.487028051 0.8860005 5 1 at1g01090.1 0.03615134 0.001773543 0.2645537 ``` Again, users can adjust p-values by specifying the `p.adjust.method` argument. ```{r, eval=FALSE} data("PhyloExpressionSetExample") # Detection of DEGs using the p-value returned by the Deviance Method # and furthermore, adjust p-values for multiple comparison # using the Benjamini & Hochberg (1995) method: method = "BH" # and filter for significantly differentially expressed genes (alpha = 0.05) Deviance.DEGs.adj <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8], nrep = 2, method = "deviance", lib.size = 1000, stage.names = c("S1","S2","S3"), p.adjust.method = "BH") # look at the results Deviance.DEGs.adj ``` ``` Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3 1 1 at1g01040.2 0.6506651 0.275582530 0.8860005 2 1 at1g01050.1 0.9531443 0.598102712 0.8860005 3 1 at1g01070.1 0.7266253 0.598102712 0.8860005 4 1 at1g01080.2 0.7266253 0.598102712 0.8860005 5 1 at1g01090.1 0.1807567 0.008867715 0.8860005 ``` ## Replicate Quality Check Users can also perform replicate quality checks to quantify the variability between replicate expression levels fo each stage separately. The `PlotReplicateQuality()` is designed to perform customized replicate variability checks for any `ExpressionSet` object storing replicates. ```{r,eval=FALSE} data(PhyloExpressionSetExample) # visualize the sd() between replicates PlotReplicateQuality(ExpressionSet = PhyloExpressionSetExample[ , 1:8], nrep = 2, legend.pos = "topright", ylim = c(0,0.2), lwd = 6) ``` The resulting plot visualizes the kernel density estimates for the variance (log variance) between replicates. Each curve represents the density function for the replicate variation within one stage or experiment. In this case the variance between replicates of `Stage 1` to `Stage 3` (each including 2 replicates) seem to deviate from each other allowing the conclusion that each stage has a different expression level variability between replicates. The `FUN` argument implemented in `PlotReplicateQuality()` allows users to furthermore, specify customized criteria quantifying replicate variability. Please notice that the function specified in `FUN` will be performed separately on each gene and stage. In the following example the median absolute deviation function `mad()` is used to quantify replicate variability. ```{r,eval=FALSE} data(PhyloExpressionSetExample) # visualize the mad() between replicates PlotReplicateQuality(ExpressionSet = PhyloExpressionSetExample[ , 1:8], nrep = 2, FUN = mad, legend.pos = "topright", ylim = c(0,0.015), lwd = 6) ``` In general, users are not limited to specific functions implemented in R. By writing customized functions such as `FUN = function(x) return((x - mean(x))^2)` users can define their own criteria to quantify replicate variability and can then apply this criteria to `PlotReplicateQuality()` by specifying the `FUN` argument. ## Collapsing Replicate Samples After performing differential gene expression analyses, replicate expression levels are collapsed to a single stage specific expression level. For this purpose, `myTAI` implements the `CollapseReplicates()` function, allowing users to combine replicate expression levels stored in a standard `PhyloExpressionSet` or `DivergenceExpressionSet` object to a stage specific expression level using a specified window function. ```{r,eval=FALSE} library(myTAI) # load example data data(PhyloExpressionSetExample) # generate an example PhyloExpressionSet with replicates ExampleReplicateExpressionSet <- PhyloExpressionSetExample[ ,1:8] # rename stages names(ExampleReplicateExpressionSet)[3:8] <- c("Stage_1_Repl_1","Stage_1_Repl_2", "Stage_2_Repl_1","Stage_2_Repl_2", "Stage_3_Repl_1","Stage_3_Repl_2") # have a look at the example dataset head(ExampleReplicateExpressionSet, 5) ``` ``` Phylostratum GeneID Stage_1_Repl_1 Stage_1_Repl_2 Stage_2_Repl_1 Stage_2_Repl_2 Stage_3_Repl_1 Stage_3_Repl_2 1 1 at1g01040.2 2173.635 1911.2001 1152.555 1291.4224 1000.253 962.9772 2 1 at1g01050.1 1501.014 1817.3086 1665.309 1564.7612 1496.321 1114.6435 3 1 at1g01070.1 1212.793 1233.0023 939.200 929.6195 864.218 877.2060 4 1 at1g01080.2 1016.920 936.3837 1181.338 1329.4734 1392.643 1287.9746 5 1 at1g01090.1 11424.567 16778.1685 34366.649 39775.6405 56231.569 66980.3673 ``` Now, assume that this example `PhyloExpressionSet` stores three developmental stages and 2 biological replicates for each developmental stage. Of course, we could now compute and visualize the TAI profile by typing: ```{r,eval=FALSE} # visualize the TAI profile over 3 stages of development # and 2 replicates per stage PlotPattern(ExpressionSet = ExampleReplicateExpressionSet, type = "l", lwd = 6) ``` Usually, one would expect that variations in replicate values are smaller than variations between developmental stages. In this example however, we constructed replicate values that vary larger than expression levels between developmental stages. For many applications it might be useful to visualize TAI/TDI values of replicates as well, but normally replicate values are collapsed to one gene and stage specific value after differential gene expression analyses and replicate quality control have been performed. The following example illustrates how to collapse replicates with `CollapseReplicates()`: ```{r,eval=FALSE} # combine the expression levels of the 2 replicates (const) per stage # using geom.mean as window function and rename new stages to: "S1","S2","S3" CollapssedPhyloExpressionSet <- CollapseReplicates( ExpressionSet = ExampleReplicateExpressionSet, nrep = 2, FUN = geom.mean, stage.names = c("S1","S2","S3")) # have a look at the collapsed PhyloExpressionSet head(CollapssedPhyloExpressionSet) ``` ``` Phylostratum GeneID S1 S2 S3 1 1 at1g01040.2 2038.1982 1220.0147 981.4381 2 1 at1g01050.1 1651.6070 1614.2524 1291.4582 3 1 at1g01070.1 1222.8557 934.3975 870.6878 4 1 at1g01080.2 975.8215 1253.2189 1339.2866 5 1 at1g01090.1 13844.9740 36972.3612 61371.0937 6 1 at1g01120.1 815.3288 894.8987 905.8272 ``` The `nrep` argument specifies either a constant number of replicates per stage or a numeric vector storing variable numbers of replicates for each developmental stage. In our example, each developmental stage had a constant (equal) number of replicates per developmental stage (`nrep = 2`). In case a variable stage specific number of replicates is present, one could specify `nrep = c(2,3,2)` defining the case that developmental stage 1 stores 2 replicates, stage 2 stores 3 replicates, and stage 3 again, stores 2 replicates. The argument `FUN` specifies the window function to collapse replicate expression levels to a single stage specific value. In this example, we chose the `geom.mean()` (geometric mean) function implemented in `myTAI`, because our example `PhyloExpressionSet` stores absolute expression levels. Notice that the mathematical equivalent of performing arithmetic mean (`mean()`) computations on `log` expression levels is to perform the geometric mean (`geom.mean()`) on absolute expression levels. The `stage.names` argument then specifies the new names of collapsed stages. ## Filter for Expressed Genes After differential gene expression analyses and replicate aggregation have been performed, some studies filter gene expression levels in RNA-Seq count tables or microarray expression matrices for non-expressed or outlier genes. For example, in most studies performing RNA-Seq experiments FPKM/RPKM values < 1 are remove from the processed (final) count table. For this purpose `myTAI` implements the `Expressed()` function to filter (remove) expression levels in RNA-Seq count tables or microarray expression matrices which do not pass a defined expression threshold. The `Expressed()` function takes a standard `PhyloExpressionSet` or `DivergenceExpressionSet` object storing a RNA-Seq count table (CT) or microarray gene expression matrix and removes genes from this count table or gene expression matrix that have an expression level below a defined `cut.off` value. `Expressed()` allows users to choose from several gene extraction methods (see `?Expressed` for details): * `const`: all genes that have at least one stage that undercuts or exceeds the expression `cut.off` will be excluded from the `ExpressionSet`. Hence, for a 7 stage `ExpressionSet` genes passing the expression level `cut.off` in 6 stages will be retained in the `ExpressionSet`. * `min-set`: genes passing the expression level `cut.off` in `ceiling(n/2)` stages will be retained in the `ExpressionSet`, where `n` is the number of stages in the `ExpressionSet`. * `n-set`: genes passing the expression level `cut.off` in `n` stages will be retained in the `ExpressionSet`. Here, the argument `n` is defining the number of stages for which the threshold criteria should be fulfilled. ```{r,eval = FALSE} # check number of genes in PhyloExpressionSetExample nrow(PhyloExpressionSetExample) #> [1] 25260 # remove genes that have an expression level below 8000 # in at least one developmental stage FilterConst <- Expressed(ExpressionSet = PhyloExpressionSetExample, cut.off = 8000, comparison = "below", method = "const") nrow(FilterConst) # check number of retained genes #> [1] 449 ``` Users will observe that only 449 out of 25260 genes in `PhyloExpressionSetExample` have an absolute expression level above `8000` when omitting genes using `method = 'const'`. The argument `comparison` specifies whether genes having expression levels below, above, or below AND above (both) the `cut.off` value should be removed from the dataset. The following comparison methods can be selected: * `comparison = "below"`: define genes as not expressed which undercut the `cut-off` threshold. * `comparison = "above"`: define genes as outliers which exceed the `cut-off` threshold. * `comparison = "both"`: remove genes fulfilling the `comparison = "below"` __AND__ `comparison = "above"` criteria. ```{r,eval = FALSE} # again: check number of genes in PhyloExpressionSetExample nrow(PhyloExpressionSetExample) #> [1] 25260 # remove genes that have an expression level above 12000 # in at least one developmental stage (outlier removal) FilterConst.above <- Expressed(ExpressionSet = PhyloExpressionSetExample, cut.off = 12000, comparison = "above", method = "const") nrow(FilterConst.above) # check number of retained genes #> [1] 23547 ``` For this example 25260 - 23547 = 1713 have been classified as outliers (expression levels above 12000) and were removed from the dataset. ```{r,eval = FALSE} # again: check number of genes in PhyloExpressionSetExample nrow(PhyloExpressionSetExample) #> [1] 25260 # remove genes that have an expression level below 8000 AND above 12000 # in at least one developmental stage (non-expressed genes AND outlier removal) FilterConst.both <- Expressed(ExpressionSet = PhyloExpressionSetExample, cut.off = c(8000,12000), comparison = "both", method = "const") nrow(FilterConst.both) # check number of retained genes #> [1] 2 ``` When selecting `comparison = 'both'`, the `cut.off` argument receives 2 threshold values: the _below_ `cut.off` as first element and the _above_ `cut.off` as second element. In this case `cut.off = c(8000,12000)`. Here, only 2 genes fulfill these criteria. Analogously, users can specify the number of stages that should fulfill the threshold criteria using the `n-set` method. ```{r,eval = FALSE} # remove genes that have an expression level below 8000 # in at least 5 developmental stages (in this case: n = 2 stages fulfilling the criteria) FilterNSet <- Expressed(ExpressionSet = PhyloExpressionSetExample, cut.off = 8000, method = "n-set", comparison = "below", n = 2) nrow(FilterMinSet) # check number of retained genes #> [1] 20 ``` Here, 20 genes are fulfilling these criteria. ## Compute the Statistical Significance of Each Replicate Combination In some cases (high variability of replicates) it might be useful to verify that there is no sequence of replicates (for all possible combination of replicates) that results in a non-significant `TAI` or `TDI` pattern, when the initial pattern with combined replicates was shown to be significant. The `CombinatorialSignificance()` function implemented in `myTAI` allows users to compute the p-values quantifying the statistical significance of the underlying pattern for all combinations of replicates. ### A small Example: Assume a `PhyloExpressionSet` stores 3 developmental stages with 3 replicates measured for each stage. The 9 replicates in total are denoted as: $1.1, 1.2, 1.3, 2.1, 2.2, 2.3, 3.1, 3.2, 3.3$. Now the function computes the statistical significance of each pattern derived by the corresponding combination of replicates, e.g. - 1.1, 2.1, 3.1 : p-value for combination 1 - 1.1, 2.2, 3.1 : p-value for combination 2 - 1.1, 2.3, 3.1 : p-value for combination 3 - 1.2, 2.1, 3.1 : p-value for combination 4 - 1.2, 2.1, 3.1 : p-value for combination 5 - 1.2, 2.1, 3.1 : p-value for combination 6 - 1.3, 2.1, 3.1 : p-value for combination 7 - 1.3, 2.2, 3.1 : p-value for combination 8 - 1.3, 2.3, 3.1 : p-value for combination 9 - ... This procedure yields 27 p-values for the $3^3$ ($n^m$) replicate combinations, where $n$ denotes the number of developmental stages and $m$ denotes the number of replicates per stage. Note that in case users have a large amount of stages/experiments and a large amount of replicates the computation time will increase by $n^m$. For 11 stages and 4 replicates, $4^{11}$ = 4194304 p-values have to be computed. Each p-value computation itself is based on a permutation test running with $1,000, 10,000, ...$ or more permutations. Be aware that this might take some time. The p-value vector returned by this function can then be used to plot the p-values to see whether an critical value $\alpha$ is exceeded or not (e.g. $\alpha = 0.05$). ```{r,eval=FALSE} # load a standard PhyloExpressionSet data(PhyloExpressionSetExample) # we assume that the PhyloExpressionSetExample # consists of 3 developmental stages # and 2 replicates for stage 1, 3 replicates for stage 2, # and 2 replicates for stage 3 # FOR REAL ANALYSES PLEASE USE: permutations = 1000 or 10000 # BUT NOTE THAT THIS TAKES MUCH MORE COMPUTATION TIME p.vector <- CombinatorialSignificance(ExpressionSet = PhyloExpressionSetExample, replicates = c(2,3,2), TestStatistic = "FlatLineTest", permutations = 100, parallel = FALSE) ``` ``` [1] 2.436296e-03 2.288593e-02 1.608399e-03 1.185615e-02 1.835306e-06 1.077012e-05 [7] 2.025515e-07 5.148342e-07 1.654885e-07 6.251145e-06 9.265520e-10 1.047479e-06 ``` Users will observe that none of the replicate combinations resulted in p-values > 0.05 and thus we can assume that the phylotranscriptomic pattern computed based on collapsed replicates is not biased by insignificant replicate combinations. ```{r,eval = FALSE} any(p.vector > 0.05) #> FALSE ``` `CombinatorialSignificance()` can perform all significance tests introduced in the [Introduction](Introduction.html) and [Intermediate](Intermediate.html) vignettes. Furthermore, the `parallel` argument allows users to perform significance computations in parallel on a multicore machine. This will speed up p-value computations for a large number of combinations.