---
title: "Gene Expression Analysis with `myTAI`"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Gene Expression Analysis with `myTAI`}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc}
---

## Introduction

In the [Introduction](Introduction.html) vignette we introduced and discussed how phylotranscriptomics can be applied
to capture evolutionary signals in (developmental) transcriptomes. Furthermore,
in the [Enrichment Analyses](Enrichment.html) vignette we provide a use case to correlate specific
groups or sets of genes with their predicted evolutionary origin. Here, we aim to
combine previously introduced techniques with _classic_ gene expression analyses to detect possible functional causes for the observed transcriptome conservation. 

In other words, phylotranscriptomics allows us to detect stages or periods of evolutionary conservation and is able to predict the evolutionary origin of process or trait specific genes
based on enrichment analyses. By combining evolutionary enrichment analyses with the
functional annotation of process or trait specific genes (see [Functional Annotation](https://github.com/ropensci/biomartr/blob/master/vignettes/Functional_Annotation.Rmd) for details) the detection of evolutionary signals can be correlated with 
functional processes. Then, performing gene expression analyses on corresponding process or trait specific genes allows users to detect potential causes of stage/period specific evolutionary transcriptome conservation. 


The following sections introduce main gene expression data analysis techniques implemented in `myTAI`:

- Detection of Differentially Expressed Genes (DEGs) 
 + Fold-Change
 + Welch t-test
 + Wilcoxon Rank Sum Test (Mann-Whitney U test)
 + Negative Binomial (Exact Tests)
 
- Collapsing Replicate Samples

- Filter for Expressed Genes

- Compute the Statistical Significance of Each Replicate Combination

## Detection of Differenentially Expressed Genes (DEGs) 

A variety of methods have been published to detect differentially expressed genes. Some methods
are based on non-statistical quantification of expression differences (e.g. fold-change and log-fold-change), but most methods are based on statistical tests to quantify the significance 
of differences in gene expression between samples. These statistical methods can furthermore be divided into two methodological categories: parametric tests and non-parametric tests.
The `DiffGenes()` function available in `myTAI` implements the most popular and useful methods 
to detect differentially expressed genes. In the literature, different methods have been introduced
and discussed for microarray technologies versus RNA-Seq technologies. 

In this section we will introduce all methods implemented in `DiffGenes()` using small examples
and will furthermore, discuss published advantages and disadvantages of each method and each mRNA
quantification technology.

__Note that when using `DiffGenes()` it is assumed that your input dataset has been normalized before passing it to `DiffGenes()`. For RNA-Seq data `DiffGenes()` assumes that the libraries have been normalized to have the same size, i.e., to have the same expected column sum under the null hypothesis (or the lib.size argument in `DiffGenes()` is specified accordingly).__

## Fold-Changes

A fold change in gene expression is simply the ratio of the gene expression level of one sample against a second sample: $\frac{e_{i1}}{e_{i2}}$, where $e_{i1}$ is the expression level of gene $i$ in sample one and $e_{i2}$ is the expression level of gene $i$ in sample two. In case replicate 
expression levels are present for each sample the ratio of means of the corresponding replicates is computed: $\frac{\bar{e}_{i1}}{\bar{e}_{i2}}$, where $\bar{e}_{i1}$ is the mean of replicate expression levels of gene $i$ in sample one and $\bar{e}_{i2}$ is the mean of replicate expression levels of gene $i$ in sample two.

* __Advantages:__ Given a small number of replicate values the statistical evaluation of differentially expressed genes might be biased (depending on the statistical test chosen)
by underlying sample distributions which are not fulfilled or because a small number of
replicate values is not sufficient enough to perform non-parametric tests. Here, fold-changes
provide a simple way to quantify gene expression differences between samples by $n$-fold enrichment. 
In our opinion, although the process of choosing a threshold for defining genes as being differentially
expressed or not based on fold-change values is purely subjective and relies on common sense,
in some cases this procedure will be more suitable than defining differentially expressed genes based on
p-values obtained from a test statistic with violated test assumptions.

* __Disadvantages:__ If used appropriately, statistical tests not only systematically 
quantify the significance of the observed gene-by-gene differences of expression, but furthermore, accounts the variance of replicate expression levels when comparing the
mean difference of replicate expression levels between samples. Hence, the gene specific
variance between replicates is also quantified by the p-value returned by the sufficient test statistic which is not quantified by a simple fold-change measure. 


### Example: Fold-Change

For the following example we assume that `PhyloExpressionSetExample[1:5,1:8]`
stores 5 genes and 3 developmental stages with 2 replicate expression levels per stage.

```{r, eval=FALSE}
data("PhyloExpressionSetExample")

# Detection of DEGs using the fold-change measure
DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
                  nrep          = 2,
                  method        = "foldchange",
                  stage.names   = c("S1","S2","S3"))


head(DEGs)
```

```
   Phylostratum      GeneID    S1->S2    S1->S3    S2->S1    S2->S3    S3->S1    S3->S2
1            1 at1g01040.2 1.6713881 2.0806706 0.5983051 1.2448758 0.4806143 0.8032930
2            1 at1g01050.1 1.0273222 1.2709185 0.9734045 1.2371177 0.7868325 0.8083305
3            1 at1g01070.1 1.3087379 1.4044799 0.7640949 1.0731560 0.7120073 0.9318310
4            1 at1g01080.2 0.7779572 0.7286769 1.2854177 0.9366542 1.3723503 1.0676299
5            1 at1g01090.1 0.3803866 0.2288961 2.6289042 0.6017460 4.3687939 1.6618307
```

The resulting output shows all combinations of fold-changes between samples (developmental stages).
Here, `S1->S2` denotes that the fold-change was computed for expression levels of stage `S1` against stage
`S2`.

### Example: Log-Fold-Change

__When selecting `method = "log-foldchange"` it is assumed that the input `ExpressionSet`
stores `log2` expression levels. Here, we transform absolute expression levels stored in
`PhyloExpressionSetExample` to `log2` expression levels using the `tf()` function before 
log-fold-changes are computed.__

```{r, eval=FALSE}
data("PhyloExpressionSetExample")

# Detection of DEGs using the logfold-change measure
log.DEGs <- DiffGenes(ExpressionSet = tf(PhyloExpressionSetExample[1:5,1:8],log2),
                      nrep          = 2,
                      method        = "log-foldchange",
                      stage.names   = c("S1","S2","S3"))


head(log.DEGs)
```

```
  Phylostratum      GeneID      S1->S2     S1->S3      S2->S1      S2->S3     S3->S1      S3->S2
1            1 at1g01040.2  0.74104679  1.0570486 -0.74104679  0.31600182 -1.0570486 -0.31600182
2            1 at1g01050.1  0.03888868  0.3458715 -0.03888868  0.30698280 -0.3458715 -0.30698280
3            1 at1g01070.1  0.38817621  0.4900360 -0.38817621  0.10185975 -0.4900360 -0.10185975
4            1 at1g01080.2 -0.36223724 -0.4566488  0.36223724 -0.09441158  0.4566488  0.09441158
5            1 at1g01090.1 -1.39446159 -2.1272350  1.39446159 -0.73277345  2.1272350  0.73277345
```

The resulting output stores all combinations of log fold-changes between samples (developmental stages).

## Welch t-test

The `Welch t-test` is a parametric test to statistically quantify the difference of sample means in cases where the assumption of homogeneity of variance (equal variances in the two populations) is violated (Boslaugh, 2013). The `Welch t-test` is a sufficient parameter test for small sample sizes and thus, has been used to detect differentially expressed genes based on p-values returned by the test statistic (Hahne et al., 2008). 

In detail, the test statistic is computed as follows:

$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$

where $\bar{x}_1$ and $\bar{x}_2$ are sample means, $s_1^2$ and $s_2^2$ are the sample variances, and $n_1$ and $n_2$ are the sample sizes.

The degrees of freedom for Welch's t-test are then computed as follows:

$df = \frac{\big(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\big)^2}{\frac{s_1^4}{n_1^2 (n_1 - 1)} + \frac{s_2^4}{n_2^2 (n_2 - 1)}}$


To perform a sufficient `Welch t-test` the following assumptions about the input data need to be fulfilled to test whether two samples come from populations with equal means:

__Assumptions about input data__

* independent samples
* continuous data
* (approximate) normality


Nevertheless, although in most cases `log2` expression levels are used to perform 
the `Welch t-test` assuming that expression levels are log-normal distributed which 
approximates a normal distribution in infinity, in most cases the small number of replicates is not sufficient enough to fulfill the (approximate) normality assumption 
of the `Welch t-test`.

Due to this fact, non-parametric, sampling based, or generalized linear model based
methods have been proposed to quantify p-values of differential expression. Nevertheless, the `DiffGenes()` function implements the `Welch t-test` for the 
detection of differentially expressed genes, allowing users to compare the results with
more recent DEG detection methods/methodologies also implemented in `DiffGenes()`.


* __Advantages:__ 
  + DEG detection based on statistical quantification 
  + Parametric test resulting in a strong test statistic
  + Can handle small sample sizes

* __Disadvantages:__
  + Test assumptions must be fulfilled to return sufficient p-values  
  + Can hardly assure normality with very sample sizes of $n = 3,4,5,..$ (replicates)
  + Pairwise comparisons between different stages or experiments

### Example: Welch t-test

Performing `Welch t-test` with `DiffGenes()` can be done by specifying `method = "t.test"`. Internally `DiffGenes()` performs a two sided `Welch t-test`. This means
that the `Welch t-test` quantifies only whether or not a gene is significantly
differentially expressed, but not the direction of enrichment (over-expressed or under-expressed).

The `PhyloExpressionSetExample` we use in the following example stores absolute expression levels.
In case your `ExpressionSet` also stores absolute expression levels (which is likely due to the `ExpressionSet` standard for Phylotranscriptomics analyses), you can use the `tf()` function implemented in
`myTAI` to transform absolute expression levels to `log2` expression
levels before performing `DiffGenes()` with a `Welch t-test`, e.g.
`tf(PhyloExpressionSetExample[1:5,1:8],log2)`. In general, using `log2` transformed expression levels as input
`ExpressionSet` of `DiffGenes()` allows us to (at least) assume that samples (replicate expression levels) used to perform the `Welch t-test` are log-normal distributed and therefore, somewhat approximate
normal distributed.

Please notice however, that RNA-Seq data can include count values of 0. So when transforming absolute
counts to `log2` counts infinity values of `log2(0) = -Inf` will be produced and therefore, p-value
computations will not be possible. To avoid this case you could either remove RNA-Seq count values of 0 from
the input dataset using the `Expressed()` function (see section _Filter for Expressed Genes_), e.g.
pass `tf(Expressed(PhyloExpressionSetExample[1:5,1:8], cut.off = 1),log2)` as `ExpressionSet` argument to
`DiffGenes()` or shift all count values by a constant value, e.g. pass
`tf(PhyloExpressionSetExample[1:5,1:8], function(x) log2(x + 1))` as `ExpressionSet` argument to
`DiffGenes()`.

Internally, `DiffGenes()` will also check for 0 values in input data and will automatically shift all 
expression levels by `+1` in case 0 values are included.

```{r, eval=FALSE}
data("PhyloExpressionSetExample")

# Detection of DEGs using the p-value returned by a Welch t-test
ttest.DEGs <- DiffGenes(ExpressionSet = tf(PhyloExpressionSetExample[1:5,1:8],log2),
                        nrep          = 2,
                        method        = "t.test",
                        stage.names   = c("S1","S2","S3"))

# look at the results
ttest.DEGs
```

```
  Phylostratum      GeneID     S1<->S2    S1<->S3    S2<->S3
1            1 at1g01040.2 0.027832572 0.04020203 0.13481563
2            1 at1g01050.1 0.852379466 0.31471871 0.36326955
3            1 at1g01070.1 0.003200692 0.00113536 0.02236621
4            1 at1g01080.2 0.086426813 0.03092924 0.45999438
5            1 at1g01090.1 0.090387087 0.04638872 0.04978092
```

The resulting `data.frame` stores the p-values of stage-wise comparisons for each gene. To adjust p-values for multiple testing of stage-wise comparisons you can specify the `p.adjust.method` argument with one of the p-value adjustment methods implemented in `DiffGenes()`.

In detail, correcting for multiple testing allows to appropriately choose selection cut-offs for p-values
fulfilling the differential expression criteria. Hahne et al., 2008 (p. 87) give a nice example of
correcting for multiple testing to determine appropriate selection cut-offs.

Please consult the documentation of `?p.adjust` to see which p-value adjustment methods are implemented
in `DiffGenes()`.

Please also consult these reviews ([Biostatistics Handbook](http://www.biostathandbook.com/multiplecomparisons.html), [Gelman et al., 2008](http://www.stat.columbia.edu/~gelman/research/published/multiple2f.pdf), and [Slides](http://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture10.pdf)) to decide whether or not to apply p-value adjustment to your own dataset.

```{r, eval=FALSE}
data("PhyloExpressionSetExample")

# Detection of DEGs using the p-value returned by a Welch t-test
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
ttest.DEGs.p_adjust <- DiffGenes(ExpressionSet   = tf(PhyloExpressionSetExample[1:5,1:8],log2),
                                 nrep            = 2,
                                 method          = "t.test",
                                 p.adjust.method = "BH",
                                 stage.names     = c("S1","S2","S3"))


ttest.DEGs.p_adjust
```

```
  Phylostratum      GeneID    S1<->S2   S1<->S3   S2<->S3
1            1 at1g01040.2 0.06958143 0.0579859 0.2246927
2            1 at1g01050.1 0.85237947 0.3147187 0.4540869
3            1 at1g01070.1 0.01600346 0.0056768 0.1118311
4            1 at1g01080.2 0.11298386 0.0579859 0.4599944
5            1 at1g01090.1 0.11298386 0.0579859 0.1244523
```

The resulting p-value adjusted `data.frame` can be used to filter for differentially expressed genes.
Here, specifying the arguments: `comparison`, `alpha`, and `filter.method` in `DiffGenes()` allows
users to obtain only significant differentially expressed genes.

```{r,eval=FALSE}
# Detection of DEGs using the p-value returned by a Welch t-test
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
# and filter for significantly differentially expressed genes (alpha = 0.05) 
ttest.DEGs.p_adjust.filtered <- DiffGenes(ExpressionSet   = tf(PhyloExpressionSetExample[1:10 ,1:8],log2),
                                          nrep            = 2,
                                          method          = "t.test",
                                          p.adjust.method = "BH",
                                          stage.names     = c("S1","S2","S3"),
                                          comparison      = "above",
                                          alpha           = 0.05,
                                          filter.method   = "n-set",
                                          n               = 1)

# look at the genes fulfilling the filter criteria 
ttest.DEGs.p_adjust.filtered
```

```
  Phylostratum      GeneID    S1<->S2   S1<->S3   S2<->S3
3            1 at1g01070.1 0.03200692 0.0113536 0.2192432
```

In this example, only 1 out of 10 genes fulfills the p-value criteria (`alpha = 0.05`) in at least one stage comparison.

#### Rank top p-values

Finally, users can rank genes in increasing p-value order for each stage comparison by typing:

```{r,eval = FALSE}

ttest.DEGs.p_adjust <- DiffGenes(ExpressionSet   = tf(PhyloExpressionSetExample[1:500,1:8],log2),
                                 nrep            = 2,
                                 method          = "t.test",
                                 p.adjust.method = "BH",
                                 stage.names     = c("S1","S2","S3"))


head(ttest.DEGs.p_adjust[order(ttest.DEGs.p_adjust[ , "S1<->S2"], decreasing = FALSE) , 1:3])
```

```
    Phylostratum      GeneID  S1<->S2
54             1 at1g02400.1 0.151388
119            1 at1g03870.1 0.151388
137            1 at1g04380.1 0.151388
289            1 at1g08110.4 0.151388
383            1 at1g10360.1 0.151388
413            1 at1g11040.1 0.151388
```

Here the line `ttest.DEGs.p_adjust[order(ttest.DEGs.p_adjust[ , "S1<->S2"], decreasing = FALSE) , 1:3]` will sort p-values of stage comparison `"S1<->S2"` in increasing order.

## Wilcoxon-Mann-Whitney test (Mann-Whitney U test)

The Wilcoxon-Mann-Whitney test is a _nonparametric_ test to quantify the shift 
in empirical distribution parameters. _Nonparametric_ tests are useful when sample populations
do not meet the test assumptions of _parametric_ tests. 

```{r, eval=FALSE}
data("PhyloExpressionSetExample")

# Detection of DEGs using the p-value returned by a Wilcoxon-Mann-Whitney test
Wilcox.DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
                        nrep          = 2,
                        method        = "wilcox.test",
                        stage.names   = c("S1","S2","S3"))

# look at the results
Wilcox.DEGs
```

```
  Phylostratum      GeneID   S1<->S2   S1<->S3   S2<->S3
1            1 at1g01040.2 0.3333333 0.3333333 0.3333333
2            1 at1g01050.1 1.0000000 0.3333333 0.3333333
3            1 at1g01070.1 0.3333333 0.3333333 0.3333333
4            1 at1g01080.2 0.3333333 0.3333333 0.6666667
5            1 at1g01090.1 0.3333333 0.3333333 0.3333333
```

Again, users can adjust p-values by specifying the `p.adjust.method` argument.

```{r, eval=FALSE}
data("PhyloExpressionSetExample")

# Detection of DEGs using the p-value returned by a Wilcoxon-Mann-Whitney test
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
# and filter for significantly differentially expressed genes (alpha = 0.05)
Wilcox.DEGs.adj <- DiffGenes(ExpressionSet  = PhyloExpressionSetExample[1:5,1:8],
                            nrep            = 2,
                            method          = "wilcox.test",
                            stage.names     = c("S1","S2","S3"),
                            p.adjust.method = "BH")

# look at the results
Wilcox.DEGs.adj
```

```
  Phylostratum      GeneID   S1<->S2   S1<->S3   S2<->S3
1            1 at1g01040.2 0.4166667 0.3333333 0.4166667
2            1 at1g01050.1 1.0000000 0.3333333 0.4166667
3            1 at1g01070.1 0.4166667 0.3333333 0.4166667
4            1 at1g01080.2 0.4166667 0.3333333 0.6666667
5            1 at1g01090.1 0.4166667 0.3333333 0.4166667
```


## Negative Binomial (Exact Tests)

Exact Tests for Differences between two groups of negative-binomial counts implemented in `DiffGenes()` are based on the
`edgeR` function `exactTest()`. Please consult the [edgeR Users Guide](http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf) for mathematical details.

### Install edgeR Package

The detection of DEGs using negative binomial models is based on the powerful implementations
provided by the [edgeR](http://www.bioconductor.org/packages/release/bioc/html/edgeR.html) package. Hence, before using the negative binomial models in `DiffGenes()` users need to install 
the edgeR package.

```{r,eval=FALSE}
# install edgeR
source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")
```

### Double Tail Method

This method computes two-sided p-values by doubling the smaller tail probability (see `?exactTestByDeviance` for details).
To compute p-values for stagewise comparisons based on negative binomial models, the `DiffGenes()` argument 
`method = "doubletail"`, the number of replicates per stage `nrep`, and `lib.size` quantifying the library size to equalize sample library
sizes by quantile-to-quantile normalization need to be specified (see also `?equalizeLibSizes`).

```{r, eval=FALSE}
data("PhyloExpressionSetExample")

# Detection of DEGs using the p-value returned by the Double Tail Method
DoubleTail.DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
                        nrep          = 2,
                        method        = "doubletail",
                        lib.size      = 1000,
                        stage.names   = c("S1","S2","S3"))

# look at the results
DoubleTail.DEGs
```

```
  Phylostratum      GeneID    S1<->S2     S1<->S3   S2<->S3
1            1 at1g01040.2 0.26026604 0.110233012 0.6304508
2            1 at1g01050.1 0.95314428 0.598102712 0.6398757
3            1 at1g01070.1 0.55461941 0.456018563 0.8774231
4            1 at1g01080.2 0.58130025 0.487028051 0.8860005
5            1 at1g01090.1 0.03615134 0.001773543 0.2645537
```

Again, users can adjust p-values by specifying the `p.adjust.method` argument.

```{r, eval=FALSE}
data("PhyloExpressionSetExample")

# Detection of DEGs using the p-value returned by the Double Tail Method
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
# and filter for significantly differentially expressed genes (alpha = 0.05)
DoubleTail.DEGs.adj <- DiffGenes(ExpressionSet  = PhyloExpressionSetExample[1:5,1:8],
                                nrep            = 2,
                                method          = "doubletail",
                                lib.size        = 1000,
                                stage.names     = c("S1","S2","S3"),
                                p.adjust.method = "BH")

# look at the results
DoubleTail.DEGs.adj
```

```
  Phylostratum      GeneID   S1<->S2     S1<->S3   S2<->S3
1            1 at1g01040.2 0.6506651 0.275582530 0.8860005
2            1 at1g01050.1 0.9531443 0.598102712 0.8860005
3            1 at1g01070.1 0.7266253 0.598102712 0.8860005
4            1 at1g01080.2 0.7266253 0.598102712 0.8860005
5            1 at1g01090.1 0.1807567 0.008867715 0.8860005
```

### Small-P Method

This method performs the method of small probabilities as proposed by Robinson and Smyth (2008) (see `exactTestBySmallP` for details).
To compute p-values for stagewise comparisons based on negative binomial models, the `DiffGenes()` argument 
`method = "doubletail"`, the number of replicates per stage `nrep`, and `lib.size` quantifying the library size to equalize sample library
sizes by quantile-to-quantile normalization need to be specified (see also `?equalizeLibSizes`).

```{r, eval=FALSE}
data("PhyloExpressionSetExample")

# Detection of DEGs using the p-value returned by the Small-P Method
SmallP.DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
                        nrep          = 2,
                        method        = "smallp",
                        lib.size      = 1000,
                        stage.names   = c("S1","S2","S3"))

# look at the results
SmallP.DEGs
```

```
  Phylostratum      GeneID    S1<->S2     S1<->S3   S2<->S3
1            1 at1g01040.2 0.26026604 0.110233012 0.6304508
2            1 at1g01050.1 0.95314428 0.598102712 0.6398757
3            1 at1g01070.1 0.55461941 0.456018563 0.8774231
4            1 at1g01080.2 0.58130025 0.487028051 0.8860005
5            1 at1g01090.1 0.03615134 0.001773543 0.2645537
```

Again, users can adjust p-values by specifying the `p.adjust.method` argument.

```{r, eval=FALSE}
data("PhyloExpressionSetExample")

# Detection of DEGs using the p-value returned by the Small-P Method
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
# and filter for significantly differentially expressed genes (alpha = 0.05)
SmallP.DEGs.adj <- DiffGenes(ExpressionSet  = PhyloExpressionSetExample[1:5,1:8],
                                nrep            = 2,
                                method          = "smallp",
                                lib.size        = 1000,
                                stage.names     = c("S1","S2","S3"),
                                p.adjust.method = "BH")

# look at the results
SmallP.DEGs.adj
```

```
  Phylostratum      GeneID   S1<->S2     S1<->S3   S2<->S3
1            1 at1g01040.2 0.6506651 0.275582530 0.8860005
2            1 at1g01050.1 0.9531443 0.598102712 0.8860005
3            1 at1g01070.1 0.7266253 0.598102712 0.8860005
4            1 at1g01080.2 0.7266253 0.598102712 0.8860005
5            1 at1g01090.1 0.1807567 0.008867715 0.8860005
```


### Deviance Method

This method uses the deviance goodness of fit statistics to define the rejection region, 
and is therefore equivalent to a conditional likelihood ratio test (see `exactTestByDeviance` for details).
To compute p-values for stagewise comparisons based on negative binomial models, the `DiffGenes()` argument 
`method = "doubletail"`, the number of replicates per stage `nrep`, and `lib.size` quantifying the library size to equalize sample library
sizes by quantile-to-quantile normalization need to be specified (see also `?equalizeLibSizes`).

```{r, eval=FALSE}
data("PhyloExpressionSetExample")

# Detection of DEGs using the p-value returned by the Deviance
Deviance.DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
                        nrep          = 2,
                        method        = "deviance",
                        lib.size      = 1000,
                        stage.names   = c("S1","S2","S3"))

# look at the results
Deviance.DEGs
```

```
  Phylostratum      GeneID    S1<->S2     S1<->S3   S2<->S3
1            1 at1g01040.2 0.26026604 0.110233012 0.6304508
2            1 at1g01050.1 0.95314428 0.598102712 0.6398757
3            1 at1g01070.1 0.55461941 0.456018563 0.8774231
4            1 at1g01080.2 0.58130025 0.487028051 0.8860005
5            1 at1g01090.1 0.03615134 0.001773543 0.2645537
```

Again, users can adjust p-values by specifying the `p.adjust.method` argument.

```{r, eval=FALSE}
data("PhyloExpressionSetExample")

# Detection of DEGs using the p-value returned by the Deviance Method
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
# and filter for significantly differentially expressed genes (alpha = 0.05)
Deviance.DEGs.adj <- DiffGenes(ExpressionSet    = PhyloExpressionSetExample[1:5,1:8],
                                nrep            = 2,
                                method          = "deviance",
                                lib.size        = 1000,
                                stage.names     = c("S1","S2","S3"),
                                p.adjust.method = "BH")

# look at the results
Deviance.DEGs.adj
```

```
  Phylostratum      GeneID   S1<->S2     S1<->S3   S2<->S3
1            1 at1g01040.2 0.6506651 0.275582530 0.8860005
2            1 at1g01050.1 0.9531443 0.598102712 0.8860005
3            1 at1g01070.1 0.7266253 0.598102712 0.8860005
4            1 at1g01080.2 0.7266253 0.598102712 0.8860005
5            1 at1g01090.1 0.1807567 0.008867715 0.8860005
```


## Replicate Quality Check

Users can also perform replicate quality checks to quantify the variability between replicate expression levels fo each
stage separately.

The `PlotReplicateQuality()` is designed to perform customized replicate variability checks for any `ExpressionSet` object storing 
replicates.

```{r,eval=FALSE}
data(PhyloExpressionSetExample)

# visualize the sd() between replicates
PlotReplicateQuality(ExpressionSet = PhyloExpressionSetExample[ , 1:8],
                     nrep          = 2,
                     legend.pos   = "topright",
                     ylim          = c(0,0.2),
                     lwd           = 6)

```

The resulting plot visualizes the kernel density estimates for the variance (log variance) between replicates.
Each curve represents the density function for the replicate variation within one stage or experiment.
In this case the variance between replicates of `Stage 1` to `Stage 3` (each including 2 replicates) seem to deviate from each other allowing the conclusion that each stage has a different expression level variability between replicates.

The `FUN` argument implemented in `PlotReplicateQuality()` allows users to furthermore, specify customized criteria quantifying
replicate variability. Please notice that the function specified in `FUN` will be performed separately on each gene and stage.

In the following example the median absolute deviation function `mad()` is used to quantify replicate variability.

```{r,eval=FALSE}
data(PhyloExpressionSetExample)

# visualize the mad() between replicates
PlotReplicateQuality(ExpressionSet = PhyloExpressionSetExample[ , 1:8],
                     nrep          = 2,
                     FUN           = mad,
                     legend.pos    = "topright",
                     ylim          = c(0,0.015),
                     lwd           = 6)

```

In general, users are not limited to specific functions implemented in R. By writing customized functions such as
`FUN = function(x) return((x - mean(x))^2)` users can define their own criteria to quantify replicate variability and
can then apply this criteria to `PlotReplicateQuality()` by specifying the `FUN` argument.

## Collapsing Replicate Samples

After performing differential gene expression analyses, replicate expression levels
are collapsed to a single stage specific expression level. For this purpose, `myTAI`
implements the `CollapseReplicates()` function, allowing users to combine replicate
expression levels stored in a standard `PhyloExpressionSet` or `DivergenceExpressionSet` object to a stage specific expression level using a specified window function.


```{r,eval=FALSE}
library(myTAI)

# load example data
data(PhyloExpressionSetExample)

# generate an example PhyloExpressionSet with replicates
ExampleReplicateExpressionSet <- PhyloExpressionSetExample[ ,1:8]

# rename stages
names(ExampleReplicateExpressionSet)[3:8] <- c("Stage_1_Repl_1","Stage_1_Repl_2",
                                               "Stage_2_Repl_1","Stage_2_Repl_2",
                                               "Stage_3_Repl_1","Stage_3_Repl_2")
# have a look at the example dataset
head(ExampleReplicateExpressionSet, 5)
```

```
  Phylostratum      GeneID Stage_1_Repl_1 Stage_1_Repl_2 Stage_2_Repl_1 Stage_2_Repl_2 Stage_3_Repl_1 Stage_3_Repl_2
1            1 at1g01040.2       2173.635      1911.2001       1152.555      1291.4224       1000.253       962.9772
2            1 at1g01050.1       1501.014      1817.3086       1665.309      1564.7612       1496.321      1114.6435
3            1 at1g01070.1       1212.793      1233.0023        939.200       929.6195        864.218       877.2060
4            1 at1g01080.2       1016.920       936.3837       1181.338      1329.4734       1392.643      1287.9746
5            1 at1g01090.1      11424.567     16778.1685      34366.649     39775.6405      56231.569     66980.3673
```

Now, assume that this example `PhyloExpressionSet` stores three developmental stages and 2
biological replicates for each developmental stage. Of course, we could now compute and visualize the TAI profile by typing:

```{r,eval=FALSE}
# visualize the TAI profile over 3 stages of development
# and 2 replicates per stage
PlotPattern(ExpressionSet = ExampleReplicateExpressionSet,
            type          = "l",
            lwd           = 6)

```

Usually, one would expect that variations in replicate values are smaller than variations between developmental stages. In this example however, we constructed replicate values that
vary larger than expression levels between developmental stages. 
For many applications it might be useful to visualize TAI/TDI values of replicates as well,
but normally replicate values are collapsed to one gene and stage specific value after
differential gene expression analyses and replicate quality control have been performed.

The following example illustrates how to collapse replicates with `CollapseReplicates()`:

```{r,eval=FALSE}
# combine the expression levels of the 2 replicates (const) per stage
# using geom.mean as window function and rename new stages to: "S1","S2","S3"
CollapssedPhyloExpressionSet <- CollapseReplicates(
                                       ExpressionSet = ExampleReplicateExpressionSet,
                                       nrep          = 2,
                                       FUN           = geom.mean,
                                       stage.names   = c("S1","S2","S3"))

# have a look at the collapsed PhyloExpressionSet
head(CollapssedPhyloExpressionSet)

```

```
   Phylostratum      GeneID         S1         S2         S3
1            1 at1g01040.2  2038.1982  1220.0147   981.4381
2            1 at1g01050.1  1651.6070  1614.2524  1291.4582
3            1 at1g01070.1  1222.8557   934.3975   870.6878
4            1 at1g01080.2   975.8215  1253.2189  1339.2866
5            1 at1g01090.1 13844.9740 36972.3612 61371.0937
6            1 at1g01120.1   815.3288   894.8987   905.8272
```

The `nrep` argument specifies either a constant number of replicates per stage or
a numeric vector storing variable numbers of replicates for each developmental stage.
In our example, each developmental stage had a constant (equal) number of replicates 
per developmental stage (`nrep = 2`). In case a variable stage specific number of replicates
is present, one could specify  `nrep = c(2,3,2)` defining the case that developmental stage 1 stores 2 replicates, stage 2 stores 3 replicates, and stage 3 again, stores 2 replicates.

The argument `FUN` specifies the window function to collapse replicate expression levels to
a single stage specific value. In this example, we chose the `geom.mean()` (geometric mean) function
implemented in `myTAI`, because our example `PhyloExpressionSet` stores absolute expression levels. Notice that the mathematical equivalent of performing arithmetic mean (`mean()`)
computations on `log` expression levels is to perform the geometric mean (`geom.mean()`) on
absolute expression levels.

The `stage.names` argument then specifies the new names of collapsed stages.


## Filter for Expressed Genes

After differential gene expression analyses and replicate aggregation have been performed,
some studies filter gene expression levels in RNA-Seq count tables or microarray expression matrices for non-expressed or outlier genes.
For example, in most studies performing RNA-Seq experiments FPKM/RPKM values < 1
are remove from the processed (final) count table.

For this purpose `myTAI` implements the `Expressed()` function to filter (remove) expression levels in RNA-Seq count tables or microarray expression matrices which do not pass a defined expression threshold.

The `Expressed()` function takes a standard `PhyloExpressionSet` or `DivergenceExpressionSet` object storing a RNA-Seq count table (CT) or microarray gene expression matrix and removes genes from this count table or gene expression matrix that have an expression level below a defined `cut.off` value.

`Expressed()` allows users to choose from several gene extraction methods (see `?Expressed` for details):

* `const`:  all genes that have at least one stage that undercuts or exceeds the expression `cut.off` will be excluded from the `ExpressionSet`. Hence, for a 7 stage `ExpressionSet` genes passing the expression level `cut.off` in 6 stages will be retained in the `ExpressionSet`.

* `min-set`: genes passing the expression level `cut.off` in `ceiling(n/2)` stages will be retained in the `ExpressionSet`, where `n` is the number of stages in the `ExpressionSet`.

* `n-set`: genes passing the expression level `cut.off` in `n` stages will be retained in the `ExpressionSet`. Here, the argument `n` is defining the number of stages for which the threshold criteria should be fulfilled.


```{r,eval = FALSE}
# check number of genes in PhyloExpressionSetExample
nrow(PhyloExpressionSetExample)
#> [1] 25260

# remove genes that have an expression level below 8000
# in at least one developmental stage
FilterConst <- Expressed(ExpressionSet = PhyloExpressionSetExample,
                         cut.off       = 8000,
                         comparison    = "below", 
                         method        = "const")

nrow(FilterConst) # check number of retained genes
#> [1] 449
```

Users will observe that only 449 out of 25260 genes in `PhyloExpressionSetExample` have an absolute expression level above `8000` when omitting genes using `method = 'const'`. The argument `comparison` specifies  whether genes having expression levels below, above, or below AND above (both) the `cut.off` value should be removed from the dataset. 

The following comparison methods can be selected:

* `comparison = "below"`: define genes as not expressed which undercut the `cut-off` threshold.
* `comparison = "above"`: define genes as outliers which exceed the `cut-off` threshold.
* `comparison = "both"`: remove genes fulfilling the `comparison = "below"` __AND__ `comparison = "above"` criteria.


```{r,eval = FALSE}
# again: check number of genes in PhyloExpressionSetExample
nrow(PhyloExpressionSetExample)
#> [1] 25260

# remove genes that have an expression level above 12000
# in at least one developmental stage (outlier removal)
FilterConst.above <- Expressed(ExpressionSet = PhyloExpressionSetExample,
                               cut.off       = 12000,
                               comparison    = "above", 
                               method        = "const")

nrow(FilterConst.above) # check number of retained genes
#> [1] 23547
```

For this example 25260 - 23547 = 1713 have been classified as outliers (expression levels above 12000)
and were removed from the dataset.


```{r,eval = FALSE}
# again: check number of genes in PhyloExpressionSetExample
nrow(PhyloExpressionSetExample)
#> [1] 25260

# remove genes that have an expression level below 8000 AND above 12000
# in at least one developmental stage (non-expressed genes AND outlier removal)
FilterConst.both <-  Expressed(ExpressionSet = PhyloExpressionSetExample,
                               cut.off       = c(8000,12000),
                               comparison    = "both", 
                               method        = "const")

nrow(FilterConst.both) # check number of retained genes
#> [1] 2
```

When selecting `comparison = 'both'`, the `cut.off` argument receives 2 threshold values:
the _below_ `cut.off` as first element and the _above_ `cut.off` as second element. In this case
`cut.off = c(8000,12000)`. Here, only 2 genes fulfill these criteria.


Analogously, users can specify the number of stages that should fulfill the threshold criteria using
the `n-set` method.


```{r,eval = FALSE}
# remove genes that have an expression level below 8000
# in at least 5 developmental stages (in this case: n = 2 stages fulfilling the criteria)
FilterNSet <- Expressed(ExpressionSet = PhyloExpressionSetExample,
                        cut.off       = 8000,
                        method        = "n-set",
                        comparison    = "below",
                        n             = 2)

nrow(FilterMinSet) # check number of retained genes
#> [1] 20
```

Here, 20 genes are fulfilling these criteria.


## Compute the Statistical Significance of Each Replicate Combination

In some cases (high variability of replicates) it might be useful to verify that there is no sequence of replicates (for all possible combination of replicates) that results in a non-significant `TAI` or `TDI` pattern, when the initial pattern with combined replicates was shown to be significant.

The `CombinatorialSignificance()` function implemented in `myTAI` allows users to compute the p-values quantifying the statistical significance of the underlying pattern for all combinations of replicates.


### A small Example:

Assume a `PhyloExpressionSet` stores 3 developmental stages with 3 replicates measured for each stage. The 9 replicates in total are denoted as: $1.1, 1.2, 1.3, 2.1, 2.2, 2.3, 3.1, 3.2, 3.3$. Now the function computes the statistical significance of each pattern derived by the corresponding combination of replicates, e.g.

- 1.1, 2.1, 3.1 : p-value for combination 1

- 1.1, 2.2, 3.1 : p-value for combination 2

- 1.1, 2.3, 3.1 : p-value for combination 3

- 1.2, 2.1, 3.1 : p-value for combination 4

- 1.2, 2.1, 3.1 : p-value for combination 5

- 1.2, 2.1, 3.1 : p-value for combination 6

- 1.3, 2.1, 3.1 : p-value for combination 7

- 1.3, 2.2, 3.1 : p-value for combination 8

- 1.3, 2.3, 3.1 : p-value for combination 9

- ...

This procedure yields 27 p-values for the $3^3$ ($n^m$) replicate combinations, where
$n$ denotes the number of developmental stages and $m$ denotes the number of replicates per stage.

Note that in case users have a large amount of stages/experiments and a large amount of replicates the computation time will increase by $n^m$. For 11 stages and 4 replicates, $4^{11}$ = 4194304 p-values have to be computed. Each p-value computation itself is based on a permutation test running with $1,000, 10,000, ...$ or more permutations. Be aware that this might take some time.

The p-value vector returned by this function can then be used to plot the p-values to see whether an critical value $\alpha$ is exceeded or not (e.g. $\alpha = 0.05$).

```{r,eval=FALSE}
# load a standard PhyloExpressionSet
data(PhyloExpressionSetExample)

# we assume that the PhyloExpressionSetExample
# consists of 3 developmental stages
# and 2 replicates for stage 1, 3 replicates for stage 2,
# and 2 replicates for stage 3
# FOR REAL ANALYSES PLEASE USE: permutations = 1000 or 10000
# BUT NOTE THAT THIS TAKES MUCH MORE COMPUTATION TIME
p.vector <- CombinatorialSignificance(ExpressionSet = PhyloExpressionSetExample,
                                      replicates    = c(2,3,2),
                                      TestStatistic = "FlatLineTest",
                                      permutations  = 100,
                                      parallel      = FALSE)
```

```
 [1] 2.436296e-03 2.288593e-02 1.608399e-03 1.185615e-02 1.835306e-06 1.077012e-05
 [7] 2.025515e-07 5.148342e-07 1.654885e-07 6.251145e-06 9.265520e-10 1.047479e-06
```

Users will observe that none of the replicate combinations resulted in p-values > 0.05 and thus
we can assume that the phylotranscriptomic pattern computed based on collapsed replicates
is not biased by insignificant replicate combinations.

```{r,eval = FALSE}
any(p.vector > 0.05)
#> FALSE
```

`CombinatorialSignificance()` can perform all significance tests introduced in the [Introduction](Introduction.html) and [Intermediate](Intermediate.html) vignettes.

Furthermore, the `parallel` argument allows users to perform significance computations in
parallel on a multicore machine. This will speed up p-value computations for a large number of combinations.