---
title: "Intermediate Phylotranscriptomics Analyses"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Intermediate Phylotranscriptomics Analyses}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc}
---

```{r, echo = FALSE, message = FALSE}
library(myTAI)
options(width = 750)
knitr::opts_chunk$set(
  comment = "#>",
  error = FALSE,
  tidy = FALSE)
```

In the __Introduction__ vignette users learned how to perform the most fundamental computations
and visualizations for phylotranscriptomics analyses using the `myTAI` package.

This __Intermediate__ vignette introduces more detailed analyses and more specialized techniques 
to investigate the observed phylotranscriptomics patterns (`TAI`, `TDI`, `RE`, etc.) and introduces the statistical framework to quantify the significance of observed phenomena.

- [1. Details: Flat Line Test](#flat-line-test)
- [2. Details: Reductive Hourglass Test](#reductive-hourglass-test)
- [3. Details: Reductive Early Conservation Test](#reductive-early-conservation-test)
- [4. Details: Reverse Hourglass Test](#reverse-hourglass-test)
- [5. Data Transformation for log(TAI) and sqrt(TAI) patterns](#data-transformation)


## Investigating the Statistical Significance of Phylotranscriptomic Patterns

Three methods have been proposed to quantify the statistical significance of the observed phylotranscriptomics patterns
(Quint et al., 2012; Drost et al., 2015).

* __Flat Line Test__
* __Reductive Hourglass Test__
* __Reductive Early Conservation Test__

Here, we will build the test statistic of each test step by step so that future modifications
or new test statistics can be built upon the existing methods implemented in the `myTAI` package.


### Flat Line Test

The __Flat Line Test__ is a permutation test quantifying the statistical significance of an observed phylotranscriptomic pattern. The goal is to detect any evolutionary signal within a developmental time course that significantly deviates from a flat line.

To build the test statistic we start with a standard PhyloExpressionSet.
The `myTAI` package provides an example PhyloExpressionSet named `PhyloExpressionSetExample`:

```{r,eval=FALSE}
library(myTAI)

# load an example PhyloExpressionSet stored in the myTAI package
data(PhyloExpressionSetExample) 

# look at the standardized data set format
head(PhyloExpressionSetExample, 3) 
```

```
  Phylostratum      GeneID   Zygote Quadrant Globular     Heart  Torpedo      Bent    Mature
1            1 at1g01040.2 2173.635 1911.200 1152.555 1291.4224 1000.253  962.9772 1696.4274
2            1 at1g01050.1 1501.014 1817.309 1665.309 1564.7612 1496.321 1114.6435 1071.6555
3            1 at1g01070.1 1212.793 1233.002  939.200  929.6195  864.218  877.2060  894.8189
```

Users will observe that the first column of the `PhyloExpressionSetExample` stores the Phylostratum assignments of the corresponding genes.
The permutation test is based on random sampling of the Phylostratum assignment of genes. The underlying assumption is that
the `TAI` profile of correctly assigned Phylostrata is significantly deviating from `TAI` profiles based on randomly assigned Phylostrata.

```{r,eval=FALSE}
# TAI profile of correctly assigned Phylostrata
TAI(PhyloExpressionSetExample) 
```

```
  Zygote Quadrant Globular    Heart  Torpedo     Bent   Mature 
3.229942 3.225614 3.107135 3.116693 3.073993 3.176511 3.390334
```
Visualization: 

```{r, fig.width= 7, fig.height= 5,eval=FALSE}
data(PhyloExpressionSetExample)
# Visualize the TAI profile of correctly assigned Phylostrata
PlotSignature( ExpressionSet = PhyloExpressionSetExample, 
               p.value       = FALSE ) 
```


```{r,eval=FALSE, echo=FALSE}
# randomly permute the Phylostratum assignment of all genes 
randomPhyloExpressionSetExample <- data.frame( sample(PhyloExpressionSetExample[ , "Phylostratum"]), 
                                               PhyloExpressionSetExample[, 2:9] )

# TAI profile based on randomly assigned Phylostrata
TAI(randomPhyloExpressionSetExample) 
```

```
  Zygote Quadrant Globular    Heart  Torpedo     Bent   Mature 
3.543700 3.573999 3.554707 3.543376 3.582905 3.562813 3.523409 
```
Visualization: 

```{r, fig.width= 7, fig.height= 5,eval=FALSE}
# Visualize the TAI profile based on randomly assigned Phylostrata
PlotSignature( ExpressionSet =  randomPhyloExpressionSetExample, 
               p.value       = FALSE ) 
```

Users will observe that the visual pattern of the correctly assigned `TAI` profile and the randomly assigned `TAI` profile differ qualitatively.

Now we investigate the variance of the two observed patterns.

```{r,eval=FALSE, echo=FALSE}
# Variance of the TAI profile based on correctly assigned Phylostrata
var(TAI(PhyloExpressionSetExample)) 
```

```
[1] 0.01147725
```

```{r,eval=FALSE, echo=FALSE}
# Variance of the TAI profile based on randomly assigned Phylostrata
var(TAI(randomPhyloExpressionSetExample)) 
```

```
[1] 0.0004102549
```

We observe that the variance of the randomly assigned TAI profile is much smaller than the variance of the correctly assigned TAI profile. Here we use the variance to quantify the _flatness_ of a given TAI profile. In theory the variance of a perfect flat line would be zero. So any TAI profile that is close to zero would resemble a flat line. But how exactly are the variances of randomly assigned TAI profiles distributed? For this purpose the `bootMatrix()` function was implemented.

The `bootMatrix()` takes an PhyloExpressionSet or DivergenceExpressionSet as input and computes N `TAI` or `TDI` profiles based on randomly assigned Phylostrata or Divergence-Strata.

```{r,eval=FALSE, echo=FALSE}
# Compute 1000 TAI profiles based on randomly assigned Phylostrata
randomTAIs <- bootMatrix( ExpressionSet = PhyloExpressionSetExample, 
                          permutations  = 1000 )

head(randomTAIs)
```

```
    Zygote Quadrant Globular    Heart  Torpedo     Bent   Mature
1 3.578000 3.577732 3.552230 3.556756 3.555627 3.553580 3.510978
2 3.590339 3.564544 3.589012 3.598754 3.583721 3.597553 3.633530
3 3.615284 3.616286 3.569486 3.566816 3.575761 3.554789 3.601522
4 3.501172 3.506554 3.486261 3.491282 3.531536 3.541634 3.494997
5 3.450857 3.458339 3.447274 3.463692 3.444261 3.468433 3.497926
6 3.539454 3.543827 3.595298 3.588149 3.582898 3.588062 3.565487
```
Based on this `booMatrix` we can compute the variance of each random TAI profile.

```{r, fig.width= 7, fig.height= 5,eval=FALSE}
# compute the variance of the random TAI profile for each row
variance_vector <- apply(randomTAIs, 1 , var)

# and visualize the distribution of variances
hist(variance_vector, breaks = 100)
```

Now it is interesting to see where we can find the variance of the correctly assigned TAI.


```{r, fig.width= 7, fig.height= 5,eval=FALSE}
# variance of the TAI profile based on correctly assigned Phylostrata
var_real <- var(TAI(PhyloExpressionSetExample)) 

# visualize the distribution of variances
hist( x      = c(variance_vector,var_real), 
      breaks = 100, 
      xlab   = "variance", 
      main   = "Histogram of variance_vector" )

# and plot a red line at the position where we can find the 
# real variance
abline(v = var_real, lwd = 5, col = "red")
```

This plot illustrates that variances based on random `TAI` profiles seem to have a smaller variance than the variance based on the correct `TAI` profile. To obtain a p-value that now quantifies this difference, we need to fit the histogram of `variance_vector` with a specific probability distribution. 


Visually it would be possible to choose a gamma distribution to fit the histogram of `variance_vector`. To validate this choice a [Cullen and Frey graph](https://openlibrary.org/works/OL1978341W/Probabilistic_techniques_in_exposure_assessment) provided by the __fitdistrplus__ package can be used.


```{r, fig.width= 7, fig.height= 5,eval=FALSE}
# install.packages("fitdistrplus")

# plot a Cullen and Frey graph
fitdistrplus::descdist(variance_vector)
```

Based on the observation that a gamma distribution is a suitable fit for `variance_vector`, we can now
estimate the parameters of the gamma distribution that fits the data.

```{r, fig.width= 7, fig.height= 5,eval=FALSE}
# estimate the parameters: shape and rate using 'moment matching estimation'
gamma_MME <- fitdistrplus::fitdist(variance_vector,distr = "gamma", method = "mme")
# estimate shape:
shape <- gamma_MME$estimate[1]
# estimate the rate:
rate <- gamma_MME$estimate[2]

# define an expression written as function as input for the curve() function
gamma_distr <- function(x){ return(dgamma(x = x, shape = shape, rate = rate)) }

# plot the density function and the histogram of variance_vector
curve( expr = gamma_distr,
       xlim = c(min(variance_vector),max(c(variance_vector,var_real))),
       col  = "steelblue",
       lwd  = 5,
       xlab = "Variances", 
       ylab = "Frequency" )

# plot the histogram of variance_vector
histogram <- hist(variance_vector,prob = TRUE,add = TRUE, breaks = 100)
rug(variance_vector)

# plot a red line at the position where we can find the real variance
abline(v = var_real, lwd = 5, col = "red")
```

Using the gamma distribution with estimated parameters the corresponding p-value of `var_real` can
be computed.

```{r,eval=FALSE}
# p-value of var_real
pgamma(var_real, shape = shape,rate = rate, lower.tail = FALSE)
```


Hence, the variance of the correct TAI profile significantly deviates from random TAI profiles 
and this allows us to assume that the underlying TAI profile captures a real evolutionary signal.

### Using the `FlatLineTest()` Function

This entire procedure of computing the p-value having the variance of TAI profiles as test statistic is done by the `FlatLineTest()` function.

```{r,eval=FALSE}
# Perform the FlatLineTest
FlatLineTest( ExpressionSet = PhyloExpressionSetExample, 
              permutations  = 1000 )
```

This function returns the p-value of the test statistic.


Additionally the `FlatLineTest()` function allows to investigate the goodness
of the test statistic.


```{r, fig.width= 7, fig.height= 3,eval=FALSE}

# perform the FlatLineTest and investigate the goodness of the test statistic
FlatLineTest( ExpressionSet = PhyloExpressionSetExample, 
              permutations  = 1000, 
              plotHistogram = TRUE )

```

The `plotHistogram` argument specifies whether analytics plots shall be drawn
to quantify the goodness of the test statistic returned by the `FlatLineTest`.


The three resulting plots show:

* a [Cullen and Frey graph](https://openlibrary.org/works/OL1978341W/Probabilistic_techniques_in_exposure_assessment)

* a histogram of the test statistic and the corresponding [gamma distribution](https://en.wikipedia.org/wiki/Gamma_distribution) that
was fitted to the test statistic

* a plot showing the p-values (`p_flt`) for 10 individual runs.
Since the underlying test statistic is generated by a permutation test,
the third plot returned by `FlatLineTest()` shows the influence of different permutations to the corresponding p-value

In other words, to test whether or not the underlying permutation of the permutation test
is causing the significance of the p-value, you can specify the `runs` argument
within the `FlatLineTest()` function to perform several independent runs.
In case there exists a permutation that causes a previous significant p-value to
become non-significant, the corresponding phylotranscriptomic pattern shouldn't be considered as 
statistically significant.


### Reductive Hourglass Test

The __Reductive Hourglass Test__ has been developed to statistically evaluate the existence of a phylotranscriptomic hourglass pattern based on TAI or TDI computations. 
The corresponding p-value quantifies the probability that a given TAI or TDI pattern (or any phylotranscriptomics pattern) does not follow an hourglass like shape. 
A p-value < 0.05 indicates that the corresponding phylotranscriptomics pattern does indeed follow an hourglass (high-low-high) shape.

To build the test statistic again we start with a standard PhyloExpressionSet.

```{r,eval=FALSE, echo=FALSE}
library(myTAI)

# load an example PhyloExpressionSet stored in the myTAI package
data(PhyloExpressionSetExample) 

# look at the standardized data set format
head(PhyloExpressionSetExample, 3) 
```

```
  Phylostratum      GeneID   Zygote Quadrant Globular     Heart  Torpedo      Bent    Mature
1            1 at1g01040.2 2173.635 1911.200 1152.555 1291.4224 1000.253  962.9772 1696.4274
2            1 at1g01050.1 1501.014 1817.309 1665.309 1564.7612 1496.321 1114.6435 1071.6555
3            1 at1g01070.1 1212.793 1233.002  939.200  929.6195  864.218  877.2060  894.8189

```

And again compute the `TAI()` profile of the `PhyloExpressionSetExample`.

```{r,eval=FALSE}
# TAI profile of correctly assigned Phylostrata
TAI(PhyloExpressionSetExample) 
```

```
  Zygote Quadrant Globular    Heart  Torpedo     Bent   Mature 
3.229942 3.225614 3.107135 3.116693 3.073993 3.176511 3.390334 

```
Visualization: 

```{r, fig.width= 7, fig.height= 5,eval=FALSE}
# visualize the TAI profile of correctly assigned Phylostrata
PlotSignature( ExpressionSet = PhyloExpressionSetExample, 
               p.value       = FALSE ) 
```

The Reductive Hourglass Test is a permutation test based on the following test statistic.

1) A set of developmental stages is partitioned into three modules - `early`, `mid`, and `late` - based on prior biological knowledge (see Drost et al., 2015 for details).

2) The mean TAI or TDI value for each of the three modules $T_{early}$, $T_{mid}$, and $T_{late}$ are computed.

3) The two differences D1 = $T_{early}$ - $T_{mid}$ and D2 = $T_{late}$ - $T_{mid}$ are calculated.

4) The minimum $D_{min}$ of D1 and D2 is computed as final test statistic of the __Reductive Hourglass Test__.

In order to determine the statistical significance of an observed minimum difference $D_{min}$ the following permutation test was performed. Based on the `bootMatrix()` $D_{min}$ is calculated from each of the permuted TAI or TDI profiles, approximated by a Gaussian distribution with method of moments estimated parameters returned by `fitdistrplus::fitdist()`, and the corresponding p-value is computed by `pnorm` given the estimated parameters of the Gaussian distribution. The goodness of fit for the random vector $D_{min}$ is statistically quantified by a [Lilliefors (Kolmogorov-Smirnov) test](https://amstat.tandfonline.com/doi/abs/10.1080/00031305.1986.10475419) for normality.


To perform the __Reductive Hourglass Test__ you can use the `ReductiveHourglassTest()` function.
Using this function you need to divide the given phylotranscriptomic pattern into
three developmental modules:

* early module
* mid module
* late module

This can be done using the `modules` argument: `module = list(early = 1:2, mid = 3:5, late = 6:7)`.
In this example (`PhyloExpressionSetExample`) we divide the corresponding developmental process
into the three modules:

* early module: Stages 1 - 2 = Zygote and Quadrant
* mid module: Stages 3 - 5 = Globular, Heart, and Torpedo
* late module: Stages 6 - 7 = Bent and Mature


```{r, fig.width= 7, fig.height= 5,eval=FALSE}
# Perform the Reductive Hourglass Test
ReductiveHourglassTest( ExpressionSet = PhyloExpressionSetExample,
                        modules       = list(early = 1:2, mid = 3:5, late = 6:7),
                        lillie.test   = TRUE ) 

```


The corresponding output shows the p-value returned by the __Reductive Hourglass Test__,
the standard deviation of randomly permuted TAI profiles returned by `bootMatrix()`
 (`apply(bootMatrix(PhyloExpressionSetExample) , 2 , sd)`) and in case the argument `lillie.test = TRUE`,
a logical value representing the goodness of fit statistic returned by the [Lilliefors (Kolmogorov-Smirnov) test](https://amstat.tandfonline.com/doi/abs/10.1080/00031305.1986.10475419) for normality. In case `lillie.test` is `TRUE` the corresponding 
[Lilliefors (Kolmogorov-Smirnov)](https://amstat.tandfonline.com/doi/abs/10.1080/00031305.1986.10475419) 
 test passed the goodness of fit criterion. In case `lillie.test` is `FALSE` the corresponding goodness of fit by a normal distribution is not statistically significant.

Analogous to the `plotHistogram` argument that is present in the `FlatLineTest()` function,
the `ReductiveHourglassTest()` function also takes an argument `plotHistogram`. When `plotHistogram = TRUE`,
the `ReductiveHourglassTest()` function returns a multi-plot showing:

* A Cullen and Frey skewness-kurtosis plot. 
This plot illustrates which distributions seem plausible to fit the resulting permutation vector $D_{min}$. 
Here a normal distribution seems most plausible.

* A histogram of $D_{min}$ combined with the density plot is visualized. 
$D_{min}$ is then fitted by a normal distribution. 
The corresponding parameters are estimated by moment matching estimation.

* A plot showing the p-values for N independent runs to verify that a specific p-value is biased by a specific permutation order.

* A bar plot showing the number of cases in which the underlying goodness of fit (returned by [Lilliefors (Kolmogorov-Smirnov) test](https://amstat.tandfonline.com/doi/abs/10.1080/00031305.1986.10475419) for normality) has shown to be significant (`TRUE`) or not significant (`FALSE`). 
This allows to quantify the permutation bias and their implications on the goodness of fit.


```{r, fig.width= 7, fig.height= 6,eval=FALSE}
# perform the Reductive Hourglass Test and plot the test statistic
ReductiveHourglassTest( ExpressionSet = PhyloExpressionSetExample,
                        modules       = list(early = 1:2, mid = 3:5, late = 6:7),
                        plotHistogram = TRUE, 
                        lillie.test   = TRUE ) 

```

The corresponding output shows the `summary statistics` of the fitted normal distribution
as well as the p-value, standard deviation, and [Lilliefors (Kolmogorov-Smirnov) test](https://amstat.tandfonline.com/doi/abs/10.1080/00031305.1986.10475419) result.

This example output nicely illustrates that although the Lilliefors (Kolmogorov-Smirnov) test for normality is violated for some permutations, the Cullen and Frey graph shows that there is no better approximation than a normal distribution (which is also supported visually by investigating the fitted frequency distribution). The corresponding p-value returned by `ReductiveHourglassTest()` is significant and illustrates that the observed phylotranscriptomic pattern of PhyloExpressionSetExample does follow the Hourglass Model assumption.

### Reductive Early Conservation Test

The Early Conservation Test has been developed to statistically evaluate the existence of a monotonically increasing phylotranscriptomic pattern based on TAI or TDI computations. 
The corresponding p-value quantifies the probability that a given TAI or TDI pattern (or any phylotranscriptomic pattern) does not follow an early conservation like pattern. 
A p-value < 0.05 indicates that the corresponding phylotranscriptomics pattern does indeed follow an early conservation (low-high-high) shape.

To build the test statistic again we start with a standard PhyloExpressionSet.

```{r,eval=FALSE}
library(myTAI)

# load an example PhyloExpressionSet stored in the myTAI package
data(PhyloExpressionSetExample) 

# look at the standardized data set format
head(PhyloExpressionSetExample, 3) 
```

And again compute the `TAI()` profile of the `PhyloExpressionSetExample`.

```{r,eval=FALSE}
# TAI profile of correctly assigned Phylostrata
TAI(PhyloExpressionSetExample) 
```

```
  Zygote Quadrant Globular    Heart  Torpedo     Bent   Mature 
3.229942 3.225614 3.107135 3.116693 3.073993 3.176511 3.390334
```
Visualization: 

```{r, fig.width= 7, fig.height= 5,eval=FALSE}
# Visualize the TAI profile of correctly assigned Phylostrata
PlotSignature( ExpressionSet = PhyloExpressionSetExample, 
               p.value       = FALSE ) 

```

The reductive early conservation test is a permutation test based on the following test statistic.

1) A set of developmental stages is partitioned into three modules - `early`, `mid`, and `late` - based on prior biological knowledge.

2) The mean TAI or TDI value for each of the three modules $T_{early}$, $T_{mid}$, and $T_{late}$ are computed.

3) The two differences D1 = $T_{mid}$ - $T_{early}$ and D2 = $T_{late}$ - $T_{early}$ are calculated.

4) The minimum $D_{min}$ of D1 and D2 is computed as final test statistic of the __Reductive Early Conservation Test__.

In order to determine the statistical significance of an observed minimum difference $D_{min}$ the following permutation test was performed. Based on the `bootMatrix()` $D_{min}$ is calculated from each of the permuted TAI or TDI profiles, approximated by a Gaussian distribution with method of moments estimated parameters returned by `fitdistrplus::fitdist()`, and the corresponding p-value is computed by `pnorm` given the estimated parameters of the Gaussian distribution. The goodness of fit for the random vector $D_{min}$ is statistically quantified by an [Lilliefors (Kolmogorov-Smirnov) test](https://amstat.tandfonline.com/doi/abs/10.1080/00031305.1986.10475419) for normality.

To perform the __Reductive Early Conservation Test__ you can use the `EarlyConservationTest()` function.
Using this function you need to divide the given phylotranscriptomics pattern into
three developmental modules:

* early module
* mid module
* late module

This can be done using the `modules` argument: `module = list(early = 1:2, mid = 3:5, late = 6:7)`.
In this example (`PhyloExpressionSetExample`) we divide the corresponding developmental process
into the three modules:

* early module: Stages 1 - 2 = Zygote and Quadrant
* mid module: Stages 3 - 5 = Globular, Heart, and Torpedo
* late module: Stages 6 - 7 = Bent and Mature


```{r, fig.width= 7, fig.height= 5,eval=FALSE}

# Perform the Reductive Early Conservation Test
EarlyConservationTest( ExpressionSet = PhyloExpressionSetExample,
                       modules       = list(early = 1:2, mid = 3:5, late = 6:7),
                       lillie.test   = TRUE ) 

```

Analogous to the `plotHistogram` argument that is present in the `FlatLineTest()` and `ReductiveHourglassTest()` function,
the `EarlyConservationTest()` function also takes an argument `plotHistogram`. When `plotHistogram = TRUE`,
the `EarlyConservationTest()` function returns a multi-plot showing:

* A Cullen and Frey skewness-kurtosis plot. 
This plot illustrates which distributions seem plausible to fit the resulting permutation vector $D_{min}$. 
Again a normal distribution seems most appropriate.

* A histogram of $D_{min}$ combined with the density plot is visualized. 
$D_{min}$ is then fitted by a normal distribution. 
The corresponding parameters are estimated by moment matching estimation.

* A plot showing the p-values for N independent runs to verify that a specific p-value is biased by a specific permutation order.

* A bar plot showing the number of cases in which the underlying goodness of fit (returned by [Lilliefors (Kolmogorov-Smirnov) test](https://amstat.tandfonline.com/doi/abs/10.1080/00031305.1986.10475419) for normality) has shown to be significant (`TRUE`) or not significant (`FALSE`). 
This allows to quantify the permutation bias and their implications on the goodness of fit.


```{r, fig.width= 7, fig.height= 6,eval=FALSE}
# perform the Reductive Early Conservation Test and plot the test statistic
EarlyConservationTest( ExpressionSet = PhyloExpressionSetExample,
                       modules       = list(early = 1:2, mid = 3:5, late = 6:7),
                       plotHistogram = TRUE, 
                       lillie.test   = TRUE ) 

```

This example output nicely illustrates that although the [Lilliefors (Kolmogorov-Smirnov) test](https://amstat.tandfonline.com/doi/abs/10.1080/00031305.1986.10475419) for normality is violated, the [Cullen and Frey graph](https://openlibrary.org/works/OL1978341W/Probabilistic_techniques_in_exposure_assessment) shows that there is no better approximation
than a normal distribution (which is also supported visually by investigating the fitted frequency distribution). The corresponding p-value
returned by the `EarlyConservationTest()` is highly non-significant and illustrates that the observed phylotranscriptomic pattern of `PhyloExpressionSetExample`
does not follow the Early Conservation Model assumption. 

This example shall illustrate that finding the right test statistic is a multi-step process of investigating different
properties of the underlying permutation test. Although single aspects might fit or fit not corresponding criteria,
the overall impression (sum of all individual analyses) must be considered to obtain a valid p-value.

### Reverse Hourglass Test

The __Reverse Hourglass Test__ has been developed to statistically evaluate the existence of a `reverse hourglass pattern` based on TAI or TDI computations. 
The corresponding p-value quantifies the probability that a given TAI or TDI pattern (or any phylotranscriptomics pattern) does not follow an reverse hourglass like shape. 
A p-value < 0.05 indicates that the corresponding phylotranscriptomics pattern does indeed follow a reverse hourglass (low-high-low) shape.

To build the test statistic again we start with a standard PhyloExpressionSet.

```{r,eval=FALSE, echo=FALSE}
library(myTAI)

# load an example PhyloExpressionSet stored in the myTAI package
data(PhyloExpressionSetExample) 

# look at the standardized data set format
head(PhyloExpressionSetExample, 3) 
```

```
  Phylostratum      GeneID   Zygote Quadrant Globular     Heart  Torpedo      Bent    Mature
1            1 at1g01040.2 2173.635 1911.200 1152.555 1291.4224 1000.253  962.9772 1696.4274
2            1 at1g01050.1 1501.014 1817.309 1665.309 1564.7612 1496.321 1114.6435 1071.6555
3            1 at1g01070.1 1212.793 1233.002  939.200  929.6195  864.218  877.2060  894.8189

```

And again compute the `TAI()` profile of the `PhyloExpressionSetExample`.

```{r,eval=FALSE}
# TAI profile of correctly assigned Phylostrata
TAI(PhyloExpressionSetExample) 
```

```
  Zygote Quadrant Globular    Heart  Torpedo     Bent   Mature 
3.229942 3.225614 3.107135 3.116693 3.073993 3.176511 3.390334 

```
Visualization: 

```{r, fig.width= 7, fig.height= 5,eval=FALSE}
# visualize the TAI profile of correctly assigned Phylostrata
PlotSignature( ExpressionSet = PhyloExpressionSetExample, 
               p.value       = FALSE ) 
```

The Reverse Hourglass Test is a permutation test based on the following test statistic.

1) A set of developmental stages is partitioned into three modules - `early`, `mid`, and `late` - based on prior biological knowledge (see Drost et al., 2015 for details).

2) The mean TAI or TDI value for each of the three modules $T_{early}$, $T_{mid}$, and $T_{late}$ are computed.

3) The two differences D1 = $T_{mid}$ - $T_{early}$ and D2 = $T_{mid}$ - $T_{late}$ are calculated.

4) The minimum $D_{min}$ of D1 and D2 is computed as final test statistic of the __Reverse Hourglass Test__.

In order to determine the statistical significance of an observed minimum difference $D_{min}$ the following permutation test was performed. Based on the `bootMatrix()` $D_{min}$ is calculated from each of the permuted TAI or TDI profiles, approximated by a Gaussian distribution with method of moments estimated parameters returned by `fitdistrplus::fitdist()`, and the corresponding p-value is computed by `pnorm` given the estimated parameters of the Gaussian distribution. The goodness of fit for the random vector $D_{min}$ is statistically quantified by a [Lilliefors (Kolmogorov-Smirnov) test](https://amstat.tandfonline.com/doi/abs/10.1080/00031305.1986.10475419) for normality.


To perform the __Reverse Hourglass Test__ you can use the `ReverseHourglassTest()` function.
Using this function you need to divide the given phylotranscriptomic pattern into
three developmental modules:

* early module
* mid module
* late module

This can be done using the `modules` argument: `module = list(early = 1:2, mid = 3:5, late = 6:7)`.
In this example (`PhyloExpressionSetExample`) we divide the corresponding developmental process
into the three modules:

* early module: Stages 1 - 2 = Zygote and Quadrant
* mid module: Stages 3 - 5 = Globular, Heart, and Torpedo
* late module: Stages 6 - 7 = Bent and Mature


```{r, fig.width= 7, fig.height= 5,eval=FALSE}
# Perform the Reverse Hourglass Test
ReverseHourglassTest( ExpressionSet = PhyloExpressionSetExample,
                        modules       = list(early = 1:2, mid = 3:5, late = 6:7),
                        lillie.test   = TRUE ) 

```

```
$p.value
[1] 1

$std.dev
[1] 0.05598325 0.05469016 0.05386737 0.05248817 0.05161525 0.05411684 0.05605765

$lillie.test
[1] TRUE
```

The corresponding output shows the p-value returned by the __Reductive Hourglass Test__,
the standard deviation of randomly permuted TAI profiles returned by `bootMatrix()`
 (`apply(bootMatrix(PhyloExpressionSetExample) , 2 , sd)`) and in case the argument `lillie.test = TRUE`,
a logical value representing the goodness of fit statistic returned by the [Lilliefors (Kolmogorov-Smirnov) test](https://amstat.tandfonline.com/doi/abs/10.1080/00031305.1986.10475419) for normality. In case `lillie.test` is `TRUE` the corresponding 
[Lilliefors (Kolmogorov-Smirnov)](https://amstat.tandfonline.com/doi/abs/10.1080/00031305.1986.10475419) 
 test passed the goodness of fit criterion. In case `lillie.test` is `FALSE` the corresponding goodness of fit by a normal distribution is not statistically significant.

Analogous to the `plotHistogram` argument that is present in the `FlatLineTest()` function,
the `ReductiveHourglassTest()` function also takes an argument `plotHistogram`. When `plotHistogram = TRUE`,
the `ReductiveHourglassTest()` function returns a multi-plot showing:

* A Cullen and Frey skewness-kurtosis plot. 
This plot illustrates which distributions seem plausible to fit the resulting permutation vector $D_{min}$. 
Here a normal distribution seems most plausible.

* A histogram of $D_{min}$ combined with the density plot is visualized. 
$D_{min}$ is then fitted by a normal distribution. 
The corresponding parameters are estimated by moment matching estimation.

* A plot showing the p-values for N independent runs to verify that a specific p-value is biased by a specific permutation order.

* A bar plot showing the number of cases in which the underlying goodness of fit (returned by [Lilliefors (Kolmogorov-Smirnov) test](https://amstat.tandfonline.com/doi/abs/10.1080/00031305.1986.10475419) for normality) has shown to be significant (`TRUE`) or not significant (`FALSE`). 
This allows to quantify the permutation bias and their implications on the goodness of fit.


```{r, fig.width= 7, fig.height= 6,eval=FALSE}
# perform the Reverse Hourglass Test and plot the test statistic
ReverseHourglassTest( ExpressionSet = PhyloExpressionSetExample,
                        modules       = list(early = 1:2, mid = 3:5, late = 6:7),
                        plotHistogram = TRUE, 
                        lillie.test   = TRUE ) 

```

The corresponding output shows the `summary statistics` of the fitted normal distribution
as well as the p-value, standard deviation, and [Lilliefors (Kolmogorov-Smirnov) test](https://amstat.tandfonline.com/doi/abs/10.1080/00031305.1986.10475419) result.

This example output nicely illustrates that although the Lilliefors (Kolmogorov-Smirnov) test for normality is violated for some permutations, the Cullen and Frey graph shows that there is no better approximation than a normal distribution (which is also supported visually by investigating the fitted frequency distribution). The corresponding p-value returned by `ReverseHourglassTest()` is significant and illustrates that the observed phylotranscriptomic pattern of PhyloExpressionSetExample does follow the Hourglass Model assumption.

## Data Transformation

Motivated by the discussion raised by Piasecka et al., 2013, the influence of gene expression transformation on the global phylotranscriptomic patterns does not seem negligible. Hence, different transformations can result in qualitatively different TAI or TDI patterns.

Initially, the TAI and TDI formulas were defined for absolute expression levels. So using the initial TAI and TDI formulas with transformed expression levels can result in qualitatively different patterns when compared with non-transformed expression levels, but might also belong to a different class of models, since different valid expression level transformation functions result in different patterns.

The purpose of the `tf()` function is to allow the user to study the qualitative impact of different transformation functions on the global TAI and TDI pattern,
or on any subsequent phylotranscriptomic analysis.

The examples using the `PhyloExpressionSetExample` data set show that using common gene expression transformation functions: log2 (Quackenbush, 2001 and 2002), sqrt (Yeung et al., 2001), boxcox, or inverse hyperbolic sine transformation, each transformation results in qualitatively different patterns. Nevertheless, for each resulting pattern the statistical significance can be tested using either the `FlatLineTest()`, `ReductiveHourglassTest()`, or `EarlyConservationTest()` (Drost et al., 2015) to quantify the significance of observed patterns.

The `tf()` function takes a standard `PhyloExpressionSet` or `DivergenceExpressionSet` and transformation function and returns the corresponding `ExpressionSet` with
transformed gene expression levels.

```{r, fig.width= 7, fig.height= 5}
library(myTAI)

data(PhyloExpressionSetExample)

# a simple example is to transform the gene expression levels of a given PhyloExpressionSet
# using a sqrt or log2 transformation

PES.sqrt <- tf(PhyloExpressionSetExample, sqrt)

head(PES.sqrt)
```

```{r, fig.width= 7, fig.height= 5, eval=FALSE}

PES.log2 <- tf(PhyloExpressionSetExample, log2)

head(PES.log2)
```

```
  Phylostratum      GeneID    Zygote  Quadrant  Globular     Heart   Torpedo      Bent    Mature
1            1 at1g01040.2 11.085894 10.900263 10.170620 10.334745  9.966149  9.911358 10.728284
2            1 at1g01050.1 10.551722 10.827588 10.701574 10.611727 10.547204 10.122367 10.065625
3            1 at1g01070.1 10.244117 10.267960  9.875289  9.860497  9.755251  9.776772  9.805452
4            1 at1g01080.2  9.989991  9.870956 10.206206 10.376639 10.443610 10.330888  9.750306
5            1 at1g01090.1 13.479852 14.034298 15.068722 15.279598 15.779093 16.031451 12.924174
6            1 at1g01120.1  9.721170  9.621306  9.747567  9.863595  9.880877  9.765307  9.630730
```

```{r, fig.width= 7, fig.height= 5,eval=FALSE}
# in case a given PhyloExpressionSet already stores gene expression levels
# that are log2 transformed and need to be re-transformed to absolute
# expression levels, to perform subsequent phylotranscriptomics analyses
# (that are defined for absolute expression levels), 
# one can re-transform a PhyloExpressionSet like this:

PES.absolute <- tf(PES.log2 , function(x) 2^x)

# which should be the same as  PhyloExpressionSetExample :
head(PhyloExpressionSetExample)
head(PES.absolute)
```

```
> head(PhyloExpressionSetExample)
  Phylostratum      GeneID     Zygote   Quadrant   Globular      Heart    Torpedo       Bent    Mature
1            1 at1g01040.2  2173.6352  1911.2001  1152.5553  1291.4224  1000.2529   962.9772 1696.4274
2            1 at1g01050.1  1501.0141  1817.3086  1665.3089  1564.7612  1496.3207  1114.6435 1071.6555
3            1 at1g01070.1  1212.7927  1233.0023   939.2000   929.6195   864.2180   877.2060  894.8189
4            1 at1g01080.2  1016.9203   936.3837  1181.3381  1329.4734  1392.6429  1287.9746  861.2605
5            1 at1g01090.1 11424.5667 16778.1685 34366.6493 39775.6405 56231.5689 66980.3673 7772.5617
6            1 at1g01120.1   844.0414   787.5929   859.6267   931.6180   942.8453   870.2625  792.7542
> head(PES.absolute)
  Phylostratum      GeneID     Zygote   Quadrant   Globular      Heart    Torpedo       Bent    Mature
1            1 at1g01040.2  2173.6352  1911.2001  1152.5553  1291.4224  1000.2529   962.9772 1696.4274
2            1 at1g01050.1  1501.0141  1817.3086  1665.3089  1564.7612  1496.3207  1114.6435 1071.6555
3            1 at1g01070.1  1212.7927  1233.0023   939.2000   929.6195   864.2180   877.2060  894.8189
4            1 at1g01080.2  1016.9203   936.3837  1181.3381  1329.4734  1392.6429  1287.9746  861.2605
5            1 at1g01090.1 11424.5667 16778.1685 34366.6493 39775.6405 56231.5689 66980.3673 7772.5617
6            1 at1g01120.1   844.0414   787.5929   859.6267   931.6180   942.8453   870.2625  792.7542
```

When transforming the `ExpressionMatrix` of the `PhyloExpressionSetExample` using different transformation functions,
the resulting phylotranscriptomic patterns qualitatively differ:

### log2 transformation (TAI)

```{r, fig.width= 7, fig.height= 5, eval=FALSE}
data(PhyloExpressionSetExample)
# plotting the TAI using log2 transformed expression levels
# and performing the Flat Line Test to obtain the p-value
PlotSignature( ExpressionSet = tf(PhyloExpressionSetExample, log2),
               TestStatistic = "FlatLineTest",
               xlab          = "Ontogeny", 
               ylab          = "TAI" )


```

### sqrt transformation (TAI)

```{r, fig.width= 7, fig.height= 5, eval=FALSE}
data(PhyloExpressionSetExample)
# plotting the TAI using sqrt transformed expression levels
# and performing the Flat Line Test to obtain the p-value
PlotSignature( ExpressionSet = tf(PhyloExpressionSetExample, sqrt),
               TestStatistic = "FlatLineTest",
               xlab          = "Ontogeny", 
               ylab          = "TAI" )
```

For the `PhyloExpressionSetExample` all transformations result in a significant
phylotranscriptomics pattern deviating from a flat line.

Nevertheless, it is not clear which transformation is the most appropriate one
since the original TAI and TDI measure were defined for absolute expression levels.

The same accounts for `TDI` profiles:

### log2 transformation (TDI)

```{r, fig.width= 7, fig.height= 5, eval=FALSE}
data(DivergenceExpressionSetExample)
# plotting the TDI using log2 transformed expression levels
# and performing the Flat Line Test to obtain the p-value
PlotSignature( ExpressionSet = tf(DivergenceExpressionSetExample, log2),
               TestStatistic = "FlatLineTest",
               xlab          = "Ontogeny", 
               ylab          = "TDI" )
```

### sqrt transformation (TDI)

```{r, fig.width= 7, fig.height= 5, eval=FALSE}
data(DivergenceExpressionSetExample)
# plotting the TDI using sqrt transformed expression levels
# and performing the Flat Line Test to obtain the p-value
PlotSignature( ExpressionSet = tf(DivergenceExpressionSetExample, sqrt),
               TestStatistic = "FlatLineTest",
               xlab          = "Ontogeny", 
               ylab          = "TDI" )
```

As a result, observed patterns should always be quantified using statistical tests 
(for ex. `FlatLineTest()`, `ReductiveHourglassTest()`, and `EarlyConservationTest()`).
In case the observed pattern is significant, qualitative differences of the observed patterns based on different data transformations must be investigated in more detail,
since most data transformations are known to cause different effects on a measure that
isn't robust against data transformations.

## References

Domazet-Loso, T. & Tautz, D. __A phylogenetically based transcriptome age index mirrors ontogenetic divergence patterns__. _Nature_ __468__, 815-8 (2010).

Quint, M. _et al_. __A transcriptomic hourglass in plant embryogenesis__. _Nature_ __490__, 98-101 (2012).

Drost, H.G. _et al_. __Evidence for Active Maintenance of Phylotranscriptomic Hourglass Patterns in Animal and Plant Embryogenesis__. _Mol. Biol. Evol._ __32__ (5), 1221-1231 (2015).

Piasecka, B. _et al_. __The hourglass and the early conservation models - co-existing patterns of developmental constraints in vertebrates__. _PLoS Genet_.  __9__:e1003476 (2013).

Quackenbush, J. Computational Analysis of Microarray Data. _Nature Reviews_. __2__, 418-427 (2001).

Quackenbush, J. Microarray data normalization and transformation. _Nature Genetics_. __32__, 496-501 (2002).

Yeung, K.Y. _et al_. Model-based clustering and data transformations for gene expression data. _Bioinformatics_. __17__, 977-987 (2001).

Yeung, K.Y. _et al_. Supplement to _Model-based clustering and data transformations for gene expression data - Data Transformations and the Gaussian mixture assumption_. _Bioinformatics_. __17__, 977-987 (2001).