--- title: "Advanced Phylotranscriptomics Analyses" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Advanced Phylotranscriptomics Analyses} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- ```{r, echo = FALSE, message = FALSE} library(myTAI) options(width = 750) knitr::opts_chunk$set( comment = "#>", error = FALSE, tidy = FALSE) ``` ## Overview In the [Introduction](Introduction.html) vignette users learned how to perform the most fundamental computations and visualizations for phylotranscriptomics analyses using the `myTAI` package. Especially in the _Introduction_ vignette we demonstrated how to perform `Phylostratigraphy` and `Divergence Stratigraphy` as well as the construction of `PhyloExpressionSets` and `DivergenceExpressionSets` using the `MatchMap()` function. In the [Intermediate](Intermediate.html) vignette users learned about more detailed analyses and more specialized techniques to investigate the observed phylotranscriptomics patterns (`TAI`, `TDI`, `RE`, etc.). ## Investigating Age or Divergence Category Specific Expression Level Distributions Gene expression levels are a fundamental aspect of phylotranscriptomics studies. In detail, phylotranscriptomic measures aim to quantify the expression intensity of genes deriving from common age or divergence categories to detect stages of evolutionary constraints. Hence, the gene expression distribution of age or divergence categories as well as their differences within and between stages or categories allow us to investigate the age (PS) or divergence (DS) category specific contribution to the corresponding transcriptome. For this purpose, the `PlotCategoryExpr()` aims to visualize the expression level distribution of each phylostratum during each time point or experiment as barplot, dot plot, or violin plot enabling users to quantify the age (PS) or divergence (DS) category specific contribution to the corresponding transcriptome. This way of visualizing the gene expression distribution of each age (PS) or divergence (DS) category during all developmental stages or experiments allows users to detect specific age or divergence categories contributing significant levels of gene expression to the underlying biological process (transcriptome). ```{r,eval = FALSE, fig.width= 7, fig.height= 5} library(myTAI) data(PhyloExpressionSetExample) # category-centered visualization of PS specific expression level distributions (log-scale) PlotCategoryExpr(ExpressionSet = PhyloExpressionSetExample, legendName = "PS", test.stat = TRUE, type = "category-centered", distr.type = "boxplot", log.expr = TRUE) ``` ``` Zygote Quadrant Globular Heart Torpedo Bent Mature category-centered "***" "***" "***" "***" "***" "***" "***" ``` The resulting boxplot illustrates the log expression levels of each phylostratum during each developmental stage. Additionally, a Kruskal-Wallis Rank Sum Test as well as a Benjamini & Hochberg p-value adjustment for multiple comparisons is performed (`test.stat = TRUE`) to statistically quantify the differences between expression levels of different age or divergence categories. This type of analysis allows users to detect stages or experiments that show high deviation between age or divergence category contributions to the overall transcriptome or no significant deviations of age or divergence categories, suggesting equal age or divergence category contributions to the overall transcriptome. The corresponding P-values are printed to the console using the following notation: - '*' = P-Value $\leq$ 0.05 - '**' = P-Value $\leq$ 0.005 - '***' = P-Value $\leq$ 0.0005 - 'n.s.' = not significant = P-Value > 0.05 In this case all developmental stages show significant differences in phylostratum specific gene expression. __Please notice that users need to define the `legendName` argument as `PS` or `DS` to specify whether the input `ExpressionSet` is a `PhyloExpressionSet` (`legendName = 'PS'`) or `DivergenceExpressionSet` (`legendName = 'DS'`).__ Alternatively, users can investigate the differences of gene expression __between__ all stages or experiments for __each__ age or divergence category by specifying `type = 'stage-centered'`. ```{r,eval = FALSE, fig.width= 7, fig.height= 5} library(myTAI) data(PhyloExpressionSetExample) # stage-centered visualization of PS specific expression level distributions (log-scale) PlotCategoryExpr(ExpressionSet = PhyloExpressionSetExample, legendName = "PS", test.stat = TRUE, type = "stage-centered", distr.type = "boxplot", log.expr = TRUE) ``` ``` PS1 PS2 PS3 PS4 PS5 PS6 PS7 PS8 PS9 PS10 PS11 PS12 stage-centered "***" "***" "n.s." "*" "n.s." "*" "n.s." "n.s." "n.s." "***" "n.s." "***" ``` Here, the Kruskal-Wallis Rank Sum Test (with Benjamini & Hochberg p-value adjustment) quantifies whether or not the gene expression distribution of a single age or divergence category significantly changes throughout development or experiments. This type of analysis allows users to detect specific age or divergence categories that significantly change their expression levels throughout development or experiments. In this case, users will observe that PS3,5,7-9,11 do not show significant differences of gene expression between developmental stages suggesting that their contribution to the overall transcriptome remains constant throughout development. Finally, users can choose the following plot types to visualize expression distributions: Argument: `distr.type` - `distr.type = "boxplot"` This specification allows users to visualize the expression distribution of all PS or DS as boxplot. - `distr.type = "violin"` This specification allows users to visualize the expression distribution of all PS or DS as violin plot. - `distr.type = "dotplot"` This specification allows users to visualize the expression distribution of all PS or DS as dot plot. Together, studies performed with `PlotCategoryExpr()` allow users to conclude that genes originating in specific PS or DS contribute significantly more to the overall transcriptome than other genes originating from different PS or DS categories. More specialized analyses such as `PlotMeans()`, `PlotRE()`, `PlotBarRE()`, `TAI()`, `TDI()`, etc. will then allow them to study the exact mean expression patterns of these age or divergence categories. Users will notice that so far all examples shown above specified `log.expr = TRUE` illustrating boxplots based on log2 expression levels. This way of visualization allows better visual comparisons between age or divergence categories. However, when specifying `log.expr = FALSE` absolute expression levels will be visualized in the corresponding boxplot. Alternatively, instead of specifying `log.expr = TRUE` users can directly pass log2 transformed expression levels to `PlotCategoryExpr()` via `tf(PhyloExpressionSetExample,log2)` (when `log.expr = FALSE`): ```{r,eval = FALSE, fig.width= 7, fig.height= 5} data(PhyloExpressionSetExample) # category-centered visualization of PS specific expression level distributions (log-scale) PlotCategoryExpr(ExpressionSet = tf(PhyloExpressionSetExample, log2), legendName = "PS", test.stat = TRUE, type = "category-centered", distr.type = "boxplot", log.expr = FALSE) ``` ``` Zygote Quadrant Globular Heart Torpedo Bent Mature category-centered "***" "***" "***" "***" "***" "***" "***" ``` Or any other expression level transformation, e.g. `sqrt`. ```{r,eval = FALSE, fig.width= 7, fig.height= 5} data(PhyloExpressionSetExample) # category-centered visualization of PS specific expression level distributions (sqrt-scale) PlotCategoryExpr(ExpressionSet = tf(PhyloExpressionSetExample, sqrt), legendName = "PS", test.stat = TRUE, type = "category-centered", distr.type = "boxplot", log.expr = FALSE) ``` ``` Zygote Quadrant Globular Heart Torpedo Bent Mature category-centered "***" "***" "***" "***" "***" "***" "***" ``` ## Gene Subset Age or Divergence Category Specific Expression Level Distributions In some cases, users wish to visualize the gene expression distributions for a subset of genes in contrast to the entire transcriptome. For this purpose, the `gene.set` argument allows users to specify the gene ids of a subset of genes that shall be matched in the input `ExpressionSet` and for which expression level distributions shall be visualized. ```{r,eval = FALSE, fig.width= 7, fig.height= 5} library(myTAI) data(PhyloExpressionSetExample) # define an example gene subset (500 genes) which # can be found in the input ExpressionSet set.seed(234) example.gene.set <- PhyloExpressionSetExample[sample(1:25260,500) , 2] # visualize the gene expression distributions for these 500 genes (category-centered) PlotCategoryExpr(ExpressionSet = PhyloExpressionSetExample, legendName = "PS", test.stat = TRUE, type = "category-centered", distr.type = "boxplot", log.expr = TRUE, gene.set = example.gene.set) ``` ``` Zygote Quadrant Globular Heart Torpedo Bent Mature category-centered "*" "*" "*" "*" "*" "*" "n.s." ``` Or analogously `stage-centered`: ```{r,eval = FALSE, fig.width= 7, fig.height= 5} library(myTAI) data(PhyloExpressionSetExample) # define an example gene subset (500 genes) which # can be found in the input ExpressionSet set.seed(234) example.gene.set <- PhyloExpressionSetExample[sample(1:25260,500) , 2] # visualize the gene expression distributions for these 500 genes (stage-centered) PlotCategoryExpr(ExpressionSet = PhyloExpressionSetExample, legendName = "PS", test.stat = TRUE, type = "stage-centered", distr.type = "boxplot", log.expr = TRUE, gene.set = example.gene.set) ``` ``` PS1 PS2 PS3 PS4 PS5 PS6 PS7 PS8 PS9 PS10 PS11 stage-centered "n.s." "n.s." "n.s." "n.s." "n.s." "n.s." "n.s." "n.s." "n.s." "n.s." "n.s." PS12 stage-centered "n.s." ``` For example, users interested in the enrichment of PS or DS values in _D. rerio_ brain genes (see [Enrichment Vignette](Enrichment.html) for details) could also visualize their gene expression distributions throughout development with `PlotCategoryExpr()` in cases where expression data is available. ## Computing the significant differences between gene expression distributions of PS or DS groups As proposed by Quint et al., 2012 in some cases users wish to compare the difference of group specific expression levels using a statistical test. For this purpose, the `PlotGroupDiffs()` function performs a test to quantify the statistical significance between the global expression level distributions of groups of PS or DS. It therefore allows users to investigate significant groups of PS or DS that significantly differ in their gene expression level distribution within specific developmental stages or experiments. Analogous to the `PlotRE()` or `PlotMeans()` function (see [Introduction](Introduction.html) for details), users need to pass the `Groups` to `PlotGroupDiffs()` specifying the groups that shall be compared. ```{r,eval= FALSE, fig.width= 7, fig.height= 7,warning=FALSE} library(myTAI) data(PhyloExpressionSetExample) PlotGroupDiffs(ExpressionSet = PhyloExpressionSetExample, Groups = list(group_1 = 1:3,group_2 = 4:12), legendName = "PS", plot.type = "p-vals", type = "b", lwd = 6, xlab = "Ontogeny") ``` In cases where no plot shall be drawn and only the resulting p-value shall be returned users can specify the `plot.type = NULL` argument to receive only p-values returned by the underlying test statistic. ```{r,eval= FALSE, fig.width= 9, fig.height= 7,warning=FALSE} library(myTAI) data(PhyloExpressionSetExample) # only receive the p-values without the corresponding plot PlotGroupDiffs(ExpressionSet = PhyloExpressionSetExample, Groups = list(group_1 = 1:3,group_2 = 4:12), legendName = "PS", plot.type = NULL) ``` Optionally, users can also visualize the difference in expression level distributions of groups of PS/DS during each developmental stage by specifying the `plot.type = "boxplot"` argument. ```{r,eval= FALSE,fig.width= 9, fig.height= 12,warning=FALSE} library(myTAI) data(PhyloExpressionSetExample) # visualize difference as boxplot PlotGroupDiffs(ExpressionSet = tf(PhyloExpressionSetExample,log2), Groups = list(group_1 = 1:3,group_2 = 4:12), legendName = "PS", plot.type = "boxplot") ``` Here, we use log2 transformed expression levels for better visualization (`tf(PhyloExpressionSetExample,log2)`). Internally, the `PlotGroupDiffs()` function performs a Wilcoxon Rank Sum test to quantify the statistical significance of PS/DS group expression. This quantification allows users to detect developmental stages of significant expression level differences between PS/DS groups. In this example we chose genes originated before the evolution of embryogenesis evolved in plants (Group1 = PS1-3) versus genes originated after the evolution of embryogenesis evolved in plants (Group2 = PS4-12). As a result, we observe that indeed the difference in total gene expression between these groups is significant throughout embryogenesis. In terms of the P-value quantification we observe that the P-value is minimized towards the phylotypic period. Hence, the expression level difference between the studied PS groups is maximized during the phylotypic period.