--- title: "Enrichment Analyses" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Enrichment Analyses} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- ## Performing `Phylostratum` and `Divergence Stratum` Enrichment Analyses `Phylostratum` and `Divergence Stratum` enrichment analyses have been performed by several studies to correlate organ or metabolic pathway evolution with the origin of organ or pathway specific genes (Sestak and Domazet-Loso, 2015). In detail, `Phylostratum` and `Divergence Stratum` enrichment analyses can be performed analogously to [Gene Ontology](https://geneontology.github.io/page/go-enrichment-analysis/) and [Kegg](https://www.genome.jp/kegg/) enrichment analyses to study the enrichment of evolutionary age or sequence divergence in a set of selected genes against the entire genome/transcriptome. In case specific age categories are significantly over- or underrepresented in the selected gene set, assumptions or potential correlations between the evolutionary origin of a particular organ or metabolic pathway can be implied. In this vignette we will use the data set published by Sestak and Domazet-Loso, 2015 to demonstrate how to perform enrichment analyses using `myTAI`. ## Enrichment Analyses using `PlotEnrichment()` The `PlotEnrichment()` function implemented in `myTAI` computes and visualizes the significance of enriched (over- or underrepresented) `Phylostrata` or `Divergence Strata` within an input set of process/tissue specific genes. In detail this function takes the `Phylostratum` or `Divergence Stratum` distribution of all genes stored in the input `ExpressionSet` as background set and the `Phylostratum` or `Divergence Stratum` distribution of the specific gene set and performs a Fisher's exact test for each `Phylostratum` or `Divergence Stratum` to quantify the statistical significance of over- or under-represented `Phylostrata` or `Divergence Strata` within the set of selected genes. In other words, the frequency distribution of `Phylostrata` or `Divergence Strata` in the complete sample is compared with the frequency distribution of `Phylostrata` or `Divergence Strata` in the set of selected genes and over- or under-representation is visualized by log-odds (or odds), where a log-odd of zero means that both frequency distributions are equal (see also Sestak and Domazet-Loso, 2015). ### Example Data Set Retrieval Before using the `PlotEnrichment()` function, we need to download the example data set from Sestak and Domazet-Loso, 2015. Download the `Phylostratigraphic Map` of _D. rerio_: ```r # download the Phylostratigraphic Map of Danio rerio # from Sestak and Domazet-Loso, 2015 download.file( url = "http://mbe.oxfordjournals.org/content/suppl/2014/11/17/msu319.DC1/TableS3-2.xlsx", destfile = "MBE_2015a_Drerio_PhyloMap.xlsx" ) ``` Read the `*.xlsx` file storing the `Phylostratigraphic Map` of _D. rerio_ and format it for the use with `myTAI`: ```r # install the readxl package install.packages("readxl") # load package readxl library(readxl) # read the excel file DrerioPhyloMap.MBEa <- read_excel("MBE_2015a_Drerio_PhyloMap.xlsx", sheet = 1, skip = 4) # format Phylostratigraphic Map for use with myTAI Drerio.PhyloMap <- DrerioPhyloMap.MBEa[ , 1:2] # have a look at the final format head(Drerio.PhyloMap) ``` ``` Phylostrata ZFIN_ID 1 1 ZDB-GENE-000208-13 2 1 ZDB-GENE-000208-17 3 1 ZDB-GENE-000208-18 4 1 ZDB-GENE-000208-23 5 1 ZDB-GENE-000209-3 6 1 ZDB-GENE-000209-4 ``` Now, `Drerio.PhyloMap` stores the `Phylostratigraphic Map` of _D. rerio_ which is used as background set to perform enrichment analyses with `PlotEnrichment()`. ### Enrichment Analyses Now, the `PlotEnrichment()` function visualizes the over- and underrepresented `Phylostrata` of brain specific genes when compared with the total number of genes stored in the `Phylostratigraphic Map` of _D. rerio_. ```{r,eval=FALSE} # read expression data (organ specific genes) from Sestak and Domazet-Loso, 2015 Drerio.OrganSpecificExpression <- read_excel("MBE_2015a_Drerio_PhyloMap.xlsx", sheet = 2, skip = 3) # select only brain specific genes Drerio.Brain.Genes <- unique(na.omit(Drerio.OrganSpecificExpression[ , "brain"])) # visualize enriched Phylostrata of genes annotated as brain specific PlotEnrichment(Drerio.PhyloMap, test.set = Drerio.Brain.Genes, measure = "log-foldchange", use.only.map = TRUE, legendName = "PS") ``` Here, the first argument is either a standard `ExpressionSet` object (in case `use.only.map = FALSE`: default) or a `Phylostratigraphic Map` or `Divergence Map` (in case `use.only.map = TRUE`; see [Introduction](Introduction.html) for details). The second argument `test.set` specifies the gene ids also stored in the corresponding `ExpressionSet` or `Phylostratigraphic Map/ Divergence Map` for which enrichment shall be quantified and visualized. To visualize the odds or log-odds of over- or underrepresented genes within the `test.set` the following procedure is performed: - $N_{ij}$ denotes the number of genes in group j and deriving from PS $i$, with $i = 1, .. , n$ and where $j = 1$ denotes the background set and $j = 2$ denotes the `test.set` - $N_{i.}$ denotes the total number of genes within PS $i$ - $N_{.j}$ denotes the total number of genes within group $j$ - $N_{..}$ is the total number of genes within all groups $j$ and all PS $i$ - $f_{ij}$ = $N_{ij}$ / $N_{..}$ and $g_{ij}$ = $f_{ij}$ / $f_{.j}$ denote relative frequencies between groups - $f_{i.}$ denotes the between group sum of $f_{ij}$ The result is the __fold-change value__ (odds; `measure = "foldchange"`) denoted as $C_2 = g_{i2} / f_{i.}$ which is visualized above and below zero or the __log fold-change__ value (log-odds; `measure = "log-foldchange"`), where $log_2$(C) = $log_2$($g_{i2}$) - $log_2$($f_{i.}$) which is visualized symmetrically above and below zero by `PlotEnrichment()`. Analogously, $C_1 = g_{i1} / f_{i.}$ but is not visualized by this function. Internally, `PlotEnrichment()` performs a Fisher's exact test for each `Phylostratum` or `Divergence Stratum` separately, to quantify the significance of over- or under-representation of corresponding `Phylostrata` or `Divergence Strata` within the `test.set` when compared with the entire `ExpressionSet`. `PlotEnrichment()` visualizes significantly enriched (over- or underrepresented) `Phylostrata` or `Divergence Strata` with asterisks '*'. Notation: - '*' = P-Value $\leq$ 0.05 - '**' = P-Value $\leq$ 0.005 - '***' = P-Value $\leq$ 0.0005 Users will notice that when performing the `PlotEnrichment()` function, the p-values and the enrichment matrix (storing $C_1$ and $C_2$) will be returned. ```{r,eval = FALSE} PlotEnrichment(Drerio.PhyloMap, test.set = Drerio.Brain.Genes, measure = "log-foldchange", use.only.map = TRUE, legendName = "PS") ``` ``` $p.values PS1 PS2 PS3 PS4 PS5 PS6 8.283490e-01 8.362880e-05 6.778981e-02 1.373816e-02 7.946309e-13 6.017041e-01 PS7 PS8 PS9 PS10 PS11 PS12 2.185021e-03 2.281194e-03 8.943147e-01 5.699612e-01 4.717058e-02 9.367759e-01 PS13 PS14 3.929949e-03 1.593834e-05 $enrichment.matrix BG_Set Test_Set PS1 -0.001132832 0.007668216 PS2 0.023733936 -0.172380714 PS3 -0.040879607 0.250587496 PS4 -0.048920465 0.294399729 PS5 -0.114888949 0.603817643 PS6 0.008678915 -0.060350168 PS7 -0.062948352 0.367240944 PS8 0.115630474 -1.206210187 PS9 -0.007353969 0.048964218 PS10 -0.031971192 0.200141519 PS11 0.039742253 -0.303363314 PS12 -0.002418079 0.016311853 PS13 0.101449988 -0.984621732 PS14 0.098211044 -0.938724783 ``` In case users are only interested in the p-values of the Fisher test and the enrichment matrix without illustrating the bar plot, they can specify the `plot.bars = FALSE` argument to only retrieve the numeric results. ```{r,eval = FALSE} # specify plot.bars = FALSE to retrieve only numeric results EnrichmentResult <- PlotEnrichment(Drerio.PhyloMap, test.set = Drerio.Brain.Genes, measure = "log-foldchange", use.only.map = TRUE, legendName = "PS", plot.bars = FALSE) # access p-values quantifying the enrichment for each Phylostratum EnrichmentResult$p.values ``` ``` PS1 PS2 PS3 PS4 PS5 PS6 8.283490e-01 8.362880e-05 6.778981e-02 1.373816e-02 7.946309e-13 6.017041e-01 PS7 PS8 PS9 PS10 PS11 PS12 2.185021e-03 2.281194e-03 8.943147e-01 5.699612e-01 4.717058e-02 9.367759e-01 PS13 PS14 3.929949e-03 1.593834e-05 ``` ```{r, eval = FALSE} # access enrichment matrix storing C_1 and C_2 EnrichmentResult$enrichment.matrix ``` ``` BG_Set Test_Set PS1 -0.001132832 0.007668216 PS2 0.023733936 -0.172380714 PS3 -0.040879607 0.250587496 PS4 -0.048920465 0.294399729 PS5 -0.114888949 0.603817643 PS6 0.008678915 -0.060350168 PS7 -0.062948352 0.367240944 PS8 0.115630474 -1.206210187 PS9 -0.007353969 0.048964218 PS10 -0.031971192 0.200141519 PS11 0.039742253 -0.303363314 PS12 -0.002418079 0.016311853 PS13 0.101449988 -0.984621732 PS14 0.098211044 -0.938724783 ``` ### Defining the Background Set The Fisher test which is performed inside `PlotEnrichment()` assumes that all genes stored in the input `ExpressionSet` or `Phylostratigraphic Map`/`Divergence Map` are used to define the background set for constructing the test statistic. However, since in most cases the `test.set` is an subset of the input `ExpressionSet` or `Phylostratigraphic Map`/`Divergence Map` one could also specify the `complete.bg` argument to remove all `test.set` genes from the background set when performing the Fisher test and visualization. The following two examples allow users to compare the results when retaining all genes as background set compared with the option to remove `test.set` genes from the background set. ```{r,eval = FALSE} # complete.bg = TRUE (default) -> retain test.set genes in background set PlotEnrichment(Drerio.PhyloMap, test.set = Drerio.Brain.Genes, measure = "log-foldchange", complete.bg = TRUE, use.only.map = TRUE, legendName = "PS") ``` ```{r,eval = FALSE} # complete.bg = FALSE -> remove test.set genes from background set PlotEnrichment(Drerio.PhyloMap, test.set = Drerio.Brain.Genes, measure = "log-foldchange", complete.bg = FALSE, use.only.map = TRUE, legendName = "PS") ``` Users will notice that although some p-values change, the qualitative result did not change. In border line cases however, the results might influence whether or not some `Phylostrata` or `Divergence Strata` are denoted as significantly enriched or not. So always be aware of the interpretation when retaining or removing the `test.set` from the background set, because both options are valid and have advantages and disadvantages and depend on a valid interpretation. ## Interpretation of Enrichment Results For the _D. rerio_ brain genes example you can see that PS4, PS5, and PS7 are significantly over-represented in the set of brain specific genes. ```{r, eval = FALSE} PlotEnrichment(Drerio.PhyloMap, test.set = Drerio.Brain.Genes, measure = "foldchange", complete.bg = TRUE, use.only.map = TRUE, legendName = "PS") ``` Again, we retrieve the _D. rerio_ specific taxonomy represented by PS1-14 using the `taxonomy()` function (see [Introduction](Introduction.html) and [Taxonomy](Taxonomy.html) for details). ```{r, eval=FALSE} # retrieve the taxonomy of D. rerio taxonomy(organism = "Danio rerio") ``` ``` id name rank 1 cellular organisms no rank 131567 2 Eukaryota superkingdom 2759 3 Opisthokonta no rank 33154 4 Metazoa kingdom 33208 5 Eumetazoa no rank 6072 6 Bilateria no rank 33213 7 Deuterostomia no rank 33511 8 Chordata phylum 7711 9 Craniata subphylum 89593 10 Vertebrata no rank 7742 11 Gnathostomata no rank 7776 12 Teleostomi no rank 117570 13 Euteleostomi no rank 117571 14 Actinopterygii superclass 7898 15 Actinopteri class 186623 16 Neopterygii subclass 41665 17 Teleostei infraclass 32443 18 Osteoglossocephalai no rank 1489341 19 Clupeocephala no rank 186625 20 Otomorpha no rank 186634 21 Ostariophysi no rank 32519 22 Otophysa no rank 186626 23 Cypriniphysae superorder 186627 24 Cypriniformes order 7952 25 Cyprinoidea superfamily 30727 26 Cyprinidae family 7953 27 Danio genus 7954 28 Danio rerio species 7955 ``` Sestak and Domazet-Loso, 2015 collapsed these 28 taxonomic nodes into 14 taxonomic nodes (see `Figure 2` in Sestak and Domazet-Loso, 2015) and labelled them as phylostrata 1 to phylostrata 14, where PS1 represents `cellular organisms` and PS14 represents `D. rerio` specific genes. Based on the phylostratum categorization of Sestak and Domazet-Loso, 2015, PS4 represents `Holozoa (= Metazoa + allies)`, PS5 represents `Metazoa`, and PS7 represents `Bilateria`. Now, the over-representation results of brain specific genes returned by `PlotEnrichment()` provide evidence, that brain specific genes might indeed have originated during the emergence of the nervous system at the metazoan-eumetazoan transition leading to the interpretation that the vertebrate brain has a step wise adaptive history where most of its extant organization was already present in the chordate ancestor as argued by Sestak and Domazet-Loso, 2015. This example shall illustrate how the `PlotEnrichment()` function can be used to trace the evolutionary origin of tissue or process specific genes by investigating their age enrichment. In case users have an `ExpressionSet` storing the `Phylostratigraphic Map` of _D. rerio_ as well as an expression set, they can furthermore use the `PlotGeneSet()` function implemented in `myTAI` to visualize the expression levels of brain specific genes which have been shown to be significantly enriched in `Metazoa` specific phylostrata. Example: ```{r, eval = FALSE} # the best parameter setting to visualize this plot: # png("DrerioBrainSpecificGeneExpression.png",700,400) PlotGeneSet(ExpressionSet = DrerioPhyloExpressionSet, gene.set = Drerio.Brain.Genes, plot.legend = FALSE, type = "l", lty = 1, lwd = 4, xlab = "Ontogeny", ylab = "Expression Level") # dev.off() ``` Here `DrerioPhyloExpressionSet` denotes a hypothetical `ExpressionSet` of _D. rerio_ development. Additionally, the `SelectGeneSet()` function allows users to obtain the `ExpresisonSet` subset of selected genes (`gene.set`) for subsequent analyses. ```{r, eval = FALSE} # select the ExpressionSet subset of Brain specific genes Brain.PhyloExpressionSet <- SelectGeneSet( ExpressionSet = DrerioPhyloExpressionSet, gene.set = Drerio.Brain.Genes ) head(Brain.PhyloExpressionSet) ``` ### Adjust P-values for Multiple Comparisons In case a large number of Phylostrata or Divergence Strata is included in the input `ExpressionSet`, p-values returned by `PlotEnrichment()` should be adjusted for multiple comparisons. For this purpose `PlotEnrichment()` includes the argument `p.adjust.method`. Here, all methods implemented in `?p.adjust` can be specified: ```{r,eval = FALSE} # adjust p-values for multiple comparisons with Benjamini & Hochberg (1995) PlotEnrichment(Drerio.PhyloMap, test.set = Drerio.Brain.Genes, measure = "log-foldchange", complete.bg = FALSE, use.only.map = TRUE, legendName = "PS", p.adjust.method = "BH") ``` Please consult these reviews ([Biostatistics Handbook](http://www.biostathandbook.com/multiplecomparisons.html), [Gelman et al., 2008](http://www.stat.columbia.edu/~gelman/research/published/multiple2f.pdf), and [Slides](http://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture10.pdf)) to decide whether or not to apply p-value adjustment to your own dataset. ## References Sestak MS and Domazet-Loso T. __Phylostratigraphic Profiles in Zebrafish Uncover Chordate Origins of the Vertebrate Brain__. _Mol. Biol. Evol._ (2015) 32 (2): 299-312.