--- title: "Retrieving Taxonomic Information" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Retrieving Taxonomic Information} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- ```{r, echo = FALSE, message = FALSE} options(width = 750) knitr::opts_chunk$set( comment = "#>", error = FALSE, tidy = FALSE) ``` This vignette will introduce users to the retrieval of taxonomic information with `myTAI`. The `taxonomy()` function implemented in `myTAI` relies on the powerful package [taxize](https://github.com/ropensci/taxize). Nevertheless, taxonomic information retrieval has been customized for the `myTAI` standard and for organism specific information retrieval. Specifically, the `taxonomy()` function implemented in `myTAI` can be used to classify genomes according to phylogenetic classification into Phylostrata (Phylostratigraphy) or to retrieve species specific taxonomic information when performing Divergence Stratigraphy (see [Introduction](Introduction.html) for details). For larger taxonomy queries it may be useful to create an NCBI Account and set up an [ENTREZ API KEY](https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/). ```r # install.packages(c("taxize", "usethis")) taxize::use_entrez() # Create your key from your (brand-new) account's. # After generating your key set it as ENTREZ_KEY in .Renviron. # ENTREZ_KEY='youractualkeynotthisstring' # For that, use usethis::edit_r_environ() usethis::edit_r_environ() ``` ## Taxonomic Information Retrieval The `myTAI` package provides the `taxonomy()` function to retrieve taxonomic information. In the following example we will obtain the taxonomic hierarchy of `Arabidopsis thaliana` from [NCBI Taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy). ```{r, eval=FALSE} # retrieving the taxonomic hierarchy of "Arabidopsis thaliana" # from NCBI Taxonomy myTAI::taxonomy( organism = "Arabidopsis thaliana", db = "ncbi", output = "classification" ) ``` ``` name rank id 1 cellular organisms no rank 131567 2 Eukaryota superkingdom 2759 3 Viridiplantae kingdom 33090 4 Streptophyta phylum 35493 5 Streptophytina no rank 131221 6 Embryophyta no rank 3193 7 Tracheophyta no rank 58023 8 Euphyllophyta no rank 78536 9 Spermatophyta no rank 58024 10 Magnoliophyta no rank 3398 11 Mesangiospermae no rank 1437183 12 eudicotyledons no rank 71240 13 Gunneridae no rank 91827 14 Pentapetalae no rank 1437201 15 rosids subclass 71275 16 malvids no rank 91836 17 Brassicales order 3699 18 Brassicaceae family 3700 19 Camelineae tribe 980083 20 Arabidopsis genus 3701 21 Arabidopsis thaliana species 3702 ``` The `organism` argument takes the scientific name of a query organism, the `db` argument specifies that database from which the corresponding taxonomic information shall be retrieved, e.g. `ncbi` (NCBI Taxonomy) and `itis` (Integrated Taxonomic Information System) and the `output` argument specifies the type of taxonomic information that shall be returned for the query organism, e.g. `classification`, `taxid`, or `children`. The output of `classification` is a `data.frame` storing the taxonomic hierarchy of `Arabidopsis thaliana` starting with `cellular organisms` up to `Arabidopsis thaliana`. The first column stores the taxonomic name, the second column the taxonomic rank, and the third column the NCBI Taxonomy id for corresponding taxa. Analogous `classification` information can be obtained from different databases. ```{r, eval=FALSE} # retrieving the taxonomic hierarchy of "Arabidopsis thaliana" # from the Integrated Taxonomic Information System myTAI::taxonomy( organism = "Arabidopsis thaliana", db = "itis", output = "classification" ) ``` ``` name rank id 1 Plantae Kingdom 202422 2 Viridiplantae Subkingdom 954898 3 Streptophyta Infrakingdom 846494 4 Embryophyta Superdivision 954900 5 Tracheophyta Division 846496 6 Spermatophytina Subdivision 846504 7 Magnoliopsida Class 18063 8 Rosanae Superorder 846548 9 Brassicales Order 822943 10 Brassicaceae Family 22669 11 Arabidopsis Genus 23040 ``` The `output` argument allows you to directly access taxonomy ids for a query organism or species. ```{r, eval=FALSE} # retrieving the taxonomy id of the query organism from NCBI Taxonomy myTAI::taxonomy( organism = "Arabidopsis thaliana", db = "ncbi", output = "taxid" ) ``` ``` id 1 3702 ``` ```{r, eval=FALSE} # retrieving the taxonomy id of the query organism from Integrated Taxonomic Information Service myTAI::taxonomy( organism = "Arabidopsis", db = "itis", output = "taxid" ) ``` ``` id 1 23040 ``` So far, the following data bases can be accesses to retrieve taxonomic information: * `db = "itis"` : Integrated Taxonomic Information Service * `db = "ncbi"` : National Center for Biotechnology Information ## Retrieve Children Nodes Another `output` supported by `taxonomy()` is `children` that returns the immediate children taxa for a query organism. This feature is useful to determine species relationships for quantifying recent evolutionary conservation with Divergence Stratigraphy. ```{r, eval=FALSE} # retrieve children taxa of the query organism stored in the corresponding database myTAI::taxonomy( organism = "Arabidopsis", db = "ncbi", output = "children" ) ``` ``` childtaxa_id childtaxa_name childtaxa_rank 1 1547872 Arabidopsis umezawana species 2 1328956 (Arabidopsis thaliana x Arabidopsis arenosa) x Arabidopsis suecica species 3 1240361 Arabidopsis thaliana x Arabidopsis arenosa species 4 869750 Arabidopsis thaliana x Arabidopsis lyrata species 5 412662 Arabidopsis pedemontana species 6 378006 Arabidopsis arenosa x Arabidopsis thaliana species 7 347883 Arabidopsis arenicola species 8 302551 Arabidopsis petrogena species 9 97980 Arabidopsis croatica species 10 97979 Arabidopsis cebennensis species 11 81970 Arabidopsis halleri species 12 59690 Arabidopsis kamchatica species 13 59689 Arabidopsis lyrata species 14 45251 Arabidopsis neglecta species 15 45249 Arabidopsis suecica species 16 38785 Arabidopsis arenosa species 17 3702 Arabidopsis thaliana species ``` ```{r, eval=FALSE} # retrieve children taxa of the query organism stored in the corresponding database myTAI::taxonomy( organism = "Arabidopsis", db = "itis", output = "children" ) ``` ``` parentname parenttsn rankname taxonname tsn 1 Arabidopsis 23040 Species Arabidopsis thaliana 23041 2 Arabidopsis 23040 Species Arabidopsis arenicola 823113 3 Arabidopsis 23040 Species Arabidopsis arenosa 823130 4 Arabidopsis 23040 Species Arabidopsis lyrata 823171 ``` These results allow us to choose `subject` organisms for [Divergence Stratigraphy](https://github.com/drostlab/orthologr/blob/master/vignettes/divergence_stratigraphy.Rmd).