library(phylobase)This document describes the new ‘phylo4’ S4 classes and methods,
which are intended to provide a unifying standard for the representation
of phylogenetic trees and comparative data in R. The
phylobase package was developed to help both end users and
package developers by providing a common suite of tools likely to be
shared by all packages designed for phylogenetic analysis, facilities
for data and tree manipulation, and standardization of formats.
This standardization will benefit end-users by making it
easier to move data and compare analyses across packages, and to keep
comparative data synchronized with phylogenetic trees. Users will also
benefit from a repository of functions for tree manipulation, for
example tools for including or excluding subtrees (and associated
phenotypic data) or improved tree and data plotting facilities.
phylobase will benefit developers by freeing them
to put their programming effort into developing new methods rather than
into re-coding base tools. We (the phylobase developers)
hope phylobase will also facilitate code validation by
providing a repository for benchmark tests, and more generally that it
will help catalyze community development of comparative methods in
R.
A more abstract motivation for developing phylobase was
to improve data checking and abstraction of the tree data formats.
phylobase can check that data and trees are associated in
the proper fashion, and protects users and developers from accidently
reordering one, but not the other. It also seeks to abstract the data
format so that commonly used information (for example, branch length
information or the ancestor of a particular node) can be accessed
without knowledge of the underlying data structure (i.e., whether the
tree is stored as a matrix, or a list, or a parenthesis-based format).
This is achieved through generic phylobase functions which
which retrieve the relevant information from the data structures. The
benefits of such abstraction are multiple: (1) easier access to the
relevant information via a simple function call (this frees both
users and developers from learning details of complex data structures),
(2) freedom to optimize data structures in the future without
breaking code. Having the generic functions in place to “translate”
between the data structures and the rest of the program code allows
program and data structure development to proceed somewhat
independently. The alternative is code written for specific data
structures, in which modifications to the data structure requires
rewriting the entire package code (often exacting too high a price,
which results in the persistence of less-optimal data structures). (3)
providing broader access to the range of tools in
phylobase. Developers of specific packages can use
these new tools based on S4 objects without knowing the details of S4
programming.
The base ‘phylo4’ class is modeled on the the phylo
class in ape. and extend the ‘phylo4’ class to include data
or multiple trees respectively. In addition to describing the classes
and methods, this vignette gives examples of how they might be used.
The phylobase package currently implements the following functions and data structures:
Data structures for storing a single tree and multiple trees: and ?
A data structure for storing a tree with associated tip and node data:
A data structure for storing multiple trees with one set of tip data:
Functions for reading nexus files into the above data structures
Functions for converting between the above data structures and
objects as well as phylog objects (although the latter are
now deprecated …)
Functions for editing trees and data (i.e., subsetting and replacing)
Functions for plotting trees and trees with data
The help system works similarly to the help system with some small
differences relating to how methods are written. The function is a good
example. When we type we are provided the help for the default plotting
function which expects x and y. R
also provides a way to smartly dispatch the right type of plotting
function. In the case of an object (a class object) R
evaluates the class of the object and finds the correct functions, so
the following works correctly.
library(ape)
set.seed(1)  ## set random-number seed
rand_tree <- rcoal(10) ## Make a random tree with 10 tips
plot(rand_tree)However, typing still takes us to the default plot help.
We have to type to find what we are looking for. This is because
generics are simply functions with a dot and the class name added.
The generic system is too complicated to describe here, but doesn’t
include the same dot notation. As a result doesn’t work, R
still finds the right plotting function.
library(phylobase)
# convert rand_tree to a phylo4 object
rand_p4_tree <- as(rand_tree, "phylo4")
plot(rand_p4_tree)All fine and good, but how to we find out about all the great
features of the phylobase plotting function? R
has two nifty ways to find it, the first is to simply put a question
mark in front of the whole call:
`?`(plot(rand_p4_tree))R looks at the class of the object and takes us to the
correct help file (note: this only works with objects). The second ways
is handy if you already know the class of your object, or want to
compare to generics for different classes:
`?`(method, plot("phylo4"))More information about how documentation works can be found in the methods package, by running the following command.
help('Documentation', package="methods")You can start with a tree — an object of class phylo
from the ape package (e.g., read in using the
read.tree() or read.nexus() functions), and
convert it to a phylo4 object.
For example, load the raw Geospiza data:
library(phylobase)
data(geospiza_raw) # what does it contain?
names(geospiza_raw)
#> [1] "tree" "data"Convert the S3 tree to a S4 phylo4 object
using the as() function:
(g1 <- as(geospiza_raw$tree, "phylo4"))
#>           label node ancestor edge.length node.type
#> 1    fuliginosa    1       24     0.05500       tip
#> 2        fortis    2       24     0.05500       tip
#> 3  magnirostris    3       23     0.11000       tip
#> 4   conirostris    4       22     0.18333       tip
#> 5      scandens    5       21     0.19250       tip
#> 6    difficilis    6       20     0.22800       tip
#> 7       pallida    7       25     0.08667       tip
#> 8      parvulus    8       27     0.02000       tip
#> 9    psittacula    9       27     0.02000       tip
#> 10       pauper   10       26     0.03500       tip
#> 11   Platyspiza   11       18     0.46550       tip
#> 12        fusca   12       17     0.53409       tip
#> 13 Pinaroloxias   13       16     0.58333       tip
#> 14     olivacea   14       15     0.88077       tip
#> 15         <NA>   15        0          NA      root
#> 16         <NA>   16       15     0.29744  internal
#> 17         <NA>   17       16     0.04924  internal
#> 18         <NA>   18       17     0.06859  internal
#> 19         <NA>   19       18     0.13404  internal
#> 20         <NA>   20       19     0.10346  internal
#>  [ reached 'max' / getOption("max.print") -- omitted 7 rows ]The (internal) nodes appear with labels because they are not defined:
nodeLabels(g1)
#> 15 16 17 18 19 20 21 22 23 24 25 26 27 
#> NA NA NA NA NA NA NA NA NA NA NA NA NAYou can also retrieve the node labels with .
A simple way to assign the node numbers as labels (useful for various checks) is
nodeLabels(g1) <- paste("N", nodeId(g1, "internal"), sep="")
head(g1, 5)
#>          label node ancestor edge.length node.type
#> 1   fuliginosa    1       24     0.05500       tip
#> 2       fortis    2       24     0.05500       tip
#> 3 magnirostris    3       23     0.11000       tip
#> 4  conirostris    4       22     0.18333       tip
#> 5     scandens    5       21     0.19250       tipThe method gives a little extra information, including information on the distribution of branch lengths:
summary(g1)
#> 
#>  Phylogenetic tree : g1 
#> 
#>  Number of tips    : 14 
#>  Number of nodes   : 13 
#>  Branch lengths:
#>         mean         : 0.1764008 
#>         variance     : 0.04624379 
#>         distribution :
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> 0.00917 0.04985 0.08000 0.17640 0.21912 0.88077Print tip labels:
tipLabels(g1)
#>              1              2              3              4              5              6 
#>   "fuliginosa"       "fortis" "magnirostris"  "conirostris"     "scandens"   "difficilis" 
#>              7              8              9             10             11             12 
#>      "pallida"     "parvulus"   "psittacula"       "pauper"   "Platyspiza"        "fusca" 
#>             13             14 
#> "Pinaroloxias"     "olivacea"(labels(g1,"tip") would also work.)
You can modify labels and other aspects of the tree — for example, to convert all the labels to lower case:
tipLabels(g1) <- tolower(tipLabels(g1))You could also modify selected labels, e.g. to modify the labels in positions 11 and 13 (which happen to be the only labels with uppercase letters):
tipLabels(g1)[c(11, 13)] <- c("platyspiza", "pinaroloxias")Note that for a given tree, phylobase always return the
tipLabels in the same order.
Print node numbers (in edge matrix order):
nodeId(g1, type='all')
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27Does it have information on branch lengths?
hasEdgeLength(g1)
#> [1] TRUEIt does! What do they look like?
edgeLength(g1)
#>   15-16   16-17   17-18   18-19   19-20   20-21   21-22   22-23   23-24    24-1    24-2 
#> 0.29744 0.04924 0.06859 0.13404 0.10346 0.03550 0.00917 0.07333 0.05500 0.05500 0.05500 
#>    23-3    22-4    21-5    0-15    20-6   19-25    25-7   25-26   26-27    27-8    27-9 
#> 0.11000 0.18333 0.19250      NA 0.22800 0.24479 0.08667 0.05167 0.01500 0.02000 0.02000 
#>   26-10   18-11   17-12   16-13   15-14 
#> 0.03500 0.46550 0.53409 0.58333 0.88077Note that the root has <NA> as its length.
Print edge labels (also empty in this case — therefore all
NA):
edgeLabels(g1)
#> 15-16 16-17 17-18 18-19 19-20 20-21 21-22 22-23 23-24  24-1  24-2  23-3  22-4  21-5  0-15 
#>    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA 
#>  20-6 19-25  25-7 25-26 26-27  27-8  27-9 26-10 18-11 17-12 16-13 15-14 
#>    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NAYou can also use this function to label specific edges:
edgeLabels(g1)["23-24"] <- "an edge"
edgeLabels(g1)
#>     15-16     16-17     17-18     18-19     19-20     20-21     21-22     22-23     23-24 
#>        NA        NA        NA        NA        NA        NA        NA        NA "an edge" 
#>      24-1      24-2      23-3      22-4      21-5      0-15      20-6     19-25      25-7 
#>        NA        NA        NA        NA        NA        NA        NA        NA        NA 
#>     25-26     26-27      27-8      27-9     26-10     18-11     17-12     16-13     15-14 
#>        NA        NA        NA        NA        NA        NA        NA        NA        NAThe edge labels are named according to the nodes they connect (ancestor-descendant). You can get the edge(s) associated with a particular node:
getEdge(g1, 24) # default uses descendant node
#>      24 
#> "23-24"
getEdge(g1, 24, type="ancestor") # edges using ancestor node
#>     24     24 
#> "24-1" "24-2"These results can in turn be passed to the function to retrieve the length of a given set of edges:
edgeLength(g1)[getEdge(g1, 24)]
#> 23-24 
#> 0.055
edgeLength(g1)[getEdge(g1, 24, "ancestor")]
#>  24-1  24-2 
#> 0.055 0.055Is it rooted?
isRooted(g1)
#> [1] TRUEWhich node is the root?
rootNode(g1)
#> N15 
#>  15Does it contain any polytomies?
hasPoly(g1)
#> [1] FALSEIs the tree ultrametric?
isUltrametric(g1)
#> [1] TRUEYou can also get the depth (distance from the root) of any given node or the tips:
nodeDepth(g1, 23)
#> Warning: 'nodeDepth' is deprecated.
#> Use 'nodeHeight' instead.
#> See help("Deprecated")
#>     N23 
#> 0.77077
depthTips(g1)
#> Warning: 'depthTips' is deprecated.
#> Use 'nodeHeight' instead.
#> See help("Deprecated")
#> Warning: 'nodeDepth' is deprecated.
#> Use 'nodeHeight' instead.
#> See help("Deprecated")
#>   fuliginosa       fortis magnirostris  conirostris     scandens   difficilis 
#>      0.88077      0.88077      0.88077      0.88077      0.88077      0.88077 
#>      pallida     parvulus   psittacula       pauper   platyspiza        fusca 
#>      0.88077      0.88077      0.88077      0.88077      0.88077      0.88077 
#> pinaroloxias     olivacea 
#>      0.88077      0.88077The phylo4d class matches trees with data, or combines
them with a data frame to make a phylo4d (tree-with-data)
object.
Now we’ll take the Geospiza data from
geospiza_raw$data and merge it with the tree. First, let’s
prepare the data:
g1 <- as(geospiza_raw$tree, "phylo4")
geodata <- geospiza_raw$dataHowever, since G. olivacea is included in the tree but not in the data set, we will initially run into some trouble:
g2 <- phylo4d(g1, geodata)
#> Error in formatData(phy = x, dt = tip.data, type = "tip", ...): The following nodes are not found in the dataset:  olivaceaTo deal with G. olivacea missing from the data, we have a few choices. The easiest is to use to allow to create the new object with a warning (you can also use to proceed without warnings):
g2 <- phylo4d(g1, geodata, missing.data="warn")
#> Warning in formatData(phy = x, dt = tip.data, type = "tip", ...): The following nodes are
#> not found in the dataset: olivaceaAnother way to deal with this would be to use prune() to
drop the offending tip from the tree first:
g1sub <- prune(g1, "olivacea")
g1B <- phylo4d(g1sub, geodata)The difference between the two objects is that the species G.
olivacea is still present in the tree but has no data (i.e.,
NA) associated with it. In the other case, G.
olivacea is not included in the tree anymore. The approach you
choose depends on the goal of your analysis.
You can summarize the new object with the function
summary. It breaks down the statistics about the traits
based on whether it is associated with the tips for the internal nodes:
<<geomergesum>>= summary(g2) @
Or use tdata() to extract the data (i.e.,
tdata(g2)). By default, tdata() will retrieve
tip data, but you can also get internal node data only () or — if the
tip and node data have the same format — all the data combined ().
If you want to plot the data (e.g. for checking the input),
plot(tdata(g2)) will create the default plot for the data —
in this case, since it is a data frame, this will be a
pairs plot of the data.
The subset command offers a variety of ways of
extracting portions of a phylo4 or phylo4d
tree, keeping any tip/node data consistent.
give a vector of tips (names or numbers) to retain
give a vector of tips (names or numbers) to drop
give a vector of node or tip names or numbers; extract the clade containing these taxa
give a node (name or number); extract the subtree starting from this node
Different ways to extract the fuliginosa-scandens clade:
subset(g2, tips.include=c("fuliginosa", "fortis", "magnirostris",
  "conirostris", "scandens"))
subset(g2, node.subtree=21)
subset(g2, mrca=c("scandens", "fortis"))One could drop the clade by doing
subset(g2, tips.exclude=c("fuliginosa", "fortis", "magnirostris",
  "conirostris", "scandens"))
subset(g2, tips.exclude=names(descendants(g2, MRCA(g2, c("difficilis",
"fortis")))))phylobase provides many functions that allows users to
explore relationships between nodes on a tree (tree-walking and tree
traversal). Most functions work by specifying the phylo4
(or phylo4d) object as the first argument, the node
numbers/labels as the second argument (followed by some additional
arguments).
getNode allows you to find a node based on its node
number or its label. It returns a vector with node numbers as values and
labels as names:
data(geospiza)
getNode(geospiza, 10)
#> pauper 
#>     10
getNode(geospiza, "pauper")
#> pauper 
#>     10If no node is specified, they are all returned, and if a node can’t
be found it’s returned as a NA. It is possible to control
what happens when a node can’t be found:
getNode(geospiza)
#>   fuliginosa       fortis magnirostris  conirostris     scandens   difficilis 
#>            1            2            3            4            5            6 
#>      pallida     parvulus   psittacula       pauper   Platyspiza        fusca 
#>            7            8            9           10           11           12 
#> Pinaroloxias     olivacea          N15          N16          N17          N18 
#>           13           14           15           16           17           18 
#>          N19          N20          N21          N22          N23          N24 
#>           19           20           21           22           23           24 
#>          N25          N26          N27 
#>           25           26           27
getNode(geospiza, 10:14)
#>       pauper   Platyspiza        fusca Pinaroloxias     olivacea 
#>           10           11           12           13           14
getNode(geospiza, "melanogaster", missing="OK") # no warning
#> <NA> 
#>   NA
getNode(geospiza, "melanogaster", missing="warn") # warning!
#> Warning in getNode(geospiza, "melanogaster", missing = "warn"): Some nodes not found
#> among all nodes in tree: melanogaster
#> <NA> 
#>   NAchildren and ancestor give the immediate
neighboring nodes:
children(geospiza, 16)
#>          N17 Pinaroloxias 
#>           17           13
ancestor(geospiza, 16)
#> N15 
#>  15while descendants and ancestors can
traverse the tree up to the tips or root respectively:
descendants(geospiza, 16) # by default returns only the tips
#> Pinaroloxias        fusca   Platyspiza   difficilis     scandens  conirostris 
#>           13           12           11            6            5            4 
#> magnirostris   fuliginosa       fortis      pallida       pauper     parvulus 
#>            3            1            2            7           10            8 
#>   psittacula 
#>            9
descendants(geospiza, "all") # also include the internal nodes
#> Warning in getNode(phy, node, missing = "warn"): Some nodes not found among all nodes in
#> tree: all
#> named list()
ancestors(geospiza, 20)
#> N19 N18 N17 N16 N15 
#>  19  18  17  16  15
ancestors(geospiza, 20, "ALL") # uppercase ALL includes self
#> N20 N19 N18 N17 N16 N15 
#>  20  19  18  17  16  15siblings returns the other node(s) associated with the
same ancestor:
siblings(geospiza, 20)
#> N25 
#>  25
siblings(geospiza, 20, include.self=TRUE)
#> N20 N25 
#>  20  25MRCA returns the most common recent ancestor for a set
of tips, and shortest path returns the nodes connecting 2 nodes:
MRCA(geospiza, 1:6)
#> N20 
#>  20
shortestPath(geospiza, 4, "pauper")
#> N19 N20 N21 N22 N25 N26 
#>  19  20  21  22  25  26multiPhylo4 classes are not yet implemented but will be
coming soon.
This section will describe a way of constructing a simulator that generates trait values for extant species (tips) given a tree with branch lengths, assuming a model of Brownian motion.
We can use to coerce the tree into a variance-covariance matrix form,
and then use mvrnorm from the MASS package to
generate a set of multivariate normally distributed values for the tips.
(A benefit of this approach is that we can very quickly generate a very
large number of replicates.) This example illustrates a common feature
of working with phylobase — combining tools from several
different packages to operate on phylogenetic trees with data.
We start with a randomly generated tree using rcoal()
from ape to generate the tree topology and branch
lengths:
set.seed(1001)
tree <- as(rcoal(12), "phylo4")Next we generate the phylogenetic variance-covariance matrix (by
coercing the tree to a phylo4vcov object) and pick a single
set of normally distributed traits (using to pick a multivariate normal
deviate with a variance-covariance matrix that matches the structure of
the tree).
vmat <- as(tree, "phylo4vcov")
vmat <- cov2cor(vmat)
library(MASS)
trvec <- mvrnorm(1, mu=rep(0, 12), Sigma=vmat)The last step (easy) is to convert the phylo4vcov object
back to a phylo4d object:
treed <- phylo4d(tree, tip.data=as.data.frame(trvec))
plot(treed)plot of chunk plotvcvphylo
This section details the internal structure of the
phylo4, multiphylo4 (coming soon!),
phylo4d, and multiphylo4d (coming soon!)
classes. The basic building blocks of these classes are the
phylo4 object and a dataframe. The phylo4 tree
format is largely similar to the one used by phylo class in
the package ape1.
We use “edge” for ancestor-descendant relationships in the phylogeny (sometimes called “branches”) and “edge lengths” for their lengths (“branch lengths”). Most generally, “nodes” are all species in the tree; species with descendants are “internal nodes” (we often refer to these just as “nodes”, meaning clear from context); “tips” are species with no descendants. The “root node” is the node with no ancestor (if one exists).
Like phylo, the main components of the
phylo4 class are:
a 2-column matrix of integers, with \(N\) rows for a rooted tree or \(N-1\) rows for an unrooted tree and column
names ancestor and descendant. Each row
contains information on one edge in the tree. See below for further
constraints on the edge matrix.
numeric list of edge lengths (length \(N\) (rooted) or \(N-1\) (unrooted) or empty (length 0))
character vector of tip labels (required), with length=# of tips. Tip labels need not be unique, but data-tree matching with non-unique labels will cause an error
character vector of node labels, length=# of internal nodes or 0 (if empty). Node labels need not be unique, but data-tree matching with non-unique labels will cause an error
character: “preorder”, “postorder”, or “unknown” (default),
describing the order of rows in the edge matrix. , “pruningwise” and
“cladewise” are accepted for compatibility with ape
The edge matrix must not contain NAs, with the exception
of the root node, which has an NA for
ancestor. phylobase does not enforce an order
on the rows of the edge matrix, but it stores information on the current
ordering in the slot — current allowable values are “unknown” (the
default), “preorder” (equivalent to “cladewise” in ape) or
“postorder” 2.
The basic criteria for the edge matrix are similar to those of
ape, as documented it’s tree specification3. This is a modified
version of those rules, for a tree with \(n\) tips and \(m\) internal nodes:
Tips (no descendants) are coded \(1,\ldots, n\), and internal nodes (\(\ge 1\) descendant) are coded \(n + 1, \ldots , n + m\) (\(n + 1\) is the root). Both series are numbered with no gaps.
The first (ancestor) column has only values \(> n\) (internal nodes): thus, values \(\le n\) (tips) appear only in the second (descendant) column
all internal nodes (not including the root) must appear in the
first (ancestor) column at least once [unlike ape, which
nominally requires each internal node to have at least two descendants
(although it doesn’t absolutely prohibit them and has a function to get
rid of them), phylobase does allow these “singleton nodes”
and has a method hasSingle for detecting them]. Singleton
nodes can be useful as a way of representing changes along a lineage;
they are used this way in the ouch package.
the number of occurrences of a node in the first column is related to the nature of the node: once if it is a singleton, twice if it is dichotomous (i.e., of degree 3 [counting ancestor as well as descendants]), three times if it is trichotomous (degree 4), and so on.
phylobase does not technically prohibit reticulations
(nodes or tips that appear more than once in the descendant column), but
they will probably break most of the methods. Disconnected trees,
cycles, and other exotica are not tested for, but will certainly break
the methods.
We have defined basic methods for
phylo4:show, print, and a variety
of accessor functions (see help files). summary does not
seem to be terribly useful in the context of a “raw” tree, because there
is not much to compute.
The phylo4d class extends phylo4 with data.
Tip data, and (internal) node data are stored separately, but can be
retrieved together or separately with tdata(x,"tip"),
tdata(x,"internal") or tdata(x,"all"). There
is no separate slot for edge data, but these can be stored as node data
associated with the descendant node.
see https://en.wikipedia.org/wiki/Tree_traversal for more
information on orderings. (ape’s “pruningwise” is
“bottom-up” ordering).↩︎