# Phylogenetic smoothing using mgcv

By Nicholas Clark in rstats mgcv

February 24, 2024

I have been highly interested in multivariate time series modeling for the last few years and have spent a lot of time working out how to use hierarchical GAMs to tackle many of the questions I’m interested in. As you may already know, GAMs afford us a huge amount of flexibility to address nonlinear associations in ecological models (
see this paper by Pedersen et al for some more context). I have done quite a bit of work to extend these models to also capture complex dynamic processes, which are available in my
`mvgam`

R package.

But I have recently become aware that it is possible to use phylogenetic or functional information to regularize these hierarchical functions. This works by taking advantage of the hugely flexible `mrf`

basis that is provided in `mgcv`

(see `?mgcv::mrf`

for details). This basis allows users to provide their own penalty matrices, which will act as a prior precision for the basis coefficients when estimating the model. By providing the right kind of penalty matrix, for example one that is constructed from a phylogenetic tree or functional dendrogram, we can force the model to *regularize* species’ nonlinear effects toward those from their most closely related neighbours. This is an incredible advance that opens many new possibilities for asking targeted questions about niche conservatism, trait evolution, functional redundancy and a whole host of other directions. A very basic example of how this can be done in `mgcv`

is presented here.

## Environment setup

Load libraries necessary for data manipulation and modeling

```
library(ape)
library(mgcv)
library(mvnfast)
library(ggplot2)
library(dplyr)
library(MRFtools) # devtools::install_github("eric-pedersen/MRFtools")
```

A utility function to simulate from a squared exponential Gaussian Process, which we will use to create species’ nonlinear temporal trends

```
sim_gp = function(N, alpha, rho){
Sigma <- alpha ^ 2 *
exp(-0.5 * ((outer(1:N, 1:N, "-") / rho) ^ 2)) +
diag(1e-9, N)
mvnfast::rmvn(1,
mu = rep(0, N),
sigma = Sigma)[1,]
}
```

## Phylogenetically structured trends

Simulate a random phylogenetic tree to inform species’ relationships

```
N_species <- 12
tree <- rcoal(N_species, tip.label = paste0('sp_', 1:N_species))
species_names <- tree$tip.label
plot(tree)
```

Now simulate a ‘shared’ nonlinear temporal trend, which will anchor each species’ trend

```
N <- 50
shared <- sim_gp(N, alpha = 1, rho = 8) + 10
plot(shared, type = 'l', lwd = 2, xlab = 'Time', ylab = 'Trend')
```

Next we construct the phylogenetically-informed trends. In this example, each species’ actual trend is a perturbation of the shared trend, whereby the final trend is a weighted sum of the shared trend and two other GP trends. Because the weights are simulated using phylogenetic information (using the `rTraitCont()`

function from the `ape`

library), this process allows us to construct smooth trends that will hopefully capture the property we’re after, i.e. that more closely related species will have more similar functional shapes

```
warp1 <- sim_gp(N, alpha = 2, rho = 20) + 10
warp2 <- sim_gp(N, alpha = 2, rho = 20) + 10
weights1 <- as.vector(scale(rTraitCont(tree)))
weights2 <- as.vector(scale(rTraitCont(tree)))
```

Create the trends for each species and take noisy observations. For the third and seventh species, we set observations to `NA`

so we can test if the model is able to recover their trends. Store all necessary data in a `data.frame`

```
dat <- do.call(rbind,
lapply(seq_len(N_species),
function(i){
sp_trend <- warp1 * weights1[i] +
warp2 * weights2[i] + shared
obs <- rnorm(N,
mean = as.vector(scale(sp_trend)),
sd = 0.35)
if(i %in% c(3, 7)){
weight <- 0
obs <- NA
} else {
weight <- 1
}
data.frame(species = species_names[i],
weight = weight,
time = 1:N,
truth = as.vector(scale(sp_trend)),
y = obs)
}))
dat$species <- factor(dat$species, levels = species_names)
```

We’ll also leave out the last 5 observations for each species so we can see how well (or how poorly) the trends extrapolate, though this isn’t the primary focus of the example

```
dat %>%
dplyr::mutate(y = dplyr::case_when(
time <= N-5 ~ y,
time > N-5 ~ NA,
TRUE ~ y
)) -> dat
```

## Data visualisation

Plot the true simulated trends for each species

```
ggplot(dat, aes(x = time, y = truth, col = species)) +
geom_line() +
facet_wrap(~species, scales = 'free_y') +
theme_classic() +
theme(legend.position = 'none')
```

Plot the noisy observations (noting that all obs are missing for two species)

```
ggplot(dat, aes(x = time, y = y*weight, col = species)) +
geom_point() +
facet_wrap(~species, scales = 'free_y') +
theme_classic() +
theme(legend.position = 'none')
```

```
## Warning: Removed 150 rows containing missing values or values outside the scale range
## (`geom_point()`).
```

## Model setup

Create the MRF penalty matrix using the phylogenetic precision matrix

```
omega <- solve(vcv(tree))
```

Now add an MRF penalty that forces the temporal trend to evolve as a Random Walk using utilities provided by Pedersen et al’s `MRFtools`

package. This requires that we have a factor variable for `time`

in our data, and we should ensure the levels of this `time_factor`

go as high as we would potentially like to forecast. Note that this package can also create the phylogenetic penalty but I feel it is better to show these steps explicitly for this example.

```
rw_penalty <- mrf_penalty(object = 1:max(dat$time),
type = 'linear')
dat$time_factor <- factor(1:max(dat$time))
```

Fit a GAM using a tensor product of the RW MRF basis and the phylogenetic MRF basis. We also use a ‘shared’ smooth of time so that the phylogenetic smooths are estimated as deviations around this shared smooth. Set `drop.unused.levels = FALSE`

to ensure there are no errors because of the extra species and times in the penalty matrices

```
mod <- gam(y ~ s(time, k = 10) +
te(time_factor, species,
bs = c("mrf", "mrf"),
k = c(8, N_species),
xt = list(list(penalty = rw_penalty),
list(penalty = omega))),
data = dat,
drop.unused.levels = FALSE,
method = "REML")
summary(mod)
```

```
##
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ s(time, k = 10) + te(time_factor, species, bs = c("mrf",
## "mrf"), k = c(8, N_species), xt = list(list(penalty = rw_penalty),
## list(penalty = omega)))
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03469 0.01665 -2.084 0.0378 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time) 6.011 6.576 2.239 0.0281 *
## te(time_factor,species) 67.709 79.000 39.045 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.892 Deviance explained = 91%
## -REML = 291.42 Scale est. = 0.12468 n = 450
```

## Predictions and evaluation

Calculate predictions from the model for the full dataset (including the missing species) and overlay the actual true simulated trends as black points. Did the model successfully estimate the missing species’ temporal trend?

```
preds <- predict(mod, newdata = dat, type = 'response', se = TRUE)
dat$pred <- preds$fit
dat$upper <- preds$fit + 1.96*preds$se.fit
dat$lower <- preds$fit - 1.96*preds$se.fit
ggplot(dat, aes(x = time, y = pred, col = species)) +
geom_point(aes(y = truth), col = 'black') +
geom_line(linewidth = 1, alpha = 0.6) +
geom_ribbon(aes(ymin = lower, ymax = upper, fill = species),
alpha = 0.3, col = NA) +
facet_wrap(~species, scales = 'free_y') +
theme_classic() +
theme(legend.position = 'none')
```

Hot Damn it worked! But could we recover these missing trends without the information provided in the phylogenetic structure? Fit a second GAM that uses a similar hierarchical smooth of time, but in this case the deviations around the shared smooth do not have any phylogenetic information to leverage

```
mod2 <- gam(y ~ s(time, k = 10) +
s(time, species, bs = 'fs', k = 8),
data = dat,
method = "REML",
drop.unused.levels = FALSE)
```

```
## Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
## 1-d smooths of same variable.
```

```
summary(mod2)
```

```
##
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ s(time, k = 10) + s(time, species, bs = "fs", k = 8)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03469 0.03615 -0.96 0.338
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time) 7.668 8.285 5.13 3.06e-06 ***
## s(time,species) 56.239 78.000 38.27 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.889 Deviance explained = 90.5%
## -REML = 285.82 Scale est. = 0.12807 n = 450
```

Now predict from the non-phylogenetic model

```
preds <- predict(mod2, newdata = dat, type = 'response', se = TRUE)
dat$pred <- preds$fit
dat$upper <- preds$fit + 1.96*preds$se.fit
dat$lower <- preds$fit - 1.96*preds$se.fit
ggplot(dat, aes(x = time, y = pred, col = species)) +
geom_point(aes(y = truth), col = 'black') +
geom_line(linewidth = 1, alpha = 0.6) +
geom_ribbon(aes(ymin = lower, ymax = upper, fill = species),
alpha = 0.3, col = NA) +
facet_wrap(~species, scales = 'free_y') +
theme_classic() +
theme(legend.position = 'none')
```

Predictions from this model draw from the ‘average’ smooth, rather than leveraging phylogenetic information, to predict the trends for the missing species. So the predictions for both missing species are identical. Obviously we can tell by eye that the predictions are worse than those from the phylogenetic model. But we could use Continuous Rank Probability Scores for each model’s predictions to quantify *how much* worse

## Further reading

The following papers and resources offer useful material about Hierarchical Generalized Additive Models and comparative phylogenetic modeling

Blomberg, S. P., Garland Jr, T., & Ives, A. R. (2003).
Testing for phylogenetic signal in comparative data: behavioral traits are more labile. *Evolution*, 57(4), 717-745.

Clark, N. J., Drovetski, S. V., & Voelker, G. (2020).
Robust geographical determinants of infection prevalence and a contrasting latitudinal diversity gradient for haemosporidian parasites in Western Palearctic birds. *Molecular Ecology*, 29(16), 3131-3143.

Clark, N. J. (2023) Ecological forecasting with R ðŸ“¦’s {mvgam} and {brms}. A workshop hosted for Physalia Courses

Pedersen, E. J., Miller, D. L., Simpson, G. L., & Ross, N. (2019).
Hierarchical generalized additive models in ecology: an introduction with mgcv. *PeerJ*, 7, e6876.