# Simulating data for model exploration in R

By Nicholas Clark in rstats simulation

March 12, 2024

## Purpose

The following passage is quoted directly from *Regression and other stories* (Gelman, Hill and Vehtari 2022). “Simulation of random variables is important in applied statistics for several reasons. First, we use probability models to mimic variation in the world, and the tools of simulation can help us better understand how this variation plays out. Patterns of randomness are notoriously contrary to normal human thinking—our brains don’t seem to be able to do a good job understanding that random swings will be present in the short term but average out in the long run—and in many cases simulation is a big help in training our intuitions about averages and variation. Second, we can use simulation to approximate the sampling distribution of data and propagate this to the sampling distribution of statistical estimates and procedures. Third, regression models are not deterministic; they produce probabilistic predictions. Simulation is the most convenient and general way to represent uncertainties in forecasts”

This script walks through some potential uses of simulation, but it is by no means exhaustive. First, generate some nice plotting colours

```
c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")
```

Next we define a few helper functions for making nice plots

```
myhist = function(x,
xlim,
xlab = '',
main = ''){
if(missing(xlim)){
xlim <- range(x)
}
hist(x,
xlim = xlim,
yaxt = 'n',
xlab = xlab,
ylab = '',
col = c_mid_highlight,
border = 'white',
lwd = 2,
breaks = 20,
main = main)
}
myscatter = function(x,
y,
xlab = '',
ylab = ''){
plot(x = x,
y = y,
pch = 16,
col = 'white',
cex = 1.25,
bty = 'l',
xlab = xlab,
ylab = ylab)
points(x = x, y = y, pch = 16,
col = c_dark,
cex = 1)
box(bty = 'l', lwd = 2)
}
mypred_plot = function(predictions, pred_values,
ylim,
ylab = 'Predictions', xlab = 'Covariate'){
probs <- c(0.05, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.95)
cred <- sapply(1:NROW(predictions),
function(n) quantile(predictions[n,],
probs = probs, na.rm = TRUE))
if(missing(ylim)){
ylim <- range(cred)
}
plot(1, type = "n", bty = 'L',
xlab = xlab,
ylab = ylab,
xlim = range(pred_values),
ylim = ylim)
polygon(c(pred_values, rev(pred_values)), c(cred[1,], rev(cred[9,])),
col = c_light, border = NA)
polygon(c(pred_values, rev(pred_values)), c(cred[2,], rev(cred[8,])),
col = c_light_highlight, border = NA)
polygon(c(pred_values, rev(pred_values)), c(cred[3,], rev(cred[7,])),
col = c_mid, border = NA)
polygon(c(pred_values, rev(pred_values)), c(cred[4,], rev(cred[6,])),
col = c_mid_highlight, border = NA)
lines(pred_values, cred[5,], col = c_dark, lwd = 2.5)
box(bty = 'l', lwd = 2)
}
```

## Simulating a simple linear regression

A linear regression generally assumes additive, linear effects of predictors on the mean of a response variable. It also assumes the errors (unexplained variation) are independent and identically distributed (i.e. the same variance / sd) applies to the full distribution of errors. We will simulate a vegetation cover outcome variable under the following model:

`$$\boldsymbol{vegetation}\sim \text{Normal}(\mu,\sigma)$$`

A linear predictor is modeled on the `\(identity\)`

scale to capture effects of covariates on `\(\mu\)`

:

`$$\mu=\alpha+{\beta_{temp}}*\boldsymbol{temperature}$$`

We simulate data with a single predictor, `\(temperature\)`

, and with varying sample sizes to inspect the influence of sample size on our inferences. Start with a sample size of `300`

.

```
N <- 300
```

First we will simulate the predictor variable, `\(temperature\)`

. Our first decision is to think about what distribution the predictor might have. For many modelling approaches, it is often useful to transform continuous predictors so they are standard normal in distribution (i.e. `\(\boldsymbol{temperature}\sim \text{Normal}(0,1)\)`

). This might be the case if we were to measure temperature and then standardise it to unit variance

```
set.seed(1971)
temperature_raw <- rnorm(n = N, mean = 14, sd = 5)
myhist(temperature_raw, xlab = 'Raw temperature')
```

It is often useful to standardize variables to unit variance before modelling; this helps us to compare effects from different predictors but also helps to specify useful prior distributions when fitting Bayesian models. This is most commonly done by shifting the mean to `0`

and dividing by `1sd`

```
temperature_scaled <- (temperature_raw - mean(temperature_raw)) /
sd(temperature_raw)
myhist(temperature_scaled, xlab = 'Scaled temperature')
```

This shifts the distribution but it does not change the properties of the variable

```
myscatter(x = temperature_raw,
y = temperature_scaled,
xlab = 'Raw temperature',
ylab = 'Scaled temperature')
```

Now simulate the actual linear effect of `\(temperature\)`

on the mean of our outcome, `\(vegetation\)`

. This is captured by the coefficient `\({\beta_{temp}}\)`

```
beta_temp <- 0.25
```

Next we need to simulate the intercept `\(\alpha\)`

. This parameter represents the average vegetation when temperature is at `0`

```
alpha <- 1
```

The other parameter we need to simulate is the standard deviation of the errors `\(\sigma\)`

; we don’t expect that temperature is the only mechanism influencing vegetation, so we need to include unexplained variation

```
sigma <- 2
```

Finally, we have the components to simulate our outcome variable, `\(vegetation\)`

. Simulate the linear predictor, which is `\(\mu=\alpha+{\beta_{temp}}*\boldsymbol{temperature}\)`

```
mu <- alpha + beta_temp * temperature_raw
myhist(mu, xlab = expression(mu))
```

Next, complete the probabilistic data generation by simulating the observations. Here we assume vegetation is normally distributed, so we use the random number generator `rnorm()`

. Use `?rnorm`

to learn more about normal distribution functions in `R`

```
vegetation <- rnorm(n = N, mean = mu, sd = sigma)
myhist(vegetation, xlab = 'Vegetation')
```

There should be an obvious positive relationship between raw `\(temperature\)`

and `\(vegetation\)`

; but it also shouldn’t be a perfect 1:1 relationship

```
myscatter(x = temperature_raw,
y = vegetation,
xlab = 'Raw temperature',
ylab = 'Vegetation')
```

We can now estimate the effect of temperature on the mean of vegetation using the `lm()`

function in `R`

. This function uses iteratively reweighted least squares to estimate model parameters in a maximum likelihood framework. But we will use the scaled version of temperature to showcase how this gives us coefficients that we can still interpret

```
veg_mod <- lm(vegetation ~ temperature_scaled)
summary(veg_mod)
```

```
##
## Call:
## lm(formula = vegetation ~ temperature_scaled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3773 -1.4182 -0.0611 1.2073 5.6760
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6737 0.1168 40.01 <2e-16 ***
## temperature_scaled 1.2222 0.1170 10.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.023 on 298 degrees of freedom
## Multiple R-squared: 0.268, Adjusted R-squared: 0.2655
## F-statistic: 109.1 on 1 and 298 DF, p-value: < 2.2e-16
```

The coefficients are harder to interpret now because we scaled the predictor

```
coef(veg_mod)
```

```
## (Intercept) temperature_scaled
## 4.673721 1.222223
```

Compare these estimates to the true values

```
cbind(alpha, beta_temp)[1,]
```

```
## alpha beta_temp
## 1.00 0.25
```

Both of these coefficients need to be interpreted by taking into account the standard deviation of the predictor variable, `\(vegetation\)`

. Once we back-transform, we can see the point estimates are very close to the true simulated values. This is a useful first check to ensure the model is able to recover simulated parameters. It is always surprising how often a model is not able to recover these when we simulate, especially for more complex models / simulations

```
coef(veg_mod) / sd(temperature_raw)
```

```
## (Intercept) temperature_scaled
## 0.9237536 0.2415704
```

We can also calculate uncertainties of these coefficients, again on the appropriate scale

```
confint(veg_mod) / sd(temperature_raw)
```

```
## 2.5 % 97.5 %
## (Intercept) 0.8783128 0.9691943
## temperature_scaled 0.1960538 0.2870871
```

Finally, for most maximum likelihood models in `R`

(`lm`

, `glm`

, `gam`

etc….) we can draw samples of regression coefficients from the implied posterior distribution. Maximum likelihood finds the posterior mode and then uses gradients around the mode to approximate the posterior with a Gaussian (Normal) distribution. So the posterior of coefficients is always multivariate normal. All we need to generate samples from this posterior is the variance-covariance matrix and the mean coefficients

```
mean_coefs <- coef(veg_mod)
coef_vcv <- vcov(veg_mod)
```

Generate 2,000 regression coefficient samples from the implied multivariate normal posterior distribution. We can use the `mvrnorm()`

function from the `MASS`

package to make this simple

```
if(!require(MASS)){
install.packages('MASS')
}
```

```
## Loading required package: MASS
```

```
coef_posterior <- MASS::mvrnorm(n = 2000,
mu = mean_coefs,
Sigma = coef_vcv)
head(coef_posterior)
```

```
## (Intercept) temperature_scaled
## [1,] 4.821914 1.2911135
## [2,] 4.572132 1.2185941
## [3,] 4.612218 1.4297919
## [4,] 4.534363 1.3223695
## [5,] 4.636644 1.3577525
## [6,] 4.866013 0.9418046
```

Instead of simply looking at confidence intervals of coefficients, we can now inspect the full distribution and overlay the true simulated value. Remember to adjust for the scaling of the variable

```
myhist(x = coef_posterior[,2] / sd(temperature_raw),
xlab = 'Posterior temp coefficient (unscaled)')
abline(v = beta_temp, lwd = 3, col = 'white')
abline(v = beta_temp, lwd = 2.5, col = 'black')
```

Prediction intervals can also be obtained once we have the posterior estimates. Predicting from models is a crucial step to understand them and interrogate their weaknesses. To make predictions, we usually must supply `newdata`

for predicting against. This would mean a new set of temperature measurements, in our case. The best way to do this, depending on model complexity, is to create a sequence of measurements for the variable of interest that we’d like predictions for. See `?predict.lm`

for more details

```
newtemps <- seq(from = min(temperature_scaled),
to = max(temperature_scaled),
length.out = 1000)
summary(newtemps)
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.26551 -1.61228 0.04094 0.04094 1.69416 3.34738
```

Create an empty `matrix`

to store the predictions. We want to predict at these new `\(temperature\)`

values for each combination of posterior coefficients so we can visualise uncertainty in those predictions

```
predictions <- matrix(NA, ncol = NROW(coef_posterior),
nrow = length(newtemps))
dim(predictions)
```

```
## [1] 1000 2000
```

Now fill the matrix with the predicted mean for each of the new `\(temperature\)`

measurements, using each posterior draw of regression coefficients

```
for(i in 1:NCOL(predictions)){
predictions[,i] <- coef_posterior[i,1] + (coef_posterior[i,2] * newtemps)
}
```

Plot empirical quantile ribbons of the predictions to visualise them, and overlay the simulated values as points

```
mypred_plot(predictions = predictions,
xlab = 'Temperature (scaled)',
ylab = 'Vegetation (expectation only)',
pred_values = newtemps,
ylim = range(c(vegetation,
quantile(predictions, probs = c(0.025, 0.975)))))
points(pch = 16, x = temperature_scaled, y = vegetation,
col = 'white',
cex = 1)
points(pch = 16, x = temperature_scaled, y = vegetation,
cex = 0.8)
```

The uncertainty about this line seems far too narrow, and that’s because it is! We have ignored the stochastic component, which is the normally distributed errors. This means we have only predicted the mean, otherwise known as the expectation. But we can easily compute full probabilistic predictions and re-plot. All we need is the sd estimate for `\(\sigma\)`

, which is stored as `sigma`

in the model. This estimate is extremely close to our true simulated value

```
sigma_estimate <- summary(veg_mod)$sigma
sigma_estimate; sigma
```

```
## [1] 2.023472
```

```
## [1] 2
```

Now compute predictions using normal random draws with appropriate probabilistic uncertainty

```
predictions <- matrix(NA, ncol = NROW(coef_posterior),
nrow = length(newtemps))
for(i in 1:NCOL(predictions)){
predictions[,i] <- rnorm(n = NROW(predictions),
mean = coef_posterior[i,1] + (coef_posterior[i,2] * newtemps),
sd = sigma_estimate)
}
```

Plot the full prediction distribution

```
mypred_plot(predictions = predictions,
xlab = 'Temperature (scaled)',
ylab = 'Vegetation (full uncertainty)',
pred_values = newtemps,
ylim = range(c(vegetation,
quantile(predictions, probs = c(0.025, 0.975)))))
points(pch = 16, x = temperature_scaled, y = vegetation,
col = 'white',
cex = 1)
points(pch = 16, x = temperature_scaled, y = vegetation,
cex = 0.8)
```

This model fits the data very well, with good coverage properties of the prediction uncertainties. But what happens if we decrease the sample size to `10`

and run the process again?

```
N <- 10
temperature_raw <- rnorm(n = N, mean = 14, sd = 5)
temperature_scaled <- (temperature_raw - mean(temperature_raw)) /
sd(temperature_raw)
mu <- alpha + beta_temp * temperature_raw
vegetation <- rnorm(n = N, mean = mu, sd = sigma)
veg_mod <- lm(vegetation ~ temperature_scaled)
mean_coefs <- coef(veg_mod); coef_vcv <- vcov(veg_mod)
coef_posterior_new <- MASS::mvrnorm(n = 2000,
mu = mean_coefs,
Sigma = coef_vcv)
```

Plot posterior estimates from both simulations, but use the same `x-axis`

limits so the distributions can be compared

```
layout(matrix(1:2, nrow = 2))
myhist(x = coef_posterior[,2] / sd(temperature_raw),
xlim = range(c(coef_posterior[,2] / sd(temperature_raw),
coef_posterior_new[,2] / sd(temperature_raw))),
xlab = '', main = 'N = 300')
abline(v = beta_temp, lwd = 3, col = 'white')
abline(v = beta_temp, lwd = 2.5, col = 'black')
myhist(x = coef_posterior_new[,2] / sd(temperature_raw),
xlim = range(c(coef_posterior[,2] / sd(temperature_raw),
coef_posterior_new[,2] / sd(temperature_raw))),
xlab = 'Posterior temp coefficient (unscaled)',
main = 'N = 30')
abline(v = beta_temp, lwd = 3, col = 'white')
abline(v = beta_temp, lwd = 2.5, col = 'black')
```

```
layout(1)
```

How does this wider uncertainty translate into the predictions?

```
sigma_estimate <- summary(veg_mod)$sigma
predictions <- matrix(NA, ncol = NROW(coef_posterior_new),
nrow = length(newtemps))
for(i in 1:NCOL(predictions)){
predictions[,i] <- rnorm(n = NROW(predictions),
mean = coef_posterior_new[i,1] + (coef_posterior_new[i,2] * newtemps),
sd = sigma_estimate)
}
mypred_plot(predictions = predictions,
xlab = 'Temperature (scaled)',
ylab = 'Vegetation (full uncertainty)',
pred_values = newtemps,
ylim = range(c(vegetation,
quantile(predictions, probs = c(0.025, 0.975)))))
points(pch = 16, x = temperature_scaled, y = vegetation,
col = 'white',
cex = 1)
points(pch = 16, x = temperature_scaled, y = vegetation,
cex = 0.8)
```

Changing sample size is just one way to get a feel for how challenging some modelling problems can be to tackle. You can also change `\(\sigma\)`

, the `\({\beta_{temp}}\)`

coefficient, and / or the `\(\alpha\)`

coefficient to understand what those changes might mean

## Inspecting confounding and its influence on causal inferences

In this next example, we will simulate a data-generating process that has a strong confounder (in the form of a collider). This will illustrate how wrong our conclusions can be if we simply throw all variables into a regression and interpret them as real risk factors / causal effects. We will use a similar approach as above, where the goal is to model `\(vegetation\)`

density

```
set.seed(11155)
N <- 100
```

`\(temperature\)`

again is our main predictor of interest. This time we’ll ignore the scaling and just run with the scaled version for simplicity

```
temperature <- rnorm(N,
mean = 0,
sd = 1)
```

`\(temperature\)`

again has a strong positive, linear effect on `\(vegetation\)`

```
beta_temp <- 0.75
vegetation <- rnorm(N,
mean = beta_temp * temperature,
sd = 1)
myscatter(x = temperature,
y = vegetation,
xlab = 'Temperature',
ylab = 'Vegetation')
```

But this time, `\(temperature\)`

also influences `\(pollinator density\)`

```
beta_temp_pollinators <- 0.75
```

And `\(vegetation\)`

, our outcome of interest, equally influences `\(pollinator density\)`

```
beta_veg_pollinators <- 0.75
```

The problem here is that pollinators now exists as a collider variable; both our outcome of interest and one of our predictors influences this variable. This is an all-too-common issue we are faced with in regressin modelling

```
pollinators <- rnorm(N,
mean = beta_temp_pollinators * temperature +
beta_veg_pollinators * vegetation,
sd = 1)
```

There are now strong apparent relationships all around

```
layout(matrix(1:2, nrow = 2))
myscatter(x = temperature,
y = pollinators,
xlab = 'Temperature',
ylab = 'Pollinators')
myscatter(x = vegetation,
y = pollinators,
xlab = 'Vegetation',
ylab = 'Pollinators')
```

```
layout(1)
```

If we thought that pollinator density could be a reasonable predictor of vegetation, the standard approach we would often take would be to fit univariate models for each of our predictors and then choose which ones to include in a final model based on some arbitrary significance threshold. Let’s see what happens if we do this

```
temp_only <- lm(vegetation ~ temperature)
summary(temp_only)
```

```
##
## Call:
## lm(formula = vegetation ~ temperature)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.44451 -0.69136 0.03974 0.65918 2.26335
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.14338 0.09937 -1.443 0.152
## temperature 0.79203 0.09952 7.959 3.11e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9885 on 98 degrees of freedom
## Multiple R-squared: 0.3926, Adjusted R-squared: 0.3864
## F-statistic: 63.34 on 1 and 98 DF, p-value: 3.106e-12
```

```
veg_only <- lm(vegetation ~ pollinators)
summary(veg_only)
```

```
##
## Call:
## lm(formula = vegetation ~ pollinators)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.94531 -0.57399 -0.06453 0.57038 1.91163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.15133 0.08499 -1.781 0.0781 .
## pollinators 0.53784 0.04864 11.057 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.846 on 98 degrees of freedom
## Multiple R-squared: 0.555, Adjusted R-squared: 0.5505
## F-statistic: 122.2 on 1 and 98 DF, p-value: < 2.2e-16
```

Both predictors are found to be significant in univariate models; the standard approach we are taught now is to include them both in a final multivariable model

```
final_mod <- lm(vegetation ~ temperature + pollinators)
summary(final_mod)
```

```
##
## Call:
## lm(formula = vegetation ~ temperature + pollinators)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8385 -0.5926 -0.0930 0.5014 1.8816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.15736 0.08454 -1.861 0.0657 .
## temperature 0.19296 0.12839 1.503 0.1361
## pollinators 0.45497 0.07333 6.205 1.35e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8407 on 97 degrees of freedom
## Multiple R-squared: 0.5652, Adjusted R-squared: 0.5562
## F-statistic: 63.04 on 2 and 97 DF, p-value: < 2.2e-16
```

This model is supported by `AIC`

over the simpler models. The lower the `AIC`

, the better are the penalized in-sample predictions from the model

```
c('full model better',
'temp only better',
'veg only better')[which.min(c(AIC(final_mod),
AIC(temp_only),
AIC(veg_only)))]
```

```
## [1] "full model better"
```

A type II `ANOVA`

reveals that if we wanted the most parsimonious model, we’d be justified in dropping `\(temperature\)`

(even though this is the effect we actually care about!)

```
drop1(final_mod, test = 'F')
```

```
## Single term deletions
##
## Model:
## vegetation ~ temperature + pollinators
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 68.551 -31.759
## temperature 1 1.5962 70.147 -31.457 2.2586 0.1361
## pollinators 1 27.2062 95.757 -0.335 38.4969 1.349e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

In fact, just about any model selection routine you use on these data will favour the multivariable model (stepwise selection, ridge regression, LASSO etc….). But why do these general model selection routines favour a model that discards `\(temperature\)`

. How does the estimated effect of temperature on vegetation resemble what we simulated as the actual truth?

```
confint(final_mod)[2,]
```

```
## 2.5 % 97.5 %
## -0.06186813 0.44778663
```

```
beta_temp
```

```
## [1] 0.75
```

It misses. **Badly**. This is because of the influence of collider bias. We must be very cautious when throwing all sorts of variables into models and assuming the resulting estimates can be interpreted in any meaningful way. The correct approach here would actually be to omit `\(pollinators\)`

altogether. This is better for inference, even though it doesn’t fit the data as well. We’ve already done that in our `temp_only`

model

```
confint(temp_only)[2,]
```

```
## 2.5 % 97.5 %
## 0.5945409 0.9895126
```

```
beta_temp
```

```
## [1] 0.75
```

## Effects are often estimated with more precision when including other predictors

Here we will explore why it is often important to adjust for other variables when estimating effects, even if those other variables are not “significant”. Simulate a scenario where age and season show independent effects on some disease’s infection probability

```
set.seed(5555)
N <- 200
```

Simulate `\(age\)`

as a log-normally distributed, positive continuous variable

```
age <- rlnorm(N)
myhist(age, xlab = 'Age')
```

Simulate a cyclic seasonal pattern to replicate what we often see with transmittable diseases, which is a fairly strong and consistent variation in risk across seasons. This type of simulation can be done with combinations of `sine`

and `cosine`

functions

```
season_function <- rep((sin(0.2 * 1:12) - cos(-0.2 * 1:12) + sin(1:12)),
50)[1:N] * 0.1
plot(season_function, type = 'l',
xlab = 'Time', ylab = 'Seasonal effect')
```

Convert season to a numeric indicator of month, just like we would use in a model

```
season <- rep(1:12, ceiling(N / 12))[1:N]
head(season, 24)
```

```
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
```

Calculate infection probability with an `\(age\)`

effect of `-0.4`

(on the log scale) and with a minor seasonal effect. We use the `plogis`

function, which uses the `inverse logit`

transformation to convert real-valued continuous variables to the probability scale. This is the same link function that is used in logistic regression (where the outcomes are `1s`

and `0s`

) and in Beta regression (where the outcome is a proportion on `(1,0)`

). We will use Beta regression here, which assumes the following:
`$$\boldsymbol{inf~prob}\sim \text{Beta}(logit^{-1}(\mu)*\phi,(1-logit^{-1}(\mu))*\phi)$$`

`$$\mu={\alpha}+{\beta_{age}}*\boldsymbol{log(age)}+f(season)$$`

Where `\(\phi\)`

is a shape parameter that controls the spread of the distribution. We can use `plogis`

to convert our simulated values from the real line to the probability scale (this is a fast and efficient version of `inverse logit`

)

```
beta_logage <- -0.4
inf_prob <- plogis(-0.3 + season_function + beta_logage * log(age))
myhist(inf_prob)
```

Fit a beta regression model that ignores season

```
library(mgcv)
```

```
## Loading required package: nlme
```

```
## This is mgcv 1.8-42. For overview type 'help("mgcv-package")'.
```

```
mod_noseason <- gam(inf_prob ~ log(age),
family = betar(),
method = 'REML')
```

The estimate for `\(log(age)\)`

is fairly precise

```
summary(mod_noseason)
```

```
##
## Family: Beta regression(605.665)
## Link function: logit
##
## Formula:
## inf_prob ~ log(age)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.247234 0.005879 -42.05 <2e-16 ***
## log(age) -0.390477 0.006448 -60.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## R-sq.(adj) = 0.951 Deviance explained = 95.2%
## -REML = -493.13 Scale est. = 1 n = 200
```

Fit a model that includes only `\(season\)`

, which finds that this predictor is not statistically significant on its own. Again, the standard practice would be to drop this predictor when we proceed with a multivariable analysis

```
mod_seasononly <- gam(inf_prob ~ s(season, bs = 'cc'),
family = betar(),
method = 'REML')
```

```
summary(mod_seasononly)
```

```
##
## Family: Beta regression(29.583)
## Link function: logit
##
## Formula:
## inf_prob ~ s(season, bs = "cc")
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.24268 0.02574 -9.429 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(season) 0.0004364 8 0 0.607
##
## R-sq.(adj) = 5.21e-08 Deviance explained = 0.000239%
## -REML = -196.38 Scale est. = 1 n = 200
```

Fit a model that includes both predictors

```
mod_bothpredictors <- gam(inf_prob ~ log(age) + s(season, bs = 'cc'),
family = betar(),
method = 'REML')
```

The estimate for `log(age)`

also appears precise in this model, just like in the first model

```
summary(mod_bothpredictors)
```

```
##
## Family: Beta regression(14453.538)
## Link function: logit
##
## Formula:
## inf_prob ~ log(age) + s(season, bs = "cc")
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.247699 0.001206 -205.5 <2e-16 ***
## log(age) -0.399987 0.001343 -297.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(season) 7.748 8 4518 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.998 Deviance explained = 99.8%
## -REML = -787.93 Scale est. = 1 n = 200
```

But which model gives us *better* estimates for `log(age)`

? Draw 5,000 posterior estimates from each model’s posterior coefficient distribution. Because `gam()`

uses a form of maximum likelihood for estimation, the posterior once again is an implied multivariate Gaussian. The variance-covariance matrix for the coefficients is stored in the model object in the `Vc`

slot

```
coefs_noseason <- MASS::mvrnorm(5000, mu = coef(mod_noseason),
Sigma = mod_noseason$Ve)
coefs_bothpredictors <- MASS::mvrnorm(5000, mu = coef(mod_bothpredictors),
Sigma = mod_bothpredictors$Ve)
```

Calculate the continuous rank probability score (CRPS), a proper scoring rule for probabilistic predictions; a *lower* score is better, as it indicates the probability density is more concentrated around the true value

```
if(!require(scoringRules)){
install.packages('scoringRules')
}
```

```
## Loading required package: scoringRules
```

```
## Warning: package 'scoringRules' was built under R version 4.3.2
```

```
score_noseason <- scoringRules::crps_sample(y = beta_logage,
dat = t(coefs_noseason[,2]))
score_bothpredictors <- scoringRules::crps_sample(y = beta_logage,
dat = t(coefs_bothpredictors[,2]))
```

Plot estimates and print the CRPS for each model

```
layout(matrix(1:2, nrow = 2))
myhist(coefs_noseason[,2],
xlim = range(c(coefs_noseason[,2], coefs_bothpredictors[,2])),
main = paste0('Without seasonal adjustment; score = ', round(score_noseason, 3)))
abline(v = beta_logage, lwd = 3, col = 'white')
abline(v = beta_logage, lwd = 2.5)
myhist(coefs_bothpredictors[,2], xlab = 'Posterior estimates for log(age)',
xlim = range(c(coefs_noseason[,2], coefs_bothpredictors[,2])),
main = paste0('With seasonal adjustment; score = ', round(score_bothpredictors, 3)))
abline(v = beta_logage, lwd = 3, col = 'white')
abline(v = beta_logage, lwd = 2.5)
```

```
layout(1)
```

Here it is clear that *adjusting for important control variables, even if they aren’t “significant”, gives us better estimates of risk factors*. What happens if we have binary outcomes as our dependent variable? This is often the case in disease surveillance. We can easily generate such data using another of `R`

’s random number generator functions. See `?rbinom`

for more details

```
infections <- rbinom(n = N, size = 1, prob = inf_prob)
myhist(infections, xlab = 'Infection status')
```

Fit the same models as above, but this time use a Bernoulli observation model with logit link function. The full model we will fit is:
`$$\boldsymbol{infections}\sim \text{Bernoulli}(p)$$`

`$$logit(p)={\alpha}+{\beta_{age}}*\boldsymbol{log(age)}+f(season)$$`

```
mod_noseason <- gam(infections ~ log(age),
family = binomial(),
method = 'REML')
summary(mod_noseason)
```

```
##
## Family: binomial
## Link function: logit
##
## Formula:
## infections ~ log(age)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2423 0.1434 -1.689 0.0912 .
## log(age) -0.2517 0.1549 -1.625 0.1043
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## R-sq.(adj) = 0.00824 Deviance explained = 0.98%
## -REML = 137.81 Scale est. = 1 n = 200
```

```
mod_seasononly <- gam(infections ~ s(season, bs = 'cc'),
family = binomial(),
method = 'REML')
summary(mod_seasononly)
```

```
##
## Family: binomial
## Link function: logit
##
## Formula:
## infections ~ s(season, bs = "cc")
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2490 0.1449 -1.718 0.0858 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(season) 1.95 8 8.352 0.00572 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0423 Deviance explained = 3.8%
## -REML = 135.62 Scale est. = 1 n = 200
```

```
mod_bothpredictors <- gam(infections ~ log(age) + s(season, bs = 'cc'),
family = binomial(),
method = 'REML')
summary(mod_bothpredictors)
```

```
##
## Family: binomial
## Link function: logit
##
## Formula:
## infections ~ log(age) + s(season, bs = "cc")
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2506 0.1461 -1.715 0.0863 .
## log(age) -0.2722 0.1594 -1.708 0.0877 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(season) 1.985 8 8.743 0.00475 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0512 Deviance explained = 4.93%
## -REML = 135.02 Scale est. = 1 n = 200
```

```
coefs_noseason <- MASS::mvrnorm(5000, mu = coef(mod_noseason),
Sigma = mod_noseason$Ve)
coefs_bothpredictors <- MASS::mvrnorm(5000, mu = coef(mod_bothpredictors),
Sigma = mod_bothpredictors$Ve)
```

```
score_noseason <- scoringRules::crps_sample(y = beta_logage,
dat = t(coefs_noseason[,2]))
score_bothpredictors <- scoringRules::crps_sample(y = beta_logage,
dat = t(coefs_bothpredictors[,2]))
```

Plot estimates and print the CRPS for each model

```
layout(matrix(1:2, nrow = 2))
myhist(coefs_noseason[,2],
xlim = range(c(coefs_noseason[,2], coefs_bothpredictors[,2])),
main = paste0('Without seasonal adjustment; score = ', round(score_noseason, 3)))
abline(v = beta_logage, lwd = 3, col = 'white')
abline(v = beta_logage, lwd = 2.5)
myhist(coefs_bothpredictors[,2], xlab = 'Posterior estimates for log(age)',
xlim = range(c(coefs_noseason[,2], coefs_bothpredictors[,2])),
main = paste0('With seasonal adjustment; score = ', round(score_bothpredictors, 3)))
abline(v = beta_logage, lwd = 3, col = 'white')
abline(v = beta_logage, lwd = 2.5)
```

```
layout(1)
```

In this case it doesn’t make much difference whether we adjust for `\(season\)`

or not when estimating effect of `log(age)`

. This is because binary data has much less information than the proportional data we were using in the Beta regression. We’d need to boos our sample size to have a better chance of adequately estimating these effects with a logistic regression. This is another advantage of simulation, as it allows us to get a better sense of the limitations of our sampling designs

## Failing to account for unobserved autocorrelation causes problems

In this final example, we will simulate a time series of counts that has autocorrelation in the underlying data-generating process. It is useful to get a sense of how models perform when we fail to take this autocorrelation into account. First we simulate the underlying autocorrelated trend, which follows an AR1 model

```
set.seed(500)
N <- 100
time <- 1:N
trend <- as.vector(scale(arima.sim(model = list(ar = 0.7), n = N)))
myscatter(x = time, y = trend, xlab = 'Time', ylab = 'Trend')
lines(trend, col = c_dark, lwd = 2)
```

Next we simulate our covariate, `\(rainfall\)`

, and it’s linear additive coefficient

```
rainfall <- rnorm(N)
beta_rain <- 0.4
```

Finally, we can simulate the counts. In this case, counts follow a Poisson observation process. Poisson regression that assumes the underlying mean evolves on the log scale:
`$$\boldsymbol{Counts}\sim \text{Poisson}(\lambda)$$`

`$$log(\lambda)={\alpha}+{\beta_{rainfall}}*\boldsymbol{rainfall}+trend$$`

In this example, the linear predictor, `\(log(\lambda)\)`

(referred to here as `\(\mu\)`

), captures the additive effects of the intercept `\(\alpha\)`

, the effect of `\(rainfall\)`

and the underlying trend on the the log of the mean for `\(counts\)`

. See `?rpois`

for more information about the Poisson distribution functions in `R`

```
alpha <- 1
mu <- alpha + beta_rain * rainfall + 0.8 * trend
counts <- rpois(n = N, lambda = exp(mu))
```

Plot the counts over time

```
myscatter(x = time, y = counts, xlab = 'Time', ylab = 'Counts')
lines(counts, col = c_dark, lwd = 2)
```

When working with time series, it is always useful to understand whether there is any autocorrelation in the observations. This can be checked using the `acf()`

function

```
acf(counts)
```

This doesn’t give us too much concern, and many people would use the fact that none of the lags shows too much autocorrelation as evidence that they do not need to account for this in their models. But what happens when we ignore time in our models?

```
mod_notime <- gam(counts ~ rainfall,
family = poisson())
acf(residuals(mod_notime))
```

There is some apparent autocorrelation left in the residuals. This means our model’s errors are not independent, which is a violation of the model’s assumptions. This can lead to overly precise standard errors and overconfident `p-values`

. But it can also lead to downright absurd estimates for our predictors. Let’s compare estimates of the `\(rainfall\)`

coefficient from this model with a model that does attempt to capture the effect of time. This is done using a flexible smoothing spline for time

```
mod_time <- gam(counts ~ rainfall + s(time, k = 30),
family = poisson())
```

As before, we can compare the accuracies of our posterior estimates of the `\(rainfall\)`

coefficient using probabilistic scoring rules

```
coefs_notime<- MASS::mvrnorm(5000, mu = coef(mod_notime),
Sigma = mod_notime$Ve)
coefs_time <- MASS::mvrnorm(5000, mu = coef(mod_time),
Sigma = mod_time$Ve)
score_notime <- scoringRules::crps_sample(y = beta_rain,
dat = t(coefs_notime[,2]))
score_time<- scoringRules::crps_sample(y = beta_rain,
dat = t(coefs_time[,2]))
layout(matrix(1:2, nrow = 2))
myhist(coefs_notime[,2],
xlim = range(c(coefs_notime[,2], coefs_time[,2])),
main = paste0('Without time adjustment; score = ', round(score_notime, 3)))
abline(v = beta_rain, lwd = 3, col = 'white')
abline(v = beta_rain, lwd = 2.5)
myhist(coefs_time[,2],
xlim = range(c(coefs_notime[,2], coefs_time[,2])),
main = paste0('With time adjustment; score = ', round(score_time, 3)))
abline(v = beta_rain, lwd = 3, col = 'white')
abline(v = beta_rain, lwd = 2.5)
```

```
layout(1)
```

The model with the temporal component does better in this case, giving us more accurate estimates of our coefficient of interest. How does the smoothing spline capture this temporal variation?

```
plot(mod_time, shade = TRUE, shade.col = c_mid, col = c_dark_highlight,
bty = 'l')
```

More examples will follow, but now that you have the building blocks of simulation of random variables and random sampling, you can simulate more complex generative models directly