Using Stan for logistic regressions with detection error
By Nicholas Clark in stan simulation
March 10, 2024
Required libraries
cmdstanr
Purpose and model introduction
This script simulates binary observations of an imperfectly observed data generating process (i.e. our measurements are made with error). It also provides Stan
code for estimating parameters of the model in a Bayesian framework. The true infection status \(z\)
is assumed to be drawn from a Bernoulli distribution with probability \(p\)
:
$$z\sim \text{Bernoulli}(p)$$
A linear predictor is modeled on the \(logit\)
scale to capture effects of covariates on \(p\)
:
$$logit(p)={\alpha} + {\beta}*\boldsymbol{X}$$
\(\alpha\)
is the intercept representing average infection probability when all predictor variables are set to their means, and \(\beta\)
is the set of regression coefficients that describe associations between predictor variables in the model design matrix \(\boldsymbol{X}\)
and \(logit(p)\)
. Unfortunately, as is the case in most diagnostic test scenarios, our measurements are made with error. A classic example would be a Kato Katz faecal smear to detect eggs of soil-transmitted helminths. We know these tests are not perfect. It is easy to miss a parasite egg when the smear quality is low. Or when parasite eggs are not being shed. Or because eggs are damn small and hard to see. Assume \(\hat{SN}\)
is our test’s sensitivity (the proportion of subjects with the condition that tend to have a positive test) and \(\hat{SP}\)
is the test’s specificity (the proportion of subjects without the condition that tend to have a negative test). Observations \(\boldsymbol{Y}\)
are assumed to be conditional on the latent, unobserved true infection status \(z\)
through the following set of equations:
$$\boldsymbol{Y}\sim \text{Bernoulli}(\hat{SN}),~if~~z~= 1$$
$$\boldsymbol{Y}\sim \text{Bernoulli}(1 - \hat{SP}),~if~~z~= 0$$
Ignoring this detection error is bad. It can bias our estimates and lead to incorrect inferences. But there are several challenges with estimating such a model. First, we often do not know with precision what the test’s specificity and sensitivity are. So we need to estimate them. The second issue is that the \(z\)
represent latent discrete parameters. We do not know what the true infection status is. We could try to sample it, but it turns out that
sampling these latent values is problematic for advanced MCMC routines such as Stan
’s Hamiltonian Monte Carlo algorithm. Other types of samplers (i.e. Gibbs or Metropolis Hastings, like those used in JAGS
or BUGS
) will allow latent discrete parameters, but they are not efficient. Fortunately, probability theory tells us how we can account for the possibility that the true status \(z\)
was either a 1
or a 0
using a
marginal likelihood. The unconditional probability of the observed infection status \(\boldsymbol{Y}\)
is the sum of the probability in each unknown state (truly infected or truly uninfected) times the chance of being in each unknown state. This actually leads to better estimates because the expectation at all possible values is being used, rather than being estimated by sampling a latent discrete parameter.
Define the joint probabilistic model
First we will define the Stan
model code for estimating parameters under the model outlined above. This code uses a logical flag to allow us to tell the sampler whether or not to account for detection error. This will be useful for investigating the consequences of model misspecification.
model_code <-
"data {
int<lower=0> N; // number of observations
int<lower=0> P; // number of (standardised) predictor variables
array[N] int<lower=0, upper=1> y; // binary infection observations
matrix[N, P] x; // predictor design matrix
real<lower=-1,upper=1> sensitivity_prior; // inerpretable prior on sensitivity
real<lower=-1,upper=1> specificity_prior; // inerpretable prior on specificity
int<lower=0,upper=1> estimate_accuracy; // logical (include detection error or not)
}
transformed data {
// Convert diagnostic hyperpriors into values to feed to the dirichlet
real<lower=0> sens_hyp;
real<lower=0> spec_hyp;
real<lower=0> remainder_hyp;
sens_hyp = 1 / exp(sensitivity_prior);
spec_hyp = 1 / exp(specificity_prior);
remainder_hyp = exp(sensitivity_prior) + exp(specificity_prior);
// QR decomposition to deal with (possibly) correlated predictor variables
// https://betanalpha.github.io/assets/case_studies/qr_regression.html
matrix[N, P] Q_ast;
matrix[P, P] R_ast;
matrix[P, P] R_ast_inverse;
Q_ast = qr_thin_Q(x) * sqrt(N - 1);
R_ast = qr_thin_R(x) / sqrt(N - 1);
R_ast_inverse = inverse(R_ast);
}
parameters {
vector[P] theta; // QR decomposed regression coefficients
real alpha; // intercept coefficient
vector<lower=1>[3] theta_acc; // hyperprior for test accuracy
simplex[3] accuracy; // sum-to-1 constraint for misclassification error
}
transformed parameters {
// test sensitivity and specificity
real<lower=0, upper=1> sens;
real<lower=0, upper=1> spec;
// set both to 1 if we don't want to estimate detection error
if (estimate_accuracy)
sens = 1 - accuracy[1];
else
sens = 1;
if (estimate_accuracy)
spec = 1 - accuracy[2];
else
spec = 1;
// linear predictor for the logistic regression; here we can also include
// terms for spatial, temporal, and spatiotemporal effects; some examples found here:
// https://mc-stan.org/docs/stan-users-guide/fit-gp.html
// https://betanalpha.github.io/assets/case_studies/gp_part1/part1.html#1_gaussian_processes
// https://github.com/nicholasjclark/mvgam
vector[N] logit_z = alpha + Q_ast * theta;
}
model {
// informative hyperpriors on 1 - sensitivity and 1 - specificity
theta_acc[1] ~ normal(sens_hyp, 0.25);
theta_acc[2] ~ normal(spec_hyp, 0.25);
// hyperprior for sum-to-one constraint
theta_acc[3] ~ normal(remainder_hyp, 0.25);
accuracy ~ dirichlet(theta_acc);
// priors for regression coefficients
// we assume the disease is a bit rare, so use an appropriate
// prior on the intercept alpha
alpha ~ normal(-0.75, 1);
theta ~ std_normal();
// marginal Bernoulli likelihood
// the unconditional probability of the observed infection status is the sum of the
// probability in each unknown state (truly infected or truly uninfected)
// times the chance of being in each unknown state
// for each observation, we calculate the probability that the 'true' status was a 1
// and the probability that the 'true' status was a 0. Summing these probabilities on
// the log scale is the same as multiplying them on the untransformed scale; this
// allows us to average over (i.e. marginalise out) the latent infection status
for (n in 1:N) {
target += log_sum_exp(bernoulli_lpmf(y[n] | sens) + bernoulli_logit_lpmf(1 | logit_z[n]),
bernoulli_lpmf(1 - y[n] | spec) + bernoulli_logit_lpmf(0 | logit_z[n]));
}
}
generated quantities {
vector<lower=0, upper=1>[N] prob;
array[N] int<lower=0, upper=1> ypred;
vector[P] beta;
// compute predicted infection probabilities (i.e. posterior expectations)
// for each individual in the dataset
for (n in 1:N) {
prob[n] = softmax([bernoulli_lpmf(y[n] | sens) + bernoulli_logit_lpmf(1 | logit_z[n]),
bernoulli_lpmf(1 - y[n] | spec) + bernoulli_logit_lpmf(0 | logit_z[n])]')[1];
}
// compute predicted infection statuses (i.e. posterior predictions)
// for each individual in the dataset
ypred = bernoulli_rng(prob);
// compute the regression coefficients for the original
// (back-transformed) predictor variables
beta = R_ast_inverse * theta;
}"
There is a lot going on in this model. It accepts interpretable hyperprior information for the test diagnostics (more on that below). It also uses a
QR decomposition to handle possible colinearity among predictors. And of course, it marginalises over discrete latent infection status. Details of each of these procedures are not the main aim of this tutorial, but please see the relevant hyperlinks for more information. Load cmdstanr
and compile the model to C++
code. See
information here for troubleshooting advice when installing Stan
and Cmdstan
library(cmdstanr)
mod <- cmdstan_model(write_stan_file(model_code),
stanc_options = list('canonicalize=deprecations,braces,parentheses'))
Simulate data
Now that the model is defined in Stan
language, we need to simulate data to explore the model. Here we define a function to generate simulated binary outcomes with user-specific sensitivity and specificity. This function was modified from a very helpful
Stan
discourse post
sim_det_error <- function(n = 200, # sample size
alpha = -0.5, # intercept (on logit scale)
p = 2, # number of predictors
sn = 0.85, # sensitivity of test
sp = 0.7) # specificity of test
{
# generate random predictors; note that predictors should be mean-centred
# prior to modelling for the QR decomposition to be effective
x <- rnorm(n*p, mean = 0, sd = 1)
x <- matrix(x, nrow = n, ncol = p)
colnames(x) <- paste0('predictor_', 1:p)
# generate beta coefficients
c <- rnorm(p, mean = 0, sd = 1)
# design matrix and linear predictor
x_design <- cbind(rep(1, n), x)
c_full <- c(alpha, c)
odds <- rowSums(sweep(x_design, 2, c_full, FUN = "*"))
# convert linear predictor to the probability scale (logit link)
pr <- 1 / (1 + exp(-odds))
# generate true infection data
truth <- rbinom(n = n, size = 1, prob = pr)
# generate observed data with imperfect detection
y <- integer(n)
for ( i in 1:n ) {
y[i] <- ifelse(truth[i] == 1, rbinom(n = 1, size = 1, prob = sn),
rbinom(n = 1, size = 1, prob = 1 - sp))
}
return(list(N = n,
P = p,
y = y,
x = x,
true_status = truth,
true_sens = sn,
true_spec = sp,
true_alpha = alpha,
true_betas = c))
}
Simulate some data with two predictors. Here sensitivity ($\hat{SN}$) = 0.75
and specificity ($\hat{SP}$) = 0.85
. Again, sensitivity is a measure of how well a test can identify true positives and specificity is a measure of how well a test can identify true negatives. These values are not unreasonable given our knowledge of how difficult it can be to detect parasite eggs in faecal smears
dat <- sim_det_error(n = 250, alpha = -0.85,
p = 2, sn = 0.75, sp = 0.85)
The true (latent) prevalence is
sum(dat$true_status) / length(dat$true_status)
## [1] 0.324
The observed prevalence is
sum(dat$y) / length(dat$y)
## [1] 0.316
Add interpretable ‘prior’ information about test accuracy to the data, which will be used to structure the prior distributions for test sensitivity and specificity. Here, we use a scale of -1 to 1
to define any prior information we may have about performance of a diagnostic test. A score of -1
means we are fairly certain the test is bad (low accuracy). A score of 1
implies we are fairly certain the test is good. A score of 0
means we are uncertain. Let’s assume the test gives reasonable accuracy because it is a commercial diagnostic.
dat$sensitivity_prior <- 0.3
dat$specificity_prior <- 0.3
Fit competing models
We can assess whether it is problematic to ignore detection error when fitting models. We’ll do this by starting with a basic logistic regression that ignores this error. To do this, we include a logical flag to tell the sampler not to include detection error
dat$estimate_accuracy <- 0
Now condition the model on the observed data using Stan
with the CmdStan
backend
fit_noerror <- mod$sample(
data = dat,
chains = 4,
parallel_chains = 4,
refresh = 100)
Next we fit an equivalent model that does estimate detection error (i.e. we change the logical flag for estimate_accuracy
from 0
to 1
)
dat$estimate_accuracy <- 1
fit_error <- mod$sample(
data = dat,
chains = 4,
parallel_chains = 4,
refresh = 100)
Inspect sampler and convergence diagnostics. Any warnings of divergences or low energy should be taken seriously and inspected further. Fortunately our models fit without issue
fit_noerror$diagnostic_summary()
## $num_divergent
## [1] 0 0 0 0
##
## $num_max_treedepth
## [1] 0 0 0 0
##
## $ebfmi
## [1] 0.9662539 1.0130262 0.9480790 0.9575079
fit_error$diagnostic_summary()
## $num_divergent
## [1] 0 0 0 0
##
## $num_max_treedepth
## [1] 0 0 0 0
##
## $ebfmi
## [1] 0.9283966 0.8561576 0.8213387 0.8475217
Compute the usual MCMC summaries, which show that the Hamiltonian Monte Carlo chains have converged nicely
fit_noerror$summary(variables = c("alpha", "beta"))
## # A tibble: 3 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 alpha -0.868 -0.865 0.142 0.143 -1.11 -0.640 1.00 4238. 2562.
## 2 beta[1] 0.345 0.346 0.150 0.147 0.100 0.591 1.00 3803. 3122.
## 3 beta[2] 0.389 0.388 0.148 0.150 0.149 0.634 1.00 4341. 2867.
fit_error$summary(variables = c("alpha", "beta", "sens", "spec"))
## # A tibble: 5 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 alpha -1.18 -1.16 0.679 0.673 -2.34 -0.0840 1.00 2060. 2461.
## 2 beta[1] 0.799 0.722 0.466 0.400 0.199 1.65 1.00 1893. 1484.
## 3 beta[2] 0.814 0.759 0.403 0.375 0.262 1.55 1.00 1950. 1799.
## 4 sens 0.721 0.720 0.152 0.181 0.474 0.962 1.00 1456. 1391.
## 5 spec 0.848 0.844 0.0648 0.0674 0.747 0.965 1.00 1542. 1056.
Inference
Visualising the estimated coefficients for each of the models will allow us to ask whether ignoring detection error leads to biased inferences. First, define some colours for nice plots
c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")
Plot the estimates for the intercept parameter \(\alpha\)
and overlay the true simulated value in black
alphas_noerror <- fit_noerror$draws("alpha", format = "draws_matrix")
alphas_error <- fit_error$draws("alpha", format = "draws_matrix")
layout(matrix(1:2, ncol = 1))
par(mar = c(4, 1, 1.5, 1))
hist(alphas_noerror,
col = 'grey', border = 'white', lwd = 2,
yaxt = 'n',
ylab = '', xlab = '',
xlim = range(c(alphas_noerror, alphas_error)),
breaks = seq(min(c(alphas_noerror, alphas_error)),
max(c(alphas_noerror, alphas_error)),
length.out = 50),
main = 'No detection error')
abline(v = dat$true_alpha, lwd = 3.5, col = 'white')
abline(v = dat$true_alpha, lwd = 3, col = 'black')
hist(alphas_error,
col = c_mid_highlight, border = 'white', lwd = 2,
main = 'Detection error',
xlim = range(c(alphas_noerror, alphas_error)),
breaks = seq(min(c(alphas_noerror, alphas_error)),
max(c(alphas_noerror, alphas_error)),
length.out = 50),
yaxt = 'n', ylab = '', xlab = expression(alpha))
abline(v = dat$true_alpha, lwd = 3.5, col = 'white')
abline(v = dat$true_alpha, lwd = 3, col = 'black')
The detection error model seems to struggle estimating \(\alpha\)
with precision, though our estimates are centred around the true simulated value. This parameter is a bit difficult to interpret in detection error models, so we will get back to that later. But our inferences on the effects of covariates are what we are primarily interested in. Next plot the estimated \(\beta\)
coefficients for the two predictors, and again overlay the true simulated values in black
betas_noerror <- fit_noerror$draws("beta", format = "draws_matrix")
betas_error <- fit_error$draws("beta", format = "draws_matrix")
layout(matrix(1:4, ncol = 2, byrow = FALSE))
hist(betas_noerror[,1],
col = 'grey', border = 'white', lwd = 2,
xlim = range(c(betas_noerror[,1], betas_error[,1])),
breaks = seq(min(c(betas_noerror[,1], betas_error[,1])),
max(c(betas_noerror[,1], betas_error[,1])),
length.out = 50),
main = 'No detection error',
yaxt = 'n', ylab = '', xlab = '')
abline(v = dat$true_betas[1], lwd = 3.5, col = 'white')
abline(v = dat$true_betas[1], lwd = 3, col = 'black')
hist(betas_error[,1],
col = c_mid_highlight, border = 'white', lwd = 2,
xlim = range(c(betas_noerror[,1], betas_error[,1])),
breaks = seq(min(c(betas_noerror[,1], betas_error[,1])),
max(c(betas_noerror[,1], betas_error[,1])),
length.out = 50),
main = 'Detection error', yaxt = 'n', ylab = '', xlab = expression(beta[1]))
abline(v = dat$true_betas[1], lwd = 3.5, col = 'white')
abline(v = dat$true_betas[1], lwd = 3, col = 'black')
hist(betas_noerror[,2],
col = 'grey', border = 'white', lwd = 2,
xlim = range(c(betas_noerror[,2], betas_error[,2])),
main = 'No detection error',
breaks = seq(min(c(betas_noerror[,2], betas_error[,2])),
max(c(betas_noerror[,2], betas_error[,2])),
length.out = 50),
yaxt = 'n', ylab = '', xlab = '')
abline(v = dat$true_betas[2], lwd = 3.5, col = 'white')
abline(v = dat$true_betas[2], lwd = 3, col = 'black')
hist(betas_error[,2],
col = c_mid_highlight, border = 'white', lwd = 2,
xlim = range(c(betas_noerror[,2], betas_error[,2])),
breaks = seq(min(c(betas_noerror[,2], betas_error[,2])),
max(c(betas_noerror[,2], betas_error[,2])),
length.out = 50),
main = 'Detection error', yaxt = 'n', ylab = '', xlab = expression(beta[2]))
abline(v = dat$true_betas[2], lwd = 3.5, col = 'white')
abline(v = dat$true_betas[2], lwd = 3, col = 'black')
Here it is immediately clear why we should account for imperfect detection when possible. The detection error model gives better estimates for both coefficients. If we had to make decisions based on these parameters, and we did not account for detection error, our decisions would be wrong. Why is there large uncertainty in these estimates? To answer this, we need to plot the estimates for the test diagnostics
layout(matrix(1:2, ncol = 1))
par(mar = c(4, 1, 1.5, 1))
posterior_sens <- fit_error$draws("sens", format = "draws_matrix")
hist(posterior_sens,
xlim = c(0, 1),
breaks = seq(0, 1, length.out = 30),
col = c_mid_highlight, border = 'white', lwd = 2,
main = '', yaxt = 'n', ylab = '', xlab = 'Sensitivity')
abline(v = dat$true_sens, lwd = 3.5, col = 'white')
abline(v = dat$true_sens, lwd = 3, col = 'black')
posterior_spec <- fit_error$draws("spec", format = "draws_matrix")
hist(posterior_spec,
xlim = c(0, 1),
breaks = seq(0, 1, length.out = 30),
col = c_mid_highlight, border = 'white', lwd = 2,
main = '', yaxt = 'n', ylab = '', xlab = 'Specificity')
abline(v = dat$true_spec, lwd = 3.5, col = 'white')
abline(v = dat$true_spec, lwd = 3, col = 'black')
Further investigation helps to diagnose why \(\alpha\)
is difficult to identify for the detection error model.
layout(matrix(1:2, ncol = 1))
par(mar = c(4, 4, 1.5, 1))
posterior_sens <- fit_error$draws("sens", format = "draws_matrix")
plot(alphas_error, posterior_sens, xlab = '',
ylab = 'Sensitivity', bty = 'l', pch = 16, col = c_mid_highlight)
box(bty = 'l', lwd = 2)
abline(h = dat$true_sens, lwd = 3.5, col = 'white')
abline(h = dat$true_sens, lwd = 3, col = 'black')
abline(v = dat$true_alpha, lwd = 3.5, col = 'white')
abline(v = dat$true_alpha, lwd = 3, col = 'black')
posterior_spec <- fit_error$draws("spec", format = "draws_matrix")
plot(alphas_error, posterior_spec, xlab = expression(alpha),
ylab = 'Specificity', bty = 'l', pch = 16, col = c_mid_highlight)
box(bty = 'l', lwd = 2)
abline(h = dat$true_spec, lwd = 3.5, col = 'white')
abline(h = dat$true_spec, lwd = 3, col = 'black')
abline(v = dat$true_alpha, lwd = 3.5, col = 'white')
abline(v = dat$true_alpha, lwd = 3, col = 'black')
There are fundamental posterior correlations induced between \(\alpha\)
and sensitivity and between \(\alpha\)
and specificity. This model cannot estimate sensitivity or specificity with much precision, resulting in wide estimates of \(\alpha\)
. There is nothing implicitly wrong here, these uncertainties simply reflect reality: We do not know the true infection status, and we do not know the true test performance. Without knowing these, how can we precisely estimate the average infection probability \(\alpha\)
? Stronger priors on diagnostic accuracies could help, but these uncertainties will not entirely go away without sampling many more individuals.
A very useful question raised by
A/Prof Joerg Henning relates to translating what these results might mean for a practicing clinician. Often we want to know how likely it is that a person is truly infected if they’ve returned a positive diagnostic test. A famous (and very topical) problem in probability theory is the
base rate fallacy. This occurs when we ignore the background rate (or prevalence of disease, in our case) when interpreting test results. No test is perfect, so they will sometimes return false positives. But if there are very few true positives in the population (because the disease is rare), this can mean that a high proportion of people testing positive are actually uninfected. To quote from
Wikipedia
: “It is especially counter-intuitive when interpreting a positive result in a test on a low-prevalence population after having dealt with positive results drawn from a high-prevalence population. If the false positive rate of the test is higher than the proportion of the new population with the condition, then a test administrator whose experience has been drawn from testing in a high-prevalence population may conclude from experience that a positive test result usually indicates a positive subject, when in fact a false positive is far more likely to have occurred.”
We can use sampling from our model’s joint posterior distribution to calculate the probability that someone returning a positive test is actually infected, which requires the following equation:
$$Pr(true~infect|pos~test) = Pr(pos~test|true~infect) * Pr(infected) /
(Pr(pos~test|true~infect) * Pr(infect) + Pr(pos~test|true~uninfect) * Pr(uninfect))$$
For details of this derivation, please read here and here).
Because I’m no wizard in probability theory, I’ll use brute force sampling to calculate this quantity. First, create a sample sequence to generate many posterior simulations. We do this because we want many thousands of samples, but we only have 4000
in our posterior.
N_samples <- 25000
sample_seq <- sample(1:NROW(posterior_sens), N_samples, replace = TRUE)
Next, sample latent (true) infection status for 25000
individuals using the logistic regression equation estimated by our model
true_inf <- matrix(NA, nrow = N_samples, ncol = NROW(dat$x))
for(i in 1:N_samples){
true_inf[i,] <- rbinom(NROW(dat$x),
size = 1,
prob = boot::inv.logit(alphas_error[sample_seq[i]] +
dat$x %*% as.vector(betas_noerror[sample_seq[i],])))
}
Sample observed test diagnostics given our model’s estimates for sensitivity and specificity
test_results <- matrix(NA, nrow = N_samples, ncol = NROW(dat$x))
for(i in 1:N_samples){
for(j in 1:NCOL(test_results)){
test_results[i,j] <- ifelse(true_inf[i,j] == 1,
rbinom(n = 1,
size = 1,
prob = posterior_sens[sample_seq[i]]),
rbinom(n = 1,
size = 1,
prob = 1 - posterior_spec[sample_seq[i]]))
}
}
Now we have all the samples we need to calculate relevant probabilistic quantities. Calculate the probability that an individual returns a positive test if they are actually infected. We would hope this value would be high, but that is not always the case
`Pr(positive test|truly infected)` <- vector(length = N_samples)
for(i in 1:N_samples){
`Pr(positive test|truly infected)`[i] = sum(test_results[i, which(true_inf[i,] == 1)]) /
length(which(true_inf[i,] == 1))
}
quantile(`Pr(positive test|truly infected)`, probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90%
## 0.5000000 0.5967742 0.7241379 0.8529412 0.9428571
Next calculate the probability that any individual in the population is infected (prevalence)
`Pr(infected)` <- vector(length = N_samples)
for(i in 1:N_samples){
`Pr(infected)`[i] = sum(true_inf[i,]) / length(true_inf[i,])
}
quantile(`Pr(infected)`, probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90%
## 0.124 0.180 0.256 0.348 0.440
Now we need the probability of returning a positive test if someone is truly uninfected. We would hope this value would be quite low
`Pr(positive test|truly uninfected)` <- vector(length = N_samples)
for(i in 1:N_samples){
`Pr(positive test|truly uninfected)`[i] = sum(test_results[i, which(true_inf[i,] == 0)]) /
length(which(true_inf[i,] == 0))
}
quantile(`Pr(positive test|truly uninfected)`, probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90%
## 0.05552764 0.10256410 0.15384615 0.20105820 0.24117647
Now the probability that any individual is uninfected, which is 1 - Pr(infected)
`Pr(uninfected)` <- vector(length = N_samples)
for(i in 1:N_samples){
`Pr(uninfected)`[i] = 1 - `Pr(infected)`[i]
}
quantile(`Pr(uninfected)`, probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90%
## 0.560 0.652 0.744 0.820 0.876
Finally, we can calculate our quantity of interest: the probability that an individual is actually infected if they return a positive test
`Pr(truly infected|positive test)` <- vector(length = N_samples)
for(i in 1:N_samples){
`Pr(truly infected|positive test)`[i] = `Pr(positive test|truly infected)`[i] * `Pr(infected)`[i] /
(`Pr(positive test|truly infected)`[i] * `Pr(infected)`[i] +
`Pr(positive test|truly uninfected)`[i] * `Pr(uninfected)`[i])
}
hist(`Pr(truly infected|positive test)`,
xlim = c(0, 1),
breaks = seq(0, 1, length.out = 30),
col = c_mid_highlight, border = 'white', lwd = 2,
main = '', yaxt = 'n', ylab = '',
xlab = 'Pr(truly infected|positive test)')
With imperfect tests and somewhat rare infections (~30%
in this example), we can’t be too sure that a positive test means a true infection. This all changes of course if we have other reasons to believe a person is infected. For example, if we only test people that show some symptoms, we are sampling a very different population compared to the full population of healthy and unhealthy individuals. Hopefully this example illustrates why simple logistic regressions are often not enough to tackle the questions we are interested in.