Post-hoc MCMC with glmmTMB

Overview

One commonly requested feature is to be able to run a post hoc Markov chain Monte Carlo analysis based on the results of a frequentist fit. This is often a reasonable shortcut for computing confidence intervals and p-values that allow for finite-sized samples rather than relying on asymptotic sampling distributions. This vignette shows examples of such an analysis. Some caveats:

  • when using such a “pseudo-Bayesian” approach, be aware that using a scaled likelihood (implicit, improper priors) can often cause problems, especially when the model is poorly constrained by the data
  • in particular, models with poorly constrained random effects (singular or nearly singular) are likely to give bad results
  • as shown below, even models that are well-behaved for frequentist fitting may need stronger priors to give well-behaved MCMC results
  • as with all MCMC analysis, it is the user’s responsibility to check for proper mixing and convergence of the chains (e.g. with functions from the coda package) before drawing conclusions. tmbstan does some checking automatically, so it produces warnings where the DIY sampler below does not.

The first MCMC sampler illustrated below is conceptually simple (Metropolis with a multivariate normal candidate distribution). Users who want to do MCMC sampling with less code or on a regular basis should consider the tmbstan package, which does more efficient hybrid/Hamiltonian Monte Carlo sampling, and can take full advantage of glmmTMB’s ability to provide gradients of the log-posterior with respect to the parameters. More info on tmbstan and the methods that are available to use on the samples can be found on Github (e.g. methods(class="stanfit")). It is safe to skip the Metropolis-Hastings (DIY) section of the salamander example below if the user prefers to use tmbstan.

In general, you can plug the objective function from glmmTMB (i.e., fitted_model$obj$fn) into any general-purpose MCMC package (e.g. adaptMCMC, MCMCpack (using MCMCmetrop1R()), ensemblemcmc, …). However, most of these algorithms do not take advantage of glmmTMB’s automatic computation of gradients; for that, we will demonstrate the use of tmbstan, which uses glmmTMB’s objective and gradient functions in conjunction with the sampling algorithms from Stan.

Setup

Load packages:

library(glmmTMB)
library(coda)     ## MCMC utilities
library(reshape2) ## for melt()
## graphics
library(lattice)
library(ggplot2); theme_set(theme_bw())

Salamander example

Fit basic model:

fm1 <- glmmTMB(count ~ mined + (1|site),
    zi=~mined,
    family=poisson, data=Salamanders)

Metropolis-Hastings (DIY)

This is a basic block Metropolis sampler, based on Florian Hartig’s code here.

##' @param start starting value
##' @param V variance-covariance matrix of MVN candidate distribution
##' @param iterations total iterations
##' @param nsamp number of samples to store
##' @param burnin number of initial samples to discard
##' @param thin thinning interval
##' @param tune tuning parameters; expand/contract V
##' @param seed random-number seed
run_MCMC <- function(start,
                     V,   
                     logpost_fun,
                     iterations = 10000,
                     nsamp = 1000,
                     burnin = iterations/2,
                     thin = floor((iterations-burnin)/nsamp),
                     tune = NULL,
                     seed=NULL
                     ) {
    ## initialize
    if (!is.null(seed)) set.seed(seed)
    if (!is.null(tune)) {
        tunesq <- if (length(tune)==1) tune^2 else outer(tune,tune)
        V <-  V*tunesq
    }
    chain <- matrix(NA, nsamp+1, length(start))
    chain[1,] <- cur_par <- start
    postval <- logpost_fun(cur_par)
    j <- 1
    for (i in 1:iterations) {
        proposal = MASS::mvrnorm(1,mu=cur_par, Sigma=V)
        newpostval <- logpost_fun(proposal)
        accept_prob <- exp(newpostval - postval)
        if (runif(1) < accept_prob) {
            cur_par <- proposal
            postval <- newpostval
        }
        if ((i>burnin) && (i %% thin == 1)) {
            chain[j+1,] <- cur_par
            j <- j + 1
        }
    }
    chain <- na.omit(chain)
    colnames(chain) <- names(start)
    chain <- coda::mcmc(chain)
    return(chain)
}

Set up for MCMC: define scaled log-posterior function (in this case the log-likelihood function); extract coefficients and variance-covariance matrices as starting points.

## FIXME: is there a better way for user to extract full coefs?
rawcoef <- with(fm1$obj$env,last.par[-random])
names(rawcoef) <- make.names(names(rawcoef), unique=TRUE)
## log-likelihood function 
## (run_MCMC wants *positive* log-lik)
logpost_fun <- function(x) -fm1$obj$fn(x)
## check definitions
stopifnot(all.equal(c(logpost_fun(rawcoef)),
                    c(logLik(fm1)),
          tolerance=1e-7))
V <- vcov(fm1,full=TRUE)

Run the chain:

t1 <- system.time(m1 <- run_MCMC(start=rawcoef,
                                 V=V, logpost_fun=logpost_fun,
                                 seed=1001))

(running this chain takes 7.6 seconds)

Add more informative names and transform correlation parameter (see vignette on covariance structures and parameters):

colnames(m1) <- colnames(vcov(fm1, full = TRUE))
colnames(m1)[ncol(m1)] <- "sd_site"
lattice::xyplot(m1,layout=c(2,3),asp="fill")

print(effectiveSize(m1),digits=3)
##    (Intercept)        minedno zi~(Intercept)     zi~minedno        sd_site 
##            262            243            258            259            249

These effective sizes are probably still too small. In a real analysis we would stop and make sure we had addressed the mixing/convergence problems before proceeding; for this simple sampler, some of our choices would be (1) simply run the chain for longer; (2) tune the candidate distribution (e.g. by using tune to scale some parameters, or perhaps by switching to a multivariate Student t distribution [see the mvtnorm package]); (3) add regularizing priors.

Ignoring the problems and proceeding, we can compute column-wise quantiles or highest posterior density intervals (coda::HPDinterval) to get confidence intervals. Plotting posterior distributions, omitting the intercept because it’s on a very different scale.

tmbstan

The tmbstan package allows direct, simple access to a hybrid/Hamiltonian Monte Carlo algorithm for sampling from a TMB object; the $obj component of a glmmTMB fit is such an object. (To run this example you’ll need to install the tmbstan package and its dependencies.)

library(tmbstan)
t2 <- system.time(m2 <- tmbstan(fm1$obj, seed = 101))

Running this command, which creates 4 chains, takes 5.6 seconds, with no parallelization. If you’re going to do this a lot, you can use argument cores=n for running chains in parallel. The argument is sent to rstan::sampling so you should look at the helpfile ?rstan::sampling. Running bayestestR::diagnostic_posterior() on the fit gives the following results:

Parameter ESS MCSE Rhat
beta[1] 736 0.010 1.001
beta[2] 709 0.011 1.001
betazi[1] 998 0.008 1.001
betazi[2] 1137 0.008 1.001
theta 556 0.015 1.012

A trace plot (rstan::traceplot(m2, pars=c("beta","betazi","theta"))):

Pairs plot (pairs(m2, pars = c("beta", "betazi"), gap = 0)):

Sleep study example

Now using the (now) classic sleepstudy data set included with the lme4 package:

data("sleepstudy", package = "lme4")
fm2 <- glmmTMB(Reaction ~ Days + (Days | Subject), data = sleepstudy)
t3 <- system.time(m3 <- tmbstan(fm2$obj, seed = 101))

Running this chain produces many warnings about divergent transitions, low effective sample size, etc.; the diagnostic table confirms this.

Diagnostics:

dp3 <- bayestestR::diagnostic_posterior(m3)
Parameter ESS MCSE Rhat
beta[1] 11 2.339 1.138
beta[2] 14 0.473 1.164
betadisp 8 0.026 1.201
theta[1] 18 0.074 1.157
theta[2] 23 0.056 1.143
theta[3] 3 205.252 2.073

The trace plot confirms that one of the chains (chain 1) is problematic:

One way to address these problems is to add bounds that prevent the chains from going outside a range of ±5 standard errors from the MLE (we originally used ±3σ, but it looked like these bounds were being hit too frequently - if possible we want bounds that will prevent crazy behaviour but not otherwise constrain the chains …)

sdrsum <- summary(fm2$sdr)
par_est <- sdrsum[,"Estimate"]
par_sd <- sdrsum[,"Std. Error"]
t4 <- system.time(m4 <- tmbstan(fm2$obj,
                                lower = par_est - 5*par_sd,
                                upper = par_est + 5*par_sd,
                                seed = 101))

The diagnostics and the trace plot look much better now …

dp4 <- bayestestR::diagnostic_posterior(m4)
Parameter ESS MCSE Rhat
beta[1] 2782 0.078 1.000
beta[2] 514 0.070 1.006
betadisp 2410 0.001 1.001
theta[1] 445 0.021 1.004
theta[2] 1275 0.006 1.002
theta[3] 1184 0.014 1.002

The last step (for now) is to extract the sampled values, give them interpretable names, and transform the correlation parameter back to a more interpretable scale (see the “Covariance structures” vignette):

samples4 <- as.data.frame(extract(m4, pars=c("beta","betadisp","theta")))
colnames(samples4) <- c(names(fixef(fm2)$cond),
                  "log(sigma)",
                  c("log(sd_Intercept)", "log(sd_Days)", "cor"))
samples4$cor <- sapply(samples4$cor, get_cor)
m4_long <- reshape2::melt(as.matrix(samples4))
ggplot(m4_long, aes(x = value)) + geom_histogram(bins = 50) + facet_wrap(~Var2, scale = "free")

A more principled, properly Bayesian way to handle this problem would be to add sensible regularizing priors. (To be done/tested; as of this writing, priors can be set on fixed effects and on random effect standard deviations, but not on random effect correlations …)