1

I am running a multiple logistic regression model. The dataset has ~350,000 observations, with the outcome being a binary 0/1 dichotomous variable. Most predictors are also dichotomous but there are several with multiple levels (max is 12 levels). My initial code using only fixed effects was this:

model_fixed <- glm(outcome ~ var1 + var2 + var3 + var4 + var5 + var6 + var7 + var8 + var9 
+ var10, family = binomial (link = logit), data=df)

This runs within 2 minutes. No issues.

I decided to make var1, a variable with 4 levels ("A","B","C","D") a random effect so switched over to this code:

model_mixed <- glmer(outcome ~ var2 + var3 + var4 + var5 + var6 + var7 + var8 + var9 + var10 +
    (1 | var1), data = df, family = binomial)

This model takes over an hour to run. The good thing is that the odds ratios for the mixed effects model seem similar to the ORs from the fixed effects model. But I'd like to speed up the process and wonder if I'm doing something wrong based on the run time.

I tried adding this optimizer: control = glmerControl(optimizer = "bobyqa")) but it takes even longer to run and gives me this warning:

Warning in commonArgs(par, fn, control, environment()) : maxfun < 10 * length(par)^2 is not recommended.

In addition, I can't generate Wald confidence intervals because of the run time.

This code works just fine: exp(confint.default(model_fixed))

This one takes forever (waited 2 hrs before canceling): exp(confint.default(model_mixed))

I've combed this site and saw similar examples (e.g. Speed up lmer function in R) but none specific to logistic regression.

I can't help but feel I am doing something wrong or the underlying statistics is incorrect here (although the similar ORs from the fixed and mixed models seem encouraging). Any advice?

8
  • 4
    Mixed effects are much more computationally intensive, so that's not terribly surprising, especially with such a large data set. Tell us more, though, about why you thought var1 should be a random effect – out of context, that's a curious choice. Commented Apr 11 at 16:03
  • @Aaron: Thank you. Var1 represents four different locations where data were sampled from so including it as a random effect makes sense contextually. It just happened to be the first variable listed. I thought the matter which variables are listed in the model doesn't matter, maybe that's wrong. Commented Apr 11 at 16:48
  • Setting nAGQ = 0 will probably speed things up somewhat, potentially at the cost of accuracy. Though as @BenBolker notes, that may be less of an issue with such a large sample. See ?glmer for details. Commented Apr 11 at 19:22
  • Also see the lme4 "performance tips" vignette -- including the suggestion at the end to try an R interface to MixedModels.jl. Commented Apr 11 at 19:25
  • With only four levels, I don't think it gains you anything at all to model it as a "random effect"; that's not really enough levels to get a good estimate of the variability, plus it'll only really matter to the inference if the levels of var1 were the experimental units of one of the other factors – but if so, you don't have enough true replicates to do inference anyway. Commented Apr 11 at 20:28

1 Answer 1

3

The main issue here, I think, is the large number of columns of your fixed-effect (X) model matrix. glmer has to minimize a nonlinear function with (ncols(X) + number of RE parameters); in your case there's only one RE parameter (the variation in the intercept among var1 groups), but (in particular) a categorical (factor) variable with n levels adds n-1 columns to X. Furthermore, when you calculate the confidence intervals, you may be needing to calculate the Hessian, whose size goes as the square of the number of parameters ...

I also agree with @Aaron that you don't gain much by treating a factor with four levels as a random effect ...

  • you can skip checking the gradient and Hessian at the MLE (which can be time-consuming) by specifying control = glmerControl(calc.derivs=FALSE)
  • you may be able to avoid computing the Hessian as shown below:
library(lme4)
set.seed(101)
dd <- expand.grid(f1 = factor(1:10), f2 = factor(1:20), rep = 1:200)
dd$y <- rbinom(nrow(dd), prob = 0.2, size = 1)
m1 <- glmer(y ~ f2 + (1|f1), family = binomial, data = dd, verbose = 1,
            control = glmerControl(calc.derivs = FALSE))
## do the equivalent of stats::confint.default() with Hessian-less vcov
vv <- vcov(m1, use.hessian = FALSE)
ss <- sqrt(diag(vv))
cimat <- fixef(m1) + outer(ss, c(-1, 1)) * qnorm(0.975)
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. I ended up not using the random effect approach since I didn't know previously that it wasn't recommended for variables with fewer than 5 levels.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.