Why is glmer function for logistic regression taking so long to run in R?

Question

I am running a multiple logistic regression model. The dataset has ~350,000 observations, with the outcome being a binary 0/1 dichotomous variable. Most predictors are also dichotomous but there are several with multiple levels (max is 12 levels). My initial code using only fixed effects was this:

model_fixed <- glm(outcome ~ var1 + var2 + var3 + var4 + var5 + var6 + var7 + var8 + var9 
+ var10, family = binomial (link = logit), data=df)

This runs within 2 minutes. No issues.

I decided to make var1, a variable with 4 levels ("A","B","C","D") a random effect so switched over to this code:

model_mixed <- glmer(outcome ~ var2 + var3 + var4 + var5 + var6 + var7 + var8 + var9 + var10 +
    (1 | var1), data = df, family = binomial)

This model takes over an hour to run. The good thing is that the odds ratios for the mixed effects model seem similar to the ORs from the fixed effects model. But I'd like to speed up the process and wonder if I'm doing something wrong based on the run time.

I tried adding this optimizer: control = glmerControl(optimizer = "bobyqa")) but it takes even longer to run and gives me this warning:

Warning in commonArgs(par, fn, control, environment()) : maxfun < 10 * length(par)^2 is not recommended.

In addition, I can't generate Wald confidence intervals because of the run time.

This code works just fine: exp(confint.default(model_fixed))

This one takes forever (waited 2 hrs before canceling): exp(confint.default(model_mixed))

I've combed this site and saw similar examples (e.g. Speed up lmer function in R) but none specific to logistic regression.

I can't help but feel I am doing something wrong or the underlying statistics is incorrect here (although the similar ORs from the fixed and mixed models seem encouraging). Any advice?

Mixed effects are much more computationally intensive, so that's not terribly surprising, especially with such a large data set. Tell us more, though, about why you thought var1 should be a random effect – out of context, that's a curious choice. — Aaron - mostly inactive
– Aaron - mostly inactive, Commented Apr 11 at 16:03
@Aaron: Thank you. Var1 represents four different locations where data were sampled from so including it as a random effect makes sense contextually. It just happened to be the first variable listed. I thought the matter which variables are listed in the model doesn't matter, maybe that's wrong. — flailing-in-r
– flailing-in-r, Commented Apr 11 at 16:48
Setting nAGQ = 0 will probably speed things up somewhat, potentially at the cost of accuracy. Though as @BenBolker notes, that may be less of an issue with such a large sample. See ?glmer for details. — zephryl
– zephryl, Commented Apr 11 at 19:22
Also see the lme4 "performance tips" vignette -- including the suggestion at the end to try an R interface to MixedModels.jl. — zephryl
– zephryl, Commented Apr 11 at 19:25
With only four levels, I don't think it gains you anything at all to model it as a "random effect"; that's not really enough levels to get a good estimate of the variability, plus it'll only really matter to the inference if the levels of var1 were the experimental units of one of the other factors – but if so, you don't have enough true replicates to do inference anyway. — Aaron - mostly inactive
– Aaron - mostly inactive, Commented Apr 11 at 20:28

Ben Bolker · Accepted Answer · 2025-04-12 00:09:44Z

3

The main issue here, I think, is the large number of columns of your fixed-effect (X) model matrix. glmer has to minimize a nonlinear function with (ncols(X) + number of RE parameters); in your case there's only one RE parameter (the variation in the intercept among var1 groups), but (in particular) a categorical (factor) variable with n levels adds n-1 columns to X. Furthermore, when you calculate the confidence intervals, you may be needing to calculate the Hessian, whose size goes as the square of the number of parameters ...

I also agree with @Aaron that you don't gain much by treating a factor with four levels as a random effect ...

you can skip checking the gradient and Hessian at the MLE (which can be time-consuming) by specifying control = glmerControl(calc.derivs=FALSE)
you may be able to avoid computing the Hessian as shown below:

library(lme4)
set.seed(101)
dd <- expand.grid(f1 = factor(1:10), f2 = factor(1:20), rep = 1:200)
dd$y <- rbinom(nrow(dd), prob = 0.2, size = 1)
m1 <- glmer(y ~ f2 + (1|f1), family = binomial, data = dd, verbose = 1,
            control = glmerControl(calc.derivs = FALSE))
## do the equivalent of stats::confint.default() with Hessian-less vcov
vv <- vcov(m1, use.hessian = FALSE)
ss <- sqrt(diag(vv))
cimat <- fixef(m1) + outer(ss, c(-1, 1)) * qnorm(0.975)

answered Apr 12 at 0:09

Ben Bolker

230k26 gold badges405 silver badges497 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

flailing-in-r Apr 24 at 12:00

Thank you. I ended up not using the random effect approach since I didn't know previously that it wasn't recommended for variables with fewer than 5 levels.

Collectives™ on Stack Overflow

Why is glmer function for logistic regression taking so long to run in R?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related