2
$\begingroup$

I’m trying to use the R poly() function with degree 1 to force glm to interpret a factor linearly. I’m puzzled by the fact that the size of the sample seems to increase the coefficient of the regressor. For example, I’m creating sample first sample.

> aa <- data.frame(a=as.ordered(rep(c("one", "two", "three", "four"), 200)), b=rnorm(800, 1, 1)+seq(1,4))

> head(aa)
      a        b
1   one 2.979086
2   two 3.817254
3 three 2.927219
4  four 7.109678
5   one 1.915151
6   two 2.178621

> lm(b~poly(a,1),aa)
Call:
lm(formula = b ~ poly(a, 1), data = aa)

Coefficients:
(Intercept)   poly(a, 1)  
3.52          -13.13

However if I generate a sample a quarter of its size, the coefficient is about half of what it was before.

> ab <- data.frame(a=as.ordered(rep(c("one", "two", "three", "four"), 50)), b=rnorm(200, 1, 1)+seq(1,4))

> lm(b~poly(a,1),ab)
Call:
lm(formula = b ~ poly(a, 1), data = ab)

Coefficients:
(Intercept)   poly(a, 1)  
      3.544       -6.979  

This is clearly related to the way poly() functions, as it creates linear orthnormal regressors, so they average to zero. But why do they have to decrease such as:

> mean(abs(poly(aa$a,1)))
[1] 0.03162278
> mean(abs(poly(ab$a,1)))
[1] 0.06324555
$\endgroup$
7
  • 1
    $\begingroup$ It seems you've answered your own question; the coefficients increase with sample size because with the way poly functions, the values of the linear predictor decrease with sample size. Are you interested in the mechanism of what poly is doing behind the scenes? $\endgroup$ Commented 10 hours ago
  • $\begingroup$ Yes, I try to understand why poly scales the effect of the regressor according to the sample size. This behavior is creating a situation in which it is difficult to compare the importance of the regressors. $\endgroup$ Commented 10 hours ago
  • $\begingroup$ Ah, thanks. The easy fix is to make a into a numeric integer-based value, though you could also use polynomial contrasts (instead of poly, which treats variables as numeric) or roll your own contrast function. I'll put a quick answer together. $\endgroup$ Commented 10 hours ago
  • $\begingroup$ People recommend to use the poly(a, 1, raw=TRUE) which keep it the way it is, which sounds good for numeric value, but it just transform ordered("sometime", "often", "always") into 1,2,3, which is, as I understand it, highly frowned upon in statistical circles. My question is more: how do you get the "true" coefficient, as there should be one that is mostly independent from the scale of the initial variable. $\endgroup$ Commented 9 hours ago
  • $\begingroup$ Hmm. Anything you do to make this linear will be equivalent to 1,2,3. And I'd disagree that the idea of doing this highly frowned upon; what matters if it makes sense in your particular situation. That is, what is definitely frowned upon is people looking only at the linear effect of an ordered factor without thinking about if actually makes sense for them. $\endgroup$ Commented 9 hours ago

1 Answer 1

3
$\begingroup$

As mentioned in the answer (and also comments) of What does the R function poly really do?, the variables are scaled by the sample size so that the sum of their squared values is 1. This can help with computational purposes so that the values within the computation don't get too big (or too small). That is, the larger the data set, the smaller the rescaled variable gets.

sum(poly(aa$a,1)^2)
#> [1] 1
sum(poly(ab$a,1)^2)
#> [1] 1

The result is that the coefficients with a model fit using the poly function depend on the sample size; if you want the coefficients to be comparable no matter what the sample size is you have several options.

# ... recreating the data with a seed that results in more similar coefficients...
set.seed(23)
aa <- data.frame(a=as.ordered(rep(c("one", "two", "three", "four"), 200)), b=rnorm(800, 1, 1)+seq(1,4))
ab <- data.frame(a=as.ordered(rep(c("one", "two", "three", "four"), 50)), b=rnorm(200, 1, 1)+seq(1,4))

First, showing that with this data set the coefficients are quite different when using poly.

lm(b~poly(a,1),aa) |> coef()
#> (Intercept)  poly(a, 1) 
#>    3.549134  -12.482210
lm(b~poly(a,1),ab) |> coef()
#> (Intercept)  poly(a, 1) 
#>    3.488270   -6.206132

Integers

The easiest way is just to make the factors into integers. The values won't change based on the sample size. This is I think the simplest and most straightforward solution; the coefficient is the average change in the response as you move to next highest level of the factor. The intercept has no meaning in context though, it's the average response at an imaginary factor level one less than the first one (because as.integer starts at 1).

lm(b~as.integer(a),aa) |> coef()
#>   (Intercept) as.integer(a) 
#>     4.5359393    -0.3947221
lm(b~as.integer(a),ab) |> coef()
#>   (Intercept) as.integer(a) 
#>     4.4695456    -0.3925103

On converting to integers...

Anything you do to make this linear will be equivalent to 1,2,3. And while some might think this is highly frowned upon by statisticians, that's not true; what matters if it makes sense in your particular situation.

That is, what's frowned upon is when people use this statistical technique (or any statistical technique!) without actually thinking. That is, one shouldn't just blindly look at the linear effect of an ordered factor without thinking about if actually makes sense for one's situation.

For example, as quoted from this UV page, "Fox and Weisberg (2019) tell us contr.poly() is appropriate when the levels of your ordered factor are equally spaced and balanced."

Here are some other ways one might consider converting a factor to a linear effect in R:

Polynomial contrasts

When the factor is ordered, R will by default use polynomial contrasts (from contr.poly), which don't depend on the sample size. This would be the usual way of looking at linear effects or quadratic effects within an ANOVA type model (that is, a model that fits unique means for each predictor value).

However, it does include the quadratic and higher order terms; it also isn't immediately obvious how one would interpret the coefficient, you'd need to look inside contr.poly to see what it does.

lm(b~a,aa) |> coef()
#> (Intercept)         a.L         a.Q         a.C 
#>   3.5491339  -0.8826255   0.9860938  -1.7221115
lm(b~a,ab) |> coef()
#> (Intercept)         a.L         a.Q         a.C 
#>   3.4882699  -0.8776796   1.2209501  -1.7361657

Poly with just degree 1

To not get the higher orders, you can force contr.poly to only use the linear contrast by setting the contrast type of the variable itself, and using the how.many parameter.

contrasts(aa$a, how.many=1) <- contr.poly
contrasts(ab$a, how.many=1) <- contr.poly
lm(b~a, aa) |> coef()
#> (Intercept)         a.L 
#>   3.5491339  -0.8826255
lm(b~a, ab) |> coef()
#> (Intercept)         a.L 
#>   3.4882699  -0.8776796

Custom contrast

Finally, one could make a custom contrast, that just does the linear contrast; this would be almost equivalent to converting the variable to an integer, except that you could more easily center the variable to have mean zero.

The coefficient is again the average change in the response as you move to next highest level of the factor; the intercept is the average response at the middle factor (or between them, if there's an even number).

contr.linear <- function(n) {
  if (!(is.numeric(n) && length(n) > 1L)) n <- length(n)
  cbind(.L=seq(1,n) - n/2 - 0.5)
}

contrasts(aa$a, 1) <- contr.linear
contrasts(ab$a, 1) <- contr.linear
lm(b~a, aa) |> coef()
#> (Intercept)         a.L 
#>   3.5491339  -0.3947221
lm(b~a, ab) |> coef()
#> (Intercept)         a.L 
#>   3.4882699  -0.3925103

Created on 2025-11-21 with reprex v2.1.1

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.