As mentioned in the answer (and also comments) of What does the R function poly really do?, the variables are scaled by the sample size so that the sum of their squared values is 1. This can help with computational purposes so that the values within the computation don't get too big (or too small). That is, the larger the data set, the smaller the rescaled variable gets.
sum(poly(aa$a,1)^2)
#> [1] 1
sum(poly(ab$a,1)^2)
#> [1] 1
The result is that the coefficients with a model fit using the poly function depend on the sample size; if you want the coefficients to be comparable no matter what the sample size is you have several options.
# ... recreating the data with a seed that results in more similar coefficients...
set.seed(23)
aa <- data.frame(a=as.ordered(rep(c("one", "two", "three", "four"), 200)), b=rnorm(800, 1, 1)+seq(1,4))
ab <- data.frame(a=as.ordered(rep(c("one", "two", "three", "four"), 50)), b=rnorm(200, 1, 1)+seq(1,4))
First, showing that with this data set the coefficients are quite different when using poly.
lm(b~poly(a,1),aa) |> coef()
#> (Intercept) poly(a, 1)
#> 3.549134 -12.482210
lm(b~poly(a,1),ab) |> coef()
#> (Intercept) poly(a, 1)
#> 3.488270 -6.206132
Integers
The easiest way is just to make the factors into integers. The values won't change based on the sample size. This is I think the simplest and most straightforward solution; the coefficient is the average change in the response as you move to next highest level of the factor. The intercept has no meaning in context though, it's the average response at an imaginary factor level one less than the first one (because as.integer starts at 1).
lm(b~as.integer(a),aa) |> coef()
#> (Intercept) as.integer(a)
#> 4.5359393 -0.3947221
lm(b~as.integer(a),ab) |> coef()
#> (Intercept) as.integer(a)
#> 4.4695456 -0.3925103
On converting to integers...
Anything you do to make this linear will be equivalent to 1,2,3. And while some might think this is highly frowned upon by statisticians, that's not true; what matters if it makes sense in your particular situation.
That is, what's frowned upon is when people use this statistical technique (or any statistical technique!) without actually thinking. That is, one shouldn't just blindly look at the linear effect of an ordered factor without thinking about if actually makes sense for one's situation.
For example, as quoted from this UV page, "Fox and Weisberg (2019) tell us contr.poly() is appropriate when the levels of your ordered factor are equally spaced and balanced."
Here are some other ways one might consider converting a factor to a linear effect in R:
Polynomial contrasts
When the factor is ordered, R will by default use polynomial contrasts (from contr.poly), which don't depend on the sample size. This would be the usual way of looking at linear effects or quadratic effects within an ANOVA type model (that is, a model that fits unique means for each predictor value).
However, it does include the quadratic and higher order terms; it also isn't immediately obvious how one would interpret the coefficient, you'd need to look inside contr.poly to see what it does.
lm(b~a,aa) |> coef()
#> (Intercept) a.L a.Q a.C
#> 3.5491339 -0.8826255 0.9860938 -1.7221115
lm(b~a,ab) |> coef()
#> (Intercept) a.L a.Q a.C
#> 3.4882699 -0.8776796 1.2209501 -1.7361657
Poly with just degree 1
To not get the higher orders, you can force contr.poly to only use the linear contrast by setting the contrast type of the variable itself, and using the how.many parameter.
contrasts(aa$a, how.many=1) <- contr.poly
contrasts(ab$a, how.many=1) <- contr.poly
lm(b~a, aa) |> coef()
#> (Intercept) a.L
#> 3.5491339 -0.8826255
lm(b~a, ab) |> coef()
#> (Intercept) a.L
#> 3.4882699 -0.8776796
Custom contrast
Finally, one could make a custom contrast, that just does the linear contrast; this would be almost equivalent to converting the variable to an integer, except that you could more easily center the variable to have mean zero.
The coefficient is again the average change in the response as you move to next highest level of the factor; the intercept is the average response at the middle factor (or between them, if there's an even number).
contr.linear <- function(n) {
if (!(is.numeric(n) && length(n) > 1L)) n <- length(n)
cbind(.L=seq(1,n) - n/2 - 0.5)
}
contrasts(aa$a, 1) <- contr.linear
contrasts(ab$a, 1) <- contr.linear
lm(b~a, aa) |> coef()
#> (Intercept) a.L
#> 3.5491339 -0.3947221
lm(b~a, ab) |> coef()
#> (Intercept) a.L
#> 3.4882699 -0.3925103
Created on 2025-11-21 with reprex v2.1.1
ainto a numeric integer-based value, though you could also use polynomial contrasts (instead of poly, which treats variables as numeric) or roll your own contrast function. I'll put a quick answer together. $\endgroup$