1

I have been trying to make a cubic regression for a few days now, but I encounter the same problem: my result does not coincide with the code I wrote in R to check. The databases are completely the same, so this is not the problem. The code that I have right now is something like this:

import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
import statsmodels.api as sm
import numpy as np

df = pd.read_csv("http://web.stanford.edu/~oleg2/hse/boston/boston_house_prices.csv")
df = df.dropna()
x, y = np.array(df.lstat), np.array(df.crim)
polynomial_features= PolynomialFeatures(degree=3)
xp = polynomial_features.fit_transform(x.reshape(-1,1))
model = sm.OLS(y, xp).fit()
print(model.summary())

I have also made something like this:

import pandas as pd
import statsmodels.api as sm
import numpy as np
import statsmodels.formula.api as smf

df = pd.read_csv("http://web.stanford.edu/~oleg2/hse/boston/boston_house_prices.csv")
df = df.dropna()
ft1 = smf.ols(formula=f"crim ~ lstat + I(np.power(lstat,2)) + I(np.power(lstat,3))", data=df).fit()
print(ft1.summary())

These two give completely the same result:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   crim   R-squared:                       0.218
Model:                            OLS   Adj. R-squared:                  0.213
Method:                 Least Squares   F-statistic:                     46.63
Date:                Sat, 03 Oct 2020   Prob (F-statistic):           1.35e-26
Time:                        10:26:13   Log-Likelihood:                -1744.2
No. Observations:                 506   AIC:                             3496.
Df Residuals:                     502   BIC:                             3513.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                 1.2010      2.029      0.592      0.554      -2.785       5.187
lstat                    -0.4491      0.465     -0.966      0.335      -1.362       0.464
I(np.power(lstat, 2))     0.0558      0.030      1.852      0.065      -0.003       0.115
I(np.power(lstat, 3))    -0.0009      0.001     -1.517      0.130      -0.002       0.000
==============================================================================
Omnibus:                      607.734   Durbin-Watson:                   1.239
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            53621.219
Skew:                           5.726   Prob(JB):                         0.00
Kurtosis:                      52.114   Cond. No.                     5.20e+04
==============================================================================

And here is the program on R:

fit.lstat2 <- lm(crim ~ poly(lstat, 3))
summary(fit.lstat2)

And it gives the following result:

## Call:
## lm(formula = crim ~ poly(lstat, 3))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.234  -2.151  -0.486   0.066  83.353 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.6135     0.3392  10.654   <2e-16 ***
## poly(lstat, 3)1  88.0697     7.6294  11.543   <2e-16 ***
## poly(lstat, 3)2  15.8882     7.6294   2.082   0.0378 *  
## poly(lstat, 3)3 -11.5740     7.6294  -1.517   0.1299    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.629 on 502 degrees of freedom
## Multiple R-squared:  0.2179, Adjusted R-squared:  0.2133 
## F-statistic: 46.63 on 3 and 502 DF,  p-value: < 2.2e-16

This result is the correct one, but I don't know why python gives the wrong answer. Perhaps I should use some different approaches?

P.S. I have made sure that both codes work, including the database link. So feel free to run them to see the results for yourself.

3
  • how are you loading your data with R? Commented Oct 3, 2020 at 9:32
  • @Frenchy library(MASS) attach(Boston) Commented Oct 3, 2020 at 9:41
  • i dunno if python's result is wrong, but you dont speak about same result. your R speaks about ortho poly, so see my answer to change to raw Commented Oct 3, 2020 at 12:33

1 Answer 1

2

I am not expert in R but, i suppose you dont use orthogonal polynomial, so you have to set raw=TRUE

I have the same result than python statsmodels.api when i use raw=TRUE for R process:

fit.lstat2 <- lm(crim ~ poly(lstat, 3, raw=TRUE))
summary(fit.lstat2)

result:

Call:
lm(formula = crim ~ poly(lstat, 3, raw = TRUE))

Residuals:
    Min      1Q  Median      3Q     Max 
-15.234  -2.151  -0.486   0.066  83.353 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)  
(Intercept)                  1.2009656  2.0286452   0.592   0.5541  
poly(lstat, 3, raw = TRUE)1 -0.4490656  0.4648911  -0.966   0.3345  
poly(lstat, 3, raw = TRUE)2  0.0557794  0.0301156   1.852   0.0646 .
poly(lstat, 3, raw = TRUE)3 -0.0008574  0.0005652  -1.517   0.1299  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.629 on 502 degrees of freedom
Multiple R-squared:  0.2179,    Adjusted R-squared:  0.2133 
F-statistic: 46.63 on 3 and 502 DF,  p-value: < 2.2e-16
Sign up to request clarification or add additional context in comments.

2 Comments

This is a great observation, but I would rather change my python code to give same results as R, because I was trying to do anova tests between linear and non-linear observations (obtained by the same method), which literally have little to know difference, but it showed some super small p-value suggesting that they are astronomically different.
In case anybody is wondering, anova test was not working because I forgot to add intercept). So, my python code is correct.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.