1

I'm trying to create a Tweedie Regression in statsmodels. The regression basically has three categorical predictors which have four levels each. To ilustrate, here is an example:

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

data = pd.DataFrame({
    'V1': pd.Categorical(['A', 'B', 'C', 'D', 'A', 'B', 'C', 'D']),
    'V2': pd.Categorical(['W', 'X', 'Y', 'Z', 'W', 'X', 'Y', 'Z']),
    'V3': pd.Categorical(['K', 'L', 'M', 'N', 'K', 'L', 'M', 'N']),
    'y': [5.1, 7.3, 6.9, 8.0, 5.4, 7.1, 6.8, 8.2]
})

formula = 'y ~ C(V1, Treatment('A')) + C(V2, Treatment('W')):C(V3, Treatment('K'))'
model = smf.GLM.from_formula(formula, data, family=sm.families.Tweedie())
result = model.fit()

print(result.summary())

I used to do this kind of regression using SAS, and SAS do not return the interaction with the reference. For example, in this case, SAS do not include any interaction with V3(K). Here is the analogous code in SAS:

data example;
    input V1 $ V2 $ V3 $ y;
    datalines;
A W K 5.1
B X L 7.3
C Y M 6.9
D Z N 8.0
A W K 5.4
B X L 7.1
C Y M 6.8
D Z N 8.2
;
run;

proc hpgenselect data=example;
    class V1 (ref='A') V2 (ref='W') V3 (ref='K'); 
    model y = V1 V2*V3 / dist=tweedie link=log; 
run;

However, in statsmodel, this interaction is included. Does anyone know why this happen? And how to do something similar to SAS (without the interaction with the reference)?

2
  • What SAS code did you try? Your example categorical variables are all perfectly correlated, so no model will really work. Do you have some actual example data? Commented Aug 16, 2024 at 21:11
  • I edited the code including the SAS code. Commented Aug 16, 2024 at 21:23

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.