0

I have a Pandas's code that calcul me the R2 of a linear regression over a window of size x. See my code :

def lr_r2_Sklearn(data):
    data = np.array(data)
    X = pd.Series(list(range(0,len(data),1))).values.reshape(-1,1)
    Y = data.reshape(-1,1)

    regressor = LinearRegression()  
    regressor.fit(X,Y)

    return(regressor.score(X,Y))

r2_rolling = df[['value']].rolling(300).agg([lr_r2_Sklearn])

I am making a rolling of size 300 and calcul the r2 for each window. I wish to do the exact same thing but with pyspark and a spark dataframe. I know I must use the Window function, but it's a bit more difficult to understand than pandas, so I am lost ...

I have this but I don't know how to make it works.

w = Window().partitionBy(lit(1)).rowsBetween(-299,0)
data.select(lr_r2('value').over(w).alias('r2')).show()

(lr_r2 return r2)

Thanks !

1 Answer 1

2

You need a udf with pandas udf with a bounded condition. This is not possible until spark3.0 and is in development. Refer answer here : User defined function to be applied to Window in PySpark? However you can explore the ml package of pyspark: http://spark.apache.org/docs/2.4.0/api/python/pyspark.ml.html#pyspark.ml.classification.LinearSVC So you can define a model such as linearSVC and pass various parts of the dataframe to this after assembling it . I suggest using a pipeline consisting of stages, assembler and classifier, then call them in a loop using your various part of your dataframe by filtering it through some unique id.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.