How to use pyspark dataframe window function

Question

I have a dataframe like below

I want to get a dataframe which will have the most recent version with the latest date.The first filter criteria will be latest version and then latest date The resulting dataframe should look like below

I am using window function to achieve this.I have written below piece of code.

wind = Window.partitionBy("id")
data = data.withColumn("maxVersion", F.max("version").over(wind)) \
               .withColumn("maxDt", F.max("dt").over(wind)) \
               .where(F.col("version") == F.col("maxVersion")) \
               .where(F.col("maxDt") == F.col("dt")) \
               .drop(F.col("maxVersion")) \
               .drop(F.col("maxDt"))

I am not sure where I am missing out.I am only getting one output with id 100. Please help me to solve this

Simon Delecourt · Accepted Answer · 2020-11-06 16:28:07Z

1

As you mentioned there is an order in your operatin : first version then dt Basically, you need to select only maximum version (removing everything else) and then select maximum dt and removing everything else. You just have to switch 2 lines as this :

wind = Window.partitionBy("id")
data = data.withColumn("maxVersion", F.max("version").over(wind)) \
               .where(F.col("version") == F.col("maxVersion")) \
               .withColumn("maxDt", F.max("dt").over(wind)) \
               .where(F.col("maxDt") == F.col("dt")) \
               .drop(F.col("maxVersion")) \
               .drop(F.col("maxDt"))

The reason why you got only one row for id 100 is because in that case maximum version and maximum dt is happening on the same row (you got lucky). But it is not true for id 200.

answered Nov 6, 2020 at 16:28

Simon Delecourt

1,60910 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Hannah Over a year ago

I guess this is the answer I was looking for.It's working.Thanks for your response.

user238607 · Accepted Answer · 2020-11-06 16:28:42Z

1

Basically there are couple of issues with your formulation. First you need to change the date from string to it's proper date format. Then Window in pyspark allows you to specify the ordering of the columns one after the other. Then there is rank() function which allows you to rank the results over the Window. Finally all that remains is to select the first rank.

from pyspark.sql.types import *
from pyspark import SparkContext, SQLContext
import pyspark.sql.functions as F
from pyspark.sql import Window

sc = SparkContext('local')
sqlContext = SQLContext(sc)

data1 = [
        (100,1,"2020-03-19","Nil1"),
        (100,2,"2020-04-19","Nil2"),
        (100,2,"2020-04-19","Nil2"),
        (100,2,"2020-05-19","Ni13"),
        (200,1,"2020-09-19","Jay1"),
        (200,2,"2020-07-19","Jay2"),
        (200,2,"2020-08-19","Jay3"),

      ]

df1Columns = ["id", "version", "dt",  "Name"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)
df1 = df1.withColumn("dt",F.to_date(F.to_timestamp("dt", 'yyyy-MM-dd')).alias('dt'))
print("Schema.")
df1.printSchema()
print("Actual initial data")
df1.show(truncate=False)

wind = Window.partitionBy("id").orderBy(F.desc("version"), F.desc("dt"))

df1 = df1.withColumn("rank", F.rank().over(wind))
print("Ranking over the window spec specified")
df1.show(truncate=False)

final_df = df1.filter(F.col("rank") == 1).drop("rank")
print("Filtering the final result by applying the rank == 1 condition")
final_df.show(truncate=False)

Output :

Schema.
root
 |-- id: long (nullable = true)
 |-- version: long (nullable = true)
 |-- dt: date (nullable = true)
 |-- Name: string (nullable = true)

Actual initial data
+---+-------+----------+----+
|id |version|dt        |Name|
+---+-------+----------+----+
|100|1      |2020-03-19|Nil1|
|100|2      |2020-04-19|Nil2|
|100|2      |2020-04-19|Nil2|
|100|2      |2020-05-19|Ni13|
|200|1      |2020-09-19|Jay1|
|200|2      |2020-07-19|Jay2|
|200|2      |2020-08-19|Jay3|
+---+-------+----------+----+

Ranking over the window spec specified
+---+-------+----------+----+----+
|id |version|dt        |Name|rank|
+---+-------+----------+----+----+
|100|2      |2020-05-19|Ni13|1   |
|100|2      |2020-04-19|Nil2|2   |
|100|2      |2020-04-19|Nil2|2   |
|100|1      |2020-03-19|Nil1|4   |
|200|2      |2020-08-19|Jay3|1   |
|200|2      |2020-07-19|Jay2|2   |
|200|1      |2020-09-19|Jay1|3   |
+---+-------+----------+----+----+

Filtering the final result by applying the rank == 1 condition
+---+-------+----------+----+
|id |version|dt        |Name|
+---+-------+----------+----+
|100|2      |2020-05-19|Ni13|
|200|2      |2020-08-19|Jay3|
+---+-------+----------+----+

answered Nov 6, 2020 at 16:28

user238607

2,5185 gold badges20 silver badges21 bronze badges

4 Comments

Simon Delecourt Over a year ago

Nice solution. But, if I may, it is not really efficient as it involves sorting the whole dataframe.

user238607 Over a year ago

@SimonDelecourt : Only over the partitionBy id.

Hannah Over a year ago

Thanks for your resonse.I have actually tried this and it is working.I was wandering if there is a way to handle this without using a rank.I guess the rank will cause performance issue as we have to sort the whole dataframe and shuffling will happen

user238607 Over a year ago

@Nils : Basically it depends on how many rows per id you have. Since the partitionBy will ensure that only rows within that id are sorted. But I am not sure if Catalyst optimizer has a optimization for this case where rank followed by filter immediate leads to max semantics. I think you will have to look at the physical plans or experiments to test this out.

mck · Accepted Answer · 2020-11-06 18:09:07Z

0

A neater way is perhaps to do the following:

w = Window.partitionBy("id").orderBy(F.col('version').desc(), F.col('dt').desc())
df1.withColumn('maximum', F.row_number().over(w)).filter('maximum = 1').drop('maximum').show()

answered Nov 6, 2020 at 18:09

mck

42.7k13 gold badges44 silver badges62 bronze badges

Collectives™ on Stack Overflow

How to use pyspark dataframe window function

3 Answers 3

1 Comment

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related