create a new column in a spark dataframe based on another dataframe

Question

I hava two dataframes:

df1:

c1    c2   c3
1    192    1
3    192    2
4    193    3
5    193    3
7    193    5
9    194    7

df2:

I want to add new column in df2, the result is:

df2:

explanation: v1=193, in df1, there are 3 rows, and the corresponding c3 are 3 \ 3 \ 5 the distinct value are 3 and 5, the count is 2, so the v2 in df2 is 2

Thank you, python version is best.

mck · Accepted Answer · 2021-02-18 08:04:12Z

2

You can do a join, group by v1 and get the distinct count of c3.

import pyspark.sql.functions as F

result = (df1.join(df2, df1.c2 == df2.v1)
             .groupBy('v1')
             .agg(F.countDistinct('c3').alias('v2'))
         )

result.show()
+---+---+
| v1| v2|
+---+---+
|193|  2|
|192|  2|
|194|  1|
+---+---+

answered Feb 18, 2021 at 8:04

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

pardeep garg · Accepted Answer · 2021-02-18 03:32:44Z

2

You can try as below:

from pyspark.sql.types import *
from pyspark.sql.functions import *
sdf1 = spark.createDataFrame([
(1,192,1),
(3,192,2),
(4,193,3),
(5,193,3),
(7,193,5),
(9,194,7)
], ["c1", "c2", "c3"])

df2 = spark.createDataFrame([
(192,),
(193,),
(194,)
], ["v1"])

df1 = sdf1.groupBy("c2").agg(countDistinct("c3").alias("cnt"))
df2.join(df1, df1.c2 == df2.v1).select(df2.v1,df1.cnt).show()

answered Feb 18, 2021 at 3:32

pardeep garg

2191 silver badge11 bronze badges

Collectives™ on Stack Overflow

create a new column in a spark dataframe based on another dataframe

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related