1

I hava two dataframes:

df1:

c1    c2   c3
1    192    1
3    192    2
4    193    3
5    193    3
7    193    5
9    194    7

df2:

v1
192 
193
194

I want to add new column in df2, the result is:

df2:

v1     v2
192    2
193    2
194    1

explanation: v1=193, in df1, there are 3 rows, and the corresponding c3 are 3 \ 3 \ 5 the distinct value are 3 and 5, the count is 2, so the v2 in df2 is 2

Thank you, python version is best.

2 Answers 2

2

You can do a join, group by v1 and get the distinct count of c3.

import pyspark.sql.functions as F

result = (df1.join(df2, df1.c2 == df2.v1)
             .groupBy('v1')
             .agg(F.countDistinct('c3').alias('v2'))
         )

result.show()
+---+---+
| v1| v2|
+---+---+
|193|  2|
|192|  2|
|194|  1|
+---+---+
Sign up to request clarification or add additional context in comments.

Comments

2

You can try as below:

from pyspark.sql.types import *
from pyspark.sql.functions import *
sdf1 = spark.createDataFrame([
(1,192,1),
(3,192,2),
(4,193,3),
(5,193,3),
(7,193,5),
(9,194,7)
], ["c1", "c2", "c3"])

df2 = spark.createDataFrame([
(192,),
(193,),
(194,)
], ["v1"])

df1 = sdf1.groupBy("c2").agg(countDistinct("c3").alias("cnt"))
df2.join(df1, df1.c2 == df2.v1).select(df2.v1,df1.cnt).show()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.