1

I have dataframe of 10 columns and want to do function - concatenation based on Array of columns which come as input:

arr = ["col1", "col2", "col3"]

This is current so far:

newDF = rawDF.select(concat(col("col1"), col("col2"), col("col3") )).exceptAll(updateDF.select( concat(col("col1"), col("col2"), col("col3") ) ) )

Also:

df3 = df2.join(df1, concat( df2.col1, df2.col2, df2.col3, df2.col3 ) == df1.col5 ) 

But I want to make a loop or function to do this based on input array (not hard-coding it as is now).

What is the best way?

1
  • can you post your expected output? Commented Feb 21, 2020 at 19:29

1 Answer 1

1

You can unpack the cols using (*). In the pyspark.sql docs, if any functions have (*cols), it means that you can unpack the cols. For concat:

pyspark.sql.functions.concat(*cols)

from pyspark.sql import functions as F
arr = ["col1", "col2", "col3"]
newDF = rawDF.select(F.concat(*(F.col(col) for col in arr))).exceptAll(updateDF.select(F.concat(*(F.col(col) for col in arr))))

For joins:

arr=['col1','col2','col3']
df3 = df2.join(df1, F.concat(*(F.col(col) for col in arr)) == df1.col5 )
Sign up to request clarification or add additional context in comments.

2 Comments

verbal explanations are often helpful
Also - how would do this part? df3 = df2.join(df1, concat( df2.col1, df2.col2, df2.col3, df2.col3 ) == df1.col5 )

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.