PySpark DataFrame - Join on multiple columns dynamically

Question

let's say I have two DataFrames on Spark

firstdf = sqlContext.createDataFrame([{'firstdf-id':1,'firstdf-column1':2,'firstdf-column2':3,'firstdf-column3':4}, \
{'firstdf-id':2,'firstdf-column1':3,'firstdf-column2':4,'firstdf-column3':5}])

seconddf = sqlContext.createDataFrame([{'seconddf-id':1,'seconddf-column1':2,'seconddf-column2':4,'seconddf-column3':5}, \
{'seconddf-id':2,'seconddf-column1':6,'seconddf-column2':7,'seconddf-column3':8}])

Now I want to join them by multiple columns (any number bigger than one)

What I have is an array of columns of the first DataFrame and an array of columns of the second DataFrame, these arrays have the same size, and I want to join by the columns specified in these arrays. For example:

columnsFirstDf = ['firstdf-id', 'firstdf-column1']
columnsSecondDf = ['seconddf-id', 'seconddf-column1']

Since these arrays have variable sizes I can't use this kind of approach:

from pyspark.sql.functions import *

firstdf.join(seconddf, \
    (col(columnsFirstDf[0]) == col(columnsSecondDf[0])) &
    (col(columnsFirstDf[1]) == col(columnsSecondDf[1])), \
    'inner'
)

Is there any way that I can join on multiple columns dynamically?

Why not using a for loop? you can also use itertools library to perform a cartesian product between your list — GwydionFR
– GwydionFR, Commented Sep 21, 2016 at 9:08

zero323 · Accepted Answer · 2016-09-21 11:28:54Z

16

Why not use a simple comprehension:

firstdf.join(
    seconddf, 
   [col(f) == col(s) for (f, s) in zip(columnsFirstDf, columnsSecondDf)], 
   "inner"
)

Since you use logical it is enough to provide a list of conditions without & operator.

answered Sep 21, 2016 at 11:28

zero323

331k108 gold badges981 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mohan Over a year ago

what to do if the column names are same in both dataframes?

Omkar Neogi Over a year ago

If column names are same in both dataframes, you either alias both the dataframes or alias the individual columns using "as". stackoverflow.com/q/33778664/5986661

Balaji SS · Accepted Answer · 2019-05-03 04:07:06Z

4

@Mohan sorry i dont have reputation to do "add a comment". Having column same on both dataframe,create list with those columns and use in the join

col_list=["id","column1","column2"]
firstdf.join( seconddf, col_list, "inner")

answered May 3, 2019 at 4:07

Balaji SS

412 bronze badges

Collectives™ on Stack Overflow

PySpark DataFrame - Join on multiple columns dynamically

2 Answers 2

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Linked

Related