9

let's say I have two DataFrames on Spark

firstdf = sqlContext.createDataFrame([{'firstdf-id':1,'firstdf-column1':2,'firstdf-column2':3,'firstdf-column3':4}, \
{'firstdf-id':2,'firstdf-column1':3,'firstdf-column2':4,'firstdf-column3':5}])

seconddf = sqlContext.createDataFrame([{'seconddf-id':1,'seconddf-column1':2,'seconddf-column2':4,'seconddf-column3':5}, \
{'seconddf-id':2,'seconddf-column1':6,'seconddf-column2':7,'seconddf-column3':8}])

Now I want to join them by multiple columns (any number bigger than one)

What I have is an array of columns of the first DataFrame and an array of columns of the second DataFrame, these arrays have the same size, and I want to join by the columns specified in these arrays. For example:

columnsFirstDf = ['firstdf-id', 'firstdf-column1']
columnsSecondDf = ['seconddf-id', 'seconddf-column1']

Since these arrays have variable sizes I can't use this kind of approach:

from pyspark.sql.functions import *

firstdf.join(seconddf, \
    (col(columnsFirstDf[0]) == col(columnsSecondDf[0])) &
    (col(columnsFirstDf[1]) == col(columnsSecondDf[1])), \
    'inner'
)

Is there any way that I can join on multiple columns dynamically?

1
  • Why not using a for loop? you can also use itertools library to perform a cartesian product between your list Commented Sep 21, 2016 at 9:08

2 Answers 2

16

Why not use a simple comprehension:

firstdf.join(
    seconddf, 
   [col(f) == col(s) for (f, s) in zip(columnsFirstDf, columnsSecondDf)], 
   "inner"
)

Since you use logical it is enough to provide a list of conditions without & operator.

Sign up to request clarification or add additional context in comments.

2 Comments

what to do if the column names are same in both dataframes?
If column names are same in both dataframes, you either alias both the dataframes or alias the individual columns using "as". stackoverflow.com/q/33778664/5986661
4

@Mohan sorry i dont have reputation to do "add a comment". Having column same on both dataframe,create list with those columns and use in the join

col_list=["id","column1","column2"]
firstdf.join( seconddf, col_list, "inner")

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.