0

I am running this code and I'd like to get only some columns back, not all columns of all tables that participated in the join.

df_final = df.join(df1,(df['sbr_brand']==df1['sbr_brand'])\
               &(df['sbr_number']==df1['sbr_number'])\
               &(df['calendar_date']==df1['calendar_date'])\
               &(df['check_number']==df1['check_number']))\
         .join(df2,(df['sbr_brand']==df2['brand'])\
               &(df['sbr_number']==df2['store_number'])\
               &(df['calendar_date']==df2['date_of_business'])\
               &(df['check_number']==df2['check_number']),'inner')\
         .select(df['modifier_gross_amount'],df1['check_line_number','item_barcode','dining_option','item_quantity','item_gross_amount','item_net_amount'],df2['brand_id'])

I have an error:

Invalid argument, not a string or column: DataFrame[check_line_number: bigint, item_barcode: string, dining_option: string, item_quantity: double, item_gross_amount: decimal(38,6), item_net_amount: decimal(38,6)] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

I removed the select statement from the bottom and the code ran perfectly. I then ran the command below and it showed me all columns of all 3 dataframes.

display(df_final)

I also ran a separate command to see if it makes a difference:

df_final2 = df_final.select(df['modifier_gross_amount'],df1['check_line_number','item_barcode','dining_option','item_quantity','item_gross_amount','item_net_amount'],df2['brand_id'])

But I was given the same error. Not sure how to fix this. Please advise.

2 Answers 2

1

Try using something as below -

Sample Input Dataframes

df1 = spark.createDataFrame(data=[(1,1,3),(2,1,1),(2,2,3),(1,2,3),(1,2,1)], schema=['id1', 'id2', 'value'])
df2 = spark.createDataFrame(data=[(1,1,3),(2,1,1),(2,2,3),(1,2,3),(1,2,1)], schema=['id1', 'id2', 'value'])

Output (Using Joins)

df1.join(df2, (df1["id1"] == df2["id1"]) & (df1["id2"] == df2["id2"]) & (df1["value"] == df2["value"])).select(df1["id1"], df2["id2"], df1["value"]).show()

+---+---+-----+
|id1|id2|value|
+---+---+-----+
|  1|  1|    3|
|  1|  2|    1|
|  1|  2|    3|
|  2|  1|    1|
|  2|  2|    3|
+---+---+-----+
Sign up to request clarification or add additional context in comments.

Comments

0

df['col1'] return a Column, while df['col1', 'col2'] return a DataFrame.

The argument of select function must be a string or Column. So it should be:

df_final2 = df_final.select(df['modifier_gross_amount'],df1['check_line_number'],df1['item_barcode'],df1['dining_option'],df1['item_quantity'],df1['item_gross_amount'],df1['item_net_amount'],df2['brand_id'])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.