I am running this code and I'd like to get only some columns back, not all columns of all tables that participated in the join.
df_final = df.join(df1,(df['sbr_brand']==df1['sbr_brand'])\
&(df['sbr_number']==df1['sbr_number'])\
&(df['calendar_date']==df1['calendar_date'])\
&(df['check_number']==df1['check_number']))\
.join(df2,(df['sbr_brand']==df2['brand'])\
&(df['sbr_number']==df2['store_number'])\
&(df['calendar_date']==df2['date_of_business'])\
&(df['check_number']==df2['check_number']),'inner')\
.select(df['modifier_gross_amount'],df1['check_line_number','item_barcode','dining_option','item_quantity','item_gross_amount','item_net_amount'],df2['brand_id'])
I have an error:
Invalid argument, not a string or column: DataFrame[check_line_number: bigint, item_barcode: string, dining_option: string, item_quantity: double, item_gross_amount: decimal(38,6), item_net_amount: decimal(38,6)] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
I removed the select statement from the bottom and the code ran perfectly. I then ran the command below and it showed me all columns of all 3 dataframes.
display(df_final)
I also ran a separate command to see if it makes a difference:
df_final2 = df_final.select(df['modifier_gross_amount'],df1['check_line_number','item_barcode','dining_option','item_quantity','item_gross_amount','item_net_amount'],df2['brand_id'])
But I was given the same error. Not sure how to fix this. Please advise.