Pyspark saveasTable gives error on overwrites for pyspark data-frame

Question

In my Pyspark code I am performing more than 10 join operations and multiple groupBy in between. I want to avoid a large DAG and so I decided to save the dataframe as a table to avoid re-computations. As a result I created a database and started saving my dataframe inside that.

After performing 5 join operations and some groupBy operations I saved the table using the below command and everything till here ran successfully.

spark.sql("DROP TABLE IF EXISTS  half_yearly_data")
half_yearly_data.write.saveAsTable("half_yearly_data") 
half_yearly_data = spark.read.table('half_yearly_data')

Later on after performing the remaining join's and groupBy's I am running the following statement which gives me an error

spark.sql("DROP TABLE IF EXISTS db.half_yearly_data")
half_yearly_data.write.saveAsTable("db.half_yearly_data") # Error pointing here
half_yearly_data = spark.read.table('db.half_yearly_data')

Error is pointing to the 2nd line as: The schema of your Delta table has changed in an incompatible way since your DataFrame or DeltaTable object was created. Please redefine your DataFrame or DeltaTable object.

I have not defined my table as a delta table, still it gives me an error related to delta table. Then I tried the following command

spark.sql("DROP TABLE IF EXISTS db.half_yearly_data")
half_yearly_data.write.mode("overwrite").option("overwriteSchema","true").saveAsTable("db.half_yearly_data") # Error pointing here
half_yearly_data = spark.read.table('db.half_yearly_data')

Still the same error. I understand that when I try to convert my data frame into a table the 2nd time there are new columns and some schema changes from the 1st creation. But I am dropping the table before creating it again. I am wondering what can I do here.

Since error was pointing to 2nd line I checked if the table was dropped from the database using the below command and the table does not exist in the database.

spark.sql("show tables in db").show()

I tried to save the data with different table name and the same error pop's up. Although the new table does not exists.

In built AI generated suggestions from Databricks notebooks are pointing to Delta Table but I am not using a delta table here. How to overwrite or create my table again the 2nd time?

vasantha3m · Accepted Answer · 2024-07-20 08:58:11Z

0

Can you try this out. After writing into 1st table and during reading it start defining a new dataframe name and then write this new dataframe to 2nd table. You can try creating new dataframe variable names as per your transformations. df1.write to table1 val df2=read table1 Val Df3=df2.//operation1 val df4=Df3.//operation2 df4.write to table2

answered Jul 20, 2024 at 8:58

vasantha3m

113 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ash Over a year ago

I tried that too but the result is the same. I was wondering how the new 2nd table is having the same issue when it’s being created the first time. For now, I’m thinking of skipping this table creation for this data-frame alone as all other data frames are being saved properly.

Collectives™ on Stack Overflow

Pyspark saveasTable gives error on overwrites for pyspark data-frame

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related