Newest 'delta-lake' Questions

0 votes

0 answers

14 views

Cannot make `VACUUM <delta_table> LITE` working

We used to use regular VACUUM xxx RETAIN nnn HOURS query. It works well but takes hours on huge databases. Wanted to explore new VACUUM xxx LITE mode but whenever I run it, I get org.apache.spark.sql....

Alexander Pavlov

2,229

asked Nov 18 at 18:42

0 votes

0 answers

111 views

Enabling Delta Table checkpointing when using polars write_delta()

I am using polars.df.write_delta() to initially create, and subsequently append to, Delta Tables in Microsoft Fabric OneLake storage, via a Fabric python notebook. Having had a production process up ...

Stuart J Cuthbertson

438

asked Sep 30 at 14:21

1 vote

0 answers

63 views

Sagemaker Unified Studio overriding delta lake configuration to iceberg on EMR

I am connecting to an EMR cluster through SageMaker Unified Studio(JupyterLab). My EMR cluster is configured with Delta Lake support, and I have the following Spark properties set on the cluster: ...

sakshi

41

asked Sep 11 at 17:55

0 votes

1 answer

188 views

Diagnosing duplicate inserts after merge/upsert with deltalake (Python)

I’d really appreciate your help with a duplication issue I’m hitting when using deltalake merges (Python). Context Backend: Azure Blob Storage Libraries: deltalake 1.1.4 (Python), Polars 1.31.0 (...

Octavio

456

asked Aug 30 at 22:18

0 votes

0 answers

129 views

Which version of source delta table table currently being processed by spark structured streaming?

I want to know/monitor which version of the delta table is currently being processed, especially when the stream is started with a startingVersion. My understanding is when that option is chosen, the ...

Saugat Mukherjee

1,070

asked Aug 14 at 8:45

0 votes

1 answer

44 views

Azure Synapse SQL Merge is not updating records, instead of that it inserts matching records using spark.sql

I have the below code where the Id is a 36 character GUID. The code gets executed but when a matching record is found , instead of updating it inserts the entire records again. What could be the root ...

Sandeep T

441

asked Aug 1 at 19:12

1 vote

1 answer

136 views

Impact of VACUUM and retention settings on Delta Lake

I have a table that needs to support time travel for up to 6 months. To preserve the necessary metadata and data files, I’ve already configured the table with the following properties: ALTER TABLE ...

mjeday

117

asked Jul 8 at 12:27

0 votes

0 answers

149 views

Is 'delta.columnMapping.mode' = 'name' incompatible with table data deduplication?

We have a delta table in databricks and de-duplicate the rows with dropDuplicates We merge data into this table in batches and use .whenMatchedUpdateAll() .whenNotMatchedInsertAll()...

Rockstar5645

4,524

asked Jun 7 at 9:48

1 vote

0 answers

82 views

How to read delta table and get empty columns in df?

In my file I have : { "Car": { "Model": null, "Color": null, } } I use read_delta to read the file: df = df.read_delta(path) At the end, I have an empty df. ...

ninja_minida

53

asked May 23 at 18:27

0 votes

0 answers

68 views

DuckDB raise exception while query DeltaLake table after merge process updates fields |table->struct->list->struct->field|

Environment: Python 3.9.21 DuckDB 1.1.3 pyarrow 18.1.0 deltalake 18.1.0 Behavior explanation: add and udpate string fields in a struct inside a list under the root of the table works fine. update ...

Ahmed Kamal ELSaman

125

asked May 16 at 14:49

0 votes

0 answers

60 views

Using databricks volumn as inputs of workflow

Currently, I am working with the Databricks platform. My work mostly involves building ETL pipelines (workflows), so I am familiar with reading input Delta tables, transforming the data, and writing ...

ndycuong

1

asked Apr 28 at 16:41

1 vote

0 answers

126 views

Doesn't Delta Table data skipping leverage parquet file metadata?

I noticed that querying for the maximum value in a string timestamp column takes 30s with 30+GB of data scanned while querying an actual timestamp column takes 1s with 310MB scanned. Maybe these ...

taksqth

73

asked Apr 15 at 2:35

1 vote

1 answer

110 views

Does auto compaction break z-ordering? [closed]

Does auto compaction break existing z-ordered tables in delta lake?

Ryan Byoun

95

asked Apr 2 at 7:53

0 votes

2 answers

229 views

Setting up a DBeaver 25.0.1 connection to a Delta Lake v2.4 Parquet table on Hadoop 3.3.4 filesystem

I am trying to create a new connection from DBeaver to a Delta Lake Parquet file which is located on the HDFS filesystem which I successfully created with a Spark/Hadoop/Scala/io.delta application. (...

Rene

11

asked Mar 29 at 9:43

0 votes

1 answer

90 views

Azure Synapse External Table no accessible from PowerBI

I have a delta table in a directory in a storage account and I am creating an external table in azure synapse using this query IF NOT EXISTS (SELECT * FROM sys.external_file_formats WHERE name = '...

Asfandyar Abbasi

105

asked Mar 28 at 11:15

2 votes

1 answer

211 views

Delta lake MERGE error: [INVALID_EXTRACT_BASE_FIELD_TYPE]

I'm trying to implement the Pyspark code below to read delta files saved in the data lake (delta_table) and join with data frame with updated records (novos_registros). #5. Build the matching ...

Marcelo Herdy

21

asked Mar 19 at 13:37

1 vote

1 answer

307 views

TimestampWithoutTimezone error on Python Notebook

I am using the python notebook in MS Fabric for some data transformations and trying to write a df to a delta table. I am expecting the following code to create a new table using deltalake library: ...

Hasi

21

asked Mar 12 at 8:35

0 votes

0 answers

86 views

Error in writing panda data frame to Delta Table using schema with non-nullable fields

I'm using Deltalake version -0.17.0. here are steps, we do- Read in the DeltaTable from existing S3 location. dt = DeltaTable("s3://mylocation/") Converted it to pyarrow table. arrow_table =...

Scooby

685

asked Mar 10 at 20:41

1 vote

1 answer

127 views

Delta Lake Merge Rewrites unchanged files

I want to do a merge on a subset of my delta table partitions to do incremental upserts to keep two tables in sync. I do not use a whenNotMatchedBySource statement to clean up stale rows in my target ...

ExploitedRoutine

21

asked Mar 4 at 14:07

1 vote

1 answer

206 views

How does Spark read unpartitioned Delta tables?

I observe severe underutilization of CPU in my Databricks job run metrics, on average less than 50% - indicating that I do not parallelize enough tasks in the Spark workflow. I am especially ...

Louis

25

asked Feb 24 at 14:56

0 votes

2 answers

421 views

Databricks DeltaLake : Cannot time travel Delta table to version 1. Available versions: [3, 23]

I have been using the following code to determine the latest table using Databricks TimeTravel feature for the past few years without any issues. I recently added a new row to the table that I have ...

Patterson

3,011

asked Feb 5 at 15:27

0 votes

0 answers

90 views

delta vacuum didn't clean older data

I executed vacuum on a delta table on Jan 31 : retain 450 hours. After the vacuum, I can still access Version 22 which falls out of the retention period. So, why didn't the vacuum clean out that ...

waliadee

3

asked Jan 31 at 16:17

0 votes

1 answer

95 views

Spark SELECT Query Ignores Partition Filters in java spark App but Works in Zeppelin

I’m running a Spark SELECT query on a Delta Lake table partitioned by year, month, day, and hour, derived from a timestamp column. When I execute the query in Zeppelin, Spark is aware of the ...

George Amgad

1

asked Jan 28 at 13:08

2 votes

1 answer

464 views

How to optimize Delta Lake datasets in Polars (sorting, compaction, cleanup)?

I'm planning to use Polars with Delta Lake to manage large, mutable datasets on my laptop. I've encountered two issues: Dataset is not sorted after merge: When I use write_delta() in "merge"...

Olibarer

423

asked Jan 24 at 1:53

1 vote

1 answer

86 views

Is it possible to directly export a Snowpark Dataframe to Databricks?

Is it possible to directly export a Snowpark DataFrame to Databricks, or must the DataFrame first be exported to an external cloud storage (e.g. S3 as parquet) before Databricks can access it?

gobigorgobald

25

asked Jan 20 at 18:06

Collectives™ on Stack Overflow

Cannot make `VACUUM <delta_table> LITE` working

Enabling Delta Table checkpointing when using polars write_delta()

Sagemaker Unified Studio overriding delta lake configuration to iceberg on EMR

Diagnosing duplicate inserts after merge/upsert with deltalake (Python)

Which version of source delta table table currently being processed by spark structured streaming?

Azure Synapse SQL Merge is not updating records, instead of that it inserts matching records using spark.sql

Impact of VACUUM and retention settings on Delta Lake

Is 'delta.columnMapping.mode' = 'name' incompatible with table data deduplication?

How to read delta table and get empty columns in df?

DuckDB raise exception while query DeltaLake table after merge process updates fields |table->struct->list->struct->field|

Using databricks volumn as inputs of workflow

Doesn't Delta Table data skipping leverage parquet file metadata?

Does auto compaction break z-ordering? [closed]

Setting up a DBeaver 25.0.1 connection to a Delta Lake v2.4 Parquet table on Hadoop 3.3.4 filesystem

Azure Synapse External Table no accessible from PowerBI

Delta lake MERGE error: [INVALID_EXTRACT_BASE_FIELD_TYPE]

TimestampWithoutTimezone error on Python Notebook

Error in writing panda data frame to Delta Table using schema with non-nullable fields

Delta Lake Merge Rewrites unchanged files

How does Spark read unpartitioned Delta tables?

Databricks DeltaLake : Cannot time travel Delta table to version 1. Available versions: [3, 23]

delta vacuum didn't clean older data

Spark SELECT Query Ignores Partition Filters in java spark App but Works in Zeppelin

How to optimize Delta Lake datasets in Polars (sorting, compaction, cleanup)?

Is it possible to directly export a Snowpark Dataframe to Databricks?

Hot Network Questions