1,428 questions
0
votes
0
answers
14
views
Cannot make `VACUUM <delta_table> LITE` working
We used to use regular VACUUM xxx RETAIN nnn HOURS query. It works well but takes hours on huge databases.
Wanted to explore new VACUUM xxx LITE mode but whenever I run it, I get
org.apache.spark.sql....
0
votes
0
answers
111
views
Enabling Delta Table checkpointing when using polars write_delta()
I am using polars.df.write_delta() to initially create, and subsequently append to, Delta Tables in Microsoft Fabric OneLake storage, via a Fabric python notebook.
Having had a production process up ...
1
vote
0
answers
63
views
Sagemaker Unified Studio overriding delta lake configuration to iceberg on EMR
I am connecting to an EMR cluster through SageMaker Unified Studio(JupyterLab).
My EMR cluster is configured with Delta Lake support, and I have the following Spark properties set on the cluster:
...
0
votes
1
answer
188
views
Diagnosing duplicate inserts after merge/upsert with deltalake (Python)
I’d really appreciate your help with a duplication issue I’m hitting when using deltalake merges (Python).
Context
Backend: Azure Blob Storage
Libraries: deltalake 1.1.4 (Python), Polars 1.31.0 (...
0
votes
0
answers
129
views
Which version of source delta table table currently being processed by spark structured streaming?
I want to know/monitor which version of the delta table is currently being processed, especially when the stream is started with a startingVersion.
My understanding is when that option is chosen, the ...
0
votes
1
answer
44
views
Azure Synapse SQL Merge is not updating records, instead of that it inserts matching records using spark.sql
I have the below code where the Id is a 36 character GUID. The code gets executed but when a matching record is found , instead of updating it inserts the entire records again. What could be the root ...
1
vote
1
answer
136
views
Impact of VACUUM and retention settings on Delta Lake
I have a table that needs to support time travel for up to 6 months. To preserve the necessary metadata and data files, I’ve already configured the table with the following properties:
ALTER TABLE ...
0
votes
0
answers
149
views
Is 'delta.columnMapping.mode' = 'name' incompatible with table data deduplication?
We have a delta table in databricks and de-duplicate the rows with dropDuplicates
We merge data into this table in batches and use
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()...
1
vote
0
answers
82
views
How to read delta table and get empty columns in df?
In my file I have :
{
"Car": {
"Model": null,
"Color": null,
}
}
I use read_delta to read the file:
df = df.read_delta(path)
At the end, I have an empty df. ...
0
votes
0
answers
68
views
DuckDB raise exception while query DeltaLake table after merge process updates fields |table->struct->list->struct->field|
Environment:
Python 3.9.21
DuckDB 1.1.3
pyarrow 18.1.0
deltalake 18.1.0
Behavior explanation:
add and udpate string fields in a struct inside a list under the root of the table works fine.
update ...
0
votes
0
answers
60
views
Using databricks volumn as inputs of workflow
Currently, I am working with the Databricks platform. My work mostly involves building ETL pipelines (workflows), so I am familiar with reading input Delta tables, transforming the data, and writing ...
1
vote
0
answers
126
views
Doesn't Delta Table data skipping leverage parquet file metadata?
I noticed that querying for the maximum value in a string timestamp column takes 30s with 30+GB of data scanned while querying an actual timestamp column takes 1s with 310MB scanned. Maybe these ...
1
vote
1
answer
110
views
Does auto compaction break z-ordering? [closed]
Does auto compaction break existing z-ordered tables in delta lake?
0
votes
2
answers
229
views
Setting up a DBeaver 25.0.1 connection to a Delta Lake v2.4 Parquet table on Hadoop 3.3.4 filesystem
I am trying to create a new connection from DBeaver to a Delta Lake Parquet file which is located on the HDFS filesystem which I successfully created with a Spark/Hadoop/Scala/io.delta application.
(...
0
votes
1
answer
90
views
Azure Synapse External Table no accessible from PowerBI
I have a delta table in a directory in a storage account and I am creating an external table in azure synapse using this query
IF NOT EXISTS (SELECT * FROM sys.external_file_formats WHERE name = '...
2
votes
1
answer
211
views
Delta lake MERGE error: [INVALID_EXTRACT_BASE_FIELD_TYPE]
I'm trying to implement the Pyspark code below to read delta files saved in the data lake (delta_table) and join with data frame with updated records (novos_registros).
#5. Build the matching ...
1
vote
1
answer
307
views
TimestampWithoutTimezone error on Python Notebook
I am using the python notebook in MS Fabric for some data transformations and trying to write a df to a delta table. I am expecting the following code to create a new table using deltalake library:
...
0
votes
0
answers
86
views
Error in writing panda data frame to Delta Table using schema with non-nullable fields
I'm using Deltalake version -0.17.0.
here are steps, we do-
Read in the DeltaTable from existing S3 location. dt = DeltaTable("s3://mylocation/")
Converted it to pyarrow table. arrow_table =...
1
vote
1
answer
127
views
Delta Lake Merge Rewrites unchanged files
I want to do a merge on a subset of my delta table partitions to do incremental upserts to keep two tables in sync. I do not use a whenNotMatchedBySource statement to clean up stale rows in my target ...
1
vote
1
answer
206
views
How does Spark read unpartitioned Delta tables?
I observe severe underutilization of CPU in my Databricks job run metrics, on average less than 50% - indicating that I do not parallelize enough tasks in the Spark workflow.
I am especially ...
0
votes
2
answers
421
views
Databricks DeltaLake : Cannot time travel Delta table to version 1. Available versions: [3, 23]
I have been using the following code to determine the latest table using Databricks TimeTravel feature for the past few years without any issues. I recently added a new row to the table that I have ...
0
votes
0
answers
90
views
delta vacuum didn't clean older data
I executed vacuum on a delta table on Jan 31 : retain 450 hours. After the vacuum, I can still access Version 22 which falls out of the retention period. So, why didn't the vacuum clean out that ...
0
votes
1
answer
95
views
Spark SELECT Query Ignores Partition Filters in java spark App but Works in Zeppelin
I’m running a Spark SELECT query on a Delta Lake table partitioned by year, month, day, and hour, derived from a timestamp column. When I execute the query in Zeppelin, Spark is aware of the ...
2
votes
1
answer
464
views
How to optimize Delta Lake datasets in Polars (sorting, compaction, cleanup)?
I'm planning to use Polars with Delta Lake to manage large, mutable datasets on my laptop. I've encountered two issues:
Dataset is not sorted after merge:
When I use write_delta() in "merge"...
1
vote
1
answer
86
views
Is it possible to directly export a Snowpark Dataframe to Databricks?
Is it possible to directly export a Snowpark DataFrame to Databricks, or must the DataFrame first be exported to an external cloud storage (e.g. S3 as parquet) before Databricks can access it?