Skip to main content
Filter by
Sorted by
Tagged with
0 votes
0 answers
14 views

We used to use regular VACUUM xxx RETAIN nnn HOURS query. It works well but takes hours on huge databases. Wanted to explore new VACUUM xxx LITE mode but whenever I run it, I get org.apache.spark.sql....
Alexander Pavlov's user avatar
0 votes
0 answers
111 views

I am using polars.df.write_delta() to initially create, and subsequently append to, Delta Tables in Microsoft Fabric OneLake storage, via a Fabric python notebook. Having had a production process up ...
Stuart J Cuthbertson's user avatar
1 vote
0 answers
63 views

I am connecting to an EMR cluster through SageMaker Unified Studio(JupyterLab). My EMR cluster is configured with Delta Lake support, and I have the following Spark properties set on the cluster: ...
sakshi's user avatar
  • 41
0 votes
1 answer
188 views

I’d really appreciate your help with a duplication issue I’m hitting when using deltalake merges (Python). Context Backend: Azure Blob Storage Libraries: deltalake 1.1.4 (Python), Polars 1.31.0 (...
Octavio's user avatar
  • 456
0 votes
0 answers
129 views

I want to know/monitor which version of the delta table is currently being processed, especially when the stream is started with a startingVersion. My understanding is when that option is chosen, the ...
Saugat Mukherjee's user avatar
0 votes
1 answer
44 views

I have the below code where the Id is a 36 character GUID. The code gets executed but when a matching record is found , instead of updating it inserts the entire records again. What could be the root ...
Sandeep T's user avatar
  • 441
1 vote
1 answer
136 views

I have a table that needs to support time travel for up to 6 months. To preserve the necessary metadata and data files, I’ve already configured the table with the following properties: ALTER TABLE ...
mjeday's user avatar
  • 117
0 votes
0 answers
149 views

We have a delta table in databricks and de-duplicate the rows with dropDuplicates We merge data into this table in batches and use .whenMatchedUpdateAll() .whenNotMatchedInsertAll()...
Rockstar5645's user avatar
  • 4,524
1 vote
0 answers
82 views

In my file I have : { "Car": { "Model": null, "Color": null, } } I use read_delta to read the file: df = df.read_delta(path) At the end, I have an empty df. ...
ninja_minida's user avatar
0 votes
0 answers
68 views

Environment: Python 3.9.21 DuckDB 1.1.3 pyarrow 18.1.0 deltalake 18.1.0 Behavior explanation: add and udpate string fields in a struct inside a list under the root of the table works fine. update ...
Ahmed Kamal ELSaman's user avatar
0 votes
0 answers
60 views

Currently, I am working with the Databricks platform. My work mostly involves building ETL pipelines (workflows), so I am familiar with reading input Delta tables, transforming the data, and writing ...
ndycuong's user avatar
1 vote
0 answers
126 views

I noticed that querying for the maximum value in a string timestamp column takes 30s with 30+GB of data scanned while querying an actual timestamp column takes 1s with 310MB scanned. Maybe these ...
taksqth's user avatar
  • 73
1 vote
1 answer
110 views

Does auto compaction break existing z-ordered tables in delta lake?
Ryan Byoun's user avatar
0 votes
2 answers
229 views

I am trying to create a new connection from DBeaver to a Delta Lake Parquet file which is located on the HDFS filesystem which I successfully created with a Spark/Hadoop/Scala/io.delta application. (...
Rene's user avatar
  • 11
0 votes
1 answer
90 views

I have a delta table in a directory in a storage account and I am creating an external table in azure synapse using this query IF NOT EXISTS (SELECT * FROM sys.external_file_formats WHERE name = '...
Asfandyar Abbasi's user avatar
2 votes
1 answer
211 views

I'm trying to implement the Pyspark code below to read delta files saved in the data lake (delta_table) and join with data frame with updated records (novos_registros). #5. Build the matching ...
Marcelo Herdy's user avatar
1 vote
1 answer
307 views

I am using the python notebook in MS Fabric for some data transformations and trying to write a df to a delta table. I am expecting the following code to create a new table using deltalake library: ...
Hasi's user avatar
  • 21
0 votes
0 answers
86 views

I'm using Deltalake version -0.17.0. here are steps, we do- Read in the DeltaTable from existing S3 location. dt = DeltaTable("s3://mylocation/") Converted it to pyarrow table. arrow_table =...
Scooby's user avatar
  • 685
1 vote
1 answer
127 views

I want to do a merge on a subset of my delta table partitions to do incremental upserts to keep two tables in sync. I do not use a whenNotMatchedBySource statement to clean up stale rows in my target ...
ExploitedRoutine's user avatar
1 vote
1 answer
206 views

I observe severe underutilization of CPU in my Databricks job run metrics, on average less than 50% - indicating that I do not parallelize enough tasks in the Spark workflow. I am especially ...
Louis's user avatar
  • 25
0 votes
2 answers
421 views

I have been using the following code to determine the latest table using Databricks TimeTravel feature for the past few years without any issues. I recently added a new row to the table that I have ...
Patterson's user avatar
  • 3,011
0 votes
0 answers
90 views

I executed vacuum on a delta table on Jan 31 : retain 450 hours. After the vacuum, I can still access Version 22 which falls out of the retention period. So, why didn't the vacuum clean out that ...
waliadee's user avatar
0 votes
1 answer
95 views

I’m running a Spark SELECT query on a Delta Lake table partitioned by year, month, day, and hour, derived from a timestamp column. When I execute the query in Zeppelin, Spark is aware of the ...
George Amgad's user avatar
2 votes
1 answer
464 views

I'm planning to use Polars with Delta Lake to manage large, mutable datasets on my laptop. I've encountered two issues: Dataset is not sorted after merge: When I use write_delta() in "merge"...
Olibarer's user avatar
  • 423
1 vote
1 answer
86 views

Is it possible to directly export a Snowpark DataFrame to Databricks, or must the DataFrame first be exported to an external cloud storage (e.g. S3 as parquet) before Databricks can access it?
gobigorgobald's user avatar

1
2 3 4 5
29