8,466 questions
0
votes
0
answers
18
views
How to stream from an merged layer (apply_changes) table into a downstream silver layer as stream and not Materialized View (MV)?
The Architecture: I am implementing a Delta Live Tables (DLT) pipeline following the Medallion architecture.
Landing: Auto Loader ingesting raw files (JSON/CSV).
Bronze Layer: Uses dlt.apply_changes() ...
0
votes
1
answer
40
views
Update Python Library in Databricks Cluster
I inherited a custom Python library and a Databricks instance that I haven't had to do much with, but this last week I had to make changes to a function in the codebase.
I thought Databricks was ...
0
votes
1
answer
47
views
No module named 'pyspark.sql.metrics' when working with pickle or joblib on Databrick
I read data from Databricks
import pandas as pd
import joblib
query = 'select * from table a"
df = spark.sql(query)
df = df.toPandas()
df.to_pickle('df.pickle')
joblib.dump(df, 'df.joblib')
...
0
votes
0
answers
60
views
DBT unit tests fail when a struct has too many fields
I am running DBT models on Databricks and I am starting to implement unit tests for them.
I have the following DBT unit test :
unit_tests:
- name: test_my_model
model: my_model
given:
...
0
votes
2
answers
66
views
How to write parquet file to Databricks Volume?
I'd like to export data from tables within my Databricks Unity Catalog. I'd like to transform each of the tables to a single parquet file which I can download. I thought I just write a table to a ...
Advice
1
vote
1
replies
45
views
Job Task Conditions - Only Run on 2nd Working Day of the Month
I need some advise. I have a job which is running everyday and I'm looking to have a particular task to run on the second working day of the month. I know I can solve this by setting up another job ...
0
votes
1
answer
61
views
Location of spark.scheduler.allocation.file in Databricks workspace
When using Databricks runtime 16.4, I am trying to set spark.scheduler.allocation.file to a location in a workspace.
config("spark.scheduler.allocation.file","file:/Workspace/init/...
0
votes
1
answer
57
views
Union Two Datasets Causes Records to Unexpectedly Filter
NOTE: I am running this query on Azure Databricks in a serverless Notebook.
I have two tables with identical schema: foo and bar. They have the same number of columns, with the same names, in the same ...
0
votes
1
answer
47
views
How do I find the file size for my Delta tables in Databricks? I want to be able to expand it to multiple tables
I would like to know the total size of a table, as well as the file sizes of the files that comprise it.
Using describe detail works DESCRIBE DETAIL table1, but using the information as a table doesn'...
1
vote
1
answer
247
views
Unable to import pyspark.pipelines module
What could be a cause of the following error of my code in a Databricks notebook, and how can we fix the error?
ImportError: cannot import name 'pipelines' from 'pyspark' (/databricks/python/lib/...
0
votes
0
answers
35
views
How to link Spark event log stages to PySpark code or query?
I'm analyzing Spark event logs and have already retrieved the SparkListenerStageSubmitted and SparkListenerTaskEnd events to collect metrics such as spill, skew ratio, memory, and CPU usage.
However, ...
2
votes
2
answers
102
views
Union of tiny dataframes exhausts resource on Databricks
As part of a function I create df1 and df2 and aim to stack them and output the results. But the results do not display within the function, nor if I output the results and display after.
results = ...
1
vote
1
answer
90
views
Retrieve schema name created with databricks asset bundles
I´ve created a schema in DAB with this code in my yml file.
resources:
schemas:
my_schema:
name: my_schema_name
catalog_name: my_catalog
The schema is created ...
0
votes
1
answer
78
views
How can i change my Spark session in Databricks Community Edition?
I want to change my spark session from 'pyspark.sql.connect.dataframe.DataFrame' to 'pyspark.sql.dataframe.DataFrame' so that I can run StringIndexer and VectorAssembler.
If I run it in pyspark.sql....
0
votes
0
answers
33
views
Databricks group cluster fails to read CSV (TextFileFormatEdge$.disabled) while personal cluster works
I have a PySpark function that reads a reference CSV file inside a larger ETL pipeline.
On my personal Databricks cluster, this works fine. On the group cluster, it return empty dataframe, the same ...
0
votes
0
answers
75
views
Can we get access history for an Inference Service in Snowflake?
I’m currently exploring Inference Services in Snowflake and wanted to check if there’s an equivalent to the Event History column in Databricks.
So far, the closest I’ve found in Snowflake are service ...
0
votes
1
answer
62
views
Pydantic model inserts None values in Databricks Delta table as string type instead of null type
I have the below pydantic model with 6 columns out of which 2 columns are nullable.
from pydantic import BaseModel
from typing import Optional
class Purchases(BaseModel):
customer_id: int
...
0
votes
0
answers
30
views
No longer able to create a Delta Table in ADLS Gen 2. Error: he protocol of your Delta table could not be recovered while Reconstructing version: 0
I have been successfully creating DeltaTables in ADLS Gen2 for a number of years without any issues. Today, I deleted the deltaTable for a table I copied into ADLS Gen 2 with ADF and the associated ...
1
vote
0
answers
50
views
Databricks - LOCATION_OVERLAP Error with AutoLoader pipeline ingesting from external location
I am trying to use pipelines in Databricks to ingest data from an external location to the datalake using AutoLoader, and I am facing this issue. I have noticed other posts with similar errors, but in ...
0
votes
1
answer
58
views
Databricks multiple output from a cell
Normally I run this code at the top of notebooks to allow printout of multiple outputs from a cell (without having to use print statements.
from IPython.core.interactiveshell import InteractiveShell
...
0
votes
1
answer
105
views
DELTA_INSERT_COLUMN_ARITY_MISMATCH error while using PYODBC and DataBricks
I'm getting an error of [DELTA_INSERT_COLUMN_ARITY_MISMATCH] while trying to insert into DataBricks using PYODBC. If I run this query, everything works fine in both DataBricks and Python
‘INSERT INTO ...
0
votes
0
answers
69
views
Unable to Create an Azure SAS Token to Be Used with Databricks to Connect to Azure ADLS Gen 2
I am trying to establish a connection to our Azure Data Lake Gen2 using a SAS Token.
I have created the following SAS token
spark.conf.set("fs.azure.account.auth.type.adlsprexxxxx.dfs.core....
0
votes
0
answers
129
views
Which version of source delta table table currently being processed by spark structured streaming?
I want to know/monitor which version of the delta table is currently being processed, especially when the stream is started with a startingVersion.
My understanding is when that option is chosen, the ...
0
votes
1
answer
112
views
How to set proxy to create WorkspaceClient in Databricks using Java SDK
I am working on Azure Databricks Test Automation using Java. There are a number of Jobs and pipelines that are created in Azure Databricks to process data. I want to create WorkspaceClient for them ...
1
vote
1
answer
128
views
How do I load joblib file on spark?
I have following Code. It reads a pre-existing file for a ML model. I am trying to run it on databricks on multiple cases
import numpy as np
import joblib
class WeightedEnsembleRegressor:
"&...