41,016 questions
1
vote
0
answers
74
views
Out of memory for a smaller dataset
I have a pyspark job reading the input data volume of just ~50-55GB Parquet data from a delta table. Job is using n2-highmem-4 GCP VM and 1-15 worker with autoscaling. Each workerVM of type n2-highmem-...
1
vote
2
answers
148
views
How can I do a join on each array element in a column and replace with something from the join table?
I have a table, base_df, with many columns, one of which is an array column:
Id
FruitNames
Col1
Col2
Col3
...
Col99
1
["apple", "banana", "orange"]
...
...
...
...
...
2
[...
0
votes
1
answer
114
views
When train a small XGBoost model on DataBricks, it will crash and show memory issue. But similar table actually works well
I meet a bug which blocks me a few days.
I have a spark dataframe with 66 columns and 100K rows, I want to train a XGBoost model on DataBricks platform but will always crash.
I generated a similar ...
1
vote
2
answers
275
views
Delta Live Table append only new data from storage location
I have the following code, that runs as a DLT pipeline. This is running fine except that I feel that everytime its loading all the data from the storage container (ADLS). So as the data increases in ...
0
votes
0
answers
46
views
Google Address validation API issue
I am currently using the Google Address Validation API in a PySpark (Databricks) pipeline to validate addresses from a table. Each row contains an address in a column called 'Address', and I send a ...
0
votes
1
answer
125
views
How to use exit from glue job successfully
I am using below code to exit from the glue job
sys.exit(0)
But it is marked as fail as aws glue
My use case is that when no file found in s3 path, the pyspark code should successfully
1
vote
1
answer
63
views
Error Running Spark Job from Django API using subprocess.Popen
I have created a Django project executable, and I need to run a Spark job from an API endpoint within this executable. I am using subprocess.Popen to execute the spark-submit command, but I am ...
-1
votes
1
answer
53
views
transform pyspark a single row of input condition values into multiple rows of individual conditions
I have a PySpark DataFrame that contains a single row but multiple columns (in context of sql where clause). It just like column start_date with value >date("2025-01-01") then new column ...
1
vote
0
answers
63
views
Why does a subquery without matching column names still work in Spark SQL? [duplicate]
I have the following two datasets in Spark SQL:
person view:
person = spark.createDataFrame([
(0, "Bill Chambers", 0, [100]),
(1, "Matei Zaharia", 1, [500, 250, 100]),
(2, "...
0
votes
0
answers
54
views
How to list all URLs accessed by AWS Glue when reading from BigQuery?
I'm encountering issues when reading data from a BigQuery table into Amazon S3 using an AWS Glue PySpark job. It functions properly under normal configuration, but when I attach a VPC connection and ...
-1
votes
3
answers
209
views
Slow performance and timeout when writing 15GB of data from ADLS to Azure SQL DB using Databricks
We have a daily ETL process where we write Parquet data (~15GB) stored in Azure Data Lake Storage (ADLS) into a table in Azure SQL Database. The target table is truncated and reloaded each day.
...
0
votes
0
answers
49
views
Disable printing info when running spark-sql
I'm running sql commands with spark-sql. I have put rootLogger.level = off in log4j2.properties file, but I'm still getting some info messages:
Spark Web UI available at http://computer:4040
Spark ...
0
votes
0
answers
42
views
java.io.EOFException PySpark Py4JJavaError always occuring when using user defined function
I'm doing data preprocessing for this csv file of 1 million rows and hoping to shrink it down to 600000 rows. However I'm having trouble always when doing an apply function on a column in the ...
2
votes
1
answer
806
views
I checked online and found that Python 3.13 doesn't have "typing" in it, so how do I bypass this to start pyspark?
PS C:\spark-3.4.4-bin-hadoop3\bin> pyspark
Python 3.13.3 (tags/v3.13.3:6280bb5, Apr 8 2025, 14:47:33) [MSC v.1943 64 bit (AMD64)] on win32
Type "help", "copyright", "...
1
vote
0
answers
131
views
Pyspark writing dataframe to oracle database table using JDBC
I am new to Pyspark and having few clarifications on writing dataframe to oracle database table using JDBC.
As part of the requirement I need to read the data from Oracle table and perform ...
0
votes
1
answer
49
views
Error (Py4JJavaError) running pyspark notebook in VSC
as the title says, I am having trouble running a code in VSC with miniforge, a pyspark notebook. What I currently have installed is:
VSC
Java 8 + Java SDK11
Downloaded into c:/spark spark 3.4.4, and ...
2
votes
3
answers
118
views
Exponential run time behavior in simple pyspark operation
My task is simple, i have a binary file that needs to be split into 8byte chunks where first 4bytes contain data (to be decoded in later step) the 2nd 4byte contain an int (time offset in ms).
Some of ...
0
votes
0
answers
56
views
Spark with availableNow trigger doesn't archive sources
I use Spark to read JSON files that appear in a folder everyday with path pattern Yyyy/mm/dd to convert them into Iceberg format. Both folders JSON and Iceberg are in a s3 bucket on different paths.
...
2
votes
1
answer
653
views
Read incremental data from iceberg tables using Spark SQL
I am trying to read incremental data between two snapshots
I have last processed snapshot (my day0 load) and below is my code snippet to read incremental data
incremental_df = spark.read.format("...
2
votes
0
answers
103
views
What could be the causes of cache() changing the number of rows of a dataframe?
The code I have is of the form
def create_df(df1, df2):
df3 = df1.cache().select(...).join(df2.cache(), on=..., how='full')
return df3 # count() is 1
df4 = create_df(df1, df2) # count() is ...
0
votes
0
answers
57
views
Unable to configure the exact number of DPUs for the Glue Pyspark job
I have 20 million records, which comprise around 1.5 to 10 GB, as per the information I received. I can't access the source system to get the exact size of this table. I am just reading it from the ...
1
vote
1
answer
154
views
Querying Iceberg AWS table
I am trying to run a pyspark Glue Job which queries an Iceberg database that is stored on AWS Cloud. Here are the configurations I am using:
conf = (
SparkConf()
.set("spark.hadoop.fs.s3a....
0
votes
0
answers
57
views
Pyspark not inserting data into BigQuery
I have a PySpark code which reads data from some bigquery external tables and inserts into bigquery native tables.
By using indirect mode of insertion, the flow should be as such.
PySpark reads from ...
0
votes
1
answer
108
views
Pyspark memory issue when joining and filtering two dataframes
I am trying to join two dataframes in pyspark.
One dataframe df_dsp_f contains approx 100,000 records. Another dataframe df_slv_f contains approx 99,800 records.
I am using databricks and a serverless ...
0
votes
0
answers
61
views
FileNotFound Exception occurs when pyspark write after persist().count()
I encountered java.io.FileNotFoundException in AWS EMR batch.
My code processes data as below :
updateDF = spark.read.load(paths, ..)
userIDs = uniqueList(matchedUserIDs)
nones = [None for _ in range(...