Newest 'pyspark' Questions - Page 5

1 vote

0 answers

74 views

Out of memory for a smaller dataset

I have a pyspark job reading the input data volume of just ~50-55GB Parquet data from a delta table. Job is using n2-highmem-4 GCP VM and 1-15 worker with autoscaling. Each workerVM of type n2-highmem-...

user16798185

377

asked May 5 at 16:07

1 vote

2 answers

148 views

How can I do a join on each array element in a column and replace with something from the join table?

I have a table, base_df, with many columns, one of which is an array column: Id FruitNames Col1 Col2 Col3 ... Col99 1 ["apple", "banana", "orange"] ... ... ... ... ... 2 [...

wkeithvan♦

2,215

asked May 2 at 23:18

0 votes

1 answer

114 views

When train a small XGBoost model on DataBricks, it will crash and show memory issue. But similar table actually works well

I meet a bug which blocks me a few days. I have a spark dataframe with 66 columns and 100K rows, I want to train a XGBoost model on DataBricks platform but will always crash. I generated a similar ...

HappyCoding

1

asked May 2 at 16:04

1 vote

2 answers

275 views

Delta Live Table append only new data from storage location

I have the following code, that runs as a DLT pipeline. This is running fine except that I feel that everytime its loading all the data from the storage container (ADLS). So as the data increases in ...

Yuva

3,213

asked Apr 30 at 14:18

0 votes

0 answers

46 views

Google Address validation API issue

I am currently using the Google Address Validation API in a PySpark (Databricks) pipeline to validate addresses from a table. Each row contains an address in a column called 'Address', and I send a ...

Ravi Dølly

1

asked Apr 30 at 4:28

0 votes

1 answer

125 views

How to use exit from glue job successfully

I am using below code to exit from the glue job sys.exit(0) But it is marked as fail as aws glue My use case is that when no file found in s3 path, the pyspark code should successfully

Sudhanshu Prakash

9

asked Apr 29 at 18:52

1 vote

1 answer

63 views

Error Running Spark Job from Django API using subprocess.Popen

I have created a Django project executable, and I need to run a Spark job from an API endpoint within this executable. I am using subprocess.Popen to execute the spark-submit command, but I am ...

Rudra Patel

11

asked Apr 28 at 6:15

-1 votes

1 answer

53 views

transform pyspark a single row of input condition values into multiple rows of individual conditions

I have a PySpark DataFrame that contains a single row but multiple columns (in context of sql where clause). It just like column start_date with value >date("2025-01-01") then new column ...

ndycuong

1

asked Apr 26 at 17:57

1 vote

0 answers

63 views

Why does a subquery without matching column names still work in Spark SQL? [duplicate]

I have the following two datasets in Spark SQL: person view: person = spark.createDataFrame([ (0, "Bill Chambers", 0, [100]), (1, "Matei Zaharia", 1, [500, 250, 100]), (2, "...

DumbCoder

515

asked Apr 26 at 8:38

0 votes

0 answers

54 views

How to list all URLs accessed by AWS Glue when reading from BigQuery?

I'm encountering issues when reading data from a BigQuery table into Amazon S3 using an AWS Glue PySpark job. It functions properly under normal configuration, but when I attach a VPC connection and ...

Abhinav S J

1

asked Apr 25 at 10:33

-1 votes

3 answers

209 views

Slow performance and timeout when writing 15GB of data from ADLS to Azure SQL DB using Databricks

We have a daily ETL process where we write Parquet data (~15GB) stored in Azure Data Lake Storage (ADLS) into a table in Azure SQL Database. The target table is truncated and reloaded each day. ...

Harish J

166

asked Apr 24 at 9:11

0 votes

0 answers

49 views

Disable printing info when running spark-sql

I'm running sql commands with spark-sql. I have put rootLogger.level = off in log4j2.properties file, but I'm still getting some info messages: Spark Web UI available at http://computer:4040 Spark ...

IGRACH

3,726

asked Apr 22 at 20:06

0 votes

0 answers

42 views

java.io.EOFException PySpark Py4JJavaError always occuring when using user defined function

I'm doing data preprocessing for this csv file of 1 million rows and hoping to shrink it down to 600000 rows. However I'm having trouble always when doing an apply function on a column in the ...

Mig Rivera Cueva

127

asked Apr 22 at 9:54

2 votes

1 answer

806 views

I checked online and found that Python 3.13 doesn't have "typing" in it, so how do I bypass this to start pyspark?

PS C:\spark-3.4.4-bin-hadoop3\bin> pyspark Python 3.13.3 (tags/v3.13.3:6280bb5, Apr 8 2025, 14:47:33) [MSC v.1943 64 bit (AMD64)] on win32 Type "help", "copyright", "...

digi store

31

asked Apr 20 at 8:27

1 vote

0 answers

131 views

Pyspark writing dataframe to oracle database table using JDBC

I am new to Pyspark and having few clarifications on writing dataframe to oracle database table using JDBC. As part of the requirement I need to read the data from Oracle table and perform ...

Siva

11

asked Apr 18 at 17:05

0 votes

1 answer

49 views

Error (Py4JJavaError) running pyspark notebook in VSC

as the title says, I am having trouble running a code in VSC with miniforge, a pyspark notebook. What I currently have installed is: VSC Java 8 + Java SDK11 Downloaded into c:/spark spark 3.4.4, and ...

lecarusin

37

asked Apr 17 at 20:26

2 votes

3 answers

118 views

Exponential run time behavior in simple pyspark operation

My task is simple, i have a binary file that needs to be split into 8byte chunks where first 4bytes contain data (to be decoded in later step) the 2nd 4byte contain an int (time offset in ms). Some of ...

dermoritz

13.1k

asked Apr 17 at 11:48

0 votes

0 answers

56 views

Spark with availableNow trigger doesn't archive sources

I use Spark to read JSON files that appear in a folder everyday with path pattern Yyyy/mm/dd to convert them into Iceberg format. Both folders JSON and Iceberg are in a s3 bucket on different paths. ...

Alex

1,019

asked Apr 17 at 10:19

2 votes

1 answer

653 views

Read incremental data from iceberg tables using Spark SQL

I am trying to read incremental data between two snapshots I have last processed snapshot (my day0 load) and below is my code snippet to read incremental data incremental_df = spark.read.format("...

Abhi5421

33

asked Apr 16 at 8:26

2 votes

0 answers

103 views

What could be the causes of cache() changing the number of rows of a dataframe?

The code I have is of the form def create_df(df1, df2): df3 = df1.cache().select(...).join(df2.cache(), on=..., how='full') return df3 # count() is 1 df4 = create_df(df1, df2) # count() is ...

PHPirate

7,681

asked Apr 15 at 15:15

0 votes

0 answers

57 views

Unable to configure the exact number of DPUs for the Glue Pyspark job

I have 20 million records, which comprise around 1.5 to 10 GB, as per the information I received. I can't access the source system to get the exact size of this table. I am just reading it from the ...

RushHour

641

asked Apr 15 at 7:42

1 vote

1 answer

154 views

Querying Iceberg AWS table

I am trying to run a pyspark Glue Job which queries an Iceberg database that is stored on AWS Cloud. Here are the configurations I am using: conf = ( SparkConf() .set("spark.hadoop.fs.s3a....

Vladyslav Chornyi

11

asked Apr 14 at 9:33

0 votes

0 answers

57 views

Pyspark not inserting data into BigQuery

I have a PySpark code which reads data from some bigquery external tables and inserts into bigquery native tables. By using indirect mode of insertion, the flow should be as such. PySpark reads from ...

Hexark

413

asked Apr 13 at 1:16

0 votes

1 answer

108 views

Pyspark memory issue when joining and filtering two dataframes

I am trying to join two dataframes in pyspark. One dataframe df_dsp_f contains approx 100,000 records. Another dataframe df_slv_f contains approx 99,800 records. I am using databricks and a serverless ...

pythondumb

1,259

asked Apr 12 at 6:33

0 votes

0 answers

61 views

FileNotFound Exception occurs when pyspark write after persist().count()

I encountered java.io.FileNotFoundException in AWS EMR batch. My code processes data as below : updateDF = spark.read.load(paths, ..) userIDs = uniqueList(matchedUserIDs) nones = [None for _ in range(...

Jisu Choi

1

asked Apr 11 at 5:46

Collectives™ on Stack Overflow

Out of memory for a smaller dataset

How can I do a join on each array element in a column and replace with something from the join table?

When train a small XGBoost model on DataBricks, it will crash and show memory issue. But similar table actually works well

Delta Live Table append only new data from storage location

Google Address validation API issue

How to use exit from glue job successfully

Error Running Spark Job from Django API using subprocess.Popen

transform pyspark a single row of input condition values into multiple rows of individual conditions

Why does a subquery without matching column names still work in Spark SQL? [duplicate]

How to list all URLs accessed by AWS Glue when reading from BigQuery?

Slow performance and timeout when writing 15GB of data from ADLS to Azure SQL DB using Databricks

Disable printing info when running spark-sql

java.io.EOFException PySpark Py4JJavaError always occuring when using user defined function

I checked online and found that Python 3.13 doesn't have "typing" in it, so how do I bypass this to start pyspark?

Pyspark writing dataframe to oracle database table using JDBC

Error (Py4JJavaError) running pyspark notebook in VSC

Exponential run time behavior in simple pyspark operation

Spark with availableNow trigger doesn't archive sources

Read incremental data from iceberg tables using Spark SQL

What could be the causes of cache() changing the number of rows of a dataframe?

Unable to configure the exact number of DPUs for the Glue Pyspark job

Querying Iceberg AWS table

Pyspark not inserting data into BigQuery

Pyspark memory issue when joining and filtering two dataframes

FileNotFound Exception occurs when pyspark write after persist().count()

Hot Network Questions