Skip to main content
Filter by
Sorted by
Tagged with
1 vote
0 answers
74 views

I have a pyspark job reading the input data volume of just ~50-55GB Parquet data from a delta table. Job is using n2-highmem-4 GCP VM and 1-15 worker with autoscaling. Each workerVM of type n2-highmem-...
user16798185's user avatar
1 vote
2 answers
148 views

I have a table, base_df, with many columns, one of which is an array column: Id FruitNames Col1 Col2 Col3 ... Col99 1 ["apple", "banana", "orange"] ... ... ... ... ... 2 [...
wkeithvan's user avatar
  • 2,215
0 votes
1 answer
114 views

I meet a bug which blocks me a few days. I have a spark dataframe with 66 columns and 100K rows, I want to train a XGBoost model on DataBricks platform but will always crash. I generated a similar ...
HappyCoding's user avatar
1 vote
2 answers
275 views

I have the following code, that runs as a DLT pipeline. This is running fine except that I feel that everytime its loading all the data from the storage container (ADLS). So as the data increases in ...
Yuva's user avatar
  • 3,213
0 votes
0 answers
46 views

I am currently using the Google Address Validation API in a PySpark (Databricks) pipeline to validate addresses from a table. Each row contains an address in a column called 'Address', and I send a ...
Ravi Dølly's user avatar
0 votes
1 answer
125 views

I am using below code to exit from the glue job sys.exit(0) But it is marked as fail as aws glue My use case is that when no file found in s3 path, the pyspark code should successfully
Sudhanshu Prakash's user avatar
1 vote
1 answer
63 views

I have created a Django project executable, and I need to run a Spark job from an API endpoint within this executable. I am using subprocess.Popen to execute the spark-submit command, but I am ...
Rudra Patel's user avatar
-1 votes
1 answer
53 views

I have a PySpark DataFrame that contains a single row but multiple columns (in context of sql where clause). It just like column start_date with value >date("2025-01-01") then new column ...
ndycuong's user avatar
1 vote
0 answers
63 views

I have the following two datasets in Spark SQL: person view: person = spark.createDataFrame([ (0, "Bill Chambers", 0, [100]), (1, "Matei Zaharia", 1, [500, 250, 100]), (2, "...
DumbCoder's user avatar
  • 515
0 votes
0 answers
54 views

I'm encountering issues when reading data from a BigQuery table into Amazon S3 using an AWS Glue PySpark job. It functions properly under normal configuration, but when I attach a VPC connection and ...
Abhinav S J's user avatar
-1 votes
3 answers
209 views

We have a daily ETL process where we write Parquet data (~15GB) stored in Azure Data Lake Storage (ADLS) into a table in Azure SQL Database. The target table is truncated and reloaded each day. ...
Harish J's user avatar
  • 166
0 votes
0 answers
49 views

I'm running sql commands with spark-sql. I have put rootLogger.level = off in log4j2.properties file, but I'm still getting some info messages: Spark Web UI available at http://computer:4040 Spark ...
IGRACH's user avatar
  • 3,726
0 votes
0 answers
42 views

I'm doing data preprocessing for this csv file of 1 million rows and hoping to shrink it down to 600000 rows. However I'm having trouble always when doing an apply function on a column in the ...
Mig Rivera Cueva's user avatar
2 votes
1 answer
806 views

PS C:\spark-3.4.4-bin-hadoop3\bin> pyspark Python 3.13.3 (tags/v3.13.3:6280bb5, Apr 8 2025, 14:47:33) [MSC v.1943 64 bit (AMD64)] on win32 Type "help", "copyright", "...
digi store's user avatar
1 vote
0 answers
131 views

I am new to Pyspark and having few clarifications on writing dataframe to oracle database table using JDBC. As part of the requirement I need to read the data from Oracle table and perform ...
Siva's user avatar
  • 11
0 votes
1 answer
49 views

as the title says, I am having trouble running a code in VSC with miniforge, a pyspark notebook. What I currently have installed is: VSC Java 8 + Java SDK11 Downloaded into c:/spark spark 3.4.4, and ...
lecarusin's user avatar
2 votes
3 answers
118 views

My task is simple, i have a binary file that needs to be split into 8byte chunks where first 4bytes contain data (to be decoded in later step) the 2nd 4byte contain an int (time offset in ms). Some of ...
dermoritz's user avatar
  • 13.1k
0 votes
0 answers
56 views

I use Spark to read JSON files that appear in a folder everyday with path pattern Yyyy/mm/dd to convert them into Iceberg format. Both folders JSON and Iceberg are in a s3 bucket on different paths. ...
Alex's user avatar
  • 1,019
2 votes
1 answer
653 views

I am trying to read incremental data between two snapshots I have last processed snapshot (my day0 load) and below is my code snippet to read incremental data incremental_df = spark.read.format("...
Abhi5421's user avatar
2 votes
0 answers
103 views

The code I have is of the form def create_df(df1, df2): df3 = df1.cache().select(...).join(df2.cache(), on=..., how='full') return df3 # count() is 1 df4 = create_df(df1, df2) # count() is ...
PHPirate's user avatar
  • 7,681
0 votes
0 answers
57 views

I have 20 million records, which comprise around 1.5 to 10 GB, as per the information I received. I can't access the source system to get the exact size of this table. I am just reading it from the ...
RushHour's user avatar
  • 641
1 vote
1 answer
154 views

I am trying to run a pyspark Glue Job which queries an Iceberg database that is stored on AWS Cloud. Here are the configurations I am using: conf = ( SparkConf() .set("spark.hadoop.fs.s3a....
Vladyslav Chornyi's user avatar
0 votes
0 answers
57 views

I have a PySpark code which reads data from some bigquery external tables and inserts into bigquery native tables. By using indirect mode of insertion, the flow should be as such. PySpark reads from ...
Hexark's user avatar
  • 413
0 votes
1 answer
108 views

I am trying to join two dataframes in pyspark. One dataframe df_dsp_f contains approx 100,000 records. Another dataframe df_slv_f contains approx 99,800 records. I am using databricks and a serverless ...
pythondumb's user avatar
  • 1,259
0 votes
0 answers
61 views

I encountered java.io.FileNotFoundException in AWS EMR batch. My code processes data as below : updateDF = spark.read.load(paths, ..) userIDs = uniqueList(matchedUserIDs) nones = [None for _ in range(...
Jisu Choi's user avatar

1
3 4
5
6 7
821