Newest 'apache-spark-sql' Questions

Best practices

0 votes

5 replies

78 views

Pushing down filters in RDBMS with Java Spark

I have been working as a Data Engineer and got this issue. I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source. Now somewhere ...

Parth Sarthi Roy

1

asked Nov 14 at 6:13

0 votes

0 answers

83 views

How to Check if a Query Touches Data Files or just Uses Manifests and Metadata in Iceberg

I created a table as follows: CREATE TABLE IF NOT EXISTS raw_data.civ ( date timestamp, marketplace_id int, ... some more columns ) USING ICEBERG PARTITIONED BY ( marketplace_id, ...

shiva

2,781

asked Oct 25 at 15:11

0 votes

0 answers

66 views

Unexpected Write Behavior when using MERGE INTO/INSERT INTO Iceberg Spark Queries

I am observing different write behaviors when executing queries on EMR Notebook (correct behavior) vs when using spark-submit to submit a spark application to EMR Cluster (incorrect behavior). When I ...

shiva

2,781

asked Oct 21 at 20:58

0 votes

0 answers

35 views

How to link Spark event log stages to PySpark code or query?

I'm analyzing Spark event logs and have already retrieved the SparkListenerStageSubmitted and SparkListenerTaskEnd events to collect metrics such as spill, skew ratio, memory, and CPU usage. However, ...

Carol C

1

asked Oct 9 at 19:40

0 votes

0 answers

53 views

Spatial join without Apache Sedona

currently I'm working in a specific version of Apache Spark (3.1.1) that cannot upgrade. Since that I can't use Apache Sedona and the version 1.3.1 is too slow. My problem is the following code that ...

matdlara

149

asked Oct 3 at 1:35

0 votes

1 answer

115 views

Problem reading the _last_checkpoint file from the _delta_log directory of a delta lake table on s3

I am trying to read the _delta_log folder of a delta lake table via spark to export some custom metrics. I have configured how to get some metrics from history and description but I have problem ...

Melika Ghiasi

1

asked Sep 7 at 10:50

3 votes

0 answers

72 views

Cannot extend Spark UnaryExpression in Java

I am trying to write a custom decoder function in Java targeting Spark 4.0: public class MyDataToCatalyst extends UnaryExpression implements NonSQLExpression, ExpectsInputTypes, Serializable { //.....

Carsten

1,288

asked Aug 26 at 16:59

1 vote

0 answers

131 views

How to read data from MongoDB collection using SparkSession into a Spark DataFrameReader?

Spark reading data from MongoDB(ver 7.0) and DocumentDB(ver 4.0) and loading into the spark DataFrameReader is failing when DataFrameReader.isEmpty() method is called . SparkSession and ...

Sandeep Reddy CONT

11

asked Aug 9 at 4:16

0 votes

1 answer

44 views

Azure Synapse SQL Merge is not updating records, instead of that it inserts matching records using spark.sql

I have the below code where the Id is a 36 character GUID. The code gets executed but when a matching record is found , instead of updating it inserts the entire records again. What could be the root ...

Sandeep T

441

asked Aug 1 at 19:12

0 votes

1 answer

105 views

How to dynamically generate SQL to Update/Insert a table in Azure Databricks Notebook

Its a sort of CDC ( Change Data Capture ) scenario in which I am trying to compare new data (in tblNewData) with old data (in tblOldData), and logging the changes into a log table (tblExpectedDataLog) ...

Aza

27

asked Aug 1 at 4:39

3 votes

1 answer

61 views

Why is this task taking so long even for a graph with merely 5-6 nodes and edges?

I am trying to implement the Parallelized BFS algorithm using Pyspark. I am following the material in CS246, What exactly in my implementation is making this thing take so long? Pardon me I am just a ...

Frenzy Ripper

31

asked Jul 4 at 15:23

1 vote

1 answer

86 views

Spark streaming failing intermittently with llegalStateException: Found no SST files

I'm encountering the following error while trying to upload a RocksDB checkpoint in Databricks: java.lang.IllegalStateException: Found no SST files during uploading RocksDB checkpoint version 498 with ...

Susmit Sarkar

75

asked Jul 3 at 11:53

0 votes

1 answer

60 views

How can I calculate the timestamp difference based on status using spark Dataframes?

I am trying to calculate the timestamp difference on cumulative rows based on ID and status columns Example dataframe: ID TIMESTAMP STATUS V1 2023-06-18 13:00:00 1 V1 2023-06-18 13:01:00 1 V1 2023-06-...

RMK

41

asked Jun 17 at 6:05

0 votes

0 answers

62 views

Get two different nodes to access and distribute the same SQL table in Apache spark?

I have the following code to test. I created a table on worker 1. Then I tried to read the table on worker 2 and it got TABLE_OR_VIEW_NOT_FOUND. Worker 2 is in the some computer as Master. I ran the ...

Rick C. Ferreira

1

asked Jun 16 at 19:25

3 votes

0 answers

337 views

Overwrite is failing with "pyspark.errors.exceptions.captured.AnalysisException: Table does not support truncate in batch mode"

I upgraded PySpark from 3.5.5 to 3.5.6, and now all unit tests with an overwrite operation are failing with this error: pyspark.errors.exceptions.captured.AnalysisException: Table does not support ...

Nicholas Fiorentini

31

asked Jun 4 at 13:50

1 vote

2 answers

111 views

Consecutive Activity using Analytical Function

I have a table containing the fields: user_ip, datetime, year, month, day, hour, tag_id, country, device_type, brand. I need to check if given a IP was active for a continuous period of 4 or more ...

user16798185

377

asked May 29 at 7:34

0 votes

0 answers

101 views

Poor performance with PySpark write to parquet

I can make PySpark "work" no problem, but know very little and am very confused by documentation on performance. I have some source data partitioned by date, read it directory by directory (...

mateoc15

680

asked May 19 at 19:11

1 vote

1 answer

104 views

How to properly recalculate Spark DataFrame statistics after checkpoint?

Here is minimal example using default data in DataBricks (Spark 3.4): import org.apache.spark.sql.functions.col import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types._ sc....

Igor Railean

11

asked May 15 at 20:43

1 vote

2 answers

148 views

How can I do a join on each array element in a column and replace with something from the join table?

I have a table, base_df, with many columns, one of which is an array column: Id FruitNames Col1 Col2 Col3 ... Col99 1 ["apple", "banana", "orange"] ... ... ... ... ... 2 [...

wkeithvan♦

2,215

asked May 2 at 23:18

0 votes

0 answers

84 views

Caused by: java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer

I am trying to deploy a scala application which uses structures streaming on a standalone distributed Spark cluster using the spark-submit command and I get the following error: Exception in thread &...

Maria

1

asked Apr 30 at 6:24

1 vote

0 answers

63 views

Why does a subquery without matching column names still work in Spark SQL? [duplicate]

I have the following two datasets in Spark SQL: person view: person = spark.createDataFrame([ (0, "Bill Chambers", 0, [100]), (1, "Matei Zaharia", 1, [500, 250, 100]), (2, "...

DumbCoder

515

asked Apr 26 at 8:38

1 vote

0 answers

131 views

Pyspark writing dataframe to oracle database table using JDBC

I am new to Pyspark and having few clarifications on writing dataframe to oracle database table using JDBC. As part of the requirement I need to read the data from Oracle table and perform ...

Siva

11

asked Apr 18 at 17:05

0 votes

0 answers

56 views

Spark with availableNow trigger doesn't archive sources

I use Spark to read JSON files that appear in a folder everyday with path pattern Yyyy/mm/dd to convert them into Iceberg format. Both folders JSON and Iceberg are in a s3 bucket on different paths. ...

Alex

1,019

asked Apr 17 at 10:19

0 votes

0 answers

65 views

Spark AQE and Skew Join configurations not being applied

I am experiencing data skew issues in spark, specifically during joins and window functions. I have tried many of the spark performance tuning configurations recommended but none appear to be working. ...

ifightfortheuserz

1

asked Apr 11 at 17:11

1 vote

0 answers

132 views

Creating an Iceberg table with a geometry column with Sedona

I'm trying to create an Iceberg table with a geometry column in this example: import org.apache.sedona.sql.utils.SedonaSQLRegistrator SedonaSQLRegistrator.registerAll(spark) val stmt = ""&...

Stefan Ziegler

63

asked Apr 9 at 9:44

Collectives™ on Stack Overflow

Pushing down filters in RDBMS with Java Spark

How to Check if a Query Touches Data Files or just Uses Manifests and Metadata in Iceberg

Unexpected Write Behavior when using MERGE INTO/INSERT INTO Iceberg Spark Queries

How to link Spark event log stages to PySpark code or query?

Spatial join without Apache Sedona

Problem reading the _last_checkpoint file from the _delta_log directory of a delta lake table on s3

Cannot extend Spark UnaryExpression in Java

How to read data from MongoDB collection using SparkSession into a Spark DataFrameReader?

Azure Synapse SQL Merge is not updating records, instead of that it inserts matching records using spark.sql

How to dynamically generate SQL to Update/Insert a table in Azure Databricks Notebook

Why is this task taking so long even for a graph with merely 5-6 nodes and edges?

Spark streaming failing intermittently with llegalStateException: Found no SST files

How can I calculate the timestamp difference based on status using spark Dataframes?

Get two different nodes to access and distribute the same SQL table in Apache spark?

Overwrite is failing with "pyspark.errors.exceptions.captured.AnalysisException: Table does not support truncate in batch mode"

Consecutive Activity using Analytical Function

Poor performance with PySpark write to parquet

How to properly recalculate Spark DataFrame statistics after checkpoint?

How can I do a join on each array element in a column and replace with something from the join table?

Caused by: java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer

Why does a subquery without matching column names still work in Spark SQL? [duplicate]

Pyspark writing dataframe to oracle database table using JDBC

Spark with availableNow trigger doesn't archive sources

Spark AQE and Skew Join configurations not being applied

Creating an Iceberg table with a geometry column with Sedona

Hot Network Questions