Recently Active 'apache-spark-sql' Questions

0 votes

1 answer

12k views

Mismatched input 'from' expecting <EOF>

Ok, so I am trying to run this piece of code: %%spark spark.sql(f'''select t2.ANOT_PDA_PRD ,t2.VLR_PDA_PRD AS MAX_VLR_PDA ,(CASE t1.cd_gr_mdld_prd_pd WHEN 7 THEN t3....

CommunityBot

1

modified 4 hours ago

Best practices

0 votes

5 replies

78 views

Pushing down filters in RDBMS with Java Spark

I have been working as a Data Engineer and got this issue. I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source. Now somewhere ...

mazaneicha

9,560

answered 20 hours ago

14 votes

6 answers

39k views

Can I change the nullability of a column in my Spark dataframe?

I have a StructField in a dataframe that is not nullable. Simple example: import pyspark.sql.functions as F from pyspark.sql.types import * l = [('Alice', 1)] df = sqlContext.createDataFrame(l, ['...

Sarang

511

answered yesterday

7 votes

2 answers

10k views

How to use LIKE operator as a JOIN condition in pyspark as a column

I would like to do the following in pyspark (for AWS Glue jobs): JOIN a and b ON a.name = b.name AND a.number= b.number AND a.city LIKE b.city So for example: Table a: Number Name City 1000 Bob % ...

Jol Blazey

1

answered Nov 16 at 1:41

3 votes

3 answers

3k views

extracting HOUR from an interval in spark sql

I was wondering how to properly extract amount of hours between given 2 timestamps objects. For instance, when the following SQL query gets executed: select x, extract(HOUR FROM x) as result ...

Anssssss

3,341

modified Nov 13 at 22:00

2 votes

1 answer

11k views

Convert PipelinedRDD to dataframe

I'm attempting to convert a pipelinedRDD in pyspark to a dataframe. This is the code snippet: newRDD = rdd.map(lambda row: Row(row.__fields__ + ["tag"])(row + (tagScripts(row), ))) df = newRDD.toDF() ...

CommunityBot

1

modified Nov 11 at 17:06

4 votes

1 answer

8k views

Maximum number of concurrent tasks in 1 DPU in AWS Glue

A standard DPU in AWS Glue comes with 4 vCPU and 2 executors. I am confused about the maximum number of concurrent tasks that can be run in parallel with this configuration. Is it 4 or 8 on a single ...

Yuva

3,213

modified Nov 7 at 8:49

3 votes

2 answers

4k views

getting yearmonth date format in sparksql

I'm trying to get year month column using this function: date_format(delivery_date,'mmmmyyyy') but I'm getting wrong values for the month ex. example of the output I want to get: if I have this date ...

jltz8383

21

answered Nov 6 at 17:09

7 votes

2 answers

13k views

How to set partition for Window function for PySpark?

I'm running a PySpark job, and I'm getting the following message: WARN org.apache.spark.sql.execution.Window: No Partition Defined for Window operation! Moving all data to a single partition, this ...

dougp

3,163

answered Nov 5 at 18:18

Advice

0 votes

6 replies

154 views

Pyspark SQL: How to do GROUP BY with specific WHERE condition

So I am doing some SQL aggregation transformations of a dataset and there is a certain condition that I would like to do, but not sure how. Here is a basic code block: le_test = spark.sql(""&...

BeaverFever

21

modified Nov 3 at 8:18

2 votes

1 answer

2k views

Why spark count action has executed in three stages

I have loaded a csv file. Re-partitioned it to 4 and then took count of the DataFrame. And when I looked at the DAG I see this action is executed in 3 stages. Why this simple action is executed into ...

WOW

1

modified Oct 31 at 8:09

0 votes

0 answers

83 views

How to Check if a Query Touches Data Files or just Uses Manifests and Metadata in Iceberg

I created a table as follows: CREATE TABLE IF NOT EXISTS raw_data.civ ( date timestamp, marketplace_id int, ... some more columns ) USING ICEBERG PARTITIONED BY ( marketplace_id, ...

OneCricketeer

193k

modified Oct 25 at 17:38

0 votes

1 answer

12k views

An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe

I am new to spark and facing an error while converting .csv file to dataframe. I am using pyspark_csv module for the conversion but gives an error, here is the stack trace for the error, can any one ...

CommunityBot

1

modified Oct 22 at 5:04

0 votes

1 answer

9k views

How to fix 22: error: not found: value SparkSession in Scala?

I am new to Spark and I would like to read a CSV-file to a Dataframe. Spark 1.3.0 / Scala 2.3.0 This is what I have so far: # Start Scala with CSV Package Module spark-shell --packages com....

CommunityBot

1

modified Oct 21 at 14:05

2 votes

3 answers

3k views

PySpark performance chained transformations vs successive reassignment

I'm somewhat new to PySpark. I understand that it uses lazy evaluation, meaning that execution of a series of transformations will be deferred until some action is requested, at which point the Spark ...

CommunityBot

1

modified Oct 17 at 17:59

0 votes

0 answers

44 views

Spark: VSAM File read issue with special character

We have a scenario to read a VSAM file directly along with a copy book to understand the column lengths, we were using COBRIX library as part of spark read. However, we could the same is not properly ...

Rocky1989

409

asked Oct 15 at 7:06

6 votes

2 answers

18k views

Drop table using Pyspark

The SparkSession.catalog object has a bunch of methods to interact with the metastore, namely: ['cacheTable', 'clearCache', 'createExternalTable', 'createTable', 'currentDatabase', '...

CommunityBot

1

modified Oct 9 at 21:04

2 votes

2 answers

9k views

Strings concatenation in Spark SQL query

I'm experimenting with Spark and Spark SQL and I need to concatenate a value at the beginning of a string field that I retrieve as output from a select (with a join) like the following: val result = ...

CommunityBot

1

modified Oct 9 at 7:09

2 votes

1 answer

9k views

SparkException: Job aborted due to stage failure. Caused by: java.lang.ArrayIndexOutOfBoundsException

I am trying to read some data from parquet file using spark SQL and trying to put that data into some other table. But while writing data into another table I am getting the below error. The parquet ...

CommunityBot

1

modified Oct 7 at 2:05

0 votes

3 answers

16k views

Filter a spark dataframe with a greater than and a less than of list of dates

I have a dataframe with the fields from_date and to_date: (2017-01-10 2017-01-14) (2017-01-03 2017-01-13) and a List of dates 2017-01-05, 2017-01-12, 2017-01-13, 2017-01-15 The idea is to ...

CommunityBot

1

modified Oct 5 at 5:07

12 votes

3 answers

6k views

Unable to start spark-shell failing to submit spark-submit

I am trying to submit spark-submit but its failing with as weird message. Error: Could not find or load main class org.apache.spark.launcher.Main /opt/spark/bin/spark-class: line 96: CMD: bad array ...

Vadim Zin4uk

1,816

answered Oct 3 at 11:22

0 votes

2 answers

967 views

Efficient joins in spark dataframe

I am trying to join a DataFrame (dfA) sequentially on the same DataFrame. Let's say dfA has columns id_x and id_y and dfB has the column id and some other columns. I want to do perform the following: ...

Ram Ghadiyaram

29.4k

modified Oct 1 at 5:49

1 vote

3 answers

4k views

How to aggregate data in Apache Spark

I have a distributed system on 3 nodes and my data is distributed among those nodes. For example, I have a test.csv file which exists on all 3 nodes, and it contains 4 columns of row | id, C1, C2, ...

G.G

765

answered Sep 27 at 8:04

1 vote

2 answers

2k views

Write dataframe to multiple tabs in a excel sheet using Spark

I have been using Spark-excel (https://github.com/crealytics/spark-excel) to write the output to a single sheet of an Excel sheet. However, I am unable to write the output to different sheets (tabs). ...

Sander Vanden Hautte

2,563

answered Sep 22 at 8:38

1 vote

3 answers

7k views

Write parquet from another parquet with a new schema using pyspark

I am using pyspark dataframes, I want to read a parquet file and write it with a different schema from the original file The original schema is (It have 9.000 variables, I am just putting the first 5 ...

eeasterly

774

answered Sep 18 at 20:03

Collectives™ on Stack Overflow

Mismatched input 'from' expecting <EOF>

Pushing down filters in RDBMS with Java Spark

Can I change the nullability of a column in my Spark dataframe?

How to use LIKE operator as a JOIN condition in pyspark as a column

extracting HOUR from an interval in spark sql

Convert PipelinedRDD to dataframe

Maximum number of concurrent tasks in 1 DPU in AWS Glue

getting yearmonth date format in sparksql

How to set partition for Window function for PySpark?

Pyspark SQL: How to do GROUP BY with specific WHERE condition

Why spark count action has executed in three stages

How to Check if a Query Touches Data Files or just Uses Manifests and Metadata in Iceberg

An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe

How to fix 22: error: not found: value SparkSession in Scala?

PySpark performance chained transformations vs successive reassignment

Spark: VSAM File read issue with special character

Drop table using Pyspark

Strings concatenation in Spark SQL query

SparkException: Job aborted due to stage failure. Caused by: java.lang.ArrayIndexOutOfBoundsException

Filter a spark dataframe with a greater than and a less than of list of dates

Unable to start spark-shell failing to submit spark-submit

Efficient joins in spark dataframe

How to aggregate data in Apache Spark

Write dataframe to multiple tabs in a excel sheet using Spark

Write parquet from another parquet with a new schema using pyspark

Hot Network Questions