26,925 questions
0
votes
1
answer
12k
views
Mismatched input 'from' expecting <EOF>
Ok, so I am trying to run this piece of code:
%%spark
spark.sql(f'''select t2.ANOT_PDA_PRD
,t2.VLR_PDA_PRD AS MAX_VLR_PDA
,(CASE t1.cd_gr_mdld_prd_pd
WHEN 7 THEN t3....
Best practices
0
votes
5
replies
78
views
Pushing down filters in RDBMS with Java Spark
I have been working as a Data Engineer and got this issue.
I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source.
Now somewhere ...
14
votes
6
answers
39k
views
Can I change the nullability of a column in my Spark dataframe?
I have a StructField in a dataframe that is not nullable. Simple example:
import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['...
7
votes
2
answers
10k
views
How to use LIKE operator as a JOIN condition in pyspark as a column
I would like to do the following in pyspark (for AWS Glue jobs):
JOIN a and b ON a.name = b.name AND a.number= b.number AND a.city LIKE b.city
So for example:
Table a:
Number
Name
City
1000
Bob
%
...
3
votes
3
answers
3k
views
extracting HOUR from an interval in spark sql
I was wondering how to properly extract amount of hours between given 2 timestamps objects.
For instance, when the following SQL query gets executed:
select x, extract(HOUR FROM x) as result
...
2
votes
1
answer
11k
views
Convert PipelinedRDD to dataframe
I'm attempting to convert a pipelinedRDD in pyspark to a dataframe. This is the code snippet:
newRDD = rdd.map(lambda row: Row(row.__fields__ + ["tag"])(row + (tagScripts(row), )))
df = newRDD.toDF()
...
4
votes
1
answer
8k
views
Maximum number of concurrent tasks in 1 DPU in AWS Glue
A standard DPU in AWS Glue comes with 4 vCPU and 2 executors.
I am confused about the maximum number of concurrent tasks that can be run in parallel with this configuration. Is it 4 or 8 on a single ...
3
votes
2
answers
4k
views
getting yearmonth date format in sparksql
I'm trying to get year month column using this function:
date_format(delivery_date,'mmmmyyyy')
but I'm getting wrong values for the month
ex.
example of the output I want to get:
if I have this date ...
7
votes
2
answers
13k
views
How to set partition for Window function for PySpark?
I'm running a PySpark job, and I'm getting the following message:
WARN org.apache.spark.sql.execution.Window: No Partition Defined for Window operation! Moving all data to a single partition, this ...
Advice
0
votes
6
replies
154
views
Pyspark SQL: How to do GROUP BY with specific WHERE condition
So I am doing some SQL aggregation transformations of a dataset and there is a certain condition that I would like to do, but not sure how.
Here is a basic code block:
le_test = spark.sql(""&...
2
votes
1
answer
2k
views
Why spark count action has executed in three stages
I have loaded a csv file. Re-partitioned it to 4 and then took count of the DataFrame. And when I looked at the DAG I see this action is executed in 3 stages.
Why this simple action is executed into ...
0
votes
0
answers
83
views
How to Check if a Query Touches Data Files or just Uses Manifests and Metadata in Iceberg
I created a table as follows:
CREATE TABLE IF NOT EXISTS raw_data.civ (
date timestamp,
marketplace_id int,
... some more columns
)
USING ICEBERG
PARTITIONED BY (
marketplace_id,
...
0
votes
1
answer
12k
views
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe
I am new to spark and facing an error while converting .csv file to dataframe. I am using pyspark_csv module for the conversion but gives an error,
here is the stack trace for the error, can any one ...
0
votes
1
answer
9k
views
How to fix 22: error: not found: value SparkSession in Scala?
I am new to Spark and I would like to read a CSV-file to a Dataframe.
Spark 1.3.0 / Scala 2.3.0
This is what I have so far:
# Start Scala with CSV Package Module
spark-shell --packages com....
2
votes
3
answers
3k
views
PySpark performance chained transformations vs successive reassignment
I'm somewhat new to PySpark. I understand that it uses lazy evaluation, meaning that execution of a series of transformations will be deferred until some action is requested, at which point the Spark ...
0
votes
0
answers
44
views
Spark: VSAM File read issue with special character
We have a scenario to read a VSAM file directly along with a copy book to understand the column lengths, we were using COBRIX library as part of spark read.
However, we could the same is not properly ...
6
votes
2
answers
18k
views
Drop table using Pyspark
The SparkSession.catalog object has a bunch of methods to interact with the metastore, namely:
['cacheTable',
'clearCache',
'createExternalTable',
'createTable',
'currentDatabase',
'...
2
votes
2
answers
9k
views
Strings concatenation in Spark SQL query
I'm experimenting with Spark and Spark SQL and I need to concatenate a value at the beginning of a string field that I retrieve as output from a select (with a join) like the following:
val result = ...
2
votes
1
answer
9k
views
SparkException: Job aborted due to stage failure. Caused by: java.lang.ArrayIndexOutOfBoundsException
I am trying to read some data from parquet file using spark SQL and trying to put that data into some other table. But while writing data into another table I am getting the below error.
The parquet ...
0
votes
3
answers
16k
views
Filter a spark dataframe with a greater than and a less than of list of dates
I have a dataframe with the fields from_date and to_date:
(2017-01-10 2017-01-14)
(2017-01-03 2017-01-13)
and a List of dates
2017-01-05,
2017-01-12,
2017-01-13,
2017-01-15
The idea is to ...
12
votes
3
answers
6k
views
Unable to start spark-shell failing to submit spark-submit
I am trying to submit spark-submit but its failing with as weird message.
Error: Could not find or load main class org.apache.spark.launcher.Main
/opt/spark/bin/spark-class: line 96: CMD: bad array ...
0
votes
2
answers
967
views
Efficient joins in spark dataframe
I am trying to join a DataFrame (dfA) sequentially on the same DataFrame.
Let's say dfA has columns id_x and id_y and dfB has the column id and some other columns.
I want to do perform the following:
...
1
vote
3
answers
4k
views
How to aggregate data in Apache Spark
I have a distributed system on 3 nodes and my data is distributed among those nodes. For example, I have a test.csv file which exists on all 3 nodes, and it contains 4 columns of
row | id, C1, C2, ...
1
vote
2
answers
2k
views
Write dataframe to multiple tabs in a excel sheet using Spark
I have been using Spark-excel (https://github.com/crealytics/spark-excel) to write the output to a single sheet of an Excel sheet. However, I am unable to write the output to different sheets (tabs).
...
1
vote
3
answers
7k
views
Write parquet from another parquet with a new schema using pyspark
I am using pyspark dataframes, I want to read a parquet file and write it with a different schema from the original file
The original schema is (It have 9.000 variables, I am just putting the first 5 ...