Skip to main content
Filter by
Sorted by
Tagged with
0 votes
1 answer
12k views

Ok, so I am trying to run this piece of code: %%spark spark.sql(f'''select t2.ANOT_PDA_PRD ,t2.VLR_PDA_PRD AS MAX_VLR_PDA ,(CASE t1.cd_gr_mdld_prd_pd WHEN 7 THEN t3....
Best practices
0 votes
5 replies
78 views

I have been working as a Data Engineer and got this issue. I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source. Now somewhere ...
14 votes
6 answers
39k views

I have a StructField in a dataframe that is not nullable. Simple example: import pyspark.sql.functions as F from pyspark.sql.types import * l = [('Alice', 1)] df = sqlContext.createDataFrame(l, ['...
7 votes
2 answers
10k views

I would like to do the following in pyspark (for AWS Glue jobs): JOIN a and b ON a.name = b.name AND a.number= b.number AND a.city LIKE b.city So for example: Table a: Number Name City 1000 Bob % ...
3 votes
3 answers
3k views

I was wondering how to properly extract amount of hours between given 2 timestamps objects. For instance, when the following SQL query gets executed: select x, extract(HOUR FROM x) as result ...
2 votes
1 answer
11k views

I'm attempting to convert a pipelinedRDD in pyspark to a dataframe. This is the code snippet: newRDD = rdd.map(lambda row: Row(row.__fields__ + ["tag"])(row + (tagScripts(row), ))) df = newRDD.toDF() ...
4 votes
1 answer
8k views

A standard DPU in AWS Glue comes with 4 vCPU and 2 executors. I am confused about the maximum number of concurrent tasks that can be run in parallel with this configuration. Is it 4 or 8 on a single ...
3 votes
2 answers
4k views

I'm trying to get year month column using this function: date_format(delivery_date,'mmmmyyyy') but I'm getting wrong values for the month ex. example of the output I want to get: if I have this date ...
7 votes
2 answers
13k views

I'm running a PySpark job, and I'm getting the following message: WARN org.apache.spark.sql.execution.Window: No Partition Defined for Window operation! Moving all data to a single partition, this ...
Advice
0 votes
6 replies
154 views

So I am doing some SQL aggregation transformations of a dataset and there is a certain condition that I would like to do, but not sure how. Here is a basic code block: le_test = spark.sql(""&...
2 votes
1 answer
2k views

I have loaded a csv file. Re-partitioned it to 4 and then took count of the DataFrame. And when I looked at the DAG I see this action is executed in 3 stages. Why this simple action is executed into ...
0 votes
0 answers
83 views

I created a table as follows: CREATE TABLE IF NOT EXISTS raw_data.civ ( date timestamp, marketplace_id int, ... some more columns ) USING ICEBERG PARTITIONED BY ( marketplace_id, ...
0 votes
1 answer
12k views

I am new to spark and facing an error while converting .csv file to dataframe. I am using pyspark_csv module for the conversion but gives an error, here is the stack trace for the error, can any one ...
0 votes
1 answer
9k views

I am new to Spark and I would like to read a CSV-file to a Dataframe. Spark 1.3.0 / Scala 2.3.0 This is what I have so far: # Start Scala with CSV Package Module spark-shell --packages com....
2 votes
3 answers
3k views

I'm somewhat new to PySpark. I understand that it uses lazy evaluation, meaning that execution of a series of transformations will be deferred until some action is requested, at which point the Spark ...
0 votes
0 answers
44 views

We have a scenario to read a VSAM file directly along with a copy book to understand the column lengths, we were using COBRIX library as part of spark read. However, we could the same is not properly ...
6 votes
2 answers
18k views

The SparkSession.catalog object has a bunch of methods to interact with the metastore, namely: ['cacheTable', 'clearCache', 'createExternalTable', 'createTable', 'currentDatabase', '...
2 votes
2 answers
9k views

I'm experimenting with Spark and Spark SQL and I need to concatenate a value at the beginning of a string field that I retrieve as output from a select (with a join) like the following: val result = ...
2 votes
1 answer
9k views

I am trying to read some data from parquet file using spark SQL and trying to put that data into some other table. But while writing data into another table I am getting the below error. The parquet ...
0 votes
3 answers
16k views

I have a dataframe with the fields from_date and to_date: (2017-01-10 2017-01-14) (2017-01-03 2017-01-13) and a List of dates 2017-01-05, 2017-01-12, 2017-01-13, 2017-01-15 The idea is to ...
12 votes
3 answers
6k views

I am trying to submit spark-submit but its failing with as weird message. Error: Could not find or load main class org.apache.spark.launcher.Main /opt/spark/bin/spark-class: line 96: CMD: bad array ...
0 votes
2 answers
967 views

I am trying to join a DataFrame (dfA) sequentially on the same DataFrame. Let's say dfA has columns id_x and id_y and dfB has the column id and some other columns. I want to do perform the following: ...
1 vote
3 answers
4k views

I have a distributed system on 3 nodes and my data is distributed among those nodes. For example, I have a test.csv file which exists on all 3 nodes, and it contains 4 columns of row | id, C1, C2, ...
1 vote
2 answers
2k views

I have been using Spark-excel (https://github.com/crealytics/spark-excel) to write the output to a single sheet of an Excel sheet. However, I am unable to write the output to different sheets (tabs). ...
1 vote
3 answers
7k views

I am using pyspark dataframes, I want to read a parquet file and write it with a different schema from the original file The original schema is (It have 9.000 variables, I am just putting the first 5 ...

1
2 3 4 5
539