Frequent 'pyspark' Questions

83 votes

4 answers

10k views

How to make good reproducible Apache Spark examples

I've been spending a fair amount of time reading through some questions with the pyspark and spark-dataframe tags and very often I find that posters don't provide enough information to truly ...

pault

43.7k

asked Jan 24, 2018 at 16:24

87 votes

4 answers

81k views

Spark functions vs UDF performance?

Spark now offers predefined functions that can be used in dataframes, and it seems they are highly optimized. My original question was going to be on which is faster, but I did some testing myself and ...

alfredox

4,372

asked Jul 10, 2016 at 21:26

139 votes

6 answers

521k views

Convert pyspark string to date format

I have a date pyspark dataframe with a string column in the format of MM-dd-yyyy and I am attempting to convert this into a date column. I tried: df.select(to_date(df.STRING_COLUMN).alias('new_date'))....

Jenks

2,050

asked Jun 28, 2016 at 15:45

12 votes

1 answer

17k views

Using a column value as a parameter to a spark DataFrame function

Consider the following DataFrame: #+------+---+ #|letter|rpt| #+------+---+ #| X| 3| #| Y| 1| #| Z| 2| #+------+---+ which can be created using the following code: df = spark....

pault

43.7k

asked Jul 2, 2018 at 16:41

213 votes

5 answers

368k views

How to add a constant column in a Spark DataFrame?

I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows: dt.withColumn('new_column', 10).head(5) --------------...

Evan Zamir

8,521

asked Sep 25, 2015 at 18:17

34 votes

1 answer

16k views

Efficient string matching in Apache Spark

Using an OCR tool I extracted texts from screenshots (about 1-5 sentences each). However, when manually verifying the extracted text, I noticed several errors that occur from time to time. Given the ...

mrtnsd

357

asked May 12, 2017 at 13:14

85 votes

9 answers

171k views

How to find median and quantiles using Spark

How can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is approximately 700,000 elements and therefore too large to collect and find the median. This ...

pr338

9,230

asked Jul 15, 2015 at 14:11

46 votes

1 answer

17k views

Calling Java/Scala function from a task

Background My original question here was Why using DecisionTreeModel.predict inside map function raises an exception? and is related to How to generate tuples of (original lable, predicted label) on ...

zero323

331k

asked Jul 28, 2015 at 18:54

9 votes

2 answers

10k views

java.lang.IllegalArgumentException at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source) with Java 10

I Started getting the following error anytime I try to collect my rdd's. It happened after I installed Java 10.1 So of course I took it out and reinstalled it, same error. I then installed Java 9.04 ...

Jimbo

143

asked Apr 22, 2018 at 2:35

144 votes

42 answers

466k views

PySpark: "Exception: Java gateway process exited before sending the driver its port number"

I'm trying to run PySpark on my MacBook Air. When I try starting it up, I get the error: Exception: Java gateway process exited before sending the driver its port number when sc = SparkContext() is ...

mt88

3,065

asked Aug 5, 2015 at 19:45

56 votes

4 answers

92k views

Applying UDFs on GroupedData in PySpark (with functioning python example)

I have this python code that runs locally in a pandas dataframe: df_result = pd.DataFrame(df .groupby('A') .apply(lambda x: myFunction(zip(x.B, x.C)...

arosner09

709

asked Oct 12, 2016 at 19:01

62 votes

3 answers

100k views

Spark Window Functions - rangeBetween dates

I have a Spark SQL DataFrame with date column, and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back ...

Nhor

3,980

asked Oct 19, 2015 at 5:24

94 votes

17 answers

129k views

How to link PyCharm with PySpark?

I'm new with apache spark and apparently I installed apache-spark with homebrew in my macbook: Last login: Fri Jan 8 12:52:04 on console user@MacBook-Pro-de-User-2:~$ pyspark Python 2.7.10 (default, ...

tumbleweed

4,670

asked Jan 8, 2016 at 20:55

188 votes

17 answers

207k views

How to turn off INFO logging in Spark?

I installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script to get to the spark prompt and can also do the Quick Start quide successfully. However, I ...

horatio1701d

9,259

asked Aug 7, 2014 at 22:48

8 votes

2 answers

13k views

How to load jar dependenices in IPython Notebook

This page was inspiring me to try out spark-csv for reading .csv file in PySpark I found a couple of posts such as this describing how to use spark-csv But I am not able to initialize the ipython ...

KarthikS

903

asked Nov 25, 2015 at 3:46

83 votes

2 answers

189k views

pyspark collect_set or collect_list with groupby

How can I use collect_set or collect_list on a dataframe after groupby. for example: df.groupby('key').collect_set('values'). I get an error: AttributeError: 'GroupedData' object has no attribute '...

Hanan Shteingart

9,166

asked Jun 2, 2016 at 0:17

29 votes

14 answers

62k views

Python worker failed to connect back

I'm trying to complete this Spark tutorial. After installing Spark on local machine (Win10 64, Python 3, Spark 2.4.0) and setting all env variables (HADOOP_HOME, SPARK_HOME, etc) I'm trying to run a ...

Mike D.

333

asked Nov 11, 2018 at 19:06

25 votes

2 answers

27k views

Spark SQL window function with complex condition

This is probably easiest to explain through example. Suppose I have a DataFrame of user logins to a website, for instance: scala> df.show(5) +----------------+----------+ | user_name|...

user4601931

5,324

asked Feb 24, 2017 at 21:25

192 votes

11 answers

555k views

How do I add a new column to a Spark DataFrame (using PySpark)?

I have a Spark DataFrame (using PySpark 1.5.1) and would like to add a new column. I've tried the following without any success: type(randomed_hours) # => list # Create in Python and transform ...

Boris

2,125

asked Nov 12, 2015 at 21:14

100 votes

10 answers

109k views

collect_list by preserving order based on another variable

I am trying to create a new column of lists in Pyspark using a groupby aggregation on existing set of columns. An example input data frame is provided below: ------------------------ id | date ...

Ravi

3,333

asked Oct 5, 2017 at 7:34

34 votes

1 answer

49k views

Avoid performance impact of a single partition mode in Spark window functions

My question is triggered by the use case of calculating the differences between consecutive rows in a spark dataframe. For example, I have: >>> df.show() +-----+----------+ |index| ...

Ytsen de Boer

3,127

asked Dec 24, 2016 at 13:00

91 votes

22 answers

240k views

How to perform union on two DataFrames with different amounts of columns in Spark?

I have 2 DataFrames: I need union like this: The unionAll function doesn't work because the number and the name of columns are different. How can I do this?

Allan Feliph

922

asked Sep 28, 2016 at 21:34

53 votes

1 answer

131k views

Specifying the filename when saving a DataFrame as a CSV [duplicate]

Say I have a Spark DF that I want to save to disk a CSV file. In Spark 2.0.0+, one can convert DataFrame(DataSet[Rows]) as a DataFrameWriter and use the .csv method to write the file. The function ...

Spandan Brahmbhatt

4,114

asked Feb 1, 2017 at 21:28

6 votes

1 answer

6k views

In Apache Spark 2.0.0, is it possible to fetch a query from an external database (rather than grab the whole table)?

Using pyspark: from pyspark.sql import SparkSession spark = SparkSession\ .builder\ .appName("spark play")\ .getOrCreate() df = spark.read\ .format("jdbc")\ .option("url", "...

PBL

115

asked Aug 2, 2016 at 20:00

108 votes

11 answers

124k views

Spark Error - Unsupported class file major version

I'm trying to install Spark on my Mac. I've used home-brew to install spark 2.4.0 and Scala. I've installed PySpark in my anaconda environment and am using PyCharm for development. I've exported to my ...

shbfy

2,143

asked Dec 2, 2018 at 18:16

Collectives™ on Stack Overflow

How to make good reproducible Apache Spark examples

Spark functions vs UDF performance?

Convert pyspark string to date format

Using a column value as a parameter to a spark DataFrame function

How to add a constant column in a Spark DataFrame?

Efficient string matching in Apache Spark

How to find median and quantiles using Spark

Calling Java/Scala function from a task

java.lang.IllegalArgumentException at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source) with Java 10

PySpark: "Exception: Java gateway process exited before sending the driver its port number"

Applying UDFs on GroupedData in PySpark (with functioning python example)

Spark Window Functions - rangeBetween dates

How to link PyCharm with PySpark?

How to turn off INFO logging in Spark?

How to load jar dependenices in IPython Notebook

pyspark collect_set or collect_list with groupby

Python worker failed to connect back

Spark SQL window function with complex condition

How do I add a new column to a Spark DataFrame (using PySpark)?

collect_list by preserving order based on another variable

Avoid performance impact of a single partition mode in Spark window functions

How to perform union on two DataFrames with different amounts of columns in Spark?

Specifying the filename when saving a DataFrame as a CSV [duplicate]

In Apache Spark 2.0.0, is it possible to fetch a query from an external database (rather than grab the whole table)?

Spark Error - Unsupported class file major version

Hot Network Questions