3,973 questions
83
votes
4
answers
10k
views
How to make good reproducible Apache Spark examples
I've been spending a fair amount of time reading through some questions with the pyspark and spark-dataframe tags and very often I find that posters don't provide enough information to truly ...
87
votes
4
answers
81k
views
Spark functions vs UDF performance?
Spark now offers predefined functions that can be used in dataframes, and it seems they are highly optimized. My original question was going to be on which is faster, but I did some testing myself and ...
139
votes
6
answers
521k
views
Convert pyspark string to date format
I have a date pyspark dataframe with a string column in the format of MM-dd-yyyy and I am attempting to convert this into a date column.
I tried:
df.select(to_date(df.STRING_COLUMN).alias('new_date'))....
12
votes
1
answer
17k
views
Using a column value as a parameter to a spark DataFrame function
Consider the following DataFrame:
#+------+---+
#|letter|rpt|
#+------+---+
#| X| 3|
#| Y| 1|
#| Z| 2|
#+------+---+
which can be created using the following code:
df = spark....
213
votes
5
answers
368k
views
How to add a constant column in a Spark DataFrame?
I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows:
dt.withColumn('new_column', 10).head(5)
--------------...
34
votes
1
answer
16k
views
Efficient string matching in Apache Spark
Using an OCR tool I extracted texts from screenshots (about 1-5 sentences each). However, when manually verifying the extracted text, I noticed several errors that occur from time to time.
Given the ...
85
votes
9
answers
171k
views
How to find median and quantiles using Spark
How can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is approximately 700,000 elements and therefore too large to collect and find the median.
This ...
46
votes
1
answer
17k
views
Calling Java/Scala function from a task
Background
My original question here was Why using DecisionTreeModel.predict inside map function raises an exception? and is related to How to generate tuples of (original lable, predicted label) on ...
9
votes
2
answers
10k
views
java.lang.IllegalArgumentException at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source) with Java 10
I Started getting the following error anytime I try to collect my rdd's. It happened after I installed Java 10.1 So of course I took it out and reinstalled it, same error. I then installed Java 9.04 ...
144
votes
42
answers
466k
views
PySpark: "Exception: Java gateway process exited before sending the driver its port number"
I'm trying to run PySpark on my MacBook Air. When I try starting it up, I get the error:
Exception: Java gateway process exited before sending the driver its port number
when sc = SparkContext() is ...
56
votes
4
answers
92k
views
Applying UDFs on GroupedData in PySpark (with functioning python example)
I have this python code that runs locally in a pandas dataframe:
df_result = pd.DataFrame(df
.groupby('A')
.apply(lambda x: myFunction(zip(x.B, x.C)...
62
votes
3
answers
100k
views
Spark Window Functions - rangeBetween dates
I have a Spark SQL DataFrame with date column, and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back ...
94
votes
17
answers
129k
views
How to link PyCharm with PySpark?
I'm new with apache spark and apparently I installed apache-spark with homebrew in my macbook:
Last login: Fri Jan 8 12:52:04 on console
user@MacBook-Pro-de-User-2:~$ pyspark
Python 2.7.10 (default, ...
188
votes
17
answers
207k
views
How to turn off INFO logging in Spark?
I installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script to get to the spark prompt and can also do the Quick Start quide successfully.
However, I ...
8
votes
2
answers
13k
views
How to load jar dependenices in IPython Notebook
This page was inspiring me to try out spark-csv for reading .csv file in PySpark
I found a couple of posts such as this describing how to use spark-csv
But I am not able to initialize the ipython ...
83
votes
2
answers
189k
views
pyspark collect_set or collect_list with groupby
How can I use collect_set or collect_list on a dataframe after groupby. for example: df.groupby('key').collect_set('values'). I get an error: AttributeError: 'GroupedData' object has no attribute '...
29
votes
14
answers
62k
views
Python worker failed to connect back
I'm trying to complete this Spark tutorial.
After installing Spark on local machine (Win10 64, Python 3, Spark 2.4.0) and setting all env variables (HADOOP_HOME, SPARK_HOME, etc) I'm trying to run a ...
25
votes
2
answers
27k
views
Spark SQL window function with complex condition
This is probably easiest to explain through example. Suppose I have a DataFrame of user logins to a website, for instance:
scala> df.show(5)
+----------------+----------+
| user_name|...
192
votes
11
answers
555k
views
How do I add a new column to a Spark DataFrame (using PySpark)?
I have a Spark DataFrame (using PySpark 1.5.1) and would like to add a new column.
I've tried the following without any success:
type(randomed_hours) # => list
# Create in Python and transform ...
100
votes
10
answers
109k
views
collect_list by preserving order based on another variable
I am trying to create a new column of lists in Pyspark using a groupby aggregation on existing set of columns. An example input data frame is provided below:
------------------------
id | date ...
34
votes
1
answer
49k
views
Avoid performance impact of a single partition mode in Spark window functions
My question is triggered by the use case of calculating the differences between consecutive rows in a spark dataframe.
For example, I have:
>>> df.show()
+-----+----------+
|index| ...
91
votes
22
answers
240k
views
How to perform union on two DataFrames with different amounts of columns in Spark?
I have 2 DataFrames:
I need union like this:
The unionAll function doesn't work because the number and the name of columns are different.
How can I do this?
53
votes
1
answer
131k
views
Specifying the filename when saving a DataFrame as a CSV [duplicate]
Say I have a Spark DF that I want to save to disk a CSV file. In Spark 2.0.0+, one can convert DataFrame(DataSet[Rows]) as a DataFrameWriter and use the .csv method to write the file.
The function ...
6
votes
1
answer
6k
views
In Apache Spark 2.0.0, is it possible to fetch a query from an external database (rather than grab the whole table)?
Using pyspark:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("spark play")\
.getOrCreate()
df = spark.read\
.format("jdbc")\
.option("url", "...
108
votes
11
answers
124k
views
Spark Error - Unsupported class file major version
I'm trying to install Spark on my Mac. I've used home-brew to install spark 2.4.0 and Scala. I've installed PySpark in my anaconda environment and am using PyCharm for development. I've exported to my ...