Skip to main content
Filter by
Sorted by
Tagged with
83 votes
4 answers
10k views

I've been spending a fair amount of time reading through some questions with the pyspark and spark-dataframe tags and very often I find that posters don't provide enough information to truly ...
pault's user avatar
  • 43.7k
87 votes
4 answers
81k views

Spark now offers predefined functions that can be used in dataframes, and it seems they are highly optimized. My original question was going to be on which is faster, but I did some testing myself and ...
alfredox's user avatar
  • 4,372
139 votes
6 answers
521k views

I have a date pyspark dataframe with a string column in the format of MM-dd-yyyy and I am attempting to convert this into a date column. I tried: df.select(to_date(df.STRING_COLUMN).alias('new_date'))....
Jenks's user avatar
  • 2,050
12 votes
1 answer
17k views

Consider the following DataFrame: #+------+---+ #|letter|rpt| #+------+---+ #| X| 3| #| Y| 1| #| Z| 2| #+------+---+ which can be created using the following code: df = spark....
pault's user avatar
  • 43.7k
213 votes
5 answers
368k views

I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows: dt.withColumn('new_column', 10).head(5) --------------...
Evan Zamir's user avatar
  • 8,521
34 votes
1 answer
16k views

Using an OCR tool I extracted texts from screenshots (about 1-5 sentences each). However, when manually verifying the extracted text, I noticed several errors that occur from time to time. Given the ...
mrtnsd's user avatar
  • 357
85 votes
9 answers
171k views

How can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is approximately 700,000 elements and therefore too large to collect and find the median. This ...
pr338's user avatar
  • 9,230
46 votes
1 answer
17k views

Background My original question here was Why using DecisionTreeModel.predict inside map function raises an exception? and is related to How to generate tuples of (original lable, predicted label) on ...
zero323's user avatar
  • 331k
9 votes
2 answers
10k views

I Started getting the following error anytime I try to collect my rdd's. It happened after I installed Java 10.1 So of course I took it out and reinstalled it, same error. I then installed Java 9.04 ...
Jimbo's user avatar
  • 143
144 votes
42 answers
466k views

I'm trying to run PySpark on my MacBook Air. When I try starting it up, I get the error: Exception: Java gateway process exited before sending the driver its port number when sc = SparkContext() is ...
mt88's user avatar
  • 3,065
56 votes
4 answers
92k views

I have this python code that runs locally in a pandas dataframe: df_result = pd.DataFrame(df .groupby('A') .apply(lambda x: myFunction(zip(x.B, x.C)...
arosner09's user avatar
  • 709
62 votes
3 answers
100k views

I have a Spark SQL DataFrame with date column, and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back ...
Nhor's user avatar
  • 3,980
94 votes
17 answers
129k views

I'm new with apache spark and apparently I installed apache-spark with homebrew in my macbook: Last login: Fri Jan 8 12:52:04 on console user@MacBook-Pro-de-User-2:~$ pyspark Python 2.7.10 (default, ...
tumbleweed's user avatar
  • 4,670
188 votes
17 answers
207k views

I installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script to get to the spark prompt and can also do the Quick Start quide successfully. However, I ...
horatio1701d's user avatar
  • 9,259
8 votes
2 answers
13k views

This page was inspiring me to try out spark-csv for reading .csv file in PySpark I found a couple of posts such as this describing how to use spark-csv But I am not able to initialize the ipython ...
KarthikS's user avatar
  • 903
83 votes
2 answers
189k views

How can I use collect_set or collect_list on a dataframe after groupby. for example: df.groupby('key').collect_set('values'). I get an error: AttributeError: 'GroupedData' object has no attribute '...
Hanan Shteingart's user avatar
29 votes
14 answers
62k views

I'm trying to complete this Spark tutorial. After installing Spark on local machine (Win10 64, Python 3, Spark 2.4.0) and setting all env variables (HADOOP_HOME, SPARK_HOME, etc) I'm trying to run a ...
Mike D.'s user avatar
  • 333
25 votes
2 answers
27k views

This is probably easiest to explain through example. Suppose I have a DataFrame of user logins to a website, for instance: scala> df.show(5) +----------------+----------+ | user_name|...
user4601931's user avatar
  • 5,324
192 votes
11 answers
555k views

I have a Spark DataFrame (using PySpark 1.5.1) and would like to add a new column. I've tried the following without any success: type(randomed_hours) # => list # Create in Python and transform ...
Boris's user avatar
  • 2,125
100 votes
10 answers
109k views

I am trying to create a new column of lists in Pyspark using a groupby aggregation on existing set of columns. An example input data frame is provided below: ------------------------ id | date ...
Ravi's user avatar
  • 3,333
34 votes
1 answer
49k views

My question is triggered by the use case of calculating the differences between consecutive rows in a spark dataframe. For example, I have: >>> df.show() +-----+----------+ |index| ...
Ytsen de Boer's user avatar
91 votes
22 answers
240k views

I have 2 DataFrames: I need union like this: The unionAll function doesn't work because the number and the name of columns are different. How can I do this?
Allan Feliph's user avatar
53 votes
1 answer
131k views

Say I have a Spark DF that I want to save to disk a CSV file. In Spark 2.0.0+, one can convert DataFrame(DataSet[Rows]) as a DataFrameWriter and use the .csv method to write the file. The function ...
Spandan Brahmbhatt's user avatar
6 votes
1 answer
6k views

Using pyspark: from pyspark.sql import SparkSession spark = SparkSession\ .builder\ .appName("spark play")\ .getOrCreate() df = spark.read\ .format("jdbc")\ .option("url", "...
PBL's user avatar
  • 115
108 votes
11 answers
124k views

I'm trying to install Spark on my Mac. I've used home-brew to install spark 2.4.0 and Scala. I've installed PySpark in my anaconda environment and am using PyCharm for development. I've exported to my ...
shbfy's user avatar
  • 2,143

1
2 3 4 5
80