How to convert a pyspark dataframe column to numpy array

Question

I am trying to convert a pyspark dataframe column having approximately 90 million rows into a numpy array.

I need the array as an input for scipy.optimize.minimize function.

I have tried both converting to Pandas and using collect(), but these methods are very time consuming.

I am new to PySpark, If there is a faster and better approach to do this, Please help.

Thanks

This is how my dataframe looks like.

+----------+
|Adolescent|
+----------+
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
+----------+

Have you tried df['Adolescent'].to_numpy() or df['Adolescent'].array? — Nils Werner
– Nils Werner, Commented Sep 30, 2019 at 7:40
It looks like to_numpy() works for pandas data frame only and not pyspark. I tried df["Adolescent"].array which gives the output: "Column<b'Adolescent[array]'>". I don't know how to use this as array. — Vaibhav Rathi
– Vaibhav Rathi, Commented Sep 30, 2019 at 8:59

pissall · Accepted Answer · 2019-09-30 09:58:14Z

28

#1

You will have to call a .collect() in any way. To create a numpy array from the pyspark dataframe, you can use:

adoles = np.array(df.select("Adolescent").collect()) #.reshape(-1) for 1-D array

#2

You can convert it to a pandas dataframe using toPandas(), and you can then convert it to numpy array using .values.

pdf = df.toPandas()
adoles = df["Adolescent"].values

Or simply:

adoles = df.select("Adolescent").toPandas().values #.reshape(-1) for 1-D array

#3

For distributed arrays, you can try Dask Arrays

I haven't tested this, but assuming it would work the same as numpy (might have inconsistencies):

import dask.array as da
adoles = da.array(df.select("Adolescent").collect()) #.reshape(-1) for 1-D array

edited Sep 30, 2019 at 9:58

answered Sep 30, 2019 at 8:01

pissall

7,4442 gold badges29 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Vaibhav Rathi Over a year ago

I have tried using toPandas() but it is taking a lot of time.

pissall Over a year ago

You should have a look at Dask Arrays

Vaibhav Rathi Over a year ago

Thanks for your help. I am trying with dask arrays now.

Talos Over a year ago

np.concatenate( df.select("user_id").rdd.glom().map( lambda x: np.array([elem[0] for elem in x])) .collect()) medium.com/@karthik.jayaraman1/…

qwr · Accepted Answer · 2021-10-25 03:48:23Z

1

Another way is to convert the selected column to RDD, then flatten by extracting the value of each Row (can abuse .keys()), then convert to numpy array:

x = df.select("colname").rdd.map(lambda r: r[0]).collect()  # python list
np.array(x)  # numpy array

answered Oct 25, 2021 at 3:48

qwr

11.5k6 gold badges75 silver badges121 bronze badges

Collectives™ on Stack Overflow

How to convert a pyspark dataframe column to numpy array

2 Answers 2

#1

#2

#3

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

#1

#2

#3

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related