12

I am trying to convert a pyspark dataframe column having approximately 90 million rows into a numpy array.

I need the array as an input for scipy.optimize.minimize function.

I have tried both converting to Pandas and using collect(), but these methods are very time consuming.

I am new to PySpark, If there is a faster and better approach to do this, Please help.

Thanks

This is how my dataframe looks like.

+----------+
|Adolescent|
+----------+
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
+----------+
2
  • Have you tried df['Adolescent'].to_numpy() or df['Adolescent'].array? Commented Sep 30, 2019 at 7:40
  • It looks like to_numpy() works for pandas data frame only and not pyspark. I tried df["Adolescent"].array which gives the output: "Column<b'Adolescent[array]'>". I don't know how to use this as array. Commented Sep 30, 2019 at 8:59

2 Answers 2

28

#1

You will have to call a .collect() in any way. To create a numpy array from the pyspark dataframe, you can use:

adoles = np.array(df.select("Adolescent").collect()) #.reshape(-1) for 1-D array

#2

You can convert it to a pandas dataframe using toPandas(), and you can then convert it to numpy array using .values.

pdf = df.toPandas()
adoles = df["Adolescent"].values

Or simply:

adoles = df.select("Adolescent").toPandas().values #.reshape(-1) for 1-D array

#3

For distributed arrays, you can try Dask Arrays

I haven't tested this, but assuming it would work the same as numpy (might have inconsistencies):

import dask.array as da
adoles = da.array(df.select("Adolescent").collect()) #.reshape(-1) for 1-D array
Sign up to request clarification or add additional context in comments.

4 Comments

I have tried using toPandas() but it is taking a lot of time.
You should have a look at Dask Arrays
Thanks for your help. I am trying with dask arrays now.
np.concatenate( df.select("user_id").rdd.glom().map( lambda x: np.array([elem[0] for elem in x])) .collect()) medium.com/@karthik.jayaraman1/…
1

Another way is to convert the selected column to RDD, then flatten by extracting the value of each Row (can abuse .keys()), then convert to numpy array:

x = df.select("colname").rdd.map(lambda r: r[0]).collect()  # python list
np.array(x)  # numpy array

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.