Big numpy array to spark dataframe

Question

I have big numpy array. Its shape is (800,224,224,3), which means that there are images (224 * 244) with 3 channels. For distributed deep learning in Spark, I want to change 'numpy array' to 'spark dataframe'.

My method is:

Changed numpy array to csv
Loaded csv and make spark dataframe with 150528 columns (224*224*3)
Use VectorAssembler to create a vector of all columns (features)
Reshape the output of 3 but in the third step, I failed since computation might be too much high

In order to make a vector from this:

+------+------+
|col_1 | col_2|
+------+------+
|0.1434|0.1434|
|0.1434|0.1451|
|0.1434|0.1467|
|0.3046|0.3046|
|0.3046|0.3304|
|0.3249|0.3046|
|0.3249|0.3304|
|0.3258|0.3258|
|0.3258|0.3263|
|0.3258|0.3307|
+------+------+

to this:

+-------------+
|   feature   |
+-------------+
|0.1434,0.1434|
|0.1434,0.1451|
|0.1434,0.1467|
|0.3046,0.3046|
|0.3046,0.3304|
|0.3249,0.3046|
|0.3249,0.3304|
|0.3258,0.3258|
|0.3258,0.3263|
|0.3258,0.3307|
+-------------+

But the number of columns are really many...

I also tried to convert numpy array to rdd directly but I got 'out of memory' error. In single machine, my job works well with this numpy array.

Out of memory error, is it? Can you try setting the driver memory to whatever maximum you can give it? I use 6g and my laptop ram is 8gb. — pissall
– pissall, Commented Oct 24, 2017 at 4:09

Shaido · Accepted Answer · 2020-12-03 03:47:48Z

5

You should be able to convert the numpy array directly to a Spark dataframe, without going through a csv file. You could try something like the below code:

from pyspark.ml.linalg import Vectors

num_rows = 800
arr = map(lambda x: (Vectors.dense(x), ), numpy_arr.reshape(num_rows, -1))
df = spark.createDataFrame(arr, ["features"])

edited Dec 3, 2020 at 3:47

answered Oct 24, 2017 at 5:52

Shaido

28.6k26 gold badges76 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

A.B Over a year ago

Hello, it gives me this error "TypeError: not supported type: <class 'numpy.ndarray'>"

Shaido Over a year ago

@A.B: Try converting to tuples of vectors and see if it works, you can refer to: stackoverflow.com/questions/41328799/…

John Stud Over a year ago

This answer does not actually work, given A.B.'s comment and testing.

Shaido Over a year ago

@JohnStud: You were correct. It seems I didn't hear back from A.B whether using tuples worked and then I forgot about it. I tested it out and updated the answer. It should work now.

JulianWgs · Accepted Answer · 2020-09-14 14:08:20Z

2

You can also do this, which I find most convenient:

import numpy as np
import pandas as pd
import pyspark

sc = pyspark.SparkContext()
sqlContext = SQLContext(sc)

array = np.linspace(0, 10)
df_spark = sqlContext.createDataFrame(pd.DataFrame(array))
df_spark.show()

The only downside is that pandas needs to be installed.

answered Sep 14, 2020 at 14:08

JulianWgs

1,0692 gold badges15 silver badges28 bronze badges

Comments

Shubham Jain · Accepted Answer · 2017-10-24 09:46:13Z

1

Increase worker memory from the default value of 1 GB using spark.executor.memory flag to resolve out of memory error if you are getting error in worker node otherwise if you are getting this error in driver then try increasing the driver memory as suggested by @pissall. Also, try to identify proper fraction of memory(spark.memory.fraction) to be used for keeping RDD in memory.

edited Oct 24, 2017 at 9:46

answered Oct 24, 2017 at 5:51

Shubham Jain

4222 gold badges4 silver badges17 bronze badges

3 Comments

pissall Over a year ago

Does it matter tweaking the executor memory when working with spark locally? The executors are used when we have a cluster and multiple worker nodes, right? I suggested to try increasing the driver memory in this case. Help me if it works differently.

Shubham Jain Over a year ago

No, executor memory doesn't matter in case of local mode as both executors and driver run in same JVM process whose memory can be increased by setting driver memory. In question he says the job runs well in single machine, so I assumed he is working in cluster mode

주은혜 Over a year ago

Yes, I am woriking in cluster mode. Your answer also helped me a lot!! I am new to spark, especially pyspark and python. I am slow to make my project though, I think Im getting in. Thanks all!!!

Collectives™ on Stack Overflow

Big numpy array to spark dataframe

3 Answers 3

4 Comments

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related