Creating Numpy Matrix from pyspark dataframe

Question

I have a pyspark dataframe child with columns like:

lat1 lon1
80    70
65    75

I am trying to convert it into numpy matrix using IndexedRowMatrix as below:

from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix

mat = IndexedRowMatrix(child.select('lat','lon').rdd.map(lambda row: IndexedRow(row[0], Vectors.dense(row[1:]))))

But its throwing me error. I want to avoid converting to pandas dataframe to get the matrix.

error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 33.0 failed 4 times, most recent failure: Lost task 0.3 in stage 33.0 (TID 733, ebdp-avdc-d281p.sys.comcast.net, executor 16): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/data/02/yarn/nm/usercache/mbansa001c/appcache/application_1506130884691_56333/container_e48_1506130884691_56333_01_000017/pyspark.zip/pyspark/worker.py", line 174, in main
    process()

desertnaut · Accepted Answer · 2017-11-29 15:05:15Z

5

You want to avoid pandas, but you try to convert to an RDD, which is severely suboptimal...

Anyway, assuming you can collect the selected columns of your child dataframe (a reasonable assumption, since you aim to put them in a Numpy array), it can be done with plain Numpy:

import numpy as np
np.array(child.select('lat1', 'lon1').collect())
# array([[80, 70], 
#        [65, 75]])

edited Nov 29, 2017 at 15:05

answered Nov 29, 2017 at 14:48

desertnaut

60.8k32 gold badges155 silver badges183 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Creating Numpy Matrix from pyspark dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related