3

I have a pyspark dataframe child with columns like:

lat1 lon1
80    70
65    75

I am trying to convert it into numpy matrix using IndexedRowMatrix as below:

from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix

mat = IndexedRowMatrix(child.select('lat','lon').rdd.map(lambda row: IndexedRow(row[0], Vectors.dense(row[1:]))))

But its throwing me error. I want to avoid converting to pandas dataframe to get the matrix.

error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 33.0 failed 4 times, most recent failure: Lost task 0.3 in stage 33.0 (TID 733, ebdp-avdc-d281p.sys.comcast.net, executor 16): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/data/02/yarn/nm/usercache/mbansa001c/appcache/application_1506130884691_56333/container_e48_1506130884691_56333_01_000017/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
0

1 Answer 1

5

You want to avoid pandas, but you try to convert to an RDD, which is severely suboptimal...

Anyway, assuming you can collect the selected columns of your child dataframe (a reasonable assumption, since you aim to put them in a Numpy array), it can be done with plain Numpy:

import numpy as np
np.array(child.select('lat1', 'lon1').collect())
# array([[80, 70], 
#        [65, 75]])
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.