How to convert spark sql dataframe to numpy array?

Question

I'm using pyspark and imported a hive table into a dataframe.

df = sqlContext.sql("from hive_table select *")

I need help on converting this df to numpy array. You may assume hive_table has only one column.

Can you please suggest? Thank you in advance.

zero323 · Accepted Answer · 2017-01-20 21:31:05Z

4

You can:

sqlContext.range(0, 10).toPandas().values  # .reshape(-1) for 1d array

array([[0],
       [1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8],
       [9]])

but it is unlikely you really want to. Created array will be local to the driver node so it its rarely useful. If you're looking for some variant of distributed array-like data structure there is a number of possible choices in Apache Spark:

pyspark.mllib.linalg.distributed which provides a number of distributed matrix classes.
sparkit-learn ArrayRDD.

and independent of Apache Spark:

Dask dask.array.

answered Jan 20, 2017 at 21:31

zero323

331k108 gold badges981 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to convert spark sql dataframe to numpy array?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related