Convert a numpy array to a DataFrame in pyspark to export as csv

Question

I have a numpy array in pyspark and I would like to convert this to a DataFrame so I can write it out as a csv to view it.

I read the data in initially from a DataFrame however I had to convert to an array in order to use numpy.random.normal(). Now I want to convert the data back so I can write it out as a csv to view it.

I have tried the following directly on the array

zarr.write.csv("/mylocation/inHDFS")

however I get the following error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'numpy.ndarray' object has no attribute 'write'

Any ideas?

Mariusz · Accepted Answer · 2018-11-29 11:15:21Z

2

Numpy array and Spark Dataframe are totally different structures. The first one is local and doesn't have column names, the second is distributed (or distribute-ready in local mode) and has columns with strong typing.

I'd recommend to convert the numpy array to Pandas DF first as described here: Creating a Pandas DataFrame from a Numpy array: How do I specify the index column and column headers?, and then convert it to spark one using:

df = spark.createDataFrame(pandas_df)
df.write.csv('/hdfs/path')

answered Nov 29, 2018 at 11:15

Mariusz

14k3 gold badges66 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Taylrl Over a year ago

Thanks, the thing is however that I don't have pandas as I am using pyspark

Taylrl · Accepted Answer · 2018-11-29 13:52:57Z

2

Firstly I needed to convert the numpy array to an rdd as follows;

zrdd = spark.sparkContext.parallelize([zarr])

Then convert this to a DataFrame using the following (where we also now define the column header);

df = zrdd.map(lambda x: x.tolist()).toDF(["SOR"])

This I could then write out as per normal like such;

df.write.csv("/hdfs/mylocation")

answered Nov 29, 2018 at 13:52

Taylrl

3,9596 gold badges37 silver badges45 bronze badges

Collectives™ on Stack Overflow

Convert a numpy array to a DataFrame in pyspark to export as csv

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related