-1

I have a numpy array in pyspark and I would like to convert this to a DataFrame so I can write it out as a csv to view it.

I read the data in initially from a DataFrame however I had to convert to an array in order to use numpy.random.normal(). Now I want to convert the data back so I can write it out as a csv to view it.

I have tried the following directly on the array

zarr.write.csv("/mylocation/inHDFS")

however I get the following error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'numpy.ndarray' object has no attribute 'write'

Any ideas?

2 Answers 2

2

Numpy array and Spark Dataframe are totally different structures. The first one is local and doesn't have column names, the second is distributed (or distribute-ready in local mode) and has columns with strong typing.

I'd recommend to convert the numpy array to Pandas DF first as described here: Creating a Pandas DataFrame from a Numpy array: How do I specify the index column and column headers?, and then convert it to spark one using:

df = spark.createDataFrame(pandas_df)
df.write.csv('/hdfs/path')
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, the thing is however that I don't have pandas as I am using pyspark
2

Firstly I needed to convert the numpy array to an rdd as follows;

zrdd = spark.sparkContext.parallelize([zarr])

Then convert this to a DataFrame using the following (where we also now define the column header);

df = zrdd.map(lambda x: x.tolist()).toDF(["SOR"])

This I could then write out as per normal like such;

df.write.csv("/hdfs/mylocation")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.