0

Real life df is a massive dataframe that cannot be loaded into driver memory. Can this be done using regular or pandas udf?

# Code to generate a sample dataframe

from pyspark.sql import functions as F
from pyspark.sql.types import *
import pandas as pd
import numpy as np

sample = [['123',[[0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1], [0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]],
      ['345',[[1,0,0,0,0,1,1,1,0,1,1,0,1,0,0,0,1,1,1,1,1,1], [0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]],
      ['425',[[1,1,0,0,0,1,0,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1],[0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]],
      ]

df = spark.createDataFrame(sample,["id", "data"])

Here's the logic that needs to be parallelized without relying on driver memory.

Input: Spark dataframe Output: numpy array to be fed into horovod (Something like this: https://docs.databricks.com/applications/deep-learning/distributed-training/mnist-tensorflow-keras.html)

pandas_df = df.toPandas() # Not possible in real life
data_array = np.asarray(list(pandas_df.data.values))
data_array = data_array.reshape(data_array.shape[0], data_array.shape[1], -1, 1, order='F')
data_array = data_array.reshape(data_array.shape[0],data_array.shape[1],-1,1,1,order="F").transpose(0,1,3,2,-1)
# Some more numpy specific transformations ..

Here's an approach that didn't work:

@pandas_udf(ArrayType(IntegerType()), PandasUDFType.SCALAR)
def generate_feature(x):
    data_array = np.asarray(x)
    data_array = data_array.reshape(data_array.shape[0], ..
    ...
    return pd.Series(data_array)

df = df.withColumn("data_array", generate_feature(df.data))
2
  • Any updates? How do you solve this problem? Commented Mar 22, 2022 at 7:50
  • @talentcat it's not possible sorry Commented Apr 13, 2022 at 20:07

1 Answer 1

0


I am trying to work on a similar case though using Images. I am looking towards Petastorm for doing this. You can save your data from Rdd to Parquet format and then use it in horovod.
- I am yet to test this.
- How to fetch the dataset in parts using ranks in horovod, needs to be tested too.
Just a tip that could help.
Thanks.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.