1

I am trying to create schema for below mentioned type of data, it's a list of dictionaries for using it with udf but I am getting the error mentioned in below.

 Unexpected tuple %r with StructType

 [{'cumulativeDefaultbalance': 0, 'loanId': 13131, 'cumulativeEndingBalance': 4877.9918745262694, 'cumulativeContractpaymentw': 263.67479214039736, 'month': 1, 'cumulativeInterestpayment': 141.66666666666666, 'cumulativePrincipalpayment': 122.00812547373067, 'cumulativeAdjbeginingbal': 5000, 'cumulativePrepaymentamt': 40.315417142065087}]

Below is the schema object that I am building

schema = StructType([
            StructField('cumulativeAdjbeginingbal', FloatType(), False),
            StructField('cumulativeEndingBalance', FloatType(), False),
            StructField('cumulativeContractpaymentw', FloatType(), False),
            StructField('cumulativeInterestpayment', FloatType(), False),
            StructField('cumulativePrincipalpayment', FloatType(), False),
            StructField('cumulativePrepaymentamt', FloatType(), False),
            StructField('cumulativeDefaultbalance', FloatType(), False)
        ])

Can anyone tell what's making my code fail?

2

1 Answer 1

1

The issue, as far as I can see, is that the schema you are defining requires that the rdd elements be in the form of lists rather than dictionaries. So you can do this before creating the DF (assuming your base list of dicts rdd is called df

df.map(lambda x: x.values)

Alternatively you could the following and eliminate explicit schema definition:

from pyspark.sql import Row
df.map(lambda x: Row(**x)).toDF()

EDIT: Actually looks like the schema is for return type of a UDF. I think the following should work:

from pyspark.sql.types import ArrayType

schema = ArrayType(StructType([
        StructField('cumulativeAdjbeginingbal', FloatType(), False),
        StructField('cumulativeEndingBalance', FloatType(), False),
        StructField('cumulativeContractpaymentw', FloatType(), False),
        StructField('cumulativeInterestpayment', FloatType(), False),
        StructField('cumulativePrincipalpayment', FloatType(), False),
        StructField('cumulativePrepaymentamt', FloatType(), False),
        StructField('cumulativeDefaultbalance', FloatType(), False)
    ]), False)
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for quick answer but there are few things: 1) I am defining my df by spark.read.csv 2) If I was not doing so then also I have to calculate 10 columns, fix datatypes etc before I get to generating this kind of fields. 3) In this case I have to do df.rdd.flatMap and after that I have to convert it back to df and then perform join, do you think it is adivsable?
Ah ok, is this actually json data or something?
What I am saying is I have a CSV, I load it with spark.read.csv then I calculate JPScore with df.withColumns and then I run df.withColumns for the function that returns above value. My dataframe fails to accept it with error unexpected tuple %r with structtype
maybe post some of code for those intermediate steps?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.