I'm using the below code to read data from an api where the payload is in json format using pyspark in azure databricks. All the fields are defined as string but keep running into json_tuple requires that all arguments are strings error.
Schema:
root
|-- Payload: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ActiveDate: string (nullable = true)
| | |-- BusinessId: string (nullable = true)
| | |-- BusinessName: string (nullable = true)
JSON:
{
"Payload":
[
{
"ActiveDate": "2008-11-25",
"BusinessId": "5678",
"BusinessName": "ACL"
},
{
"ActiveDate": "2009-03-22",
"BusinessId": "6789",
"BusinessName": "BCL"
}
]
}
PySpark:
from pyspark.sql import functions as F
df = df.select(F.col('Payload'), F.json_tuple(F.col('Payload'), 'ActiveDate', 'BusinessId', 'BusinessName') \.alias('ActiveDate', 'BusinessId', 'BusinessName'))
df.write.format("delta").mode("overwrite").saveAsTable("delta_payload")
Error:
AnalysisException: cannot resolve 'json_tuple(`Payload`, 'ActiveDate', 'BusinessId', 'BusinessName')' due to data type mismatch: json_tuple requires that all arguments are strings;