I am having two small problems regarding my one bigger problem: I want to read in JSON data once a day and save it as Parquet for later data-related work. Working with parquet is so much faster. But the thing where I am stuck at is the fact that when reading that parquet, Spark always tries to get the schema from the schema file or just takes the schema from the first parquet file and presumes the schema is same for all files. But there are cases when we don't have any data for some days in some columns.
So let's say I have a JSON file with data with the following schema:
root
|-- Id: long (nullable = true)
|-- People: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Name: string (nullable = true)
| | |-- Amount: double (nullable = true)
And then I have another JSON file where there is no data for the "People" column. And therefore the schema is the following:
root
|-- Id: long (nullable = true)
|-- People: array (nullable = true)
| |-- element: string (containsNull = true)
When I would read them in together with read.json, Spark goes through all the files and infers the merged schema from these, more specifically from the first one and just leaves the rows from the second file empty, but the schema is correct.
But when I read these separately and write to parquet separately, then I cannot read them together because for Parquet, the schema doesn't match and I get an error.
My first idea was to read in the file with missing data and change its schema manually by casting column types to match the first schema, but this manual conversion is faulty, it can be out of sync and I don't even know how to cast this string type to array or struct type.
And another problem is when the "Amount" field has only full integers, then Spark reads them in as longs but not doubles, as is necessary. But if I use:
val df2 = df.withColumn("People.Amount", col("People.Amount").cast(org.apache.spark.sql.types.ArrayType(org.apache.spark.sql.types.DoubleType,true)))
Then it doesn't change the type of the original column, but adds a new column named People.Amount