I've data stored in MongoDB collection and the timestamp column is not being read by Apache Spark correctly. I'm running Apache Spark on GCP Dataproc.
Here is sample data :
In Mongo :
timeslot_date :
timeslot |timeslot_date |
+--------------------------+------
1683527400|{2023-05-08T06:30:00Z}|
When I use pyspark to read this (selected only specific columns :
+----------+-------------------+
timeslot |timeslot_date |
+----------+-------------------+
1683527400|2023-05-07 23:30:00|
+----------------+-------+-----
My understanding is, data in Mongo is in UTC format i.e. 2023-05-08T06:30:00Z is in UTC format. I'm in PST timezone. I'm not clear why spark is reading it a different timezone format (neither PST nor UTC) Note - it is not reading it as PST timezone, if it was doing that it would advance the time by 7 hours, instead it is doing the opposite.
Where is the default timezone format taken from, when Spark is reading data from MongoDB ?
Any ideas on this ?
tia!