I use Spark to read JSON files that appear in a folder everyday with path pattern Yyyy/mm/dd to convert them into Iceberg format. Both folders JSON and Iceberg are in a s3 bucket on different paths.
Im using a stream reader as in
jsondf = spark.readStream.format("json").schema(myschema).option("cleanSource", "archive").option("sourceArchiveDir", "s3a://mybucket/myarchivepath").load("s3a://mybucket/sourcefolder/yyyy/mm/dd").select("*")
I have been trying several choices of streamwriters. A continuous streamwriter seems to work well and archives files when they popup. But we dont have so many files so I want to try a trigger. Once=true triggers seems to be a wrong choice for archiving but I dont know why (any reason for Once=true to fail when archiving? It looks to me like the natural choice for archiving). Due to this Im trying availableNow=true like in:
jsondf.writeStream.trigger(availableNow=true).format("iceberg").option("checkpointLocation", "s3a://mybucket/chkpointfolder").outputMode("append").start(jsontable)
Excuse any typos. I'm writing from a mobile.
Given the version without triggers works and archives, why using triggers make the archive to fail? As a matter of fact, I don't even see that this streamWriter makes the reader read any file at all.
PS: Im using Spark 3.4.1. It seems trigger Once is deprecated and is recommended to use availableNow.