0

I am using azure blob storage to store data and feeding this data to Autoloader using mount. I was looking for a way to allow Autoloader to load a new file from any mount. Let's say I have these folders in my mount:

mnt/

├─ blob_container_1

├─ blob_container_2

When I use .load('/mnt/') no new files are detected. But when I consider folders individually then it works fine like .load('/mnt/blob_container_1')

I want to load files from both mount paths using Autoloader (running continuously).

1 Answer 1

0

You can use the path for providing prefix patterns, for example:

df = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", <format>) \
  .schema(schema) \
  .load("<base_path>/*/files")

For example, if you would like to parse only png files within a directory that contains files with different suffixes, you can do:

df = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "binaryFile") \
  .option("pathGlobfilter", "*.png") \
  .load(<base_path>)

Refer – https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html#filtering-directories-or-files-using-glob-patterns

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the response! In my case .load('/mnt/*') should work?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.