Read data from mount in Databricks (using Autoloader)

Question

I am using azure blob storage to store data and feeding this data to Autoloader using mount. I was looking for a way to allow Autoloader to load a new file from any mount. Let's say I have these folders in my mount:

mnt/

├─ blob_container_1

├─ blob_container_2

When I use .load('/mnt/') no new files are detected. But when I consider folders individually then it works fine like .load('/mnt/blob_container_1')

I want to load files from both mount paths using Autoloader (running continuously).

Abhishek Khandave · Accepted Answer · 2022-03-15 12:28:55Z

0

You can use the path for providing prefix patterns, for example:

df = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", <format>) \
  .schema(schema) \
  .load("<base_path>/*/files")

For example, if you would like to parse only png files within a directory that contains files with different suffixes, you can do:

df = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "binaryFile") \
  .option("pathGlobfilter", "*.png") \
  .load(<base_path>)

Refer – https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html#filtering-directories-or-files-using-glob-patterns

answered Mar 15, 2022 at 12:28

Abhishek Khandave

3,2621 gold badge9 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Mansimar anand Over a year ago

Thanks for the response! In my case .load('/mnt/*') should work?

Collectives™ on Stack Overflow

Read data from mount in Databricks (using Autoloader)

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related