1

I am trying to use pipelines in Databricks to ingest data from an external location to the datalake using AutoLoader, and I am facing this issue. I have noticed other posts with similar errors, but in those posts (e.g. this one), the error was related to the destination table already being registered as managed.

In my case, it appears that the error is related to the event log table associated with the AutoLoader. More speciifcally, if I look for the storage path in the error using the following query, I get a single table created automatically, called event_log_a3c015c9_f373_4aa6_92db_6b56ae0dc948:

SELECT 
  table_name
FROM system.information_schema.tables
WHERE table_name LIKE '%event%' and storage_path LIKE '%3775a194-3db0-48a6-8c0e-cce43c26c9e7%'

I tried re-creating the pipeline but it didn't help. Any idea how to resolve this?

Error:

AnalysisException: Traceback (most recent call last):
File "/Users/[email protected]/.bundle/Testproject_2/dev/files/src/notebook", cell 4, line 11
      2 csv_file_path = "abfss://storage-dm-int-container@devdomaindmdbxint01.dfs.core.windows.net/dummy.csv"
      3 schema_location = "abfss://storage-dm-int-container@devdomaindmdbxint01.dfs.core.windows.net/_schema8/"
      4 df = (
      5     session.readStream
      6     .format("cloudFiles")
      7     .option("cloudFiles.format", "csv")
      8     .option("header", "true")
      9     .option("inferSchema", "true")
     10     .option("cloudFiles.schemaLocation", schema_location)
---> 11     .load(csv_file_path)
     12 )

AnalysisException: [RequestId=3ef8b745-48dc-4ae1-b2f6-9afaaf442c3b ErrorClass=INVALID_PARAMETER_VALUE.LOCATION_OVERLAP] Input path url 'abfss://[email protected]/dev-data-domain/__unitystorage/catalogs/cf3123b2-b661-48d9-9baa-a0b0214d5a29/tables/3775a194-3db0-48a6-8c0e-cce43c26c9e7/_dlt_metadata/_autoloader' overlaps with managed storage within 'CheckPathAccess' call. .

Relevant code:

from databricks.connect import DatabricksSession
from pyspark.sql.functions import *

# Create or retrieve a DatabricksSession
session = DatabricksSession.builder.getOrCreate()


csv_file_path = "abfss://storage-dm-int-container@devdomaindmdbxint01.dfs.core.windows.net/dummy.csv"
schema_location = "abfss://storage-dm-int-container@devdomaindmdbxint01.dfs.core.windows.net/_schema8/"
df = (
    session.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .option("cloudFiles.schemaLocation", schema_location)
    .load(csv_file_path)
)

checkpoint_path = "/Volumes/dev-data-domain/bronze/test/_checkpoint5"  

query = (
    df.writeStream
    .format("delta")
    .option("checkpointLocation", checkpoint_path)
    .outputMode("append")
    .trigger(once=True)
    .toTable("`dev-data-domain`.bronze.delta_table_pipeline3")
)
2
  • 1
    Double check if you are calling autoloader from @dlt.table annotated function. The error you are getting is exactly what I was seeing when calling it from outside. Commented Sep 14 at 18:41
  • Using DLT helped me resolve the issue, but I guess that is because it manages checkpoints and schema locations automatically. That helped me move forward so thank you, but I still wonder why my example didn't work. I should still be able to read with AutoLoader without using Delta Live Tables. Commented Sep 15 at 9:35

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.