1

My use case requires me to continuously write incoming messages into files stored in an Azure Data Lake Gen2 storage account. I am able to create the files by triggering a function, which uses the python azure-storage-file-datalake SDK to interact with the storage account.

The problem is that by default the files created using the create_file() method of the DataLakeFileClient class are Block Blobs (and there isn't any parameter to change the type of blob that gets created), which means I cannot append data to them after new messages arrive.

I have tried using the python azure-storage-blob SDK, however, it is unable to use paths to locate files within the containers of my Data Lake.

This would be an example of how I am creating the files, although those come out as Block Blobs:

if int(day) in days:  
    day_directory_client.create_directory()                                     
    file_client = day_directory_client.create_file(f'{json_name}')                                                 
    file_client.append_data(data=f'{str(message_body)}\n', offset=0,  
    length=len(str(message_body)))                                     
    file_client.flush_data(len(str(message_body)))                                     
    write_to_cache(year, month, day, json_path)

I appreciate any help I can get, thanks!

1
  • Did you manage to solve your problem? I'm struggling with similar issue in java... Commented Nov 23, 2021 at 12:49

2 Answers 2

0

If you want to create an append blob in an Azure Data Lake Gen2 account, you will need to use azure-storage-blob package instead of azure-storage-file-datalake.

azure-storage-file-datalake package is a wrapper over Azure Data Lake Store REST API which does not allow you to specify blob type.

Sign up to request clarification or add additional context in comments.

1 Comment

Hi @Gaurav Mantri, thank you for your reply, I have tried using the azure-storage-blob package version 12.9.0b1, and although it manages to create an append blob and append blocks to it, it completely dismisses the folder structure of my data lake, and creates the file directly under the root folder. I went over the documentation for the create_append_blob method of the BlobServiceClient class, but it doesn't seem to take any parameters that would take the file path in count.
0

I was able to achieve what you have asked by using the BlobClient library to create and append a blob in Azure Datalake Storage Gen2 using the following code:

from azure.storage.blob import BlobClient

#converted a pandas dataframe to csv (your data can be converted to your desired file format)

data=df.to_csv()
sas_url="https://<ACCOUNT_NAME>.blob.core.windows.net/<CONTAINER>/<DIRECTORIES>/<BLOBNAME>?<SAS TOKEN>"
blob_client = BlobClient.from_blob_url(sas_url)
blob_client.upload_blob(data, blob_type="AppendBlob")

Once an append blob is created in the datalake, you can append the file using blob_client.append_block(data) everytime you want to add a value.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.