Azure ADF: Losing Value of Pipeline Variable when re-running an Activity

Question

In ADF, I have a demo pipeline with three activities and one pipeline variable called parentRunId:

pipeline variable

The first activity is a "Set Variable" activity and generates a UUID for parentRunId:

supply variable with UUID:

The second activity uses parentRunId to pass it to a Databricks notebook:

pass variable to notebook:

The third step is irrelevant for this example.

This all works, and the UUID is displayed in my Databrick notebook.

But I want to be able to restart my pipeline from step 2, in case it failed.

So I select the "Rerun from selected activity" button in the ADF monitor on the second step:

rerun from selected activity

My expectation would be, that the variable parentRunId is still the same (because I am not re-running the first step, which would generate the UUID).

However, surprisingly, a new UUID is generated and passed to Databricks.

This makes it difficult for me to recover from a failure, because the pipeline seemingly has lost its context.

Is this a bug? Any idea how I can pass on a piece of information between activities which will also be reliably available when I restart in the middle of the pipeline?

Rakesh Govindula · Accepted Answer · 2024-04-23 10:37:23Z

0

Is this a bug? Any idea how I can pass on a piece of information between activities which will also be reliably available when I restart in the middle of the pipeline?

I think this is a default nature in ADF that set variable activities will be executed again whenever we re-run the pipeline even though it says skipped.

To achieve your requirement, you can try the below workaround. But for this, you need a Blob storage or ADLS storage, copy activity and a lookup activity.

During the re-run, the copy activity and lookup activities will be skipped. So, first store the run_id in a file using copy activity and then get it using lookup activity.

Follow the below design.

enter image description here

create a dummy source.csv file with one and one column in ADLS or Blob. Create dataset for that and add as source for the copy activity. In the source, add your run_id variable to the source by creating additional column for it like above.

Copy this to another csv file. Add that dataset to the copy activity sink.

enter image description here

Now, take a lookup activity with same target dataset like below.

enter image description here

At the first pipeline run, the copy activity will create an additional column for the run_id and stores it in the target csv file. The lookup activity will give the required run_id from the same target dataset.

You can use the below expression to get that from lookup.

@activity('Lookup1').output.value[0].run_id

Instead of Notebook activity, here I am using a set variable activity to get that.

enter image description here

Normal pipeline run:

enter image description here

The above pipeline run_id will be stored in the file.

Here, I am re-running the Set variable2 activity.

enter image description here

Required old pipeline run_id:

enter image description here

Even though the new pipeline run_id got stored in the first activity, you can see it gave the required old pipeline run_id.

Just replace the last activity with your notebook activity and pass the same expression to your notebook activity parameter.

answered Apr 23, 2024 at 10:37

Rakesh Govindula

11.9k2 gold badges5 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

stuffed Over a year ago

Thanks a lot for your extensive answer. Unfortunately I could not make it work yet (somehow the run_id does not get written into the file), but I have a general concern with this solution: What if this pipeline is triggered several times in parallel (e.g. triggered by arriving files) - wouldn't the run_id in the file be overwritten and thus become useless?

Rakesh Govindula Over a year ago

Yes, in that case, this solution won't work. It will only work in the case of sequential pipeline runs. I thought you are only re-running the failed pipeline run. If there is no failure, it will work even in the parallel runs, but in the case of re-running of failed pipeline run in parallel runs, it might not work.

stuffed Over a year ago

Yes I did not make it clear in the initial post: The pipeline is a generic file handler which can run in parallel.

Collectives™ on Stack Overflow

Azure ADF: Losing Value of Pipeline Variable when re-running an Activity

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related