1

In ADF, I have a demo pipeline with three activities and one pipeline variable called parentRunId:

pipeline variable

enter image description here

The first activity is a "Set Variable" activity and generates a UUID for parentRunId:

supply variable with UUID:

enter image description here

The second activity uses parentRunId to pass it to a Databricks notebook:

pass variable to notebook:

enter image description here

The third step is irrelevant for this example.

This all works, and the UUID is displayed in my Databrick notebook.

But I want to be able to restart my pipeline from step 2, in case it failed.

So I select the "Rerun from selected activity" button in the ADF monitor on the second step:

rerun from selected activity

enter image description here

My expectation would be, that the variable parentRunId is still the same (because I am not re-running the first step, which would generate the UUID).

However, surprisingly, a new UUID is generated and passed to Databricks.

This makes it difficult for me to recover from a failure, because the pipeline seemingly has lost its context.

Is this a bug? Any idea how I can pass on a piece of information between activities which will also be reliably available when I restart in the middle of the pipeline?

1 Answer 1

0

Is this a bug? Any idea how I can pass on a piece of information between activities which will also be reliably available when I restart in the middle of the pipeline?

I think this is a default nature in ADF that set variable activities will be executed again whenever we re-run the pipeline even though it says skipped.

To achieve your requirement, you can try the below workaround. But for this, you need a Blob storage or ADLS storage, copy activity and a lookup activity.

During the re-run, the copy activity and lookup activities will be skipped. So, first store the run_id in a file using copy activity and then get it using lookup activity.

Follow the below design.

enter image description here

create a dummy source.csv file with one and one column in ADLS or Blob. Create dataset for that and add as source for the copy activity. In the source, add your run_id variable to the source by creating additional column for it like above.

Copy this to another csv file. Add that dataset to the copy activity sink.

enter image description here

Now, take a lookup activity with same target dataset like below.

enter image description here

At the first pipeline run, the copy activity will create an additional column for the run_id and stores it in the target csv file. The lookup activity will give the required run_id from the same target dataset.

You can use the below expression to get that from lookup.

@activity('Lookup1').output.value[0].run_id

Instead of Notebook activity, here I am using a set variable activity to get that.

enter image description here

Normal pipeline run:

enter image description here

The above pipeline run_id will be stored in the file.

Here, I am re-running the Set variable2 activity.

enter image description here

Required old pipeline run_id:

enter image description here

Even though the new pipeline run_id got stored in the first activity, you can see it gave the required old pipeline run_id.

Just replace the last activity with your notebook activity and pass the same expression to your notebook activity parameter.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks a lot for your extensive answer. Unfortunately I could not make it work yet (somehow the run_id does not get written into the file), but I have a general concern with this solution: What if this pipeline is triggered several times in parallel (e.g. triggered by arriving files) - wouldn't the run_id in the file be overwritten and thus become useless?
Yes, in that case, this solution won't work. It will only work in the case of sequential pipeline runs. I thought you are only re-running the failed pipeline run. If there is no failure, it will work even in the parallel runs, but in the case of re-running of failed pipeline run in parallel runs, it might not work.
Yes I did not make it clear in the initial post: The pipeline is a generic file handler which can run in parallel.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.