I am fairly new to Azure Data Factory (ADF) but have been learning and experimenting with some of its advanced features. I'm currently working on a use case involving Change Data Capture (CDC) and partitioned folders in Azure Data Lake Storage (ADLS). I need some guidance on how to configure this properly.
My Setup
I am creating a Data Flow in ADF that processes daily partitioned files in ADLS. Here’s how it works:
Source Data:
Located in a "raw" layer, with folders partitioned by date, e.g.,
{file_system}/raw/supplier/20241116.The file within the partition is named
data.parquet.
Transformations:
- The Data Flow applies transformations to this data.
Destination Data:
- The transformed data is written to a "curated" layer with a similar partitioning structure, e.g.,
{file_system}/curated/supplier/20241116.
- The transformed data is written to a "curated" layer with a similar partitioning structure, e.g.,
Dynamic Configuration
I’ve parameterized the datasets used in the Data Flow as follows:
Parameters:
p_DirectoryNamefor the base directory path (e.g.,/raw/supplieror/curated/supplier).p_partition_folderfor the partition folder name (e.g.,20241116).
File Path: Constructed dynamically as:
@concat(dataset().DirectoryName, '/', dataset().partition_folder)
The values for p_partition_folder are derived from a parent pipeline, which includes:
Lookup Activity:
Reads a JSON file that contains:
{ "current_partition_date": "20241116", "next_partition_date": "20241117" }current_partition_daterefers to the latest partition in the curated layer (the baseline for CDC).next_partition_daterefers to the partition to which the current run’s transformed data should be written.
Set Variable Activities:
- Extract the
current_partition_dateandnext_partition_datevalues from the Lookup activity output.
- Extract the
Data Flow Activity:
- Passes these parameters (
current_partition_dateandnext_partition_date) into the Data Flow for dynamic path resolution.
- Passes these parameters (
The Problem
In the Data Flow:
I have a single source transformation where I’ve enabled the CDC (Change Data Capture) option.
The dataset used for the source is parameterized as mentioned above (
@concat(dataset().DirectoryName, '/', dataset().partition_folder)).
I want the Data Flow to:
For the current run:
- Compare the latest raw data (e.g.,
{file_system}/raw/supplier/20241117/data.parquet) with the baseline data from the curated layer (e.g.,{file_system}/curated/supplier/20241116/data_today.parquet).
- Compare the latest raw data (e.g.,
Identify changes (inserts, updates, deletes) between these files using CDC.
Write the results to the "curated" layer for the
next_partition_date(e.g.,{file_system}/curated/supplier/20241117).
However, I am confused about:
How the CDC transformation in the source knows to compare the current file with the baseline.
How to ensure the Data Flow dynamically reads the correct baseline file (
current_partition_date) and writes to the correct target folder (next_partition_date).
Questions
How do I configure the CDC-enabled source transformation in the Data Flow to handle the dynamic dataset parameters (
current_partition_dateandnext_partition_date) for file paths?After the first run, where no baseline file exists, how should the CDC handle the absence of a previous partition (i.e., treat all records as inserts)?
Is there a better approach to achieve this entire workflow compared to what I’ve described?
Any guidance or best practices would be greatly appreciated. Thank you in advance!



