How to implement CDC in Azure Data Factory using a Change Data Capture Resource -

In this article, we will perform step by step demo on how to implement CDC in Azure Data Factory using a Change Data Capture Resource to capture change data from an Azure Data Lake Storage Gen2 source to a Azure SQL Database

Recently Microsoft added Change Data Capture feature as a native top-level resource in the Azure Data Factory to configure continuous process which automatically capture changes from data source(s).

It automatically detect data changes at the source without requiring complex designing or coding. Currently this feature is in public preview.

**Azure Data Factory: How to capture changed data from ADLS Gen2 to SQL DB using a Change data capture resource**

Change Data Capture Resource: Important points

Learning to implement: To use this CDC resource, we do not need to learn concepts like triggers, schedules, integration run times or need not to design pipelines or data flows etc.
Latency: It provides latency in two ways to capture change data from source: real time & frequency-based like 15 minutes, 30 minutes, 1 hour and 2 hours.
Cost: CDC resource uses a 4-core General Purpose dataflow cluster which bills while data is being processed based on selected latencies.
Steps on how to implement CDC in Azure Data Factory
- Set data sources
- select transformation rule; It auto-map source and target columns based on fuzzy matching
- set destination
- set frequency real time or batch
- start the mapping

Supported data sources:

Azure: Azure SQL Database & Azure Cosmos DB (SQL API).
Database: SQL Server
File: Delimited Text (CSV), JSON, Avro, Parquet, ORC,XML

Supported Targets:

Azure: Azure SQL Database
File: Delimited Text (CSV), JSON, Avro, Parquet, ORC, Delta

Implement Change Data Capture resource (CDC):

Step by Step Demo to capture change data in Azure Data Factory:

So let’s start step by step demo on how to capture change data in Azure Data Factory using a Change Data Capture resource (CDC – Azure Change Data Capture).

In this demo, we capture the change data from Azure Data Lake Gen 2 storage and load the data into Azure SQL Database.

Pre-requisites:

To create a change data capture resource & Monitor change data capture activity, we need below

Azure storage account: for data source, we need data in source container. For the demo, we will use CSV file as source.
Azure SQL Database: for destination, we need database and target Sales table.
Azure Data Factory instance and respective linked service of data source and target.

Step 1: Click ‘Change Data Capture’ resource

First let us open the Azure Data Factory studio in Azure portal and mouse hover on the ‘Change Data Capture’ resource under Factory Resource as shown in below image.

Once we mouse hover on ‘Change Data Capture’ resource, we can see three dot option.

**Azure Data Factory Change Data Capture CDC mouse hover**

Clicking on that three dot of above image will provide us ‘New mapping(preview)‘ option as shown below.

Step 2: Click ‘New mapping(preview)’

**Azure Data Factory Change Data Capture CDC New Mapping.**

Once we click on ‘New mapping(preview)’ option, it will open mapping window at right side and will create one data flow debug session as shown in below image.

**Azure Data Factory Change Data Capture CDC New Mapping create cluster**

Step 3: Source set up

Now let us set up the mapping with source details:

3A: Provide proper mapping name like CSV2SQLDBMapping we are mapping source CSV with target SQL.
3B: As source data is in CSV file, select Delimited Text source type out of all source type as mentioned above.
3C: Select respective linked service for the source type that we already created as part of pre-requisites.
3D: Next browse and select the source container or folder path to get CSV file.

**Azure Data Factory Change Data Capture New Mapping source setting**

As shown in above image, once we browse the storage account we will get option to choose the source folder;

Step 4: Choose source folder

Here let us choose the cdcsales folder under root folder as shown in below image.

This is important step on how to implement CDC in Azure Data Factory because it is the source container from where change data to be captured.

Azure Data Factory Change Data Capture New Mapping source folder select — Azure Data Factory Change Data Capture : source folder select

Below image shows initial CSV file in the cdcsales folder of storage account. This file will be processed first.

Please note, this first source file contains only 3 rows as we preview the data in next step 5.

**Azure Data Factory Change Data Capture Source CSV File one**

Step 5: Preview Source data

As shown in below image, once we select the source cdcsales folder, we will get option inside red circle to preview the source data.

In the left side, we can preview the source data containing four columns and three rows.

**Azure Data Factory Change Data Capture New Mapping preview source data**

Step 6: Choose Target

Please refer the 6A, 6B, 6C & 6D in the below image for each steps

6A: here let us select Azure SQL Database as target
6B: choose the respective linked service for Azure SQL Dataabase that we created earlier.
6C: Now select the dbo.Sales table where we will load the change data.
6D: click on continue button for next mapping page

**Azure Data Factory Change Data Capture New Mapping set target**

Step 7: Preview Target

Once we select the target dbo.Sales table, we have option to preview the table data. As this table is empty now, so that is no data in the preview section as shown in below image.

Click on continue button for next mapping page

**Azure Data Factory Change Data Capture New Mapping target preview**

Step 8: Mapping Source and Target

here in below image we can see

New resource is created for Change Data Capture i.e. CSV2SQLDBMapping at the left side under Factory Resource.
Source and Target is mapped.
This mapping is not published yet.
and this mapping is also not running.
here we can also add new mapping by clicking + New mapping icon.
Here please note, Start button is disabled now, once we we publish at step 12, it will be enabled.

**Azure Data Factory Change Data Capture Source and Target Mapping**

Step 9: Mapping the Columns: Auto Mapping vs Column Mapping

Here we have two option to map

Auto Mapping:
- This is default and automatically map the source and target column;
- It support for schema drift to track column changes between individual polling intervals.
- Please refer below image, the Auto Map toggler slider option is highlighted.
Column Mapping: If we move the toggler slider to turn off auto-mapping, the Column Mapping option will be available as shown in subsequent image no 9:

Azure Data Factory Change Data Capture Auto Mapping

Now select column mapping option (inside red circled area) under Column mapped label in the below image to open the column mapping details in subsequent image.

Azure Data Factory Change Data Capture Column Mapping

Step 10: Add or Update Column Mapping

here we can

see the mapping method and mapping between source & target columns.
we can update the mapping method, by default mapping method is direct.

There are several mapping method as below

Direct
Derived: This method can be trim,upper, lower, advanced
Aggregation: available methods are first, sum,maximum, minimum, last, standard deviation.

Step 11: Set Latency of Mapping

here we can set the latency of mapping i.e. how frequently this mapping will run.

Frequency of Mapping execution:

We can set latency on how frequently change data will be captured; Currently available option are Real time, 15 minute, 30 minute, 1 hour, 2 hours.

For this demo, let us keep micro batch with frequency of 15 minutes i.e. mapping job will run each 15 minutes and get the change data from source and insert into target.

Azure Data Factory Change Data Capture Set Latency

Step 12: Publish the Mapping

To start the mapping, we have to publish the new mapping by clicking on Publish button as shown in below image.

**Azure Data Factory Change Data Capture mapping publish**

Step 13: Start the Mapping

Once we publish the mapping at step 12, we can see the start is enabled now. Start button is red circled as shown in below image.

We can see that the displayed as This mapping is not running.

Now let us start the mapping by clicking on Start button.

**Azure Data Factory Change Data Capture start**

Step 14: Mapping is running:

Once we start the mapping,

the text of ‘Start’ button is now become ‘Stop’.
Now the message is showing as This mapping is currently running.

**Azure Data Factory Change Data Capture started**

Step 15: Monitoring the Mapping

Now to monitor the running mapping, let us go to

monitor tab
under Runs, click on “Change data capture“.
here we can see the mapping name along with source & target and status is Running.

**Azure Data Factory Change Data Capture monitor initial**

Step 16: Monitoring the Mapping

Once we click on the mapping name ‘CSB2SQLDBMapping’ in above step 14, we can see the mapping runs details as shown in below image.

As no processing is done yet, so

CHANGES READ is 0
CHANGES WRITTEN is 0
Last Processed value is empty

**Azure Data Factory Change Data Capture monitor initial details**

Step 17: Monitoring – 1st Processing

As can see the monitoring details of 1st run, CHANGES READ is 3 & CHANGES WRITTEN is also 3. This is because our source file contains rows only.

**Azure Data Factory Change Data Capture 1st Processing**

Step 18: Monitoring – 1st Processing details

Now let us see the details of 1st processing below. here we can see

every run instance
each processing instance as shown as 1st processing in below image.
Diagnostics details like
- sink name: sales
- Status: succeeded
- Processing Time
- Changes Written as 3

Each processing means when it capture change data and inserted into target, it will display like below with date and time of processing.

Not every run convert to processing, if changes data is captured only then it shows as processing.

**Azure Data Factory Change Data Capture 1st details**

Step 19: Verify Target table after 1st processing

Now let us run the select query in the sales table. Earlier we saw in preview section this table was empty.

Now we can see three records inserted in 1st processing.

**Azure Data Factory Change Data Capture 1st target**

Step 20: Upload change data in source

Now to test change data capture, let us upload new Sales_08012028.csv file in source cdcsales as shown below.

This csv file contains 3 rows.

**Azure Data Factory Change Data Capture Source CSV File 2nd file land at source**

Step 21: Monitoring – 2nd Processing

Now we can see after 2nd processing, CHANGES READ is 6 instead of earlier 3 and CHANGES WRITTEN is also 6 instead of earlier 3.

This is because, in 1st processing 3 rows were written and in 2nd processing 3 rows were written, so total 6 rows were written .

**Azure Data Factory Change Data Capture 2nd Processing**

Step 22: Monitoring – 2nd Processing details

Now let us see the details of 2bd processing below. here we can also see

every run instance
each processing instance as shown as 1st processing and 2nd processing in below image.
Diagnostics details like
- sink name: sales
- Status: succeeded
- Processing Time
- Changes Written as 6

**Azure Data Factory Change Data Capture 2nd processing details**

Step 23: Verify Target table after 2nd processing

Now let us run the select query in the sales table. Earlier we saw after 1st processing, table contains 3 rows.

Now we can see more than three records after 2nd processing.

**Azure Data Factory Change Data Capture 2nd target verification**

So in this tutorial we performed step by step demo with new Change data capture resource of Azure Data Factory which will capture changed data of sales details from an Azure Data Lake Storage Gen2 source to a Azure SQL Database.

More Demo on Change Data Capture and Change Data Feed

Azure Synapse Link for Azure SQL Database using Change Feed : Demo: https://sarnendude.com/azure-synapse-link-for-azure-sql-database-using-change-feed-demo/

Delta Lake’s Change Data Feed (CDF) Demo in Azure Databricks: https://sarnendude.com/delta-lakes-change-data-feed-cdf-demo-in-azure-databricks/

Azure Data Factory data flows: ‘incremental extract’ feature allows to read only changed rows on Azure SQL DB sources: https://sarnendude.com/azure-data-factory-data-flows-incremental-extract-feature-allows-to-read-only-changed-rows-on-azure-sql-db-sources/

Azure Data Factory Data Flow: Change Data Capture Architecture & Demo: https://sarnendude.com/azure-data-factory-data-flow-support-change-data-capture/

4 comments

Rahul Kumar Sharma says:

February 9, 2023 at 6:20 am

Unable to read the csv file from the ADLS source. Gives an error saying file or folder doesn’t exists. Could you please help here.

1. Sarnendu De says:
  
  March 3, 2023 at 8:09 am
  
  Hi Rahul, Thanks for your question. If you are still facing issue, Please let me know when you are facing this issue? and also please check file or folder are exists or getting deleted after execution.
  
harirevuriHari says:

February 22, 2023 at 5:44 pm

Hi,

my source is on premises if any changes append it will sync with oncloud database

1. Sarnendu De says:
  
  March 3, 2023 at 8:11 am
  
  Hello, Are you facing issue to sync on-premise data or are you planning to build solution to sync data?