How to implement CDC in Azure Data Factory using a Change Data Capture Resource

In this article, we will perform step by step demo on how to implement CDC in Azure Data Factory using a Change Data Capture Resource to capture change data from an Azure Data Lake Storage Gen2 source to a Azure SQL Database

Recently Microsoft added Change Data Capture feature as a native top-level resource in the Azure Data Factory to configure continuous process which automatically capture changes from data source(s).

It automatically detect data changes at the source without requiring complex designing or coding. Currently this feature is in public preview.

Azure Data Factory: How to capture changed data from ADLS Gen2 to SQL DB using a Change data capture resource
Azure Data Factory: How to capture changed data from ADLS Gen2 to SQL DB using a Change data capture resource
Change Data Capture Resource: Important points
  • Learning to implement: To use this CDC resource, we do not need to learn concepts like triggers, schedules, integration run times or need not to design pipelines or data flows etc.
  • Latency: It provides latency in two ways to capture change data from source: real time & frequency-based like 15 minutes, 30 minutes, 1 hour and 2 hours.
  • Cost: CDC resource uses a 4-core General Purpose dataflow cluster which bills while data is being processed based on selected latencies.
  • Steps on how to implement CDC in Azure Data Factory
    • Set data sources
    • select transformation rule; It auto-map source and target columns based on fuzzy matching
    • set destination
    • set frequency real time or batch
    • start the mapping
Supported data sources:
  • Azure: Azure SQL Database & Azure Cosmos DB (SQL API).
  • Database: SQL Server
  • File: Delimited Text (CSV), JSON, Avro, Parquet, ORC,XML
Supported Targets:
  • Azure: Azure SQL Database
  • File: Delimited Text (CSV), JSON, Avro, Parquet, ORC, Delta
Recommended Change Data Capture articles on Azure Data Factory:

How to do incremental load in Azure Data Factory: https://sarnendude.com/azure-data-factory-data-flows-incremental-extract-feature-allows-to-read-only-changed-rows-on-azure-sql-db-sources/

Visit this article for demo on Native SQL Change Data Capture : https://sarnendude.com/azure-data-factory-data-flow-support-change-data-capture/ (Enable CDC in Azure SQL Database)

Azure Data Factory Data Flow: Change Data Capture Architecture & Demo: https://sarnendude.com/azure-data-factory-data-flow-support-change-data-capture/

Implement Change Data Capture resource (CDC):

Step by Step Demo to capture change data in Azure Data Factory:

So let’s start step by step demo on how to capture change data in Azure Data Factory using a Change Data Capture resource (CDC – Azure Change Data Capture).

In this demo, we capture the change data from Azure Data Lake Gen 2 storage and load the data into Azure SQL Database.

Pre-requisites:

To create a change data capture resource & Monitor change data capture activity, we need below

  • Azure storage account: for data source, we need data in source container. For the demo, we will use CSV file as source.
  • Azure SQL Database: for destination, we need database and target Sales table.
  • Azure Data Factory instance and respective linked service of data source and target.
Step 1: Click ‘Change Data Capture’ resource

First let us open the Azure Data Factory studio in Azure portal and mouse hover on the ‘Change Data Capture’ resource under Factory Resource as shown in below image.

Once we mouse hover on ‘Change Data Capture’ resource, we can see three dot option.

Azure Data Factory Change Data Capture CDC mouse hover
Azure Data Factory Change Data Capture CDC mouse hover

Clicking on that three dot of above image will provide us ‘New mapping(preview)‘ option as shown below.

Step 2: Click ‘New mapping(preview)’
Azure Data Factory Change Data Capture CDC New Mapping.
Azure Data Factory Change Data Capture CDC New Mapping.

Once we click on ‘New mapping(preview)’ option, it will open mapping window at right side and will create one data flow debug session as shown in below image.

Azure Data Factory Change Data Capture   CDC New Mapping  create cluster
Azure Data Factory Change Data Capture CDC New Mapping create cluster
Step 3: Source set up

Now let us set up the mapping with source details:

  • 3A: Provide proper mapping name like CSV2SQLDBMapping we are mapping source CSV with target SQL.
  • 3B: As source data is in CSV file, select Delimited Text source type out of all source type as mentioned above.
  • 3C: Select respective linked service for the source type that we already created as part of pre-requisites.
  • 3D: Next browse and select the source container or folder path to get CSV file.

Azure Data Factory Change Data Capture New Mapping source setting
Azure Data Factory Change Data Capture New Mapping source setting

As shown in above image, once we browse the storage account we will get option to choose the source folder;

Step 4: Choose source folder

Here let us choose the cdcsales folder under root folder as shown in below image.

This is important step on how to implement CDC in Azure Data Factory because it is the source container from where change data to be captured.

Azure Data Factory Change Data Capture New Mapping source folder select
Azure Data Factory Change Data Capture : source folder select

Below image shows initial CSV file in the cdcsales folder of storage account. This file will be processed first.

Please note, this first source file contains only 3 rows as we preview the data in next step 5.

Azure Data Factory Change Data Capture Source CSV File one
Azure Data Factory Change Data Capture Source CSV File one
Step 5: Preview Source data

As shown in below image, once we select the source cdcsales folder, we will get option inside red circle to preview the source data.

In the left side, we can preview the source data containing four columns and three rows.

Azure Data Factory Change Data Capture New Mapping preview source data
Azure Data Factory Change Data Capture New Mapping preview source data
Step 6: Choose Target

Please refer the 6A, 6B, 6C & 6D in the below image for each steps

  • 6A: here let us select Azure SQL Database as target
  • 6B: choose the respective linked service for Azure SQL Dataabase that we created earlier.
  • 6C: Now select the dbo.Sales table where we will load the change data.
  • 6D: click on continue button for next mapping page
Azure Data Factory Change Data Capture New Mapping set target
Azure Data Factory Change Data Capture New Mapping set target
Step 7: Preview Target

Once we select the target dbo.Sales table, we have option to preview the table data. As this table is empty now, so that is no data in the preview section as shown in below image.

Click on continue button for next mapping page

Azure Data Factory Change Data Capture New Mapping target preview
Azure Data Factory Change Data Capture New Mapping target preview
Step 8: Mapping Source and Target

here in below image we can see

  • New resource is created for Change Data Capture i.e. CSV2SQLDBMapping at the left side under Factory Resource.
  • Source and Target is mapped.
  • This mapping is not published yet.
  • and this mapping is also not running.
  • here we can also add new mapping by clicking + New mapping icon.
  • Here please note, Start button is disabled now, once we we publish at step 12, it will be enabled.
Azure Data Factory Change Data Capture Source and Target Mapping
Azure Data Factory Change Data Capture Source and Target Mapping
Step 9: Mapping the Columns: Auto Mapping vs Column Mapping

Here we have two option to map

  • Auto Mapping:
    • This is default and automatically map the source and target column;
    • It support for schema drift to track column changes between individual polling intervals.
    • Please refer below image, the Auto Map toggler slider option is highlighted.
  • Column Mapping: If we move the toggler slider to turn off auto-mapping, the Column Mapping option will be available as shown in subsequent image no 9:
Azure Data Factory Change Data Capture Auto Mapping
Azure Data Factory Change Data Capture Auto Mapping

Now select column mapping option (inside red circled area) under Column mapped label in the below image to open the column mapping details in subsequent image.

Azure Data Factory Change Data Capture Column Mapping
Azure Data Factory Change Data Capture Column Mapping
Step 10: Add or Update Column Mapping

here we can

  • see the mapping method and mapping between source & target columns.
  • we can update the mapping method, by default mapping method is direct.

There are several mapping method as below

  • Direct
  • Derived: This method can be trim,upper, lower, advanced
  • Aggregation: available methods are first, sum,maximum, minimum, last, standard deviation.
Azure Data Factory Change Data Capture  Column Mapping
Azure Data Factory Change Data Capture Column Mapping
Step 11: Set Latency of Mapping

here we can set the latency of mapping i.e. how frequently this mapping will run.

Frequency of Mapping execution:

We can set latency on how frequently change data will be captured; Currently available option are Real time, 15 minute, 30 minute, 1 hour, 2 hours.

For this demo, let us keep micro batch with frequency of 15 minutes i.e. mapping job will run each 15 minutes and get the change data from source and insert into target.

Azure Data Factory Change Data Capture Set Latency
Azure Data Factory Change Data Capture Set Latency
Step 12: Publish the Mapping

To start the mapping, we have to publish the new mapping by clicking on Publish button as shown in below image.

Azure Data Factory Change Data Capture mapping publish
Azure Data Factory Change Data Capture mapping publish
Step 13: Start the Mapping

Once we publish the mapping at step 12, we can see the start is enabled now. Start button is red circled as shown in below image.

We can see that the displayed as This mapping is not running.

Now let us start the mapping by clicking on Start button.

Azure Data Factory Change Data Capture start
Azure Data Factory Change Data Capture start
Step 14: Mapping is running:

Once we start the mapping,

  • the text of ‘Start’ button is now become ‘Stop’.
  • Now the message is showing as This mapping is currently running.
Azure Data Factory Change Data Capture started
Azure Data Factory Change Data Capture started
Step 15: Monitoring the Mapping

Now to monitor the running mapping, let us go to

  • monitor tab
  • under Runs, click on “Change data capture“.
  • here we can see the mapping name along with source & target and status is Running.
Azure Data Factory Change Data Capture monitor initial
Azure Data Factory Change Data Capture monitor initial
Step 16: Monitoring the Mapping

Once we click on the mapping name ‘CSB2SQLDBMapping’ in above step 14, we can see the mapping runs details as shown in below image.

As no processing is done yet, so

  • CHANGES READ is 0
  • CHANGES WRITTEN is 0
  • Last Processed value is empty
Azure Data Factory Change Data Capture monitor initial details
Azure Data Factory Change Data Capture monitor initial details
Step 17: Monitoring – 1st Processing

As can see the monitoring details of 1st run, CHANGES READ is 3 & CHANGES WRITTEN is also 3. This is because our source file contains rows only.

Azure Data Factory Change Data Capture 1st
Azure Data Factory Change Data Capture 1st Processing
Step 18: Monitoring – 1st Processing details

Now let us see the details of 1st processing below. here we can see

  • every run instance
  • each processing instance as shown as 1st processing in below image.
  • Diagnostics details like
    • sink name: sales
    • Status: succeeded
    • Processing Time
    • Changes Written as 3

Each processing means when it capture change data and inserted into target, it will display like below with date and time of processing.

Not every run convert to processing, if changes data is captured only then it shows as processing.

Azure Data Factory Change Data Capture 1st details
Azure Data Factory Change Data Capture 1st details
Step 19: Verify Target table after 1st processing

Now let us run the select query in the sales table. Earlier we saw in preview section this table was empty.

Now we can see three records inserted in 1st processing.

Azure Data Factory Change Data Capture 1st target
Azure Data Factory Change Data Capture 1st target
Step 20: Upload change data in source

Now to test change data capture, let us upload new Sales_08012028.csv file in source cdcsales as shown below.

This csv file contains 3 rows.

Azure Data Factory Change Data Capture Source CSV File 2nd file land at source
Azure Data Factory Change Data Capture Source CSV File 2nd file land at source
Step 21: Monitoring – 2nd Processing

Now we can see after 2nd processing, CHANGES READ is 6 instead of earlier 3 and CHANGES WRITTEN is also 6 instead of earlier 3.

This is because, in 1st processing 3 rows were written and in 2nd processing 3 rows were written, so total 6 rows were written .

Azure Data Factory Change Data Capture 2nd Processing
Azure Data Factory Change Data Capture 2nd Processing
Step 22: Monitoring – 2nd Processing details

Now let us see the details of 2bd processing below. here we can also see

  • every run instance
  • each processing instance as shown as 1st processing and 2nd processing in below image.
  • Diagnostics details like
    • sink name: sales
    • Status: succeeded
    • Processing Time
    • Changes Written as 6
Azure Data Factory Change Data Capture 2nd processing details
Azure Data Factory Change Data Capture 2nd processing details
Step 23: Verify Target table after 2nd processing

Now let us run the select query in the sales table. Earlier we saw after 1st processing, table contains 3 rows.

Now we can see more than three records after 2nd processing.

Azure Data Factory Change Data Capture 2nd target verification
Azure Data Factory Change Data Capture 2nd target verification

So in this tutorial we performed step by step demo with new Change data capture resource of Azure Data Factory which will capture changed data of sales details from an Azure Data Lake Storage Gen2 source to a Azure SQL Database.

More Demo on Change Data Capture and Change Data Feed

Azure Synapse Link for Azure SQL Database using Change Feed : Demo: https://sarnendude.com/azure-synapse-link-for-azure-sql-database-using-change-feed-demo/

Delta Lake’s Change Data Feed (CDF) Demo in Azure Databricks: https://sarnendude.com/delta-lakes-change-data-feed-cdf-demo-in-azure-databricks/

Azure Data Factory data flows: ‘incremental extract’ feature allows to read only changed rows on Azure SQL DB sources: https://sarnendude.com/azure-data-factory-data-flows-incremental-extract-feature-allows-to-read-only-changed-rows-on-azure-sql-db-sources/

Azure Data Factory Data Flow: Change Data Capture Architecture & Demo: https://sarnendude.com/azure-data-factory-data-flow-support-change-data-capture/

4 comments

  1. Unable to read the csv file from the ADLS source. Gives an error saying file or folder doesn’t exists. Could you please help here.

    1. Hi Rahul, Thanks for your question. If you are still facing issue, Please let me know when you are facing this issue? and also please check file or folder are exists or getting deleted after execution.

    1. Hello, Are you facing issue to sync on-premise data or are you planning to build solution to sync data?

Leave a Reply