In this article, we will perform step by step demo on how to implement CDC in Azure Data Factory using a Change Data Capture Resource to capture change data from an Azure Data Lake Storage Gen2 source to a Azure SQL Database
Recently Microsoft added Change Data Capture feature as a native top-level resource in the Azure Data Factory to configure continuous process which automatically capture changes from data source(s).
It automatically detect data changes at the source without requiring complex designing or coding. Currently this feature is in public preview.
Change Data Capture Resource: Important points
- Learning to implement: To use this CDC resource, we do not need to learn concepts like triggers, schedules, integration run times or need not to design pipelines or data flows etc.
- Latency: It provides latency in two ways to capture change data from source: real time & frequency-based like 15 minutes, 30 minutes, 1 hour and 2 hours.
- Cost: CDC resource uses a 4-core General Purpose dataflow cluster which bills while data is being processed based on selected latencies.
- Steps on how to implement CDC in Azure Data Factory
- Set data sources
- select transformation rule; It auto-map source and target columns based on fuzzy matching
- set destination
- set frequency real time or batch
- start the mapping
Supported data sources:
- Azure: Azure SQL Database & Azure Cosmos DB (SQL API).
- Database: SQL Server
- File: Delimited Text (CSV), JSON, Avro, Parquet, ORC,XML
Supported Targets:
- Azure: Azure SQL Database
- File: Delimited Text (CSV), JSON, Avro, Parquet, ORC, Delta
Recommended Change Data Capture articles on Azure Data Factory:
How to do incremental load in Azure Data Factory: https://sarnendude.com/azure-data-factory-data-flows-incremental-extract-feature-allows-to-read-only-changed-rows-on-azure-sql-db-sources/
Visit this article for demo on Native SQL Change Data Capture : https://sarnendude.com/azure-data-factory-data-flow-support-change-data-capture/ (Enable CDC in Azure SQL Database)
Azure Data Factory Data Flow: Change Data Capture Architecture & Demo: https://sarnendude.com/azure-data-factory-data-flow-support-change-data-capture/
Implement Change Data Capture resource (CDC):
Step by Step Demo to capture change data in Azure Data Factory:
So let’s start step by step demo on how to capture change data in Azure Data Factory using a Change Data Capture resource (CDC – Azure Change Data Capture).
In this demo, we capture the change data from Azure Data Lake Gen 2 storage and load the data into Azure SQL Database.
Pre-requisites:
To create a change data capture resource & Monitor change data capture activity, we need below
- Azure storage account: for data source, we need data in source container. For the demo, we will use CSV file as source.
- Azure SQL Database: for destination, we need database and target Sales table.
- Azure Data Factory instance and respective linked service of data source and target.
Step 1: Click ‘Change Data Capture’ resource
First let us open the Azure Data Factory studio in Azure portal and mouse hover on the ‘Change Data Capture’ resource under Factory Resource as shown in below image.
Once we mouse hover on ‘Change Data Capture’ resource, we can see three dot option.
Clicking on that three dot of above image will provide us ‘New mapping(preview)‘ option as shown below.
Step 2: Click ‘New mapping(preview)’
Once we click on ‘New mapping(preview)’ option, it will open mapping window at right side and will create one data flow debug session as shown in below image.
Step 3: Source set up
Now let us set up the mapping with source details:
- 3A: Provide proper mapping name like CSV2SQLDBMapping we are mapping source CSV with target SQL.
- 3B: As source data is in CSV file, select Delimited Text source type out of all source type as mentioned above.
- 3C: Select respective linked service for the source type that we already created as part of pre-requisites.
- 3D: Next browse and select the source container or folder path to get CSV file.
As shown in above image, once we browse the storage account we will get option to choose the source folder;
Step 4: Choose source folder
Here let us choose the cdcsales folder under root folder as shown in below image.
This is important step on how to implement CDC in Azure Data Factory because it is the source container from where change data to be captured.
Below image shows initial CSV file in the cdcsales folder of storage account. This file will be processed first.
Please note, this first source file contains only 3 rows as we preview the data in next step 5.
Step 5: Preview Source data
As shown in below image, once we select the source cdcsales folder, we will get option inside red circle to preview the source data.
In the left side, we can preview the source data containing four columns and three rows.
Step 6: Choose Target
Please refer the 6A, 6B, 6C & 6D in the below image for each steps
- 6A: here let us select Azure SQL Database as target
- 6B: choose the respective linked service for Azure SQL Dataabase that we created earlier.
- 6C: Now select the dbo.Sales table where we will load the change data.
- 6D: click on continue button for next mapping page
Step 7: Preview Target
Once we select the target dbo.Sales table, we have option to preview the table data. As this table is empty now, so that is no data in the preview section as shown in below image.
Click on continue button for next mapping page
Step 8: Mapping Source and Target
here in below image we can see
- New resource is created for Change Data Capture i.e. CSV2SQLDBMapping at the left side under Factory Resource.
- Source and Target is mapped.
- This mapping is not published yet.
- and this mapping is also not running.
- here we can also add new mapping by clicking + New mapping icon.
- Here please note, Start button is disabled now, once we we publish at step 12, it will be enabled.
Step 9: Mapping the Columns: Auto Mapping vs Column Mapping
Here we have two option to map
- Auto Mapping:
- This is default and automatically map the source and target column;
- It support for schema drift to track column changes between individual polling intervals.
- Please refer below image, the Auto Map toggler slider option is highlighted.
- Column Mapping: If we move the toggler slider to turn off auto-mapping, the Column Mapping option will be available as shown in subsequent image no 9:
Now select column mapping option (inside red circled area) under Column mapped label in the below image to open the column mapping details in subsequent image.
Step 10: Add or Update Column Mapping
here we can
- see the mapping method and mapping between source & target columns.
- we can update the mapping method, by default mapping method is direct.
There are several mapping method as below
- Direct
- Derived: This method can be trim,upper, lower, advanced
- Aggregation: available methods are first, sum,maximum, minimum, last, standard deviation.
Step 11: Set Latency of Mapping
here we can set the latency of mapping i.e. how frequently this mapping will run.
Frequency of Mapping execution:
We can set latency on how frequently change data will be captured; Currently available option are Real time, 15 minute, 30 minute, 1 hour, 2 hours.
For this demo, let us keep micro batch with frequency of 15 minutes i.e. mapping job will run each 15 minutes and get the change data from source and insert into target.
Step 12: Publish the Mapping
To start the mapping, we have to publish the new mapping by clicking on Publish button as shown in below image.
Step 13: Start the Mapping
Once we publish the mapping at step 12, we can see the start is enabled now. Start button is red circled as shown in below image.
We can see that the displayed as This mapping is not running.
Now let us start the mapping by clicking on Start button.
Step 14: Mapping is running:
Once we start the mapping,
- the text of ‘Start’ button is now become ‘Stop’.
- Now the message is showing as This mapping is currently running.
Step 15: Monitoring the Mapping
Now to monitor the running mapping, let us go to
- monitor tab
- under Runs, click on “Change data capture“.
- here we can see the mapping name along with source & target and status is Running.
Step 16: Monitoring the Mapping
Once we click on the mapping name ‘CSB2SQLDBMapping’ in above step 14, we can see the mapping runs details as shown in below image.
As no processing is done yet, so
- CHANGES READ is 0
- CHANGES WRITTEN is 0
- Last Processed value is empty
Step 17: Monitoring – 1st Processing
As can see the monitoring details of 1st run, CHANGES READ is 3 & CHANGES WRITTEN is also 3. This is because our source file contains rows only.
Step 18: Monitoring – 1st Processing details
Now let us see the details of 1st processing below. here we can see
- every run instance
- each processing instance as shown as 1st processing in below image.
- Diagnostics details like
- sink name: sales
- Status: succeeded
- Processing Time
- Changes Written as 3
Each processing means when it capture change data and inserted into target, it will display like below with date and time of processing.
Not every run convert to processing, if changes data is captured only then it shows as processing.
Step 19: Verify Target table after 1st processing
Now let us run the select query in the sales table. Earlier we saw in preview section this table was empty.
Now we can see three records inserted in 1st processing.
Step 20: Upload change data in source
Now to test change data capture, let us upload new Sales_08012028.csv file in source cdcsales as shown below.
This csv file contains 3 rows.
Step 21: Monitoring – 2nd Processing
Now we can see after 2nd processing, CHANGES READ is 6 instead of earlier 3 and CHANGES WRITTEN is also 6 instead of earlier 3.
This is because, in 1st processing 3 rows were written and in 2nd processing 3 rows were written, so total 6 rows were written .
Step 22: Monitoring – 2nd Processing details
Now let us see the details of 2bd processing below. here we can also see
- every run instance
- each processing instance as shown as 1st processing and 2nd processing in below image.
- Diagnostics details like
- sink name: sales
- Status: succeeded
- Processing Time
- Changes Written as 6
Step 23: Verify Target table after 2nd processing
Now let us run the select query in the sales table. Earlier we saw after 1st processing, table contains 3 rows.
Now we can see more than three records after 2nd processing.
So in this tutorial we performed step by step demo with new Change data capture resource of Azure Data Factory which will capture changed data of sales details from an Azure Data Lake Storage Gen2 source to a Azure SQL Database.
More Demo on Change Data Capture and Change Data Feed
Azure Synapse Link for Azure SQL Database using Change Feed : Demo: https://sarnendude.com/azure-synapse-link-for-azure-sql-database-using-change-feed-demo/
Delta Lake’s Change Data Feed (CDF) Demo in Azure Databricks: https://sarnendude.com/delta-lakes-change-data-feed-cdf-demo-in-azure-databricks/
Azure Data Factory data flows: ‘incremental extract’ feature allows to read only changed rows on Azure SQL DB sources: https://sarnendude.com/azure-data-factory-data-flows-incremental-extract-feature-allows-to-read-only-changed-rows-on-azure-sql-db-sources/
Azure Data Factory Data Flow: Change Data Capture Architecture & Demo: https://sarnendude.com/azure-data-factory-data-flow-support-change-data-capture/
Unable to read the csv file from the ADLS source. Gives an error saying file or folder doesn’t exists. Could you please help here.
Hi Rahul, Thanks for your question. If you are still facing issue, Please let me know when you are facing this issue? and also please check file or folder are exists or getting deleted after execution.
Hi,
my source is on premises if any changes append it will sync with oncloud database
Hello, Are you facing issue to sync on-premise data or are you planning to build solution to sync data?