Now Azure Data Factory process data in real-time using Change Data Capture CDC

In this article, we will perform step by step demo on how Azure Data Factory (ADF) process data in real-time using Change Data Capture CDC feature.

We know Change Data Capture feature is a native top-level resource in the Azure Data Factory to configure continuous process which automatically capture changes from data source(s) based on latency.

“ADF CDC processes are light-weight always-running (not batch) data processing with a latency option”

Change Data Capture Resource in Azure Data Factory: Important points
  • Latency: How frequently it will look for changes; Latency options are: Real Time, 15 minutes, 30 minutes, 1 hour & 2 hours.
    • CDC process will continuously look for changes on sources until we stop it.
    • For Real time latency: While monitoring we can see checkpoints occur every few seconds as ADF continues to monitor sources for changes.
  • Easy to implement:
    • We do not need to build pipeline or set trigger to run this CDC resource
    • Hence no need to learn concepts like triggers, schedules, integration run times or need not to design data factory pipelines or data flows etc.
  • Cost: CDC resource uses a 4-core General Purpose dataflow cluster which bills while data is being processed based on selected latencies.

We can also implement adf incremental ETL or adf incremental load using this Change Data Capture Resource based on latency.

DEMO on CDC & other articles from Azure Data Factory, Databricks, Synapse:

Demo Scenario for real time CDC in Azure Data Factory (ADF Change Data Capture):

We will upload 2 CSV files after starting the map, so Change Data Capture Resource can keep tracking the changes made at source container and captured the changes and process those in near real time or in seconds of uploading.

Azure Data Factory Change Data Capture Monitoring Details
Azure Data Factory Change Data Capture Monitoring Details
For demo on Change Data Capture of 15 minutes latency:

Please visit this link: https://sarnendude.com/how-to-implement-cdc-in-azure-data-factory-using-a-change-data-capture-resource/

Implement Change Data Capture resource (CDC):
Step by Step Demo to capture change data in Azure Data Factory:

In this demo, we will implement CDC in Azure Data Factory using a Change Data Capture Resource to capture change data from an Azure Data Lake Storage Gen2 source to a Azure SQL Database real time.

Change Data Capture resource (CDC) Setup Steps (Azure Data Factory Change Data Capture):

  • Create ‘Change Data Capture’ resource
  • Source set up
  • Target set up
  • Mapping Source and Target
  • Mapping the Columns: Auto Mapping vs Column Mapping
  • Set Latency of Mapping
  • Publish the Mapping
  • Start the Mapping
  • Monitoring the Mapping

Except step “Set Latency of Mapping”, we have done all above steps for demo on Change Data Capture of 15 minutes latency.

Please visit this link to know step by step setup and come back here for real time option set up and monitoring. https://sarnendude.com/how-to-implement-cdc-in-azure-data-factory-using-a-change-data-capture-resource/#Implement-Change-Data-Capture-resource-ADF-CDC

Hope you referred each steps given at the above link, now let us set up real time latency.

To develop CDC in ADF faster, Recently Auto Mapping option is introduced. Once we set up source and destination then we have to set up column mapping.

Mapping the Columns: Auto Mapping vs Column Mapping

Here we have two option to map:

  • Auto Mapping:
    • This is default and automatically map the source and target column;
    • It support for Schema drift to track column changes between individual polling intervals.
    • Please refer below image, the Auto Map toggler slider option is highlighted.
  • Column Mapping: If we move the toggler slider to turn off auto-mapping, the Column Mapping option will be available as shown in subsequent image:
Azure Data Factory Change Data Capture Auto Mapping
Azure Data Factory Change Data Capture Auto Mapping

Column Mapping option is visible inside red circle in below image.

Azure Data Factory Change Data Capture Column Mapping
Azure Data Factory Change Data Capture Column Mapping

Once we click on Column Mapping option, the below screen will appear with mapping method along with source and target column mapping.

This allows us to select mapping method, update column mapping and select Keys as well.

Azure Data Factory Change Data Capture Column Mapped
Azure Data Factory Change Data Capture Column Mapped
Set Latency of Real Time:

To set up real time latency, we have to select “Real-time” in Set Latency window & click on Apply as shown below.

Azure Data Factory Change Data Capture Set Latency Real time
Azure Data Factory Change Data Capture Set Latency Real time

Once we apply the real time latency, then we have to publish the resource and start the mapping as mentioned at relevant steps of above link.

Note, we have to start mapping so that ADF can monitor the changes at source.
Monitoring the Mapping:

Here we will upload 2 CSV files with sales record at cdcsales container of data lake gen2 at certain interval and keep monitoring.

Now to monitor the running mapping, let us go to

  • Monitor tab
  • under Runs, click on “Change Data Capture“.
  • here we can see the mapping name along with source & target and status is Running.
  • Click on mapping name and we can see below image with CHANGES READ & CHANGES WRITTEN details
    • CHANGES READ: 6 because 2 CSV files contains total 6 rows
    • CHANGES WRITTEN: 6 rows written at target sql store.
Azure Data Factory Change Data Capture Monitoring
Azure Data Factory Change Data Capture Monitoring

Azure Data Factory Change Data Capture Monitoring Details:

We can see in below image, polling happens in every few seconds, this polling is marked by green square box.

As we uploaded 2 CSV files after starting the map, so Change Data Capture Resource keeps tracking the changes made at source container and captured the changes and process those in near real time or in seconds of uploading.

Here 2 blue colored rectangular box on each polling internal marked green square box, indicate tracking of changes at source and processing.

Azure Data Factory Change Data Capture Monitoring Details
Azure Data Factory Change Data Capture Monitoring Details

So in this tutorial we performed step by step demo with new Change data capture resource of Azure Data Factory which will capture changed data of sales details from an Azure Data Lake Storage Gen2 source to a Azure SQL Database (Azure CDC).

Azure Data Factory change tracking is one of the finest way to capture change in azure change data capture stack for change data capture in adf.

DEMO on CDC & other articles from Azure Data Factory, Databricks, Synapse:
Tutorial from Azure Data Factory:

These are important demo article for Change Data Capture in Azure Data Factory or cdc in azure data factory (change data capture adf)

Azure Data Factory Data Flow: Change Data Capture Architecture & Demo: https://sarnendude.com/azure-data-factory-data-flow-support-change-data-capture/

There are many more article available in blog on azure data factory cdc.

How to do incremental load in azure data factory (Incremental ETL): https://sarnendude.com/azure-data-factory-data-flows-incremental-extract-feature-allows-to-read-only-changed-rows-on-azure-sql-db-sources/ (This is one of the way to implement incremental data load in azure data factory).

How to create and use Flowlet transformation in Azure Data Factory and Azure Synapse pipeline: https://sarnendude.com/how-to-create-and-use-flowlet-transformation-in-azure-data-factory-and-azure-synapse-pipeline/

Azure Managed Identity Authentication for Azure SQL Database by Azure Data Factory: https://sarnendude.com/azure-managed-identity-authentication-for-azure-sql-database-by-azure-data-factory/

Cast Transformation DEMO in Mapping Data flow of Azure Data Factory: https://sarnendude.com/cast-transformation-in-mapping-data-flow-of-azure-data-factory-synapse-analytics/

Azure Data Factory and Synapse Analytics provides Script Activity to execute DML & DDL script: https://sarnendude.com/azure-data-factory-and-synapse-analytics-provides-script-activity/

How ABN AMRO Bank uses Data Mesh architecture on Microsoft Azure for faster data insights & business decisions: https://sarnendude.com/azure-data-mesh-architecture-abn-amro-bank/ (ABN AMRO Data Mesh)

Azure Data Factory end to end Data Lineage demo using Azure Purview: https://sarnendude.com/azure-data-factory-end-to-end-data-lineage-demo-using-azure-purview/

Azure Data Factory data flows: ‘incremental extract’ feature allows to read only changed rows on Azure SQL DB sources: https://sarnendude.com/azure-data-factory-data-flows-incremental-extract-feature-allows-to-read-only-changed-rows-on-azure-sql-db-sources/

Tutorial from PySpark and Azure Databricks:

Delta Lake’s Change Data Feed (CDF) Demo in Azure Databricks: https://sarnendude.com/delta-lakes-change-data-feed-cdf-demo-in-azure-databricks/

How to Read CSV file in PySpark easily in Azure Databricks: https://sarnendude.com/how-to-read-csv-file-in-pyspark-easily-and-load-into-dataframe/

How to Write CSV file in PySpark easily in Azure Databricks: https://sarnendude.com/how-to-write-csv-file-in-pyspark-easily-in-azure-databricks/

Tutorial from Azure Synapse:

MERGE command in Azure Synapse Analytics Dedicated SQL pool: https://sarnendude.com/merge-command-in-azure-synapse-analytics-dedicated-sql-pool/

Azure Synapse Link for Azure SQL Database using Change Feed : Demo: https://sarnendude.com/azure-synapse-link-for-azure-sql-database-using-change-feed-demo/

Azure Synapse Intelligent Cache for Apache Spark: https://sarnendude.com/azure-synapse-analytics-intelligent-cache-for-apache-spark/

Flowlet transformation in Azure Data Factory and Azure Synapse pipeline: https://sarnendude.com/azure-synapse-analytics-intelligent-cache-for-apache-spark/

Azure Synapse Tutorial: Three In ONE Service: https://sarnendude.com/azure-synapse-tutorial/

Leave a Reply