In this article we will have two real time demo on how Azure Data Factory Data Flow implements Change Data Capture feature. Azure Data Factory and Azure Synapse Mapping Data Flow implements CDC in two ways :
- Native SQL Change Data Capture: for this, continue reading this blog & refer YouTube demo video .
- Incremental Extract :CDC DEMO link https://sarnendude.com/azure-data-factory-data-flows-incremental-extract-feature-allows-to-read-only-changed-rows-on-azure-sql-db-sources/
Let’s understand below details first:
- Native SQL Change Data Capture features in Azure Data Factory Data Flow
- What scenario SQL native CDC enables ?
- CDC Architecture: Behind the scenes
- How is Change Data Capture (CDC) enabled at SQL source?
- Checkpoint in ADF for CDC
- CDC: Best Practice for Incremental ETL
- YouTube Video: CDC Architecture and Incremental ETL Demo
Let’s get connected:
- Please subscribe to this blog;go to right-top corner of this post at “Email Subscription” section.
- For demo video, Please subscribe to my YouTube channel: https://lnkd.in/dsF59A8n;
Native SQL Change Data Capture feature in Azure Data Factory Mapping Data Flow:
- Azure Data Factory & Azure Synapse both supports this SQL native change data capture CDC.
- Supported SQL Source: Native SQL CDC supports available when sources are SQL Server, Azure SQL Database & SQL Manage Instance.
- Enable native change data capture(Preview): It allows Azure Data Factory mapping data flow automatically to detect, and extract changed data or delta data since the pipeline was last executed.
- Delta data is changed data including row insert, update and deletion in SQL stores.
- No customization for CDC: No timestamp or ID columns are required to identify the changes since it uses the native change data capture technology in the databases.
- Start reading from beginning : Once enabled, it allows ADF to extract all source data in initial load and change data in subsequent pipeline run.
CDC Architecture and Incremental ETL Demo using Azure Data Factory native SQL CDC feature:
SQL native CDC enables below scenario:
- Auto replication of data: Auto sync of source SQL database to destination database:
- Incremental ETL : support incremental ETL by applying data transformation logic in between from source and destination in both azure data factory data flow and azure synapse data flow.
Demo on Script Activity in Azure Data Factory and Synapse Analytics : https://sarnendude.com/azure-data-factory-and-synapse-analytics-provides-script-activity/
CDC Architecture: Behind the scenes
- Once CDC is Enabled, Capture instance is created which is consists of a change table & up to two query functions.
- Change Data Capture (CDC) uses SQL Server agent to capture insert, update, and delete records of a source table in Change table via Capture process.
- For Azure SQL Database, a CDC scheduler invokes stored procedures to start periodic capture & clean up of the CDC tables.
- Query functions: Data consumer uses Table-valued functions to access to the change data.
- It enables incremental ETL.
How is Change Data Capture (CDC) enabled at SQL source ?
We have to enable CDC at database and table level.
How is Change Data Capture (CDC) enabled Database level?
We can enable change data capture on database using sys.sp_cdc_enable_db system stored procedure as shown at below image:
How is Change Data Capture (CDC) enabled at Table level ?
We can enable change data capture on table using sys.sp_cdc_enable_table system stored procedure and pass the required parameter like schema, table name, role as shown at below image.
Checkpoint in ADF for CDC:
- Once native CDC is enabled, it creates checkpoint in ADF.
- Checkpoint key: guid unique value identify the pipeline.
- Checkpoint key is used to set the checkpoint when data flow is used for changed data capture.
- Please refer Azure Data Factory data flow activity image.
CDC: Best Practice for Incremental ETL
- When source is CDC Supported: Native change data capture is always recommended as the simplest way for you to get change data enabling incremental ETL.
- Easy setup: it only takes few clicks to enable CDC which support incremental ETL.
- No customization: No timestamp or ID columns are required to identify the changes, so faster development with lesser issue.
- No Source System impact: It also brings much less burden on source database when ADF extracts the change data for further processing.
- Thus, no impact on the transactions at the source application.
- Non-supported Connector as Source: Microsoft recommend to check auto incremental extraction option where we only need to input incremental column to capture the changes.
Recommended article from Azure Data Factory and Azure Synapse:
Azure Synapse Link for Azure SQL Database using Change Feed Demo: https://sarnendude.com/azure-synapse-link-for-azure-sql-database-using-change-feed-demo/
Azure Managed Identity Authentication for Azure SQL Database by Azure Data Factory: https://sarnendude.com/azure-managed-identity-authentication-for-azure-sql-database-by-azure-data-factory/
Thanks for reading the article; Please feel free to share your queries/though via comment box below.