Microsoft Azure Data Factory: Data Integration and Transformation for Warehousing and Analytics

Microsoft Azure Cloud Data Warehousing & analytics reference architecture: integrates large amounts of data from multiple sources into a Cloud Data Warehouse for analytics using Azure cloud native data services.

In this reference architecture, we will use Azure Data Factory for data integration & transformation:

Azure Data Factoryis the cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale.”

Cloud ETL: Reference Architecture

Architecture Components for data integration and transformation:

  1. Data Source:
    • SQL On-premise: SQL Server/Oracle
    • SQL Cloud: Azure SQL database
    • NO Sql Cloud: Cosmos DB
    • On premise file server
  2. Staging Data source: Azure Datalake Gen2 or Blob storage to stage all data from diverse data source
  3. Azure Data Factory V2
    • Copy Activity: to ingest data from multiple source to Staging Datasource
    • Mapping data flow: to clean/transform data the staged data and sink in Azure Synapse Analytics.
  4. Azure Synapse Analytics: ADF V2 logs cleansed and transformed the data into SQL pool of Azure Synapse Analytics using polybase
  5. Azure Analysis Services : to create the tabular model using the data in Azure Synapse Analytics. tabular model is refreshed after loading a new batch of data into the warehouse
  6. Power BI: Power BI helps Business analysts to analyse warehoused data via Analysis Services semantic model
  7. Azure AD: to authenticates users to the Analysis Services server from Power BI. ADF uses Azure AD to authenticate to Azure Synapse via a service principal or Managed identity for Azure resources.
  8. Azure Key Vault: to store and access secrets securely. Respective service like ADF can connect to key vault to access the value of the secrets.

Data Integration and transformation flow:

  1. Ingest data into staging: Parent copy pipeline contains the child pipelines for each input data source and ingest data into ADLS Gen2/blob storage.
  2. Clean/Transformation: Azure Data Factory mapping data flow Clean/Transform data using another pipeline which will triggered by tumbling window
  3. Data warehousing : Load the transformed data into Azure Synapse SQL pool using Polybase
  4. Tabular model: created/refreshed in Azure Analysis service after each loading into synapse
  5. Dashboard view: Power BI to query/analyse the model from Azure Analysis service.

Pipeline & Triggers involved:

  1. From Upstream: CopyPipeline ingestion data into cloud staging storage: Azure Data Factory can ingest the data from above diverse data source to Azure Azure Datalake Gen2 or Blob storage as staging to be used subsequently
  2. For downstream: TransformationPipeline uses Data Factory mapping flow to transform staged data and load the data into a common structure in Azure Synapse using PolyBase

We will use tumbling window dependency trigger to initiate the transformation once the main data ingestion pipelines gets completed successfully.

Thanks for reading. Please feel free to comment below in case of any query/suggestion.

Leave a Reply