In this article we will discuss WHY we need lake database in Azure Synapse i.e. what challenges it solves and HOW it solves these challenges and HOW to get started with lake database.
Currently this Azure Synapse Lake database is in public preview.
The lake database in Azure Synapse Analytics enables customers to bring together database design, meta information about the data that is stored and a possibility to describe how and where the data should be stored.
WHYwe need this lake database? – What challenges it addressed?
- Using Relational Model, it resolves the difficulty in understanding how data is structured in data lake
- Entity Relationships among Entity resolves the lack of Entity/Model relationships which is a real challenge for the interaction on the data lake
- Meta Data: Adding descriptions (meta data) as & demo values to the model helps developer in the future to have better understanding about the data. This meta data helps compute engines to provide an integrated experience & usage of additional data like relationships which was not originally supported on the data lake.
HOW the above challenges are addressed?
Lake database uses its 3 components to address these challenges:
- Database designer:
- To creates data model for lake database and provide option to add meta data on it.
- To add meta data: Add more information of Entity & it’s Attribute and Entity Relationships as well. Capturing relationships among entity is a challenge on the data lake.
- & add descriptions and possible demo values to the model for future reference
- Data storage:
- Location: it uses azure Storage to store the data of the database.
- Format: Parquet or CSV format and different settings can be used to optimize the storage
- Connection by: Lake database uses linked service to define the location of the root data folder.
- Database compute:
- In lake database, storage is decoupled from compute.
- Lake database is accessible via Synapse SQL, serverless SQL pool and Apache Spark pool.
HOWto get started?
Prerequisite: At a minimum we need below prerequisites to be ready to get started.
- Azure Subscription
- Azure Data Lake Gen2 or Azure Blob Storage
- Azure Synapse Analytic
Please log into Azure portal and open the synapse studio in synapse account.
Once in synapse studio, please click on “Data” at the extreme left hand side and again click on + sign just next to “Data”. This will open a pop up displaying the list of data source in workspace and linked service connections.
Now click on “Lake database (preview)“, currently this is in preview. This will open below image
Now provide name and storage setting for lake database:
- Lake database name
- Linked service: the storage account location where the data is staged
- Input folder: the folder path of your data starting at the container in the storage account
- Data format: Currently supported format are ‘Delimited Text’ & ‘Parquet’
Now the Lake database is created, let’s create table; There are 3 ways to create table in lake database:
- Custom: This allows to create table manually
- From template: This allows to create table using available template i.e. defined schema
- From data lake: This allows to create table using files available in data lake.
In this tutorial we will create table from data lake to show how tables & their relationship are created.
To create table from data lake, let’s click “From data lake option” & following window will appear
Here we need to
- provide external table name
- select linked service to storage account
- select the input folder or file; This file will be used as a base for table schema and data.
Once we provide the above details and continue then we need to provide the settings of the file used from data lake as shown below:
Database designer:
After providing details to create table, new table will be displayed in the database designer of the lake database as shown in below picture.
Click on the column table to add/modify the schema and provide description as meta data to table as shown in below image.
Once we provide the schema details and description of each column, we need to publish (at left top of the above image) the changes to workspace so that this is available under lake database to DML operation.
Please refer the below image, here we can see the fields under ‘Column‘ tab: Name, Keys, Description, Null-ability & at last format field. These are column level properties and we can add description as meta data for each column.
After we publish, the table will be available under lake database as shown below:
Before querying the the table, let us view the actual Customer Dimension storage data. Once the query the external table we can verify the table data with this Customer Dimension storage file data.
Now we can query the above Dim_Customer external table by clicking one the 3 dots next to Dim_Customer and then “New SQL script“and “Select TOP 100 rows” as shown below image
After clicking Run in the below image the data will be displayed. Now we can verify the below table data with the above mentioned Customer Dimension storage file data.
Here by default one ‘select query’ with top 100 rows is created and run using the server-less sql by connecting to “Built in” pool
So far we have created the Customer dimension table and queried it. Similarly we can create Product dimension table and Sales fact table:
Product dimension table schema in database designer:
Sales fact table schema in database designer:
Create Entity Relationship:
This ‘Entity Relationship’ in lake database resolves the lack of Entity/Model relationships which is a real challenge for the interaction on the data lake. Now we will create the entity relationship among tables which are created from the Azure Data Lake Gen 2 storage.
Here in the database designer, under ‘Relationships‘ tab, there is a drop-down ‘Relationship’ to mention the relationship between ‘from’ and ‘to’ table by selecting the ‘from’ table and its column & ‘to’ table and its column.
Entity Relationship among the fact and dimension tables in database designer:
Here we have created a simple data model having relationship among customer & product dimension tables & face sales table. In this scenario, ‘from’ tables are dimension table i.e. Dim_Customer & Dim_Sales tables & fact table is Fact_Sales table.
From the below image, we can clearly understand how the relationship is created using columns.
So in this blog we understood ‘WHY’ i.e. what the challenges are addressed by lake database and ‘HOW’ we can get started to use the lake database. We also understood how to create the relationship among entities.
Thanks for reading the article; If you have any question, please feel free to write the comment section below.