Azure Data Engineering Project Part 1
Azure Data Engineering Project Part 1
Azure Data Engineering Project Part 1
Before creating all the resources we will create the resource group
in which we will create all the required resources.
Go to Azure Portal and log in with your Azure account. In the left-hand menu, select
"Resource groups". If you don't see it, use the search bar at the top of the page and search
for "Resource groups". Click the "+ Create" button or "Add" at the top of the Resource
groups page.
Select the subscription under which the resource group will be created. Enter a unique name
for your resource group. Choose a location (region) where your resources will reside (e.g.,
East US, West Europe).
Search for "Storage accounts" and click "+ Create". Select subscription, resource group,
region, and enter a unique storage account name as shown below.
Enable ADLS Gen2: Go to the Advanced tab and enable Hierarchical namespace.
Then in the configs container create the directory “emr” and then upload the file “load_config.csv”
in it.
Steps to create Azure SQL database:
In the search bar, type "SQL Database" and select "SQL Database" from the results.
Choose your Subscription and Resource Group. Enter a Database Name. Also
create a SQL Server.
We will create the server as shown below. Choose a Compute + Storage tier (e.g.,
Basic, General Purpose).
After this Go to Networking => In Network connectivity select “Public endpoint”
option. Also set yes for the options “Allow Azure services and resources to access
this server” and “Add current client IP address”
Note: Please note down this username and password for future reference.
Click Review + Create, validate the details, and then click Create as shown below.
Note: While creating database if you are not able to allow public access and
add client ip address you can follow below steps:
After creating this database Go to Networking => For Public access (select option
Selected networks and save this) =>
Also while using query editor if you face below error click on “Allowing IP for current
ip address” as shown below
Similarly we will create another database trendytech-hospital-b(We will use the same
server i.e trendytech-sqlserver that we have created while creating
trendytech-hospital-a database). Thus we have created 2 databases as shown
below.
Then we will create the tables in these databases and for creating tables in the
database use below scripts which are present on github account:
For trendytech-hospital-a =>
Trendytech_hospital_A_table_creation_commands
In the search bar, type "Data Factory" and select "Data Factory" from the results.
Click the "Create" button on the Data Factory page. Provide a globally unique name
for your Data Factory instance. Choose V2 (Data Factory Version 2) for the latest
features.
Click "Review + Create" to validate the details. If validation passes, click "Create" to
deploy the Data Factory.
In the ADF interface, go to the Manage section on the left-hand panel. Under the
Connections section, select Linked Services. Click on New to create a new Linked
Service.
1. Azure SQL DB
Note down the server name for that sql database that we have created.
In fully qualified domain names, mention the server name, mention the
username and password for sql server and define the parameter db_name
and using this parameter we will pass the database name as shown below.
And click on create to create the linked service.
2. ADLS GEN2
Select Azure Data Lake Storage1 as the data store. Provide the following
details- Name of your Blob Storage account, Authentication. Then click Test
Connection to verify and save the Linked Service.
To get the url for Azure Data Lake Storage go to Adls gen2 storage that we
have created => Setting => Endpoints => and copy the URL as shown below
Also copy the access key. Using these details create the linked service as
shown below.
Dataset creation:
In the ADF interface, click on the Author section (left-hand panel).
Expand the Datasets option. Click on the “…” next to Datasets in
order to create the dataset.
1. Azure SQL DB
We will select the linked service that we have created for the SQL database.
To create the datasets for the tables in a parameterized way in the sql
database , we will create the parameter db_name, schema_name,
table_name.
In order to store data in ADLS gen2 in parquet format we will need the
dataset.
While creating this dataset we will select source as ADLS gen2, fileformat as
parquet and we will create parameters file_name, file_path and container.
4. Databricks Delta Lake for Delta lake
We will select the source as Azure Databricks Delta Lake. For this we will
create the parameter schema_name and table_name.
Once all the dataset and linked service are created, publish all in order to
save them.
Creation of Pipelines:
We will use the same linked service “hosa_sql_ls” that we have created
earlier for the database.
2. Creation of Datasets:
Select source as ADLS Gen2 storage, then in file format select json as our
lookup file is a json file as shown below
Steps to Configure the Pipeline:
This pipeline will copy the data from the file into the tables in the sql database. On
successfully running the pipeline we will get below output.
Pipeline to copy data from Azure Sql db to Landing Folder
in ADLS Gen2
@activity('lkp_EMR_configs').output.value
a. We will use get metadata activity in order to check whether
file exists in Bronze container:
This will check if the file exists in the Bronze container. Based on the file's
presence or absence, we will use an If Condition activity to determine the
subsequent processing steps.
Condition 1: File Exists (True) => Move the file to the Archive folder.
condition:
@and(equals(activity('fileExists').output.exists,true),equals(item().is_acti
ve, '1'))
Source: Container: Bronze, Path: hosa, File: encounters
@equals(items().loadtype, 'Full')
If “If condition” holds true => Full Load => Copy all data from the database
table. => Enter Log details in the audit table:
Folder and File Structure
Bronze Container:
Source Path: bronze/hosa
Target Path for Data Loads: bronze/<target-path>
(Fetch incremental data using the last fetched date) using Lookup=>
Incremental load using copy activity =>Enter log details in the audit table:
Lookup:
Incremental load:
Source Path: bronze/hosa
Target Path for Data Loads: bronze/<target-path>
Before running the pipeline for each activity select the “sequential” option as
shown below.
But limitation with this pipeline is it is sequential which will we resolve in part 2
Part 2:
In this section, we will focus on improving our data pipeline and governance by
implementing the following:
Build Fact and Dimension tables for better data analysis and reporting.
Secure sensitive data by integrating Azure Key Vault for managing secrets and
credentials.
Standardize names across datasets, pipelines, and tables for better organization and
understanding.
Optimize Azure Data Factory pipelines to run multiple processes at the same time,
reducing execution time.
Transition from a local Hive Metastore to Databricks Unity Catalog for centralized
metadata management and improved data governance.
Also to organize the notebook we will create the folder as shown below.
1. Set up:
2. API extracts
3. Silver
4. Gold