Taking Interviw
Taking Interviw
Taking Interviw
com/en-us/azure/data-factory/quickstart-get-started
Q1. I am trying to create new Linked service (to copy data from ADLSV2) using
AutoResolveIntegrationRuntime.
I get error The interactive authoring capability is not enabled on the integration
runtime 'AutoResolveIntegrationRuntime'.
Please enable interactive authoring first and retry the operation.
Activity ID: undefined
Q3: Can you describe the different integration runtime types available in Azure
Data Factory and provide use cases for each?
Note:
# https://learn.microsoft.com/en-gb/azure/data-factory/continuous-integration-
delivery
Only the development factory is associated with a git repository. The test and
production factories shouldn't have a git repository associated with them and
should only be updated via an Azure DevOps pipeline or via a Resource Management
template.
Note:
Useful repo: https://github.com/djpmsft/adf-cicd
https://github.com/AdamPaternostro/Azure-Data-Factory-CI-CD-Source-Control
https://github.com/marketplace/actions/data-factory-export
Q5: in ADF , what is dataset & if dataset can be used with multiple pipelines?
Q6: You are working as an Azure Data Factory architect for a retail company. The
company receives daily sales data from multiple sources, including a SQL Server
database and an external API that provides JSON data. The data from these sources
needs to be loaded into a centralized data warehouse. However, the data sources
have complex dependencies and require transformations before being loaded.
Additionally, the data needs to be loaded incrementally based on a timestamp column
in the SQL Server database table.
Please explain how you would design an Azure Data Factory pipeline to handle
incremental data loading from multiple sources with complex dependencies and
transformations.
To design an Azure Data Factory pipeline for this scenario, I would follow the
following steps:
Identify the sources: SQL Server database and an external API that provides JSON
data.
Define the dependencies: For example, the SQL Server data may need to be
transformed before being joined with the API data.
Configure source datasets:Create datasets in Azure Data Factory that represent the
SQL Server database and the API endpoint. These datasets should include the
necessary connection details, such as the connection string, table name, and API
endpoint.
Create activities:Define activities in the pipeline to extract, transform, and load
the data. This can include data movement, data transformations, and control flow
activities.
Example-Copy activity to extract data from the SQL Server database and load it into
a staging area in Azure Storage. Then, a Data Flow activity can be used to apply
transformations, such as filtering or aggregating, on the staged data.
Now Web activity can be used to call the API endpoint and retrieve the JSON data.
Following that, another Data Flow activity can be used to perform transformations
on the JSON data.
Schedule and trigger the pipeline: (scheduled based is better for this scenario)
Define a schedule or trigger mechanism to run the pipeline at the desired frequency
or in response to specific events. In this case, we can schedule the pipeline to
run every day to fetch the latest incremental changes from the data sources.
Q7:Can you explain what Parquet is and why it is a preferred file format for big
data processing?
How does Azure Data Factory support Parquet files in data integration workflows?
Parquet is a columnar storage file format that is widely used in big data
processing environments. It has become a preferred choice for storing and
processing large datasets due to its numerous advantages.
Columnar Storage: Parquet stores data in a columnar format, meaning that values of
each column are stored together, rather than storing data row by row. This columnar
organization allows for more efficient compression, as data within a column tends
to have similar characteristics, resulting in better compression ratios.
Q8. In what scenarios would you typically use the Lookup activity in an Azure Data
Factory pipeline?
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-lookup-activity
Data Enrichment: For example, you can perform a lookup on a reference table to
retrieve product details based on product IDs in the incoming data.
Dimension Lookup: In data warehousing or data integration scenarios, the Lookup
activity is commonly used to perform dimension lookup operations. It allows you to
match and retrieve dimension keys or attributes based on certain conditions, such
as mapping customer names to customer IDs.
Data Validation: The Lookup activity can be utilized to validate data against
reference datasets or lookup tables. It helps in checking the existence or validity
of certain values before further processing. For example, you can use a Lookup
activity to verify if a particular customer or product exists in a reference
database.
Data Filtering: By performing lookups on specific columns or fields, the Lookup
activity enables data filtering based on conditions. This allows you to filter out
or route specific data subsets to different branches of the pipeline for further
processing or data flow decisions.
Terraform?
Q2: Can you explain the concept of data sources in Terraform? And in which secnario
you will use it?
Q3: What is the purpose of the locals block in Terraform, and how can it be used to
simplify and organize Terraform configurations?
Q4: explain this code? and what is data type of splitted_IP in below code?
variable "ip_addresses" {
type = string
default = "192.168.1.100, 10.0.0.1, 172.16.0.100"
}
locals {
splitted_IP = split(", ", var.ip_addresses)
}
variable "myuserlist" {
description = "username of iam user"
type = list(string)
default = ["user1", "user2", "user3"]
}
#Method1
resource "aws_iam_user" "example-user" {
count = length(var.myuserlist)
name = var.myuserlist[count.index]
}
#Method2
resource "aws_iam_user" "example-user" {
for_each = toset(var.myuserlist)
name = each.value
}
Q6: explain this code & what will be the output of this code ?
locals {
heights = {
bob = "short"
kevin = "tall"
stewart = "medium"
}
}
resource "null_resource" "heights" {
for_each = local.heights
triggers = {
name = each.key
height = each.value
}
}
output "heights" {
value = values(null_resource.heights)[*]
}
Q9: - we also have modules to create resource_group, vnet, subnet, nsg & attach nsg
to subnet. We are using there modules in ADO to create resources.
Given below NSG/main.tf and Subnet/main.tf.
Issue: pipeline fail with error stating that subnet does not exist, what is issue &
tell me approch to resolve it?
# NSG/main.tf
variable "env" {
type = string
default = null
}
variable "resource_group_name" {
type = string
default = null
}
variable "location" {
type = string
default = null
}
variable "nsg_name" {
type = string
default = null
}
output "values" {
value = azurerm_network_security_group.my-nsg
}
# Subnet/main.tf
variable "subnet_name" {
description = "Name of the subnet"
type = string
default = ""
}
variable "virtual_network_name" {
description = "ID of the virtual network"
type = string
default = ""
}
variable "subnet_address_prefix" {
description = "Name of the resource group"
type = list(string)
default = []
}
variable "env" {
description = "Environment (dev, tst, prd)"
type = string
default = ""
}
variable "resource_group_name" {
description = "Name of the resource group"
type = string
default = ""
}
output "subnet_name" {
value = azurerm_subnet.my-subnet.name
}
output "subnet_id" {
value = azurerm_subnet.my-subnet.id
}
output "subnet_address_prefix" {
value = azurerm_subnet.my-subnet.address_prefixes
}
-----------------------------------------------------------------------------------
-----
Question:
result = db.orders.aggregate(pipeline)
Question:
You've observed that a query searching for documents in a collection named users
based on email and lastLoginDate is running slow. Write a command to create an
optimal compound index for this query. Also, explain why this would improve
performance.
# Example usage:
# insert_user("John Doe", "john@example.com", 30)
Question:
Consider a blogging application where each blog post can have multiple comments.
Each comment contains a message, author name, and timestamp. Given that the primary
operation on the blog post is fetching it along with all its comments, suggest a
data model for this scenario and write a pymongo function to add a new comment to a
given blog post.
{
"_id": ObjectId("some_id"),
"title": "Sample Blog Post",
"content": "This is the content of the blog post",
"comments": [
{ "message": "Great post!", "author": "Alice", "timestamp": ISODate("2023-10-
24T10:20:00Z") },
{ "message": "I found this really informative. Thanks for sharing!", "author":
"Bob", "timestamp": ISODate("2023-10-24T11:15:00Z") },
{ "message": "Can you elaborate on the second point?", "author": "Charlie",
"timestamp": ISODate("2023-10-24T11:45:00Z") },
{ "message": "I disagree with your conclusion. Here's my perspective...",
"author": "Dana", "timestamp": ISODate("2023-10-24T12:30:00Z") },
// potentially more comments...
]
}
The pymongo function to add a new comment would look like:
from pymongo import MongoClient
comment = {
"message": message,
"author": author,
"timestamp": datetime.datetime.utcnow()
}
# Example usage:
# add_comment(ObjectId("some_id"), "Nice article!", "Eve")
Explanation if $push:
$push is an update operator provided by MongoDB that appends a specified value to
an array. If the field specified in the $push operation is not present in the
document to be updated, MongoDB will add a new array field with the specified name
and append the given value to this array.
In the context of the example I provided earlier, the $push operator is used to add
a new comment to the comments array embedded in a blog post document.
new_comment = {
"message": "I found this really informative. Thanks for sharing!",
"author": "Bob",
"timestamp": datetime.datetime.utcnow()
}
posts.update_one({"_id": 1}, {"$push": {"comments": new_comment}})
After applying the $push operation, the document will be updated to:
{
"_id": 1,
"title": "Sample Blog Post",
"content": "This is the content of the blog post",
"comments": [
{
"message": "Great post!",
"author": "Alice",
"timestamp": ISODate("2023-10-24T10:20:00Z")
},
{
"message": "I found this really informative. Thanks for sharing!",
"author": "Bob",
"timestamp": ISODate("2023-10-24T11:15:00Z")
}
]
}
Question:
CRUD operations (find, update, insert, delete, etc.).
Answer:
CRUD stands for Create, Read, Update, and Delete, which are the four basic
operations for any persistent storage system. In the context of MongoDB, a NoSQL
document database, these operations act on the BSON-formatted documents stored in
collections.
2.Read (Find)
Operation: find
Purpose: Used to retrieve documents from a collection based on specified criteria.
Example:
db.collectionName.find({ "age": { "$gte": 30 } });
The find method retrieves all documents that match the provided query. In the above
example, it retrieves all documents where the age is greater than or equal to 30.
If no criteria are provided (i.e., {}), it fetches all documents in the collection.
3. Update
Operation: updateOne, updateMany, or replaceOne
Purpose: Modify existing documents based on specific criteria.
db.collectionName.updateOne({ "name": "Alice" }, { "$set": { "age": 31 } });
db.collectionName.updateMany({ "age": { "$gte": 30 } }, { "$inc": { "age": 1 } });
updateOne modifies the first document that matches the given criteria.
updateMany modifies all documents that match the provided criteria.
In the examples, the first updates Alice's age to 31, while the second increments
the age by 1 for all documents where age is greater than or equal to 30.
4. Delete
Operation: deleteOne or deleteMany
Purpose: Remove documents from a collection based on specific criteria.
Example:
db.collectionName.deleteOne({ "name": "Charlie" });
db.collectionName.deleteMany({ "age": { "$lt": 30 } });
deleteOne removes the first document that matches the given criteria.
deleteMany removes all documents that match the criteria.
In the examples, the first deletes the document with the name "Charlie", and the
second deletes all documents where the age is less than 30.
Question :
what is $unwind is an aggregation pipeline?
Aggregation Stage: In the context of MongoDB's aggregation pipeline, a stage
processes data records (documents) and returns transformed data. Each stage
transforms the documents as they pass through the pipeline.
Function or Operator: Within each stage, you can use specific expressions or
operators to manipulate the data. For instance, $sum, $avg, and $multiply are
operators that can be used within certain stages to perform operations on the data.
So, to directly answer your question:
Input Documents
│
▼
[=================]
│ $match │ ----> Filters documents based on given criteria.
[=================]
│
▼
[=================]
│ $unwind │ ----> Deconstructs an array field, outputting a document for
each array item.
[=================]
│
▼
[=================]
│ $group │ ----> Groups documents by specified identifiers and performs
aggregation operations.
[=================]
│
▼
Output Documents
Aggregation pipeline can have many more stages, and data can be reshaped and
transformed in numerous ways as it moves through the pipeline.
{
"_id": 1,
"name": "John",
"hobbies": ["reading", "gaming", "hiking"]
},
{
"_id": 2,
"name": "Jane",
"hobbies": ["dancing", "singing"]
}
{
"_id": 1,
"name": "John",
"hobbies": "reading"
},
{
"_id": 1,
"name": "John",
"hobbies": "gaming"
},
{
"_id": 1,
"name": "John",
"hobbies": "hiking"
},
{
"_id": 2,
"name": "Jane",
"hobbies": "dancing"
},
{
"_id": 2,
"name": "Jane",
"hobbies": "singing"
}
As you can see, the $unwind operation has transformed our original 2 documents into
5 documents. For each element in the hobbies array of a student, a new document is
created, where the hobbies field is replaced by each individual value.
Question:
Advance Operation in MongoDb.
1. Aggregation
Purpose: Transform and combine documents in your collection to aggregate data and
perform operations that provide insight into your data.
Example Operations:
Aggregation Pipeline: Series of data transformation stages (e.g., $match, $group,
$sort).
db.collectionName.aggregate([
{ "$match": { "age": { "$gte": 30 } } },
{ "$group": { "_id": "$country", "averageAge": { "$avg": "$age" } } },
{ "$sort": { "averageAge": -1 } }
]);
The above example in the aggregation pipeline retrieves documents with age >= 30,
groups them by country, and calculates the average age for each country, then sorts
countries by the average age in descending order.
The aggregation framework provides capabilities similar to SQL's GROUP BY clause,
and much more
[
{
"_id": ObjectId("5f50a6506881fcbf9fbb1042"),
"user": "Amit",
"orderDate": ISODate("2023-10-01"),
"items": [
{"product": "Laptop", "quantity": 1, "price": 1200},
{"product": "Mouse", "quantity": 2, "price": 25}
]
},
{
"_id": ObjectId("5f50a6546881fcbf9fbb1043"),
"user": "Raj",
"orderDate": ISODate("2023-10-02"),
"items": [
{"product": "Headphones", "quantity": 1, "price": 150},
{"product": "Laptop", "quantity": 1, "price": 1150},
{"product": "Keyboard", "quantity": 1, "price": 50}
]
},
{
"_id": ObjectId("5f50a6586881fcbf9fbb1044"),
"user": "Rahul",
"orderDate": ISODate("2023-10-03"),
"items": [
{"product": "Laptop", "quantity": 2, "price": 1100},
{"product": "Phone Case", "quantity": 3, "price": 15}
]
}
]
This aggregation first "unwinds" the items array, which effectively creates a new
document for each item in the array. It then groups the data by product and sums up
the total sales. The result is then sorted in descending order by total sales.
Third:Top users:Who are the top users based on the number of orders placed?
db.orders.aggregate([
{
"$group": {
"_id": "$user",
"totalOrders": { "$sum": 1 }
}
},
{ "$sort": { "totalOrders": -1 } }
])
$group Stage:
{ "$group": { "_id": "$user", "totalOrders": { "$sum": 1 } } }
Here's a breakdown of the $group stage components:
"_id": "$user": This is specifying that the documents (in this case, orders) should
be grouped by the user field. So, for each unique user value in the collection, a
single output document will be produced.
"totalOrders": { "$sum": 1 }: This is an accumulator. For each document that gets
grouped into the same user, 1 is added to the totalOrders field. In other words,
this counts the number of orders (documents) for each user. The resulting
totalOrders field in each output document represents the total number of orders
made by the respective user.
$sort Stage:
{ "$sort": { "totalOrders": -1 } }
This stage sorts the output documents based on the totalOrders field:
totalOrders: -1: The -1` indicates a descending sort. So, users with more orders
will appear first in the output.
[
{ "_id": "userA", "totalOrders": 100 },
{ "_id": "userB", "totalOrders": 75 },
{ "_id": "userC", "totalOrders": 50 },
... and so on ...
]
2. Indexing:
Indexes support the efficient execution of search operations in MongoDB. Without
indexes, MongoDB has to scan every document in a collection to find the ones that
match a query – this is a "collection scan". With the right index, the database can
narrow down the search to fewer documents. This is similar to the index of a book,
which helps you find content faster without reading every page.
Example:
Running this query
db.orders.find({ "user": "Amit" });
This was taking time.
Creating an index on the user field of an orders collection to speed up lookups
based on user:
db.orders.createIndex({ "user": 1 }).
The database will utilize the index to quickly locate the orders for "Amit" without
scanning every document in the collection.
This would significantly speed up the query execution time, especially as the
dataset grows larger.
It's similar to looking up a word in a book. Without an index, you'd have to go
page by page (full scan). With an index, you can go directly to the relevant pages.
Sharding: To handle massive amounts of data and provide high throughput operations,
MongoDB uses sharding, where data is distributed across a cluster of machines.
Text Search: MongoDB supports text search against string content in the
collections, which can be beneficial for search-as-you-type functionalities.
GridFS: If you need to store and retrieve files such as images or audio files,
MongoDB offers GridFS, which splits the file into chunks and stores each chunk as a
separate document.