Expert Veri Ed, Online, Free.: Topic 1 - Question Set 1
Expert Veri Ed, Online, Free.: Topic 1 - Question Set 1
Expert Veri Ed, Online, Free.: Topic 1 - Question Set 1
-
Expert Verified, Online, Free.
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 1/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Question #1 Topic 1
You have a table in an Azure Synapse Analytics dedicated SQL pool. The table was created by using the following Transact-SQL statement.
✑ Provide fast lookup of the managers' attributes such as name and job title.
A.
[ManagerEmployeeID] [smallint] NULL
B.
[ManagerEmployeeKey] [smallint] NULL
C.
[ManagerEmployeeKey] [int] NULL
D.
[ManagerName] [varchar](200) NULL
Correct Answer:
C
We need an extra column to identify the Manager. Use the data type as the EmployeeKey column, an int column.
Reference:
https://docs.microsoft.com/en-us/analysis-services/tabular-models/hierarchies-ssas-tabular
alexleonvalencia
Highly Voted
5 months, 2 weeks ago
Selected Answer: C
La respuesta es correcta.
upvoted 10 times
jskibick
Highly Voted
3 months, 3 weeks ago
Selected Answer: C
Answer C. Smallint eliminates A and B. But I would name the field [ManagerEmployeeID] [int] NULL since it should reference EmployeeID, not
EmployeeKey since this one is IDENTITY.
upvoted 6 times
Dothy
Most Recent
2 weeks ago
Selected Answer: C
upvoted 1 times
Egocentric
1 month, 1 week ago
C is the correct answer
upvoted 1 times
boggy011
1 month, 1 week ago
Selected Answer: C
upvoted 1 times
temacc
2 months ago
Selected Answer: C
Correct answer is C
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 2/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Guincimund
2 months ago
Selected Answer: C
answer is C.
upvoted 1 times
NeerajKumar
2 months, 3 weeks ago
Selected Answer: C
Correct Ans is C
upvoted 2 times
KosteK
2 months, 4 weeks ago
Selected Answer: C
correct answer is C
upvoted 1 times
samtrion
3 months ago
Selected Answer: C
It is quite obvious C
upvoted 1 times
ArunCDE
3 months, 2 weeks ago
Selected Answer: C
This is the correct answer.
upvoted 1 times
PallaviPatel
4 months ago
Selected Answer: C
correct is C. Use surrogate key instead of business key as a reference.
upvoted 1 times
Aurelkb
4 months, 1 week ago
correct answer is C
upvoted 1 times
pozdrotechno
4 months, 1 week ago
Selected Answer: C
C is correct.
The column should be based on the surrogate key (EmployeeKey), including an identical data type.
upvoted 2 times
SofiaG
4 months, 1 week ago
Selected Answer: C
Correto
upvoted 1 times
jchen9314
4 months, 2 weeks ago
INT has the best performance to be a key.
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 3/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Question #2 Topic 1
You have an Azure Synapse workspace named MyWorkspace that contains an Apache Spark database named mytestdb.
You run the following command in an Azure Synapse Analytics Spark pool in MyWorkspace.
EmployeeID int,
EmployeeName string,
EmployeeStartDate date)
USING Parquet -
You then use Spark to insert a row into mytestdb.myParquetTable. The row contains the following data.
One minute later, you execute the following query from a serverless SQL pool in MyWorkspace.
SELECT EmployeeID -
FROM mytestdb.dbo.myParquetTable
A.
24
B.
an error
C.
a null value
Correct Answer:
A
Once a database has been created by a Spark job, you can create tables in it with Spark that use Parquet as the storage format. Table names
will be converted to lower case and need to be queried using the lower case name. These tables will immediately become available for querying
by any of the Azure Synapse workspace Spark pools. They can also be used from any of the Spark jobs subject to permissions.
Note: For external tables, since they are synchronized to serverless SQL pool asynchronously, there will be a delay until they appear.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/metadata/table
kruukp
Highly Voted
1 year ago
B is a correct answer. There is a column 'name' in the where clause which doesn't exist in the table.
upvoted 103 times
knarf
11 months, 1 week ago
I agree B is correct, not because the column 'name' in the query is invalid, but because the table reference itself is invalid as the table was
created as CREATE TABLE mytestdb.myParquetTable and not mytestdb.dbo.myParquetTable
upvoted 14 times
EddyRoboto
9 months ago
When we query a spark table from SQL Serveless we must use the schema, in this case, dbo, so, this doesn't cause erros.
upvoted 7 times
anarvekar
9 months, 3 weeks ago
Isn't dbo the default schema the objects are created in, if the schema name is not explicitly specified in the DDL?
upvoted 2 times
AugustineUba
10 months ago
I agree with this.
upvoted 1 times
anto69
4 months, 2 weeks ago
Agree there's no column named 'name'
upvoted 2 times
baobabko
12 months ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 4/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Even if the column name is correct. When I tried the example , it threw an error that table doesn't exist (as expected - after all - it is a Spark
table, not SQL. There is no external or any other table which could be queried in the SQL pool)
upvoted 4 times
EddyRoboto
9 months ago
They shared the same metadata, perhaps you forgot to specify the schema in your query in SQL Serveless Pool. You should have tried
spark_db.[dbo].spark_table
upvoted 2 times
Alekx42
12 months ago
https://docs.microsoft.com/en-us/azure/synapse-analytics/metadata/table
"Once a database has been created by a Spark job, you can create tables in it with Spark that use Parquet as the storage format. Table names
will be converted to lower case and need to be queried using the lower case name. These tables will immediately become available for
querying by any of the Azure Synapse workspace Spark pools. The Spark created, managed, and external tables are also made available as
external tables with the same name in the corresponding synchronized database in serverless SQL pool."
I think the reason you got the error was because the query had to use the lower case names. See the example in the same link, they create a
similar table and use the lowercase letters to query it from the Serverless SQL pool.
knarf
11 months ago
See my post above and comment?
upvoted 1 times
polokkk
Highly Voted
4 months, 2 weeks ago
A is correct in real exam, it was employeename not name. So 24 is the one to select in real exam.
B is correct in this question as it isn't exactly the same as exam thus B is correct
upvoted 12 times
Dothy
Most Recent
2 weeks ago
No EmployeeName column in query. So answer B is correct
upvoted 1 times
Lizaveta
4 weeks ago
I came across this question on an exam today. It was correct query with "WHERE
FelixI
1 month ago
Selected Answer: B
No EmployeeName column in query
upvoted 1 times
Egocentric
1 month, 1 week ago
B is correct because there is no column called name
upvoted 1 times
xuezhizlv
1 month, 3 weeks ago
Selected Answer: B
B is the correct answer.
upvoted 1 times
AlCubeHead
2 months ago
Selected Answer: B
There is no name column in the table. Also the - at the end of the select is also dubious to me
upvoted 1 times
Guincimund
2 months ago
Selected Answer: B
SELECT EmployeeID -
FROM mytestdb.dbo.myParquetTable
As the Where clause is name = 'Alice', the answer is B as there is no column named 'name'.
In the case where the Where clause is "Where EmployeeName = 'Alice' " then the answer will return 24. which is answer A.
upvoted 2 times
Sakshi_21
2 months, 2 weeks ago
Selected Answer: B
name column dosent exsists in the table
upvoted 1 times
enricobny
2 months, 2 weeks ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 5/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
BI is the right answer. Column 'name' is not present in the table structure and also using mytestdb.dbo.myParquetTable will not work - [dbo] is the
problem.
Rama22
2 months, 3 weeks ago
name = 'Alice', not EmployeeName, will throw Error
upvoted 1 times
jskibick
3 months, 3 weeks ago
Selected Answer: B
B, SQL for serverless has error, name field do not exist in table
upvoted 1 times
PallaviPatel
4 months ago
Selected Answer: B
Wrong table and column names in the query.
upvoted 1 times
Fer079
4 months ago
Regarding the lower case... I have test it on Azure, I have created the table in Spark pool, and it´s true that it is converted to lower case
automatically, however we can query from both Spark pool and synapse serverless pool using lower/upper case and it will always find the table...
Does any one test it?
upvoted 2 times
ANath
4 months ago
I am getting 'Bulk load data conversion error (type mismatch or invalid character for the specified codepage)' error.
upvoted 1 times
ANath
4 months ago
Sorry I was doing it in a wrong way. If we specify table name in lower order and specify correct column name the exact result will show.
upvoted 1 times
pozdrotechno
4 months, 1 week ago
Selected Answer: B
B is correct.
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 6/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Question #3 Topic 1
DRAG DROP -
You have a table named SalesFact in an enterprise data warehouse in Azure Synapse Analytics. SalesFact contains sales data from the past 36
months and has the following characteristics:
✑ Is partitioned by month
At the beginning of each month, you need to remove data from SalesFact that is older than 36 months as quickly as possible.
Which three actions should you perform in sequence in a stored procedure? To answer, move the appropriate actions from the list of actions to the
answer area and arrange them in the correct order.
Correct Answer:
Step 1: Create an empty table named SalesFact_work that has the same schema as SalesFact.
Step 2: Switch the partition containing the stale data from SalesFact to SalesFact_Work.
SQL Data Warehouse supports partition splitting, merging, and switching. To switch partitions between two tables, you must ensure that the
partitions align on their respective boundaries and that the table definitions match.
Loading data into partitions with partition switching is a convenient way stage new data in a table that is not visible to users the switch in the
new data.
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-partition
hsetin
Highly Voted
8 months, 2 weeks ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 7/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
svik
8 months, 2 weeks ago
Yes. Once the partition is switched with an empty partition it is equivalent to truncating the partition from the original table
upvoted 1 times
Dothy
Most Recent
2 weeks ago
Step 1: Create an empty table named SalesFact_work that has the same schema as SalesFact.
Step 2: Switch the partition containing the stale data from SalesFact to SalesFact_Work.
JJdeWit
1 month, 1 week ago
D A C is the right option.
For more information, this doc discusses exactly this example: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-
data-warehouse-tables-partition
upvoted 1 times
theezin
1 month, 3 weeks ago
Why not included deleting sales data older than 36 months which is mentioned in question?
upvoted 1 times
RamGhase
3 months, 2 weeks ago
i could not understand how answer handled to remove data before 36 month
upvoted 1 times
gerard
3 months, 2 weeks ago
you have to move the partitions that contains the date before 36 months
upvoted 2 times
PallaviPatel
4 months ago
D A C is correct.
upvoted 1 times
indomanish
4 months, 2 weeks ago
Partition switching help us in loading large data set quickly . Not sure if it will help in purging data as well.
upvoted 2 times
SabaJamal2010AtGmail
4 months, 4 weeks ago
Given answer is correct
upvoted 2 times
covillmail
7 months ago
DAC is correct
upvoted 4 times
AvithK
9 months, 2 weeks ago
truncate partition is even quicker, why isn't that the answer, if the data is dropped anyway?
upvoted 3 times
yolap31172
7 months, 2 weeks ago
There is no way to truncate partitions in Synapse. Partitions don't even have names and you can't reference them by value.
upvoted 4 times
BlackMal
9 months, 2 weeks ago
This, i think it should be the answer
upvoted 1 times
poornipv
10 months ago
what is the correct answer for this?
upvoted 2 times
AnonAzureDataEngineer
10 months ago
Seems like it should be:
1. E
2. A
3. C
upvoted 1 times
dragos_dragos62000
10 months, 4 weeks ago
Correct!
upvoted 1 times
Dileepvikram
12 months ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 8/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
savin
11 months, 1 week ago
partition switching part covers it. So its correct i think
upvoted 1 times
wfrf92
1 year ago
Is this correct ????
upvoted 1 times
alain2
1 year ago
Yes, it is.
https://www.cathrinewilhelmsen.net/table-partitioning-in-sql-server-partition-switching/
upvoted 5 times
YipingRuan
7 months, 2 weeks ago
"Archive data by switching out: Switch from Partition to Non-Partitioned" ?
upvoted 1 times
TorbenS
1 year ago
yes, I think so
upvoted 4 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 9/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Question #4 Topic 1
You have files and folders in Azure Data Lake Storage Gen2 for an Azure Synapse workspace as shown in the following exhibit.
When you query ExtTable by using an Azure Synapse Analytics serverless SQL pool, which files are returned?
A.
File2.csv and File3.csv only
B.
File1.csv and File4.csv only
C.
File1.csv, File2.csv, File3.csv, and File4.csv
D.
File1.csv only
Correct Answer:
C
To run a T-SQL query over a set of files within a folder or set of folders while treating them as a single entity or rowset, provide a path to a folder
or a pattern
(using wildcards) over a set of files or folders.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-data-storage#query-multiple-files-or-folders
Chillem1900
Highly Voted
1 year ago
I believe the answer should be B.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#arguments-create-external-table
upvoted 76 times
captainpike
7 months, 1 week ago
I tested and proove you right, the answer is B. Remind the question is referring to serverless SQL and not dedicated SQL pool. "Unlike Hadoop
external tables, native external tables don't return subfolders unless you specify /** at the end of path. In this example, if
LOCATION='/webdata/', a serverless SQL pool query, will return rows from mydata.txt. It won't return mydata2.txt and mydata3.txt because
they're located in a subfolder. Hadoop tables will return all files within any subfolder."
upvoted 12 times
alain2
Highly Voted
1 year ago
"Serverless SQL pool can recursively traverse folders only if you specify /** at the end of path."
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-folders-multiple-csv-files
upvoted 17 times
Preben
11 months, 3 weeks ago
When you are quoting from Microsoft documentation, do not ADD in words to the sentence. 'Only' is not used.
upvoted 10 times
captainpike
7 months, 1 week ago
The answer is B however. I could not make "/**" to work. somebody?
upvoted 2 times
amiral404
Most Recent
2 days, 21 hours ago
C is correct as mentionned in the official documentation which showcase a similar example : https://docs.microsoft.com/en-us/sql/t-
sql/statements/create-external-table-transact-sql?view=sql-server-ver15&tabs=dedicated#location--folder_or_filepath
upvoted 1 times
Backy
1 week, 5 days ago
The question does not show the actual query so this is a problem
upvoted 1 times
Dothy
2 weeks ago
I believe the answer should be B.
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 10/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
upvoted 1 times
carloalbe
3 weeks ago
In this example, if LOCATION='/webdata/', a PolyBase query will return rows from mydata.txt and mydata2.txt. It won't return mydata3.txt because
it's a file in a hidden folder. And it won't return _hidden.txt because it's a hidden file. https://docs.microsoft.com/en-us/sql/t-
sql/statements/media/aps-polybase-folder-traversal.png?view=sql-server-ver15b
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15&tabs=dedicated
upvoted 1 times
BJPJowee
4 weeks ago
the answer is correct. C see the link https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-
ver15&tabs=dedicated
upvoted 1 times
MS_Nikhil
4 weeks, 1 day ago
Ans is definitely B
upvoted 1 times
poundmanluffy
1 month, 4 weeks ago
Selected Answer: B
Option is definitely "B"
Unlike Hadoop external tables, native external tables don't return subfolders unless you specify /** at the end of path. In this example, if
LOCATION='/webdata/', a serverless SQL pool query, will return rows from mydata.txt. It won't return mydata2.txt and mydata3.txt because they're
located in a subfolder. Hadoop tables will return all files within any sub-folder.
upvoted 1 times
Ozren
2 months, 1 week ago
Selected Answer: B
This is not a recursive pattern like '.../**'. So the answer is B, not C.
upvoted 2 times
kamil_k
2 months, 2 weeks ago
this one is tricky, I found information here which would suggest answer C is indeed correct:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15&tabs=dedicated#arguments-2
upvoted 1 times
kamil_k
2 months, 2 weeks ago
ok I've done the test:
3. created container myfilesystem, subfolder topfolder and another subfolder topfolder under that
4. created two csv files and dropped one per folder i.e. one in topfolder and the other in topfolder/topfolder
WITH
LOCATION = 'https://[storage-account-name].blob.core.windows.net/myfilesystem'
)
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ',',
FIRST_ROW = 2
);
WITH (
LOCATION='/topfolder/',
DATA_SOURCE = test,
FILE_FORMAT = test
);
The result were only records from File1.csv which was located in the first "topfolder".
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 11/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
kamil_k
2 months, 2 weeks ago
in other words, the answer C is incorrect. I forgot to mention I used the built-in serverless SQL Pool
upvoted 1 times
islamarfh
1 month, 4 weeks ago
this is the from the document tell that B is indeed correct
In this example, if LOCATION='/webdata/', a PolyBase query will return rows from mydata.txt and mydata2.txt. It won't return mydata3.txt
because it's a file in a hidden folder. And it won't return _hidden.txt because it's a hidden file.
upvoted 1 times
RalphLiang
2 months, 3 weeks ago
Selected Answer: B
I believe the answer should be B.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#arguments-create-external-table
upvoted 2 times
KosteK
2 months, 4 weeks ago
Selected Answer: B
Tested. Ans: B
upvoted 2 times
toms100
3 months, 2 weeks ago
Answer is C
If you specify LOCATION to be a folder, a PolyBase query that selects from the external table will retrieve files from the folder and all of its
subfolders.
Refer https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15&tabs=dedicated
upvoted 2 times
PallaviPatel
4 months ago
Selected Answer: B
B is correct answer.
upvoted 2 times
Sandip4u
4 months, 2 weeks ago
The answer is B , In case of a serverless pool a wildcard should be added to the location , otherwise this will not fetch the files from child folders
upvoted 2 times
bharatnhkh10
4 months, 3 weeks ago
Selected Answer: B
as we need ** o pick all files
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 12/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Question #5 Topic 1
HOTSPOT -
You are planning the deployment of Azure Data Lake Storage Gen2.
You have the following two reports that will access the data lake:
You need to recommend in which format to store the data in the data lake to support the reports. The solution must minimize read times.
What should you recommend for each report? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:
Report1: CSV -
Report2: AVRO -
Not Parquet, TSV: Not options for Azure Data Lake Storage Gen2.
Reference:
https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Destinations/ADLS-G2-D.html
alain2
Highly Voted
1 year ago
1: Parquet - column-oriented binary file format
https://youtu.be/UrWthx8T3UY
upvoted 92 times
azurestudent1498
1 month, 1 week ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 13/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
this is correct.
upvoted 1 times
terajuana
11 months, 2 weeks ago
the web is full of old information. timestamp support has been added to parquet
upvoted 5 times
vlad888
11 months ago
Ok, but in 1st case we need only 3 of 50 columns. Parquet i columnar format. In 2nd Avro because ideal for read full row
upvoted 12 times
Himlo24
Highly Voted
1 year ago
Shouldn't the answer for Report 1 be Parquet? Because Parquet format is Columnar and should be best for reading a few columns only.
upvoted 18 times
main616
Most Recent
1 week, 6 days ago
1. csv (or json) . csv/json support query accelerate by select specified rows https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-
storage-query-acceleration#overview
2. avro
upvoted 1 times
Dothy
2 weeks ago
1: Parquet
2: AVRO
upvoted 1 times
RalphLiang
2 months, 3 weeks ago
Consider Parquet and ORC file formats when the I/O patterns are more read heavy or when the query patterns are focused on a subset of columns
in the records.
the Avro format works well with a message bus such as Event Hub or Kafka that write multiple events/messages in succession.
upvoted 3 times
ragz_87
3 months, 3 weeks ago
1. Parquet
2. Avro
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices
"Consider using the Avro file format in cases where your I/O patterns are more write heavy, or the query patterns favor retrieving multiple rows of
records in their entirety.
Consider Parquet and ORC file formats when the I/O patterns are more read heavy or when the query patterns are focused on a subset of columns
in the records."
upvoted 5 times
SebK
2 months ago
Thank you.
upvoted 1 times
MohammadKhubeb
4 months ago
Why NOT csv in report1 ?
upvoted 1 times
Sandip4u
4 months, 2 weeks ago
This has to be parquet and AVRO , got the answer from Udemy
upvoted 4 times
Mahesh_mm
5 months ago
1. Parquet
2. AVRO
upvoted 3 times
marcin1212
5 months, 2 weeks ago
The goal is: The solution must minimize read times.
9 mln of records.
Parquet ~150 MB
Avro ~700MB
I checked:
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 14/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
- Parquet
- Parquet
upvoted 2 times
dev2dev
4 months, 3 weeks ago
how can be faster read is same as number of reads?
upvoted 1 times
Ozzypoppe
6 months ago
Solution says parquet is not supported for adls gen 2 but it actually is: https://docs.microsoft.com/en-us/azure/data-factory/format-parquet
upvoted 3 times
noranathalie
7 months, 1 week ago
An interesting and complete article that explains the different uses between parquet/avro/csv and gives answers to the question :
https://medium.com/ssense-tech/csv-vs-parquet-vs-avro-choosing-the-right-tool-for-the-right-job-79c9f56914a8
upvoted 4 times
elimey
10 months, 1 week ago
https://luminousmen.com/post/big-data-file-formats
upvoted 1 times
elimey
10 months, 1 week ago
Report 1 definitely Parquet
upvoted 1 times
noone_a
10 months, 3 weeks ago
report 1 - Parquet as it is columar.
report 2 - avro as it is row based and can be compressed further than csv.
upvoted 1 times
bsa_2021
11 months, 1 week ago
The actual answer provided and answer from discussion differs. Which one to follow for actual exam?
upvoted 1 times
Yaduvanshi
7 months, 2 weeks ago
Follow what feels logical after reading the answer and the discussion forum.
upvoted 2 times
bc5468521
12 months ago
1- Parquet
2- Parquet
Since they are all querying; AVRO is good for writing, OLTP, Parquet is good for quering/read
upvoted 5 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 15/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Question #6 Topic 1
You are designing the folder structure for an Azure Data Lake Storage Gen2 container.
Users will query data by using a variety of services including Azure Databricks and Azure Synapse Analytics serverless SQL pools. The data will be
secured by subject area. Most queries will include data from the current year or current month.
Which folder structure should you recommend to support fast queries and simplified folder security?
A.
/{SubjectArea}/{DataSource}/{DD}/{MM}/{YYYY}/{FileData}_{YYYY}_{MM}_{DD}.csv
B.
/{DD}/{MM}/{YYYY}/{SubjectArea}/{DataSource}/{FileData}_{YYYY}_{MM}_{DD}.csv
C.
/{YYYY}/{MM}/{DD}/{SubjectArea}/{DataSource}/{FileData}_{YYYY}_{MM}_{DD}.csv
D.
/{SubjectArea}/{DataSource}/{YYYY}/{MM}/{DD}/{FileData}_{YYYY}_{MM}_{DD}.csv
Correct Answer:
D
There's an important reason to put the date at the end of the directory structure. If you want to lock down certain regions or subject matters to
users/groups, then you can easily do so with the POSIX permissions. Otherwise, if there was a need to restrict a certain security group to
viewing just the UK data or certain planes, with the date structure in front a separate permission would be required for numerous directories
under every hour directory. Additionally, having the date structure in front would exponentially increase the number of directories as time went
on.
Note: In IoT workloads, there can be a great deal of data being landed in the data store that spans across numerous products, devices,
organizations, and customers. It's important to pre-plan the directory layout for organization, security, and efficient processing of the data for
down-stream consumers. A general template to consider might be the following layout:
{Region}/{SubjectMatter(s)}/{yyyy}/{mm}/{dd}/{hh}/
sagga
Highly Voted
1 year ago
D is correct
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices#batch-jobs-structure
upvoted 41 times
Dothy
Most Recent
2 weeks ago
D is correct
upvoted 1 times
Olukunmi
1 month ago
Selected Answer: D
D is correct
upvoted 1 times
Egocentric
1 month, 1 week ago
D is correct
upvoted 1 times
SebK
2 months ago
Selected Answer: D
D is correct
upvoted 2 times
RalphLiang
2 months, 3 weeks ago
Selected Answer: D
D is correct
upvoted 1 times
NeerajKumar
2 months, 3 weeks ago
Selected Answer: D
Correct
upvoted 1 times
PallaviPatel
4 months ago
Selected Answer: D
Correct
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 16/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Skyrocket
4 months ago
D is correct
upvoted 1 times
VeroDon
4 months, 3 weeks ago
Selected Answer: D
Thats correct
upvoted 2 times
Mahesh_mm
5 months ago
D is correct
upvoted 1 times
alexleonvalencia
5 months, 2 weeks ago
Respuesta correcta D.
upvoted 1 times
rashjan
5 months, 2 weeks ago
Selected Answer: D
D is correct
upvoted 1 times
ohana
7 months ago
Took the exam today, this question came out.
Ans: D
upvoted 4 times
Sunnyb
11 months, 3 weeks ago
D is absolutely correct
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 17/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Question #7 Topic 1
HOTSPOT -
Which file format should you use for each type of output? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:
Box 1: Parquet -
Parquet stores data in columns, while Avro stores data in a row-based format. By their very nature, column-oriented data stores are optimized
for read-heavy analytical workloads, while row-based databases are best for write-heavy transactional workloads.
Box 2: Avro -
Note: Azure Data Factory supports the following file formats (not GZip or TXT).
Avro format -
✑ Binary format
✑ Excel format
✑ JSON format
✑ ORC format
✑ Parquet format
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 18/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
✑ XML format
Reference:
https://www.datanami.com/2018/05/16/big-data-file-formats-demystified
Mahesh_mm
Highly Voted
5 months ago
Parquet and AVRO is correct option
upvoted 17 times
Dothy
Most Recent
2 weeks ago
agree with the answer
upvoted 2 times
RalphLiang
2 months, 3 weeks ago
Parquet and AVRO is correct option
upvoted 2 times
PallaviPatel
4 months ago
correct
upvoted 1 times
Skyrocket
4 months ago
Parquet and AVRO is right.
upvoted 2 times
edba
5 months ago
GZIP file format is one of supported Binary format by ADF.
https://docs.microsoft.com/en-us/azure/data-factory/connector-file-system?tabs=data-factory#file-system-as-sink
upvoted 1 times
bad_atitude
5 months, 1 week ago
agree with the answer
upvoted 2 times
alexleonvalencia
5 months, 2 weeks ago
Respuesta correcta PARQUET & AVRO.
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 19/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Question #8 Topic 1
HOTSPOT -
You use Azure Data Factory to prepare data to be queried by Azure Synapse Analytics serverless SQL pools.
Files are initially ingested into an Azure Data Lake Storage Gen2 account as 10 small JSON files. Each file contains the same data attributes and
data from a subsidiary of your company.
You need to move the files to a different folder and transform the data to meet the following requirements:
How should you configure the Data Factory copy activity? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:
Compared to the flat namespace on Blob storage, the hierarchical namespace greatly improves the performance of directory management
operations, which improves overall job performance.
Box 2: Parquet -
Azure Data Factory parquet format is supported for Azure Data Lake Storage Gen2.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction https://docs.microsoft.com/en-us/azure/data-
factory/format-parquet
alain2
Highly Voted
1 year ago
1. Merge Files
2. Parquet
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-performance-tuning-guidance
upvoted 87 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 20/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
edba
5 months ago
just want to add a bit more reference regarding copyBehavior in ADF plus info mentioned in Best Practice doc, so it shall be MergeFile first.
https://docs.microsoft.com/en-us/azure/data-factory/connector-file-system?tabs=data-factory#file-system-as-sink
upvoted 3 times
kilowd
7 months, 1 week ago
Larger files lead to better performance and reduced costs.
Typically, analytics engines such as HDInsight have a per-file overhead that involves tasks such as listing, checking access, and performing
various metadata operations. If you store your data as many small files, this can negatively affect performance. In general, organize your data
into larger sized files for better performance (256 MB to 100 GB in size). S
upvoted 3 times
Ameenymous
1 year ago
The smaller the files, the negative the performance so Merge and Parquet seems to be the right answer.
upvoted 14 times
captainbee
Highly Voted
10 months, 3 weeks ago
It's frustrating just how many questions ExamTopics get wrong. Can't be helpful
upvoted 26 times
RyuHayabusa
10 months, 1 week ago
At least it helps in learning, as you have to research and think for yourself. Another big topic is having this questions in the first place is
immensely helpful
upvoted 30 times
SebK
1 month, 4 weeks ago
Agree.
upvoted 2 times
gssd4scoder
7 months ago
Trying to understand if an answer is correct will help learn more
upvoted 3 times
Dothy
Most Recent
2 weeks ago
1. Merge Files
2. Parquet
upvoted 1 times
KashRaynardMorse
1 month, 1 week ago
A requirement was "Automatically infer the schema from the underlying files", meaning Preserve hierarchy is needed.
upvoted 2 times
gabdu
3 weeks, 3 days ago
it is possible that all or some schemas may be different in that case we cannot merge
upvoted 1 times
imomins
1 month, 3 weeks ago
another hot key is : You need to move the files to a different folder
Eyepatch993
2 months ago
1. Preserve heirarchy - ADF is used only for processing and Synapse is the sink. Since synapse has parallel processing power, it can process the files
in different folder and thus improve performance.
2. Parquet
upvoted 1 times
kamil_k
2 months, 1 week ago
Are these answers the actual correct answers or guesses? Who highlights the correct answers?
upvoted 2 times
srakrn
4 months ago
"In general, we recommend that your system have some sort of process to aggregate small files into larger ones for use by downstream
applications."
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices
Sandip4u
4 months, 2 weeks ago
Merge and parquet will be the right option , also taken reference from Udemy
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 21/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Mahesh_mm
5 months ago
1.As hierarchical namespace greatly improves the performance of directory management operations, which improves overall job performance,
Preserver herarchy looks correct. Also there is overhead for merging files.
2. Parquet
upvoted 3 times
Boompiee
2 weeks, 3 days ago
The overhead for merging happens once, after that it's faster every time to query the files if they are merged.
upvoted 1 times
m2shines
5 months, 1 week ago
Merge Files and Parquet
upvoted 1 times
AM1971
6 months, 2 weeks ago
shouldn't a json file be flattened first? So I think the answer is: flatten and parquet
upvoted 1 times
RinkiiiiiV
7 months, 3 weeks ago
1. Preserver hierarchy
2. Parquet
upvoted 1 times
noobplayer
7 months, 2 weeks ago
Is this correct?
upvoted 2 times
Marcus1612
8 months, 2 weeks ago
The files are copied/transform from one folder to another inside the same hierarchical account. The hierarchical property is defined during the
account creation. The destination folder still have the hierarchical. On the other hand, as mentioned by Microsoft:Typically, analytics engines such
as HDInsight and Azure Data Lake Analytics have a per-file overhead. If you store your data as many small files, this can negatively affect
performance. In general, organize your data into larger sized files for better performance (256MB to 100GB in size).
upvoted 2 times
meetj
9 months ago
1. Merge for sure
elimey
10 months, 1 week ago
1. Merge Files: Because the question said 10 different small JSON to a different file
2. Parquet
upvoted 5 times
Erte
11 months ago
Box 1: Preserver herarchy
Compared to the flat namespace on Blob storage, the hierarchical namespace greatly improves the performance of directory management
operations, which
Box 2: Parquet
Azure Data Factory parquet format is supported for Azure Data Lake Storage Gen2. Parquet supports the schema property.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction https://docs.microsoft.com/en-us/azure/data-
factory/format-parquet
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 22/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Question #9 Topic 1
HOTSPOT -
You have a data model that you plan to implement in a data warehouse in Azure Synapse Analytics as shown in the following exhibit.
All the dimension tables will be less than 2 GB after compression, and the fact table will be approximately 6 TB. The dimension tables will be
relatively static with very few data inserts and updates.
Which type of table should you use for each table? To answer, select the appropriate options in the answer area.
Hot Area:
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 23/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Correct Answer:
Box 1: Replicated -
Replicated tables are ideal for small star-schema dimension tables, because the fact table is often distributed on a column that is not
compatible with the connected dimension tables. If this case applies to your schema, consider changing small dimension tables currently
implemented as round-robin to replicated.
Box 2: Replicated -
Box 3: Replicated -
Box 4: Hash-distributed -
For Fact tables use hash-distribution with clustered columnstore index. Performance improves when two hash tables are joined on the same
distribution column.
Reference:
https://azure.microsoft.com/en-us/updates/reduce-data-movement-and-make-your-queries-more-efficient-with-the-general-availability-of-
replicated-tables/ https://azure.microsoft.com/en-us/blog/replicated-tables-now-generally-available-in-azure-sql-data-warehouse/
ian_viana
Highly Voted
8 months, 1 week ago
The answer is correct.
The table category often determines which option to choose for distributing the table.
Fact -Use hash-distribution with clustered columnstore index. Performance improves when two hash tables are joined on the same distribution
column.
Dimension - Use replicated for smaller tables. If tables are too large to store on each Compute node, use hash-distributed.
Staging - Use round-robin for the staging table. The load with CTAS is fast. Once the data is in the staging table, use INSERT...SELECT to move the
data to production tables.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview#common-distribution-
methods-for-tables
upvoted 30 times
GameLift
7 months, 3 weeks ago
Thanks, but where in the question does it indicate about Fact table has clustered columnstore index.?
upvoted 3 times
berserksap
6 months, 4 weeks ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 24/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Normally for big tables we use clustered columnstore index for optimal performance and compression. Since the table mentioned here is in
TBs we can safely assume using this index is the best choice
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index
upvoted 2 times
berserksap
6 months, 4 weeks ago
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-overview
upvoted 1 times
ohana
Highly Voted
7 months ago
Took the exam today, this question came out.
Dothy
Most Recent
2 weeks ago
The answer is correct.
upvoted 1 times
PallaviPatel
4 months ago
correct answer
upvoted 2 times
Pritam85
4 months ago
Got this question on 2312/2021...answer is correct
upvoted 1 times
Mahesh_mm
5 months ago
Ans is correct
upvoted 2 times
alfonsodisalvo
6 months, 3 weeks ago
Dimension are Replicated :
"Since the table has multiple copies, replicated tables work best when the table size is less than 2 GB compressed."
"Replicated tables may not yield the best query performance when:
" We recommend using replicated tables instead of round-robin tables in most cases"
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/design-guidance-for-replicated-tables
upvoted 1 times
gssd4scoder
7 months ago
Correct: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-overview#common-distribution-methods-for-tables
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 25/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
HOTSPOT -
Data is ingested into the container, and then transformed by a data integration application. The data is NOT modified after that. Users can read
files in the container but cannot modify the files.
You need to design a data archiving solution that meets the following requirements:
✑ Data that is older than five years is accessed infrequently but must be available within one second when requested.
✑ Data that is older than seven years is NOT accessed. After seven years, the data must be persisted at the lowest cost possible.
How should you manage the data? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:
Archive - Optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements, on the order of
hours.
The following table shows a comparison of premium performance block blob storage, and the hot, cool, and archive access tiers.
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 26/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers
yobllip
Highly Voted
11 months, 3 weeks ago
Answer should be
1 - Cool
2 - Archive
Comparison table shown access time for cool tier ttfb is milliseconds
https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers#comparing-block-blob-storage-options
upvoted 48 times
r00s
1 day, 16 hours ago
Right. #1 is Cool because it's clearly mentioned in the documentation that "Older data sets that are not used frequently, but are expected to be
available for immediate access"
https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview#comparing-block-blob-storage-options
upvoted 1 times
Justbu
Highly Voted
8 months, 1 week ago
Tricky question, it says data that is OLDER THAN (> 5 years), must be available within one second when requested
But the first question asks for Five-year-old data, which is =5, so it can also be hot storage
Dothy
Most Recent
2 weeks ago
ans is correct
upvoted 1 times
PallaviPatel
4 months ago
ans is correct
upvoted 1 times
ANath
4 months, 3 weeks ago
1. Cool Storage
2. Archive Storage
upvoted 1 times
Mahesh_mm
5 months ago
Answer is correct
upvoted 1 times
ssitb
11 months, 4 weeks ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 27/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Answer should be
1-hot
2-archive
https://www.bmc.com/blogs/cold-vs-hot-data-storage/
Cold storage data retrieval can take much longer than hot storage. It can take minutes to hours to access cold storage data
upvoted 2 times
marcin1212
5 months, 1 week ago
https://www.bmc.com/blogs/cold-vs-hot-data-storage/
captainbee
11 months, 3 weeks ago
Cold storage takes milliseconds to retrieve
upvoted 5 times
syamkumar
11 months, 3 weeks ago
I also doubt if its hot storage and archive.. because its mentioned 5-year-old has to be retrieved within seconds which is not possible via cold
storage//
upvoted 1 times
savin
11 months, 1 week ago
but the cost factor is also there. keeping the data in hot tier for 5 years vs cold tier for 5 years would add significant amount.
upvoted 1 times
DrC
12 months ago
Answer is correct
upvoted 8 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 28/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
DRAG DROP -
You need to create a partitioned table in an Azure Synapse Analytics dedicated SQL pool.
How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used
once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
Correct Answer:
Box 1: DISTRIBUTION -
Table distribution options include DISTRIBUTION = HASH ( distribution_column_name ), assigns each row to one distribution by hashing the
value stored in distribution_column_name.
Box 2: PARTITION -
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-azure-sql-data-warehouse
Sunnyb
Highly Voted
11 months, 3 weeks ago
Answer is correct
upvoted 43 times
Sasha_in_San_Francisco
Highly Voted
6 months, 3 weeks ago
Correct answer by how to remember? Distribution option before the Partition option because… ‘D’ comes before ‘P’ or because the system needs
to know the algorithm (hash, round-robin, replicate) before it can start to Partition or segment the data. (seem reasonable?)
upvoted 26 times
Dothy
Most Recent
2 weeks ago
Answer is correct
upvoted 1 times
Egocentric
1 month, 1 week ago
provided answer is correct
upvoted 1 times
vineet1234
1 month, 3 weeks ago
D comes before P as in DP-203
upvoted 3 times
PallaviPatel
4 months ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 29/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
correct
upvoted 1 times
Jaws1990
4 months, 3 weeks ago
Wouldn't VALUES(1,1000000, 200000) create a partition for records with ID <= 1 which would mean 1 row?
upvoted 1 times
ploer
3 months, 2 weeks ago
Having three boundaries will lead to four partitions:
nastyaaa
3 months, 1 week ago
but only <= and >. it is range left for values, right
upvoted 1 times
Mahesh_mm
5 months ago
Answer is correct
upvoted 1 times
hugoborda
8 months ago
Answer is correct
upvoted 1 times
hsetin
8 months, 3 weeks ago
Indeed! Answer is correct
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 30/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You need to design an Azure Synapse Analytics dedicated SQL pool that meets the following requirements:
A.
as a temporal table
B.
as a SQL graph table
C.
as a degenerate dimension table
D.
as a Type 2 slowly changing dimension (SCD) table
Correct Answer:
D
A Type 2 SCD supports versioning of dimension members. Often the source system doesn't store versions, so the data warehouse load process
detects and manages changes in a dimension table. In this case, the dimension table must use a surrogate key to provide a unique reference to
a version of the dimension member. It also includes columns that define the date range validity of the version (for example, StartDate and
EndDate) and possibly a flag column (for example,
Reference:
https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-
dimension-types
bc5468521
Highly Voted
12 months ago
Answer D; Temporal table is better than SCD2, but it is not supported in Synpase yet
upvoted 47 times
sparkchu
2 months, 2 weeks ago
though this not something relative to this question. temproal tables looks alike to delta table.
upvoted 1 times
Preben
11 months, 3 weeks ago
Here's the documentation for how to implement temporal tables in Synapse from 2019.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-temporary
upvoted 1 times
mbravo
11 months, 2 weeks ago
Temporal tables and Temporary tables are two very distinct concepts. Your link has absolutely nothing to do with this question.
upvoted 11 times
Vaishnav
10 months, 2 weeks ago
https://docs.microsoft.com/en-us/azure/azure-sql/temporal-tables
berserksap
6 months, 4 weeks ago
I think synapse doesn't support temporal tables. Please check the below comment by hsetin.
upvoted 1 times
rashjan
Highly Voted
5 months, 2 weeks ago
Selected Answer: D
D is correct (voting comment that people dont have to open discussion always, please upvote to help others)
upvoted 39 times
Dothy
Most Recent
2 weeks ago
Answer is correct
upvoted 2 times
Martin_Nbg
1 month ago
Temporal tables are not supported in Synapse so D is correct.
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 31/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
upvoted 2 times
sparkchu
1 month, 3 weeks ago
overall, you should use delta table :@
upvoted 1 times
PallaviPatel
4 months ago
Selected Answer: D
correct
upvoted 1 times
Adelina
4 months, 2 weeks ago
Selected Answer: D
D is correct
upvoted 2 times
dev2dev
4 months, 2 weeks ago
Confusing high voted comment. D is SCD2 but comment is talking about temporal table. Either way SCD2 is right answer which is Choice D
upvoted 1 times
VeroDon
4 months, 3 weeks ago
Selected Answer: D
Dedicated SQL Pools is the key
upvoted 3 times
Mahesh_mm
5 months ago
Answer is D
upvoted 1 times
hsetin
8 months, 3 weeks ago
Answer is D. Microsoft seems to have confirmed this.
https://docs.microsoft.com/en-us/answers/questions/130561/temporal-table-in-azure-
synapse.html#:~:text=Unfortunately%2C%20we%20do%20not%20support,submitted%20by%20another%20Azure%20customer.
upvoted 3 times
dd1122
9 months, 2 weeks ago
Answer D is correct. Temporal tables mentioned in the link below are supported in Azure SQL Database(PaaS) and Azure Managed Instance, where
as in this question Dedicated SQL Pools are mentioned so no temporal tables can be used. SCD Type 2 is the answer.
https://docs.microsoft.com/en-us/azure/azure-sql/temporal-tables
upvoted 4 times
escoins
11 months ago
Definitively answer D
upvoted 3 times
[Removed]
11 months, 1 week ago
The answer is A - Temporal tables
"Temporal tables enable you to restore row versions from any point in time."
https://docs.microsoft.com/en-us/azure/azure-sql/database/business-continuity-high-availability-disaster-recover-hadr-overview
upvoted 1 times
Dileepvikram
11 months, 3 weeks ago
The requirement says that the table should store latest information, so the answer should be temporal table, right? Because scd type 2 will store
the complete history.
upvoted 1 times
captainbee
11 months, 3 weeks ago
Also needs to return employee information from a given point in time? Full history needed for that.
upvoted 12 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 32/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You have an enterprise-wide Azure Data Lake Storage Gen2 account. The data lake is accessible only through an Azure virtual network named
VNET1.
You are building a SQL pool in Azure Synapse that will use data from the data lake.
Your company has a sales team. All the members of the sales team are in an Azure Active Directory group named Sales. POSIX controls are used
to assign the
You need to ensure that the SQL pool can load the sales data from the data lake.
Which three actions should you perform? Each correct answer presents part of the solution.
A.
Add the managed identity to the Sales group.
B.
Use the managed identity as the credentials for the data load process.
C.
Create a shared access signature (SAS).
D.
Add your Azure Active Directory (Azure AD) account to the Sales group.
E.
Use the shared access signature (SAS) as the credentials for the data load process.
F.
Create a managed identity.
Correct Answer:
ABF
The managed identity grants permissions to the dedicated SQL pools in the workspace.
Note: Managed identity for Azure resources is a feature of Azure Active Directory. The feature provides Azure services with an automatically
managed identity in
Azure AD -
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/security/synapse-workspace-managed-identity
Diane
Highly Voted
1 year ago
correct answer is ABF https://www.examtopics.com/discussions/microsoft/view/41207-exam-dp-200-topic-1-question-56-discussion/
upvoted 61 times
AvithK
9 months, 2 weeks ago
yes but the order is different it is FAB
upvoted 24 times
gssd4scoder
7 months ago
Exactly, agree with you
upvoted 1 times
KingIlo
9 months, 1 week ago
The question didn't specify order or sequence
upvoted 9 times
IDKol
Highly Voted
10 months, 1 week ago
Correct Answer should be
B. Use the managed identity as the credentials for the data load process.
upvoted 20 times
Dothy
Most Recent
2 weeks ago
correct answer is ABF
upvoted 1 times
Egocentric
1 month, 1 week ago
ABF is correct
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 33/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
praticewizards
2 months ago
Selected Answer: ABF
FAB - create, add to group, use to load data
upvoted 1 times
Backy
4 months, 2 weeks ago
Is answer A properly worded?
"Add the managed identity to the Sales group" should be "Add the Sales group to managed identity"
upvoted 3 times
lukeonline
4 months, 3 weeks ago
Selected Answer: ABF
FAB should be correct
upvoted 4 times
VeroDon
4 months, 3 weeks ago
Selected Answer: ABF
FAB is correct sequence
upvoted 2 times
SabaJamal2010AtGmail
4 months, 4 weeks ago
1. Create a managed identity.
3. Use the managed identity as the credentials for the data load process.
upvoted 2 times
Mahesh_mm
5 months ago
FAB is correct sequence
upvoted 1 times
Lewistrick
5 months ago
Would it even be a good idea to have the data load process be part of the Sales team? They have separate responsibilities, so should be part of
another group. I know that's not possible in the answer list, but I'm trying to think best practices here.
upvoted 2 times
Aslam208
6 months ago
Selected Answer: ABF
Correct answer is F, A, B
upvoted 6 times
FredNo
6 months, 1 week ago
Selected Answer: ABF
use managed identity
upvoted 5 times
ohana
7 months ago
Took the exam today. Similar question came out. Ans: ABF.
Eniyan
8 months ago
It should be FAB please refer the following reference.https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/quickstart-
bulk-load-copy-tsql-examples
upvoted 3 times
AvithK
9 months, 2 weeks ago
I don't get why it doesn't start with F. The managed identity should be created first, right?
upvoted 2 times
Mazazino
7 months, 1 week ago
There's no mentioning of sequence. The question is just about the right steps
upvoted 2 times
MonemSnow
10 months, 3 weeks ago
A, C, F is the correct answer
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 34/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
HOTSPOT -
You have an Azure Synapse Analytics dedicated SQL pool that contains the users shown in the following table.
User1 executes a query on the database, and the query returns the results shown in the following exhibit.
User1 is the only user who has access to the unmasked data.
Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic.
Hot Area:
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 35/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Correct Answer:
Box 1: 0 -
The Default masking function: Full masking according to the data types of the designated fields
✑ Use a zero value for numeric data types (bigint, bit, decimal, int, money, numeric, smallint, smallmoney, tinyint, float, real).
Users with administrator privileges are always excluded from masking, and see the original data without any mask.
Reference:
https://docs.microsoft.com/en-us/azure/azure-sql/database/dynamic-data-masking-overview
hsetin
Highly Voted
8 months, 3 weeks ago
user 1 is admin, so he will see the value stored in dbms.
1. 0
2. Value in database
upvoted 48 times
azurearmy
7 months ago
2 is wrong
upvoted 2 times
rjile
Highly Voted
9 months ago
• Use a zero value for numeric data types (bigint, bit, decimal, int, money, numeric, smallint, smallmoney, tinyint, float, real).
• Use 01-01-1900 for date/time data types (date, datetime2, datetime, datetimeoffset, smalldatetime, time).
upvoted 13 times
berserksap
6 months, 4 weeks ago
The second question is queried by User 1 who is the admin
upvoted 13 times
Dothy
Most Recent
2 weeks ago
1. 0
2. Value in database
upvoted 1 times
Egocentric
1 month, 1 week ago
on this question its just about paying attention to detail
upvoted 3 times
manan16
1 month, 2 weeks ago
How user2 can access data as it is masked?
upvoted 1 times
manan16
1 month, 2 weeks ago
Can Someone explain first option as in doc it says 0
upvoted 1 times
Mahesh_mm
5 months ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 36/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
1. 0 (Default values for money data type for masked function will written when queried by user2)
Milan1988
7 months ago
CORRECT
upvoted 2 times
gssd4scoder
7 months ago
Agree with answer, but I see a typo in the question db_datereader MUST be db_datareader.
upvoted 3 times
Jiddu
7 months, 3 weeks ago
o for money and 1/1/1900 for dates
https://docs.microsoft.com/en-us/sql/relational-databases/security/dynamic-data-masking?view=sql-server-ver15
upvoted 4 times
GervasioMontaNelas
8 months, 3 weeks ago
Its correct
upvoted 2 times
rjile
9 months ago
correct?
upvoted 2 times
Mazazino
7 months, 1 week ago
yes, it's correct
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 37/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Using PolyBase, you create an external table named [Ext].[Items] to query Parquet files stored in Azure Data Lake Storage Gen2 without importing
the data to the data warehouse.
You discover that the Parquet files have a fourth column named ItemID.
Which command should you run to add the ItemID column to the external table?
A.
B.
C.
D.
Correct Answer:
C
Incorrect Answers:
A, D: Only these Data Definition Language (DDL) statements are allowed on external tables:
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql
Chien_Nguyen_Van
Highly Voted
8 months, 3 weeks ago
C is correct
https://www.examtopics.com/discussions/microsoft/view/19469-exam-dp-200-topic-1-question-27-discussion/
upvoted 29 times
Ozren
Most Recent
2 months, 1 week ago
Good thing the details are shown here: "The external table has three columns." And the solution yet reveals the column details. This doesn't make
any sense to me. If C is the correct answer (only one that seems acceptable), then the question itself is flawed.
upvoted 2 times
PallaviPatel
3 months, 4 weeks ago
c is correct.
upvoted 1 times
hugoborda
8 months ago
Answer is correct
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 38/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
HOTSPOT -
You have two Azure Storage accounts named Storage1 and Storage2. Each account holds one container and has the hierarchical namespace
enabled. The system has files that contain data stored in the Apache Parquet format.
You need to copy folders and files from Storage1 to Storage2 by using a Data Factory copy activity. The solution must meet the following
requirements:
How should you configure the copy activity? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:
Box 1: Parquet -
For Parquet datasets, the type property of the copy activity source must be set to ParquetSource.
Box 2: PreserveHierarchy -
PreserveHierarchy (default): Preserves the file hierarchy in the target folder. The relative path of the source file to the source folder is identical
to the relative path of the target file to the target folder.
Incorrect Answers:
✑ FlattenHierarchy: All files from the source folder are in the first level of the target folder. The target files have autogenerated names.
✑ MergeFiles: Merges all files from the source folder to one file. If the file name is specified, the merged file name is the specified name.
Otherwise, it's an autogenerated file name.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/format-parquet https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-
data-lake-storage
EddyRoboto
Highly Voted
9 months ago
This could be binary as source and sink, since there are no transformations on files. I tend to believe that would be binary the correct anwer.
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 39/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
upvoted 43 times
michalS
8 months, 3 weeks ago
I agree. If it's just copying then binary is fine and would probably be faster
upvoted 6 times
iooj
3 months, 1 week ago
Agree. I've checked it. With binary source and sink datasets it works.
upvoted 2 times
rav009
8 months ago
agree. When using Binary dataset, the service does not parse file content but treat it as-is.
GameLift
7 months, 1 week ago
But the doc says "When using Binary dataset in copy activity, you can only copy from Binary dataset to Binary dataset." So I guess it's parquet
then?
upvoted 3 times
captainpike
7 months, 1 week ago
This note is referring to the fact that, in the template, you have to specify “BinarySink” as the type for the target Sink; and that exactly what
the Copy data tool does. (you can check this by editing the created copy pipeline and see the code). Choosing BInary and PreserveHierarchy
copy all file as they are perfectly.
upvoted 3 times
AbhiGola
Highly Voted
8 months, 3 weeks ago
Answer seems correct as data is store is parquet already and requirement is to do no transformation so answer is right
upvoted 34 times
NintyFour
1 week, 2 days ago
As question has mentioned, Minimize time required to perform the copy activity.
NintyFour
Most Recent
1 week, 2 days ago
As question has mentioned, Minimize time required to perform the copy activity.
AzureRan
2 weeks ago
Is it binary or parquet?
upvoted 1 times
DingDongSingSong
2 months ago
So what is the answer to this question? Binary or Parquet? The file is a ParquetFile. If you're simply copying a file, you just need to define the right
source type (i.e. Parquet) in this instance. Why would you even consider Binary when the file isn't Binary type
upvoted 2 times
kamil_k
2 months, 1 week ago
I've just tested it in Azure, created two Gen2 storage accounts, used Binary as source and destination, placed two parquet files in account one.
Created pipeline in ADF, added copy data activity and then defined first binary as source with wildcard path (*.parquet) and the sink as binary, with
linked service for account 2, selected PreserveHierarchy. It worked.
upvoted 6 times
AnshulSuryawanshi
2 months, 3 weeks ago
When using Binary dataset in copy activity, you can only copy from Binary dataset to Binary dataset
upvoted 1 times
Sandip4u
4 months, 2 weeks ago
this should be binary
upvoted 1 times
VeroDon
4 months, 3 weeks ago
The type property of the dataset must be set to Parquet
https://docs.microsoft.com/en-us/azure/data-factory/format-parquet#parquet-as-source
upvoted 1 times
Mahesh_mm
5 months ago
I think it is Parquet as When using Binary dataset in copy activity, you can only copy from Binary dataset to Binary dataset.
upvoted 2 times
Canary_2021
5 months, 1 week ago
If you only copy over files from one storage to another, don't need to read data inside the file, binary should be selected for better performance.
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 40/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
upvoted 5 times
m2shines
5 months, 1 week ago
Binary and Preserve Hierarchy should be the answer
upvoted 4 times
Lucky_me
5 months, 1 week ago
The answers are correct! Binary doesn't work; I just tried.
upvoted 5 times
kamil_k
2 months, 1 week ago
hmm what did you try? I literally created it the same way as described i.e. two gen2 storage accounts. I chose gen2 as source linked service with
binary as file type and the same for destination. In the copy data activity in ADF pipeline I specified preserve hierarchy and it worked as
expected.
upvoted 1 times
Ozzypoppe
5 months, 2 weeks ago
https://docs.microsoft.com/en-us/azure/data-factory/format-parquet#parquet-as-source
upvoted 2 times
medsimus
7 months, 2 weeks ago
The correct answer is Binary , I test it
upvoted 8 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 41/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You have an Azure Data Lake Storage Gen2 container that contains 100 TB of data.
You need to ensure that the data in the container is available for read workloads in a secondary region if an outage occurs in the primary region.
The solution must minimize costs.
A.
geo-redundant storage (GRS)
B.
read-access geo-redundant storage (RA-GRS)
C.
zone-redundant storage (ZRS)
D.
locally-redundant storage (LRS)
Correct Answer:
B
Geo-redundant storage (with GRS or GZRS) replicates your data to another physical location in the secondary region to protect against regional
outages.
However, that data is available to be read only if the customer or Microsoft initiates a failover from the primary to secondary region. When you
enable read access to the secondary region, your data is available to be read at all times, including in a situation where the primary region
becomes unavailable.
Incorrect Answers:
A: While Geo-redundant storage (GRS) is cheaper than Read-Access Geo-Redundant Storage (RA-GRS), GRS does NOT initiate automatic
failover.
C, D: Locally redundant storage (LRS) and Zone-redundant storage (ZRS) provides redundancy within a single region.
Reference:
https://docs.microsoft.com/en-us/azure/storage/common/storage-redundancy
meetj
Highly Voted
9 months ago
B is right
Geo-redundant storage (with GRS or GZRS) replicates your data to another physical location in the secondary region to protect against regional
outages. However, that data is available to be read only if the customer or Microsoft initiates a failover from the primary to secondary region.
When you enable read access to the secondary region, your data is available to be read at all times, including in a situation where the primary
region becomes unavailable.
upvoted 53 times
dev2dev
4 months, 2 weeks ago
A looks correct answer. RA-GRS is always avialable because its auto failover. Since this is not asked in the question but more importantly the
question is about reducing cost which GRS.
upvoted 13 times
BK10
3 months, 1 week ago
It should be A because of two reasons:
1. Minimize cost
Sasha_in_San_Francisco
Highly Voted
6 months, 3 weeks ago
In my opinion, I believe the and answer is A, and this is why.
In the question they state "...available for read workloads in a secondary region IF AN OUTAGE OCCURES in the primary...". Well, answer B (RA-GRS)
states in Microsoft documentation that RA-GRS is for when "...your data is available to be read AT ALL TIMES, including in a situation where the
primary region becomes unavailable."
To me, the nature of the question is what is the cheapest solution which allows for failover to read workload, when there is an outage. Answer (A).
Common sense would be 'A' too because that is probably the most often real-life use case.
upvoted 40 times
SabaJamal2010AtGmail
4 months, 4 weeks ago
It's not about common sense rather about technology. With GRS, data remains available even if an entire data center becomes unavailable or if
there is a widespread regional failure. There would be a down time when a region becomes unavailable. Alternately, you could implement read-
access geo-redundant storage (RA-GRS), which provides read-access to the data in alternate locations.
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 42/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
prathamesh1996
Most Recent
6 days, 16 hours ago
A is Correct for Minimize cost $ When primary is unavailable.
upvoted 1 times
Andushi
4 weeks, 1 day ago
Selected Answer: A
A because of costs aspect
upvoted 2 times
muove
1 month, 1 week ago
A is correct because of cost, RA-GRS will cost $5,910.73, GRS will cost 4,596.12
upvoted 2 times
Egocentric
1 month, 2 weeks ago
GRS is the correct answer,the key in the question is reducing costs
upvoted 1 times
Somesh512
1 month, 2 weeks ago
Selected Answer: A
To reduce cost GRS should be right option
upvoted 1 times
KosteK
1 month, 4 weeks ago
Selected Answer: A
GRZ is cheaper
upvoted 1 times
praticewizards
2 months ago
Selected Answer: B
The explanation is right. The given answer is wrong
upvoted 1 times
DingDongSingSong
2 months ago
B is incorrect. The answer is A. GRS is cheaper than RA-GRS. GRS read access is available ONLY once primary region failover occurs (therefore lower
cost). The requirement is for read-access availability in secondary region at lower cost WHEN a failover occurs in primary. Therefore, A is the answer
upvoted 2 times
phdphd
2 months, 1 week ago
Selected Answer: A
Got this question on the exam. RA_GRS was not an option, so it should to be A.
upvoted 10 times
vineet1234
2 months, 2 weeks ago
A is right. GRS means secondary is available ONLY when primary is down. And it is cheaper than RA-GRS (where secondary read access is always
available). The question sneaks in the word 'read workloads' just to confuse.
upvoted 2 times
Sgarima
2 months, 3 weeks ago
Selected Answer: B
B is correct.
Geo-redundant storage (with GRS or GZRS) replicates your data to another physical location in the secondary region to protect against regional
outages. However, that data is available to be read only if the customer or Microsoft initiates a failover from the primary to secondary region.
When you enable read access to the secondary region, your data is available to be read at all times, including in a situation where the primary
region becomes unavailable. For read access to the secondary region, enable read-access geo-redundant storage (RA-GRS) or read-access geo-
zone-redundant storage (RA-GZRS).
upvoted 1 times
NamitSehgal
3 months ago
A should be the answer as we need data in read only secondary only when something happens at region A, not always.
upvoted 2 times
MANESH_PAI
3 months, 1 week ago
Selected Answer: A
It is GRS because GRS is cheaper than RA-GRS
https://azure.microsoft.com/en-gb/pricing/details/storage/blobs/
upvoted 3 times
PallaviPatel
3 months, 4 weeks ago
Selected Answer: B
correct
upvoted 2 times
Tinaaaaaaa
4 months ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 43/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Selected Answer: B
While Geo-redundant storage (GRS) is cheaper than Read-Access Geo-Redundant Storage (RA-GRS), GRS does NOT initiate automatic failover.
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 44/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You need to ensure that the data lake will remain available if a data center fails in the primary Azure region. The solution must minimize costs.
Which type of replication should you use for the storage account?
A.
geo-redundant storage (GRS)
B.
geo-zone-redundant storage (GZRS)
C.
locally-redundant storage (LRS)
D.
zone-redundant storage (ZRS)
Correct Answer:
D
Zone-redundant storage (ZRS) copies your data synchronously across three Azure availability zones in the primary region.
Incorrect Answers:
C: Locally redundant storage (LRS) copies your data synchronously three times within a single physical location in the primary region. LRS is the
least expensive replication option, but is not recommended for applications requiring high availability or durability
Reference:
https://docs.microsoft.com/en-us/azure/storage/common/storage-redundancy
JohnMasipa
Highly Voted
9 months ago
This can't be correct. Should be D.
upvoted 70 times
JayBird
9 months ago
Why, LRS is cheaper?
upvoted 1 times
Vitality
8 months, 2 weeks ago
It is cheaper but LRS helps to replicate data in the same data center while ZRS replicates data synchronously across three storage clusters in
one region. So if one data center fails you should go for ZRS.
upvoted 8 times
azurearmy
7 months ago
Also, note that the question talks about failure in "a data center". As long as other data centers are running fine(as in ZRS which will have
many), ZRS would be the least expensive option.
upvoted 6 times
MadEgg
Highly Voted
4 months, 3 weeks ago
Selected Answer: D
First, about the Question:
What fails? -> The (complete) DataCenter, not the region and not components inside a DataCenter.
LRS: "..copies your data synchronously three times within a single physical location in the primary region." Important is here the SINGLE PHYSICAL
LOCATION (meaning inside the same Data Center. So in our scenario all copies wouldn't work anymore.)
-> C is wrong.
ZRS: "...copies your data synchronously across three Azure availability zones in the primary region" (meaning, in different Data Centers. In our
scenario this would meet the requirements)
-> D is right
GRS/GZRS: are like LRS/ZRS but with the Data Centers in different azure regions. This works too but is more expensive than ZRS. So ZRS is the right
answer.
https://docs.microsoft.com/en-us/azure/storage/common/storage-redundancy
upvoted 30 times
Ozren
2 months, 1 week ago
Yes, well said, that's the correct answer.
upvoted 1 times
Narasimhap
3 months, 1 week ago
Well explained!
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 45/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
DrTaz
4 months, 3 weeks ago
I agree.
olavrab8
Most Recent
2 weeks, 2 days ago
Selected Answer: D
D -> Data is replicated synchronously
upvoted 1 times
Egocentric
1 month, 1 week ago
D is correct
upvoted 2 times
ravi2931
1 month, 2 weeks ago
it should be D
upvoted 1 times
ravi2931
1 month, 2 weeks ago
see this explained clearly -
LRS is the lowest-cost redundancy option and offers the least durability compared to other options. LRS protects your data against server rack
and drive failures. However, if a disaster such as fire or flooding occurs within the data center, all replicas of a storage account using LRS may be
lost or unrecoverable. To mitigate this risk, Microsoft recommends using zone-redundant storage (ZRS), geo-redundant storage (GRS), or geo-
zone-redundant storage (GZRS)
upvoted 1 times
ASG1205
1 month, 2 weeks ago
Selected Answer: D
Answer should be D, as LRS won't be helpfull in case of whole datacenter failure.
upvoted 1 times
Andy91
2 months ago
Selected Answer: D
This is the correct answer indeed
upvoted 2 times
bhanuprasad9331
2 months, 4 weeks ago
Selected Answer: C
Answer is LRS.
LRS replicates data in a single AZ. An AZ can contain one or more data centers. So, even if one data center fails, data can be accessed through
other data centers in the same AZ.
https://docs.microsoft.com/en-us/azure/availability-zones/az-overview#availability-zones
https://docs.microsoft.com/en-us/azure/storage/common/storage-redundancy#redundancy-in-the-primary-region
upvoted 1 times
PallaviPatel
3 months, 4 weeks ago
Selected Answer: D
D is correct.
upvoted 3 times
vimalnits
4 months ago
Correct answer is D.
upvoted 2 times
Tinaaaaaaa
4 months ago
LRS helps to replicate data in the same data center while ZRS replicates data synchronously across three storage clusters in one region
upvoted 1 times
Shatheesh
4 months ago
D is the correct answer, In question it’s clearly mentioned if data center fails it should be available, LRS stores everything in sane data center so it’s
not the correct answer, next cheapest option is ZRS.
upvoted 1 times
Jaws1990
4 months, 3 weeks ago
Selected Answer: D
Mentions data centre (Availability Zone) failure, not rack failure, so should be Zone Redundant Storage.
upvoted 3 times
DrTaz
4 months, 3 weeks ago
Selected Answer: D
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 46/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
VeroDon
4 months, 3 weeks ago
After reading all the comments ill go with LRS. it doesn't mention a disaster. "LRS protects your data against server rack and drive failures"
https://docs.microsoft.com/en-us/azure/storage/common/storage-redundancy.
upvoted 1 times
ArunMonika
4 months, 4 weeks ago
I will go with D
upvoted 1 times
Mahesh_mm
5 months ago
Answer is D
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 47/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
HOTSPOT -
You plan to load data from Azure Blob storage to a staging table. Approximately 1 million rows of data will be loaded daily. The table will be
truncated before each daily load.
You need to create the staging table. The solution must minimize how long it takes to load the data to the staging table.
How should you configure the table? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:
Box 1: Hash -
Hash-distributed tables improve query performance on large fact tables. They can have very large numbers of rows and still achieve high
performance.
Incorrect Answers:
When creating partitions on clustered columnstore tables, it is important to consider how many rows belong to each partition. For optimal
compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed.
Box 3: Date -
Table partitions enable you to divide your data into smaller groups of data. In most cases, table partitions are created on a date column.
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 48/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-partition
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute
A1000
Highly Voted
9 months ago
Round-Robin
Heap
None
upvoted 156 times
Narasimhap
3 months, 1 week ago
Round- Robin
Heap
None.
No brainer for this question.
upvoted 3 times
anto69
4 months, 2 weeks ago
I agree too
upvoted 2 times
gssd4scoder
7 months ago
Agree 100%.
DrTaz
4 months, 3 weeks ago
Also agree 100%
upvoted 2 times
laszek
Highly Voted
8 months, 4 weeks ago
Round-robin - this is the simplest distribution model, not great for querying but fast to process
No partitions - this is a staging table, why add effort to partition, when truncated daily?
upvoted 29 times
Vardhan_Brahmanapally
6 months, 3 weeks ago
Can you explain me why should we use heap?
upvoted 1 times
DrTaz
4 months, 3 weeks ago
The term heap basically refers to a table without a clustered index. Adding a clustered index to a temp table makes absolutely no sense and
is a waste of compute resources for a table that would be entirely truncated daily.
SQLDev0000
3 months ago
DrTaz is right, in addition, when you populate an indexed table, you are also writing to the index, so this adds an additional overhead in
the write process
upvoted 2 times
berserksap
6 months, 4 weeks ago
Had doubts regarding why there is no need for a partition. While what you suggested is true won't it be better if there is a date partition to
truncate the table ?
upvoted 1 times
andy_g
3 months, 1 week ago
There is no filter on a truncate statement so no benefit in having a partition
upvoted 1 times
SandipSingha
Most Recent
2 weeks, 3 days ago
Round-Robin
Heap
None
upvoted 1 times
Sandip4u
4 months, 2 weeks ago
Round-robin,heap,none
upvoted 2 times
Mahesh_mm
4 months, 4 weeks ago
Round-Robin
Heap
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 49/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
None
upvoted 2 times
ArunMonika
4 months, 4 weeks ago
Answer: Round-Robin (1), Heap (2), None (3).
upvoted 1 times
m2shines
5 months, 1 week ago
Round-robin, Heap and None
upvoted 1 times
Sasha_in_San_Francisco
6 months, 3 weeks ago
Answer: Round-Robin (1), Heap (2), None (3).
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-overview
#2. Search for: “A heap table can be especially useful for loading data, such as a staging table,…”
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-partition?context=/azure/synapse-
analytics/context/context
#3. Partitioning by date is useful when stage destination has data because you can hide the inserting data’s new partition (to keep users from
hitting it), complete the load and then unhide the new partition.
However, in this question it states, “the table will be truncated before each daily load”, so, it appears it’s a true Staging table and there are no users
with access, no existing data, and I see no reason to have a Date partition. To me, such a partition would do nothing but slow the load.
upvoted 12 times
Sasha_in_San_Francisco
6 months, 3 weeks ago
Answer: Round-Robin (1), Heap (2), None (3).
upvoted 1 times
Aslam208
6 months, 3 weeks ago
Round-Robin, Heap, Noe.
A polite request to the moderator, please verify these answers and correct. For some people, wrong answers will be detrimental.
upvoted 6 times
Vardhan_Brahmanapally
7 months ago
Many of the answers provided in this website are incorrect
upvoted 5 times
dJeePe
3 weeks, 5 days ago
Did MS hack this site to make it give wrong answers ? ;-)
upvoted 1 times
itacshish
7 months, 1 week ago
Round-Robin
Heap
None
upvoted 2 times
HaliBrickclay
7 months, 1 week ago
as per Microsoft document
To achieve the fastest loading speed for moving data into a data warehouse table, load data into a staging table. Define the staging table as a heap
and use round-robin for the distribution option.
Consider that loading is usually a two-step process in which you first load to a staging table and then insert the data into a production data
warehouse table. If the production table uses a hash distribution, the total time to load and insert might be faster if you define the staging table
with the hash distribution. Loading to the staging table takes longer, but the second step of inserting the rows to the production table does not
incur data movement across the distributions.
upvoted 4 times
VeroDon
4 months, 3 weeks ago
It doesn't mention the prd table. Only the staging. So, round Robin/Heap is the answer, correct? tricky questions.
:)
upvoted 1 times
estrelle2008
7 months, 2 weeks ago
Please correct the answers ExamTopics, as Microsoft itself recently published best practices on data loading in Synapse, and describes staging as
100% FAB answers is correct instead of ADF. https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/data-loading-best-practices
upvoted 2 times
RinkiiiiiV
7 months, 3 weeks ago
Round-Robin
Heap
None
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 50/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
hugoborda
8 months ago
Round-Robin
Heap
None
upvoted 2 times
hsetin
8 months, 2 weeks ago
Why heap and not CCI?
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 51/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from
suppliers for a retail store. FactPurchase will contain the following columns.
FactPurchase will have 1 million rows of data added daily and will contain three years of data.
SELECT -
FROM FactPurchase -
A.
replicated
B.
hash-distributed on PurchaseKey
C.
round-robin
D.
hash-distributed on IsOrderFinalized
Correct Answer:
B
✑ Has many unique values. The column can have duplicate values. All rows with the same value are assigned to the same distribution. Since
there are 60 distributions, some distributions can have > 1 unique values while others may end with zero values.
Incorrect Answers:
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute
FredNo
Highly Voted
6 months ago
Selected Answer: B
Correct
upvoted 18 times
GameLift
Highly Voted
8 months, 2 weeks ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 52/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Is it hash-distributed on PurchaseKey and not on IsOrderFinalized because 'IsOrderFinalized' yields less distributions(rows either contain yes,no
values) compared to PurchaseKey?
upvoted 7 times
Podavenna
8 months, 2 weeks ago
Yes, your logic is correct!
upvoted 4 times
Dothy
Most Recent
2 weeks ago
B Correct
upvoted 1 times
SandipSingha
2 weeks, 3 days ago
B Correct
upvoted 1 times
sarapaisley
1 month, 2 weeks ago
Selected Answer: B
B is correct
upvoted 1 times
Anshul2910
2 months, 2 weeks ago
Selected Answer: B
CORRECT
upvoted 1 times
Istiaque
3 months, 2 weeks ago
Selected Answer: B
A round-robin distributed table distributes table rows evenly across all distributions. The assignment of rows to distributions is random. Unlike
hash-distributed tables, rows with equal values are not guaranteed to be assigned to the same distribution.
As a result, the system sometimes needs to invoke a data movement operation to better organize your data before it can resolve a query. This extra
step can slow down your queries.
upvoted 2 times
PallaviPatel
3 months, 4 weeks ago
Selected Answer: C
The options do not have correct key selected for hash distribution and query performance will improve only if correct distribution column is
selected. Also question says 1 million rows but how much those rows convert into actual GB of data is a question the data types are majorly int
which arn't bulky. hence I will go for round robin instead of hash distribution.
upvoted 2 times
vineet1234
2 months, 2 weeks ago
Incorrect.. 1 million rows added per day. And the table has 3 years of data. So it's a large fact table. So Hash distributed. On purchase key (not
on IsOrderFinalized, as it's very low cardinality)
upvoted 2 times
Canary_2021
4 months, 3 weeks ago
Selected Answer: D
Hash field should be used in join, group by, having. SupplierKey, StockItemKey, IsOrderFinalized are group by fields. PurchaseKey doesn’t exist in
the query, why select PurchaseKey as hash key?
I select D. IsOrderFinalized may only provide 2 partitions, not as good as suppliekey and stockitemkey, but at least it is a group by column.
upvoted 2 times
Canary_2021
4 months, 3 weeks ago
To balance the parallel processing, select a distribution column that:
Based on these descriptions, maybe B is the right answer. Just purchasekey is not a part of the query, is it still improve performance of this
specific query?
upvoted 4 times
Mahesh_mm
4 months, 4 weeks ago
B is correct
upvoted 2 times
kahei
5 months, 2 weeks ago
Selected Answer: B
upvoted 1 times
alexleonvalencia
5 months, 2 weeks ago
Selected Answer: B
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 53/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
B es la respuesta correcta.
upvoted 2 times
stuard
7 months, 2 weeks ago
hash-distributed on PurchaseKey and round-robin are going to provide the same result (in a case PurchaseKey has even distribution) for the query
as this specific query does not use PurchaseKey. However, round-robin is going to provide a slightly faster loading time.
upvoted 6 times
RinkiiiiiV
7 months, 3 weeks ago
Yes Agree..
upvoted 1 times
Gilvan
8 months, 2 weeks ago
Correct
upvoted 4 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 54/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
HOTSPOT -
From a website analytics system, you receive data extracts about user interactions such as downloads, link clicks, form submissions, and video
plays.
You need to design a star schema to support analytical queries of the data. The star schema will contain four tables including a date dimension.
To which table should you add each column? To answer, select the appropriate options in the answer area.
Hot Area:
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 55/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Correct Answer:
Box 1: DimEvent -
Box 2: DimChannel -
Box 3: FactEvents -
Fact tables store observations or events, and can be sales orders, stock balances, exchange rates, temperatures, etc
Reference:
https://docs.microsoft.com/en-us/power-bi/guidance/star-schema
gssd4scoder
Highly Voted
7 months ago
It seems to be correct
upvoted 29 times
DingDongSingSong
Highly Voted
2 months ago
What is this question? It is poorly written. I couldn't even understand what's being asked here. It talks about 4 tables, yet the answer shows 3. Then,
the columns mentioned in the question don't match the column/attributes shown in the 3 tables noted in the answer.
upvoted 7 times
Dothy
Most Recent
2 weeks ago
EventCategory -> dimEvent
JJdeWit
2 weeks, 4 days ago
EventCategory ==> dimEvent
Explanation:
A bit of knowledge of Google Analytics Universal helps to understand this question. eventCategory, eventAction and eventLabel all contain
information about the event/action done on the website, and can be logically be grouped together. ChannelGrouping is about how the user came
on the website (through Google, and advertisement, an email link, etc.) and is not related to events at all. It therefore would make sense to put it in
a second dim table.
upvoted 1 times
Mahesh_mm
4 months, 4 weeks ago
Answer is correct
upvoted 3 times
laszek
8 months, 4 weeks ago
I would add ChannelGrouping to DimEvents table. What would DimChannel table contain? only one column? No sense to me
upvoted 3 times
manquak
8 months, 3 weeks ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 56/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
It is supposed to contain 4 tables. Date, Event, Fact so the logical conclusion would be to include the channel dimension. If it were up to me
Questionthough Topic 1
#22 I'd use the channel as a degenerate dimension and store it in fact table if it's the only information that we have provided.
upvoted 3 times
Note:This
question is part
Seansmyrke
2 months,
of a series of questions
3 weeks ago that present the same scenario. Each question in the series contains a unique solution that
I mean
might meet the ifstated
you think
goals.about
Someit,question
ChannelName (facebook,google,youtube),
sets might ChannelType
have more than one correct (paidothers
solution, while media, freenot
might posts,
haveads), ChannleDelivery
a correct solution.
(chrome, etc
etc). Just thinking out loud
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
upvoted 1 times
You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain
description data that has an average length of 1.1 MB.
You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics.
You need to prepare the files to ensure that the data copies quickly.
A.
Yes
B.
No
Correct Answer:
A
All file formats have different performance characteristics. For the fastest load, use compressed delimited text files.
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/guidance-for-loading-data
Fahd92
Highly Voted
7 months, 3 weeks ago
They said you need to prepare the files to copy, maybe the mean we should make them less than 1MB ? so it will be A else would be B !!!!
upvoted 12 times
ANath
4 months, 1 week ago
The answer should be A.
https://azure.microsoft.com/en-gb/blog/increasing-polybase-row-width-limitation-in-azure-sql-data-warehouse/
upvoted 1 times
Thij
7 months, 3 weeks ago
After reading the other questions oh this topic I go with A because the relevant part seems to be the compression.
upvoted 4 times
Muishkin
Most Recent
4 weeks, 1 day ago
A text file seems to be too simple an answer however true as per the microsoft link.I was thinking of parquet/avro files
upvoted 1 times
Massy
2 months, 2 weeks ago
Selected Answer: B
From the question: "75% of the rows contain description data that has an average length of 1.1 MB". You can't
From the documentation: "When you put data into the text files in Azure Blob storage or Azure Data Lake Store, they must have fewer than
1,000,000 bytes of data."
So 75% of rows aren't good for a delimited text files... why you said answer is yes?
upvoted 3 times
kamil_k
2 months, 2 weeks ago
I initially thought so too, however isn't this limit only relevant to PolyBase copy? It is not mentioned which method is used to transfer the data
so you could fit more than 1mb into a column in the table if you want to, you just have to use something else e.g. COPY command.
upvoted 2 times
PallaviPatel
3 months, 4 weeks ago
Selected Answer: A
correct answer.
upvoted 2 times
Mahesh_mm
4 months, 4 weeks ago
A is correct
upvoted 1 times
alexleonvalencia
5 months, 2 weeks ago
Selected Answer: A
Correcto
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 57/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
upvoted 2 times
rashjan
5 months, 2 weeks ago
Selected Answer: A
correct because compression
upvoted 1 times
Odoxtoom
7 months, 1 week ago
Consider this sets one question:
What | Yes | No |
compressed | O | O |
columnstore | O | O |
> 1MB | O | O |
HaliBrickclay
7 months, 1 week ago
As per Microsoft
PolyBase loads are limited to rows smaller than 1 MB. It cannot be used to load to VARCHR(MAX), NVARCHAR(MAX), or VARBINARY(MAX). For
more information, see Azure Synapse Analytics service capacity limits.
When your source data has rows greater than 1 MB, you might want to vertically split the source tables into several small ones. Make sure that the
largest size of each row doesn't exceed the limit. The smaller tables can then be loaded by using PolyBase and merged together in Azure Synapse
Analytics.
upvoted 2 times
jamesraju
7 months, 2 weeks ago
The answer should be 'yes"
All file formats have different performance characteristics. For the fastest load, use compressed delimited text files. The difference between UTF-8
and UTF-16 performance is minimal.
upvoted 1 times
RinkiiiiiV
7 months, 3 weeks ago
correct Answer is B
upvoted 1 times
gk765
8 months ago
Correct Answer is B. There is limit of 1MB when it comes to the row length. Hence you have to modify the files to ensure the row size is less than
1MB
upvoted 3 times
kolakone
8 months, 1 week ago
Answer is correct
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 58/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that
might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain
description data that has an average length of 1.1 MB.
You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics.
You need to prepare the files to ensure that the data copies quickly.
Solution: You copy the files to a table that has a columnstore index.
A.
Yes
B.
No
Correct Answer:
B
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/guidance-for-loading-data
Odoxtoom
Highly Voted
7 months, 1 week ago
Consider this sets one question:
What | Yes | No |
compressed | O | O |
columnstore | O | O |
> 1MB | O | O |
Julius7000
7 months ago
Can You explain this in more details?
upvoted 10 times
helly13
5 months, 2 weeks ago
I really didn't understand this , can you explain?
upvoted 5 times
Amsterliese
Most Recent
1 month, 2 weeks ago
Columnstore index would be used for faster reading, but the question is only about faster loading. So for faster loading you want the least possible
overhead. So the answer should be no. Am I right?
upvoted 2 times
Muishkin
4 weeks, 1 day ago
Yes load to a table without indexes for faster load right?
upvoted 1 times
lionurag
2 months, 3 weeks ago
Selected Answer: B
B is correct
upvoted 2 times
bhanuprasad9331
3 months, 1 week ago
From the documentation, loads to heap table are faster than indexed tables. So, better to use heap table than columnstore index table in this case.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index#heap-tables
upvoted 4 times
PallaviPatel
3 months, 4 weeks ago
Selected Answer: B
B is correct.
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 59/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
DE_Sanjay
4 months, 1 week ago
NO is the answer.
upvoted 1 times
Mahesh_mm
4 months, 4 weeks ago
B is correct
upvoted 1 times
rashjan
5 months, 2 weeks ago
Selected Answer: B
Correct Answer: No.
upvoted 2 times
sachabess79
7 months, 3 weeks ago
No, The index will expand the time of insertion
upvoted 3 times
michalS
8 months, 3 weeks ago
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/guidance-for-loading-data. "For the fastest load, use compressed
delimited text files."
upvoted 1 times
umeshkd05
8 months, 2 weeks ago
But the row size also need to be < 1 MB
Answer: NO
upvoted 4 times
Julius7000
7 months ago
In other words, i think that 100GB is much to much for the columnstore index memorywise. The documentation in unclear with the context
of this particular question, but i think the ansewer is NO, as ithe given answer is the wrong idea anyways.
upvoted 1 times
Julius7000
7 months ago
Not Row size, row NUMBER have to be at maximum of 1,048,576 rows.
"When there is memory pressure, the columnstore index might not be able to achieve maximum compression rates. This effects query
performance."
upvoted 1 times
gk765
8 months ago
Correct answer should be NO
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 60/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that
might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain
description data that has an average length of 1.1 MB.
You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics.
You need to prepare the files to ensure that the data copies quickly.
Solution: You modify the files to ensure that each row is more than 1 MB.
A.
Yes
B.
No
Correct Answer:
B
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/guidance-for-loading-data
Gilvan
Highly Voted
8 months, 1 week ago
No, rows need to have less than 1 MB. A batch size between 100 K to 1M rows is the recommended baseline for determining optimal batch size
capacity.
upvoted 7 times
PallaviPatel
Most Recent
3 months, 4 weeks ago
Selected Answer: B
B is correct.
upvoted 4 times
amarG1996
4 months, 3 weeks ago
PolyBase can't load rows that have more than 1,000,000 bytes of data. When you put data into the text files in Azure Blob storage or Azure Data
Lake Store, they must have fewer than 1,000,000 bytes of data. This byte limitation is true regardless of the table schema.
upvoted 2 times
kamil_k
2 months, 2 weeks ago
is it stated anywhere that we have to use PolyBase? What about COPY command?
upvoted 1 times
amarG1996
4 months, 3 weeks ago
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/data-loading-best-practices#prepare-data-in-azure-storage
upvoted 1 times
Mahesh_mm
4 months, 4 weeks ago
Answer is No
upvoted 1 times
rashjan
5 months, 2 weeks ago
Selected Answer: B
Correct Answer: No.
upvoted 2 times
Odoxtoom
7 months, 1 week ago
Consider this sets one question:
What | Yes | No |
compressed | O | O |
columnstore | O | O |
> 1MB | O | O |
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 61/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Aslam208
6 months ago
@Odoxtoom, can you please explain your answer and specify based on this matrix which option is correct.
upvoted 3 times
Bishtu
5 months ago
Yes
No
No
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 62/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You build a data warehouse in an Azure Synapse Analytics dedicated SQL pool.
Analysts write a complex SELECT query that contains multiple JOIN and CASE statements to transform data for use in inventory reports. The
inventory reports will use the data and additional WHERE parameters depending on the report. The reports will be produced once daily.
You need to implement a solution to make the dataset available for the reports. The solution must minimize query times.
A.
an ordered clustered columnstore index
B.
a materialized view
C.
result set caching
D.
a replicated table
Correct Answer:
B
Materialized views for dedicated SQL pools in Azure Synapse provide a low maintenance method for complex analytical queries to get fast
performance without any query change.
Incorrect Answers:
C: One daily execution does not make use of result cache caching.
Note: When result set caching is enabled, dedicated SQL pool automatically caches query results in the user database for repetitive use. This
allows subsequent query executions to get results directly from the persisted cache so recomputation is not needed. Result set caching
improves query performance and reduces compute resource usage. In addition, queries using cached results set do not use any concurrency
slots and thus do not count against existing concurrency limits.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/performance-tuning-materialized-views
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/performance-tuning-result-set-caching
ANath
Highly Voted
3 months, 2 weeks ago
B is correct.
These two features in dedicated SQL pool are used for query performance tuning. Result set caching is used for getting high concurrency and fast
response from repetitive queries against static data.
To use the cached result, the form of the cache requesting query must match with the query that produced the cache. In addition, the cached result
must apply to the entire query.
Materialized views allow data changes in the base tables. Data in materialized views can be applied to a piece of a query. This support allows the
same materialized views to be used by different queries that share some computation for faster performance.
upvoted 6 times
SandipSingha
Most Recent
2 weeks, 3 days ago
B materialized view
upvoted 1 times
Egocentric
1 month, 1 week ago
B is correct without a doubt
upvoted 1 times
DingDongSingSong
2 months ago
Why isn't the answer "A" when the query may have additional WHERE parameters depending on the report. That mean's the query isn't static and
will change depending on the report. A clustered columstore index would provide a bettery query performance in case of a complex query where
query isn't static.
upvoted 1 times
PallaviPatel
3 months, 4 weeks ago
Selected Answer: B
B correct.
upvoted 1 times
VeroDon
4 months, 3 weeks ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 63/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Selected Answer: B
Correct
upvoted 3 times
Mahesh_mm
4 months, 4 weeks ago
B is correct
upvoted 1 times
bad_atitude
5 months, 1 week ago
B materialized view
upvoted 2 times
Canary_2021
5 months, 1 week ago
Selected Answer: B
B is the correct answer.
A materialized view is a database object that contains the results of a query. A materialized view is not simply a window on the base table. It is
actually a separate object holding data in itself. So query data against a materialized view with different filters should be quick.
https://techdifferences.com/difference-between-view-and-materialized-view.html
upvoted 4 times
alexleonvalencia
5 months, 2 weeks ago
Respuesta Correcta B, Una vista materializada.
upvoted 4 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 64/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You have an Azure Synapse Analytics workspace named WS1 that contains an Apache Spark pool named Pool1.
You need to ensure that when tables are created in DB1, the tables are available automatically as external tables to the built-in serverless SQL
pool.
A.
CSV
B.
ORC
C.
JSON
D.
Parquet
Correct Answer:
D
Serverless SQL pool can automatically synchronize metadata from Apache Spark. A serverless SQL pool database will be created for each
database existing in serverless Apache Spark pools.
For each Spark external table based on Parquet or CSV and located in Azure Storage, an external table is created in a serverless SQL pool
database.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-storage-files-spark-tables
KevinSames
Highly Voted
5 months, 1 week ago
Both A and D are correct
"For each Spark external table based on Parquet or CSV and located in Azure Storage, an external table is created in a serverless SQL pool
database. As such, you can shut down your Spark pools and still query Spark external tables from serverless SQL pool."
upvoted 15 times
RehanRajput
Most Recent
3 days, 18 hours ago
Both A and D
upvoted 1 times
RehanRajput
3 days, 18 hours ago
https://docs.microsoft.com/en-us/azure/synapse-analytics/metadata/database
upvoted 1 times
MatiCiri
4 weeks ago
Selected Answer: D
Looks correct to me
upvoted 1 times
AhmedDaffaie
2 months, 2 weeks ago
JSON is also supported by Serverless SQL Pool but it is kinda complicated. Why is it not selected?
upvoted 2 times
Ajitk27
3 months ago
Selected Answer: D
Looks correct to me
upvoted 1 times
VijayMore
3 months ago
Selected Answer: D
Correct
upvoted 1 times
PallaviPatel
3 months, 4 weeks ago
Selected Answer: D
Both A and D are correct. as CSV and Parquet are correct answers.
upvoted 1 times
Mahesh_mm
4 months, 4 weeks ago
Parquet and CSV are correct
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 65/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
upvoted 4 times
Nifl91
5 months, 2 weeks ago
I think A and D are both correct answers.
upvoted 3 times
alexleonvalencia
5 months, 2 weeks ago
Respuesta Correcta Parquet.
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 66/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You are planning a solution to aggregate streaming data that originates in Apache Kafka and is output to Azure Data Lake Storage Gen2. The
developers who will implement the stream processing solution use Java.
Which service should you recommend using to process the streaming data?
A.
Azure Event Hubs
B.
Azure Data Factory
C.
Azure Stream Analytics
D.
Azure Databricks
Correct Answer:
D
The following tables summarize the key differences in capabilities for stream processing technologies in Azure.
General capabilities -
Integration capabilities -
Reference:
https://docs.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/stream-processing
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 67/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
D (100%)
Nifl91
Highly Voted
5 months, 2 weeks ago
Correct!
upvoted 13 times
NewTuanAnh
Most Recent
1 month, 2 weeks ago
why not C: Azure Stream Analytics?
upvoted 1 times
Muishkin
4 weeks, 1 day ago
Yes Azure stream Analytics for streaming data?
upvoted 1 times
NewTuanAnh
1 month, 2 weeks ago
I see, Azure Stream Analytics does not associate with Java
upvoted 1 times
sdokmak
2 days, 22 hours ago
or databricks
upvoted 1 times
sdokmak
2 days, 22 hours ago
kafka*
upvoted 1 times
PallaviPatel
3 months, 4 weeks ago
Selected Answer: D
correct.
upvoted 1 times
Mahesh_mm
4 months, 4 weeks ago
Answer is correct
upvoted 3 times
alexleonvalencia
5 months, 2 weeks ago
Respuesta correcta Azure DataBricks.
upvoted 4 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 68/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You plan to implement an Azure Data Lake Storage Gen2 container that will contain CSV files. The size of the files will vary based on the number
of events that occur per hour.
You need to ensure that the files stored in the container are optimized for batch processing.
A.
Convert the files to JSON
B.
Convert the files to Avro
C.
Compress the files
D.
Merge the files
Correct Answer:
B
Note: Avro is framework developed within Apache's Hadoop project. It is a row-based storage format which is widely used as a serialization
process. AVRO stores its schema in JSON format making it easy to read and interpret by any program. The data itself is stored in binary format
by doing it compact and efficient.
Reference:
https://www.adaltas.com/en/2020/07/23/benchmark-study-of-different-file-format/
VeroDon
Highly Voted
4 months, 3 weeks ago
You can not merge the files if u don't know how many files exist in ADLS2. In this case, you could easily create a file larger than 100 GB in size and
decrease performance. so B is the correct answer. Convert to AVRO
upvoted 25 times
Massy
3 weeks, 5 days ago
I can understand why you say not merge, but why avro?
upvoted 2 times
Canary_2021
Highly Voted
4 months, 3 weeks ago
Selected Answer: D
If you store your data as many small files, this can negatively affect performance. In general, organize your data into larger sized files for better
performance (256 MB to 100 GB in size).
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices#optimize-for-data-ingest
upvoted 7 times
SAYAK7
Most Recent
4 days, 15 hours ago
Selected Answer: D
Batch can support JSON or AVRO, you should input one file by merging them all.
upvoted 1 times
sdokmak
1 day, 9 hours ago
They're CSV so you're saying answer is A
upvoted 1 times
sdokmak
1 day, 9 hours ago
B*, AVRO is faster than JSON
upvoted 1 times
RehanRajput
1 month ago
Selected Answer: D
You need to make sure that the files in the container are optimized for BATCH PROCESSING. In case of batch processing it makes sense to merge
files as to reduce the amount of IO Listing operations.
Karthikj18
1 month, 3 weeks ago
Conversion makes an additional load, so not an good idea to convert into Avro rather than merging is easier
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 69/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
SebK
2 months ago
Selected Answer: D
merge files
upvoted 2 times
adfgasd
3 months, 3 weeks ago
This question makes me very confused.
It says the file size depends on the number of events per hour, so i guess there is a file generated every hour. In the worst case, we have 5GB * 24h,
which is greater than 100GB...
But why is AVRO a good choice??
upvoted 6 times
PallaviPatel
3 months, 4 weeks ago
Selected Answer: D
merge files is correct.
upvoted 3 times
vincetita
4 months ago
Selected Answer: D
Small-sized files will hurt performance. Optimal file size: 256MB to 100GB
upvoted 1 times
Tomi1234
4 months, 3 weeks ago
Selected Answer: D
In my opinion for better batch processing files should be not bigger than 100GB but as big as possible.
upvoted 7 times
VeroDon
4 months, 3 weeks ago
One example of batch processing is transforming a large set of flat, semi-structured CSV or JSON files into a schematized and structured format
that is ready for further querying. Typically the data is converted from the raw formats used for ingestion (such as CSV) into binary formats that are
more performant for querying because they store data in a columnar format, and often provide indexes and inline statistics about the data.
https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/batch-processing
upvoted 1 times
edba
4 months, 4 weeks ago
I think it shall be D as well. Please check the link below. https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-
practices#file-size
upvoted 3 times
Mahesh_mm
4 months, 4 weeks ago
B is correct
upvoted 2 times
SabaJamal2010AtGmail
4 months, 3 weeks ago
Consider using the Avro file format in cases where your I/O patterns are more write heavy, or the query patterns favor retrieving multiple rows
of records in their entirety. For example, the Avro format works well with a message bus such as Event Hub or Kafka that write multiple
events/messages in succession.
upvoted 1 times
didixuecoding
5 months ago
Correct Answer should be D: Merge the files
upvoted 2 times
corebit
5 months ago
Please explain why it is D.
upvoted 1 times
alexleonvalencia
5 months, 2 weeks ago
Respuesta correcta Avro
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 70/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
HOTSPOT -
You store files in an Azure Data Lake Storage Gen2 container. The container has the storage policy shown in the following exhibit.
Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic.
Hot Area:
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 71/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Correct Answer:
The ManagementPolicyBaseBlob.TierToCool property gets or sets the function to tier blobs to cool storage. Support blobs currently at Hot tier.
Box 2: container1/contoso.csv -
As defined by prefixMatch.
prefixMatch: An array of strings for prefixes to be matched. Each rule can define up to 10 case-senstive prefixes. A prefix string must start with
a container name.
Reference:
https://docs.microsoft.com/en-us/dotnet/api/microsoft.azure.management.storage.fluent.models.managementpolicybaseblob.tiertocool
bad_atitude
Highly Voted
5 months, 1 week ago
correct
upvoted 17 times
adfgasd
5 months, 1 week ago
why the .csv?
upvoted 1 times
Lewistrick
5 months ago
It matches anything that starts with "container1/contoso" and the csv in the answer is the only one that matches.
upvoted 9 times
alexleonvalencia
Highly Voted
5 months, 2 weeks ago
Respuesta Cool Tier & Container1/contoso.csv
upvoted 5 times
AJ01
Most Recent
4 months, 2 weeks ago
shouldn't the question be greater than 60 days?
upvoted 2 times
stunner85_
3 months, 3 weeks ago
The files get deleted after 60 days but after 30 days they are moved to the cool storage.
upvoted 3 times
Mahesh_mm
4 months, 4 weeks ago
correct
upvoted 3 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 72/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 73/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You are designing a financial transactions table in an Azure Synapse Analytics dedicated SQL pool. The table will have a clustered columnstore
index and will include the following columns:
✑ Transactions analysis will typically summarize transactions by transaction type, customer segment, and/or account type
You need to recommend a partition strategy for the table to minimize query times.
A.
CustomerSegment
B.
AccountType
C.
TransactionType
D.
TransactionMonth
Correct Answer:
D
For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is
needed. Before partitions are created, dedicated SQL pool already divides each table into 60 distributed databases.
Example: Any partitioning added to a table is in addition to the distributions created behind the scenes. Using this example, if the sales fact
table contained 36 monthly partitions, and given that a dedicated SQL pool has 60 distributions, then the sales fact table should contain 60
million rows per month, or 2.1 billion rows when all months are populated. If a table contains fewer than the recommended minimum number of
rows per partition, consider using fewer partitions in order to increase the number of rows per partition.
Lewistrick
Highly Voted
5 months ago
Anyone else thinks this is a very badly explained situation?
upvoted 12 times
Canary_2021
Highly Voted
5 months, 1 week ago
Selected Answer: D
Select D because analysts will most commonly analyze transactions for a given month,
upvoted 9 times
Youdaoud
Most Recent
1 month, 3 weeks ago
Selected Answer: D
correct answer D
upvoted 1 times
PallaviPatel
3 months, 4 weeks ago
Selected Answer: D
correct.
upvoted 3 times
SabaJamal2010AtGmail
4 months, 3 weeks ago
D is correct because "Transactions analysis will typically summarize transactions by transaction type, customer segment, and/or account type"
implying its part of the WHERE clause. The option of choosing a distribution column is to ensure that it is not used in the WHERE clause.
upvoted 6 times
ploer
3 months, 3 weeks ago
D is correct, but those columns will be used for aggregate funtions. TransactionMonth column will be used in the where-clause: "analysts will
most commonly analyze transactions for a given month", so the given month must be in the where clause. Partitioning on the where clause
column significantly reduces the amount of data to be processed which leads to increased performance. Do not confuse with distribution
column on hash partitioned tables. Using TransactionMonth column as distribution column here would be a really bad idea because all queried
data would be on one single node.
upvoted 15 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 74/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Mahesh_mm
4 months, 4 weeks ago
D is correct
upvoted 2 times
Sonnie01
5 months, 2 weeks ago
Selected Answer: D
correct
upvoted 5 times
edba
4 months, 4 weeks ago
check this as well for explanation. https://www.linkedin.com/pulse/partitioning-distribution-azure-synapse-analytics-swapnil-mule
upvoted 2 times
gf2tw
5 months, 2 weeks ago
Agree with D, should not be confused with Distribution column for Hash-distributed tables.
upvoted 5 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 75/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
HOTSPOT -
You have an Azure Data Lake Storage Gen2 account named account1 that stores logs as shown in the following table.
You do not expect that the logs will be accessed during the retention periods.
You need to recommend a solution for account1 that meets the following requirements:
What should you include in the recommendation? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:
Box 1: Store the infrastructure logs in the Cool access tier and the application logs in the Archive access tier
For infrastructure logs: Cool tier - An online tier optimized for storing data that is infrequently accessed or modified. Data in the cool tier should
be stored for a minimum of 30 days. The cool tier has lower storage costs and higher access costs compared to the hot tier.
For application logs: Archive tier - An offline tier optimized for storing data that is rarely accessed, and that has flexible latency requirements, on
the order of hours.
Data in the archive tier should be stored for a minimum of 180 days.
Blob storage lifecycle management offers a rule-based policy that you can use to transition your data to the desired access tier when your
specified conditions are met. You can also use lifecycle management to expire data at the end of its life.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview
gf2tw
Highly Voted
5 months, 2 weeks ago
"Data must remain in the Archive tier for at least 180 days or be subject to an early deletion charge. For example, if a blob is moved to the Archive
tier and then deleted or moved to the Hot tier after 45 days, you'll be charged an early deletion fee equivalent to 135 (180 minus 45) days of
storing that blob in the Archive tier." <- from the sourced link.
This explains why we have to use two different access tiers rather than both as archive.
upvoted 26 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 76/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Backy
Most Recent
2 weeks ago
The question says "You do not expect that the logs will be accessed during the retention periods" - so there is no reason to keep any of them as
Cool, so the correct answer should be to put them both in Archive
upvoted 1 times
sdokmak
2 days, 21 hours ago
yeah but because the infrastructure logs are <180 days before deleting, there is a considerable fee to delete if in archive, so not the cheapest
option.
upvoted 2 times
Muishkin
4 weeks, 1 day ago
But the question says 360 days and 60 days for the 2 logs...whereas archive tier could store only upto 180 days .Also the cool tier has lesser storage
cost /- hour as compared to archive tier.So should'nt the answer be cool tier for both?
upvoted 1 times
Mahesh_mm
4 months, 4 weeks ago
Answers are correct
upvoted 2 times
ANath
5 months ago
The answers are correct.
Data must remain in the Archive tier for at least 180 days or be subject to an early deletion charge. For example, if a blob is moved to the Archive
tier and then deleted or moved to the Hot tier after 45 days, you'll be charged an early deletion fee equivalent to 135 (180 minus 45) days of
storing that blob in the Archive tier.
A blob in the Cool tier in a general-purpose v2 accounts is subject to an early deletion penalty if it is deleted or moved to a different tier before 30
days has elapsed. This charge is prorated. For example, if a blob is moved to the Cool tier and then deleted after 21 days, you'll be charged an early
deletion fee equivalent to 9 (30 minus 21) days of storing that blob in the Cool tier.
https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview
upvoted 4 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 77/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You plan to ingest streaming social media data by using Azure Stream Analytics. The data will be stored in files in Azure Data Lake Storage, and
then consumed by using Azure Databricks and PolyBase in Azure Synapse Analytics.
You need to recommend a Stream Analytics data output format to ensure that the queries from Databricks and PolyBase against the files
encounter the fewest possible errors. The solution must ensure that the files can be queried quickly and that the data type information is retained.
A.
JSON
B.
Parquet
C.
CSV
D.
Avro
Correct Answer:
B
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql
hrastogi7
Highly Voted
5 months, 1 week ago
Parquet can be quickly retrieved and maintain metadata in itself. Hence Parquet is correct answer.
upvoted 10 times
Muishkin
Most Recent
4 weeks, 1 day ago
Isnt JSON good for batch processing/streaming?
upvoted 1 times
RehanRajput
3 days, 18 hours ago
Indeed. However, we also want to query the data using PolyBase. Polybase doesn't support Avro.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/load-data-overview#polybase-external-file-formats
upvoted 2 times
AhmedDaffaie
1 month, 4 weeks ago
I am confused!
Avro has self-describing schema and good for quick loading (patching), why parquet?
upvoted 3 times
Boompiee
2 weeks, 1 day ago
Apparently, the deciding factor is the fact that PolyBase doesn't support AVRO, but it does support Parquet.
upvoted 3 times
PallaviPatel
3 months, 4 weeks ago
Selected Answer: B
correct.
upvoted 1 times
EmmettBrown
4 months ago
Selected Answer: B
Parquet is the correct answer
upvoted 1 times
alexleonvalencia
5 months, 2 weeks ago
Respuesta correcta PARQUET
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 78/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You have an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 contains a partitioned fact table named dbo.Sales and a staging
table named stg.Sales that has the matching table and partition definitions.
You need to overwrite the content of the first partition in dbo.Sales with the content of the same partition in stg.Sales. The solution must minimize
load times.
A.
Insert the data from stg.Sales into dbo.Sales.
B.
Switch the first partition from dbo.Sales to stg.Sales.
C.
Switch the first partition from stg.Sales to dbo.Sales.
D.
Update dbo.Sales from stg.Sales.
Correct Answer:
B
A way to eliminate rollbacks is to use Metadata Only operations like partition switching for data management. For example, rather than execute
a DELETE statement to delete all rows in a table where the order_date was in October of 2001, you could partition your data monthly. Then you
can switch out the partition with data for an empty partition from another table
Note: Syntax:
✑ Reassigns all data in one partition of a partitioned table to an existing non-partitioned table.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool
Aslam208
Highly Voted
5 months, 2 weeks ago
Selected Answer: C
The correct answer is C
upvoted 31 times
Nifl91
Highly Voted
5 months, 2 weeks ago
this must be C. since the need is to overwrite dbo.Sales with the content of stg.Sales.
SAYAK7
Most Recent
4 days, 4 hours ago
Selected Answer: C
Coz we have to impact dbo.Sales
upvoted 1 times
kknczny
1 month ago
Selected Answer: B
As partition in stg.Sales is the one we will be using to overwrite the first partition in dbo.Stage, should it not be understood as "B. Switch the first
partition from dbo.Sales to stg.Sales."?
kanak01
3 months, 2 weeks ago
Seriously.. Who puts Fact Table data into Dimension table !
upvoted 1 times
rockyc05
3 months ago
Its Fact to Stage Table actually acc to the answer provided
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 79/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
PallaviPatel
3 months, 4 weeks ago
Selected Answer: C
C is correct as partition switching works from source to target.
upvoted 2 times
dev2dev
4 months, 2 weeks ago
Either way works.
upvoted 1 times
Sandip4u
4 months, 2 weeks ago
Stg.sales is a temp table which does not have any partition , So C can not be correct
upvoted 2 times
ABExams
3 months, 1 week ago
It literally states it has the same partition definition.
upvoted 2 times
alex623
4 months, 2 weeks ago
The correct answer is C, because target table is dbo.sales
upvoted 1 times
Rickie85
4 months, 2 weeks ago
Selected Answer: C
C correct
upvoted 1 times
Jaws1990
4 months, 2 weeks ago
Selected Answer: C
B is the wrong way round.
upvoted 3 times
VeroDon
4 months, 3 weeks ago
Selected Answer: C
https://medium.com/@cocci.g/switch-partitions-in-azure-synapse-sql-dw-1e0e32309872
upvoted 4 times
Mahesh_mm
4 months, 4 weeks ago
C is correct answer
upvoted 1 times
ArunMonika
4 months, 4 weeks ago
I will go with C
upvoted 1 times
gitoxam686
5 months ago
Selected Answer: C
C is correct answer because we have to overwrite.
upvoted 2 times
adfgasd
5 months ago
Selected Answer: C
C for sure
upvoted 2 times
Will_KaiZuo
5 months, 1 week ago
Selected Answer: C
agree with C
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 80/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You are designing a slowly changing dimension (SCD) for supplier data in an Azure Synapse Analytics dedicated SQL pool.
Which three additional columns should you add to the data to create a Type 2 SCD? Each correct answer presents part of the solution.
A.
surrogate primary key
B.
effective start date
C.
business key
D.
last modified date
E.
effective end date
F.
foreign key
Correct Answer:
BCE
C: The Slowly Changing Dimension transformation requires at least one business key column.
BE: Historical attribute changes create new records instead of updating existing ones. The only change that is permitted in an existing record is
an update to a column that indicates whether the record is current or expired. This kind of change is equivalent to a Type 2 change. The Slowly
Changing Dimension transformation directs these rows to two outputs: Historical Attribute Inserts Output and New Output.
Reference:
https://docs.microsoft.com/en-us/sql/integration-services/data-flow/transformations/slowly-changing-dimension-transformation
ItHYMeRIsh
Highly Voted
5 months, 2 weeks ago
Selected Answer: ABE
The answer is ABE. A type 2 SCD requires a surrogate key to uniquely identify each record when versioning.
See https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-
between-dimension-types under SCD Type 2 “ the dimension table must use a surrogate key to provide a unique reference to a version of the
dimension member.”
A business key is already part of this table - SupplierSystemID. The column is derived from the source data.
upvoted 35 times
CHOPIN
Highly Voted
4 months, 2 weeks ago
Selected Answer: BCE
WHAT ARE YOU GUYS TALKING ABOUT??? You are really misleading other people!!! No issue with the provided answer. Should be BCE!!!
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 81/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
https://docs.microsoft.com/en-us/sql/integration-services/data-flow/transformations/slowly-changing-dimension-transformation?view=sql-server-
ver15
"The Slowly Changing Dimension transformation requires at least one business key column."
dev2dev
4 months, 2 weeks ago
Search for Business Keys in that page. and make sure you wear specs :D
upvoted 2 times
assU2
4 months, 2 weeks ago
Yes, because SupplierSystemID is unique. But Microsoft questions are terribly misleading here. People think that SupplierSystemID is business
key because of Supplier in it. Also, there are some really not good and not sufficient examples on Learn. See https://docs.microsoft.com/en-
us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-dimension-types
upvoted 1 times
Mad_001
3 months ago
I don't understand.
1) What in your opinion should then be the business key. Can you explain please.
2) SupplierSysteID ist uniqe in the source system. Is there a definition that the column need to be unique also in the DataWarehouse? If no,
there ist the possibility to use it as business key. Am I wrong?
upvoted 1 times
Onobhas01
1 month, 2 weeks ago
No you're not wrong, the unique identifier form the ERP system is the Business Key
upvoted 1 times
shrikantK
Most Recent
6 days, 20 hours ago
ABE seems correct. Why not business key is already discussed. Why not foreign key? one reason: Foreign key constraint is not supported in
dedicated SQL pool.
upvoted 1 times
gabdu
3 weeks, 2 days ago
why is there no mention of flag?
upvoted 2 times
necktru
4 weeks ago
Selected Answer: ABE
Please, SupplierSystemID is unique in ERP, that not mean that must be unique in our DW, that's why we need a surrogate primary key, If don't, SCD
type 2 can't be implemented
upvoted 1 times
Andushi
4 weeks ago
Selected Answer: ABE
https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-
dimension-types
upvoted 1 times
ladywhiteadder
1 month, 4 weeks ago
Selected Answer: ABE
ABE - very clear answer if you know what a type 2 SCD is. you will need a new surrogate key. the business key is already there - it's
SupplierSystemID - and will stay the same over time = will not be unique when anything changes as we will insert a new row then.
upvoted 4 times
kilowd
3 months, 3 weeks ago
Selected Answer: ABE
Type 2
In order to support type 2 changes, we need to add four columns to our table:
· Surrogate Key – the original ID will no longer be sufficient to identify the specific record we require, we therefore need to create a new ID that the
fact records can join to specifically.
· Current Flag – A quick method of returning only the current version of each record
· Start Date – The date from which the specific historical version is active
· End Date – The date to which the specific historical version record is active
https://adatis.co.uk/introduction-to-slowly-changing-dimensions-scd-types/
upvoted 4 times
KashRaynardMorse
1 month ago
Good answer! It's worth talking about the business key, since that is the controversial bit.
There needs to be something that functions as a business key, in this case it can be the SupplierSystemID.
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 82/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
The Current Flag is not strictly needed, the solution would function okay without it, but I would include it in real life anyway for performance
and ease of querying (it's also not shown as an option).
upvoted 2 times
KashRaynardMorse
1 month ago
And to add, the question is what "additional" columns are needed. So emphasising, although a business key is definitely needed, the column
that serves it's purpose is already present (albeit with a different column name), so does not need adding again.
upvoted 1 times
stunner85_
3 months, 3 weeks ago
Here's why the answer is Surrogate Key and not Business key:
In a temporal database, it is necessary to distinguish between the surrogate key and the business key. Every row would have both a business key
and a surrogate key. The surrogate key identifies one unique row in the database, the business key identifies one unique entity of the modeled
world.
upvoted 2 times
m0rty
3 months, 4 weeks ago
Selected Answer: ABE
correcto
upvoted 1 times
PallaviPatel
3 months, 4 weeks ago
Selected Answer: ABE
these are correct answers.
upvoted 1 times
Hervedoux
4 months, 1 week ago
Totally agree with Chopin, SCD type 2 tables require at least a Business Key column and a start and end date to capture historical dat, thus BCE is
the correct answer.
https://docs.microsoft.com/en-us/sql/integration-services/data-flow/transformations/slowly-changing-dimension-transformation
upvoted 1 times
Mahesh_mm
4 months, 4 weeks ago
ABE is correct
upvoted 2 times
gitoxam686
5 months ago
Selected Answer: ABE
A B E.... Surrogate Key s required.
upvoted 3 times
KevinSames
5 months ago
Selected Answer: ABE
ez ezez
upvoted 1 times
m2shines
5 months, 1 week ago
A, B and E
upvoted 1 times
Nifl91
5 months, 2 weeks ago
shouldn't it be ABE? we already have a business key! we need a surrogate to use as a primary key when a supplier with updated attributes is to be
inserted into the table
upvoted 1 times
assU2
4 months, 2 weeks ago
It's not a business key, its unique. And business key may not be unique because it's 2 SCD. You can have multiple rows for one entity with
different start/end dates.
upvoted 2 times
necktru
4 weeks ago
It's unique in the ERP, in the DW can be duplicated, it's why we need the surrogate pk that must be unique, answers are ABE
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 83/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
HOTSPOT -
You have a Microsoft SQL Server database that uses a third normal form schema.
You plan to migrate the data in the database to a star schema in an Azure Synapse Analytics dedicated SQL pool.
You need to design the dimension tables. The solution must optimize read operations.
What should you include in the solution? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:
Denormalization is the process of transforming higher normal forms to lower normal forms via storing the join of higher normal form relations
as a base relation.
Denormalization increases the performance in data retrieval at cost of bringing update anomalies to a database.
The collapsing relations strategy can be used in this step to collapse classification entities into component entities to obtain flat dimension
tables with single-part keys that connect directly to the fact table. The single-part key is a surrogate key generated to ensure it remains unique
over time.
Example:
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 84/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Note: A surrogate key on a table is a column with a unique identifier for each row. The key is not generated from the table data. Data modelers
like to create surrogate keys on their tables when they design data warehouse models. You can use the IDENTITY property to achieve this goal
simply and effectively without affecting load performance.
Reference:
https://www.mssqltips.com/sqlservertip/5614/explore-the-role-of-normal-forms-in-dimensional-modeling/ https://docs.microsoft.com/en-
us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-identity
JimZhang4123
1 week, 2 days ago
'The solution must optimize read operations.' means denormalization
upvoted 1 times
PallaviPatel
3 months, 4 weeks ago
Answer correct.
upvoted 3 times
Mahesh_mm
4 months, 4 weeks ago
Answers are correct
upvoted 1 times
PallaviPatel
5 months ago
answer is correct
upvoted 4 times
moreinva43
5 months, 2 weeks ago
While denormalizing does require implementing a lower level of normalization, the second normal form ONLY applies when a table has a
composite primary key. https://www.geeksforgeeks.org/second-normal-form-2nf/
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 85/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
HOTSPOT -
You plan to develop a dataset named Purchases by using Azure Databricks. Purchases will contain the following columns:
✑ ProductID
✑ ItemPrice
✑ LineTotal
✑ Quantity
✑ StoreID
✑ Minute
✑ Month
✑ Hour
Year -
✑ Day
You need to store the data to support hourly incremental load pipelines that will vary for each Store ID. The solution must minimize storage costs.
How should you complete the code? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:
Box 1: partitionBy -
Example:
df.write.partitionBy("y","m","d")
.mode(SaveMode.Append)
.parquet("/data/hive/warehouse/db_name.db/" + tableName)
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 86/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Box 3: parquet("/Purchases")
Reference:
https://intellipaat.com/community/11744/how-to-partition-and-write-dataframe-in-spark-without-deleting-partitions-with-no-new-data
sparkchu
1 month, 3 weeks ago
ans should be saveAsTable. format is defined by format() method.
upvoted 1 times
assU2
4 months, 2 weeks ago
Can anyone explain why it's Partitioning and not Bucketing pls?
upvoted 2 times
KashRaynardMorse
1 month ago
Bucketing feature (part of data skipping index) was removed and microsoft recommends using DeltaLake, which uses the partition syntax.
https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/dataskipping-index
upvoted 1 times
bhanuprasad9331
3 months, 1 week ago
There should be a different folder for each store. Partitioning will create separate folder for each storeId. In bucketing, multiple stores having
same hash value can be present in the same file, so multiple storeIds can be present under a single file.
upvoted 3 times
assU2
4 months, 2 weeks ago
Is it a question of correct syntax (numBuckets int the number of buckets to save) or is it smth else?
upvoted 1 times
Mahesh_mm
4 months, 4 weeks ago
Answers are correct
upvoted 4 times
Aslam208
5 months, 2 weeks ago
correct
upvoted 4 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 87/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You are designing a partition strategy for a fact table in an Azure Synapse Analytics dedicated SQL pool. The table has the following
specifications:
✑ Contain 2.4 billion records for the years 2019 and 2020.
Which number of partition ranges provides optimal compression and performance for the clustered columnstore index?
A.
40
B.
240
C.
400
D.
2,400
Correct Answer:
A
Each partition should have around 1 millions records. Dedication SQL pools already have 60 partitions.
Note: Having too many partitions can reduce the effectiveness of clustered columnstore indexes if each partition has fewer than 1 million rows.
Dedicated SQL pools automatically partition your data into 60 databases. So, if you create a table with 100 partitions, the result will be 6000
partitions.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool
Aslam208
Highly Voted
5 months, 2 weeks ago
correct
upvoted 12 times
sdokmak
Most Recent
1 day, 9 hours ago
Selected Answer: A
quick maths
upvoted 1 times
MS_Nikhil
3 weeks, 6 days ago
Selected Answer: A
A is correct
upvoted 1 times
Egocentric
1 month, 2 weeks ago
correct
upvoted 1 times
Twom
2 months, 1 week ago
Selected Answer: A
Correct
upvoted 2 times
jskibick
3 months ago
Selected Answer: A
I am also confused.
So we have 2.400.000.000 rows that are already split in 60 nodes od SQL DW. That makes
Next, we know the partitions will be divided into CCI segments ~1mln per each. And here is my problem. Because CCI will autosplit data in
partitions into 1mln row segments. We do not have to do it on our own in partitions. I would split data into monthly partitions i.e. #24 for 2 year,
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 88/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Justin_beswick
3 months, 2 weeks ago
Selected Answer: C
The Rule is Partitions= Records/(1 million * 60)
24,000,000,000/60,000,000 = 400
upvoted 4 times
helpaws
3 months ago
it's 2.4 billion, not 24 billion
upvoted 9 times
AlvaroEPMorais
3 months, 1 week ago
The Rule is Partitions= Records/(1 million * 60)
2,400,000,000/60,000,000 = 40
upvoted 8 times
dev2dev
4 months, 2 weeks ago
Are you suggesting create 40 partitions on ProductId? This is confusing. If you create 40 partitions, SQL Pool will create 40*60 partitions which is
2400. And documentation says create fewer partitions. If we want to create paritions by year then we can create 2 partitions for two years which
internally creates 2*60 = 120 paritions, but extra 2 paritions for outer boundaries will make it 4*60 = 240. So 240 paritions for 2.4 billion rows is
ideal. But what is confusing me is we creat only 4 paritions which is not even in options
upvoted 2 times
Canary_2021
4 months, 2 weeks ago
A distributed table appears as a single table, but the rows are actually stored across 60 distributions.
Muishkin
4 weeks ago
So then how do we calculate the number of partitions?Is'nt it user driven ?
upvoted 1 times
Canary_2021
4 months, 2 weeks ago
If you partition your data, each partition will need to have 1 million rows to benefit from a clustered columnstore index. For a table with 100
partitions, it needs to have at least 6 billion rows to benefit from a clustered columns store (60 distributions 100 partitions 1 million rows).
upvoted 1 times
Mahesh_mm
4 months, 4 weeks ago
correct
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 89/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
HOTSPOT -
You are creating dimensions for a data warehouse in an Azure Synapse Analytics dedicated SQL pool.
You create a table by using the Transact-SQL statement shown in the following exhibit.
Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic.
Hot Area:
Correct Answer:
Box 1: Type 2 -
A Type 2 SCD supports versioning of dimension members. Often the source system doesn't store versions, so the data warehouse load process
detects and manages changes in a dimension table. In this case, the dimension table must use a surrogate key to provide a unique reference to
a version of the dimension member. It also includes columns that define the date range validity of the version (for example, StartDate and
EndDate) and possibly a flag column (for example,
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 90/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Incorrect Answers:
A Type 1 SCD always reflects the latest values, and when changes in source data are detected, the dimension table data is overwritten.
A business key or natural key is an index which identifies uniqueness of a row based on columns that exist naturally in a table according to
business rules. For example business keys are customer code in a customer table, composite of sales order header number and sales order
item line number within a sales order details table.
Reference:
https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-
dimension-types
nkav
Highly Voted
1 year ago
product key is a surrogate key as it is an identity column
upvoted 98 times
111222333
1 year ago
Agree on the surrogate key, exactly.
"In data warehousing, IDENTITY functionality is particularly important as it makes easier the creation of surrogate keys."
Why ProductKey is certainly not a business key: "The IDENTITY value in Synapse is not guaranteed to be unique if the user explicitly inserts a
duplicate value with 'SET IDENTITY_INSERT ON' or reseeds IDENTITY". Business key is an index which identifies uniqueness of a row and here
Microsoft says that identity doesn't guarantee uniqueness.
References:
https://azure.microsoft.com/en-us/blog/identity-now-available-with-azure-sql-data-warehouse/
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-identity
upvoted 8 times
rikku33
8 months ago
Type 2
In order to support type 2 changes, we need to add four columns to our table:
· Surrogate Key – the original ID will no longer be sufficient to identify the specific record we require, we therefore need to create a new ID
that the fact records can join to specifically.
· Current Flag – A quick method of returning only the current version of each record
· Start Date – The date from which the specific historical version is active
· End Date – The date to which the specific historical version record is active
With these elements in place, our table will now look like:
upvoted 4 times
sagga
Highly Voted
1 year ago
Type2 because there are start and end columns and ProductKey is a surrogate key. ProductNumber seems a business key.
upvoted 29 times
DrC
12 months ago
The start and end columns are for when to when the product was being sold, not for metadata purposes. That makes it:
Type 1 – No History
Update record directly, there is no record of historical values, only current state
upvoted 40 times
Kyle1
8 months, 1 week ago
When the product is not being sold anymore, it becomes a historical record. Hence Type 2.
upvoted 2 times
rockyc05
3 months ago
It is type 1 not 2
upvoted 1 times
Yuri1101
5 months, 1 week ago
With type 2, you normally don't update any column of a row other than row start date and end date.
upvoted 1 times
captainbee
11 months, 3 weeks ago
Exactly how I saw it
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 91/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
SandipSingha
Most Recent
2 weeks, 3 days ago
product key is definitely a surrogate key
upvoted 1 times
dazero
3 weeks ago
Definitely Type 1. There are no columns to indicate the different versions of the same business key. The sell start and end date columns are actual
source columns from when the product was sold. The Insert and Update columns are audit columns that explain when a record was inserted for the
first time and when it was updated. So the insert date remains the same, but the updated column is updated every time a Type 1 update occurs.
upvoted 1 times
AlCubeHead
2 months ago
Product Key is surrogate NOT business key as it's a derived IDENTITY
Dimension is type 1 as it does not have a StartDate and EndDate associated with data changes and also does not have an IsCurrent flag
upvoted 5 times
Shrek66
3 months, 3 weeks ago
Agree with ploer
SCD Type 1
Surrogate
upvoted 4 times
ploer
3 months, 3 weeks ago
Surrugate Key and Type 1 SCD. Tpye 1 SCD because sellenddate and sellstartdate are attributes of the product and not for versioning. rowupdated
and rowinserted are used for scd. And - as the naming indicates- the fact that both have a not null constraint leads to the conclusion that we have
no possibility to store the information which row is the current one. So it must be scd type 1.
upvoted 3 times
skkdp203
3 months, 3 weeks ago
SCD Type 1
Surrogate
https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-
dimension-types
It is critical that the primary key’s value of a dimension table remains unchanged. And it is highly recommended that all dimension tables use
surrogate keys as primary keys.
Surrogate keys are key generated and managed inside the data warehouse rather than keys extracted from data source systems.
upvoted 3 times
dev2dev
3 months, 3 weeks ago
identity/surrogate key's can be a business key in transition tables but in dw it can be used only as surrogate key.
upvoted 1 times
assU2
4 months, 2 weeks ago
Maybe it's type 2 because the logic is: we can have multiple rows with one productID, different surrogate keys and different start/end sale dates.
assU2
4 months, 2 weeks ago
Where are these answers from? Why there are so many mistakes? ProductKey is obviously a surrogate key
upvoted 1 times
alex623
4 months, 2 weeks ago
It seems like Type 1: There is no flag to inform if the record is the current record, also the date column is just for modified date
upvoted 1 times
Boompiee
2 weeks, 1 day ago
The flag is commonly used, but not required.
upvoted 1 times
Mahesh_mm
4 months, 4 weeks ago
Type 2 and surrogate key
upvoted 2 times
Ayan3B
5 months, 2 weeks ago
when table being created rowinsertdatetime and rowupdatedatetime attribute has been kept along with ETL identifier attribute so no previous
version data would be kept. So type 1 is answer. Type 2 keep the older version information at row level along with start date and end date and type
3 keeps column level restricted old version of data.
Second answer would be surrogate key as product key generated with IDENTITY
upvoted 4 times
satyamkishoresingh
5 months, 3 weeks ago
What is type 0 ?
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 92/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
upvoted 1 times
DrTaz
4 months, 3 weeks ago
SCD type 0 us a constant value that never changes.
upvoted 1 times
jay5518
6 months ago
This was on test today
upvoted 1 times
ohana
7 months ago
Took the exam today, this question came out.
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 93/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from
suppliers for a retail store. FactPurchase will contain the following columns.
FactPurchase will have 1 million rows of data added daily and will contain three years of data.
SELECT -
FROM FactPurchase -
A.
replicated
B.
hash-distributed on PurchaseKey
C.
round-robin
D.
hash-distributed on DateKey
Correct Answer:
B
Hash-distributed tables improve query performance on large fact tables, and are the focus of this article. Round-robin tables are useful for
improving loading speed.
Incorrect:
Not D: Do not use a date column. . All data for the same date lands in the same distribution. If several users are all filtering on the same date,
then only 1 of the 60 distributions do all the processing work.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute
AugustineUba
Highly Voted
9 months, 3 weeks ago
From the documentation the answer is clear enough. B is the right answer.
When choosing a distribution column, select a distribution column that: "Is not a date column. All data for the same date lands in the same
distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work."
upvoted 33 times
YipingRuan
7 months, 1 week ago
To minimize data movement, select a distribution column that:
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 94/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
YipingRuan
7 months, 1 week ago
Consider using the round-robin distribution for your table in the following scenarios:
If the table does not share a common join key with other tables
waterbender19
Highly Voted
9 months, 3 weeks ago
I think the answer should be D for that specific query. If you look at the datatypes, DateKey is an INT datatype not a DATE datatype.
upvoted 13 times
kamil_k
2 months, 1 week ago
n.b. if we look at the example query itself the date range is 31 days so we will use 31 distributions out of 60, and only process ~31 million
records
upvoted 1 times
waterbender19
9 months, 3 weeks ago
and thet statement that Fact table will be added 1 million rows daily means that each datekey value has an equal amount of rows associated
with that value.
upvoted 5 times
Lucky_me
4 months, 2 weeks ago
But the DateKey is used in the WHERE clause.
upvoted 2 times
kamil_k
2 months, 1 week ago
I agree, date key is int, and besides, even if it was a date, when you query a couple days then 1 million rows per distribution is not that
much. So what if you are going to use only a couple distributions to do the job? Isn't it still faster than using all distributions to process all
of the records to get the required date range?
upvoted 1 times
AnandEMani
8 months, 3 weeks ago
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute this link says date filed ,
NOT a date Data type. B is correct
upvoted 3 times
Ramkrish39
Most Recent
2 months, 1 week ago
Agree B is the right answer
upvoted 1 times
PallaviPatel
3 months, 4 weeks ago
Selected Answer: C
I will go with round robin.
''Consider using the round-robin distribution for your table in the following scenarios:
If the table does not share a common join key with other tables
yovi
4 months, 4 weeks ago
Anyone, when you finish an exam, do they give you the correct answers in the end?
upvoted 1 times
dev2dev
4 months, 2 weeks ago
those finished exam will not know the answer. because answers are not reveled
upvoted 1 times
Mahesh_mm
4 months, 4 weeks ago
B is correct ans
upvoted 1 times
danish456
5 months ago
Selected Answer: B
It's correct
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 95/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
trietnv
5 months ago
Selected Answer: B
1. choose distribution b/c "joining a round-robin table usually requires reshuffling the rows, which is a performance hit"
refer:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute
and
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute
upvoted 2 times
Aslam208
5 months, 2 weeks ago
Selected Answer: B
B is correct
upvoted 1 times
Hervedoux
6 months ago
Selected Answer: B
Its cleary a hash on purchasekey column
upvoted 3 times
ohana
7 months ago
Took the exam today, this question came out.
Ans: B
upvoted 5 times
Marcus1612
8 months, 2 weeks ago
To optimize the MPP, data have to be distributed evenly. Datekey is not a good candidate because the data will be distributed evenly one day per
60 days. In practice, if many users query the fact table to retreive the data about the week before, only 7 nodes will process the queries instead of
60. According to microsoft documentation:"To balance the parallel processing, select a distribution column that .. Is not a date column. All data for
the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the
processing work.
upvoted 4 times
Marcus1612
8 months, 2 weeks ago
the good answer is B
upvoted 2 times
andimohr
10 months ago
The reference given in the answer is precise: Choose a distribution column with data that a) distributes evenly b) has many unique values c) does
not have NULLs or few NULLs and d) IS NOT A DATE COLUMN... definitely the best choice for the Hash distribution is on the Identity column.
upvoted 4 times
noone_a
10 months, 3 weeks ago
although its a fact table, replicated is the correct distribution in this case.
Each row is 141 bytes in size x 1000000 records = 135Mb total size
We have no further information regarding table growth so this answer is based only on the info provided.
upvoted 1 times
noone_a
10 months, 3 weeks ago
edit, this is incorrect as it will have 1 million records added daily for 3 years, putting it over 2GB
upvoted 4 times
vlad888
11 months ago
Yes - do not use date column - there is such recomendation in synapse docs. But here we have range search - potensiallu several nodes will be
used.
upvoted 1 times
vlad888
11 months ago
Actually it is clear that it should be hash distributed. BUT Product key brings no benefit for this query - doesn't participated in it at all. So - DateKey.
Although it is unusual for Synapse
upvoted 4 times
savin
11 months, 1 week ago
I don't think there is enough information to decide this. Also we can not decide it by just looking at one query. Only considering this query and if
we assume no other dimensions are connected to this fact table, good answer would be D.
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 96/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Data files will be produced be using Azure Data Factory and stored in Azure Data Lake Storage Gen2. The files will be consumed by an Azure
Synapse Analytics serverless SQL pool.
A.
Use Snappy compression for files.
B.
Use OPENROWSET to query the Parquet files.
C.
Create an external table that contains a subset of columns from the Parquet files.
D.
Store all data as string in the Parquet files.
Correct Answer:
C
An external table points to data located in Hadoop, Azure Storage blob, or Azure Data Lake Storage. External tables are used to read data from
files or write data to files in Azure Storage. With Synapse SQL, you can use external tables to read external data using dedicated SQL pool or
serverless SQL pool.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables
m2shines
Highly Voted
5 months, 1 week ago
Answer should be A, because this talks about minimizing storage costs, not querying costs
upvoted 22 times
assU2
4 months, 2 weeks ago
Isn't snappy a default compressionCodec for parquet in azure?
https://docs.microsoft.com/en-us/azure/data-factory/format-parquet
upvoted 8 times
Aslam208
Highly Voted
5 months, 2 weeks ago
C is the correct answer, as an external table with a subset of columns with parquet files would be cost-effective.
upvoted 13 times
RehanRajput
3 days, 17 hours ago
This is not correct.
1. External tables are are not saved in the database. (This is why they're external)
2. You're assuming that the SQL Serverless pools have a local storage. They don't -- > https://docs.microsoft.com/en-us/azure/synapse-
analytics/sql/best-practices-serverless-sql-pool
upvoted 1 times
Massy
3 weeks, 5 days ago
in serverless sql pool you don't create a copy of the data, so how could be cost effective?
upvoted 1 times
sdokmak
Most Recent
1 day, 8 hours ago
Selected Answer: B
I agree with Canary2021
upvoted 1 times
rohitbinnani
1 month, 2 weeks ago
Selected Answer: C
not A - The default compression for a parquet file is SNAPPY. Even in Python as well.
C - because an external table that contains a subset of columns from the Parquet files will not need re-saving them in databases and that would
save storage costs.
upvoted 6 times
RehanRajput
3 days, 17 hours ago
This is not correct.
1. External tables are are not saved in the database. (This is why they're external)
2. You're assuming that the SQL Serverless pools have a local storage. They don't -- > https://docs.microsoft.com/en-us/azure/synapse-
analytics/sql/best-practices-serverless-sql-pool
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 97/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
DingDongSingSong
2 months ago
The answer is NOT A. Snappy compression offers fast compression, but file size at rest is larger which will translate into higher storage cost. The
answer is C where an external table with requisite columns is made available which will reduce the amount of storage
upvoted 4 times
cotillion
3 months, 2 weeks ago
Selected Answer: A
Only A has sth to do with the storage
upvoted 1 times
PallaviPatel
3 months, 4 weeks ago
A looks to be correct.
upvoted 1 times
dev2dev
4 months, 1 week ago
Since this is a batch process, and we can delete files once loaded and this can't be avoid initial/temporary storage cost of any form for loading
data, including most optimized parquet format with compression option. So best approach would be to store only required columns which can
save storage. However, we can always use OPENROWSET if we are not interested to persist data. Yeah, like someone said, this is shitty question
with shitty options.
upvoted 2 times
Ramkrish39
2 months, 1 week ago
OPENROWSET is for JSON files
upvoted 1 times
bhushanhegde
4 months, 2 weeks ago
As per the documentation, A is the correct answer
https://docs.microsoft.com/en-us/azure/data-factory/format-parquet#dataset-properties
upvoted 1 times
Jaws1990
4 months, 2 weeks ago
Selected Answer: A
creating an external table with fewer columns than the file has no effect on the file itself and will actually fail so in no way helps with storage costs.
See MS documentation "The column definitions, including the data types and number of columns, must match the data in the external files. If
there's a mismatch, the file rows will be rejected when querying the actual data."
upvoted 6 times
Canary_2021
4 months, 3 weeks ago
Selected Answer: B
In order to query data from external table, need to creat these 3 items. Feel that they all cost some storage.
If using open row set, don’t need to creat any thing, so l select B.
upvoted 5 times
ploer
3 months, 3 weeks ago
But this has nothing to do with storage costs. Only some bytes in the data dictionary are added and you are not even charged for this.
upvoted 1 times
sdokmak
1 day, 8 hours ago
no storage cost = WIN :)
upvoted 1 times
TestMitch
5 months, 1 week ago
This question is garbage.
upvoted 7 times
assU2
4 months, 2 weeks ago
Like many others...
upvoted 2 times
Jerrylolu
5 months, 1 week ago
That is correct. Looks like whoever put it here, didnt remember it clearly.
upvoted 2 times
vijju23
5 months, 2 weeks ago
Answer is B. which is best as per storage cost. reason we are querying parquet file when need using OPENROWSET.
upvoted 7 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 98/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
DRAG DROP -
You need to build a solution to ensure that users can query specific files in an Azure Data Lake Storage Gen2 account from an Azure Synapse
Analytics serverless SQL pool.
Which three actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and
arrange them in the correct order.
NOTE: More than one order of answer choices is correct. You will receive credit for any of the correct orders you select.
Correct Answer:
You can create external tables in Synapse SQL pools via the following steps:
1. CREATE EXTERNAL DATA SOURCE to reference an external Azure storage and specify the credential that should be used to access the
storage.
3. CREATE EXTERNAL TABLE on top of the files placed on the data source with the same file format.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables
avijitd
Highly Voted
5 months, 2 weeks ago
Looks correct answer
upvoted 12 times
SandipSingha
Most Recent
2 weeks, 3 days ago
correct
upvoted 1 times
lotuspetall
2 months, 1 week ago
correct
upvoted 1 times
PallaviPatel
3 months, 4 weeks ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 99/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
correct
upvoted 2 times
ANath
4 months, 3 weeks ago
Correct
upvoted 1 times
gf2tw
5 months, 2 weeks ago
Correct
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 100/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You are designing a data mart for the human resources (HR) department at your company. The data mart will contain employee information and
employee transactions.
From a source system, you have a flat extract that has the following fields:
✑ EmployeeID
FirstName -
✑ LastName
✑ Recipient
✑ GrossAmount
✑ TransactionID
✑ GovernmentID
✑ NetAmountPaid
✑ TransactionDate
You need to design a star schema data model in an Azure Synapse Analytics dedicated SQL pool for the data mart.
Which two tables should you create? Each correct answer presents part of the solution.
A.
a dimension table for Transaction
B.
a dimension table for EmployeeTransaction
C.
a dimension table for Employee
D.
a fact table for Employee
E.
a fact table for Transaction
Correct Answer:
CE
C: Dimension tables contain attribute data that might change but usually changes infrequently. For example, a customer's name and address
are stored in a dimension table and updated only when the customer's profile changes. To minimize the size of a large fact table, the customer's
name and address don't need to be in every row of a fact table. Instead, the fact table and the dimension table can share a customer ID. A query
can join the two tables to associate a customer's profile and transactions.
E: Fact tables contain quantitative data that are commonly generated in a transactional system, and then loaded into the dedicated SQL pool.
For example, a retail business generates sales transactions every day, and then loads the data into a dedicated SQL pool fact table for analysis.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview
avijitd
Highly Voted
5 months, 2 weeks ago
Correct Answer . Emp info as Dimension & trans table as fact
upvoted 7 times
SandipSingha
Most Recent
2 weeks, 3 days ago
correct
upvoted 1 times
tg2707
3 weeks, 2 days ago
why not fact table for employee and dim table for transactions
upvoted 1 times
Egocentric
1 month, 1 week ago
CE is correct
upvoted 1 times
NewTuanAnh
1 month, 2 weeks ago
Selected Answer: CE
CE is the correct answer
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 101/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
SebK
2 months ago
Selected Answer: CE
CE is correct
upvoted 1 times
surya610
3 months ago
Selected Answer: CE
Dimension for employee and fact for transactions.
upvoted 1 times
PallaviPatel
3 months, 4 weeks ago
Selected Answer: CE
correct
upvoted 1 times
gf2tw
5 months, 2 weeks ago
Correct
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 102/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You are designing a dimension table for a data warehouse. The table will track the value of the dimension attributes over time and preserve the
history of the data by adding new rows as the data changes.
A.
Type 0
B.
Type 1
C.
Type 2
D.
Type 3
Correct Answer:
C
A Type 2 SCD supports versioning of dimension members. Often the source system doesn't store versions, so the data warehouse load process
detects and manages changes in a dimension table. In this case, the dimension table must use a surrogate key to provide a unique reference to
a version of the dimension member. It also includes columns that define the date range validity of the version (for example, StartDate and
EndDate) and possibly a flag column (for example,
Incorrect Answers:
B: A Type 1 SCD always reflects the latest values, and when changes in source data are detected, the dimension table data is overwritten.
D: A Type 3 SCD supports storing two versions of a dimension member as separate columns. The table includes a column for the current value
of a member plus either the original or previous value of the member. So Type 3 uses additional columns to track one key instance of history,
rather than storing additional rows to track each change like in a Type 2 SCD.
Reference:
https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-
dimension-types
gf2tw
Highly Voted
5 months, 2 weeks ago
Correct
upvoted 12 times
SandipSingha
Most Recent
2 weeks, 3 days ago
correct
upvoted 1 times
SandipSingha
2 weeks, 3 days ago
correct
upvoted 1 times
AZ9997989798979789798979789797
3 weeks, 1 day ago
Correct
upvoted 1 times
Onobhas01
1 month, 3 weeks ago
Selected Answer: C
Correct!
upvoted 1 times
surya610
3 months ago
Selected Answer: C
Correct
upvoted 1 times
PallaviPatel
3 months, 4 weeks ago
Selected Answer: C
correct
upvoted 1 times
saupats
4 months, 2 weeks ago
correct
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 103/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
upvoted 1 times
ANath
4 months, 3 weeks ago
correct
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 104/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
DRAG DROP -
You have data stored in thousands of CSV files in Azure Data Lake Storage Gen2. Each file has a header row followed by a properly formatted
carriage return (/ r) and line feed (/n).
You are implementing a pattern that batch loads the files daily into an enterprise data warehouse in Azure Synapse Analytics by using PolyBase.
You need to skip the header row when you import the files into the data warehouse. Before building the loading pattern, you need to prepare the
required database objects in Azure Synapse Analytics.
Which three actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and
arrange them in the correct order.
Correct Answer:
Step 1: Create an external data source that uses the abfs location
Create External Data Source to reference Azure Data Lake Store Gen 1 or 2
Step 2: Create an external file format and set the First_Row option.
Step 3: Use CREATE EXTERNAL TABLE AS SELECT (CETAS) and configure the reject options to specify reject values or percentages
To use PolyBase, you must create external tables to reference your external data.
Note: REJECT options don't apply at the time this CREATE EXTERNAL TABLE AS SELECT statement is run. Instead, they're specified here so that
the database can use them at a later time when it imports data from the external table. Later, when the CREATE TABLE AS SELECT statement
selects data from the external table, the database will use the reject options to determine the number or percentage of rows that can fail to
import before it stops the import.
Reference:
https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-t-sql-objects https://docs.microsoft.com/en-us/sql/t-
sql/statements/create-external-table-as-select-transact-sql
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 105/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
avijitd
Highly Voted
5 months, 2 weeks ago
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#create-external-data-source
Hadoop external data source in dedicated SQL pool for Azure Data Lake Gen2 pointing
SECRET = 'sv=2018-03-28&ss=bf&srt=sco&sp=rl&st=2019-10-14T12%3A10%3A25Z&se=2061-12-
31T12%3A10%3A00Z&sig=KlSU2ullCscyTS0An0nozEpo4tO5JAgGBvw%2FJX2lguw%3D'
GO
-- Please note the abfss endpoint when your account has secure transfer enabled
( LOCATION = 'abfss://data@newyorktaxidataset.dfs.core.windows.net' ,
CREDENTIAL = ADLS_credential ,
TYPE = HADOOP
) ;
2.external DS
Fer079
Highly Voted
3 months, 3 weeks ago
The right answer should be:
"Create external table as select (CETAS)" makes no sense in this case because we would need to include a Select to fill out the external table,
however this data must come from files and not from other tables. In this case It's not the same an "external table" as an "external table as select",
the first one the data come from files and the second one the data come from a SQL query to export them into files.
upvoted 11 times
ravi2931
1 month, 3 weeks ago
I was thinking same and its obvious
upvoted 1 times
Genere
Most Recent
1 month, 3 weeks ago
"CETAS : Creates an external table and THEN EXPERTS, in parallel, the results of a Transact-SQL SELECT statement to Hadoop or Azure Blob
storage."
We are not looking here to export data but rather to consume data from ADLS.
wwdba
2 months ago
1. Create database scoped credential
You are implementing a pattern that batch loads the files daily...so "Create external table as select" is wrong because it'll load the data into Synapse
only once
upvoted 3 times
DingDongSingSong
2 months ago
According to this link, when using Polybase: https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-configure-sql-server?
view=sql-server-ver15
ovokpus
3 months ago
Why should you be the one to create the database scoped credential? You ought to have that already
upvoted 1 times
VeroDon
4 months, 3 weeks ago
"Azure Synapse Analytics uses a database scoped credential to access non-public Azure blob storage with PolyBase" The question doesnt mention
if the storage is or is not private
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-database-scoped-credential-transact-sql?view=sql-server-ver15
upvoted 3 times
ANath
4 months, 1 week ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 106/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Now that's clear the confusion if database scoped credential is needed in this context or not. By going through VeroDon's provided link it is
clear that database scoped credential is needed for non-public azure blob storage.
upvoted 1 times
VeroDon
4 months, 3 weeks ago
Correct
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/load-data-overview
upvoted 2 times
alexleonvalencia
5 months, 2 weeks ago
Step 1 : Create External data source ...
alexleonvalencia
5 months, 2 weeks ago
Corrijo;
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 107/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
HOTSPOT -
You are building an Azure Synapse Analytics dedicated SQL pool that will contain a fact table for transactions from the first half of the year 2020.
You need to ensure that the table meets the following requirements:
✑ Minimizes the processing time to delete data that is older than 10 years
How should you complete the Transact-SQL statement? To answer, select the appropriate options in the answer area.
Hot Area:
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 108/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Correct Answer:
Box 1: PARTITION -
Part 2: [TransactionDateID]
The following partition function partitions a table or index into 12 partitions, one for each month of a year's worth of values in a datetime
column.
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-partition-function-transact-sql
gf2tw
Highly Voted
5 months, 2 weeks ago
Correct
upvoted 6 times
gabdu
Most Recent
3 weeks, 2 days ago
How are we ensuring "Minimizes the processing time to delete data that is older than 10 years"?
upvoted 2 times
wwdba
2 months, 2 weeks ago
Correct
upvoted 1 times
PallaviPatel
3 months, 3 weeks ago
correct
upvoted 1 times
saupats
4 months, 2 weeks ago
correct
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 109/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You are performing exploratory analysis of the bus fare data in an Azure Data Lake Storage Gen2 account by using an Azure Synapse Analytics
serverless SQL pool.
A.
Only CSV files in the tripdata_2020 subfolder.
B.
All files that have file names that beginning with "tripdata_2020".
C.
All CSV files that have file names that contain "tripdata_2020".
D.
Only CSV that have file names that beginning with "tripdata_2020".
Correct Answer:
D
gf2tw
Highly Voted
5 months, 2 weeks ago
Correct
upvoted 10 times
Egocentric
Most Recent
1 month, 1 week ago
on this one you need to pay attention to wording
upvoted 1 times
jskibick
1 month, 2 weeks ago
Selected Answer: D
D all good
upvoted 1 times
sarapaisley
1 month, 2 weeks ago
Selected Answer: D
D is correct
upvoted 1 times
SebK
2 months ago
Selected Answer: D
Correct
upvoted 1 times
DingDongSingSong
2 months ago
Why is option C not correct, when the code has "tripdata_2020*.csv" which means that a wild card is used with "tripdata_2020" csv files. So,
example tripdata_2020A.csv, tripdata_2020B.csv, tripdata_2020YZ.csv, all 3 would be queried. Option D does not make sense, even gramatically
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 110/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
PallaviPatel
3 months, 3 weeks ago
Selected Answer: D
correct
upvoted 2 times
anto69
4 months, 2 weeks ago
No doubts is correct, no doubts is ans D
upvoted 1 times
duds19
5 months, 2 weeks ago
Why not B?
upvoted 1 times
Nifl91
5 months, 1 week ago
Because of the .csv at the end
upvoted 3 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 111/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
DRAG DROP -
You use PySpark in Azure Databricks to parse the following JSON input.
How should you complete the PySpark code? To answer, drag the appropriate values to the correct targets. Each value may be used once, more
than once, or not at all. You may need to drag the spit bar between panes or scroll to view content.
Correct Answer:
Box 1: select -
Box 2: explode -
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 112/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Bop 3: alias -
pyspark.sql.Column.alias returns this column aliased with a new name or names (in the case of expressions that return more than one column,
such as explode).
Reference:
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Column.alias.html https://docs.microsoft.com/en-
us/azure/databricks/sql/language-manual/functions/explode
galacaw
4 weeks ago
Correct
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 113/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
HOTSPOT -
You are designing an application that will store petabytes of medical imaging data.
When the data is first created, the data will be accessed frequently during the first week. After one month, the data must be accessible within 30
seconds, but files will be accessed infrequently. After one year, the data will be accessed infrequently but must be accessible within five minutes.
You need to select a storage strategy for the data. The solution must minimize costs.
Which storage tier should you use for each time frame? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:
Box 1: Hot -
Hot tier - An online tier optimized for storing data that is accessed or modified frequently. The Hot tier has the highest storage costs, but the
lowest access costs.
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 114/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Box 2: Cool -
Cool tier - An online tier optimized for storing data that is infrequently accessed or modified. Data in the Cool tier should be stored for a
minimum of 30 days. The
Cool tier has lower storage costs and higher access costs compared to the Hot tier.
Box 3: Cool -
Not Archive tier - An offline tier optimized for storing data that is rarely accessed, and that has flexible latency requirements, on the order of
hours. Data in the
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview https://www.altaro.com/hyper-v/azure-archive-storage/
nefarious_smalls
2 weeks, 4 days ago
Why would it not be be Hot Cool and Archive
upvoted 1 times
SandipSingha
2 weeks, 2 days ago
After one year, the data will be accessed infrequently but must be accessible within five minutes.
upvoted 2 times
Guincimund
2 weeks, 3 days ago
"After one year, the data will be accessed infrequently but must be accessible within five minutes"
The latency for the first bytes, is "hours" for the archive. so because they want to be able to access the data within 5 min, you need to place it in
"cool"
nefarious_smalls
2 weeks, 4 days ago
I dont know
upvoted 1 times
Andy91
1 month ago
Correct answer!
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 115/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You have an Azure Synapse Analytics Apache Spark pool named Pool1.
You plan to load JSON files from an Azure Data Lake Storage Gen2 container into the tables in Pool1. The structure and data types vary by file.
You need to load the files into the tables. The solution must maintain the source data types.
A.
Use a Conditional Split transformation in an Azure Synapse data flow.
B.
Use a Get Metadata activity in Azure Data Factory.
C.
Load the data by using the OPENROWSET Transact-SQL command in an Azure Synapse Analytics serverless SQL pool.
D.
Load the data by using PySpark.
Correct Answer:
C
Serverless SQL pool can automatically synchronize metadata from Apache Spark. A serverless SQL pool database will be created for each
database existing in serverless Apache Spark pools.
Serverless SQL pool enables you to query data in your data lake. It offers a T-SQL query surface area that accommodates semi-structured and
unstructured data queries.
To support a smooth experience for in place querying of data that's located in Azure Storage files, serverless SQL pool uses the OPENROWSET
function with additional capabilities.
The easiest way to see to the content of your JSON file is to provide the file URL to the OPENROWSET function, specify csv FORMAT.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-json-files https://docs.microsoft.com/en-us/azure/synapse-
analytics/sql/query-data-storage
Ben_1010
4 days, 18 hours ago
Why PySpark?
upvoted 1 times
Andushi
2 weeks, 6 days ago
Selected Answer: D
Should be D, I agree with @galacaw
upvoted 1 times
galacaw
4 weeks ago
Should be D, it's about Apache Spark pool, not serverless SQL pool.
upvoted 4 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 116/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You have an Azure Databricks workspace named workspace1 in the Standard pricing tier. Workspace1 contains an all-purpose cluster named
cluster1.
You need to reduce the time it takes for cluster1 to start and scale up. The solution must minimize costs.
A.
Configure a global init script for workspace1.
B.
Create a cluster policy in workspace1.
C.
Upgrade workspace1 to the Premium pricing tier.
D.
Create a pool in workspace1.
Correct Answer:
D
You can use Databricks Pools to Speed up your Data Pipelines and Scale Clusters Quickly.
Databricks Pools, a managed cache of virtual machine instances that enables clusters to start and scale 4 times faster.
Reference:
https://databricks.com/blog/2019/11/11/databricks-pools-speed-up-data-pipelines.html
Maggiee
1 week, 5 days ago
Answer should be C
upvoted 2 times
sdokmak
1 day, 8 hours ago
Answer is D:
looking at the reference link, pool works for this. Optimized scaling not needed to reduce 'start and scale up' times only.
upvoted 1 times
galacaw
4 weeks ago
Correct
upvoted 4 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 117/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
HOTSPOT -
You are building an Azure Stream Analytics job that queries reference data from a product catalog file. The file is updated daily.
The reference data input details for the file are shown in the Input exhibit. (Click the Input tab.)
The storage account container view is shown in the Refdata exhibit. (Click the Refdata tab.)
You need to configure the Stream Analytics job to pick up the new reference data.
What should you configure? To answer, select the appropriate options in the answer area.
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 118/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Hot Area:
Correct Answer:
Box 1: {date}/product.csv -
Note: Path Pattern: This is a required property that is used to locate your blobs within the specified container. Within the path, you may choose
to specify one or more instances of the following 2 variables:
{date}, {time}
Example 1: products/{date}/{time}/product-list.csv
Example 2: products/{date}/product-list.csv
Example 3: product-list.csv -
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 119/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Box 2: YYYY-MM-DD -
Note: Date Format [optional]: If you have used {date} within the Path Pattern that you specified, then you can select the date format in which
your blobs are organized from the drop-down of supported formats.
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-use-reference-data
inotbf83
3 weeks, 2 days ago
I should change box 2 to YYYY/MM/DD (as shows 1st exhibit). A bit confusing with time format in the box 1.
upvoted 4 times
jackttt
4 weeks, 1 day ago
The file is updated daily, i think `{date}/product.csv` is correct
upvoted 4 times
Lotusss
1 month ago
Wrong! Path Pattern: {dat}/{time}/product.csv
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-use-reference-data
upvoted 2 times
KashRaynardMorse
3 weeks, 4 days ago
See that the file is stored under the date folder, and there is no time folder.
Your link does recommend the time part, but the the link also says it's optional, and ultimately you need to answer the question, which states
the path without the time.
upvoted 4 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 120/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
HOTSPOT -
For each of the following statements, select Yes if the statement is true. Otherwise, select No.
Hot Area:
Correct Answer:
Box 1: No -
Note: You can now use a new extension of Azure Stream Analytics SQL to specify the number of partitions of a stream when reshuffling the
data.
The outcome is a stream that has the same partition scheme. Please see below for an example:
WITH step1 AS (SELECT * FROM [input1] PARTITION BY DeviceID INTO 10), step2 AS (SELECT * FROM [input2] PARTITION BY DeviceID INTO 10)
SELECT * INTO [output] FROM step1 PARTITION BY DeviceID UNION step2 PARTITION BY DeviceID
Note: The new extension of Azure Stream Analytics SQL includes a keyword INTO that allows you to specify the number of partitions for a
stream when performing reshuffling using a PARTITION BY statement.
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 121/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Box 2: Yes -
When joining two streams of data explicitly repartitioned, these streams must have the same partition key and partition count.
Box 3: Yes -
Streaming Units (SUs) represents the computing resources that are allocated to execute a Stream Analytics job. The higher the number of SUs,
the more CPU and memory resources are allocated for your job.
In general, the best practice is to start with 6 SUs for queries that don't use PARTITION BY.
Note: Remember, Streaming Unit (SU) count, which is the unit of scale for Azure Stream Analytics, must be adjusted so the number of physical
resources available to the job can fit the partitioned flow. In general, six SUs is a good number to assign to each partition. In case there are
insufficient resources assigned to the job, the system will only apply the repartition if it benefits the job.
Reference:
https://azure.microsoft.com/en-in/blog/maximize-throughput-with-repartitioning-in-azure-stream-analytics/ https://docs.microsoft.com/en-
us/azure/stream-analytics/stream-analytics-streaming-unit-consumption
TacoB
2 weeks, 5 days ago
Reading https://docs.microsoft.com/en-us/stream-analytics-query/union-azure-stream-analytics and the second sample given in there I would
expect the first one to be No.
upvoted 1 times
Akshay_1995
3 weeks, 3 days ago
Correct
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 122/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
HOTSPOT -
You are building a database in an Azure Synapse Analytics serverless SQL pool.
You have data stored in Parquet files in an Azure Data Lake Storege Gen2 container.
"id": 123,
"address_housenumber": "19c",
"applicant1_name": "Jane",
"applicant2_name": "Dev"
You need to build a table that includes only the address fields.
How should you complete the Transact-SQL statement? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:
An external table points to data located in Hadoop, Azure Storage blob, or Azure Data Lake Storage. External tables are used to read data from
files or write data to files in Azure Storage. With Synapse SQL, you can use external tables to read external data using dedicated SQL pool or
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 123/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Syntax:
( <column_definition> [ ,...n ] )
WITH (
LOCATION = 'folder_or_filepath',
DATA_SOURCE = external_data_source_name,
FILE_FORMAT = external_file_format_name
Box 2. OPENROWSET -
When using serverless SQL pool, CETAS is used to create an external table and export query results to Azure Storage Blob or Azure Data Lake
Storage Gen2.
Example:
AS -
FROM -
OPENROWSET(BULK
'https://azureopendatastorage.blob.core.windows.net/censusdatacontainer/release/us_population_county/year=*/*.parquet',
FORMAT='PARQUET') AS [r]
GO -
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables
SandipSingha
2 weeks, 2 days ago
correct
upvoted 2 times
Feljoud
3 weeks ago
correct
upvoted 3 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 124/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
HOTSPOT -
You have an Azure Synapse Analytics dedicated SQL pool named Pool1 and an Azure Data Lake Storage Gen2 account named Account1.
You need to create a data source in Pool1 that you can reference when you create the external table.
How should you complete the Transact-SQL statement? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:
Box 1: blob -
The following example creates an external data source for Azure Data Lake Gen2
TYPE = HADOOP)
Box 2: HADOOP -
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables
Jmanuelleon
1 week, 1 day ago
Es confuso.... en la definición para location, indica usar DFS,https://docs.microsoft.com/es-es/azure/synapse-analytics/sql/develop-tables-external-
tables?tabs=hadoop#location, pero en el ejemplo que aparece mas abajo, usa lo contrario, https://docs.microsoft.com/es-es/azure/synapse-
analytics/sql/develop-tables-external-tables?tabs=hadoop#example-for-create-external-data-source (En el ejemplo siguiente se crea un origen de
datos externo para Azure Data Lake Gen2 que apunta al conjunto de datos de Nueva York disponible públicamente: CREATE EXTERNAL DATA
SOURCE YellowTaxi
hbad
1 week, 2 days ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 125/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
It is hadoop and dfs. For dfs see link below location section:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop
upvoted 2 times
LetsPassExams
2 weeks, 5 days ago
From https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#example-for-create-external-
data-source
The following example creates an external data source for Azure Data Lake Gen2 pointing to the publicly available New York data set:
SQL
Copy
LetsPassExams
2 weeks, 5 days ago
I thin answer is correct:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#example-for-create-external-data-
source
upvoted 1 times
shrikantK
3 weeks, 5 days ago
dfs is the answer as question is about Azure Data Lake Storage Gen2 . if question was about blob storage then answer would have been blob.
upvoted 1 times
Andushi
3 weeks, 6 days ago
1. is DFS
upvoted 2 times
Andushi
4 weeks ago
I agree with galacaw is dfs and type Hadoop
upvoted 2 times
galacaw
4 weeks ago
1. dfs (for Azure Data Lake Storage Gen2)
upvoted 4 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 126/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You have an Azure subscription that contains an Azure Blob Storage account named storage1 and an Azure Synapse Analytics dedicated SQL pool
named
Pool1.
You need to store data in storage1. The data will be read by Pool1. The solution must meet the following requirements:
Enable Pool1 to skip columns and rows that are unnecessary in a query.
A.
JSON
B.
Parquet
C.
Avro
D.
CSV
Correct Answer:
B
Automatic creation of statistics is turned on for Parquet files. For CSV files, you need to create statistics manually until automatic creation of
CSV files statistics is supported.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-statistics
ClassMistress
1 week, 1 day ago
Selected Answer: B
Automatic creation of statistics is turned on for Parquet files. For CSV files, you need to create statistics manually until automatic creation of CSV
files statistics is supported.
upvoted 1 times
sdokmak
1 day, 7 hours ago
Good point, also better cost
upvoted 1 times
shachar_ash
2 weeks, 1 day ago
Correct
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 127/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
DRAG DROP -
You plan to create a table in an Azure Synapse Analytics dedicated SQL pool.
Data in the table will be retained for five years. Once a year, data that is older than five years will be deleted.
You need to ensure that the data is distributed evenly across partitions. The solution must minimize the amount of time required to delete old
data.
How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used
once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
Correct Answer:
Box 1: HASH -
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 128/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Box 2: OrderDateKey -
A way to eliminate rollbacks is to use Metadata Only operations like partition switching for data management. For example, rather than execute
a DELETE statement to delete all rows in a table where the order_date was in October of 2001, you could partition your data early. Then you can
switch out the partition with data for an empty partition from another table.
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-azure-sql-data-warehouse https://docs.microsoft.com/en-
us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool
ClassMistress
1 week, 1 day ago
I think it is Hash because the question refer to a Fact table.
upvoted 1 times
jebias
1 month ago
I think the first answer should be Round-Robin as it should be distributed evenly.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute
upvoted 2 times
Feljoud
1 month ago
While you are right, that Round-Robin guarantees an even distribution, it is only recommended to use on small tables < 2 GB (see your link).
Using the Hash of the ProductKey will also allow for an even distribution but in a more efficient manner.
Also, the Syntax here would be wrong if you would insert Round-Robin. As in that case it would only say: "DISTRIBUTION = ROUND-ROBIN" (no
ProductKey)
upvoted 10 times
nefarious_smalls
2 weeks, 4 days ago
You are exactly righty
upvoted 1 times
Muishkin
3 weeks, 6 days ago
yes i think so too
upvoted 1 times
Massy
3 weeks, 5 days ago
the syntax is ok only for HASH
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 129/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
HOTSPOT -
You need to design a data archiving solution that meets the following requirements:
✑ Data that is older than five years is accessed infrequently but must be available within one second when requested.
How should you manage the data? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:
Archive - Optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements, on the order of
hours.
The following table shows a comparison of premium performance block blob storage, and the hot, cool, and archive access tiers.
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 130/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers
sagur
Highly Voted
1 month ago
If "Data that is older than seven years is NOT accessed" then this data can be deleted to minimize the storage costs, right?
upvoted 5 times
Feljoud
1 month ago
Would agree, but the question states: "a data archiving solution", so maybe to keep the data was implied with this?
upvoted 2 times
noobprogrammer
1 month ago
Makes sense to me
upvoted 1 times
Massy
1 month ago
I agree, should be deleted
upvoted 1 times
KashRaynardMorse
3 weeks, 4 days ago
Deleting data older than 7 years is not an option available in the answer list. Becareful of the gotcha; 'Delete the blob' is an option but it
would delete all the data, included the ones that are e.g. 5 years old. So you can't choose that answer. So the next best thing to do is to put
it into archive.
upvoted 5 times
Boompiee
2 weeks, 1 day ago
I'm confused by your comment. It clearly does state an option to delete the blob after 7 years.
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 131/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 132/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Question #1 Topic 2
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that
might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads:
✑ A workload for data engineers who will use Python and SQL.
✑ A workload for jobs that will run notebooks that use Python, Scala, and SQL.
✑ A workload that data scientists will use to perform ad hoc analysis in Scala and R.
The enterprise architecture team at your company identifies the following standards for Databricks environments:
✑ The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for
deployment to the cluster.
✑ All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are
three data scientists.
Solution: You create a Standard cluster for each data scientist, a High Concurrency cluster for the data engineers, and a Standard cluster for the
jobs.
Does this meet the goal?
A.
Yes
B.
No
Correct Answer:
B
Note:
Standard clusters are recommended for a single user. Standard can run workloads developed in any language: Python, R, Scala, and SQL.
A high concurrency cluster is a managed cloud resource. The key benefits of high concurrency clusters are that they provide Apache Spark-
native fine-grained sharing for maximum resource utilization and minimum query latencies.
Reference:
https://docs.azuredatabricks.net/clusters/configure.html
Amalbenrebai
Highly Voted
8 months, 4 weeks ago
- data engineers: high concurrency cluster
Egocentric
1 month, 1 week ago
agreed
upvoted 1 times
Julius7000
8 months, 1 week ago
Tell me one thing: is this answer 9jobs) based on the text:
"A Single Node cluster has no workers and runs Spark jobs on the driver node.
In contrast, a Standard cluster requires at least one Spark worker node in addition to the driver node to execute Spark jobs."?
I dont understand the connection between worker noodes and the requirements given in the question about jobs workspace.
upvoted 1 times
gangstfear
Highly Voted
8 months, 4 weeks ago
The answer must be A!
upvoted 31 times
Eyepatch993
Most Recent
2 months ago
Selected Answer: B
Standard clusters do not have fault tolerance. Both the data scientist and data engineers will be using the job cluster for processing their
notebooks, so if a standard cluster is chosen and a fault occurs in the notebook of any one user, there is a chance that other notebooks might also
fail. Due to this a high concurrency cluster is recommended for running jobs.
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 133/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Boompiee
2 weeks, 1 day ago
It may not be a best practice, but the question asked is: does the solution meet the stated requirements, and it does..
upvoted 1 times
Hanse
2 months, 2 weeks ago
As per Link: https://docs.azuredatabricks.net/clusters/configure.html
Standard and Single Node clusters terminate automatically after 120 minutes by default. --> Data Scientists
A Standard cluster is recommended for a single user. --> Standard for Data Scientists & High Concurrency for Data Engineers
Standard clusters can run workloads developed in any language: Python, SQL, R, and Scala.
High Concurrency clusters can run workloads developed in SQL, Python, and R. The performance and security of High Concurrency clusters is
provided by running user code in separate processes, which is not possible in Scala. --> Jobs needs Standard
upvoted 2 times
ovokpus
3 months ago
Selected Answer: A
Yes it seems to be!
upvoted 2 times
PallaviPatel
3 months, 3 weeks ago
Selected Answer: A
correct
upvoted 2 times
kilowd
4 months ago
Selected Answer: A
Data Engineers - High Concurrency cluster as it provides for sharing . Also caters for SQl,Python and R.
Data Scientist - Standard Clusters which automatically terminates after 120 minutes and caters for Scala,SQl,Python and R.
let_88
4 months ago
As per the doc in Microsoft the High Concurrency cluster doesn't support Scala.
High Concurrency clusters can run workloads developed in SQL, Python, and R. The performance and security of High Concurrency clusters is
provided by running user code in separate processes, which is not possible in Scala.
https://docs.microsoft.com/en-us/azure/databricks/clusters/configure#cluster-mode
upvoted 6 times
tesen_tolga
4 months, 1 week ago
Selected Answer: A
The answer must be A!
upvoted 2 times
SabaJamal2010AtGmail
4 months, 3 weeks ago
The solution does not meet the requirement because: "High Concurrency clusters work only for SQL, Python, and R. The performance and security
of High Concurrency clusters is provided by running user code in separate processes, which is not possible in Scala.
upvoted 1 times
FredNo
6 months ago
Selected Answer: A
Data scientists and jobs use Scala so they need standard cluster
upvoted 9 times
Aslam208
6 months, 3 weeks ago
Answer is A.
upvoted 4 times
gangstfear
9 months ago
Shouldn't the answer be A, as ll the requirements are met:
Jobs - Standard
upvoted 13 times
satyamkishoresingh
8 months, 4 weeks ago
Yes , Given solution is correct.
upvoted 6 times
echerish
9 months ago
Question 23 and 24 seems to have been swapped. They Key is
Jobs - Standard
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 134/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
upvoted 7 times
MoDar
9 months ago
Answer A
Scala is not supported in High Concurrency cluster --> Jobs & Data scientists --> Standard
damaldon
11 months, 1 week ago
Answer: B
-Data scientist should have their own cluster and should terminate after 120 mins - STANDARD
https://docs.microsoft.com/en-us/azure/databricks/clusters/configure
upvoted 6 times
Sunnyb
11 months, 3 weeks ago
B is the correct answer
Link below:
https://docs.microsoft.com/en-us/azure/databricks/clusters/configure
upvoted 3 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 135/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Question #2 Topic 2
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that
might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads:
✑ A workload for data engineers who will use Python and SQL.
✑ A workload for jobs that will run notebooks that use Python, Scala, and SQL.
✑ A workload that data scientists will use to perform ad hoc analysis in Scala and R.
The enterprise architecture team at your company identifies the following standards for Databricks environments:
✑ The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for
deployment to the cluster.
✑ All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are
three data scientists.
Solution: You create a Standard cluster for each data scientist, a High Concurrency cluster for the data engineers, and a High Concurrency cluster
for the jobs.
A.
Yes
B.
No
Correct Answer:
A
We need a High Concurrency cluster for the data engineers and the jobs.
Note: Standard clusters are recommended for a single user. Standard can run workloads developed in any language: Python, R, Scala, and SQL.
A high concurrency cluster is a managed cloud resource. The key benefits of high concurrency clusters are that they provide Apache Spark-
native fine-grained sharing for maximum resource utilization and minimum query latencies.
Reference:
https://docs.azuredatabricks.net/clusters/configure.html
dfdsfdsfsd
Highly Voted
1 year ago
High-concurrency clusters do not support Scala. So the answer is still 'No' but the reasoning is wrong.
https://docs.microsoft.com/en-us/azure/databricks/clusters/configure
upvoted 31 times
Preben
11 months, 3 weeks ago
I agree that High concurrency does not support Scala. But they specified using a Standard cluster for the jobs, which does support Scala. Why is
the answer 'No'?
upvoted 2 times
eng1
11 months, 1 week ago
Because the High Concurrency cluster for each data scientist is not correct, it should be standard for a single user!
upvoted 4 times
FRAN__CO_HO
Highly Voted
11 months, 1 week ago
Answer should be NO, which
ClassMistress
Most Recent
1 week, 1 day ago
Selected Answer: B
High Concurrency clusters is provided by running user code in separate processes, which is not possible in Scala.
upvoted 1 times
narendra399
1 month, 3 weeks ago
1 and 2 are same questions but answers are different why?
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 136/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Hanse
2 months, 2 weeks ago
As per Link: https://docs.azuredatabricks.net/clusters/configure.html
Standard and Single Node clusters terminate automatically after 120 minutes by default. --> Data Scientists
A Standard cluster is recommended for a single user. --> Standard for Data Scientists & High Concurrency for Data Engineers
Standard clusters can run workloads developed in any language: Python, SQL, R, and Scala.
High Concurrency clusters can run workloads developed in SQL, Python, and R. The performance and security of High Concurrency clusters is
provided by running user code in separate processes, which is not possible in Scala. --> Jobs needs Standard
upvoted 2 times
lukeonline
4 months, 3 weeks ago
Selected Answer: B
high concurrency does not support scala
upvoted 2 times
rashjan
5 months, 2 weeks ago
Selected Answer: B
wrong: no
upvoted 1 times
FredNo
6 months ago
Selected Answer: B
Answer is no because high concurrency does not support scala
upvoted 5 times
Aslam208
6 months, 3 weeks ago
Answer is No
upvoted 2 times
damaldon
11 months, 1 week ago
Answer: NO
-Data scientist should have their own cluster and should terminate after 120 mins - STANDARD
https://docs.microsoft.com/en-us/azure/databricks/clusters/configure
upvoted 2 times
nas28
11 months, 3 weeks ago
Answer correct : No. but the reason is wrong, They want data scientists cluster to shut down automatically after 120 minutes so Standard cluster
not high concurrency
upvoted 3 times
Sunnyb
11 months, 3 weeks ago
Answer is correct - NO
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 137/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Question #3 Topic 2
HOTSPOT -
You plan to create a real-time monitoring app that alerts users when a device travels more than 200 meters away from a designated location.
You need to design an Azure Stream Analytics job to process the data for the planned app. The solution must minimize the amount of code
developed and the number of technologies used.
What should you include in the Stream Analytics job? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:
You can process real-time IoT data streams with Azure Stream Analytics.
Function: Geospatial -
With built-in geospatial functions, you can use Azure Stream Analytics to build applications for scenarios such as fleet management, ride
sharing, connected cars, and asset tracking.
Note: In a real-world scenario, you could have hundreds of these sensors generating events as a stream. Ideally, a gateway device would run
code to push these events to Azure Event Hubs or Azure IoT Hubs.
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-get-started-with-azure-stream-analytics-to-process-data-from-iot-
devices https://docs.microsoft.com/en-us/azure/stream-analytics/geospatial-scenarios
Podavenna
Highly Voted
8 months, 1 week ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 138/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Correct solution!
upvoted 22 times
ClassMistress
Most Recent
1 week, 1 day ago
Correct
upvoted 1 times
NewTuanAnh
1 month, 2 weeks ago
Correct!
upvoted 1 times
PallaviPatel
3 months, 3 weeks ago
Correct
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 139/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Question #4 Topic 2
A company has a real-time data analysis solution that is hosted on Microsoft Azure. The solution uses Azure Event Hub to ingest data and an
Azure Stream
Analytics cloud job to analyze the data. The cloud job is configured to use 120 Streaming Units (SU).
You need to optimize performance for the Azure Stream Analytics job.
Which two actions should you perform? Each correct answer presents part of the solution.
A.
Implement event ordering.
B.
Implement Azure Stream Analytics user-defined functions (UDF).
C.
Implement query parallelization by partitioning the data output.
D.
Scale the SU count for the job up.
E.
Scale the SU count for the job down.
F.
Implement query parallelization by partitioning the data input.
Correct Answer:
DF
D: Scale out the query by allowing the system to process each input partition separately.
F: A Stream Analytics job definition includes inputs, a query, and output. Inputs are where the job reads the data stream from.
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-parallelization
manquak
Highly Voted
8 months, 3 weeks ago
Partition input and output.
REF: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-parallelization
upvoted 25 times
kolakone
8 months, 1 week ago
Agree. And partitioning Input and output with same number of partitions gives the best performance optimization..
upvoted 5 times
Lio95
Highly Voted
8 months ago
No event consumer was mentioned. Therefore, partitioning output is not relevant. Answer is correct
upvoted 11 times
Boompiee
2 weeks, 1 day ago
The stream analytics job is the consumer.
upvoted 1 times
nicolas1999
6 months, 1 week ago
Stream analytics ALWAYS has at least one output. There is no need to mention that. So correct answer is input and output
upvoted 2 times
Andushi
Most Recent
2 weeks, 6 days ago
Selected Answer: CF
I agree with @manquak.
upvoted 1 times
DingDongSingSong
2 months ago
I think the answer is correct. The two things you do is: 1. Scale up SU and 2. partition input. If this doesn't work, THEN you could partition output as
well.
upvoted 1 times
Dianova
3 months, 1 week ago
Selected Answer: DF
I think answer is correct, because:
Nothing is mentioned in the question about the output and some type of outputs do not support partitioning (like PowerBI), so it would be risky to
assume that we can partition the output to implement Embarrassingly parallel jobs.
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-parallelization#outputs
Implementing query parallelization by partitioning the data input would be an optimization but the total number of SUs depends on the number of
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 140/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-parallelization#calculate-the-maximum-streaming-units-of-a-job
upvoted 7 times
PallaviPatel
3 months, 3 weeks ago
Selected Answer: CF
ignore my previous answer C and F is correct.
upvoted 2 times
PallaviPatel
3 months, 3 weeks ago
Selected Answer: DF
correct
upvoted 1 times
assU2
4 months ago
Selected Answer: CF
Partitioning lets you divide data into subsets based on a partition key. If your input (for example Event Hubs) is partitioned by a key, it is highly
recommended to specify this partition key when adding input to your Stream Analytics job. Scaling a Stream Analytics job takes advantage of
partitions in the input and output.
assU2
4 months ago
Is scaling an optimization??
upvoted 1 times
DE_Sanjay
4 months, 1 week ago
C & F Should be the right answer.
upvoted 1 times
dev2dev
4 months, 1 week ago
Optimization is always about improving performance using existing resources. So definitly not increasing SKU or SU
upvoted 4 times
alex623
4 months, 1 week ago
I think the answer is to partitioning input and output, because the target is to optimize regardless of computing capacity (#SUs)
upvoted 1 times
DingDongSingSong
2 months ago
who says optimization is regardless of computing capacity. Infact computing capacity increase is ONE of the ways to optimize performance.
upvoted 1 times
Jaws1990
4 months, 2 weeks ago
Selected Answer: CF
Should always aim for Embarrassingly parallel jobs (partitioning input, job and output) https://docs.microsoft.com/en-us/azure/stream-
analytics/stream-analytics-parallelization
Upping the computing power of a resource (SUs in this case) should never be classed as 'optimisation' like the question asks.
upvoted 5 times
dev2dev
4 months, 1 week ago
I agree
upvoted 1 times
trietnv
5 months ago
Selected Answer: BF
Choosing the number of required SUs for a particular job depends on the partition configuration for "the inputs" and "the query" that's defined
within the job. The Scale page allows you to set the right number of SUs. It is a best practice to allocate more SUs than needed.
ref: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-streaming-unit-consumption
upvoted 2 times
Sayour
5 months, 1 week ago
The Answer Is Correct, You Scall Up Streaming Units And Partition Input So The Input Events Are More Efficient To Process.
upvoted 3 times
m2shines
5 months, 1 week ago
C and F
upvoted 1 times
rashjan
5 months, 2 weeks ago
Selected Answer: CF
Partition input and output is the correct answer even if output is not mentioned because stream analytics always have at least one output.
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 141/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Question #5 Topic 2
You need to trigger an Azure Data Factory pipeline when a file arrives in an Azure Data Lake Storage Gen2 container.
A.
Microsoft.Sql
B.
Microsoft.Automation
C.
Microsoft.EventGrid
D.
Microsoft.EventHub
Correct Answer:
C
Event-driven architecture (EDA) is a common data integration pattern that involves production, detection, consumption, and reaction to events.
Data integration scenarios often require Data Factory customers to trigger pipelines based on events happening in storage account, such as the
arrival or deletion of a file in Azure
Blob Storage account. Data Factory natively integrates with Azure Event Grid, which lets you trigger pipelines on such events.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger https://docs.microsoft.com/en-us/azure/data-
factory/concepts-pipeline-execution-triggers
jv2120
Highly Voted
5 months, 2 weeks ago
Correct. C
Azure Event Hubs – Multiple source big data streaming pipeline (think telemetry data)
medsimus
Highly Voted
8 months ago
Correct
https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger?tabs=data-factory
upvoted 11 times
PallaviPatel
Most Recent
3 months, 3 weeks ago
Selected Answer: C
Correct.
upvoted 2 times
romanzdk
4 months, 1 week ago
But EventHub does not support ADLS, only Blob storage
upvoted 1 times
romanzdk
4 months, 1 week ago
https://docs.microsoft.com/en-us/azure/event-grid/overview
upvoted 2 times
Swagat039
4 months, 2 weeks ago
C. is correct.
You need storage event trigger (for this Microsoft.EventGrid service needs to be enabled).
upvoted 1 times
Vardhan_Brahmanapally
6 months, 3 weeks ago
Why not eventhub?
upvoted 3 times
wijaz789
8 months, 3 weeks ago
Absolutely correct
upvoted 4 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 142/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Question #6 Topic 2
A.
High Concurrency
B.
automated
C.
interactive
Correct Answer:
B
Automated Databricks clusters are the best for jobs and automated batch processing.
Note: Azure Databricks has two types of clusters: interactive and automated. You use interactive clusters to analyze data collaboratively with
interactive notebooks. You use automated clusters to run fast and robust automated jobs.
This scenario involves running batch job JARs and notebooks on a regular cadence through the Databricks platform.
The suggested best practice is to launch a new cluster for each run of critical jobs. This helps avoid any issues (failures, missing SLA, and so
on) due to an existing workload (noisy neighbor) on a shared cluster.
Reference:
https://docs.microsoft.com/en-us/azure/databricks/clusters/create https://docs.databricks.com/administration-guide/cloud-
configurations/aws/cmbp.html#scenario-3-scheduled-batch-workloads-data-engineers-running-etl-jobs
Podavenna
Highly Voted
8 months, 1 week ago
Correct!
upvoted 8 times
necktru
Most Recent
3 weeks ago
Selected Answer: B
correct
upvoted 1 times
PallaviPatel
3 months, 3 weeks ago
Selected Answer: B
correct.
upvoted 1 times
satyamkishoresingh
8 months, 3 weeks ago
What is automated cluster ?
upvoted 1 times
wijaz789
8 months, 3 weeks ago
There are 2 types of databricks clusters:
Swagat039
4 months, 2 weeks ago
Job cluster
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 143/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Question #7 Topic 2
HOTSPOT -
You are processing streaming data from vehicles that pass through a toll booth.
You need to use Azure Stream Analytics to return the license plate, vehicle make, and hour the last vehicle passed during each 10-minute window.
How should you complete the query? To answer, select the appropriate options in the answer area.
Hot Area:
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 144/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Correct Answer:
Box 1: MAX -
The first step on the query finds the maximum time stamp in 10-minute windows, that is the time stamp of the last event for that window. The
second step joins the results of the first query with the original stream to find the event that match the last time stamps in each window.
Query:
WITH LastInWindow AS -
SELECT -
MAX(Time) AS LastEventTime -
FROM -
GROUP BY -
TumblingWindow(minute, 10)
SELECT -
Input.License_plate,
Input.Make,
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 145/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Input.Time -
FROM -
Box 2: TumblingWindow -
Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals.
Box 3: DATEDIFF -
DATEDIFF is a date-specific function that compares and returns the time difference between two DateTime fields, for more information, refer to
date functions.
Reference:
https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics
rikku33
Highly Voted
8 months ago
correct
upvoted 16 times
Jerrylolu
5 months, 1 week ago
Why not Hopping Window??
upvoted 1 times
Wijn4nd
4 months, 2 weeks ago
Because a hopping window can overlap, and we need the data from 10 minute time frames that DON'T overlap
upvoted 3 times
PallaviPatel
Most Recent
3 months, 3 weeks ago
correct.
upvoted 1 times
BusinessApps
3 months, 3 weeks ago
HoppingWindow has a minimum of three arguments whereas TumblingWindow only takes two so considering the solution only has two
arguments it has to be Tumbling
https://docs.microsoft.com/en-us/stream-analytics-query/hopping-window-azure-stream-analytics
https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics
upvoted 3 times
DrTaz
4 months, 3 weeks ago
Answer is 100% correct.
upvoted 2 times
bubububox
4 months, 3 weeks ago
definitively hopping. because the event (last car passing) can be part of more than one window. Thus it cant be tumbling
upvoted 1 times
DrTaz
4 months, 3 weeks ago
the question defines non-overlapping windows, thus tumbling for sure 100%
upvoted 1 times
durak
5 months ago
Why not Select COunt?
upvoted 1 times
DrTaz
4 months, 3 weeks ago
max is used to get "last" event.
upvoted 1 times
irantov
8 months, 1 week ago
I think it is correct. Although, we could also use hoppingwindow. But it would be better to use Tumblingwindow as time events are unique.
upvoted 3 times
TelixFom
7 months, 2 weeks ago
I was thinking TumblingWindow based on the term: "each 10-minute window." This infers that the situation is not looking for a rolling max.
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 146/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
upvoted 2 times
elcholo
8 months, 3 weeks ago
QUEEEE!
upvoted 4 times
GameLift
8 months, 2 weeks ago
very confusing
upvoted 4 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 147/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Question #8 Topic 2
You have an Azure Data Factory instance that contains two pipelines named Pipeline1 and Pipeline2.
Pipeline1 has the activities shown in the following exhibit.
A.
Pipeline1 and Pipeline2 succeeded.
B.
Pipeline1 and Pipeline2 failed.
C.
Pipeline1 succeeded and Pipeline2 failed.
D.
Pipeline1 failed and Pipeline2 succeeded.
Correct Answer:
A
Activities are linked together via dependencies. A dependency has a condition of one of the following: Succeeded, Failed, Skipped, or
Completed.
Consider Pipeline1:
If we have a pipeline with two activities where Activity2 has a failure dependency on Activity1, the pipeline will not fail just because Activity1
failed. If Activity1 fails and Activity2 succeeds, the pipeline will succeed. This scenario is treated as a try-catch block by Data Factory.
Note:
If we have a pipeline containing Activity1 and Activity2, and Activity2 has a success dependency on Activity1, it will only execute if Activity1 is
successful. In this scenario, if Activity1 fails, the pipeline will fail.
Reference:
https://datasavvy.me/category/azure-data-factory/
echerish
Highly Voted
9 months ago
Pipeline 2 executes Pipeline 1 if success set variable. Since Pipeline 1 exists it's a success
Pipeline 1 Stored procedure fails. If fails set variable. Since the expected outcome is fail the job runs successfully and sets variable1.
SaferSephy
Highly Voted
8 months, 2 weeks ago
Correct answer is A. The trick is the fact that pipeline 1 only has a Failure dependency between de activity's. In this situation this results in a
Succeeded pipeline if the Stored procedure failed.
If also the success connection was linked to a follow up activity, and the SP would fail, the pipeline would be indeed marked as failed.
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 148/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
So A.
upvoted 21 times
BK10
3 months, 1 week ago
well explained! A is right
upvoted 1 times
SebK
Most Recent
2 months ago
Selected Answer: A
Correct
upvoted 1 times
AngelJP
2 months, 1 week ago
Selected Answer: A
A correct:
https://docs.microsoft.com/en-us/azure/data-factory/tutorial-pipeline-failure-error-handling#try-catch-block
upvoted 2 times
PallaviPatel
3 months, 3 weeks ago
Selected Answer: A
A correct. I agree with SaferSephy's comments below.
upvoted 2 times
dev2dev
4 months, 1 week ago
A is correct. Pipeline 1 is connected to Set variable to Failure node/event. Its like handling exceptions/errors in programming language. Without
Failure node, it would be treated as failed.
upvoted 1 times
VeroDon
4 months, 3 weeks ago
Selected Answer: A
Correct
upvoted 1 times
JSSA
5 months ago
Correct answer is A
upvoted 1 times
rashjan
5 months, 2 weeks ago
Selected Answer: A
correct
upvoted 1 times
medsimus
7 months, 1 week ago
Correct answer , I tested it in synapse . the first activity failed but the pipeline succeded
upvoted 5 times
Oleczek
8 months, 3 weeks ago
Just checked it myself on Azure, answer A is correct.
upvoted 4 times
wdeleersnyder
8 months, 4 weeks ago
I'm not seeing this... what's not being called out is if Pipeline 2 has a dependency on Pipeline 1. It happens all the time where two pipelines run;
one runs, the other fails.
It should be D in my opinion.
upvoted 4 times
gangstfear
9 months ago
The answer must be B
upvoted 2 times
JohnMasipa
9 months ago
Can someone please explain why the answer is A?
upvoted 1 times
dev2dev
4 months, 1 week ago
if you look at the green and red squares, they are called Success and Failure events, in psuedo code , pipeline can be read as "On Error Set
Variable", where as pipeline 2 has "On Sucess Set Variable"
upvoted 1 times
Ayanchakrain
9 months ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 149/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 150/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Question #9 Topic 2
HOTSPOT -
A company plans to use Platform-as-a-Service (PaaS) to create the new data pipeline process. The process must meet the following requirements:
Ingest:
Store:
Which technologies should you use? To answer, select the appropriate options in the answer area.
Hot Area:
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 151/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Correct Answer:
In Azure, the following services and tools will meet the core requirements for pipeline orchestration, control flow, and data movement: Azure
Data Factory, Oozie on HDInsight, and SQL Server Integration Services (SSIS).
Note: Data at rest includes information that resides in persistent storage on physical media, in any digital format. Microsoft Azure offers a
variety of data storage solutions to meet different needs, including file, disk, blob, and table storage. Microsoft also provides encryption to
protect Azure SQL Database, Azure Cosmos
Azure Databricks provides enterprise-grade Azure security, including Azure Active Directory integration.
With Azure Databricks, you can set up your Apache Spark environment in minutes, autoscale and collaborate on shared projects in an
interactive workspace.
Azure Databricks supports Python, Scala, R, Java and SQL, as well as data science frameworks and libraries including TensorFlow, PyTorch and
scikit-learn.
Azure Synapse Analytics/ SQL Data Warehouse stores data into relational tables with columnar storage.
Azure SQL Data Warehouse connector now offers efficient and scalable structured streaming write support for SQL Data Warehouse. Access
SQL Data
Warehouse from Azure Databricks using the SQL Data Warehouse connector.
Note: As of November 2019, Azure SQL Data Warehouse is now Azure Synapse Analytics.
Reference:
https://docs.microsoft.com/bs-latn-ba/azure/architecture/data-guide/technology-choices/pipeline-orchestration-data-movement
https://docs.microsoft.com/en-us/azure/azure-databricks/what-is-azure-databricks
Podavenna
Highly Voted
8 months, 1 week ago
Correct solution!
upvoted 26 times
irantov
Highly Voted
8 months, 1 week ago
Correct!
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 152/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
upvoted 10 times
SebK
Most Recent
2 months ago
Correct
upvoted 1 times
Massy
2 months, 1 week ago
for the store, couldn't we use also Azure Blob Storage? It supports all the three requisites
upvoted 1 times
NewTuanAnh
1 month, 2 weeks ago
Because ADLS Gen2 support Big Data Workload better
upvoted 1 times
paras_gadhiya
3 months ago
Correct
upvoted 1 times
PallaviPatel
3 months, 3 weeks ago
Correct solution.
upvoted 1 times
joeljohnrm
4 months, 2 weeks ago
Correct Solution
upvoted 1 times
[Removed]
4 months, 3 weeks ago
for model and server, HDI has all of this. Why DataBricks?
upvoted 1 times
rockyc05
3 months ago
Also seamless integration with AAD
upvoted 1 times
rockyc05
3 months ago
Support for SQL
upvoted 1 times
corebit
5 months, 1 week ago
Would be best if people including answers that go against the popular responses provide some reference instead of blinding saying false
upvoted 3 times
Akash0105
6 months, 2 weeks ago
Answer is correct.
Pratikh
6 months, 3 weeks ago
Databricks doesn't support Java so in the Prep and Train should be HDInsight Apache Spark Cluster
upvoted 3 times
KOSTA007
6 months, 2 weeks ago
Azure Databricks supports Python, Scala, R, Java, and SQL, as well as data science frameworks and libraries including TensorFlow, PyTorch, and
scikit-learn.
upvoted 9 times
Aslam208
6 months, 3 weeks ago
Databricks does not support Java, Prepare and Train should be Azure HDInsight Apache spark cluster
upvoted 1 times
Aslam208
5 months, 2 weeks ago
I would like to correct my answer here... java is supported in Azure Databricks, therefore Prepare and Train can be done with Azure Databricks
upvoted 3 times
Samanda
7 months, 1 week ago
false. kafka hd insight is the correct option in the last box
upvoted 1 times
datachamp
8 months, 1 week ago
Is this an ad?
upvoted 9 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 153/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
DRAG DROP -
You need to calculate the employee_type value based on the hire_date value.
How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used
once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
Correct Answer:
Box 1: CASE -
CASE evaluates a list of conditions and returns one of multiple possible result expressions.
CASE can be used in any statement or clause that allows a valid expression. For example, you can use CASE in statements such as SELECT,
UPDATE,
DELETE and SET, and in clauses such as select_list, IN, WHERE, ORDER BY, and HAVING.
CASE input_expression -
[ ELSE else_result_expression ]
END -
Box 2: ELSE -
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/language-elements/case-transact-sql
MoDar
Highly Voted
8 months, 4 weeks ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 154/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Correct
upvoted 24 times
NewTuanAnh
Most Recent
1 month, 2 weeks ago
the answer is correct
CASE ...
ELSE ...
upvoted 2 times
PallaviPatel
3 months, 3 weeks ago
correct
upvoted 1 times
steeee
9 months ago
The answer is correct. But, is this in the scope of this exam?
upvoted 4 times
anto69
4 months, 2 weeks ago
it seems
upvoted 1 times
mkutts
6 months, 4 weeks ago
Got this question yesterday so yes.
upvoted 5 times
parwa
9 months ago
make sense to me , data engineer should be able to write Queries
upvoted 6 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 155/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
DRAG DROP -
You have an Azure Data Lake Storage Gen2 container that contains JSON-formatted files in the following format.
You need to use the serverless SQL pool in WS1 to read the files.
How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used
once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 156/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Correct Answer:
Box 1: openrowset -
The easiest way to see to the content of your CSV file is to provide file URL to OPENROWSET function, specify csv FORMAT.
Example:
SELECT *
FROM OPENROWSET(
BULK 'csv/population/population.csv',
DATA_SOURCE = 'SqlOnDemandDemo',
FIELDTERMINATOR =',',
ROWTERMINATOR = '\n'
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 157/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Box 2: openjson -
You can access your JSON files from the Azure File Storage share by using the mapped drive, as shown in the following example:
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-single-csv-file https://docs.microsoft.com/en-us/sql/relational-
databases/json/import-json-documents-into-sql-server
Maunik
Highly Voted
8 months, 2 weeks ago
Answer is correct
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-json-files
upvoted 25 times
Lrng15
8 months ago
answer is correct as per this link
upvoted 1 times
gf2tw
Highly Voted
8 months, 2 weeks ago
The question and answer seem out of place, there was no mention of the CSV and the query in the answer doesn't match up with openjson at all
upvoted 6 times
dev2dev
4 months, 1 week ago
Look at the WITH statement, the csv column can contain json data.
upvoted 1 times
anto69
4 months, 2 weeks ago
agree with u
upvoted 1 times
dead_SQL_pool
6 months ago
Actually, the csv format is specified if you're using OPENROWSET to read json files in Synapse. The OPENJSON is required if you want to parse
data from every array in the document. See the OPENJSON example in this link:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-json-files#query-json-files-using-openjson
upvoted 8 times
gf2tw
5 months, 2 weeks ago
Thanks, you're right:
"The easiest way to see to the content of your JSON file is to provide the file URL to the OPENROWSET function, specify csv FORMAT, and set
values 0x0b for fieldterminator and fieldquote."
upvoted 4 times
gssd4scoder
6 months, 3 weeks ago
agree with you, very misleading
upvoted 1 times
SebK
Most Recent
2 months ago
Correct
upvoted 1 times
PallaviPatel
3 months, 3 weeks ago
correct
upvoted 1 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 158/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
DRAG DROP -
You have an Apache Spark DataFrame named temperatures. A sample of the data is shown in the following table.
You need to produce the following table by using a Spark SQL query.
How should you complete the query? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once,
or not at all.
You may need to drag the split bar between panes or scroll to view content.
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 159/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
Correct Answer:
Box 1: PIVOT -
PIVOT rotates a table-valued expression by turning the unique values from one column in the expression into multiple columns in the output.
And PIVOT runs aggregations where they're required on any remaining column values that are wanted in the final output.
Incorrect Answers:
UNPIVOT carries out the opposite operation to PIVOT by rotating columns of a table-valued expression into column values.
Box 2: CAST -
If you want to convert an integer value to a DECIMAL data type in SQL Server use the CAST() function.
Example:
SELECT -
decimal_value
12.00
Reference:
https://learnsql.com/cookbook/how-to-convert-an-integer-to-a-decimal-in-sql-server/ https://docs.microsoft.com/en-us/sql/t-sql/queries/from-
using-pivot-and-unpivot
SujithaVulchi
Highly Voted
8 months ago
correct answer, pivot and cast
upvoted 22 times
ggggyyyyy
Most Recent
8 months ago
correct. cast not convert
upvoted 3 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 160/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
You need to label each pipeline with its main purpose of either ingest, transform, or load. The labels must be available for grouping and filtering
when using the monitoring experience in Data Factory.
A.
a resource tag
B.
a correlation ID
C.
a run group ID
D.
an annotation
Correct Answer:
D
Annotations are additional, informative tags that you can add to specific factory resources: pipelines, datasets, linked services, and triggers. By
adding annotations, you can easily filter and search for specific factory resources.
Reference:
https://www.cathrinewilhelmsen.net/annotations-user-properties-azure-data-factory/
umeshkd05
Highly Voted
8 months, 2 weeks ago
Annotation
upvoted 16 times
anto69
4 months, 1 week ago
Cause ADF pipelines are not first class resources
upvoted 1 times
AhmedDaffaie
Most Recent
2 months, 1 week ago
What is the difference between resource tags and annotations?
upvoted 1 times
paras_gadhiya
3 months ago
Correct!
upvoted 1 times
PallaviPatel
3 months, 3 weeks ago
Selected Answer: D
correct
upvoted 1 times
huesazo
4 months, 1 week ago
Selected Answer: D
Anotacion
upvoted 1 times
aarthy2
7 months, 3 weeks ago
yes correct, annotation provides label functionality than show in pipeline monitoring.
upvoted 2 times
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 161/161