0% found this document useful (0 votes)
13 views

GPU - Mid - Gradescope

Uploaded by

EMCUBE MELO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

GPU - Mid - Gradescope

Uploaded by

EMCUBE MELO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

9/19/23, 11:14 PM View Submission | Gradescope

Q1
4 Points

Consider a Reduction kernel that operates on 10,240 elements


using a block size of 512 threads. (Assume each block calculates a
partialSum as in our assignment 2.)

For an optimized reduction implementation, how many steps are


divergent?

Q2
2 Points

Variables stored in registers are visible to:


All threads in a kernel
All warps in a thread block
All threads in a thread block
A single thread only

Q3
2 Points

For our tiled matrix multiplication kernel, if we use a 32x32 tile,


what is the reduction of memory bandwidth usage for input
matrices M and N?
~1/64 of the original usage
~1/8 of the original usage
~1/16 of the original usage
~1/32 of the original usage

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 1/11
9/19/23, 11:14 PM View Submission | Gradescope
2 Points

Variables that are declared in Shared Memory are visible to:


threads across kernels
within a single thread only
threads in a thread block
threads across different thread blocks

Q5
2 Points

Which of the following memory types would accessing be the


fastest?
L2 Cache
Shared Memory
Global Memory
L1 Cache
Register

Q6
4 Points

In the basic Matrix Multiply code, if matrix M , N , and P are of size


100x100, how many total bytes of data are written to matrix P
during the lifetime of execution?
Assume matrix multiply operates on double-precision (64-bit)
floating point numbers.

80,000

Q7

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 2/11
9/19/23, 11:14 PM View Submission | Gradescope

For a tiled single-precision (32-bit) matrix multiplication kernel,


assume that each thread block is 32x32 and the system has a
DRAM burst size of 128 bytes.
How many DRAM bursts will be delivered to the processor as a
result of loading on M-matrix tile by a thread block (during one
phase)?
Keep in mind that each single prediction floating point number is
four bytes.

32

Q8
3 Points

Assume a CUDA device's SM (streaming multiprocessor) can take


up to 2,048 threads and up to 4 thread blocks.
Which of the following block configuration would result in the
greatest number of threads in each SM?
(More than one correct answer is possible, select all correct
answers.)

2048 threads per block

1,024 threads per block

256 threads per block

512 threads per block

Q9
12 Points

Consider the following code block.


Assume that the variable netID stores the 3-digit numerical
portion of your UCR netID.

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 3/11
9/19/23, 11:14 PM View Submission | Gradescope
{
int index = threadIdx.x + 1; //PC = A
if(netID % index == 0) //PC = B
{
// Do something here
}
else // PC = C
{
// Do something else here
}
// PC = D
// Do more stuff here
}

Q9.1
2 Points

Enter your UCR netID here:

sgang011

Q9.2
5 Points

What is the active mask when the warp is executing the if


statement basic block?

Assume a warp size of 4 threads and the left-most bit of the active
mask is assigned to thread 0 and the right-most bit is thread 3.

1000

Q9.3
5 Points

What is the active mask when the warp is executing the else
statement basic block?

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 4/11
9/19/23, 11:14 PM View Submission | Gradescope
mask is assigned to thread 0 and the right-most bit is thread 3.

0111

Q10
2 Points

cudaMalloc allocates memory in:

Host memory
Shared memory
Device memory
Virtual memory

Q11
2 Points

The active mask is used to keep track of which warp in an SM is


active.
True
False

Q12
2 Points

Thread blocks are scheduled to what unit in the hardware:


Execution Units
Caches
Streaming Multiprocessors (SM)
Warps

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 5/11
9/19/23, 11:14 PM View Submission | Gradescope
2 Points

Assume the following simple matrix multiplication kernel

__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)

{
int Row = blockIdx.y*blockDim.y+threadIdx.y;
int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {

float Pvalue = 0;
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k] * N[k*Width+Col];
}

P[Row*Width+Col] = Pvalue;

Which of the following is true?


Accesses to N[] are not coalesced
None of the above
Accesses to P[] are not coalesced
Accesses to M[] are not coalesced

Q14
4 Points

Consider a Reduction kernel that operates on 10,240 elements


using a block size of 512 threads. (Assume each block calculates a
partialSum as in our assignment 2.)

For a naïve reduction implementation, how many steps are


divergent?

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 6/11
9/19/23, 11:14 PM View Submission | Gradescope
4 Points

Assume that a kernel is launched with 100 thread blocks, each


with 512 threads.
If a variable, s_var , is declared as a shared memory variable, how
many versions of s_var will be created throughout the lifetime of
the execution of the kernel?

100

Q16
5 Points

Compare the following code blocks where we calculate the dot


product for a row in M and a column in N .
Code block 1 calculates the dot product using a variable Pvalue
then writes it to the output matrix P .
Code block 2 calculates the dot product by directly accumulating
the results into the output matrix P .

Code block 1:

if ((Row < Width) && (Col < Width)) {


float Pvalue = 0;
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}

Code block 2:

if ((Row < Width) && (Col < Width)) {


P[Row*Width+Col] = 0;
for (int k = 0; k < Width; ++k) {
P[Row*Width+Col] += M[Row*Width+k]*N[k*Width+Col];
}
}

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 7/11
9/19/23, 11:14 PM View Submission | Gradescope
answer concise and no longer than 5 sentences.

First Code block will be the fastest one.


In case of 2nd block there will be k number of writes to the
global memory for each individual thread.
Where as in the 1st block only we write once to the memory
as the Pvalue is stored in register.

Q17
4 Points

For a Reduction kernel that operates on 32,768 elements using a


block size of 512 threads, how many steps does the reduction
kernel take? (Assume each block calculates a partialSum as in
assignment 2.)

10

Q18
4 Points

In the basic Matrix Multiply code, if matrix M , N , and P are of size


100x100, how many total bytes of data are transferred from
device to host during the lifetime of execution?
Assume matrix multiply operates on double-precision (64-bit)
floating point numbers.

80,000

Q19
2 Points

Which of the following boundary condition checks will cause warp


divergence? (Assume a warp size of 32.)

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 8/11
9/19/23, 11:14 PM View Submission | Gradescope

if(threadIdx.x > 5)
if(gridDim.x > 5)
if(blockDim.x > 5)

Q20
2 Points

If we want to copy 1000 bytes of data from host array h_A ( h_A is
a pointer to element 0 of the source array) to device array d_A
( d_A ) is a pointer to element 0 of the destination array), what
would be an appropriate API call for this in CUDA?
cudaMemcpy(1000, d_A, h_A, cudaMemcpyHostToDevice);
cudaMemcpy(1000, h_A, d_A, cudaMemcpyHostToDevice);
cudaMemcpy(d_A, h_A, 1000, cudaMemcpyHostToDevice);
cudaMemcpy(h_A, d_A, 1000, cudaMemcpyDeviceToHost);

Q21
2 Points

Which of the following memory types would accessing be the


slowest?
Shared Memory
Global Memory
L2 Cache
L1 Cache
Register

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 9/11
9/19/23, 11:14 PM View Submission | Gradescope

Midterm Exam  Graded

Student
Sree Charan Reddy Gangireddy

Total Points
66 / 70 pts

Question 1
(no title) 4 / 4 pts

Question 2
(no title) 2 / 2 pts

Question 3
(no title) 2 / 2 pts

Question 4
(no title) 2 / 2 pts

Question 5
(no title) 2 / 2 pts

Question 6
(no title) 0 / 4 pts

Question 7
(no title) 4 / 4 pts

Question 8
(no title) 3 / 3 pts

Question 9
(no title) 12 / 12 pts

9.1 (no title) 2 / 2 pts

9.2 (no title) 5 / 5 pts

9.3 (no title) 5 / 5 pts

Question 10
(no title) 2 / 2 pts

Question 11

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 10/11
9/19/23, 11:14 PM View Submission | Gradescope

(no title) 2 / 2 pts

Question 12
(no title) 2 / 2 pts

Question 13
(no title) 2 / 2 pts

Question 14
(no title) 4 / 4 pts

Question 15
(no title) 4 / 4 pts

Question 16
(no title) 5 / 5 pts

Question 17
(no title) 4 / 4 pts

Question 18
(no title) 4 / 4 pts

Question 19
(no title) 2 / 2 pts

Question 20
(no title) 2 / 2 pts

Question 21
(no title) 2 / 2 pts

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 11/11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy