0% found this document useful (0 votes)

13 views

GPU - Mid - Gradescope

Uploaded by

EMCUBE MELO

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

GPU - Mid - Gradescope

Uploaded by

EMCUBE MELO

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

9/19/23, 11:14 PM View Submission | Gradescope

Q1
4 Points

Consider a Reduction kernel that operates on 10,240 elements

using a block size of 512 threads. (Assume each block calculates a
partialSum as in our assignment 2.)

For an optimized reduction implementation, how many steps are

divergent?

Q2
2 Points

Variables stored in registers are visible to:

All threads in a kernel
All warps in a thread block
All threads in a thread block
A single thread only

Q3
2 Points

For our tiled matrix multiplication kernel, if we use a 32x32 tile,

what is the reduction of memory bandwidth usage for input
matrices M and N?
~1/64 of the original usage
~1/8 of the original usage
~1/16 of the original usage
~1/32 of the original usage

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 1/11
9/19/23, 11:14 PM View Submission | Gradescope
2 Points

Variables that are declared in Shared Memory are visible to:

threads across kernels
within a single thread only
threads in a thread block
threads across different thread blocks

Q5
2 Points

Which of the following memory types would accessing be the

fastest?
L2 Cache
Shared Memory
Global Memory
L1 Cache
Register

Q6
4 Points

In the basic Matrix Multiply code, if matrix M , N , and P are of size

100x100, how many total bytes of data are written to matrix P
during the lifetime of execution?
Assume matrix multiply operates on double-precision (64-bit)
floating point numbers.

80,000

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 2/11
9/19/23, 11:14 PM View Submission | Gradescope

For a tiled single-precision (32-bit) matrix multiplication kernel,

assume that each thread block is 32x32 and the system has a
DRAM burst size of 128 bytes.
How many DRAM bursts will be delivered to the processor as a
result of loading on M-matrix tile by a thread block (during one
phase)?
Keep in mind that each single prediction floating point number is
four bytes.

Q8
3 Points

Assume a CUDA device's SM (streaming multiprocessor) can take

up to 2,048 threads and up to 4 thread blocks.
Which of the following block configuration would result in the
greatest number of threads in each SM?
(More than one correct answer is possible, select all correct
answers.)

2048 threads per block

1,024 threads per block

256 threads per block

512 threads per block

Q9
12 Points

Consider the following code block.

Assume that the variable netID stores the 3-digit numerical
portion of your UCR netID.

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 3/11
9/19/23, 11:14 PM View Submission | Gradescope
{
int index = threadIdx.x + 1; //PC = A
if(netID % index == 0) //PC = B
{
// Do something here
}
else // PC = C
{
// Do something else here
}
// PC = D
// Do more stuff here
}

Q9.1
2 Points

Enter your UCR netID here:

sgang011

Q9.2
5 Points

What is the active mask when the warp is executing the if

statement basic block?

Assume a warp size of 4 threads and the left-most bit of the active
mask is assigned to thread 0 and the right-most bit is thread 3.

1000

Q9.3
5 Points

What is the active mask when the warp is executing the else
statement basic block?

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 4/11
9/19/23, 11:14 PM View Submission | Gradescope
mask is assigned to thread 0 and the right-most bit is thread 3.

0111

Q10
2 Points

cudaMalloc allocates memory in:

Host memory
Shared memory
Device memory
Virtual memory

Q11
2 Points

The active mask is used to keep track of which warp in an SM is

active.
True
False

Q12
2 Points

Thread blocks are scheduled to what unit in the hardware:

Execution Units
Caches
Streaming Multiprocessors (SM)
Warps

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 5/11
9/19/23, 11:14 PM View Submission | Gradescope
2 Points

Assume the following simple matrix multiplication kernel

global void MatrixMulKernel(float* M, float* N, float* P, int Width)

{
int Row = blockIdx.y*blockDim.y+threadIdx.y;
int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {

float Pvalue = 0;
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k] * N[k*Width+Col];
}

P[Row*Width+Col] = Pvalue;

Which of the following is true?

Accesses to N[] are not coalesced
None of the above
Accesses to P[] are not coalesced
Accesses to M[] are not coalesced

Q14
4 Points

Consider a Reduction kernel that operates on 10,240 elements

using a block size of 512 threads. (Assume each block calculates a
partialSum as in our assignment 2.)

For a naïve reduction implementation, how many steps are

divergent?

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 6/11
9/19/23, 11:14 PM View Submission | Gradescope
4 Points

Assume that a kernel is launched with 100 thread blocks, each

with 512 threads.
If a variable, s_var , is declared as a shared memory variable, how
many versions of s_var will be created throughout the lifetime of
the execution of the kernel?

100

Q16
5 Points

Compare the following code blocks where we calculate the dot

product for a row in M and a column in N .
Code block 1 calculates the dot product using a variable Pvalue
then writes it to the output matrix P .
Code block 2 calculates the dot product by directly accumulating
the results into the output matrix P .

Code block 1:

if ((Row < Width) && (Col < Width)) {

float Pvalue = 0;
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}

Code block 2:

if ((Row < Width) && (Col < Width)) {

P[Row*Width+Col] = 0;
for (int k = 0; k < Width; ++k) {
P[Row*Width+Col] += M[Row*Width+k]*N[k*Width+Col];
}
}

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 7/11
9/19/23, 11:14 PM View Submission | Gradescope
answer concise and no longer than 5 sentences.

First Code block will be the fastest one.

In case of 2nd block there will be k number of writes to the
global memory for each individual thread.
Where as in the 1st block only we write once to the memory
as the Pvalue is stored in register.

Q17
4 Points

For a Reduction kernel that operates on 32,768 elements using a

block size of 512 threads, how many steps does the reduction
kernel take? (Assume each block calculates a partialSum as in
assignment 2.)

Q18
4 Points

In the basic Matrix Multiply code, if matrix M , N , and P are of size

100x100, how many total bytes of data are transferred from
device to host during the lifetime of execution?
Assume matrix multiply operates on double-precision (64-bit)
floating point numbers.

80,000

Q19
2 Points

Which of the following boundary condition checks will cause warp

divergence? (Assume a warp size of 32.)

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 8/11
9/19/23, 11:14 PM View Submission | Gradescope

if(threadIdx.x > 5)
if(gridDim.x > 5)
if(blockDim.x > 5)

Q20
2 Points

If we want to copy 1000 bytes of data from host array h_A ( h_A is
a pointer to element 0 of the source array) to device array d_A
( d_A ) is a pointer to element 0 of the destination array), what
would be an appropriate API call for this in CUDA?
cudaMemcpy(1000, d_A, h_A, cudaMemcpyHostToDevice);
cudaMemcpy(1000, h_A, d_A, cudaMemcpyHostToDevice);
cudaMemcpy(d_A, h_A, 1000, cudaMemcpyHostToDevice);
cudaMemcpy(h_A, d_A, 1000, cudaMemcpyDeviceToHost);

Q21
2 Points

Which of the following memory types would accessing be the

slowest?
Shared Memory
Global Memory
L2 Cache
L1 Cache
Register

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 9/11
9/19/23, 11:14 PM View Submission | Gradescope

Midterm Exam  Graded

Student
Sree Charan Reddy Gangireddy

Total Points
66 / 70 pts

Question 1
(no title) 4 / 4 pts

Question 2
(no title) 2 / 2 pts

Question 3
(no title) 2 / 2 pts

Question 4
(no title) 2 / 2 pts

Question 5
(no title) 2 / 2 pts

Question 6
(no title) 0 / 4 pts

Question 7
(no title) 4 / 4 pts

Question 8
(no title) 3 / 3 pts

Question 9
(no title) 12 / 12 pts

9.1 (no title) 2 / 2 pts

9.2 (no title) 5 / 5 pts

9.3 (no title) 5 / 5 pts

Question 10
(no title) 2 / 2 pts

Question 11

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 10/11
9/19/23, 11:14 PM View Submission | Gradescope

(no title) 2 / 2 pts

Question 12
(no title) 2 / 2 pts

Question 13
(no title) 2 / 2 pts

Question 14
(no title) 4 / 4 pts

Question 15
(no title) 4 / 4 pts

Question 16
(no title) 5 / 5 pts

Question 17
(no title) 4 / 4 pts

Question 18
(no title) 4 / 4 pts

Question 19
(no title) 2 / 2 pts

Question 20
(no title) 2 / 2 pts

Question 21
(no title) 2 / 2 pts

https://www.gradescope.com/courses/461455/assignments/2397020/submissions/145232557 11/11

6CS005 - Assessment 20-21
No ratings yet
6CS005 - Assessment 20-21
25 pages
Mathworks Interview Questions
100% (1)
Mathworks Interview Questions
5 pages
Midterm
No ratings yet
Midterm
5 pages
GPU - Final - Gradescope
No ratings yet
GPU - Final - Gradescope
20 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
100% (5)
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
4 pages
Processors
No ratings yet
Processors
25 pages
ECE408 2012 Practice Exam1
No ratings yet
ECE408 2012 Practice Exam1
10 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
CUDA_part-2
No ratings yet
CUDA_part-2
49 pages
Coursera Quiz Week2 Fall 2012
No ratings yet
Coursera Quiz Week2 Fall 2012
3 pages
CUDA_Memory
No ratings yet
CUDA_Memory
56 pages
Coursera Quiz Week3 Fall 2012
100% (1)
Coursera Quiz Week3 Fall 2012
3 pages
Module 3 Quiz
No ratings yet
Module 3 Quiz
2 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
written_asst1
No ratings yet
written_asst1
31 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
Mid Sem QP&Solution
No ratings yet
Mid Sem QP&Solution
7 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
Exam2 s09 v2
No ratings yet
Exam2 s09 v2
10 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
5-computation
No ratings yet
5-computation
13 pages
GPU_Assignment-3_Solution
No ratings yet
GPU_Assignment-3_Solution
4 pages
12 Gpu Cuda 3
No ratings yet
12 Gpu Cuda 3
58 pages
#Include #Include #Define
No ratings yet
#Include #Include #Define
8 pages
written_asst2
No ratings yet
written_asst2
27 pages
ELEC 4601 Sample-Questions
No ratings yet
ELEC 4601 Sample-Questions
12 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
Quiz For Chapter 7 With Solutions
No ratings yet
Quiz For Chapter 7 With Solutions
8 pages
Web GPU
0% (1)
Web GPU
40 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
VSCSE-Lecture3-cuda-memory-model-2012
No ratings yet
VSCSE-Lecture3-cuda-memory-model-2012
31 pages
6-computation
No ratings yet
6-computation
11 pages
HPC Revision
No ratings yet
HPC Revision
16 pages
6963 Midterm Review
No ratings yet
6963 Midterm Review
20 pages
217 Lec7
No ratings yet
217 Lec7
30 pages
Less Slow C++ _ Hacker News
No ratings yet
Less Slow C++ _ Hacker News
3 pages
CS222 - COAL - SOLUTION - Final - Spring2023
No ratings yet
CS222 - COAL - SOLUTION - Final - Spring2023
12 pages
20 Quiz 14
No ratings yet
20 Quiz 14
12 pages
CS4961: Parallel Programming Midterm Exam October 20, 2011
No ratings yet
CS4961: Parallel Programming Midterm Exam October 20, 2011
4 pages
EC355TBF_CA_ 2022 scheme_V Sem_MQP
No ratings yet
EC355TBF_CA_ 2022 scheme_V Sem_MQP
4 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
Par - 1 In-Term Exam - Course 2018/19-Q2
No ratings yet
Par - 1 In-Term Exam - Course 2018/19-Q2
9 pages
CENG443_2023_Final
No ratings yet
CENG443_2023_Final
4 pages
cs330_endsems_2023
No ratings yet
cs330_endsems_2023
7 pages
National University of Computer and Emerging Sciences, Lahore Campus
No ratings yet
National University of Computer and Emerging Sciences, Lahore Campus
9 pages
40 Out
No ratings yet
40 Out
80 pages
Aptitude Technical Group Discussion Technical HR Personal Interview
No ratings yet
Aptitude Technical Group Discussion Technical HR Personal Interview
12 pages
ARM MCQs
No ratings yet
ARM MCQs
16 pages
S0285 Optimization of Sparse Matrix Matrixltiplication On GPU
No ratings yet
S0285 Optimization of Sparse Matrix Matrixltiplication On GPU
21 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
WWW Gatepaper in
No ratings yet
WWW Gatepaper in
16 pages
Ass Parallel
No ratings yet
Ass Parallel
11 pages
Exam OS 2 - Ready!
100% (1)
Exam OS 2 - Ready!
5 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
Parallel Distributed Computing Assignment 2
No ratings yet
Parallel Distributed Computing Assignment 2
2 pages
Cse410 Sp09 Final Sol
No ratings yet
Cse410 Sp09 Final Sol
10 pages
CS 61C: Great Ideas in Computer Architecture (Machine Structures)
No ratings yet
CS 61C: Great Ideas in Computer Architecture (Machine Structures)
32 pages
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
From Everand
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
Rodrigo Copetti
No ratings yet
Vardhaman College of Engineering
No ratings yet
Vardhaman College of Engineering
3 pages
OS oral QB
No ratings yet
OS oral QB
34 pages
Assignment 3
No ratings yet
Assignment 3
4 pages
Installation Guide - EN
No ratings yet
Installation Guide - EN
6 pages
Aml Crashlog
No ratings yet
Aml Crashlog
20 pages
Abapdump
No ratings yet
Abapdump
5 pages
Multiprocessor Systems: 1. Increased Throughput. by Increasing The Number of Processors, We Expect
No ratings yet
Multiprocessor Systems: 1. Increased Throughput. by Increasing The Number of Processors, We Expect
8 pages
Utilities
No ratings yet
Utilities
119 pages
Windows Internals Part 2 Developer Reference 7th Edition Russinovich all chapter instant download
100% (3)
Windows Internals Part 2 Developer Reference 7th Edition Russinovich all chapter instant download
65 pages
How To Open A Command Prompt at Boot in Windows 7
No ratings yet
How To Open A Command Prompt at Boot in Windows 7
9 pages
Oracle 10g Installation On OEL 4 With Explanation
No ratings yet
Oracle 10g Installation On OEL 4 With Explanation
29 pages
Linux Commands
No ratings yet
Linux Commands
5 pages
Deviohsc
No ratings yet
Deviohsc
2 pages
Operating System Support: Introduction
No ratings yet
Operating System Support: Introduction
9 pages
Deadlock Notes
No ratings yet
Deadlock Notes
3 pages
Big Data Computing - Assignment 1
No ratings yet
Big Data Computing - Assignment 1
3 pages
My BSD Sucks Less Than Yours-AsiaBSDCon2017-Paper
100% (1)
My BSD Sucks Less Than Yours-AsiaBSDCon2017-Paper
21 pages
DBMS Concurrency Control
No ratings yet
DBMS Concurrency Control
18 pages
Multithreading in C
No ratings yet
Multithreading in C
4 pages
Operating System Concepts Chapter 3 Exercise Solution Part 1
No ratings yet
Operating System Concepts Chapter 3 Exercise Solution Part 1
3 pages
Log
No ratings yet
Log
2 pages
How To Install 64-Bit Microsoft Database Drivers Alongside 32-Bit Microsoft Office
No ratings yet
How To Install 64-Bit Microsoft Database Drivers Alongside 32-Bit Microsoft Office
4 pages
CH 8
No ratings yet
CH 8
71 pages
Debugger x86
No ratings yet
Debugger x86
83 pages
Setup Log 2014-05-27 #004
No ratings yet
Setup Log 2014-05-27 #004
9 pages
Ms-Dos: BOOTING: The Process of Loading The Operating System in The Computer's
No ratings yet
Ms-Dos: BOOTING: The Process of Loading The Operating System in The Computer's
6 pages
S L U R M: Imple Inux Tility For Esource Anagement
No ratings yet
S L U R M: Imple Inux Tility For Esource Anagement
21 pages
Logcat Prev CSC Log
No ratings yet
Logcat Prev CSC Log
256 pages
Ansible
No ratings yet
Ansible
13 pages
CCS372-VIRTUALIZATION-LAB manual
No ratings yet
CCS372-VIRTUALIZATION-LAB manual
30 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

GPU - Mid - Gradescope

Uploaded by

GPU - Mid - Gradescope

Uploaded by

9/19/23, 11:14 PM View Submission | Gradescope

Consider a Reduction kernel that operates on 10,240 elements

For an optimized reduction implementation, how many steps are

Variables stored in registers are visible to:

For our tiled matrix multiplication kernel, if we use a 32x32 tile,

Variables that are declared in Shared Memory are visible to:

Which of the following memory types would accessing be the

In the basic Matrix Multiply code, if matrix M , N , and P are of size

For a tiled single-precision (32-bit) matrix multiplication kernel,

Assume a CUDA device's SM (streaming multiprocessor) can take

2048 threads per block

1,024 threads per block

256 threads per block

512 threads per block

Consider the following code block.

Enter your UCR netID here:

What is the active mask when the warp is executing the if

cudaMalloc allocates memory in:

The active mask is used to keep track of which warp in an SM is

Thread blocks are scheduled to what unit in the hardware:

Assume the following simple matrix multiplication kernel

__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)

if ((Row < Width) && (Col < Width)) {

Which of the following is true?

Consider a Reduction kernel that operates on 10,240 elements

For a naïve reduction implementation, how many steps are

Assume that a kernel is launched with 100 thread blocks, each

Compare the following code blocks where we calculate the dot

if ((Row < Width) && (Col < Width)) {

if ((Row < Width) && (Col < Width)) {

First Code block will be the fastest one.

For a Reduction kernel that operates on 32,768 elements using a

In the basic Matrix Multiply code, if matrix M , N , and P are of size

Which of the following boundary condition checks will cause warp

Which of the following memory types would accessing be the

Midterm Exam  Graded

9.1 (no title) 2 / 2 pts

9.2 (no title) 5 / 5 pts

9.3 (no title) 5 / 5 pts

(no title) 2 / 2 pts

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

global void MatrixMulKernel(float* M, float* N, float* P, int Width)