Parallel Computing LessonPlan
Parallel Computing LessonPlan
Preamble: The Internet of Things (IoT) is a revolutionary paradigm that connects everyday
physical objects to the internet, enabling real-time data collection, remote control, and
intelligent decision-making. As smart devices become increasingly embedded into our lives—
ranging from home automation systems to industrial monitoring the need for robust
understanding of IoT technologies, platforms, protocols, and design methodologies is essential.
Prerequisites: Students are expected to have basic proficiency in C programming, along with a
fundamental understanding of computer organization, data structures, and operating systems.
Familiarity with process and thread management will be helpful for grasping concepts related
to memory models and synchronization in parallel computing.
Course Outcomes:
(Indicate levels of learning in accordance with Bloom’s Taxonomy)
Module – 1
Introduction to Parallel programming, Parallel hardware and parallel software:
Classifications of parallel computers, SIMD systems, MIMD systems, Interconnection
networks, Cache coherence, Shared-memory vs distributed memory, Coordinating the 10 hrs
processes/threads, Shared-memory, Distributed-memory.
Textbook: Ch.1, 2: 2.3, 2.4(2.4.2-2.4.4)
Module – 2
GPU programming, Programming hybrid systems, MIMD systems, GPUs,
Performance: Speedup and efficiency in MIMD systems, Amdahl’s law, Scalability in 10hrs
MIMD systems, Taking timings of MIMD programs, GPU performance.
Textbook: Ch.2: 2.4 (2.4.5, 2.4.6), 2.5, 2.6
Module– 3
Distributed memory programming with MPI: MPI functions, The trapezoidal rule in
MPI, Dealing with I/O, Collective communication, MPI-derived datatypes, Performance 10hrs
evaluation of MPI programs, A parallel sorting algorithm.
Textbook 1: Ch.3: 3.1 – 3.7
Module – 4
Shared-memory programming with OpenMP: OpenMP pragmas and directives, The
trapezoidal rule, Scope of variables, The reduction clause, loop carried dependency,
scheduling, producers and consumers, Caches, cache coherence and false sharing in 10hrs
OpenMP, tasking, thread safety.
Textbook: Ch.5: 5.1 – 5.11
Module – 5
GPU programming with CUDA: GPUs and GPGPU, GPU architectures, Heterogeneous
computing, Threads, blocks, and grids, Nvidia compute capabilities and device
architectures, Vector addition, Returning results from CUDA kernels, CUDA trapezoidal 10hrs
rule I, CUDA trapezoidal rule II: improving performance, CUDA trapezoidal rule III:
blocks with more than one wrap.
Textbook: Ch.6: 6.1-6.11, 6.13
REFERENCES:
MODULE - I
Lesson Schedule:
Clas
Portions Covered
s No. Text
1 Introduction to Parallel programming T1
2 Parallel hardware T1
3 Parallel software: Classifications of parallel computers T1
4 SIMD systems, MIMD systems T1
5 Interconnection networks T1
6 Cache coherence, Shared-memory vs distributed memory T1
Coordinating the processes/threads, Shared-memory, Distributed-
7 T1
memory
Level 1:
Level 2:
Level 3:
MODULE – II
Lesson Schedule:
Clas
Portions Covered Text
s No.
9 GPU programming T1
10 Programming hybrid systems T1
11 MIMD systems, GPUs T1
12 Performance: Speedup and efficiency in MIMD systems T1
15 GPU performance T1
Level 1:
1. Define MIMD (Multiple Instruction, Multiple Data) system.
2. What is speedup in the context of parallel computing?
3. State Amdahl’s Law.
4. List the key components of a hybrid system involving CPU and GPU
Level 2:
Level 3:
1. Apply Amdahl’s Law to calculate the theoretical speedup when 80% of a task is
parallelized over 4 processors.
2. Use timing functions to measure and compare execution time of a sequential vs. parallel
MIMD program.
3. Given a real-world task (e.g., image processing), show how to implement it using GPU
for performance gain.
4. Apply knowledge of hybrid systems to design a simple CPU-GPU cooperative task.
5. Demonstrate performance scaling by running a parallel code with increasing processor
count and plotting speedup.
MODULE – III
Lesson Schedule:
Clas
Portions Covered Text
s No.
17 Distributed memory programming with MPI: MPI functions T1
20 Collective communication T1
21 MPI-derived datatypes T1
Level 1:
1. Name any four basic MPI functions used in every MPI program.
2. Define collective communication in MPI.
3. What is the purpose of MPI_Comm_rank and MPI_Comm_size?
4. List any two MPI collective communication functions.
Level 2:
1. Explain the role of the trapezoidal rule in demonstrating parallel computation with MPI.
2. Describe how MPI handles communication between processes.
3. Explain how MPI-derived datatypes can help in structuring communication.
4. Summarize the difference between point-to-point and collective communication in MPI.
5. Describe the need for performance evaluation in MPI programs.
Level 3:
1. Apply MPI_Bcast to distribute input data from the root process to all other processes.
2. Implement the trapezoidal rule in MPI to approximate definite integrals in parallel.
3. Use MPI functions to write a parallel program that sorts a set of numbers using a
distributed sorting algorithm.
4. Apply MPI file I/O routines to write output from multiple processes into a common file.
5. Evaluate the performance of an MPI-based matrix multiplication program by measuring
execution time with increasing process count.
MODULE – IV
Lesson Schedule:
Parallel Computing Page 7 of 10
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Class
Portions Covered Text
No.
Shared-memory programming with OpenMP: OpenMP pragmas
27 T1
and directives
28 The trapezoidal rule T1
29 Scope of variables T1
30 The reduction clause T1
31 loop carried dependency T1
32 scheduling T1
33 producers and consumers T1
34 Caches T1
35 Cache coherence and false sharing in OpenMP T1
36 Tasking T1
37 Thread safety T1
Level 1:
1. What is OpenMP used for in parallel programming?
2. List any four commonly used OpenMP directives.
3. Define the reduction clause in OpenMP.
4. What is loop-carried dependency?
5. What does the #pragma omp parallel directive do?
Level 2:
1. Explain the purpose of the trapezoidal rule and how it's implemented in OpenMP.
2. Describe how variable scope (shared vs. private) affects parallel execution in OpenMP.
3. Explain the impact of false sharing on OpenMP program performance.
4. Summarize the scheduling types supported by OpenMP and their differences.
5. Explain how tasking is used in OpenMP and why it is useful.
Level 3:
1. Apply OpenMP pragmas to parallelize a trapezoidal rule-based numerical integration
program.
2. Write a parallel program using OpenMP to compute the sum of an array using the
reduction clause.
3. Use scheduling strategies to optimize load balancing in a loop-heavy program.
4. Demonstrate false sharing using a parallel array update and suggest a fix.
5. Implement a producer-consumer problem using OpenMP sections or tasks with proper
synchronization.
MODULE – V
Parallel Computing Page 8 of 10
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Lesson Schedule:
Clas
Portions Covered Text
s No.
38 GPU programming with CUDA: GPUs and GPGPU, GPU architectures T1
39 Heterogeneous computing, Threads, blocks, and grids T1
40 Nvidia compute capabilities and device architectures T1
41 Vector addition T1
42 Returning results from CUDA kernels T1
43 CUDA trapezoidal rule I T1
44 CUDA trapezoidal rule II T1
45 Improving performance T1
46 CUDA trapezoidal rule III: blocks with more than one wrap T1
Level 1:
1. List the three main components in CUDA thread hierarchy.
2. What is a CUDA kernel?
3. Name any two Nvidia compute capability versions.
4. Define a CUDA block and a CUDA grid.
Level 2:
1. Explain the difference between CPU and GPU architectures.
2. Describe the structure and role of threads, blocks, and grids in a CUDA program.
3. Summarize the concept of heterogeneous computing and its benefits.
4. Explain how CUDA kernels return results to the host.
5. Describe how the trapezoidal rule is parallelized in CUDA.
Level 3:
1. Apply CUDA programming to implement a simple vector addition using kernel launch.
2. Use CUDA to implement the trapezoidal rule for numerical integration (Version I).
3. Modify your trapezoidal rule implementation to improve performance by reducing
memory accesses (Version II).
4. Apply thread and block configurations for large input sizes using multiple warps
(Version III).
5. Demonstrate the impact of block size on performance by benchmarking different
configurations in a CUDA program.
CO1 2 “_” “_” 2 “_” “_” “_” “_” “_” “_” “_” 2 2 “_” 2
Correlated levels:
High (H) :3
Medium (M) : 2
Low (L) :1