HPA - Notes
HPA - Notes
Introduction
Architecture of the central Processing Unit
Parts of a CPU:
1. ALU - The arithmetic logic unit executes all calculations within
the CPU
2. CU - control unit, coordinates how data moves around, decodes
instructions
Registers, a memory location within the actual processor that
work at very fast speeds. It stores instructions which await to be
decoded or executed.
1. PC - program counter - stores address of the -> next <-
• Memory holds both data and instructions
instruction in RAM
• The arithmetic/logic gate unit is capable of performing arithmetic and
2. MAR - memory address register - stores the address of the
logic operations on data
current instruction being executed
• A processor register is a quickly accessible location available to a
3. MDR - memory data register - stores the data that is to be sent to
digital processor's central processing unit (CPU). Registers usually
or fetched from memory
consist of a small amount of fast storage, although some registers
4. CIR - current instruction register - stores actual instruction that is
have specific hardware functions, and may be read-only or write-
being decoded and executed
only[3]
5. ACC - accumulator - stores result of calculations
• The control unit controls the flow of data within the CPU - (which is Buses
the Fetch-Execute cycle)
1. address bus - carries the ADDRESS of the instruction or data
• Input arrives into a CPU via a bus 2. data bus - carries data between processor and the memory
• Output exits the CPU via a bus 3. control bus - sends control signals such as: memory read,
memory write
Layers of Abstraction
HPA Page 1
Instruction Level Parallelism
Different instructions within a stream can
be executed in parallel
Pipelining, out-of-order execution,
speculative execution, VLIW
Dataflow
Data Parallelism
Different pieces of data can be operated on
in parallel
SIMD: Vector processing, array processing
Systolic arrays, streaming processors
Task Level Parallelism
Different “tasks/threads” can be executed
in parallel
Multithreading
Multiprocessing (multi-core)
HPA Page 2
Multiple Instruction, Multiple Data (MIMD)
• A type of parallel computer
• Multiple Instruction: Every processor may be executing a different instruction stream
• Multiple Data: Every processor may be working with a different data stream
• Execution can be synchronous or asynchronous, deterministic or non-deterministic
• Currently, the most common type of parallel computer - most modern supercomputers fall into
this category.
• Examples: most current supercomputers, networked parallel computer clusters and "grids",
multi-processor SMP computers, multi-core PCs.
Advantages
• Global address space provides a user-friendly programming perspective to memory
• Data sharing between tasks is both fast and uniform due to the proximity of memory to
CPUs
Disadvantages
• Primary disadvantage is the lack of scalability between memory and CPUs. Adding
more CPUs can geometrically increases traffic on the shared memory-CPU path, and
for cache coherent systems, geometrically increase traffic associated with
cache/memory management.
• Programmer responsibility for synchronization constructs that ensure "correct" access
of global memory.
Distributed Memory
General Characteristics
• Like shared memory systems, distributed memory systems vary widely but share a
common characteristic. Distributed memory systems require a communication network
to connect inter-processor memory.
• Processors have their own local memory. Memory addresses in one processor do not
map to another processor, so there is no concept of global address space across all
processors.
• Because each processor has its own local memory, it operates independently.
Changes it makes to its local memory have no effect on the memory of other
processors. Hence, the concept of cache coherency does not apply.
• When a processor needs access to data in another processor, it is usually the task of
the programmer to explicitly define how and when data is communicated.
Synchronization between tasks is likewise the programmer's responsibility.
• The network "fabric" used for data transfer varies widely, though it can be as simple as
Ethernet.
Advantages
• Memory is scalable with the number of processors. Increase the number of processors
and the size of memory increases proportionately.
• Each processor can rapidly access its own memory without interference and without
the overhead incurred with trying to maintain global cache coherency.
• Cost effectiveness: can use commodity, off-the-shelf processors and networking.
Disadvantages
• The programmer is responsible for many of the details associated with data
communication between processors.
• It may be difficult to map existing data structures, based on global memory, to this
memory organization.
• Non-uniform memory access times - data residing on a remote node takes longer to
access than node local data.
HPA Page 3
• Non-uniform memory access times - data residing on a remote node takes longer to
access than node local data.
Scalability Prospects
High performance computing (HPC) systems face several key scalability challenges as they continue to
grow in size and complexity:
Bandwidth Scaling
Maintaining sufficient memory bandwidth is critical for HPC performance. As core counts increase, the
memory bandwidth per core tends to decrease, leading to memory bandwidth becoming a bottleneck.
Techniques like 3D stacking, wide I/O, and high-bandwidth memory can help increase memory
bandwidth, but scaling bandwidth remains a major challenge
Latency Scaling
Latency between processors and memory is another key challenge. As systems scale, the average
distance between processors and memory increases, leading to higher latency. Techniques like non -
uniform memory access (NUMA) can help mitigate this, but latency will continue to be a concern
Cost Scaling
Building and operating large-scale HPC systems is extremely expensive. The costs of the hardware,
power, cooling, and facilities grow rapidly as systems scale. Reducing these costs while maintaining
performance is crucial for the continued growth of HPC
Physical Scaling
There are physical limits to how large HPC systems can be built. Factors like the size of data centers,
power delivery, and cooling capacity constrain the maximum size. Innovative approaches to system
architecture and cooling will be needed to push the boundaries of physical scaling
SIMT
• SIMT is the thread equivalent of SIMD. While the latter uses Execution Units or Vector Units, SIMT
expands it to leverage threads. In SIMT, multiple threads perform the same instruction on
different data sets. The main advantage of SIMT is that it reduces the latency that comes with
instruction prefetching.
• Every time the GPU needs to execute a particular instruction, the data and instructions are
fetched from the memory and then decoded and executed. In this case, all the data sets (up to a
certain limit) that need the same instruction for execution are prefetched and executed
simultaneously using the various threads available to the processor.
Thread blocks are allocated (dynamically) to Streaming Multiprocessors (SMs) and run to completion.
Threads (warps) within a block run on the same SM, allowing them to share data and synchronize.
Different blocks in a grid cannot interact with each other.
Fermi Architecture
The NVIDIA Fermi GPU architecture, introduced in 2010, represents a significant advancement
in GPU design for high-performance computing. The key aspects of the Fermi architecture and
CUDA execution model are:
SMs
• Each Fermi GPU consists of multiple SMs, with each SM containing 32 CUDA cores
• The SMs are supported by a second-level cache, host interface, GigaThread scheduler, and
multiple DRAM interfaces
Memory Hierarchy
• Fermi introduced improvements to the memory hierarchy compared to previous GPU
architectures
• Each SM has its own L1 cache, instead of multiple SMs sharing a cache
• The memory was upgraded to GDDR5, capable of up to 144 GB/s of bandwidth
HPA Page 5