ΕCE 338 Parallel Computer Architecture Spring 2022

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

ΕCE 338

Parallel Computer Architecture


Spring 2022

Administrivia
The need for Parallel Computing
Introduction to Parallel Computer Architecture

Nikos Bellas

Electrical and Computer Engineering Department


University of Thessaly

ECE 338: Parallel Computer Architecture 1


Administrivia
Instructor: Νίκος Μπέλλας (nbellas@uth.gr)

Class lectures - Labs: Tuesdays 11:00-13:00,


Thursdays 11:00-13:00
Location : #305

MS-teams page
Code: 9zj4f0d
To be used exclusively

Office: #422
Phone #: 24210-74704

ECE 340: Parallel Computer Architecture 2


Προαπαιτούμενα για το μάθημα
• ECE232. Οργάνωση Η/Υ
• Καλή γνώση της γλώσσας C
• Ψηφιακή σχεδίαση και Λογικό σχεδιασμό
• Βασικές γνώσεις σε Unix/Linux
• Πολύ καλή γνώση της Αγγλικής, κυρίως της σχετικής
ορολογίας
• Όρεξη γιά δουλειά και επιθυμία για γνώσεις πάνω σε
“Computer Architecture”
• Τα slides και η ορολογία θα είναι στα Αγγλικά

ECE 338: Parallel Computer 3


Curriculum I
• Introductory Material
– Introduction to Parallel Computer Architecture
– The need for parallel architectures
– Latest trends in system design
– Quick reminder of ECE232
• Instruction Level parallelism (ILP)
– Stalls and their effects on ILP
– Dynamic instruction scheduling. Tomasulo’s algorithm. Reorder Buffers.
Speculation
– Branch prediction
– Static instruction scheduling. VLIW technology. Loop scheduling. Software
pipeline. Modulo Scheduling
– Simultaneous Multithreading (SMT)
– Case study: Multicore architectures from Intel (Intel Core i7 and Itanium)

ECE 338: Parallel Computer 4


Curriculum II
• Memory Hierarchy
– Cache optimizations for performance improvements
– Virtual Memory
– DRAM organization and functionality
• Data Level Parallelism (DLP)
– Vector Architectures
– SIMD instructions in General Purpose CPUs
– GPUs architectures and CUDA
• Thread Level Parallelism (TLP) - Multicores
– Centralized and Distributed shared memory multiprocessors
– Memory Coherence
– Memory Consistency
– Synchronization

ECE 338: Parallel Computer 5


Curriculum III

• Warehouse Scale Computing (WSC)


– Architectures for WSC
– Programming models and Workloads for WSC
– Efficiency and Cost of WSC
– Cloud computing
– Case study: Google WSC

• Various topics (time permitting)


– Domain specific architectures (Google’s TPU, FPGAs)
– Reliable computing. Approximate computing
– Memory Centric Computing
– Machine Learning and Computer Architecture

ECE 338: Parallel Computer 6


Συγγράμματα
Το μάθημα θα βασιστεί στην 6η έκδοση του βιβλίου
“Computer Architecture: A Quantitative Approach”,
by J. Hennessy, D. Patterson, Morgan Kaufmann
Publishers, 6th edition, 2019
Οι φοιτητές θα πάρουν την 6η έκδοση του βιβλίου
που έχει μεταφραστεί στα ελληνικά.

Eπιλεγμένες δημοσιεύσεις από συνέδρια


“Αρχιτεκτονικής Υπολογιστών” όπως (ISCA, Micro,
HPCA, κοκ).
Συμβουλή: To internet έχει ένα τεράστιο αριθμό από
πηγές για “Computer Architecture”. Χρησιμοποιέιστε
τις.

ECE 338: Parallel Computer 7


Grading
• Final Exam: 50%
• You should take at least 5 to pass the class
• Homeworks: 50%
• Notes:
• this is a demanding class. You have to devote a lot of time to keep
up with the lectures, and do the homework, the project and paper
study.

ECE 338: Parallel Computer 8


Let's get started

ECE 338: Parallel Computer 9


The Past…

ENIAC, “US Army photo, around 1946”


ECE 338: Parallel Computer 1010
The present…
Today a $500 laptop has more performance, more main memory and more disk storage
than a computer bought in 1985 for $1 M ARMv8 64-bit Microserver
Raspberry PI4 CS Lab@UTH

Android Smartphone

Playstation 4

Facebook Datacenter

Xilinx Ultrascale+ FPGA

ECE 338: Parallel Computer 11


What is Computer Architecture
• Computer architecture is a description of the
structure and the functionality of a computer system.

• Computer architecture comprises at least three main


subcategories:
• Instruction set architecture (ISA), also known as assembly language is the lowest
point of control of the programmer on the processor.
• Microarchitecture or Computer Organization is at a lower level description of the
system. What are the modules of the system, how they interconnect and how they
interact. Microarchitecture is beyond the control of the programmer. For example, the
number of functional units in a CPU is a microarchitectural detail.
• System Design which includes all of the other hardware components within a
computing system such as system interconnects, memory hierarchies, peripherals, etc.
System design is sometimes visible to the programmer.

ECE 338: Parallel Computer 12


ISA vs. Computer Architecture
• Old definition of computer architecture
= instruction set design
– Other aspects of computer design called implementation
– Insinuates implementation is uninteresting or less challenging
• Today computer architects do much more. Technical hurdles
more challenging than in the earlier days
• Two very important trends:
– Implementation of microarchitecture has become critical, and more
so as technology scales down
– What really matters now is the whole system, NOT only the CPU
(end-to-end system design). Computer architecture is an integrated
approach
• All these are in the plate of the computer architect.

ECE 338: Parallel Computer 13


So, what does a good computer architect do
• Exploit quantitative principles of design
– Take Advantage of Parallelism
– Principle of Locality
– Focus on the Common Case
– Amdahl’s Law
– The Processor Performance Equation
• Performs careful, quantitative comparisons
 Define, quantify, and summarize relative performance, cost,
dependability, power dissipation of multiple solution
• Anticipate and exploit advances in technology
• Define and thoroughly verify well-defined interfaces

ECE 338: Parallel Computer 14


1) Taking Advantage of Parallelism
Parallelism in Space
– Multiple CPUs running different threads of the program
– Carry lookahead adders uses parallelism to speed up computing sums from
linear to logarithmic in number of bits per operand
– Multiple memory banks searched in parallel in set-associative caches
Parallelism in Time
– overlap instruction execution (pipelining) to reduce the total time to
complete an instruction sequence
– Not every instruction depends on immediate predecessor executing
instructions completely/partially in parallel possible
– Classic 5-stage pipeline:
1) Instruction Fetch (Ifetch),
2) Register Read (Reg),
3) Execute (ALU),
4) Data Memory Access (Dmem),
5) Register Write (Reg)

ECE 338: Parallel Computer 15


Pipelined Instruction Execution
Time (clock cycles)

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7


I

ALU
Ifetch Reg DMem Reg
n
s
t

ALU
Ifetch Reg DMem Reg
r.

ALU
Ifetch Reg DMem Reg

r
d

ALU
Ifetch Reg DMem Reg
e
r

ECE 338: Parallel Computer 16


Limits to pipelining

• Hazards prevent next instruction from executing during its


designated clock cycle
– Structural hazards: attempt to use the same hardware to do two different
things at once
– Data hazards: Instruction depends on result of prior instruction still in the
pipeline
– Control hazards: Caused by delay between the fetching of instructions
and decisions about changes in control flow (branches and jumps).
Time (clock cycles)

ALU
I Ifetch Reg DMem Reg
n

ALU
s Ifetch Reg DMem Reg
t

ALU
r. Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg
O
r
d
e
r ECE 338: Parallel Computer 17
2) The Principle of Locality
• The Principle of Locality:
– Program access a relatively small portion of the address space at any instant of time.
• Two Different Types of Locality:
– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced
again soon (e.g., loops, reuse)
– Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are
close by tend to be referenced soon
(e.g., straight-line code, array access)
• Last 30 years, computer architecture relied on locality for memory
perfomance

P $ MEM

ECE 338: Parallel Computer 18


Capacity
Levels of the Memory Hierarchy
Access Time Staging
Cost Xfer Unit
CPU Registers prog./compiler Upper Level
100s Bytes
Registers
1-8 bytes
300 – 500 ps (0.3-0.5 ns) Instr. Operands faster
L1 and L2 Cache L1 Cache
10s-100s K Bytes cache cntl
~1 ns - ~10 ns
Blocks 32-64 bytes
$1000s/ GByte L2 Cache

Main Memory Blocks Cache/memory cntl


GBytes 64-128 bytes
80ns- 200ns Memory
~ $100/ GByte
Pages
Disk OS
10s TBytes, 10 ms 4K-8K bytes
Disk
(10,000,000 ns)
~ $1 / GByte
Files
Storage Servers user/operator Larger
infinite Mbytes
sec-min Storage Servers in the Net Lower Level
~$1 / GByte

ECE 338: Parallel Computer 19


3) Focus on the Common Case
• Common sense guides computer design
– Since we’re in engineering, common sense is valuable
• In making a design trade-off, favor the frequent case over
the infrequent case
– E.g., Instruction fetch and decode unit used more frequently than multiplier, so
optimize it first
• Frequent case is often simpler and can be done faster than
the infrequent case
– E.g., overflow is rare when adding 2 numbers, so improve performance by optimizing
more common case of no overflow
– May slow down overflow, but overall performance improved by optimizing for the
normal case
• What is frequent case and how much performance
improved by making case faster => Amdahl’s Law

ECE 338: Parallel Computer 20


4) Amdahl’s Law
T
Sequential Part Parallelizable Part
(1-α)*Τ α*Τ

(1-α)*Τ α*Τ / Speedup enhanced

 α 
ExTime new  ExTime old  1  α   
 Speedup enhanced 

ExTime old 1
Speedup overall  
ExTime new 1  α   α
Speedup enhanced
Best you could ever hope to do (perfect speedup):
1
Speedup maximum 
1 - α 
ECE 338: Parallel Computer 21
Amdahl’s Law example
• New CPU 10X faster
• I/O bound server, so 60% time waiting for I/O

1
Speedup overall  
α
1  α  
Speedup enhanced

1
 1.56
1  0.4  0.4
10
• Apparently, it’s human nature to be attracted by 10X faster,
vs. keeping in perspective it’s just 1.6X faster

ECE 338: Parallel Computer 22


Amdahl’s Law for different parallelism granularities

Amdahl’s law main idea: the sequential part of an application


limits performance scaling.

ECE 338: Parallel Computer 23


CPI
5) Processor performance equation
“Iron Law of Performance”
inst count Cycle time
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle

Inst Count CPI Clock Rate


Program X

Compiler X (X)

Inst. Set. X X

Organization X X

Technology X

ECE 338: Parallel Computer 24


What drives new architectures ?
Why are there so many of them?
• Technology
– Determines what is plausible and what is not
– What is cheap and what is expensive in terms of performance, cost
and power
– For example, memory hierarchy (e.g. cache memories) became
necessary due to DRAM technology being much slower than CPU
technology
• Applications
– High volume, mainstream applications drive architectural decisions
– SIMD parallelism driven by a market need for multimedia products

ECE 338: Parallel Computer 25


Technology drives new architectures

• General-purpose single cores have stopped


historic performance scaling
• Why?
– Power consumption
– DRAM access latency
– Diminishing returns of more instruction-level
parallelism

ECE 338: Parallel Computer 26


Power consumption problem

Robert H. Dennard photo from Wikipedia

1 transistor = 1x energy 2 transistors= 1x energy 4 transistors = 1x energy

after 2 yrs after 2 yrs

ECE 338: Parallel Computer 27


Dennard Scaling
L’ = L/S
W’ = W/S
tox’ = tox/S
Xj’ = Xj/S -- junction depth
Vdd’ = Vdd/S
Vth’ = Vth/S
Na’, Nd’ = S * Na, Nd
Id (lin)’ = Id(lin)/S
Id(sat)’ = Id(sat)/S
P’ = Id.Vds = Id/S * Vds/S = P/(s^2)
Power density’ = Power’/Area’ = (P/S^2) / ((W*L)/S^2) =
Power/Area

ECE 338: Parallel Computer Architecture 28


What is the problem?

After mid-2000’s
Transistors still getting smaller (Moore’s law) but
energy increases!
WHY?

1 transistor = 1x energy 2 transistors > 1x energy 4 transistors >> 1x energy

after 2 yrs after 2 yrs

ECE 338: Parallel Computer 29


Dennard Scaling no more

ECE 338: Parallel Computer 30


Technology drives new architectures
» High power dissipation (CV2f) drives lower clock frequency
» Simpler, lower frequency unicore architecture
» Seek performance through more, simpler, slower cores

ECE 338: Parallel Computer 31


Reducing Power: Frequency
Growth in clock frequency has stalled since 2003/04

ECE 338: Parallel Computer 32


Reducing Power: Multicores
Multiple cores instead of single, complex cores

Processor
f/2
Input Output Output
Processor
f
Input

Processor
f/2
Before
Capacitance = C
After
Clock frequency = f
Capacitance = 2.2C
Voltage = V
Clock frequency = f/2
Power = CV2f
Voltage = 0.6V
Power = 0.396CV2f
Slower processors allow for lower Vdd voltage.
Emphasis on parallelism NOT on clock frequency
ECE 338: Parallel Computer 33
Reducing Power: Heterogeneous computing
Specialization

• Domain specific processors (Google’s TPU, Security processors)


• FPGAs
ECE 338: Parallel Computer Architecture
34
Reducing Power: Smart Software
– Turn off the clock (or even Vdd) when cores are idle
(turbo mode in modern multicores)
– Dynamic Voltage-Frequency Scaling (DVFS)
• Under the control of the Operating Systems
– Low power state for DRAM, disks
– Approximate computing

ECE 338: Parallel Computer 35


Technology drives new architectures
» Wire speed scales slower than transistor speed
» Wire delays drive localized computing == multi cores
» Super-pipelining to account for data transfer!

ECE 338: Parallel Computer 36


Technology drives new architectures
DRAM access latency
External memory accesses becoming more and more expensive
In the order of hundreds of cycles for HighPerf processors
Need for caches or local memories

ECE 338: Parallel Computer 37


Technology drives new architectures
• Diminishing returns of instruction level parallelism

– 50% performance improvement every year in the 80's


– Due to pipeline : 5 CPI → 1 CPI
– Diminishing returns in the 90's
– More complexity to detect the last available ILP
– Superscalar, VLIW, Branch prediction
– Due to ILP : 1 CPI -> 0.3 CPI
– The multicore era in the 00's

ECE 338: Parallel Computer 38


Technology drives new architectures
General-purpose unicores have stopped historic performance scaling
From Hennessy and Patterson, Computer Architecture: A Quantitative
Approach, 6th edition, 2017 Single-core performance

ECE 338: Parallel Computer 39


Tremendous change in Design Technology
• Intel 4004 (1971): 4-bit processor,
2250 transistors, 750 KHz,
10 micron PMOS process, 11 mm2 chip

• RISC II (1983): 32-bit, 5 stage


pipeline, 40,760 transistors, 3 MHz,
3 micron NMOS process, 60 mm2 chip

• IBM Power 9 (2017): 24-core, 64-bit, 96


threads, 8 billion transistors, 14nm FinFET
Silicon On Insulator (SOI) process, 695 mm2
chip

• State of art is 7nm (0.007 micron) in 2019

ECE 338: Parallel Computer 40


Computer Architecture Today (I)
• Today is a very exciting time to study computer architecture
• Industry is in a large paradigm shift (to heterogeneous or
accelerator-based computing) – many different potential system
designs possible
• Machine Learning applications have rejuvenated hardware design
• Many difficult problems motivating and caused by the shift
– Huge hunger for data and new data-intensive applications (ML, Big Data,
Robotics)
– Power/energy/thermal constraints
– Complexity of design due to Heterogeneity
– Difficulties in technology scaling
– Memory wall/gap
– Reliability problems
– Programmability problems
– Security and privacy issues
ECE 338: Parallel Computer Architecture 41
Computer Architecture Today (II)
• These problems affect all parts of the computing stack – if we do
not change the way we design systems
Problem
Many new demands
from the top Algorithm
(Look Up) Program/Language User Fast changing
demands and
personalities
Runtime System of users
(VM, OS, MM) (Look Up)

ISA
Microarchitecture
Many new issues Logic
at the bottom Circuits
(Look Down)
Electrons
• No clear, definitive answers to these problems
ECE 338: Parallel Computer 42
Computer Architecture Today (III)
• Computing landscape is very different from 10-20 years ago
• Both UP (software and humanity trends) and DOWN (technologies
and their issues), FORWARD and BACKWARD, and the resulting
requirements and constraints

Hybrid Main Memory

Heterogeneous Persistent Memory/Storage


Processors and
Accelerators Every component and its
interfaces, as well as
entire system designs
are being re-examined
General Purpose GPUs

ECE 338: Parallel Computer 43


Future trends
• All exponential laws must come to an end
– Dennard scaling (constant power density)
• Stopped by threshold voltage
– Disk capacity
• 30-100% per year to 5% per year
• Moore’s Law has slowed
• Most visible with DRAM capacity
• Only four foundries left producing state-of-the-art
logic chips
– Taiwan Semi (TSMC), Intel, Samsung, and Global
Foundries (IBM, AMD, etc).
• 7 nm now, 3 nm might be the limit

ECE 338: Parallel Computer 44

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy