L2-CA-Background and Motivation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 86

Computer Architecture:

Background and Motivation


Reference: Behrooz Parhami, “Computer Architecture: From Microprocessors To Supercomputers”, Oxford
Univ. Press, New York, 2005.
Contents
1. Combinational Digital Circuits
2. Digital Circuits with Memory
3. Computer System Technology
4. Computer Performance

Slide 2
1 Combinational Digital Circuits

1.1 Signals, Logic Operators, and Gates


1.2 Boolean Functions and Expressions
1.3 Designing Gate Networks
1.4 Useful Combinational Parts
1.5 Programmable Combinational Parts
1.6 Timing and Circuit Considerations

Slide 3
1.1 Signals, Logic Operators, and Gates
Name NOT AND OR XOR
Graphical
symbol

Operator x _ xy x y xy
sign and
alternat e(s) x or x x y xy x  y

Output Both inputs At least one Inputs are


Input is 0
is 1 iff: are 1s input is 1 not equal

Arithmetic 1x x y or xy x  y  xy x  y 2xy


expression
Add: 1 + 1 = 10
Figure 1.1 Some basic elements of digital logic circuits, with
operator signs.

Slide 4
The Arithmetic Substitution Method
z = 1 – z NOT converted to arithmetic form
xy AND same as multiplication
(when doing the algebra, set zk = z)
x  y = x + y  xy OR converted to arithmetic form
x  y = x + y  2xy XOR converted to arithmetic form

Example: Prove the identity xyz  x   y   z  ? 1

LHS = [xyz  x ]  [y   z ]
= [xyz + 1 – x – (1 – x)xyz]  [1 – y + 1 – z – (1 – y)(1 – z)]
= [xyz + 1 – x]  [1 – yz]
= (xyz + 1 – x) + (1 – yz) – (xyz + 1 – x)(1 – yz)
= 1 + xy2z2 – xyz
= 1 = RHS This is addition,
not logical OR

Slide 5
Variations in Gate Symbols

AND OR NAND NOR XNOR

Bubble = Inverter
Figure 1.2 Gates with more than two inputs and/or with
inverted signals at input or output.

Slide 6
Gates as Control Elements

Enable/Pass signal Enable/Pass signal


e e
Data out Data in
Data in Data out
x or 0 x
x x or “high impedance”
(a) AND gate for controlled trans fer (b) Tristate buffer

e e
0 0 0
No data
1 1
x ex x or x
(c) Model for AND switch. (d) Model for tristate buffer.

Figure 1.3 An AND gate and a tristate buffer act as controlled switches
or valves. An inverting buffer is logically the same as a NOT gate.

Slide 7
Wired OR and Bus Connections
ex ex
x
x

ey
ey Data out
y (x, y, z,
y Data out
or high
(x, y, z, or 0)
impedance)
ez
ez
z
z

(a) Wired OR of product terms (b) Wired OR of t ristate outputs

Figure 1.4 Wired OR allows tying together of several


controlled signals.

Slide 8
Control/Data Signals and Signal Bundles

Enable Compl
8
/
/ / /
/ 8 / 32 / k
8 32 k
(a) 8 NOR gates (b) 32 AND gat es (c) k XOR gat es

Figure 1.5 Arrays of logic gates represented by a single gate symbol.

Slide 9
1.2 Boolean Functions and Expressions
Ways of specifying a logic function

 Truth table: 2n row, “don’t-care” in input or output

 Logic expression: w  (x  y  z), product-of-sums,


sum-of-products, equivalent expressions

 Word statement: Alarm will sound if the door


is opened while the security system is engaged,
or when the smoke detector is triggered

 Logic circuit diagram: Synthesis vs analysis

Slide 10
Manipulating Logic Expressions
Table 1.2 Laws (basic identities) of Boolean algebra.

Name of law OR version AND version


Identity x0=x x1=x

One/Zero x1=1 x0=0

Idempotent xx= x xx=x

Inverse xx=1 xx=0

Commutative xy=yx xy=yx

Associative (x  y)  z = x  (y  z) (x y) z = x (y z)

Distributive x  (y z) = (x  y) (x  z) x (y  z) = (x y)  (x z)

DeMorgan’s (x  y) = x  y  (x y) = x   y 

Slide 11
Proving the Equivalence of Logic Expressions
Example 1.1
 Truth-table method: Exhaustive verification

 Arithmetic substitution
x  y = x + y  xy
x  y = x + y  2xy
Example: x  y ? xy  xy 
x + y – 2xy ? (1 – x)y + x(1 – y) – (1 – x)yx(1 – y)

 Case analysis: two cases, x = 0 or x = 1

 Logic expression manipulation

Slide 12
1.3 Designing Gate Networks
 AND-OR, NAND-NAND, OR-AND, NOR-NOR

 Logic optimization: cost, speed, power dissipation


(a  b  c) = a b c 
x x x
y y y
y y y
z z z
z z z
x x x
(a) AND-OR circuit (b) Int ermediate circuit (c) NAND-NAND equivalent

Figure 1.6 A two-level AND-OR circuit and two equivalent circuits.

Slide 13
Seven-Segment Display of Decimal Digits
Optional segment

Figure 1.7 Seven-segment display


of decimal digits. The three open
segments may be optionally used.
The digit 1 can be displayed in two
ways, with the more common right-
side version shown.

I swear I didn’t use a calculator!

Slide 14
BCD-to-Seven-Segment Decoder
Example 1.2
e0 0
Signals to
4-bit input in [0, 9]
enable or
x3 x2 x1 x0 turn on the e5 5 1
segments
e6 6

e4 4 2

e3 3

e2

e1

Figure 1.8 The logic circuit that generates the enable signal for the
lowermost segment (number 3) in a seven-segment display unit.

Slide 15
1.4 Useful Combinational Parts

 High-level building blocks

 Much like prefab parts used in building a house

 Arithmetic components (adders, multipliers, ALUs)

 Here we cover three useful parts:


multiplexers, decoders/demultiplexers, encoders

Slide 16
Multiplexers
x0
0 x0 0
z x0 z z
x1
x1 1 x1 1
y
y
y (a) 2-to-1 mux (b) Switch view (c) Mux symbol

e (Enable) x0 0
x0 0
/
x1 1 0
0 x1 1 z z
32 /
y0
x2
/ 1 32
x3
2 x2 x2 0 1
3
32
y y1
1 0 x3 1
y1y0 y0
(d) Mux array (e) 4-to-1 mux with enable (e) 4-to-1 mux design

Figure 1.9 Multiplexer (mux), or selector, allows one of several inputs


to be selected and routed to output depending on the binary value of a
set of selection or address signals provided to it.

Slide 17
Decoders/Demultiplexers
y1 y0
1 0
y1y0
y1y0
x0 1 0
0 x0
0 x0
1 x1
x1 1 x1 e 1
2 x2 (Enable) 2 1 x2
1 3 x3
1 x2 3 x3
1
x3
(c) Demultiplexer, or
(a) 2-to-4 decoder (b) Decoder symbol decoder wit h “enable”

Figure 1.10 A decoder allows the selection of one of 2a options using


an a-bit address as input. A demultiplexer (demux) is a decoder that
only selects an output if its enable signal is asserted.

Slide 18
Encoders
x0 0

x1 0
x0 0
x1 1
x2 1
x2 2
x3 3
x3 0
1 0
y1y0
y1y0
(a) 4-to-2 encoder (b) Enc oder symbol

Figure 1.11 A 2a-to-a encoder outputs an a-bit binary number


equal to the index of the single 1 among its 2a inputs.

Slide 19
1.5 Programmable Combinational Parts

A programmable combinational part can do the job of


many gates or gate networks

Programmed by cutting existing connections (fuses)


or establishing new connections (antifuses)

 Programmable ROM (PROM)

 Programmable array logic (PAL)

 Programmable logic array (PLA)

Slide 20
PROMs
Inputs
w w

x x

Decoder
.
y y .
.
z z

...
Outputs
(a) Programmable (b) Logic equivalent (c) Programmable read-only
OR gates of part a memory (PROM)

Figure 1.12 Programmable connections and their use in a PROM.

Slide 21
PALs and PLAs
Inputs 8-input 6-input
ANDs ANDs
...

AND OR
.
array array
(AND . (OR
plane) . plane)

... 4-input
ORs
Outputs
(a) General programmable (b) PAL: programmable (c) PLA: programmable
combinational logic AND array, fixed OR array AND and OR arrays

Figure 1.13 Programmable combinational logic: general structure and


two classes known as PAL and PLA devices. Not shown is PROM with
fixed AND array (a decoder) and programmable OR array.

Slide 22
1.6 Timing and Circuit Considerations

Changes in gate/circuit output, triggered by changes in its


inputs, are not instantaneous

 Gate delay d: a fraction of, to a few, nanoseconds

 Wire delay, previously negligible, is now important


(electronic signals travel about 15 cm per ns)

 Circuit simulation to verify function and timing

Slide 23
2 Digital Circuits with Memory

2.1 Latches, Flip-Flops, and Registers


2.2 Finite-State Machines
2.3 Designing Sequential Circuits
2.4 Useful Sequential Parts
2.5 Programmable Sequential Parts
2.6 Clocks and Timing of Events

Slide 24
2.1 Latches, Flip-Flops, and Registers
D R
R
Q Q

Q S Q
S
C
(a) SR latch (b) D latch

D D Q D Q Q D Q / D Q /
k k
FF FF
C C Q C Q Q C Q C Q

(c) Master-slave D flip-flop (d) D flip-flop symbol (e) k -bit register

Figure 2.1 Latches, flip-flops, and registers.

Slide 25
Reading and Modifying FFs in the Same Cycle

/ D Q /
k k
FF
C Q Computation module
(combinational logic)

/ D Q /
k k
FF
C Q
Propagation delay
Combinational delay
Clock

Figure 2.3 Register-to-register operation with edge-triggered


flip-flops.

Slide 26
2.2 Finite-State Machines
Example 2.1
------- Input ------- Dime
Current S 10 S 20

Quarter
state

Reset
Dime
Reset
Reset

S 00 S 10 S 25 S 00 Dime Dime
Quarter
S 10 S 20 S 35 S 00 Start
Quarter
S 20 S 30 S 35 S 00 Quarter
S 00 S 25
S 25 S 35 S 35 S 00 Reset
S 30 S 35 S 35 S 00 Reset Dime
Quarter
S 35 S 35 S 35 S 00
Reset
Next state
S 00 is the initial state Dime
S 35 S 30
S 35 is the final state Dime
Quarter
Quarter

Figure 2.4 State table and state diagram for a vending


machine coin reception unit.

Slide 27
Sequential Machine Implementation

Only for Mealy machine

Inputs Output Outputs


/ State register /
n logic m
Next-state
/
logic Next-state l
Present excitation
state signals

Figure 2.5 Hardware realization of Moore and Mealy


sequential machines.

Slide 28
2.3 Designing Sequential Circuits
Example 2.3
Inputs
Quarter in q
Output Final
D Q
e state
FF2
Q
is 1xx
C
Dime in d

D Q
FF1
C Q

D Q
FF0
C Q

Figure 2.7 Hardware realization of a coin reception unit (Example 2.3).

Slide 29
2.4 Useful Sequential Parts

 High-level building blocks

 Here we cover three useful parts:


shift register, register file (SRAM basics), counter

Slide 30
Shift Register

0 1 0 0 1 1 1 0

Shift
Load
Parallel data in / 0
k Parallel data out
/ D Q /
Serial data in 1 k k
FF
C Q Serial data out
k – 1 LSBs MSB
/
Figure 2.8 Register with single-bit left shift and parallel load
capabilities. For logical left shift, serial data in line is connected to 0.

Slide 31
Register File and FIFO
Write 2 h k -bit registers Muxes
data / / / Write enable
D Q
k k k
Write FF Write
/ C Q / data
address h k Read
Write data 0 k/
/ addr
h
Write / D Q / Read
k k Read
enable FF data 0 / addr 0
C Q h Read
/ data 1 k/
k Read
Decoder

/ addr 1
k h
/ D Q / / Read enable
k k Read
FF
C Q data 1
(b) Graphic symbol
for register file
/ D Q /
k k
FF
C Q Push Full

/ Input Output /
h k k
Read address 0 / Read
enable Empty Pop
h
Read address 1 /
(a) Register file with random access (c) FIFO symbol

Figure 2.9 Register file with random access and FIFO.


Slide 32
SRAM

Row decoder
. Square or
Write enable . almost square
. memory matrix
/ Data in
g
Data out /
g
/ Address
h
Chip Output
. . .
select enable Row buffer
. . .
Row
Address / Column mux
h Column
g bits data out

(a) SRAM block diagram (b) SRAM read mechanism

Figure 2.10 SRAM memory is simply a large, single-port register file.

Slide 33
Binary Counter
Input
IncrInit
0 Mux 1

Load
0 Count register
x
c out c in
Incrementer 1

x+1
Figure 2.11 Synchronous binary counter with initialization capability.

Slide 34
2.5 Programmable Sequential Parts

A programmable sequential part contain gates and


memory elements

Programmed by cutting existing connections (fuses)


or establishing new connections (antifuses)

 Programmable array logic (PAL)

 Field-programmable gate array (FPGA)

 Both types contain macrocells and interconnects

Slide 35
PAL and FPGA
8-input I/O blocks
ANDs

CLB CLB

01
CLB CLB
Mu x C D
FF
Q Q

Mu x

01 Configurable Programmable
logic block connections

(a) Portion of PAL with storable output (b) Generic structure of an FPGA

Figure 2.12 Examples of programmable sequential logic.


Slide 36
2.6 Clocks and Timing of Events
Clock is a periodic signal: clock rate = clock frequency
The inverse of clock rate is the clock period: 1 GHz  1 ns
Constraint: Clock period  tprop + tcomb + tsetup + tskew

D Q Combinational D Q
FF1 FF2
Q
logic Q
C C
Clock1 Clock2
Other inputs
Clock period

Must be wide enough


FF1 begins FF1 change to accommodate
to change observed worst-cas e delays

Figure 2.13 Determining the required length of the clock period.

Slide 37
Synchronization
Asynch Synch Asynch Synch
input version input version
D Q D Q D Q
FF FF1 FF2
C Q C Q C Q

(a) Simple synchroniz er (b) Two-FF synchronizer

Clock

Asynch
input
Synch
version
(c) Input and output waveforms

Figure 2.14 Synchronizers are used to prevent timing problems


arising from untimely changes in asynchronous signals.

Slide 38
Level-Sensitive Operation

D Q Combi- D Q Combi- D Q
Latch national Latch national Latch
Q logic Q logic Q
1 C 2 C 1 C

Other inputs Other inputs


Clock period
1
Clocks with
nonoverlapping
 highs
2

Figure 2.15 Two-phase clocking with nonoverlapping clock signals.

Slide 39
3 Computer System Technology
Interplay between architecture, hardware, and software
• Architectural innovations influence technology
• Technological advances drive changes in architecture

3.1 From Components to Applications


3.2 Computer Systems and Their Parts
3.3 Generations of Progress
3.4 Processor and Memory Technologies
3.5 Peripherals, I/O, and Communications
3.6 Software Systems and Applications

Slide 40
3.1 From Components to Applications
Software Hardware

Electronic components
Application domains

Application designer

Computer designer
System designer

Circuit designer
Logic designer
High- Low-
level Computer archit ecture level
view view
Computer organization

Figure 3.1 Subfields or views in computer system engineering.

Slide 41
What Is (Computer) Architecture?
Client’s requirements: Client’s taste:
function, cost, . . . mood, style, . . .

Goals
Interface Architect
Means

Construction tec hnology: Engineering Arts


The world of arts:
material, codes, . . .
aesthetics, trends, . . .
Interface

Figure 3.2 Like a building architect, whose place at the


engineering/arts and goals/means interfaces is seen in this diagram, a
computer architect reconciles many conflicting or competing demands.

Slide 42
3.2 Computer Systems and Their Parts
Computer

Analog Digital

Fixed-function Stored-program

Electronic Nonelectronic

General-purpose Special-purpose

Number cruncher Data manipulator

Figure 3.3 The space of computer systems, with what we normally


mean by the word “computer” highlighted.
Slide 43
Price/Performance Pyramid
Super $Millions
Mainframe
$100s Ks
Server $10s Ks
Differences in scale,
not in substance Workstation $1000s

Personal $100s

Embedded $10s

Figure 3.4 Classifying computers by computational


power and price range.

Slide 44
Automotive Embedded Computers

Impact sensors

Brakes
Airbags

Engine Cent ral


controller

Navigation &
entert ainment

Figure 3.5 Embedded computers are ubiquitous, yet invisible. They


are found in our automobiles, appliances, and many other places.

Slide 45
Personal Computers and Workstations

Figure 3.6 Notebooks, a common class of portable computers,


are much smaller than desktops but offer substantially the same
capabilities. What are the main reasons for the size difference?

Slide 46
Digital Computer Subsystems

Memory

Control Input

Processor Link Input/Output

Datapath Output

CPU To/from network I/O


Figure 3.7 The (three, four, five, or) six main units of a digital
computer. Usually, the link unit (a simple bus or a more elaborate
network) is not explicitly included in such diagrams.
Slide 47
3.3 Generations of Progress
Table 3.2 The 5 generations of digital computers, and their ancestors.
Generation Processor Memory I/O devices Dominant
(begun) technology innovations introduced look & fell
0 (1600s) (Electro-) Wheel, card Lever, dial, Factory
mechanical punched card equipment
1 (1950s) Vacuum tube Magnetic Paper tape, Hall-size
drum magnetic tape cabinet
2 (1960s) Transistor Magnetic Drum, printer, Room-size
core text terminal mainframe
3 (1970s) SSI/MSI RAM/ROM Disk, keyboard, Desk-size
chip video monitor mini
4 (1980s) LSI/VLSI SRAM/DRAM Network, CD, Desktop/
mouse,sound laptop micro
5 (1990s) ULSI/GSI/ SDRAM, Sensor/actuator, Invisible,
WSI, SOC flash point/click embedded

Slide 48
IC Production and Yield
Blank wafer
30-60 cm Patterned wafer
with defects
Processing:
Silicon Slicer x x 20-30 steps
crystal x x x
15-30 x x
ingot cm x x
x x

(100s of simple or scores


0.2 cm of complex processors)

Microchip
Good
Die or other part Part
Die die
Dicer tester Mounting tester
Usable
part
to ship
~1 cm ~1 cm

Figure 3.8 The manufacturing process for an IC part.


Slide 49
Effect of Die Size on Yield

120 dies, 109 good 26 dies, 15 good

Figure 3.9 Visualizing the dramatic decrease in yield


with larger dies.

Die yield =def (number of good dies) / (total number of dies)


Die yield = Wafer yield  [1 + (Defect density  Die area) / a]–a
Die cost = (cost of wafer) / (total number of dies  die yield)
= (cost of wafer)  (die area / wafer area) / (die yield)

Slide 50
3.4 Processor and Memory Technologies
Interlayer connections
Backplane deposited on the
outside of the stack
Die
PC board

Bus

CPU

Connector Stacked layers


glued together
Memory

(a) 2D or 2.5D packaging now common (b) 3D packaging of the future

Figure 3.11 Packaging of processor, memory, and other components.

Slide 51
TIPS Tb
Moore’s Processor

Law 1.6 / yr
2 / 18 mos
10 / 5 yrs
Memory
Processor performance

Memory chip capacity


GIPS R10000 Gb
1Gb
Pentium II
Pentium
68040 256Mb
64Mb
80486
16Mb
80386
68000 4Mb
MIPS Mb
80286
1Mb
256kb 4 / 3 yrs

64kb

kIPS kb
1980 1990 2000 2010
Calendar year
Figure 3.10 Trends in processor performance and DRAM
memory chip capacity (Moore’s law).
Slide 52
Pitfalls of Computer Technology Forecasting
“DOS addresses only 1 MB of RAM because we cannot
imagine any applications needing more.” Microsoft, 1980
“640K ought to be enough for anybody.” Bill Gates, 1981
“Computers in the future may weigh no more than 1.5
tons.” Popular Mechanics
“I think there is a world market for maybe five
computers.” Thomas Watson, IBM Chairman, 1943
“There is no reason anyone would want a computer in
their home.” Ken Olsen, DEC founder, 1977
“The 32-bit machine would be an overkill for a personal
computer.” Sol Libes, ByteLines
Slide 53
3.5 Input/Output and Communications

Typically Floppy CD-ROM


2-9 cm disk

. .
Magnetic
tape
.
..
..
. cartridge

(a) Cutaway view of a hard disk drive (b) Some removable storage media

Figure 3.12 Magnetic and optical disk memory units.

Slide 54
10 12

Communication
Processor Geographically distributed
Technologies bus
I/O
network
System-area
10 9

Bandwidth (b/s)
network
(SAN) Local-area
network
(LAN)
Metro-area
network
(MAN)
10 6 Wide-area
network
Same geographic location (WAN)

10 3
10 9 10 6 10 3 1 10 3
(ns) (s) (ms) (min) (h)
Latency (s)

Figure 3.13 Latency and bandwidth characteristics of different


classes of communication links.
Slide 55
3.6 Software Systems and Applications
Software

Application: System
word processor,
spreadsheet,
circuit simulator, Operating system Translator:
.. . MIPS assembler,
C compiler,
Manager: Enabler: Coordinator: .. .
virtual memory, disk driver, scheduling,
security, display driver, load balancing,
file system, printing, diagnostics,
.. . .. . .. .

Figure 3.15 Categorization of software, with examples in each class.

Slide 56
High- vs Low-Level Programming
More abstract, machine-independent; More conc rete, machine-specific, error-prone;
easier to write, read, debug, or maintain harder to write, read, debug, or maintain

Very High-level Assembly Machine


high-level language language language
language statements instructions, instructions,
objectives mnemonic binary (hex)

Assembler
Interpreter

Compiler
or tasks

Swap v[i] temp=v[i] add $2,$5,$5 00a51020


and v[i+1] v[i]=v[i+1] add $2,$2,$2 00421020
v[i+1]=temp add $2,$4,$2 00821020
lw $15,0($2) 8c620000
lw $16,4($2) 8cf20004
sw $16,0($2) acf20000
sw $15,4($2) ac620004
jr $31 03e00008

One task = One statement = Mostly one-to-one


many statements several instructions

Figure 3.14 Models and abstractions in programming.


Slide 57
4 Computer Performance
Performance is key in design decisions; also cost and power
• It has been a driving force for innovation
• Isn’t quite the same as speed (higher clock rate)

4.1 Cost, Performance, and Cost/Performance


4.2 Defining Computer Performance
4.3 Performance Enhancement and Amdahl’s Law
4.4 Performance Measurement vs Modeling
4.5 Reporting Computer Performance
4.6 The Quest for Higher Performance

Slide 58
4.1 Cost, Performance, and Cost/Performance
$1 G
Computer cost

$1 M

$1 K

$1
1960 1980 2000 2020
Calendar year
Slide 59
Cost/Performance

Performance

Superlinear:
Linear
economy of
(ideal?)
scale

Sublinear:
diminishing
returns
Cost

Figure 4.1 Performance improvement as a function of cost.

Slide 60
4.2 Defining Computer Performance
CPU-bound task

Input Processing Output

I/O-bound task
Figure 4.2 Pipeline analogy shows that imbalance between processing
power and I/O capabilities leads to a performance bottleneck.

Slide 61
Six Passenger Aircraft to Be Compared
B 747

DC-8-50

Slide 62
Performance of Aircraft: An Analogy
Table 4.1 Key characteristics of six passenger aircraft: all figures
are approximate; some relate to a specific model/configuration of
the aircraft or are averages of cited range of values.

Range Speed Price


Aircraft Passengers
(km) (km/h) ($M)
Airbus A310 250 8 300 895 120

Boeing 747 470 6 700 980 200

Boeing 767 250 12 300 885 120

Boeing 777 375 7 450 980 180

Concorde 130 6 400 2 200 350

DC-8-50 145 14 000 875 80


Speed of sound  1220 km / h

Slide 63
Different Views of Performance
Performance from the viewpoint of a passenger: Speed
Note, however, that flight time is but one part of total travel time.
Also, if the travel distance exceeds the range of a faster plane,
a slower plane may be better due to not needing a refueling stop
Performance from the viewpoint of an airline: Throughput
Measured in passenger-km per hour (relevant if ticket price were
proportional to distance traveled, which in reality it is not)
Airbus A310 250  895 = 0.224 M passenger-km/hr
Boeing 747 470  980 = 0.461 M passenger-km/hr
Boeing 767 250  885 = 0.221 M passenger-km/hr
Boeing 777 375  980 = 0.368 M passenger-km/hr
Concorde 130  2200 = 0.286 M passenger-km/hr
DC-8-50 145  875 = 0.127 M passenger-km/hr
Performance from the viewpoint of FAA: Safety
Slide 64
Cost Effectiveness: Cost/Performance
Table 4.1 Key characteristics of six passenger
Larger Smaller
aircraft: all figures are approximate; some relate to
values values
a specific model/configuration of the aircraft or are
better better
averages of cited range of values.
Aircraft Passen- Range Speed Price Throughput Cost /
gers (km) (km/h) ($M) (M P km/hr) Performance

A310 250 8 300 895 120 0.224 536

B 747 470 6 700 980 200 0.461 434

B 767 250 12 300 885 120 0.221 543

B 777 375 7 450 980 180 0.368 489

Concorde 130 6 400 2 200 350 0.286 1224

DC-8-50 145 14 000 875 80 0.127 630

Slide 65
Concepts of Performance and Speedup
Performance = 1 / Execution time is simplified to

Performance = 1 / CPU execution time

(Performance of M1) / (Performance of M2) = Speedup of M1 over M2


= (Execution time of M2) / (Execution time M1)

Terminology: M1 is x times as fast as M2 (e.g., 1.5 times as fast)


M1 is 100(x – 1)% faster than M2 (e.g., 50% faster)

CPU time = Instructions  (Cycles per instruction)  (Secs per cycle)


= Instructions  CPI / (Clock rate)
Instruction count, CPI, and clock rate are not completely independent,
so improving one by a given factor may not lead to overall execution
time improvement by the same factor.

Slide 66
Elaboration on the CPU Time Formula
CPU time = Instructions  (Cycles per instruction)  (Secs per cycle)
= Instructions  Average CPI / (Clock rate)

Instructions: Number of instructions executed, not number of


instructions in our program (dynamic count)

Average CPI: Is calculated based on the dynamic instruction mix


and knowledge of how many clock cycles are needed
to execute various instructions (or instruction classes)

Clock rate: 1 GHz = 109 cycles / s (cycle time 10–9 s = 1 ns)


200 MHz = 200  106 cycles / s (cycle time = 5 ns)
Clock period

Slide 67
Dynamic Instruction Count
How many instructions Each “for” consists of two instructions:
are executed in this increment index, check exit condition
program fragment?
12,422,450 Instructions
250 instructions
for i = 1, 100 do 2 + 20 + 124,200 instructions
20 instructions 100 iterations
for j = 1, 100 do 12,422,200 instructions in all
40 instructions 2 + 40 + 1200 instructions
for k = 1, 100 do 100 iterations
10 instructions 124,200 instructions in all
endfor 2 + 10 instructions
endfor 100 iterations for i = 1, n
endfor 1200 instructions in all while x > 0
Static count = 326
Slide 68
Faster Clock  Shorter Running Time
Suppose addition takes 1 ns
Clock period = 1 ns; 1 cycle
Clock period = ½ ns; 2 cycles Solution
1 GHz

4 steps

20 steps

2 GHz In this example, addition time


does not improve in going from
1 GHz to 2 GHz clock

Figure 4.3 Faster steps do not necessarily


mean shorter travel time.
Slide 69
4.3 Performance Enhancement: Amdahl’s Law
50
f = fraction
f =0
40 unaffected
p = speedup
Speedup (s )

f = 0.01
30 of the rest
f = 0.02

20
f = 0.05 1
s=
10 f + (1 – f)/p
f = 0.1
 min(p, 1/f)
0
0 10 20 30 40 50
Enhancement factor (p )
Figure 4.4 Amdahl’s law: speedup achieved if a fraction f of a
task is unaffected and the remaining 1 – f part runs p times as fast.

Slide 70
Amdahl’s Law Used in Design
Example 4.1
A processor spends 30% of its time on flp addition, 25% on flp mult,
and 10% on flp division. Evaluate the following enhancements, each
costing the same to implement:

a. Redesign of the flp adder to make it twice as fast.


b. Redesign of the flp multiplier to make it three times as fast.
c. Redesign the flp divider to make it 10 times as fast.

Solution

a. Adder redesign speedup = 1 / [0.7 + 0.3 / 2] = 1.18


b. Multiplier redesign speedup = 1 / [0.75 + 0.25 / 3] = 1.20
c. Divider redesign speedup = 1 / [0.9 + 0.1 / 10] = 1.10

What if both the adder and the multiplier are redesigned?

Slide 71
Amdahl’s Law Used in Management
Example 4.2
Members of a university research group frequently visit the library.
Each library trip takes 20 minutes. The group decides to subscribe
to a handful of publications that account for 90% of the library trips;
access time to these publications is reduced to 2 minutes.

a. What is the average speedup in access to publications?


b. If the group has 20 members, each making two weekly trips to
the library, what is the justifiable expense for the subscriptions?
Assume 50 working weeks/yr and $25/h for a researcher’s time.

Solution

a. Speedup in publication access time = 1 / [0.1 + 0.9 / 10] = 5.26


b. Time saved = 20  2  50  0.9 (20 – 2) = 32,400 min = 540 h
Cost recovery = 540  $25 = $13,500 = Max justifiable expense
Slide 72
4.4 Performance Measurement vs Modeling
Execution time

Machine 1

Machine 2

Machine 3

Program
A B C D E F

Figure 4.5 Running times of six programs on three machines.

Slide 73
Generalized Amdahl’s Law

Original running time of a program = 1 = f1 + f2 + . . . + fk

New running time after the fraction fi is speeded up by a factor pi


f1 f2 fk
+ + ... +
p1 p2 pk

Speedup formula
If a particular fraction
1 is slowed down rather
S= than speeded up,
f1 f2 fk use sj fj instead of fj / pj ,
+ + ... + where sj > 1 is the
p1 p2 pk slowdown factor

Slide 74
Performance Benchmarks
Example 4.3
You are an engineer at Outtel, a start-up aspiring to compete with Intel
via its new processor design that outperforms the latest Intel processor
by a factor of 2.5 on floating-point instructions. This level of performance
was achieved by design compromises that led to a 20% increase in the
execution time of all other instructions. You are in charge of choosing
benchmarks that would showcase Outtel’s performance edge.
a. What is the minimum required fraction f of time spent on floating-point
instructions in a program on the Intel processor to show a speedup of
2 or better for Outtel?
Solution
a. We use a generalized form of Amdahl’s formula in which a fraction f
is speeded up by a given factor (2.5) and the rest is slowed down by
another factor (1.2): 1 / [1.2(1 – f) + f / 2.5]  2  f  0.875
Slide 75
Performance Estimation
Average CPI = All instruction classes (Class-i fraction)  (Class-i CPI)
Machine cycle time = 1 / Clock rate
CPU execution time = Instructions  (Average CPI) / (Clock rate)

Table 4.3 Usage frequency, in percentage, for various


instruction classes in four representative applications.
Application  Data C language Reactor Atomic motion
Instr’n class  compression compiler simulation modeling

A: Load/Store 25 37 32 37
B: Integer 32 28 17 5
C: Shift/Logic 16 13 2 1
D: Float 0 0 34 42
E: Branch 19 13 9 10
F: All others 8 9 6 4
Slide 76
CPI and IPS Calculations
Example 4.4 (2 of 5 parts)
Consider two implementations M1 (600 MHz) and M2 (500 MHz) of
an instruction set containing three classes of instructions:
Class CPI for M1 CPI for M2 Comments
F 5.0 4.0 Floating-point
I 2.0 3.8 Integer arithmetic
N 2.4 2.0 Nonarithmetic
a. What are the peak performances of M1 and M2 in MIPS?
b. If 50% of instructions executed are class-N, with the rest divided
equally among F and I, which machine is faster? By what factor?
Solution
a. Peak MIPS for M1 = 600 / 2.0 = 300; for M2 = 500 / 2.0 = 250
b. Average CPI for M1 = 5.0 / 4 + 2.0 / 4 + 2.4 / 2 = 2.95;
for M2 = 4.0 / 4 + 3.8 / 4 + 2.0 / 2 = 2.95  M1 is faster; factor 1.2
Slide 77
MIPS Rating Can Be Misleading
Example 4.5
Two compilers produce machine code for a program on a machine
with two classes of instructions. Here are the number of instructions:
Class CPI Compiler 1 Compiler 2
A 1 600M 400M
B 2 400M 400M
a. What are run times of the two programs with a 1 GHz clock?
b. Which compiler produces faster code and by what factor?
c. Which compiler’s output runs at a higher MIPS rate?
Solution
a. Running time 1 (2) = (600M  1 + 400M  2) / 109 = 1.4 s (1.2 s)
b. Compiler 2’s output runs 1.4 / 1.2 = 1.17 times as fast
c. MIPS rating 1, CPI = 1.4 (2, CPI = 1.5) = 1000 / 1.4 = 714 (667)
Slide 78
4.5 Reporting Computer Performance
Table 4.4 Measured or estimated execution times for three programs.

Time on Time on Speedup of


machine X machine Y Y over X
Program A 20 200 0.1
Program B 1000 100 10.0
Program C 1500 150 10.0
All 3 prog’s 2520 450 5.6

Analogy: If a car is driven to a city 100 km away at 100 km/hr


and returns at 50 km/hr, the average speed is not (100 + 50) / 2
but is obtained from the fact that it travels 200 km in 3 hours.

Slide 79
Comparing the Overall Performance

Table 4.4 Measured or estimated execution times for three programs.

Time on Time on Speedup of Speedup of


machine X machine Y Y over X X over Y

Program A 20 200 0.1 10


Program B 1000 100 10.0 0.1
Program C 1500 150 10.0 0.1

Arithmetic mean 6.7 3.4


Geometric mean 2.15 0.46

Geometric mean does not yield a measure of overall speedup,


but provides an indicator that at least moves in the right direction

Slide 80
Effect of Instruction Mix on Performance
Example 4.6 (1 of 3 parts)
Consider two applications DC and RS and two machines M1 and M2:
Class Data Comp. Reactor Sim. M1’s CPI M2’s CPI
A: Ld/Str 25% 32% 4.0 3.8
B: Integer 32% 17% 1.5 2.5
C: Sh/Logic 16% 2% 1.2 1.2
D: Float 0% 34% 6.0 2.6
E: Branch 19% 9% 2.5 2.2
F: Other 8% 6% 2.0 2.3
a. Find the effective CPI for the two applications on both machines.
Solution
a. CPI of DC on M1: 0.25  4.0 + 0.32  1.5 + 0.16  1.2 + 0  6.0 +
0.19  2.5 + 0.08  2.0 = 2.31
DC on M2: 2.54 RS on M1: 3.94 RS on M2: 2.89
Slide 81
4.6 The Quest for Higher Performance
State of available computing power ca. the early 2000s:

Gigaflops on the desktop


Teraflops in the supercomputer center
Petaflops on the drawing board

Note on terminology (see Table 3.1)

Prefixes for large units:


Kilo = 103, Mega = 106, Giga = 109, Tera = 1012, Peta = 1015
For memory:
K = 210 = 1024, M = 220, G = 230, T = 240, P = 250
Prefixes for small units:
micro = 106, nano = 109, pico = 1012, femto = 1015

Slide 82
Performance Trends and Obsolescence
TIPS Tb
Processor

1.6 / yr
2 / 18 mos
10 / 5 yrs
Memory
Processor performance

Memory chip capacity


GIPS R10000 Gb
1Gb
Pentium II
Pentium
68040 256Mb
64Mb
80486
16Mb
80386
68000 4Mb
MIPS Mb
80286
1Mb
256kb 4 / 3 yrs

64kb

kIPS kb
1980 1990 2000 2010
Calendar year “Can I call you back? We
just bought a new computer
Figure 3.10 Trends in processor
and we’re trying to set it up
performance and DRAM memory before it’s obsolete.”
chip capacity (Moore’s law).

Slide 83
Super- PFLOPS Massively parallel
processors
computers
$240M MPPs

Supercomputer performance
$30M MPPs
CM-5
TFLOPS
CM-5

CM-2 Vector
supercomputers
Y-MP

GFLOPS

Cray
X-MP

MFLOPS
1980 1990 2000 2010
Calendar year

Figure 4.7 Exponential growth of supercomputer performance.

Slide 84
The Most Powerful Computers
1000 Plan Develop Use
Performance (TFLOPS)

100+ TFLOPS, 20 TB
100 ASCI Purple
30+ TFLOPS, 10 TB
ASCI Q
10+ TFLOPS, 5 TB
10 ASCI White ASCI
3+ TFLOPS, 1.5 TB
ASCI Blue
1+ TFLOPS, 0.5 TB
1 ASCI Red
1995 2000 2005 2010
Calendar year
Figure 4.8 Milestones in the DOE’s Accelerated Strategic Computing
Initiative (ASCI) program with extrapolation up to the PFLOPS level.

Slide 85
Performance is Important, But It Isn’t Everything
TIPS

Figure 25.1
DSP performance Absolute
per Watt proce ssor Trend in
performance computational
GIPS
performance
Performance

per watt of
GP processor
performance
power used
per Watt in general-
purpose
MIPS
processors
and DSPs.

kIPS
1980 1990 2000 2010
Calendar year

Slide 86

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy