L2-CA-Background and Motivation

Computer Architecture:
Background and Motivation

Reference: Behrooz Parhami, “Computer Architecture: From Microprocessors To Supercomputers”, Oxford
Univ. Press, New York, 2005.
Contents
1. Combinational Digital Circuits
2. Digital Circuits with Memory
3. Computer System Technology
4. Computer Performance
Slide 2
1 Combinational Digital Circuits
1.1 Signals, Logic Operators, and Gates

1.2 Boolean Functions and Expressions
1.3 Designing Gate Networks
1.4 Useful Combinational Parts
1.5 Programmable Combinational Parts
1.6 Timing and Circuit Considerations
Slide 3
1.1 Signals, Logic Operators, and Gates
Name NOT AND OR XOR
Graphical
symbol
Operator x _ xy x y xy
sign and
alternat e(s) x or x x y xy x  y
Output Both inputs At least one Inputs are

Input is 0
is 1 iff: are 1s input is 1 not equal
Arithmetic 1x x y or xy x  y  xy x  y 2xy

expression
Add: 1 + 1 = 10
Figure 1.1 Some basic elements of digital logic circuits, with
operator signs.
Slide 4
The Arithmetic Substitution Method
z = 1 – z NOT converted to arithmetic form
xy AND same as multiplication
(when doing the algebra, set zk = z)
x  y = x + y  xy OR converted to arithmetic form
x  y = x + y  2xy XOR converted to arithmetic form
Example: Prove the identity xyz  x   y   z  ? 1
LHS = [xyz  x ]  [y   z ]
= [xyz + 1 – x – (1 – x)xyz]  [1 – y + 1 – z – (1 – y)(1 – z)]
= [xyz + 1 – x]  [1 – yz]
= (xyz + 1 – x) + (1 – yz) – (xyz + 1 – x)(1 – yz)
= 1 + xy2z2 – xyz
= 1 = RHS This is addition,
not logical OR
Slide 5
Variations in Gate Symbols
AND OR NAND NOR XNOR
Bubble = Inverter
Figure 1.2 Gates with more than two inputs and/or with
inverted signals at input or output.
Slide 6
Gates as Control Elements
Enable/Pass signal Enable/Pass signal

e e
Data out Data in
Data in Data out
x or 0 x
x x or “high impedance”
(a) AND gate for controlled trans fer (b) Tristate buffer
e e
0 0 0
No data
1 1
x ex x or x
(c) Model for AND switch. (d) Model for tristate buffer.
Figure 1.3 An AND gate and a tristate buffer act as controlled switches
or valves. An inverting buffer is logically the same as a NOT gate.
Slide 7
Wired OR and Bus Connections
ex ex
x
x
ey
ey Data out
y (x, y, z,
y Data out
or high
(x, y, z, or 0)
impedance)
ez
ez
z
z
(a) Wired OR of product terms (b) Wired OR of t ristate outputs
Figure 1.4 Wired OR allows tying together of several

controlled signals.
Slide 8
Control/Data Signals and Signal Bundles
Enable Compl
8
/
/ / /
/ 8 / 32 / k
8 32 k
(a) 8 NOR gates (b) 32 AND gat es (c) k XOR gat es
Figure 1.5 Arrays of logic gates represented by a single gate symbol.
Slide 9
1.2 Boolean Functions and Expressions
Ways of specifying a logic function
 Truth table: 2n row, “don’t-care” in input or output
 Logic expression: w  (x  y  z), product-of-sums,

sum-of-products, equivalent expressions
 Word statement: Alarm will sound if the door

is opened while the security system is engaged,
or when the smoke detector is triggered
 Logic circuit diagram: Synthesis vs analysis
Slide 10
Manipulating Logic Expressions
Table 1.2 Laws (basic identities) of Boolean algebra.
Name of law OR version AND version

Identity x0=x x1=x
One/Zero x1=1 x0=0
Idempotent xx= x xx=x
Inverse xx=1 xx=0
Commutative xy=yx xy=yx
Associative (x  y)  z = x  (y  z) (x y) z = x (y z)
Distributive x  (y z) = (x  y) (x  z) x (y  z) = (x y)  (x z)
DeMorgan’s (x  y) = x  y  (x y) = x   y 
Slide 11
Proving the Equivalence of Logic Expressions
Example 1.1
 Truth-table method: Exhaustive verification
 Arithmetic substitution
x  y = x + y  xy
x  y = x + y  2xy
Example: x  y ? xy  xy 
x + y – 2xy ? (1 – x)y + x(1 – y) – (1 – x)yx(1 – y)
 Case analysis: two cases, x = 0 or x = 1
 Logic expression manipulation
Slide 12
1.3 Designing Gate Networks
 AND-OR, NAND-NAND, OR-AND, NOR-NOR
 Logic optimization: cost, speed, power dissipation

(a  b  c) = a b c 
x x x
y y y
y y y
z z z
z z z
x x x
(a) AND-OR circuit (b) Int ermediate circuit (c) NAND-NAND equivalent
Figure 1.6 A two-level AND-OR circuit and two equivalent circuits.
Slide 13
Seven-Segment Display of Decimal Digits
Optional segment
Figure 1.7 Seven-segment display

of decimal digits. The three open
segments may be optionally used.
The digit 1 can be displayed in two
ways, with the more common right-
side version shown.
I swear I didn’t use a calculator!
Slide 14
BCD-to-Seven-Segment Decoder
Example 1.2
e0 0
Signals to
4-bit input in [0, 9]
enable or
x3 x2 x1 x0 turn on the e5 5 1
segments
e6 6
e4 4 2
e3 3
e2
e1
Figure 1.8 The logic circuit that generates the enable signal for the
lowermost segment (number 3) in a seven-segment display unit.
Slide 15
1.4 Useful Combinational Parts
 High-level building blocks
 Much like prefab parts used in building a house
 Arithmetic components (adders, multipliers, ALUs)
 Here we cover three useful parts:

multiplexers, decoders/demultiplexers, encoders
Slide 16
Multiplexers
x0
0 x0 0
z x0 z z
x1
x1 1 x1 1
y
y
y (a) 2-to-1 mux (b) Switch view (c) Mux symbol
e (Enable) x0 0
x0 0
/
x1 1 0
0 x1 1 z z
32 /
y0
x2
/ 1 32
x3
2 x2 x2 0 1
3
32
y y1
1 0 x3 1
y1y0 y0
(d) Mux array (e) 4-to-1 mux with enable (e) 4-to-1 mux design
Figure 1.9 Multiplexer (mux), or selector, allows one of several inputs

to be selected and routed to output depending on the binary value of a
set of selection or address signals provided to it.
Slide 17
Decoders/Demultiplexers
y1 y0
1 0
y1y0
y1y0
x0 1 0
0 x0
0 x0
1 x1
x1 1 x1 e 1
2 x2 (Enable) 2 1 x2
1 3 x3
1 x2 3 x3
1
x3
(c) Demultiplexer, or
(a) 2-to-4 decoder (b) Decoder symbol decoder wit h “enable”
Figure 1.10 A decoder allows the selection of one of 2a options using

an a-bit address as input. A demultiplexer (demux) is a decoder that
only selects an output if its enable signal is asserted.
Slide 18
Encoders
x0 0
x1 0
x0 0
x1 1
x2 1
x2 2
x3 3
x3 0
1 0
y1y0
y1y0
(a) 4-to-2 encoder (b) Enc oder symbol
Figure 1.11 A 2a-to-a encoder outputs an a-bit binary number

equal to the index of the single 1 among its 2a inputs.
Slide 19
1.5 Programmable Combinational Parts
A programmable combinational part can do the job of

many gates or gate networks
Programmed by cutting existing connections (fuses)

or establishing new connections (antifuses)
 Programmable ROM (PROM)
 Programmable array logic (PAL)
 Programmable logic array (PLA)
Slide 20
PROMs
Inputs
w w
x x
Decoder
.
y y .
.
z z
...
Outputs
(a) Programmable (b) Logic equivalent (c) Programmable read-only
OR gates of part a memory (PROM)
Figure 1.12 Programmable connections and their use in a PROM.
Slide 21
PALs and PLAs
Inputs 8-input 6-input
ANDs ANDs
...
AND OR
.
array array
(AND . (OR
plane) . plane)
... 4-input
ORs
Outputs
(a) General programmable (b) PAL: programmable (c) PLA: programmable
combinational logic AND array, fixed OR array AND and OR arrays
Figure 1.13 Programmable combinational logic: general structure and

two classes known as PAL and PLA devices. Not shown is PROM with
fixed AND array (a decoder) and programmable OR array.
Slide 22
1.6 Timing and Circuit Considerations
Changes in gate/circuit output, triggered by changes in its

inputs, are not instantaneous
 Gate delay d: a fraction of, to a few, nanoseconds
 Wire delay, previously negligible, is now important

(electronic signals travel about 15 cm per ns)
 Circuit simulation to verify function and timing
Slide 23
2 Digital Circuits with Memory
2.1 Latches, Flip-Flops, and Registers

2.2 Finite-State Machines
2.3 Designing Sequential Circuits
2.4 Useful Sequential Parts
2.5 Programmable Sequential Parts
2.6 Clocks and Timing of Events
Slide 24
2.1 Latches, Flip-Flops, and Registers
D R
R
Q Q
Q S Q
S
C
(a) SR latch (b) D latch
D D Q D Q Q D Q / D Q /
k k
FF FF
C C Q C Q Q C Q C Q
(c) Master-slave D flip-flop (d) D flip-flop symbol (e) k -bit register
Figure 2.1 Latches, flip-flops, and registers.
Slide 25
Reading and Modifying FFs in the Same Cycle
/ D Q /
k k
FF
C Q Computation module
(combinational logic)
/ D Q /
k k
FF
C Q
Propagation delay
Combinational delay
Clock
Figure 2.3 Register-to-register operation with edge-triggered

flip-flops.
Slide 26
2.2 Finite-State Machines
Example 2.1
------- Input ------- Dime
Current S 10 S 20
Quarter
state
Reset
Dime
Reset
Reset
S 00 S 10 S 25 S 00 Dime Dime
Quarter
S 10 S 20 S 35 S 00 Start
Quarter
S 20 S 30 S 35 S 00 Quarter
S 00 S 25
S 25 S 35 S 35 S 00 Reset
S 30 S 35 S 35 S 00 Reset Dime
Quarter
S 35 S 35 S 35 S 00
Reset
Next state
S 00 is the initial state Dime
S 35 S 30
S 35 is the final state Dime
Quarter
Quarter
Figure 2.4 State table and state diagram for a vending

machine coin reception unit.
Slide 27
Sequential Machine Implementation
Only for Mealy machine
Inputs Output Outputs

/ State register /
n logic m
Next-state
/
logic Next-state l
Present excitation
state signals
Figure 2.5 Hardware realization of Moore and Mealy

sequential machines.
Slide 28
2.3 Designing Sequential Circuits
Example 2.3
Inputs
Quarter in q
Output Final
D Q
e state
FF2
Q
is 1xx
C
Dime in d
D Q
FF1
C Q
D Q
FF0
C Q
Figure 2.7 Hardware realization of a coin reception unit (Example 2.3).
Slide 29
2.4 Useful Sequential Parts
 High-level building blocks
 Here we cover three useful parts:

shift register, register file (SRAM basics), counter
Slide 30
Shift Register
0 1 0 0 1 1 1 0
Shift
Load
Parallel data in / 0
k Parallel data out
/ D Q /
Serial data in 1 k k
FF
C Q Serial data out
k – 1 LSBs MSB
/
Figure 2.8 Register with single-bit left shift and parallel load
capabilities. For logical left shift, serial data in line is connected to 0.
Slide 31
Register File and FIFO
Write 2 h k -bit registers Muxes
data / / / Write enable
D Q
k k k
Write FF Write
/ C Q / data
address h k Read
Write data 0 k/
/ addr
h
Write / D Q / Read
k k Read
enable FF data 0 / addr 0
C Q h Read
/ data 1 k/
k Read
Decoder
/ addr 1
k h
/ D Q / / Read enable
k k Read
FF
C Q data 1
(b) Graphic symbol
for register file
/ D Q /
k k
FF
C Q Push Full
/ Input Output /
h k k
Read address 0 / Read
enable Empty Pop
h
Read address 1 /
(a) Register file with random access (c) FIFO symbol
Figure 2.9 Register file with random access and FIFO.

Slide 32
SRAM
Row decoder
. Square or
Write enable . almost square
. memory matrix
/ Data in
g
Data out /
g
/ Address
h
Chip Output
. . .
select enable Row buffer
. . .
Row
Address / Column mux
h Column
g bits data out
(a) SRAM block diagram (b) SRAM read mechanism
Figure 2.10 SRAM memory is simply a large, single-port register file.
Slide 33
Binary Counter
Input
IncrInit
0 Mux 1
Load
0 Count register
x
c out c in
Incrementer 1
x+1
Figure 2.11 Synchronous binary counter with initialization capability.
Slide 34
2.5 Programmable Sequential Parts
A programmable sequential part contain gates and

memory elements
Programmed by cutting existing connections (fuses)

or establishing new connections (antifuses)
 Programmable array logic (PAL)
 Field-programmable gate array (FPGA)
 Both types contain macrocells and interconnects
Slide 35
PAL and FPGA
8-input I/O blocks
ANDs
CLB CLB
01
CLB CLB
Mu x C D
FF
Q Q
Mu x
01 Configurable Programmable
logic block connections
(a) Portion of PAL with storable output (b) Generic structure of an FPGA
Figure 2.12 Examples of programmable sequential logic.

Slide 36
2.6 Clocks and Timing of Events
Clock is a periodic signal: clock rate = clock frequency
The inverse of clock rate is the clock period: 1 GHz  1 ns
Constraint: Clock period  tprop + tcomb + tsetup + tskew
D Q Combinational D Q
FF1 FF2
Q
logic Q
C C
Clock1 Clock2
Other inputs
Clock period
Must be wide enough

FF1 begins FF1 change to accommodate
to change observed worst-cas e delays
Figure 2.13 Determining the required length of the clock period.
Slide 37
Synchronization
Asynch Synch Asynch Synch
input version input version
D Q D Q D Q
FF FF1 FF2
C Q C Q C Q
(a) Simple synchroniz er (b) Two-FF synchronizer
Clock
Asynch
input
Synch
version
(c) Input and output waveforms
Figure 2.14 Synchronizers are used to prevent timing problems

arising from untimely changes in asynchronous signals.
Slide 38
Level-Sensitive Operation
D Q Combi- D Q Combi- D Q
Latch national Latch national Latch
Q logic Q logic Q
1 C 2 C 1 C
Other inputs Other inputs

Clock period
1
Clocks with
nonoverlapping
 highs
2
Figure 2.15 Two-phase clocking with nonoverlapping clock signals.
Slide 39
3 Computer System Technology
Interplay between architecture, hardware, and software
• Architectural innovations influence technology
• Technological advances drive changes in architecture
3.1 From Components to Applications

3.2 Computer Systems and Their Parts
3.3 Generations of Progress
3.4 Processor and Memory Technologies
3.5 Peripherals, I/O, and Communications
3.6 Software Systems and Applications
Slide 40
3.1 From Components to Applications
Software Hardware
Electronic components
Application domains
Application designer
Computer designer
System designer
Circuit designer
Logic designer
High- Low-
level Computer archit ecture level
view view
Computer organization
Figure 3.1 Subfields or views in computer system engineering.
Slide 41
What Is (Computer) Architecture?
Client’s requirements: Client’s taste:
function, cost, . . . mood, style, . . .
Goals
Interface Architect
Means
Construction tec hnology: Engineering Arts

The world of arts:
material, codes, . . .
aesthetics, trends, . . .
Interface
Figure 3.2 Like a building architect, whose place at the

engineering/arts and goals/means interfaces is seen in this diagram, a
computer architect reconciles many conflicting or competing demands.
Slide 42
3.2 Computer Systems and Their Parts
Computer
Analog Digital
Fixed-function Stored-program
Electronic Nonelectronic
General-purpose Special-purpose
Number cruncher Data manipulator
Figure 3.3 The space of computer systems, with what we normally

mean by the word “computer” highlighted.
Slide 43
Price/Performance Pyramid
Super $Millions
Mainframe
$100s Ks
Server $10s Ks
Differences in scale,
not in substance Workstation $1000s
Personal $100s
Embedded $10s
Figure 3.4 Classifying computers by computational

power and price range.
Slide 44
Automotive Embedded Computers
Impact sensors
Brakes
Airbags
Engine Cent ral

controller
Navigation &
entert ainment
Figure 3.5 Embedded computers are ubiquitous, yet invisible. They

are found in our automobiles, appliances, and many other places.
Slide 45
Personal Computers and Workstations
Figure 3.6 Notebooks, a common class of portable computers,

are much smaller than desktops but offer substantially the same
capabilities. What are the main reasons for the size difference?
Slide 46
Digital Computer Subsystems
Memory
Control Input
Processor Link Input/Output
Datapath Output
CPU To/from network I/O

Figure 3.7 The (three, four, five, or) six main units of a digital
computer. Usually, the link unit (a simple bus or a more elaborate
network) is not explicitly included in such diagrams.
Slide 47
3.3 Generations of Progress
Table 3.2 The 5 generations of digital computers, and their ancestors.
Generation Processor Memory I/O devices Dominant
(begun) technology innovations introduced look & fell
0 (1600s) (Electro-) Wheel, card Lever, dial, Factory
mechanical punched card equipment
1 (1950s) Vacuum tube Magnetic Paper tape, Hall-size
drum magnetic tape cabinet
2 (1960s) Transistor Magnetic Drum, printer, Room-size
core text terminal mainframe
3 (1970s) SSI/MSI RAM/ROM Disk, keyboard, Desk-size
chip video monitor mini
4 (1980s) LSI/VLSI SRAM/DRAM Network, CD, Desktop/
mouse,sound laptop micro
5 (1990s) ULSI/GSI/ SDRAM, Sensor/actuator, Invisible,
WSI, SOC flash point/click embedded
Slide 48
IC Production and Yield
Blank wafer
30-60 cm Patterned wafer
with defects
Processing:
Silicon Slicer x x 20-30 steps
crystal x x x
15-30 x x
ingot cm x x
x x
(100s of simple or scores

0.2 cm of complex processors)
Microchip
Good
Die or other part Part
Die die
Dicer tester Mounting tester
Usable
part
to ship
~1 cm ~1 cm
Figure 3.8 The manufacturing process for an IC part.

Slide 49
Effect of Die Size on Yield
120 dies, 109 good 26 dies, 15 good
Figure 3.9 Visualizing the dramatic decrease in yield

with larger dies.
Die yield =def (number of good dies) / (total number of dies)

Die yield = Wafer yield  [1 + (Defect density  Die area) / a]–a
Die cost = (cost of wafer) / (total number of dies  die yield)
= (cost of wafer)  (die area / wafer area) / (die yield)
Slide 50
3.4 Processor and Memory Technologies
Interlayer connections
Backplane deposited on the
outside of the stack
Die
PC board
Bus
CPU
Connector Stacked layers

glued together
Memory
(a) 2D or 2.5D packaging now common (b) 3D packaging of the future
Figure 3.11 Packaging of processor, memory, and other components.
Slide 51
TIPS Tb
Moore’s Processor
Law 1.6 / yr
2 / 18 mos
10 / 5 yrs
Memory
Processor performance
Memory chip capacity

GIPS R10000 Gb
1Gb
Pentium II
Pentium
68040 256Mb
64Mb
80486
16Mb
80386
68000 4Mb
MIPS Mb
80286
1Mb
256kb 4 / 3 yrs
64kb
kIPS kb
1980 1990 2000 2010
Calendar year
Figure 3.10 Trends in processor performance and DRAM
memory chip capacity (Moore’s law).
Slide 52
Pitfalls of Computer Technology Forecasting
“DOS addresses only 1 MB of RAM because we cannot
imagine any applications needing more.” Microsoft, 1980
“640K ought to be enough for anybody.” Bill Gates, 1981
“Computers in the future may weigh no more than 1.5
tons.” Popular Mechanics
“I think there is a world market for maybe five
computers.” Thomas Watson, IBM Chairman, 1943
“There is no reason anyone would want a computer in
their home.” Ken Olsen, DEC founder, 1977
“The 32-bit machine would be an overkill for a personal
computer.” Sol Libes, ByteLines
Slide 53
3.5 Input/Output and Communications
Typically Floppy CD-ROM

2-9 cm disk
. .
Magnetic
tape
.
..
..
. cartridge
(a) Cutaway view of a hard disk drive (b) Some removable storage media
Figure 3.12 Magnetic and optical disk memory units.
Slide 54
10 12
Communication
Processor Geographically distributed
Technologies bus
I/O
network
System-area
10 9
Bandwidth (b/s)
network
(SAN) Local-area
network
(LAN)
Metro-area
network
(MAN)
10 6 Wide-area
network
Same geographic location (WAN)
10 3
10 9 10 6 10 3 1 10 3
(ns) (s) (ms) (min) (h)
Latency (s)
Figure 3.13 Latency and bandwidth characteristics of different

classes of communication links.
Slide 55
3.6 Software Systems and Applications
Software
Application: System
word processor,
spreadsheet,
circuit simulator, Operating system Translator:
.. . MIPS assembler,
C compiler,
Manager: Enabler: Coordinator: .. .
virtual memory, disk driver, scheduling,
security, display driver, load balancing,
file system, printing, diagnostics,
.. . .. . .. .
Figure 3.15 Categorization of software, with examples in each class.
Slide 56
High- vs Low-Level Programming
More abstract, machine-independent; More conc rete, machine-specific, error-prone;
easier to write, read, debug, or maintain harder to write, read, debug, or maintain
Very High-level Assembly Machine

high-level language language language
language statements instructions, instructions,
objectives mnemonic binary (hex)
Assembler
Interpreter
Compiler
or tasks
Swap v[i] temp=v[i] add $2,$5,$5 00a51020

and v[i+1] v[i]=v[i+1] add $2,$2,$2 00421020
v[i+1]=temp add $2,$4,$2 00821020
lw $15,0($2) 8c620000
lw $16,4($2) 8cf20004
sw $16,0($2) acf20000
sw $15,4($2) ac620004
jr $31 03e00008
One task = One statement = Mostly one-to-one

many statements several instructions
Figure 3.14 Models and abstractions in programming.

Slide 57
4 Computer Performance
Performance is key in design decisions; also cost and power
• It has been a driving force for innovation
• Isn’t quite the same as speed (higher clock rate)
4.1 Cost, Performance, and Cost/Performance

4.2 Defining Computer Performance
4.3 Performance Enhancement and Amdahl’s Law
4.4 Performance Measurement vs Modeling
4.5 Reporting Computer Performance
4.6 The Quest for Higher Performance
Slide 58
4.1 Cost, Performance, and Cost/Performance
$1 G
Computer cost
$1 M
$1 K
$1
1960 1980 2000 2020
Calendar year
Slide 59
Cost/Performance
Performance
Superlinear:
Linear
economy of
(ideal?)
scale
Sublinear:
diminishing
returns
Cost
Figure 4.1 Performance improvement as a function of cost.
Slide 60
4.2 Defining Computer Performance
CPU-bound task
Input Processing Output
I/O-bound task
Figure 4.2 Pipeline analogy shows that imbalance between processing
power and I/O capabilities leads to a performance bottleneck.
Slide 61
Six Passenger Aircraft to Be Compared
B 747
DC-8-50
Slide 62
Performance of Aircraft: An Analogy
Table 4.1 Key characteristics of six passenger aircraft: all figures
are approximate; some relate to a specific model/configuration of
the aircraft or are averages of cited range of values.
Range Speed Price

Aircraft Passengers
(km) (km/h) ($M)
Airbus A310 250 8 300 895 120
Boeing 747 470 6 700 980 200
Boeing 767 250 12 300 885 120
Boeing 777 375 7 450 980 180
Concorde 130 6 400 2 200 350
DC-8-50 145 14 000 875 80

Speed of sound  1220 km / h
Slide 63
Different Views of Performance
Performance from the viewpoint of a passenger: Speed
Note, however, that flight time is but one part of total travel time.
Also, if the travel distance exceeds the range of a faster plane,
a slower plane may be better due to not needing a refueling stop
Performance from the viewpoint of an airline: Throughput
Measured in passenger-km per hour (relevant if ticket price were
proportional to distance traveled, which in reality it is not)
Airbus A310 250  895 = 0.224 M passenger-km/hr
Boeing 747 470  980 = 0.461 M passenger-km/hr
Concorde 130  2200 = 0.286 M passenger-km/hr
DC-8-50 145  875 = 0.127 M passenger-km/hr
Performance from the viewpoint of FAA: Safety
Slide 64
Cost Effectiveness: Cost/Performance
Table 4.1 Key characteristics of six passenger
Larger Smaller
aircraft: all figures are approximate; some relate to
values values
a specific model/configuration of the aircraft or are
better better
averages of cited range of values.
Aircraft Passen- Range Speed Price Throughput Cost /
gers (km) (km/h) ($M) (M P km/hr) Performance
A310 250 8 300 895 120 0.224 536
B 747 470 6 700 980 200 0.461 434
B 767 250 12 300 885 120 0.221 543
B 777 375 7 450 980 180 0.368 489
Concorde 130 6 400 2 200 350 0.286 1224
DC-8-50 145 14 000 875 80 0.127 630
Slide 65
Concepts of Performance and Speedup
Performance = 1 / Execution time is simplified to
Performance = 1 / CPU execution time
(Performance of M1) / (Performance of M2) = Speedup of M1 over M2

= (Execution time of M2) / (Execution time M1)
Terminology: M1 is x times as fast as M2 (e.g., 1.5 times as fast)

M1 is 100(x – 1)% faster than M2 (e.g., 50% faster)
CPU time = Instructions  (Cycles per instruction)  (Secs per cycle)

= Instructions  CPI / (Clock rate)
Instruction count, CPI, and clock rate are not completely independent,
so improving one by a given factor may not lead to overall execution
time improvement by the same factor.
Slide 66
Elaboration on the CPU Time Formula
CPU time = Instructions  (Cycles per instruction)  (Secs per cycle)
= Instructions  Average CPI / (Clock rate)
Instructions: Number of instructions executed, not number of

instructions in our program (dynamic count)
Average CPI: Is calculated based on the dynamic instruction mix

and knowledge of how many clock cycles are needed
to execute various instructions (or instruction classes)
Clock rate: 1 GHz = 109 cycles / s (cycle time 10–9 s = 1 ns)

200 MHz = 200  106 cycles / s (cycle time = 5 ns)
Clock period
Slide 67
Dynamic Instruction Count
How many instructions Each “for” consists of two instructions:
are executed in this increment index, check exit condition
program fragment?
12,422,450 Instructions
250 instructions
for i = 1, 100 do 2 + 20 + 124,200 instructions
20 instructions 100 iterations
for j = 1, 100 do 12,422,200 instructions in all
40 instructions 2 + 40 + 1200 instructions
for k = 1, 100 do 100 iterations
10 instructions 124,200 instructions in all
endfor 2 + 10 instructions
endfor 100 iterations for i = 1, n
endfor 1200 instructions in all while x > 0
Static count = 326
Slide 68
Faster Clock  Shorter Running Time
Suppose addition takes 1 ns
Clock period = 1 ns; 1 cycle
Clock period = ½ ns; 2 cycles Solution
1 GHz
4 steps
20 steps
2 GHz In this example, addition time

does not improve in going from
1 GHz to 2 GHz clock
Figure 4.3 Faster steps do not necessarily

mean shorter travel time.
Slide 69
4.3 Performance Enhancement: Amdahl’s Law
50
f = fraction
f =0
40 unaffected
p = speedup
Speedup (s )
f = 0.01
30 of the rest
f = 0.02
20
f = 0.05 1
s=
10 f + (1 – f)/p
f = 0.1
 min(p, 1/f)
0
0 10 20 30 40 50
Enhancement factor (p )
Figure 4.4 Amdahl’s law: speedup achieved if a fraction f of a
task is unaffected and the remaining 1 – f part runs p times as fast.
Slide 70
Amdahl’s Law Used in Design
Example 4.1
A processor spends 30% of its time on flp addition, 25% on flp mult,
and 10% on flp division. Evaluate the following enhancements, each
costing the same to implement:
a. Redesign of the flp adder to make it twice as fast.

b. Redesign of the flp multiplier to make it three times as fast.
c. Redesign the flp divider to make it 10 times as fast.
Solution
a. Adder redesign speedup = 1 / [0.7 + 0.3 / 2] = 1.18

b. Multiplier redesign speedup = 1 / [0.75 + 0.25 / 3] = 1.20
c. Divider redesign speedup = 1 / [0.9 + 0.1 / 10] = 1.10
What if both the adder and the multiplier are redesigned?
Slide 71
Amdahl’s Law Used in Management
Example 4.2
Members of a university research group frequently visit the library.
Each library trip takes 20 minutes. The group decides to subscribe
to a handful of publications that account for 90% of the library trips;
access time to these publications is reduced to 2 minutes.
a. What is the average speedup in access to publications?

b. If the group has 20 members, each making two weekly trips to
the library, what is the justifiable expense for the subscriptions?
Assume 50 working weeks/yr and $25/h for a researcher’s time.
Solution
a. Speedup in publication access time = 1 / [0.1 + 0.9 / 10] = 5.26

b. Time saved = 20  2  50  0.9 (20 – 2) = 32,400 min = 540 h
Cost recovery = 540  $25 = $13,500 = Max justifiable expense
Slide 72
4.4 Performance Measurement vs Modeling
Execution time
Machine 1
Machine 2
Machine 3
Program
A B C D E F
Figure 4.5 Running times of six programs on three machines.
Slide 73
Generalized Amdahl’s Law
Original running time of a program = 1 = f1 + f2 + . . . + fk
New running time after the fraction fi is speeded up by a factor pi

f1 f2 fk
+ + ... +
p1 p2 pk
Speedup formula
If a particular fraction
1 is slowed down rather
S= than speeded up,
f1 f2 fk use sj fj instead of fj / pj ,
+ + ... + where sj > 1 is the
p1 p2 pk slowdown factor
Slide 74
Performance Benchmarks
Example 4.3
You are an engineer at Outtel, a start-up aspiring to compete with Intel
via its new processor design that outperforms the latest Intel processor
by a factor of 2.5 on floating-point instructions. This level of performance
was achieved by design compromises that led to a 20% increase in the
execution time of all other instructions. You are in charge of choosing
benchmarks that would showcase Outtel’s performance edge.
a. What is the minimum required fraction f of time spent on floating-point
instructions in a program on the Intel processor to show a speedup of
2 or better for Outtel?
Solution
a. We use a generalized form of Amdahl’s formula in which a fraction f
is speeded up by a given factor (2.5) and the rest is slowed down by
another factor (1.2): 1 / [1.2(1 – f) + f / 2.5]  2  f  0.875
Slide 75
Performance Estimation
Average CPI = All instruction classes (Class-i fraction)  (Class-i CPI)
Machine cycle time = 1 / Clock rate
CPU execution time = Instructions  (Average CPI) / (Clock rate)
Table 4.3 Usage frequency, in percentage, for various

instruction classes in four representative applications.
Application  Data C language Reactor Atomic motion
Instr’n class  compression compiler simulation modeling
A: Load/Store 25 37 32 37
B: Integer 32 28 17 5
C: Shift/Logic 16 13 2 1
D: Float 0 0 34 42
E: Branch 19 13 9 10
F: All others 8 9 6 4
Slide 76
CPI and IPS Calculations
Example 4.4 (2 of 5 parts)
Consider two implementations M1 (600 MHz) and M2 (500 MHz) of
an instruction set containing three classes of instructions:
Class CPI for M1 CPI for M2 Comments
F 5.0 4.0 Floating-point
I 2.0 3.8 Integer arithmetic
N 2.4 2.0 Nonarithmetic
a. What are the peak performances of M1 and M2 in MIPS?
b. If 50% of instructions executed are class-N, with the rest divided
equally among F and I, which machine is faster? By what factor?
Solution
a. Peak MIPS for M1 = 600 / 2.0 = 300; for M2 = 500 / 2.0 = 250
b. Average CPI for M1 = 5.0 / 4 + 2.0 / 4 + 2.4 / 2 = 2.95;
for M2 = 4.0 / 4 + 3.8 / 4 + 2.0 / 2 = 2.95  M1 is faster; factor 1.2
Slide 77
MIPS Rating Can Be Misleading
Example 4.5
Two compilers produce machine code for a program on a machine
with two classes of instructions. Here are the number of instructions:
Class CPI Compiler 1 Compiler 2
A 1 600M 400M
B 2 400M 400M
a. What are run times of the two programs with a 1 GHz clock?
b. Which compiler produces faster code and by what factor?
c. Which compiler’s output runs at a higher MIPS rate?
Solution
a. Running time 1 (2) = (600M  1 + 400M  2) / 109 = 1.4 s (1.2 s)
b. Compiler 2’s output runs 1.4 / 1.2 = 1.17 times as fast
c. MIPS rating 1, CPI = 1.4 (2, CPI = 1.5) = 1000 / 1.4 = 714 (667)
Slide 78
4.5 Reporting Computer Performance
Table 4.4 Measured or estimated execution times for three programs.
Time on Time on Speedup of

machine X machine Y Y over X
Program A 20 200 0.1
Program B 1000 100 10.0
Program C 1500 150 10.0
All 3 prog’s 2520 450 5.6
Analogy: If a car is driven to a city 100 km away at 100 km/hr

and returns at 50 km/hr, the average speed is not (100 + 50) / 2
but is obtained from the fact that it travels 200 km in 3 hours.
Slide 79
Comparing the Overall Performance
Table 4.4 Measured or estimated execution times for three programs.
Time on Time on Speedup of Speedup of

machine X machine Y Y over X X over Y
Program A 20 200 0.1 10

Program B 1000 100 10.0 0.1
Program C 1500 150 10.0 0.1
Arithmetic mean 6.7 3.4

Geometric mean 2.15 0.46
Geometric mean does not yield a measure of overall speedup,

but provides an indicator that at least moves in the right direction
Slide 80
Effect of Instruction Mix on Performance
Example 4.6 (1 of 3 parts)
Consider two applications DC and RS and two machines M1 and M2:
Class Data Comp. Reactor Sim. M1’s CPI M2’s CPI
A: Ld/Str 25% 32% 4.0 3.8
B: Integer 32% 17% 1.5 2.5
C: Sh/Logic 16% 2% 1.2 1.2
D: Float 0% 34% 6.0 2.6
E: Branch 19% 9% 2.5 2.2
F: Other 8% 6% 2.0 2.3
a. Find the effective CPI for the two applications on both machines.
Solution
a. CPI of DC on M1: 0.25  4.0 + 0.32  1.5 + 0.16  1.2 + 0  6.0 +
0.19  2.5 + 0.08  2.0 = 2.31
DC on M2: 2.54 RS on M1: 3.94 RS on M2: 2.89
Slide 81
4.6 The Quest for Higher Performance
State of available computing power ca. the early 2000s:
Gigaflops on the desktop

Teraflops in the supercomputer center
Petaflops on the drawing board
Note on terminology (see Table 3.1)
Prefixes for large units:

Kilo = 103, Mega = 106, Giga = 109, Tera = 1012, Peta = 1015
For memory:
K = 210 = 1024, M = 220, G = 230, T = 240, P = 250
Prefixes for small units:
micro = 106, nano = 109, pico = 1012, femto = 1015
Slide 82
Performance Trends and Obsolescence
TIPS Tb
Processor
1.6 / yr
2 / 18 mos
10 / 5 yrs
Memory
Processor performance
Memory chip capacity

GIPS R10000 Gb
1Gb
Pentium II
Pentium
68040 256Mb
64Mb
80486
16Mb
80386
68000 4Mb
MIPS Mb
80286
1Mb
256kb 4 / 3 yrs
64kb
kIPS kb
1980 1990 2000 2010
Calendar year “Can I call you back? We
just bought a new computer
Figure 3.10 Trends in processor
and we’re trying to set it up
performance and DRAM memory before it’s obsolete.”
chip capacity (Moore’s law).
Slide 83
Super- PFLOPS Massively parallel
processors
computers
$240M MPPs
Supercomputer performance
$30M MPPs
CM-5
TFLOPS
CM-5
CM-2 Vector
supercomputers
Y-MP
GFLOPS
Cray
X-MP
MFLOPS
1980 1990 2000 2010
Calendar year
Figure 4.7 Exponential growth of supercomputer performance.
Slide 84
The Most Powerful Computers
1000 Plan Develop Use
Performance (TFLOPS)
100+ TFLOPS, 20 TB
100 ASCI Purple
30+ TFLOPS, 10 TB
ASCI Q
10+ TFLOPS, 5 TB
10 ASCI White ASCI
3+ TFLOPS, 1.5 TB
ASCI Blue
1+ TFLOPS, 0.5 TB
1 ASCI Red
1995 2000 2005 2010
Calendar year
Figure 4.8 Milestones in the DOE’s Accelerated Strategic Computing
Initiative (ASCI) program with extrapolation up to the PFLOPS level.
Slide 85
Performance is Important, But It Isn’t Everything
TIPS
Figure 25.1
DSP performance Absolute
per Watt proce ssor Trend in
performance computational
GIPS
performance
Performance
per watt of
GP processor
performance
power used
per Watt in general-
purpose
MIPS
processors
and DSPs.
kIPS
1980 1990 2000 2010
Calendar year
Slide 86

L2-CA-Background and Motivation

Uploaded by

Copyright:

Available Formats

L2-CA-Background and Motivation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L2-CA-Background and Motivation

Uploaded by

Copyright:

Available Formats

Computer Architecture:

Background and Motivation

1.1 Signals, Logic Operators, and Gates

Output Both inputs At least one Inputs are

Arithmetic 1x x y or xy x  y  xy x  y 2xy

Example: Prove the identity xyz  x   y   z  ? 1

AND OR NAND NOR XNOR

Enable/Pass signal Enable/Pass signal

(a) Wired OR of product terms (b) Wired OR of t ristate outputs

Figure 1.4 Wired OR allows tying together of several

Figure 1.5 Arrays of logic gates represented by a single gate symbol.

 Truth table: 2n row, “don’t-care” in input or output

 Logic expression: w  (x  y  z), product-of-sums,

 Word statement: Alarm will sound if the door

 Logic circuit diagram: Synthesis vs analysis

Name of law OR version AND version

One/Zero x1=1 x0=0

Idempotent xx= x xx=x

Inverse xx=1 xx=0

Commutative xy=yx xy=yx

DeMorgan’s (x  y) = x  y  (x y) = x   y 

 Case analysis: two cases, x = 0 or x = 1

 Logic expression manipulation

 Logic optimization: cost, speed, power dissipation

Figure 1.6 A two-level AND-OR circuit and two equivalent circuits.

Figure 1.7 Seven-segment display

I swear I didn’t use a calculator!

 High-level building blocks

 Much like prefab parts used in building a house

 Arithmetic components (adders, multipliers, ALUs)

 Here we cover three useful parts:

Figure 1.9 Multiplexer (mux), or selector, allows one of several inputs

Figure 1.10 A decoder allows the selection of one of 2a options using

Figure 1.11 A 2a-to-a encoder outputs an a-bit binary number

A programmable combinational part can do the job of

Programmed by cutting existing connections (fuses)

 Programmable ROM (PROM)

 Programmable array logic (PAL)

 Programmable logic array (PLA)

Figure 1.12 Programmable connections and their use in a PROM.

Figure 1.13 Programmable combinational logic: general structure and

Changes in gate/circuit output, triggered by changes in its

 Gate delay d: a fraction of, to a few, nanoseconds

 Wire delay, previously negligible, is now important

 Circuit simulation to verify function and timing

2.1 Latches, Flip-Flops, and Registers

(c) Master-slave D flip-flop (d) D flip-flop symbol (e) k -bit register

Figure 2.1 Latches, flip-flops, and registers.

Figure 2.3 Register-to-register operation with edge-triggered

Figure 2.4 State table and state diagram for a vending

Only for Mealy machine

Inputs Output Outputs

Figure 2.5 Hardware realization of Moore and Mealy

Figure 2.7 Hardware realization of a coin reception unit (Example 2.3).

 High-level building blocks

 Here we cover three useful parts:

Figure 2.9 Register file with random access and FIFO.

(a) SRAM block diagram (b) SRAM read mechanism

Figure 2.10 SRAM memory is simply a large, single-port register file.

A programmable sequential part contain gates and