L2-CA-Background and Motivation
L2-CA-Background and Motivation
L2-CA-Background and Motivation
Slide 2
1 Combinational Digital Circuits
Slide 3
1.1 Signals, Logic Operators, and Gates
Name NOT AND OR XOR
Graphical
symbol
Operator x _ xy x y xy
sign and
alternat e(s) x or x x y xy x y
Slide 4
The Arithmetic Substitution Method
z = 1 – z NOT converted to arithmetic form
xy AND same as multiplication
(when doing the algebra, set zk = z)
x y = x + y xy OR converted to arithmetic form
x y = x + y 2xy XOR converted to arithmetic form
LHS = [xyz x ] [y z ]
= [xyz + 1 – x – (1 – x)xyz] [1 – y + 1 – z – (1 – y)(1 – z)]
= [xyz + 1 – x] [1 – yz]
= (xyz + 1 – x) + (1 – yz) – (xyz + 1 – x)(1 – yz)
= 1 + xy2z2 – xyz
= 1 = RHS This is addition,
not logical OR
Slide 5
Variations in Gate Symbols
Bubble = Inverter
Figure 1.2 Gates with more than two inputs and/or with
inverted signals at input or output.
Slide 6
Gates as Control Elements
e e
0 0 0
No data
1 1
x ex x or x
(c) Model for AND switch. (d) Model for tristate buffer.
Figure 1.3 An AND gate and a tristate buffer act as controlled switches
or valves. An inverting buffer is logically the same as a NOT gate.
Slide 7
Wired OR and Bus Connections
ex ex
x
x
ey
ey Data out
y (x, y, z,
y Data out
or high
(x, y, z, or 0)
impedance)
ez
ez
z
z
Slide 8
Control/Data Signals and Signal Bundles
Enable Compl
8
/
/ / /
/ 8 / 32 / k
8 32 k
(a) 8 NOR gates (b) 32 AND gat es (c) k XOR gat es
Slide 9
1.2 Boolean Functions and Expressions
Ways of specifying a logic function
Slide 10
Manipulating Logic Expressions
Table 1.2 Laws (basic identities) of Boolean algebra.
Associative (x y) z = x (y z) (x y) z = x (y z)
Distributive x (y z) = (x y) (x z) x (y z) = (x y) (x z)
Slide 11
Proving the Equivalence of Logic Expressions
Example 1.1
Truth-table method: Exhaustive verification
Arithmetic substitution
x y = x + y xy
x y = x + y 2xy
Example: x y ? xy xy
x + y – 2xy ? (1 – x)y + x(1 – y) – (1 – x)yx(1 – y)
Slide 12
1.3 Designing Gate Networks
AND-OR, NAND-NAND, OR-AND, NOR-NOR
Slide 13
Seven-Segment Display of Decimal Digits
Optional segment
Slide 14
BCD-to-Seven-Segment Decoder
Example 1.2
e0 0
Signals to
4-bit input in [0, 9]
enable or
x3 x2 x1 x0 turn on the e5 5 1
segments
e6 6
e4 4 2
e3 3
e2
e1
Figure 1.8 The logic circuit that generates the enable signal for the
lowermost segment (number 3) in a seven-segment display unit.
Slide 15
1.4 Useful Combinational Parts
Slide 16
Multiplexers
x0
0 x0 0
z x0 z z
x1
x1 1 x1 1
y
y
y (a) 2-to-1 mux (b) Switch view (c) Mux symbol
e (Enable) x0 0
x0 0
/
x1 1 0
0 x1 1 z z
32 /
y0
x2
/ 1 32
x3
2 x2 x2 0 1
3
32
y y1
1 0 x3 1
y1y0 y0
(d) Mux array (e) 4-to-1 mux with enable (e) 4-to-1 mux design
Slide 17
Decoders/Demultiplexers
y1 y0
1 0
y1y0
y1y0
x0 1 0
0 x0
0 x0
1 x1
x1 1 x1 e 1
2 x2 (Enable) 2 1 x2
1 3 x3
1 x2 3 x3
1
x3
(c) Demultiplexer, or
(a) 2-to-4 decoder (b) Decoder symbol decoder wit h “enable”
Slide 18
Encoders
x0 0
x1 0
x0 0
x1 1
x2 1
x2 2
x3 3
x3 0
1 0
y1y0
y1y0
(a) 4-to-2 encoder (b) Enc oder symbol
Slide 19
1.5 Programmable Combinational Parts
Slide 20
PROMs
Inputs
w w
x x
Decoder
.
y y .
.
z z
...
Outputs
(a) Programmable (b) Logic equivalent (c) Programmable read-only
OR gates of part a memory (PROM)
Slide 21
PALs and PLAs
Inputs 8-input 6-input
ANDs ANDs
...
AND OR
.
array array
(AND . (OR
plane) . plane)
... 4-input
ORs
Outputs
(a) General programmable (b) PAL: programmable (c) PLA: programmable
combinational logic AND array, fixed OR array AND and OR arrays
Slide 22
1.6 Timing and Circuit Considerations
Slide 23
2 Digital Circuits with Memory
Slide 24
2.1 Latches, Flip-Flops, and Registers
D R
R
Q Q
Q S Q
S
C
(a) SR latch (b) D latch
D D Q D Q Q D Q / D Q /
k k
FF FF
C C Q C Q Q C Q C Q
Slide 25
Reading and Modifying FFs in the Same Cycle
/ D Q /
k k
FF
C Q Computation module
(combinational logic)
/ D Q /
k k
FF
C Q
Propagation delay
Combinational delay
Clock
Slide 26
2.2 Finite-State Machines
Example 2.1
------- Input ------- Dime
Current S 10 S 20
Quarter
state
Reset
Dime
Reset
Reset
S 00 S 10 S 25 S 00 Dime Dime
Quarter
S 10 S 20 S 35 S 00 Start
Quarter
S 20 S 30 S 35 S 00 Quarter
S 00 S 25
S 25 S 35 S 35 S 00 Reset
S 30 S 35 S 35 S 00 Reset Dime
Quarter
S 35 S 35 S 35 S 00
Reset
Next state
S 00 is the initial state Dime
S 35 S 30
S 35 is the final state Dime
Quarter
Quarter
Slide 27
Sequential Machine Implementation
Slide 28
2.3 Designing Sequential Circuits
Example 2.3
Inputs
Quarter in q
Output Final
D Q
e state
FF2
Q
is 1xx
C
Dime in d
D Q
FF1
C Q
D Q
FF0
C Q
Slide 29
2.4 Useful Sequential Parts
Slide 30
Shift Register
0 1 0 0 1 1 1 0
Shift
Load
Parallel data in / 0
k Parallel data out
/ D Q /
Serial data in 1 k k
FF
C Q Serial data out
k – 1 LSBs MSB
/
Figure 2.8 Register with single-bit left shift and parallel load
capabilities. For logical left shift, serial data in line is connected to 0.
Slide 31
Register File and FIFO
Write 2 h k -bit registers Muxes
data / / / Write enable
D Q
k k k
Write FF Write
/ C Q / data
address h k Read
Write data 0 k/
/ addr
h
Write / D Q / Read
k k Read
enable FF data 0 / addr 0
C Q h Read
/ data 1 k/
k Read
Decoder
/ addr 1
k h
/ D Q / / Read enable
k k Read
FF
C Q data 1
(b) Graphic symbol
for register file
/ D Q /
k k
FF
C Q Push Full
/ Input Output /
h k k
Read address 0 / Read
enable Empty Pop
h
Read address 1 /
(a) Register file with random access (c) FIFO symbol
Row decoder
. Square or
Write enable . almost square
. memory matrix
/ Data in
g
Data out /
g
/ Address
h
Chip Output
. . .
select enable Row buffer
. . .
Row
Address / Column mux
h Column
g bits data out
Slide 33
Binary Counter
Input
IncrInit
0 Mux 1
Load
0 Count register
x
c out c in
Incrementer 1
x+1
Figure 2.11 Synchronous binary counter with initialization capability.
Slide 34
2.5 Programmable Sequential Parts
Slide 35
PAL and FPGA
8-input I/O blocks
ANDs
CLB CLB
01
CLB CLB
Mu x C D
FF
Q Q
Mu x
01 Configurable Programmable
logic block connections
(a) Portion of PAL with storable output (b) Generic structure of an FPGA
D Q Combinational D Q
FF1 FF2
Q
logic Q
C C
Clock1 Clock2
Other inputs
Clock period
Slide 37
Synchronization
Asynch Synch Asynch Synch
input version input version
D Q D Q D Q
FF FF1 FF2
C Q C Q C Q
Clock
Asynch
input
Synch
version
(c) Input and output waveforms
Slide 38
Level-Sensitive Operation
D Q Combi- D Q Combi- D Q
Latch national Latch national Latch
Q logic Q logic Q
1 C 2 C 1 C
Slide 39
3 Computer System Technology
Interplay between architecture, hardware, and software
• Architectural innovations influence technology
• Technological advances drive changes in architecture
Slide 40
3.1 From Components to Applications
Software Hardware
Electronic components
Application domains
Application designer
Computer designer
System designer
Circuit designer
Logic designer
High- Low-
level Computer archit ecture level
view view
Computer organization
Slide 41
What Is (Computer) Architecture?
Client’s requirements: Client’s taste:
function, cost, . . . mood, style, . . .
Goals
Interface Architect
Means
Slide 42
3.2 Computer Systems and Their Parts
Computer
Analog Digital
Fixed-function Stored-program
Electronic Nonelectronic
General-purpose Special-purpose
Personal $100s
Embedded $10s
Slide 44
Automotive Embedded Computers
Impact sensors
Brakes
Airbags
Navigation &
entert ainment
Slide 45
Personal Computers and Workstations
Slide 46
Digital Computer Subsystems
Memory
Control Input
Datapath Output
Slide 48
IC Production and Yield
Blank wafer
30-60 cm Patterned wafer
with defects
Processing:
Silicon Slicer x x 20-30 steps
crystal x x x
15-30 x x
ingot cm x x
x x
Microchip
Good
Die or other part Part
Die die
Dicer tester Mounting tester
Usable
part
to ship
~1 cm ~1 cm
Slide 50
3.4 Processor and Memory Technologies
Interlayer connections
Backplane deposited on the
outside of the stack
Die
PC board
Bus
CPU
Slide 51
TIPS Tb
Moore’s Processor
Law 1.6 / yr
2 / 18 mos
10 / 5 yrs
Memory
Processor performance
64kb
kIPS kb
1980 1990 2000 2010
Calendar year
Figure 3.10 Trends in processor performance and DRAM
memory chip capacity (Moore’s law).
Slide 52
Pitfalls of Computer Technology Forecasting
“DOS addresses only 1 MB of RAM because we cannot
imagine any applications needing more.” Microsoft, 1980
“640K ought to be enough for anybody.” Bill Gates, 1981
“Computers in the future may weigh no more than 1.5
tons.” Popular Mechanics
“I think there is a world market for maybe five
computers.” Thomas Watson, IBM Chairman, 1943
“There is no reason anyone would want a computer in
their home.” Ken Olsen, DEC founder, 1977
“The 32-bit machine would be an overkill for a personal
computer.” Sol Libes, ByteLines
Slide 53
3.5 Input/Output and Communications
. .
Magnetic
tape
.
..
..
. cartridge
(a) Cutaway view of a hard disk drive (b) Some removable storage media
Slide 54
10 12
Communication
Processor Geographically distributed
Technologies bus
I/O
network
System-area
10 9
Bandwidth (b/s)
network
(SAN) Local-area
network
(LAN)
Metro-area
network
(MAN)
10 6 Wide-area
network
Same geographic location (WAN)
10 3
10 9 10 6 10 3 1 10 3
(ns) (s) (ms) (min) (h)
Latency (s)
Application: System
word processor,
spreadsheet,
circuit simulator, Operating system Translator:
.. . MIPS assembler,
C compiler,
Manager: Enabler: Coordinator: .. .
virtual memory, disk driver, scheduling,
security, display driver, load balancing,
file system, printing, diagnostics,
.. . .. . .. .
Slide 56
High- vs Low-Level Programming
More abstract, machine-independent; More conc rete, machine-specific, error-prone;
easier to write, read, debug, or maintain harder to write, read, debug, or maintain
Assembler
Interpreter
Compiler
or tasks
Slide 58
4.1 Cost, Performance, and Cost/Performance
$1 G
Computer cost
$1 M
$1 K
$1
1960 1980 2000 2020
Calendar year
Slide 59
Cost/Performance
Performance
Superlinear:
Linear
economy of
(ideal?)
scale
Sublinear:
diminishing
returns
Cost
Slide 60
4.2 Defining Computer Performance
CPU-bound task
I/O-bound task
Figure 4.2 Pipeline analogy shows that imbalance between processing
power and I/O capabilities leads to a performance bottleneck.
Slide 61
Six Passenger Aircraft to Be Compared
B 747
DC-8-50
Slide 62
Performance of Aircraft: An Analogy
Table 4.1 Key characteristics of six passenger aircraft: all figures
are approximate; some relate to a specific model/configuration of
the aircraft or are averages of cited range of values.
Slide 63
Different Views of Performance
Performance from the viewpoint of a passenger: Speed
Note, however, that flight time is but one part of total travel time.
Also, if the travel distance exceeds the range of a faster plane,
a slower plane may be better due to not needing a refueling stop
Performance from the viewpoint of an airline: Throughput
Measured in passenger-km per hour (relevant if ticket price were
proportional to distance traveled, which in reality it is not)
Airbus A310 250 895 = 0.224 M passenger-km/hr
Boeing 747 470 980 = 0.461 M passenger-km/hr
Boeing 767 250 885 = 0.221 M passenger-km/hr
Boeing 777 375 980 = 0.368 M passenger-km/hr
Concorde 130 2200 = 0.286 M passenger-km/hr
DC-8-50 145 875 = 0.127 M passenger-km/hr
Performance from the viewpoint of FAA: Safety
Slide 64
Cost Effectiveness: Cost/Performance
Table 4.1 Key characteristics of six passenger
Larger Smaller
aircraft: all figures are approximate; some relate to
values values
a specific model/configuration of the aircraft or are
better better
averages of cited range of values.
Aircraft Passen- Range Speed Price Throughput Cost /
gers (km) (km/h) ($M) (M P km/hr) Performance
Slide 65
Concepts of Performance and Speedup
Performance = 1 / Execution time is simplified to
Slide 66
Elaboration on the CPU Time Formula
CPU time = Instructions (Cycles per instruction) (Secs per cycle)
= Instructions Average CPI / (Clock rate)
Slide 67
Dynamic Instruction Count
How many instructions Each “for” consists of two instructions:
are executed in this increment index, check exit condition
program fragment?
12,422,450 Instructions
250 instructions
for i = 1, 100 do 2 + 20 + 124,200 instructions
20 instructions 100 iterations
for j = 1, 100 do 12,422,200 instructions in all
40 instructions 2 + 40 + 1200 instructions
for k = 1, 100 do 100 iterations
10 instructions 124,200 instructions in all
endfor 2 + 10 instructions
endfor 100 iterations for i = 1, n
endfor 1200 instructions in all while x > 0
Static count = 326
Slide 68
Faster Clock Shorter Running Time
Suppose addition takes 1 ns
Clock period = 1 ns; 1 cycle
Clock period = ½ ns; 2 cycles Solution
1 GHz
4 steps
20 steps
f = 0.01
30 of the rest
f = 0.02
20
f = 0.05 1
s=
10 f + (1 – f)/p
f = 0.1
min(p, 1/f)
0
0 10 20 30 40 50
Enhancement factor (p )
Figure 4.4 Amdahl’s law: speedup achieved if a fraction f of a
task is unaffected and the remaining 1 – f part runs p times as fast.
Slide 70
Amdahl’s Law Used in Design
Example 4.1
A processor spends 30% of its time on flp addition, 25% on flp mult,
and 10% on flp division. Evaluate the following enhancements, each
costing the same to implement:
Solution
Slide 71
Amdahl’s Law Used in Management
Example 4.2
Members of a university research group frequently visit the library.
Each library trip takes 20 minutes. The group decides to subscribe
to a handful of publications that account for 90% of the library trips;
access time to these publications is reduced to 2 minutes.
Solution
Machine 1
Machine 2
Machine 3
Program
A B C D E F
Slide 73
Generalized Amdahl’s Law
Speedup formula
If a particular fraction
1 is slowed down rather
S= than speeded up,
f1 f2 fk use sj fj instead of fj / pj ,
+ + ... + where sj > 1 is the
p1 p2 pk slowdown factor
Slide 74
Performance Benchmarks
Example 4.3
You are an engineer at Outtel, a start-up aspiring to compete with Intel
via its new processor design that outperforms the latest Intel processor
by a factor of 2.5 on floating-point instructions. This level of performance
was achieved by design compromises that led to a 20% increase in the
execution time of all other instructions. You are in charge of choosing
benchmarks that would showcase Outtel’s performance edge.
a. What is the minimum required fraction f of time spent on floating-point
instructions in a program on the Intel processor to show a speedup of
2 or better for Outtel?
Solution
a. We use a generalized form of Amdahl’s formula in which a fraction f
is speeded up by a given factor (2.5) and the rest is slowed down by
another factor (1.2): 1 / [1.2(1 – f) + f / 2.5] 2 f 0.875
Slide 75
Performance Estimation
Average CPI = All instruction classes (Class-i fraction) (Class-i CPI)
Machine cycle time = 1 / Clock rate
CPU execution time = Instructions (Average CPI) / (Clock rate)
A: Load/Store 25 37 32 37
B: Integer 32 28 17 5
C: Shift/Logic 16 13 2 1
D: Float 0 0 34 42
E: Branch 19 13 9 10
F: All others 8 9 6 4
Slide 76
CPI and IPS Calculations
Example 4.4 (2 of 5 parts)
Consider two implementations M1 (600 MHz) and M2 (500 MHz) of
an instruction set containing three classes of instructions:
Class CPI for M1 CPI for M2 Comments
F 5.0 4.0 Floating-point
I 2.0 3.8 Integer arithmetic
N 2.4 2.0 Nonarithmetic
a. What are the peak performances of M1 and M2 in MIPS?
b. If 50% of instructions executed are class-N, with the rest divided
equally among F and I, which machine is faster? By what factor?
Solution
a. Peak MIPS for M1 = 600 / 2.0 = 300; for M2 = 500 / 2.0 = 250
b. Average CPI for M1 = 5.0 / 4 + 2.0 / 4 + 2.4 / 2 = 2.95;
for M2 = 4.0 / 4 + 3.8 / 4 + 2.0 / 2 = 2.95 M1 is faster; factor 1.2
Slide 77
MIPS Rating Can Be Misleading
Example 4.5
Two compilers produce machine code for a program on a machine
with two classes of instructions. Here are the number of instructions:
Class CPI Compiler 1 Compiler 2
A 1 600M 400M
B 2 400M 400M
a. What are run times of the two programs with a 1 GHz clock?
b. Which compiler produces faster code and by what factor?
c. Which compiler’s output runs at a higher MIPS rate?
Solution
a. Running time 1 (2) = (600M 1 + 400M 2) / 109 = 1.4 s (1.2 s)
b. Compiler 2’s output runs 1.4 / 1.2 = 1.17 times as fast
c. MIPS rating 1, CPI = 1.4 (2, CPI = 1.5) = 1000 / 1.4 = 714 (667)
Slide 78
4.5 Reporting Computer Performance
Table 4.4 Measured or estimated execution times for three programs.
Slide 79
Comparing the Overall Performance
Slide 80
Effect of Instruction Mix on Performance
Example 4.6 (1 of 3 parts)
Consider two applications DC and RS and two machines M1 and M2:
Class Data Comp. Reactor Sim. M1’s CPI M2’s CPI
A: Ld/Str 25% 32% 4.0 3.8
B: Integer 32% 17% 1.5 2.5
C: Sh/Logic 16% 2% 1.2 1.2
D: Float 0% 34% 6.0 2.6
E: Branch 19% 9% 2.5 2.2
F: Other 8% 6% 2.0 2.3
a. Find the effective CPI for the two applications on both machines.
Solution
a. CPI of DC on M1: 0.25 4.0 + 0.32 1.5 + 0.16 1.2 + 0 6.0 +
0.19 2.5 + 0.08 2.0 = 2.31
DC on M2: 2.54 RS on M1: 3.94 RS on M2: 2.89
Slide 81
4.6 The Quest for Higher Performance
State of available computing power ca. the early 2000s:
Slide 82
Performance Trends and Obsolescence
TIPS Tb
Processor
1.6 / yr
2 / 18 mos
10 / 5 yrs
Memory
Processor performance
64kb
kIPS kb
1980 1990 2000 2010
Calendar year “Can I call you back? We
just bought a new computer
Figure 3.10 Trends in processor
and we’re trying to set it up
performance and DRAM memory before it’s obsolete.”
chip capacity (Moore’s law).
Slide 83
Super- PFLOPS Massively parallel
processors
computers
$240M MPPs
Supercomputer performance
$30M MPPs
CM-5
TFLOPS
CM-5
CM-2 Vector
supercomputers
Y-MP
GFLOPS
Cray
X-MP
MFLOPS
1980 1990 2000 2010
Calendar year
Slide 84
The Most Powerful Computers
1000 Plan Develop Use
Performance (TFLOPS)
100+ TFLOPS, 20 TB
100 ASCI Purple
30+ TFLOPS, 10 TB
ASCI Q
10+ TFLOPS, 5 TB
10 ASCI White ASCI
3+ TFLOPS, 1.5 TB
ASCI Blue
1+ TFLOPS, 0.5 TB
1 ASCI Red
1995 2000 2005 2010
Calendar year
Figure 4.8 Milestones in the DOE’s Accelerated Strategic Computing
Initiative (ASCI) program with extrapolation up to the PFLOPS level.
Slide 85
Performance is Important, But It Isn’t Everything
TIPS
Figure 25.1
DSP performance Absolute
per Watt proce ssor Trend in
performance computational
GIPS
performance
Performance
per watt of
GP processor
performance
power used
per Watt in general-
purpose
MIPS
processors
and DSPs.
kIPS
1980 1990 2000 2010
Calendar year
Slide 86