Pipeline Architecture PDF
Pipeline Architecture PDF
Pipeline Architecture PDF
C. V. Ramamoorthy
Computer Science Division, Department o/Electrical Engineering and Computer Sciences and the
Electronzcs Research Laboratory, Unzversity of Cahfornza, Berkeley, Berkeley, Californzu-94720
and
H.F. Li
Department o] Electrical Engineering and the Coordinated Science Laboratory, University of Illinois at
Champaign-Urbana, Urbana, Illinois 61801
Pipelined computer architecture has re ceived considerable attention since the 1960s
when the need for faster and more cost-effective systems became critical. The
merit of pipelining is that it can help to match the speeds of various subsystems
without duplicating the cost of the entire system involved. As technology evolves,
faster and cheaper LSI circuits become available, and the future of pipelining,
either in a simple or complex form, becomes more promising.
This paper reviews the many theoretical considerations and problems behind
pipelining, surveying and comparing various representative pipeline machines that
operate in either sequential or vector pipeline mode, the practical solutions
adopted, and the tradeoffs involved. The performance of a simple pipe, the physical
speed limitation, and the control structures for penalty-incurring events are
analyzed separately. The problems faced by the system designers are tackled,
including buffering, busing structur, branching, and interrupt handling. Aspects
of sequential and vector processing ~re studied. Fundamental advantages of vector
processing are unveiled, and additional requirements (costs) are discussed to
establish a criterion for the tradeoff between sequential and vector pipeline
processing. Finally, two recent machines (the CRA'~-I and the Amdahl 470 V/6
systems) are presented to demonstrate how complex pipeline techniques can be
used and how simple but advantageous pipeline concepts can be exploited.
1. INTRODUCTION
T h e p r i n c i p l e of p i p e l i n i n g h a s e m e r g e d as a
m a j o r a r c h i t e c t u r a l a t t r i b u t e of m o s t
present computer systems. In particular,
s u p e r m a c h i n e s such as t h e T e x a s I n s t r u * Research sponsored by US Army Research
Office Contract DA-ARO-D-31-124-73-G157.
m e n t s T I A S C , B u r r o u g h s PEPE, I B M
S y s t e m / 3 6 0 M o d e l s 91 a n d 195, C r a y R e s e a r c h CRAY-1, C D C STAR-100, A m d a h l
470 V / 6 , C D C 6600, a n d C D C 7600 h a v e
d i s t i n c t p i p e l i n e processing capabilities,
e i t h e r in t h e f o r m of i n t e r n a l l y p i p e l i n e d
i n s t r u c t i o n a n d a r i t h m e t i c u n i t s or in t h e
f o r m of p i p e l i n e d special p u r p o s e f u n c t i o n a l
u n i t s [1-4].
Copyright 1977, Association for Computing Machinery, Inc. General permission to republish, but
not for profit, all or part of this material is granted, provided that ACM's copyright notice is given
and that reference is made to the publication, to its date of issue, and to the fact that reprinting
privileges were granted by permission of the Association for Computing Machinery.
62
C. V. Ramamoorthy and H. F. Li
CONTENTS
~ Instructlon Processlng (
FIGURE la.
-I IF
Non-pipelined processor.
ID
OF
EXECI
"~
ID
4
~ T~me
P i p e l i n e Architecture
speed achievable arises because the technique requires the insertion of latches of
finite delay. It is shown that this delay
plays a significant role in determining the
bound on the fastest speed achievable.
On the other hand, when a pipeline
operates on tasks with precedence constraints, the space-time measure for the
ideal situation is not directly applicable.
Section 1.3, Performance Characteristics,
analyzes the performance of such a pipe
when the precedence relationships are in
the form of a tree. Appropriate bounds are
provided which reflect that the pipe sometimes has a throughput rate close to its
segment time and at other times has a
rate close to its flush time. The dominating
role played by task relationships in an
actual pipeline is thus apparent.
After the analytical evaluation of a pipeline's performance, the various applicable
control schemes are classified and compared with respect to the flow of instructions and the resolution of conflicts. This
classification covers most of the schemes
existing in pipelined systems as well as
some theoretically feasible combinations.
In Section 1.4, Control Structure, Hazards,
and Penalties, three kinds of hazards are
formally classified. The detection and resolution techniques for these hazards under
either "streamline" or "fully asynchronous" control are analyzed according to
the incurred cost in hardware and incurred
delay penalties in runtime. Section 1.5,
Sequencing Control, presents a simple
sequencing control using shift registers as
an example of synchronous pipelines whose
collisions are predeterminable. This scheme
is useful for controlling lower level pipelines
such as arithmetic pipes for which external
conditions or precedence constraints are
rare. Finally, in Section 1.6, Software
Aspects, some software problems related
to the efficient code generation of a vector
pipeline are discussed.
In Section 2, Structure of a Pipelined
Processor, the problems and solutions associated with a sequential pipelined system
are examined more carefully. Three systems
are used as examples to make cross-comparisons in several practical problems.
These problems include: 1) buffering for
63
64
C. V. Ramamoorthy and H. F. Li
FIGURE le.
Flow of Control
Pipeline Architecture
ent design and control strategies classify a
pipelined module into one of two forms; it
can be either a static or a dynamic pipe.
Sometimes a pipelined module only serves a
single dedicated function--for example, the
pipelined adder or multiplier in the IBM
360/91. Naturally, it can be termed a
unifunctional pipe with a static configuration. On the other hand, a pipelined module
can serve a set of functions, each with a
distinguishable configuration. For example,
in the TI ASC system the arithmetic unit
in the processor is a pipe that has different
configurations (interconnections of modules)
for performing different types of arithmetic
operations. Such a pipe is called a multifunctional pipe. A multifunctional pipe can
be either static or dynamic. In the static
case, at any time instant only one configuration is active, therefore pipelining (overlapped processing) is permissible only if
the tasks (instructions) involve the same
configuration. Most, if not all, multifunctionM pipes in arithmetic units of existing machines fall into this classification
because static pipes are easier to control, as
will become clearer later on. Dynamic multifunctional pipes permit overlapped processing among several active configurations
simultaneously. Throughput may be further
enhanced, but more elaborate control and
sequencing are required. This classification
of static and dynamic pipes will be very
useful when we consider and evaluate pipelined processor architecture in subsequent
sections.
65
FIGURE ]f.
3t
FIGURE l h .
3t
P a r a l l e l i n g of f a c i l i t y 2.
66
C. V. Ramamoorthy and H. F. Li
~/y~.d
a[SULT I
~GENERATE ;-ADDRESS
~ DECOOE,G~NERATEOPERANOI ~EXECUT[ fNST
2 ND
~ INSTRUCIION ACCESS2ADORES$ ~OpERAND ACCFSS2
~G[NERATE I ADDRESS
~HOECOOE,GENERArE OPERaNOz - ' t EXECUT INS?: 2
~R 0
~IKSTRUCTION 6CCESS3 AOD~ESS ~OpERANOACCESS3
INSTRUCTION~
- ~ / ~
~
F///~RESULI
~E~ECATE t ~oRess~
~a~cooc
~a~r~
ae~t~s~
-~t~cJ'e
ms~ 3
4 H
~ INST~uCI Io~ ACCESS
4 AODRESs ~+ ORERANDACCE554
INSTRUCTION~
~
~ / / ~
~ / / / ~
R~SULT 4
~'IGENF~ATE I aDORESS4
~IDECODE G ~ ' R A r E O~ERAND4 -~XEC~'rE ~N~T '~
aDORESS
9, No.
1, March
1977
~,,
t~ W ( L -
1)ti
a,t=
a, tj <
i.
Pipeline Architecture
pipeline design, all segments have to be
synchronized by the same clock for propagating the data through the pipe.
The study of a maximum clock rate
serves to place a practical bound on the
throughput achievable in a pipeline system
limited by the propagation delays of the
logic gates used. Several studies have been
carried out to examine this problem under
various assumptions of timing parameters.
In all cases, three necessary conditions of
signal balancing exist:
1) The data must be gated by a clock
wide enough to insure a properly
stabilized output;
2) The clock should not be too wide to
allow data to pass through two or
more segments within the same clock;
and
3) The data that passes through a segment should arrive at the next segment before the next clock begins.
Initially Cotten [27] tested this data rate
and latching problem by using a hypothetical circuit as in Figure 3. The clock for
various segments may have a skew S t , defined as the time difference between the
arrival of the same pulse at different gates.
The latch register is assumed to be composed of two gate levels with feedback connections. Then, under conditions 1) and
2), Cotten's clock requirements are:
Cr -- S~ >__ 3 t ~ -- t ~ ,
(1)
C r W S ~ < 4t. . . .
(2)
where CT = clock width, So -- clock skew,
t ~ = maximum single gate delay, and
t~,~ = minimum single gate delay.
Reglster 1
Register 2
data
llne
skew
enerator
67
(3)
Cr + Ce >_ 2tma~ +
_~ Tmi. +
2t~i= -
(5)
(6)
and
Consequently the minimum clock period
can be derived to be (Cr + Cr), which
satisfies the above constraint and also
Cr + Ce _> 2 min Cr (that is, the period
must be long enough to allow the data to
propagate through a latch and then remain
stable for min Cr). Under zero skew and
tmax =
train = t,
C~ -t- C~ ~ 4t.
Ut
(7)
2tM + St ~__ CT S TD + d
(8)
source
FIGURE 3.
1977
68
C. V. Ramamoorthy and H. F. Li
Pipeline Architecture
(m - 1) computations are left, with the
needed (intermediate) results residing in
the m segments. T h e y will take an additional
(m log2 m -[- m - 1) units to complete. So
altogether, the (n - 1) computations take
(n -k m log2m - 1) units. On the other
hand, if 2 < n < 2m, (n/2 "4- ra log2 n -- 1)
units are required. The corresponding efficiencies, defined by the ratio of the total
busy segment times to the total segment
time span, can then be derived easily as
(n - 1)/((n - 1) - k m log2 m) and (n - 1)/
(m/2 - k m log2 m - 1), respectively. This
implies that for n >> m the pipe of m segments functions almost like a nonpiped
processor with speed of one segment time
instead of m segment cycles (the total time
is O(n), not O(mn)). However, for smaller
n the time is O(m log2 n).
The previous special case assmues a set
of uniform operations on a set of data,
merging them into one result where the exact
order of merging is unimportant. In the
case where a specific tree is to be followed
(the precedence structure is fixed), other
lower bounds can be derived in a similar
fashion. I t can be shown that, for a general
tree (not necessarily binary), if each node i
is labeled by g(i) corresponding to its distance from the root, then execution of the
nodes according to priorities in descending
order of g(~) in a pipeline environment with
identical pipeline characteristics is optimal.
Therefore the shortest execution time can
be achieved if the nodes are executed according to priorities corresponding to their
level labels. However, if the pipes have
different structures a n d / o r capabilities, the
problem becomes NP-complete. Without
going into the latter case, the time bound
for the former case can be derived, given a
tree structure and a pipe structure (latency
and flush time).
First let Lj be the number of nodes of the
tree with label 3 where l >_ j >_ 0. For the
simple case that there exists a J0 such that
forC > j > jo, L~ > m and for j _< jo,
L~ < m, the time bound is given by
69
L, -k m(j, - 3,+l) -k m - 1.
$
j o+I
. . . . . . . . . . . . . . . . .
Jn
Jn-I
Jn-2
. . . . . . . . . . . . . . . . .
Ji+l
. . . . . . . . . . . . . .
31
. . . . . . . . . . . . . . . . .
L: -4- moo -k 1) -- 1,
and this time bound is exact (from the
optimality of the level algorithm). This
8
FIGURE 4.
.....
oases.
1977
70
C. V. Ramamoorthy and H. F. Li
1977
time of one instruction may be very different from that of another, and it is only
natural to allow a subsequent short instruction to finish ahead of a preceding (but independent) long instruction.
The first type of control structure is used
by such systems as the IBM 7094 and
360/75, and even the apparently more powerful TI ASC. Representative of the second
type of control structure are the CDC
6600, the IBM 360/91, and the STAR-100
systems.
We now look at the fundamental problems
to be solved by either type of control as
well as the means and complexities involved.
For any asynchronous system, three sources
of control problems exist: 1) Read after
write, 2) write after write, and 3) write
after read. Their significance is worth more
elaborate explanation.
Read After Write
Pipeline Architecture
InstruOIon J
1o be deferred
Instruction i
71
~ E T._.~,,e
request
buffer
,,
~nstruchon
address
etechov s*qnol
Instruction hazazdresolv,ng
Operond hazardresolving . ~
72
C. V. Ramamoorthy and H. F. Li
t, + t~1,
$~*-I-1
Et,+t,o
j=t
Pipeline Architecture
according to tag values, the read-afterwrite can be monitored properly. A CDB
can be depicted by S sources (that generate
or forward operands) and T sinks (that need
the operands). Thus the added control
involves O(T) tag registers, each of length
log2 S, in addition to the comparison circuit
to route operands (T comparators plus
gating control).
A centralized scheme involves similar
complexity. The disadvantage of the centralized scheme is that it can reinitiate (update) only one sink at a time so that the
delay t~ can be longer than that in the decentralized case with parallel updating
(same tag).
The penalty of hazards in fully asynchronous systems is less severe and less well
defined. One possible way to view this
penalty is to represent it by the waiting
time of the instruction causing the hazard,
which is a random variable depending on
the completion time of its predecessor instruction. Such a stochastic characterization
is omitted here.
73
tl
SI
Oulput ~
S2
S3
t2 t 3
I4
t6
t7
X
X
S4
(o)
t5
(b)
CI
5+
0001
54
5+
FIGUR~ 6a.
FIGURE 6b.
F~GURE6C.
FIGURE 6d.
k pipeline.
Reservation table for Fig. 6a.
Shift register controller.
State diagram.
74
C. V. Ramamoorthy and H . F. L i
$
and leading to. the initial node, indicating
that, if more than n units of time elapse
between the initiations, then the shift
register will return to the state represented
by the collision vector itself.
Cycles in the state diagram correspond to
possible cycles of collision-free initiations
in the pipeline. A cycle may be specified
completely by the nodes passed through and
the latencies of (or the time taken by) the
arcs traversed from node to node in sequence. From the state diagram in Figure
6(d), it can be observed that there is a
cycle consisting of states (1010), (1101),
(1011), and (1001) with latencies of 1, 2, 3,
and 2 time units, respectively. This cycle
can be entered through the state (1000).
At each of these states a new computation
with a new set of operands can be initiated.
Since four new computations can be initiated during each traversal of the cycle,
there is an average latency of two, i.e., one
result per two time units. One can find
cycles that produce maximum throughput
rates (minimum average latency cycles).
In this example, the two cycles that produce
minimum average latencies are (1000),
(1100), (1110), (1111) and (1010), (1101),
(1011), (1001), each having a latency of 2
time units.
In general, the problem of efficient sequencing control of a pipeline reduces to
the discovery of minimum latency cycles in
Pipeline Architecture
75
the state diagram. In the case or more complex or multifunctional (assuming a certain
instruction mix) pipelines, the discovery of
the minimum latency cycles becomes quite
difficult. Nevertheless, such a shift register
control is applicable to properly avoiding
any resource (facility) conflicts due to the
existence of multiple paths or loops, in a
completely synchronous sense.
Note that, in a multifunctional pipe, a reconfiguration is required to change its function, which, in turn, involves a time delay.
b) Special machine instructions.
A
FORTRANprogram for polynomial evaluation
is shown below:
Unoptimized C o d e
K = A*B
F=B/C
L=D+E
P=F*C
H = P+A
requires
four reconfigurations
Optimized Code
F = B/C
K=A*B
P=F*C
L=D+E
I t = P+A
requires
two reeonfigurations
LIMIT--N+I
DO l0 J = 1,M
VALUE(J) =X(J)*A(1)
DO 10 1 = 2,LIMIT
10 VALUE(J)=VALUE(J) + A(I)*X(J)
This is equivalent to one machine instruction in the STAR-100. Therefore the compiler has to "recognize" the high level
language statement sequence and replace it
with the appropriate machine instruction.
c) Vectorization. Another type of optimization is to recognize sequential program
statements that represent vector operations
and translate them into powerful vector
arithmetic instructions.
FORTRANprogram:
DO 10 1 = 4,100
C(I)=A(I)+B(I-3)
10 CONTINUE
Equivalent compiler generated text for
pipeline machines:
VECT__BEGIN
A,C: VECTOR(4.. 100);
B: VECTOR(1.. 97);
C = A+B
VECT__END.
2. STRUCTURE OF A PIPELINED PROCESSOR
76
C. V. Ramamoorthy and H. F. Li
EXEC
I_
IPU
F
4
FUNCTION
UNIT
CONTROL
EXECUTION
4MEMORY
OPERATIONS
_t_
VARIABLE
._~
DURATION
L
F
j
7
VARIABLE
DURATION
~,~
~ VARIABLE..J
DURATION-I
gASH~TIME
INT[RVAL
[
G~NERAT[
IN~T
AOOI
~ESS
INSTRUCTK)N
A~
I
k'CST
TO
~COOE
OECO~
fitt~T
ARF.A
I
STORAGE
~
TRANSM~
FLC~TIN~ O~(~
CUT
I=
h~,ROWARE
I
II~TRUCTION
EXECLr~
Pipeline Architecture
77
The importance of reducing memory ac- actually can decide the efficiency and percess time has been demonstrated. Even formance of the resulting design.
after the memory accessing problem has
been solved, another bottleneck in the 2.2 Buffering
pipeline may emerge. This bottleneck is
the execution unit. Usually many arith- Buffering is the process of storing results
metic operations, especially floating point (outputs) of a segment temporarily before
operations, require considerable delay be- forwarding them to the next segment. It is
cause of their implicit internal circuit delay essential in smoothing out the flow of a
requirement or iterative characteristics. If computation process when the timing for
there' is only one execution station to serve each processing module (segment) involved
the entire instruction stream coming in, the is not fixed. The impact of buffering can be
speed of the execution unit may not be visualized in a common assembly line, say
compatible with the input rate, thereby in the car industry. Occasionally a station
unnecessarily slowing down the computa- (segment) of the pipe (assembly line) may
tion. One alterative is to provide multiple be slowed down for one of many reasons,
physical execution stations to perform which could prevent the continuous input
different types of operations. In the 360/91, of cars to the next station. If there is suffithere is a fixed point execution area and a cient storage space between this segment
floating point execution area. With this and its predecessor, the latter can continue
arrangement, floating and fixed point opera- its operation on other cars and transfer
tions can be performed asynchronously but them to the storage space until it is full.
in parallel. But within each execution area, When the station resumes normal service
the multiplicity of execution stations can it can try to clear up the cars in the input
be increased. This is equivalent to increasing storage place, perhaps at a faster speed.
Therefore buffering may be needed bethe throughput of the execution unit as an
entity. For example, the floating point area fore or after any segment whose processing
in the 360/91 has two function units: a speed is not fixed. In a pipelined processor
pipelined adder and a multiply/divide pipe. this means 1) memory storage access reWe have shown the essential structures of lated stations, including instruction fetch
a pipelined processor. Next attention will and operand fetch, and 2) execution unit
be paid to studying some design and opera- stations. In a typical pipe like the 360/91,
tional problems associated with a typical the instruction buffer can hold eight words
pipeline. Included are the following topics: of instructions to be followed in the se1) Buffering: the concept and urgency of quence. In the execution unit, for the fixed
buffering in a pipeline and the ways point execution area, a buffer of six words
of instructions (pseudo) and six words of
it can be accomplished.
2) Busing structure: for communication operands is available, whereas in the floatbetween segments and operand supply ing point area a buffer of six instructions
to allow processing to proceed or re- and six operands (from storage) is also
provided. These buffers serve the purpose
sume as quickly as possible.
3) Branching: effect of branching in of continuing the supply of instructions or
throughput and the ways to alleviate operands to the appropriate units whenever
the inefficiency in existing systems. a variable speed occurs. Similar buffers in
4) Interrupt handling: how interrupts other pipelined processors can be found.
are handled in sequential and vector In the STAR-100 system, whose configuration is shown in Figure 8, a 64 quarterword
pipes.
5) Pipeline processing of arithmetic func- (superwo.rd) buffer exists in the stream unit
to buffer the data and to align the two
tions.
Taken together these five topic areas operand vectors (in vector processing mode)
represent the major design constituents to for streaming in the operations involved.
be added to the basic structure already In addition, there is of course the instrucdiscussed. Their importance and effects tion buffer holding four swords of instrucComputing Surveys, Vol. 9, No. 1, March 1977
78
C. V. Ramamoorthy and H. F. Li
DIRECT
hCCES$
CHANNEL
I/O
FIGURE 8. B a s i c C D C STAR-100
CHANNEL
configuration.
'
......
ReJFPER U N I T
m S ' n ~ J C T ION
IN~OCESs ING U H I T S
.......
(mu)
I
I
'l .0,'=%*.iT I]
ME"iR*
gUrFEm UNIT
I
I
AR I TUHNMI T T IC
J
MEMORY
CONTROL
UNIT
PI~OCESSOR
i7
i AUX.
(A)t~163
FIGURE 9. A S C
system configuration.
~.DER
P i p e l i n e Architecture
79
INSIRUCTIONUNIT
STORAGE BUS
l..
6
FLOAFING POINT
BUFFERS(FLB)
i CON"ROL
BUSY
TAGS FLOATINGFOINT
BITS
REGISTERS (FLR)
[
T
o',
m
<
....
FLB BUS
FLR BUS
CDB
TAG( S,NK
TAG
SINK
FIGURE 10. Floating point unit of IBM Model 91 with CDB and reservation stations.
80
C. V. Ramamoorthy and H. F. Li
SHORTSTOP
A OPERANO
'
ADD
SNIPT I
COUNT
SHIFT
-I TRANSMIT
I--'1
SUL
MULTIPLY I MULTIPLY 2
MERGE i
MERGE 2
SHORTSTOP
A 0P
I
e OPERANDI ( ~
EXPONENT COEFFICIENT
C0M RE ALIGN.ENT
SHIFT
C0EFFOE.T"0--LZE''0..AL,ZE'
MULTIPURPOSE UNIT
(24 SEGMENTS)
.~]
OlVIOE
UNIT
-I
Branching
Branching is more damaging to the pipeline performance than instruction dependency. When a conditional branch is
encountered, one cannot tell which sequence
of instructions will follow until the deciding
result is available at the output. Therefore
a conditional branch not only delays further
execution but also affects the entire pipe
starting from the instruction fetch segment.
An incorrect branch of instructions and
operands fetched may create a discontinuity of instruction supply.
To remedy the effect of branching, different techniques can be employed to provide
mechanisms whereby processing can resume even if an unexpected branch occurs.
P i p e l i n e Architecture
1977
81
82
C. V. Ramamoorthy and H. F. Li
(4-sword Instruction Stack)
Pipeline Architecture
d
83
Decode Phase
Generation of Summands
Mult1~l . . . .
Mult~pller
Decode
I Summand generatlon
i
Addltlon wlth
carry propagatlon
Product
Operatlon
0 0 0 0
0 0 0 1
0 0 1 0
D
2D
0 0
4D-D, 2D+D
1 1
0 1 0 0
0 1 0 1
0 1 1 0
4D
4D+D
8D-2D, 4D+2D
0
l
1
l
8D-D
8D
8D+D
8D+2D
l
0
0
0
l
0
0
l
l
0
l
0
1 0 1 l
l 1 0 0
l l 0 1
16D-4D-D
16D-4D
16D-4D+D
1 1 1 0
1 1 l l
16D-2D
16D-D
84
C. V. Ramamoorthy and H. F. Li
~D
-*4D
-*20~ - - I~ - ]
8D
169
CSh:Corry SaveAdder
$
Product
Performance Analysis
Let N be the number of bits in the multiplier, and let tc be the delay through a
CSA. Since the latter can be realized by
two levels of combinational logic, t~ will be
equivalent to two logic gate delays. The
delay through the full binary adder, tF~A,
will vary with the size of the operands and
its design. Then the total time for multiplication of N bits (from the time the inputs
are introduced at the first carry save adder)
is
t(multiply) = [N/4]tc -[- 4t, q-- tFBA.
If t~ is equal to two gate delays of 20 nsee
each and t~nA for 32-bit operands using
carry look ahead logic is around 70 nsec,
then the total multiply time with 16-bit
operands to generate a 32-bit product is
270 nsec.
Extensions
The previous procedure using 4-bit multiplier groups can be extended to 8-bit multiplier groups, thereby almost doubling the
CSAl
CSA2
CSA3
FBA
X X X X X
,
X
t I t 2 t 3 t 4 t 5 t 6 t 7 t8 t 9
12e.
Timing diagram and reservation
table for pipeline multiplication.
FIGURE
P i p e l i n e Architecture
MULTIPLIER
9
8 blls
DELAY=CSA OELAy
PRODUCT
85
86
C. V. Ramamoorthy and H. F. Li
configuration is formed. While in the dynamic case more complicated control and
routing overhead is involved, the throughput may be higher because of the simultaneous existence of several configurations.
In reality, static vector pipes are more
common, as is illustrated in the TI ASC and
CDC S~AR-100 examples to follow.
For a vector that consists of the two levels
of pipeline action, appropriate vector instructions have to be designed and implemented to denote the operations on some
ordered data in vector or array form.
Generally, in the first level, a vector instruction is fetched, decoded, and the
necessary control paths connected before
the needed elements of the vector are
fetched from consecutive storage locations
over a specified address range. The second
level execution unit pipe carries out the
specified operations on these elements,
normally being supervised by a control
ROM. Sometimes the results generated are
stored back to certain consecutive addresses
of a result field, and sometimes other
needed indicators are generated and stored
in the register file in the processor for future
usage. The exact procedures and mechanisms to accomplish all these functions vary
from machine to machine. For later comparison and analysis, an example of vector
instruction execution is provided here.
Before the execution of a vector instruction starts, certain additional information
pertinent to the mode of processing has to
be furnished to the system. Such information can be quite varied and detailed, such
as the starting (base) address of each source
vector and result vector involved (usually
two source vectors and one result vector)
and the control over what elements of the
vectors should be operated upon. The
method by which the STAR-100 handles
this is demonstrated first. The similarity
with the control of an array processing
system can be observed. Then similar and
different features in the ASC system are
noted. Finally the vector processing powers
of the two systems are compared.
The schematic diagram of the central
processing unit for the STAR-100 system is
shown in Figure 8. Basically it consists of
Computing Surveym, Vol. 9, N o . I, M a r c h 1977
i
78
1516 2 3 2 4
3 ! 3 2 3940
4748
5556
63
F
G
X
A
Y
B
Z
(8X,9) (subfune- (offset (ffeld
(offset {held I{CV base (f,eCd
tlon)
forA) length& for B) length& address) length&
bose
address
base
address
base
address
I CI =1
I (offset I
Pipeline Architecture
87
{
FG
IURE
14.
i~ ~ Base Address
Offset
Field Length
88
C. V. Ramamoorthy and H. F. Li
A source
10000
10020
10040
10060
1008O
100A0
10000
IOOEO
10100
10120
10140
10160
vector
~ b a s e address
A0
AI
offset
A2
A3
A4
A5
A6
A7
A8
A9
start address
~ ( b a s e address -offset)
AI0
All
B source vector
IFF80
IFFA0
IFFC0
IFFEO
20000
20020
20040
20060
-~-startlng address -
8"4
8-3
I
I -offset
g. 1
B0
B1
B2
B3
' B'2
1-,-base
address
C source vector
~ b a s e address
30000 1
CO ~ CO
30020
C1 ~ C1
l offset
30040 )' C2 ~ 02
30060
C3 ~ C3
~ s t a r t l n g address
30080
C4 ~ C4
300A0 ,,!5 ~ A5+B_3
300C0
C6 ~C6
300E0
C7 - C7
effectl ve
field
30100 "~8 " A8 + B6
length
control vector
30120
C9 ~ 09
II I01111 lOll(O101'lOI'I')
30140 Cl0 - AIO+ B2
4ooo~ , ,~o0o4
30160 Cll ~ All + B3
offset
ComputingSurveys,~ o l .
9, N o . 1, M a r c h 1977
Pipeline Architecture
Unit (AU). The IPU is analogous to the
stream unit in the STAR-100,the MBU is
analogous to the load/store; and the AU
actually processes the data. In vector mode
the IPU fetches and decodes the instruction and calculates the effective addresses
for the vector fields. After receiving the
needed information from the IPU, the
MBU starts fetching source operands and
pairing those to be sent into an AU pipe
(the AU can have one to four identical
pipes). Each AU pipe has different configurations for performing different arithmetic operations (including integers) as in
a typical static multifunctional pipeline.
The two levels of pipeline action are quite
apparent in this case.
A vector instruction in the ASC has some
outstanding characteristics; the instruction format is depicted in Figure 16. Particular registers for fetching operand address and control information do not have
to be specified, however. Some registers in
the IPU, forming the Vector Parameter File
(VPF), are dedicated to vector processing.
The VPF consists of eight 32-bit registers
whose individual functions or interpretations have been permanently assigned,
as shown in Figure 17. This fixed organization has the advantage that registers can
be hardwired to the input of the control
ROM or other logic units for fast operation,
without having to worry about access conflicts among them. The first register contains the operation code and the type and
length of the vector considered (single or
two-dimensional). Then the base address
and the register containing the index (offset) are specified for each operand vector in
the subsequent register in the VPF. The
fifth and sixth registers are used to specify
the increment for each vector and the
number of iterations (field length) in this
inner loop. For the outer-loop (two-dimensional vectors), similar information about
the increments and number of iterations is
included in registers seven and eight. The
vector instruction, after having been de-
I]
oP
I R I T12I M I
16
20
32
HO HI
Reglster
28
H2
H3 H4
89
H5
OPRALCTISV I
H6
H7
29
--
XA
S~
2A
HS
XB
SAB
2B
V1
XC
SAC
2C
DAI
2D
~1
N1
2E
DAO
DBO
2F
DCO
NO
DBI
90
C. V. Ramamoorthy and H. F. L i
I n ) t , a l Vector
n
n+2
V1
V2 (R)
V3 (R)
n+l
n+3
V4
n+4
V5
n+5
V6 (R)
V7 (R)
n+6
n+7
V8
V9 {R)
n+8
Step l
78
Illjo)oll)llOlOll)O)
V2
V9
V0
V4
p+2
V5
p+3
V8
vector
Pipeline Archi~cture
91
STAR-]O0
TI ASC
Vector parameter file fixed, therefore easy to reference and store.
Strong vector instruction set.
Sparse vector not included.
Variable vector increment allowed.
No control vector used.
92
C. V. Ramamoorthy and H. F. Li
spieuous aspects in a vector machine is the instruction and the exit of the result
appropriate. From the previous description, (for vectors, the first result element) through
one notices at least four aspects:
the entire pipe. Therefore it directly mea1) There is some setup time involved sures the sum of the execution time of all
before executing a vector.
the facilities that the instruction and an
2) Additional control in configuring the operand pair have to go through. Sometimes
execution pipe and monitoring ope- it is interesting to compare the flush times
rand admission and traversal is needed. of a vector pipe to those of a sequential
3) Richer instruction sets and intelligent pipe. A vector pipe often has to perform
compilers are prerequisites for pro- more activities, such as checking the termiducing optimized code for vector ma- nation condition, checking the control
chines.
vector, etc. (though some of them can be
4) An intrinsic tradeoff between se- overlapped with other operations). Therequential and vector processing can be fore it is not surprising to discover that a
derived from the above considerations. vector pipe may have a longer flush time
These four observations are discussed in than its sequential counterpart.
this section.
Here an attempt-is made to compare
analytically sequential and vector pipeline
processing in terms of time efficiency. For a
1) Setup Time and Flush Time
vector pipe, the memory operand supply
As demonstrated in the ASC and STAR-100 rate is usually fast enough to meet the
systems, each vector instruction involves a speed of the execution pipe(s). For exset of vector parameter registers or control ample, in the ASC system, the eight intervectors to hold the information needed be- leaved memory modules can maintain a total
fore the instruction can be initiated. The data transfer rate of 400M words per seccontents of these parameter registers are o n d - t w i c e that required to support a cenused to control the addressing operation tral processor with four arithmetic unit
and storage of result operands, as well as pipes when processing vector instructions
the final termination. In the STAR-100 [13]. Therefore, for an effective vector field
system, they are used by the MIC and later length of l, the execution time of the vector
by other buffers in the stream unit for the instruction can be expressed analyticaIIy as
continuous initiation of operand fetches and (assuming the bottleneck is in execution
execution until a termination condition is units) :
detected by the MIC. In the case of the
ASC processor, they are used by the I P U
for address calculation, by the MBU for
memory references, and by the MIC (in where t~ is the vector instruction processing
the MBU) for monitoring subsequent execu- time; t, is the setup time; t~ is the vector
tion activities. These parameter registers pipe flush time including decode, address
can be loaded from memory. In doing so, calculation, operand fetch and paired, termimany additional memory fetches (register nation check and execution; and t~ is the
loading) have to be performed before the speed of the bottleneck segment of the exevector instruction can be started. These cution unit pipe (in the case of the ASC, all
fetches represent an overhead in time--the eight segments have the same speed, namely
setup time. If the vector invoIved has a 1 minor cycle -- 60 nsec).
The same situation in a sequential pipe
relatively short field length (the number of
can
be analogously analyzed. Suppose the
iterations to be executed is small), the setup
same
instruction has to be executed on a
time may be comparable to the actual
vector in this case. Without vector processprocessing time of the vectors.
Besides the setup time, there is another ing power, this instruction has to be intime measure of interest: the flushing time. voked l times; that is, it must go through
The flushing time is the period of time the entire pipe 1 times. Even if the execution
between the initial operation (decode) of unit is fast enough here, it is probable that
Computing Surveys, ~ro|. 9, No. 1, March 1977
Pipeline Architecture
93
In addition, some other costs arise inthe fetching of operands is less efficiently
performed. (In vector machines, consecu- directly. The vector parameter file or
tive storage locations for operands are registers represent part of the indirect hardfetched.) The processing time of the l in- ware needed. Larger instruction sets to
cope with vector processing also demand
structions may be expressed as:
longer word lengths--a result that affects
t,p = t,s + (1 - 1)tb
the cost throughout the entire system. For
where t,p is the sequential (pipeline) process- smaller word length machines, one can try
ing time; t,1 is the sequential pipe flush to get around the problem by using techtime; and tb is the speed of bottleneck in the niques such as dedicated VPF in ASC. Bepipe, most likely in fetching operands if the cause of its cost-effectiveness and speed
execution unit is fast enough because more advantages, vector processing power may
interference from unstructured memory prove adaptable to medium scale systems.
To keep up the execution speed, addireferences for instructions and operands
tional memory buffers (like the MBU) may
results.
be necessary to maintain an effective memComparing t~p and t~p yields:
ory supply rate. Memory management
t, -t- t~i q- (l - 1)t~ < t~s + (1 - 1)tb
problems, though out of the scope of this
paper, present a rich area to be explored for
if and only if
vector machines. All this direct and indirect
t, -[- t,i -- t~i _< (l -- 1)(tb -- t~).
control cost marks the space overhead incurred in vector processing and should be
This equation reveals that, if the vector
evaluated appropriately in tradeoff conlength is reasonably large, vector processing
sideraions.
is beneficial, considering the time advantage.
If the setup and differential flush times are 3) Richer Instruction Set and Intelligent
large compared to the difference of the
Compders
speeds of the bottlenecks of the two pipes,
then a large vector field length is needed to Once the skeleton processor is assigned, the
justify processing it in the vector form. instruction set has to be designed carefully.
Usually (tb -- t~) has been about a tenth of As in the case of the STAR-100, suitable
t, q- tvl -t- Lf; so vector processing provides higher level vector macro and sparse vector
instructions can be implemented (with
time efficiency in pipelined processors.
proper hardware support) ~o that some application algorithms can be easily handled
2) Additional Control and Hardware
(fewer instruction and operand fetches and
Vector pipes are designed to be cost-effec- other conflicts). Without such well designed
tive. They are implemented with sufficient instruction sets, the power of the processor
flexibility and power to match the speed of may depreciate many times because inan array processor (which usually is more efficient operations, redundant or excessive
expensive). For those vector machines with memory references, and poorly utilized
multifunctional pipes, additional control to facilities may result.
establish the desirable configurations and
Since many of the rich instructions are
routing of the operands between pipe seg- by no means conventional, how to use them
ments are needed. These needs are usually effectively in programs becomes a prime
fulfilled by using microcoded control to concern. For assembly language program
allow flexibility and simpler circuitry. The writing, the user has to familiarize himself
hardware and firmware cost so introduced not only with the algorithm he is going to
represents a portion of the cost of vector implement, but first with the details of these
processing. These control functions some- unconventional instructions [19]. Because of
times are not very conspicuous, but they the various architectural aspects involved,
do require a considerable amount of hard- he has to choose a suitable algorithm careware support.
fully. Often a theoretically fast algorithm
94
C. V. Ramamoorthy and H. F. Li
Pipeline Architecture
95
Scalar counterpart
>/Scalar
bound
SIoP~/'g=(tLstatM)
TV
/
/
/
IF+ t S
Slope t L
L ~
mLS J
p [Update Pointer]
| [ B a s i c Instruction] *-- (could
|
involve more than one scalar
|
instruction depending on
|
instruction format)
L--[Test Pointer and Branch]
5) TradeoffSummary
In this section, we have discussed the time
and the space overhead needed in vector
processing as compared to a sequential
pipelined processor (such as the IBM 360/
91). The advantages of vector processing
are its speed improvement for reasonably
long vectors and its more orderly management and thus better utilization of the
memory system and other resources when
dealing with vectors. The costs it incurs
are the needed firmward control and additional software facilities to utilize its power.
When the latter problems have been solved
successfully at less cost, vector processing
may be generalized and extended to smaller
scale processing systems.
4. OVERVIEW OF TWO RECENT MACHINES
4.1 The AsynchronousCRAY-1 Computer
C. V. Ramamoorthy and H. F. Li
96
purer CRAY-1 of Cray Research Corporation [26] Several unique features of this
machine are explored to supplement the
ideas in Section 3 and to illustrate the
current trend of progress.
The CRAY-1 design philosophy follows
closely the tradition of the CDC 6600 and
7600. The twelve functional units incorporate vector processing capabilities and
are "connectable" to form efficient chains,
thereby maximizing overlapped vector processing. These units represent a deviation
from the universal (mnltifunctional) pipe
approach as adopted by the ASC and
STAR-100. However, the tradeoff is quite
apparent. The control here is more complex.
Some specific features of the CRAY-1 include:
Operating Registers
Figure 20 illustrates the register organization of this computer. The primary operating registers are the scalar and vector
registers called S and V registers, respectively. Each of the eight V registers has
64 bits. A scalar instruction may perform
some function, such as addition, obtaining
operands from two S registers and entering
the result into another S register. A vector
instruction performs the same function in
an analogous fashion, obtaining a new pair
of operands each clock cycle of 12.5 nsec
from two V registers and storing the result
into another V register. The contents of
the vector length (VL) register determine
the number of operations performed by the
vector instruction. Eight 24-bit A registers
are used as address registers for memory
references and as index registers. The A and
S registers are each supported by 64 rapid
access temporary storage registers called
B and T registers. Data can be transferred
between A, B, S, T, or V registers and
memory.
Instruction Buffers
Instructions, which are either 16 or 32 bits,
are executed from four instruction buffers,
each consisting of 64 16-hit registers. Associated with each instruction buffer is a
base address register that is used to determine if the current instruction resides in
a buffer. Forward and backward branching
within the buffers is possible, and the program segments may be discontinuous in
the program buffer. When the current instruction does not reside in a buffer, one
of the instruction buffers is filled from
memory. Four memory words are read per
clock period to the least recently filled instruction buffer. To allow the current instruction to be issued as soon as possible,
the memory word containing the current
instruction is among the first to be read.
Functional Units
The CRAY-I CPU has twelve functional
units, each of which is independent of the
others and therefore capable of parallel
operation. A functional unit receives operands from registers and delivers each
result to a register when the operation is
completed. The functional units retain no
information regarding their past operation.
The three functional units that provide
24-bit results to A registers are Integer
Add, Integer Multiply, and Population
Count. The three functional units that
provide 64-bit results to the S registers are
Integer Add, Shift, and Logical. The three
functional units providing 64-bit results
to the V registers only are Integer Add,
Shift, and Logical. The three functional
units that provide 64-bit results to either
the S or V registers are Floating Add,
Floating Multiply, and Reciprocal Approximation. All functional units are buffered,
perform their algorithms in a fixed amount
of time, and produce one result per clock
period.
Memory
Up to one million 64-bit words are arranged
in 16 banks with a bank cycle time of 4
clock periods. The memory is constructed of
bipolar 1024-bit LSI chips.
Computing Surveye, Vol. 9, No. 1, March 1977
Vector Operations
Because of the instruction formats adopted,
vector instructions are of four types. One
type of vector instruction obtains operands
Pipeline Architecture
97
Vector Registers
V6
vs
il
F -vo:"
V1
SJ
v~
Vector
If
r,nctto.,~
II
Units
Vk
: i /
VI
Add
Vector
Con}rol
I
I
/I
TOO
~ECIORMASK
REAL-TIMECLOCK
Sj
|= 1 |
i
, .S.'i I
"~Floatlng
~JPolnt
S IFunctinal
~ts
Scalar Reglsters
through
T77
l J //~s3
~ /
sj
Shift
Add
" Scalar
((Ah) + Jkm)
Exchange
--
L ~
Boo
IIA,
Vector
_ ~
VECTOR Al
Address Registers
-~F--/
Ak
Fu.ctlonal [ [
+1
--
II
-~,~H
....
~___
----~--~
~ = I P
trol
'
'
L__~
[nstructlon
Buffers
FIGURE 20.
Register block
diagram-CRAY-l.
tional unit each clock period, and the corresponding results emerge from the functional unit n periods later, where n is the
execution time. The results are entered
98
C. V. Ramamoorthy and H. F. Li
Vk
Vk
V~
V~
FIGURE 21a.
FIGURE 2lb.
FIGURE 21e.
FIGURE 21d.
and V2o) is transmitted to the add functional unit, where it arrives at time ti.
The function is executed in six clock time
periods and the first result exists from the
functional unit at clock period h . The
second pair of operands (VII and V21)
arrive at the functional unit at t2, and so on.
V~
Parallel Operations
When a vector instruction is issued, the
required functional unit and the operand
registers are reserved for the number of
clock periods determined by the vector
length. A subsequent vector instruction requiring the same resources (functional units
and registers) cannot be executed until the
resources are released; however, parallel
(simultaneous) execution of neighboring
instructions that do not interfere in their
resource requirements is permitted.
Chaining
The CRAV machine has the unique ability
to combine several pipeline executions in a
sequence by chaining. In the chaining process
a result register which receives the result of
a vector instruction can become the operand
register of a succeeding instruction. The
succeeding instruction is started as soon as
the first result arrives for use as an operand.
Figure 23(a) shows a chain of four instructions reading a vector of integers from
memory, adding that vector to another,
shifting the sum, and finally forming the
VII,V2]
VOl
VI2,V22 --
..
VO2
V|3,V23
VO3
VI4,V24 .
VI5,V25 .
.
.
V16,V26 .
.
.
VO4
.
VO5
.
VO6
VI7,V27
VIs,V28 V19,V29
.. VO7
V08
. VO9
Pipeline Architecture
99
Performance
Memory
READ
Shlft
hr
ci~ = ~
aim'b~"
nml
t o t I t 2 t~ t 4 t 5 t e t 7 tm t I tlo t. tew t~ t14 tls tl6 tmr tl= ttl tzotzm tzz t23tz4ttstt6tz7
Izmtltisot$l
V5 o
_ . V5 I
-- V5=
---
VS3
vs,,
---
V5 s
vs.
---MSr
V5e
f q
g - transit
h -
functional unlt
transit of sum from Integer add functional
unlt to element of V2
FIGURE
23c.
k
of
unlt
shift operatlon perfomed by shift functional unit
o f shifted sum from shift functional unit
to element of V3
i - translt
l - transit
k - loq~cal
unlt
- translt
of
100
C. V. Ramamoorthy and H. F. Li
MFLOPS
150'
300
I00
u
g200
50
g
alog
COS
sqrl
Motr~x dtmensmn
J
exp
10C
scalor
80
I
160
i.~
240
vector 1
c~Iog
--sqrt
,, e~p
60 64
Pipeline Archi~cture
CONCLUSION
101
102
C . V . Ramamoorthy and H . F. L i
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]