15th Lecture 6. Future Processors To Use Coarse-Grain Parallelism
15th Lecture 6. Future Processors To Use Coarse-Grain Parallelism
15th Lecture 6. Future Processors To Use Coarse-Grain Parallelism
15th Lecture
6. Future Processors to use Coarse-Grain Parallelism
Chip Multiprocessors (CMP)
Multithreaded processors
Simultaneous multithreading
- Instructions are simultaneously issued from multiple threads to the
FUs of a superscalar processor.
- The SMT approach combines a wide superscalar instruction issue with
the multithreading approach
- by providing several register sets on the multiprocessor
- and issuing instructions from several instruction queues
simultaneously.
- The issue slots of a wide issue processor can be filled by operations of
several threads.
Latencies occurring in the execution of single threads are bridged by issuing
operations of the remaining threads loaded on the processor.
2
Simultaneous Multithreading (SMT)
SMT fetch unit can take advantage of the interthread competition for
instruction bandwidth in two ways:
First, it can partition fetch bandwidth among the threads and fetch from several
threads each cycle.
Goal: increasing the probability of fetching only non speculative instructions.
Second, the fetch unit can be selective about which threads it fetches.
The main drawback to simultaneous multithreading may be that it
complicates the instruction issue stage, which always is central to the
multiple threads.
A functional partitioning as demanded for processors of the 10
9
-transistor
era is therefore not easily reached.
No simultaneous multithreaded processors exist to date. Only simulations.
General opinion: SMT will be in next generation microprocessors.
Announcement of Microprocessor Forum Oct. 1999:
DEC/Compaq Alpha EV8: (21464) will be four-threaded SMT
3
SMT at the Universities of Washington and San Diego
Hypothetical out-of-order issue superscalar microprocessor that resembles
MIPS R10000 and HP PA-8000.
8 threads and 8-issue superscalar organization are assumed.
Eight instructions are decoded, renamed and fed to either the integer or
floating-point instruction window.
Unified buffers are used
When operands become available, up to 8 instructions are issued out-of-
order per cycle, executed and retired.
Each thread can address 32 architectural integer (and floating-point)
registers. These registers are renamed to a large physical register le of 356
physical registers.
SMT at the Universities of Washington and San Diego
Instruction
fetch
Instruction
decode
Instruction
issue
Execution pipelines
Floating-
point
register
file
Integer
register
file
Floating-point
Instruction
Queue
Integer
Instruction
Queue
I-cache
D-cache
Fetch Unit
PC
Decode
Register
Renaming
Floating-point
Units
Integer
load/store
Units
5
SMT at the Universities of Washington and San Diego -
Instruction fetching schemes
Basic: Round-robin: RR.2.8 fetching scheme, i.e., in each cycle, two
times 8 instructions are fetched in round-robin policy from two different 2
threads,
superior to different other schemes like RR.1.8, RR.4.2, and RR.2.4
Other fetch policies:
BRCOUNT scheme gives highest priority to those threads that are least likely
to be on a wrong path,
MISSCOUNT scheme gives priority to the threads that have the fewest
outstanding D-cache misses
IQPOSN policy gives lowest priority to the oldest instructions by penalizing
those threads with instructions closest to the head of either the integer or the
floating-point queue
ICOUNT feedback technique gives highest fetch priority to the threads with
the fewest instructions in the decode, renaming, and queue pipeline stages
6
SMT at the Universities of Washington and San Diego -
Instruction fetching schemes
The ICOUNT policy proved as superior!
The ICOUNT.2.8 fetching strategy reached a IPC of about 5.4 (the
RR.2.8 reached about 4.2 only).
Most interesting is that neither mispredicted branches nor blocking due to
cache misses, but a mix of both and perhaps some other effects showed as
the best fetching strategy.
Recently, simultaneous multithreading has been evaluated with
SPEC95
database workloads
and multimedia workloads (Oehring, Sigmund, Ungerer 1999/2000),
both achieving roughly a 3-fold IPC increase with an eight-threaded SMT
over a single-threaded superscalar with similar resources.
7
SMT Processor with Multimedia-Enhancement
- Combining Simultaneous Multithreading and Multimedia
Start with a wide-issue superscalar general-purpose processor
Enhance by simultaneous multithreading
Enhance by multimedia unit(s)
Utilization of subword parallelism
(data parallel instructions, SIMD)
Saturation arithmetic
Additional arithmetic, masking
and selection, reordering and
conversion instructions
Enhance by additional features useful for multimedia processing,
e.g. on-chip RAM memory, special cache techniques
For more information see:
http://goethe.ira.uka.de/people/ungerer/smt-mm/SM-MM-processor.html
Branch
Compl
Integer
RT WB RI
ID IF
Global
L/S
Local
L/S
Thread
Control
Simple
Integer
Local
Memory
I/O
Memory-
interface
DCache
BTAC
ICache
Rename
Register
ID IF
To Memory
The SMT Multimedia Processor Model
9
Maximum Processor Configuration
- IPCs of 8-threaded 8-issue cases
Initial maximum configuration: 2.28
16 entry reservation stations for thread, global and local load/store units
(instead of 256): 2.96
one common 256-entry reservation station unit for all integer/multimedia
units (instead of 256-entry reservation stations each): 3.27
loads and stores may pass blocked load/stores of other threads: 4.1
highest-priority-first, non-speculative-instruction-first, non-saturated-first
strategies for issue, dispatch, and retire stages: 4.34
32-entry reorder buffer (instead of 256): 4.69
second local load/store unit (because of 20.1% local load/stores): 6.07
(6.32 with dynamic branch prediction)
10
IPC of Maximum Processor with On-chip RAM and
Two Local Load/Store Units (MTEAC 2000)
1
2
4
6
8
1
4
8
6,32
5,56
3,84
1,98
1
6,33
5,64
3,89
1,99
1
5,67
5,34
3,91
1,99
1
3,53
3,52
3,27
1,96
1
1,86
1,86
1,86
1,57
0,96
0
1
2
3
4
5
6
7
IPC
Issue
Threads
4 MB I-cache,
D-cache fill burst rate
of 6:2:2:2
11
More Realistic Processor
D-cache fill burst rate
of 32:4:4:4
issue bandwidth 8
12
Speedup
Maximum processor Realistic processor
A threefold speedup
13
6.5 SMT vs. Multiprocessor Chip
- IPC-Performance of SMT and CMP (1)
SPEC92-simulations of Tullsen et al. 1995 vs. Sigmund, Ungerer 1995
14
IPC-Performance of SMT and CMP (2)
SPEC95-simulations of Eggers et al. 1997
CMP2: 2 Prozessoren, je 4-fach superskalar 2*(1,4)
CMP4: 4 Prozessoren, je 2-fach superskalar 4*(1,2)
SMT: 8 Threads, 8-fach superskalar 1*(8,8)
IPC-Performance of SMT and CMP
SPEC95-simulations of Hammond et al. 1997
Performance is given relative to a single 2-issue superscalar processor as
baseline processor
16
Comments to the Simulation Results of Hammond,
Nayfeh, Olukotun 1997
CMP (eight 2-issue processors) outperforms a 12-issue superscalar and a 12-
issue, 8-threaded SMT processor on four SPEC95 benchmark programs (by
hand parallelized for CMP and SMP).
The CMP achieved higher performance than SMT due to a total of 16 issue
slot instead of 12 issue slots for SMT.
Hammond et al. argue that design complexity for 16-issue CMPs is similar to
12-issue superscalars or 12-issue SMT processors.
17
SMT vs. Multiprocessor Chip
(Univ. of Washington, Eggers et al. 1997)
SMT obtained better speedups than the (CMP) chip multiprocessors
- in contrast to results of Hammond et al.!!
Eggers compared 8-issue, 8-threaded SMTs with four 2-issue CMPs.
Hammond compared 12-issue, 8-threaded SMTs with eight 2-issue CMPs.
Eggers et al.:
Speedups on the CMP were hindered by the fixed partitioning of their
hardware resources across the processors.
In CMP processors were idle when thread-level parallelism was insufficient.
Exploiting large amounts of instruction-level parallelism in the unrolled loops
of individual threads not possible due to CMP processors smaller issue
bandwidth.
An SMT processor dynamically partitions its resources among threads,
and therefore can respond well to variations in both types of parallelism,
exploiting them interchangeably.
18
Conclusions
The performance race between SMT and CMP is not yet decided.
CMP is easier to implement, but only SMT has the ability to hide latencies.
A functional partitioning is not easily reached within a SMT processor due to
the centralized instruction issue.
A separation of the thread queues is a possible solution, although it does not
remove the central instruction issue.
A combination of simultaneous multithreading with the CMP may be superior.
We favor a CMP consisting of moderately equipped (e.g., 4-threaded 4-issue
superscalar) SMTs.
Future research: combine SMT or CMP organization with the ability to create
threads with compiler support or fully dynamically out of a single thread
thread-level speculation
close to multiscalar
19
Processors that Use Thread-level Speculation to Boost
Single-threaded Programs
Multiscalar processors divide a program in a collection of tasks
that are distributed to a number of parallel processing units under
control of a single hardware sequencer.
Trace processors facilitate high ILP and a fast clock by breaking
up the processor into multiple distinct cores (similar to multiscalar!),
and breaking up the program into traces (dynamic sequences of
instructions).
DataScalar processors run the same sequential program
redundantly across multiple processors using distributed data sets.
20
5.4 Multiscalar Processors
A program is represented as a control flow graph (CFG), where basic blocks
are nodes, and arcs represent flow of control.
Program execution can be viewed as walking through the CFG with a high
level of parallelism.
A multiscalar processor walks through the CFG speculatively, taking task-
sized steps, without pausing to inspect any of the instructions within a task.
The primary constraint: it must preserve the sequential program semantics.
A program is statically partitioned into tasks which are marked by annotations
of the CFG.
The tasks are distributed to a number of parallel PEs within a processor.
Each PE fetches and executes instructions belonging to its assigned task.
21
Multiscalar mode of execution
A
B C
D
E
Task A
PE 0
Task B
PE 1
Task D
PE 2
Task E
PE 3
D
a
t
a
v
a
l
u
e
s
22
Multiscalar processor
. . .
unidirectional
ring
unidirectional
ring
. . .
Sequencer
Tail Head
Interconnect
A
R
B
D
-
c
a
c
h
e
Data
bank
Data
bank
. . .
. . .
P
r
o
c
e
s
s
i
n
g
e
l
e
m
e
n
t
n
I-cache
Processing
unit
Register
file
P
r
o
c
e
s
s
i
n
g
e
l
e
m
e
n
t
1
23
Proper resolution of inter-task data dependences
Concerns in particular data that is passed between instructions via registers
and via memory.
To maintain a sequential appearance a twofold strategy is employed.
First, each processing element adheres to sequential execution semantics for
the task assigned to it.
Second, a loose sequential order is enforced over the collection of processing
elements, which in turn imposes a sequential order of the tasks.
The sequential order on the processing elements is maintained by
organizing the elements into a circular queue.
The appearance of a single logical register file is maintained although
copies are distributed to each parallel PE.
Register results are dynamically routed among the many parallel processing
elements with the help of compiler-generated masks.
For memory operations: An address resolution buffer (ARB) is provided
to hold speculative memory operations and to detect violations of memory
dependences.
24
The multiscalar paradigm has at least two forms of
speculation:
control speculation, which is used by the task sequencer, and
data dependence speculation, which is performed by each PE.
It could also use other forms of speculation, such as data value
speculation, to alleviate inter-task dependences.
25
Alternatives with Thread-level Speculation
Static Thread-level speculation: multiscalar
Dynamic thread-level speculation: threads are generated during run-time
from:
Loops
Subprogram invocations
Specific start or stop instruction
Traces: Trace processors
Datascalar
26
5.5 Trace Processors
The focus of a Trace processor is the combination of the Trace
cache with multiscalar.
Idea: Create subsystems similar in complexity to today's
superscalar processors and combine replicated subsystems into a
full processor.
A trace is a sequence of instructions that potentially covers several
basic blocks starting at any point in the dynamic instruction stream.
A trace cache is a special I-cache that captures dynamic instruction
sequences in contrast to the I-cache that contains static instruction
sequences.
See: J.E. Smith, S. Vapapeyam: Trace processors: Moving to Fourth-Generation
Microarchitectures. IEEE Computer, Sept. 1997, pp. 68-74
27
Trace Processor
Processing element 2
Processing element n
Data-value
prediction
Next-trace
prediction
Instruction
preprocessing
Trace
construction
Branch
prediction
Functional
units
Local
registers
Global
registers
Instruction
buffer
Processing element 1
T
r
a
c
e
c
a
c
h
e
.
.
.
28
Trace Processor
Instruction fetch hardware fetches instructions from the I-cache and
simultaneously generates traces of 8 to 32 instructions including predicted
conditional branches.
Traces are built as the program executes and they are stored in a trace cache.
A trace fetch unit reads traces from the trace cache and parcels them out to
the parallel PEs.
Next trace prediction speculates on the next traces to be executed,
Next trace prediction predicts multiple branches per cycle.
Data prediction speculates on the trace's input data values.
Constant value prediction predicts 80% correct for gcc.
A trace cache miss causes a trace to be built through conventional instruction
fetching with branch prediction.
Trace processor is similar to multiscalar except for its use of hardware-
generated dynamic traces rather than compiler-generated static tasks.
29
5.6 DataScalar Processors
The DataScalar model of execution runs the same
sequential program redundantly across multiple processors.
The data set is distributed across physical memories that are
tightly coupled to their distinct processors.
Each processor broadcasts operands that it loads from its
local memory to all other processors.
Instead of explicitly accessing a remote memory, processors
wait until the requested value is broadcasted.
Stores are completed only by the processor that owns the
operand, and are dropped by the others.
30
Address Space
The address space is divided into a replicated and a
communicated section:
The communicated section holds values that only exist in
single copies and are owned by the respective processor.
The communicated section of the address space is
distributed among the processors.
Replicated pages are mapped in each processor's local
memory.
Access to a replicated page requires no data communication.
31
Accesses of DataScalar processors to memory
Processor 1
replicated
Memory
communicated
load-1
store-1
load-2
store-2
Processor 2
replicated
Memory
communicated
load-2
store-1
load-1
store-2
broadcast
32
Why DataScalar?
The main goal of the DataScalar model of execution is the improvement of
memory system performance by introducing redundancy in execution by
replicating processors and part of the data storage.
Since all physical memory is local to at least one processor, a request for a
remote operand is never sent.
Memory access latency and bus traffic is reduced.
All communication is one-way.
Writes never appear on the global bus.
33
Execution Mode
The processors execute the same program in slightly different time steps.
The lead processor runs slightly ahead of the others, especially when it is
broadcasting while the others wait for the broadcasted value.
When the program execution accesses an operand that is not owned by
the lead processor, a lead change occurs.
All processors stall until the new lead processor catches up and broadcasts
its operands.
The capability that each processor may run ahead on computation that
involves operands owned by the processor is called datathreading.
34
Pros and Cons
The DataScalar model is primarily a memory system optimization
intended for codes that are performance limited by the memory system and
difficult to parallelize.
Simulation results : the DataScalar model of execution works best with
codes for which traditional parallelization techniques fail.
Six unmodied SPEC95 binaries ran from 7 % slower to 50 % faster on two
nodes, and from 9 % to 100 % faster on four nodes, than on a system with
comparable, more traditional memory system.
Current technological parameters do not make DataScalar systems a cost-
effective alternative to today's microprocessors.
35
Pros and Cons
For a DataScalar system to be more cost-effective than the alternatives, the
following three conditions must hold:
Processing power must be cheap, the dominant cost of each node should be
memory.
Remote memory accesses should be slower than local memory accesses.
Broadcasts should not be prohibitively expensive.
Three possible candidates:
IRAM: because remote memory accesses to other IRAM chips will be more
expensive than on-chip memory accesses.
extending concept of CMP: CMPs access operands from a processor-local
memory faster than requesting an operand from a remote processor memory
across the chip due to wiring delays.
NOWs (networks of workstations): alternative to paging, provided that
broadcasts are efficient.