Unit 5 Advanced Processors
Unit 5 Advanced Processors
2 MGR,ECE,RVCE
ARM based products
3 MGR,ECE,RVCE
ARM based products
4 MGR,ECE,RVCE
ARM based products
iPOD
5 MGR,ECE,RVCE
Features driven the ARM processor
design
Portable embedded systems requires some
form of battery power
High code density: Limited memory devices
ES are price sensitive
Area of the die taken by the embedded
processor.
Time to market( development time &
testing time)
6 MGR,ECE,RVCE
Why is 32-bit outgrowing the MCU
market?
Customers ask for more
Questionnaire:
7 MGR,ECE,RVCE
Why is ARM outgrowing the 32-bit
market?
Balanced performance / code
density with ARM & Thumb
instruction sets
Cost-effective performance in
embedded systems:
Wide range of peripherals
Low-cost, high-speed memory
implementation
Low power consumption
Supported by huge ecosystem ARM MCUs (Mpcs)
12 MGR,ECE,RVCE
Design goals
Register file
-Large general purpose register set
-Orthogonal instruction set
-no register specific instructions
-Register can be used to store either data or address.
Load-Store Architecture
-Separate load & store instructions transfer data between the
register bank and external memory.
13 MGR,ECE,RVCE
ARM features
The ARM architecture includes the following RISC features:
Load/store architecture.
No support for misaligned memory accesses (not all
cores)
Uniform 16 × 32-bit register file.
Fixed instruction width of 32 bits to ease decoding and
pipelining, at the cost of decreased code density.
-Later, "Thumb mode" increased code density.
Mostly single-cycle execution
14 MGR,ECE,RVCE
Additional design features
Conditional execution of most instructions,
reducing branch overhead and compensating
for the lack of a branch predictor.
It avoids branch instructions when generating
code for small if statements
In the C programming language, the loop is:
while (i != j)
{
if (i > j)
i -= j;
else
15 MGR,ECE,RVCE j -= i;
In ARM assembly, the loop is:
loop :CMP Ri, Rj ; set condition "NE" if (i !=
j), ; "GT" if (i >
j), ; or "LT" if (i
< j)
SUBGT Ri, Ri, Rj ; if "GT" (greater
than), i = i-j;
SUBLT Rj, Rj, Ri ; if "LT" (less than), j =
j-i;
BNE loop ; if "NE" (not
equal), then loop
which avoids the branches around the then
16 MGR,ECE,RVCE
and else clauses.
Additional design feature
19 MGR,ECE,RVCE
Additional design featu
20 MGR,ECE,RVCE
Data Sizes and Instruction Sets
The ARM is a 32-bit architecture.
Most ARM’s implement two instructio
sets
◦ 32-bit ARM Instruction Set
◦ 16-bit Thumb Instruction Set
21 MGR,ECE,RVCE
ARM core: Data flow model
22 MGR,ECE,RVCE
Core: Data flow
Data enters the core through the data bus. The
data may be an instruction to execute or a data
item(operand).
Von Neuman-type (ARM7) or Harvard (ARM9) bus
structure.
Load/store architecture
The instruction decoder translates instructions
before they are executed.
37 pieces of 32-bit integer registers (16 available)
The signed extend hardware converts signed 8-bit
or 16-bit numbers to 32 bit values .
23 MGR,ECE,RVCE
Core: Data flo
ARM instructions typically have two source
register, Rn and Rm, and a single result or
destination register, Rd.
The ALU or MAC (multiply-accumulate unit)
takes the register values Rn and Rm and
computes result.
Load and store instructions use the ALU to
generate an address to be held in the address
register and broadcast on the address bus.
One important feature of the ARM is that
register Rm alternatively can be preprocessed
in the barrel shifter before it enters the ALU.
Together the barrel shifter and ALU can
24 calculate a wide range of expressions and
MGR,ECE,RVCE
addresses.
Barrel Shifter
In conventional processor shifting is normally
implemented by serial shift register.
This operation takes one clock cycle for every
single bit shift.
Barrel shifter implement shifting by several bits
in one cycle.
25 MGR,ECE,RVCE
Barrel shifter:
Implementation
The barrel shifter connects the input lines
representing a word to group of output lines
with the required shift determined by
control inputs.
If the input word has n bits, and shifts from
0 to n-1 bit positions to the right or left are
to be implemented, the control input
requires log2n lines to determine the
number of bits to be shifted.
26 MGR,ECE,RVCE
4 Bit shift right barrel shifter
27 MGR,ECE,RVCE
Barrel Shifter Operation
28 MGR,ECE,RVCE
Multiply & Accumulate (MAC)unit
Most of DSP of applications require the
accumulation of series of successive
multiplications.
MAC units:
-Multiplier
-Adder/ Substractor
-Accumulator
29 MGR,ECE,RVCE
MAC
If N products are to be accumulated, N-1
multiplies can be overlap with
accumulations.
During very first multiply, the accumulator
is idle since there is nothing to accumulate.
Similarly, during very last accumulation the
multiplier is idle since all N products have
been computed.
N+1 cycles for sum of products of N
multiplications
30 MGR,ECE,RVCE
ARM register set (User mode)
Register structure depends on mode
of operation
16 pieces of 32-bit integer registers
R0 - R15 are available in ARM-mode.
R0 - R12 are general purpose
registers
R13 is Stack Pointer (SP)
R14 is subroutine Link Register
R15 is Program Counter (PC)
R16 is state register (CPSR, Current
Program Status Register
R0 to R12 are orthogonal.
31 MGR,ECE,RVCE
Pipeline
ARM CPUs based on pipeline execution of
instructions.
The pipeline design for each ARM family
differs.
32 MGR,ECE,RVCE
Pipeline
As the pipeline length increases, the
amount of work done at each stage is
reduced, which allows processor to attain
a higher clock frequency.
ARM 9 process on average 1.1 Dhrystone
MIPS per MHz (increased instruction
throughput by around 13% compared ARM
7).
ARM 10 process on average 1.3 Dhrystone
MIPS per MHz (increased instruction
throughput by around 34% compared ARM
7).
33 MGR,ECE,RVCE
Pipeline Execution Characteristics
The execution of branch instruction or
branching by direct modification of PC causes
the ARM core to flush the pipeline.
Some architectures(ARM 10) uses branch
prediction which reduces the effect of
pipeline flush by predicting possible
branches.
An instruction in execute stage will complete
the even though an interrupt has been
raised.
34 MGR,ECE,RVCE
Multi-core
architectures
Single-core computer
36 MGR,ECE,RVCE
Single-core CPU chip
the single core
37 MGR,ECE,RVCE
Multi-core architectures
This lecture is about a new trend in computer
architecture:
Replicate multiple processor cores on a single
Core 1 Core 2 Core 3 Core 4
die.
38 MGR,ECE,RVCE
Multi-core CPU chip
The cores fit on a single processor socket
Also called CMP (Chip Multi-Processor)
c c c c
o o o o
r r r r
e e e e
1 2 3 4
39 MGR,ECE,RVCE
The cores run in parallel
thread 1 thread 2 thread 3 thread 4
c c c c
o o o o
r r r r
e e e e
1 2 3 4
40 MGR,ECE,RVCE
Within each core, threads are time-
sliced (just like on a uniprocessor)
several several several several
threads threads threads threads
c c c c
o o o o
r r r r
e e e e
1 2 3 4
41 MGR,ECE,RVCE
Interaction with the Operating
System
42 MGR,ECE,RVCE
Why multi-core ?
Difficult to make single-core clock
frequencies even higher
Deeply pipelined circuits:
heat problems (Cooling Systems)
difficult design and verification
large design teams necessary
Many new applications are
multithreaded
General trend in computer
architecture (shift towards more
43 parallelism)
MGR,ECE,RVCE
Why Multicore ?
Core Core
Core Core Core
Core Core
45 MGR,ECE,RVCE
Thread-level parallelism (TLP)
This is parallelism on a more coarser scale
E.g. Server can serve each client in a separate
thread (Web server, database server)
A computer game can do AI, graphics, and
physics in three separate threads
Single-core superscalar processors cannot fully
exploit TLP
Multi-core architectures are the next step in
processor evolution: explicitly exploiting TLP
46 MGR,ECE,RVCE
Question ?
What is the difference between
Core2 duo & Dual core?
47 MGR,ECE,RVCE
The cache coherence problem
Since we have private caches:
How to keep the data consistent across
caches?
Each core should perceive the memory as a
monolithic array, shared by all the cores
MGR,ECE,RVCE 48
The cache coherence problem
Suppose variable x initially contains
15213
Core 1 Core 2 Core 3 Core 4
multi-core chip
Main memory
49 MGR,ECE,RVCEx=15213
The cache coherence
problem
Core 1 reads x
multi-core chip
Main memory
50 MGR,ECE,RVCEx=15213
The cache coherence
problem
Core 2 reads x
multi-core chip
Main memory
51 MGR,ECE,RVCEx=15213
The cache coherence
problem
Core 1 writes to x, setting it to 21660
multi-core chip
Main memory assuming
52 MGR,ECE,RVCEx=21660 write-through
caches
The cache coherence
problem
Core 2 attempts to read x… gets a stale
copy
Core 1 Core 2 Core 3 Core 4
multi-core chip
Main memory
53 MGR,ECE,RVCEx=21660
Solutions for cache
coherence
This is a general problem with multiprocessors,
not limited just to multi-core
There exist many solution algorithms, coherence
protocols, etc.
A simple solution:
invalidation-based protocol with snooping
54 MGR,ECE,RVCE
Invalidation protocol with
snooping
Invalidation:
If a core writes to a data item, all other
copies of this data item in other caches are
invalidated
Snooping:
All cores continuously “snoop” (monitor)
the bus connecting the cores.
55 MGR,ECE,RVCE
The cache coherence
problem
Revisited: Cores 1 and 2 have both
read x
Core 1 Core 2 Core 3 Core 4
multi-core chip
Main memory
56 MGR,ECE,RVCEx=15213
The cache coherence
problem
Core 1 writes to x, setting it to 21660
sends INVALIDATED
invalidation
multi-core chip
request
Main memory assuming inter-core
57 MGR,ECE,RVCEx=21660 write-through bus
caches
The cache coherence
problem
After invalidation:
multi-core chip
Main memory
58 MGR,ECE,RVCEx=21660
The cache coherence
problem
Core 2 reads x. Cache misses, and loads the new
copy.
multi-core chip
Main memory
59 MGR,ECE,RVCEx=21660
Alternative to invalidate protocol:
update protocol
Core 1 writes x=21660:
broadcasts
multi-core chip
updated
value Main memory assuming inter-core
60 MGR,ECE,RVCEx=21660 write-through bus
caches
Invalidation vs update
• Multiple writes to the same location
– invalidation: only the first time
– update: must broadcast each write
(which includes new variable
value)
62 MGR,ECE,RVCE
Pentium Processor
Pentium processor uses Superscalar architecture and hence can
issue multiple instructions per cycle.
All application software written for the Intel386 and Intel486
family microprocessors will run on the Pentium processors
without modification.
The on-chip memory management unit (MMU) is completely
compatible with the Intel386 family and Intel486 family of CPUs.
The Pentium processors implement several enhancements to
increase performance. The two instruction pipelines and floating-
point unit on Pentium processors are capable of independent
operation.
Branch prediction is implemented in the Pentium processors. To
support this, Pentium processors implement two prefetch
buffers, one to prefetch code in a linear fashion, and one that
prefetches code according to the Branch Target buffer B so the
needed code is almost always prefetched before it is needed for
63 execution.
MGR,ECE,RVCE
Pentium..
The floating-point unit has been completely
redesigned over the Intel486 CPU. Faster
algorithms provide up to 10X speed-up for
common operations including add, multiply, and
load.
Pentium processors include separate code and
data caches integrated on-chip to meet
performance goals. Each cache is 8 Kbytes in
size.
The Pentium processors have increased the data
bus to 64 bits.
64 MGR,ECE,RVCE
P
e
n
t
i
u
m
65 MGR,ECE,RVCE
The block diagram shows the two instruction
pipelines, the "u” pipe and the "v” pipe.
The u-pipe can execute all integer and floating
point instructions. The v-pipe can execute
simple integer instructions.
The code cache, branch target buffer and
prefetch buffers are responsible for getting
raw instructions into the execution units of the
Pentium processor.
66 MGR,ECE,RVCE
Branch Prediction
Brach can be predicted either based on
branch code types statically or based on
branch history during program execution.
This requires collecting the frequency and
probabilities of branch taken and branch types
across a large number of program traces.
Lee & Smith have shown the use of Branch
Target Buffer(BTB) to implement branch
prediction. The BTB is used to hold recent
branch information including the address of
the branch target used.
The state diagram shown in next slide used
67 for back tracking the last two braches in given
MGR,ECE,RVCE
Branch Prediction
The BTB entry contains the backtracking
information which will guide prediction.
Prediction information is updated upon
completion of the current branch.
The BTB can be extended to store not only
the branch target address but also the target
instruction itself and a few of its successor
instructions, in order to allow zero delay in
converting conditional branches to
unconditional branches.
68 MGR,ECE,RVCE
Brach Target Buffer Organization
. . .
. . .
69 MGR,ECE,RVCE