0% found this document useful (0 votes)
38 views69 pages

Unit 5 Advanced Processors

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views69 pages

Unit 5 Advanced Processors

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 69

Unit 5

Advanced Processors &


Controllers
Syllabus
Parallel Architectures, Pentium, Multicore
Architectures, Cache Coherence issues, ARM

2 MGR,ECE,RVCE
ARM based products

3 MGR,ECE,RVCE
ARM based products

Washing Coffee Machine


Machine

4 MGR,ECE,RVCE
ARM based products

iPOD

5 MGR,ECE,RVCE
Features driven the ARM processor
design
Portable embedded systems requires some
form of battery power
High code density: Limited memory devices
ES are price sensitive
Area of the die taken by the embedded
processor.
Time to market( development time &
testing time)

6 MGR,ECE,RVCE
Why is 32-bit outgrowing the MCU
market?
 Customers ask for more
Questionnaire:

performance My current embedded project's main processor is …

 Advanced communication 16%


13%
8-bit processor 16%
 Advanced human interfaces 18%
17%
19%
16-bit processor 18%
19%
 8-bit loses its price advantage 60%
58%
32-bit processor 61%
 Old process technology vs 54%

advanced 32-bit processes 6%


7%
64-bit processor 4% 2009 (N = 1,533)
 Difference in the silicon area of 6% 2008 (N = 1,067)

8-bit vs 32-bit cores shrinks to Don’t know


1%
3%
2007 (N = 938)
1% 2006 (N = 917)
nothing 2%

32-bit offers more value for


money!

7 MGR,ECE,RVCE
Why is ARM outgrowing the 32-bit
market?
 Balanced performance / code
density with ARM & Thumb
instruction sets
 Cost-effective performance in
embedded systems:
 Wide range of peripherals
 Low-cost, high-speed memory
implementation
 Low power consumption
 Supported by huge ecosystem ARM MCUs (Mpcs)

ARM is the leading 32-bit


MCU
architecture
8
MGR,ECE,RVCE
What is ARM?
 The ARM is a 32-bit reduced instruction set
computer (RISC) instruction set architecture
(ISA) developed by ARM.
 It was known as the Advanced RISC
Machine, and before that as the Acorn RISC
Machine.
The ARM core is 32 bit general purpose
microprocessor.
ARM's business has always been to sell IP
cores, which licensees use to create
microcontrollers and CPUs based on this core.
9 MGR,ECE,RVCE
ARM licensees
IP Cores
-Gate Netlist (Hard)
-Synthesizable RTL code (Soft)
With the synthesizable RTL, the customer has
the ability to perform architectural level
optimizations and extensions.
This allows the designer to achieve exotic
design goals not otherwise possible with an
unmodified net list
While ARM does not grant the licensee the
right to resell the ARM architecture itself,
licensees may freely sell manufactured
product (chip devices, evaluation boards,
MGR,ECE,RVCE
10
complete systems, etc.).
RISC design

The ARM core uses a RISC architecture ?
The RISC design goals
Instructions
-reduced number of instructions
-simple instructions executing in a single cycle.
-Fixed length instructions
-Software support for many instructions
E.g. divide operation –repeated subtraction
-Compiler design is complex
11
-Orthogonal instruction set
MGR,ECE,RVCE
Design goals
Pipelines
-Fixed length instructions pipeline design is simpler
-Ideally the pipeline advances one step on each
cycle for maximum throughput

12 MGR,ECE,RVCE
Design goals

Register file
-Large general purpose register set
-Orthogonal instruction set
-no register specific instructions
-Register can be used to store either data or address.
Load-Store Architecture
-Separate load & store instructions transfer data between the
register bank and external memory.

13 MGR,ECE,RVCE
ARM features
The ARM architecture includes the following RISC features:
Load/store architecture.
No support for misaligned memory accesses (not all
cores)
Uniform 16 × 32-bit register file.
Fixed instruction width of 32 bits to ease decoding and
pipelining, at the cost of decreased code density.
-Later, "Thumb mode" increased code density.
Mostly single-cycle execution

14 MGR,ECE,RVCE
Additional design features
Conditional execution of most instructions,
reducing branch overhead and compensating
for the lack of a branch predictor.
It avoids branch instructions when generating
code for small if statements
In the C programming language, the loop is:
while (i != j)
{
if (i > j)
i -= j;
else
15 MGR,ECE,RVCE j -= i;
In ARM assembly, the loop is:
loop :CMP Ri, Rj ; set condition "NE" if (i !=
j), ; "GT" if (i >
j), ; or "LT" if (i
< j)
SUBGT Ri, Ri, Rj ; if "GT" (greater
than), i = i-j;
SUBLT Rj, Rj, Ri ; if "LT" (less than), j =
j-i;
BNE loop ; if "NE" (not
equal), then loop
which avoids the branches around the then
16 MGR,ECE,RVCE
and else clauses.
Additional design feature

Another feature of the instruction set is the ability


to fold shifts and rotates into the data processing
instructions, for example, the C statement
a += (j << 2);
could be rendered as a single-word, single-cycle
instruction on the ARM.
ADD Ra, Ra, Rj, LSL #2
 This results in the typical ARM program
being denser than expected with fewer
memory accesses; thus the pipeline is used
17 more efficiently.
MGR,ECE,RVCE
Additional design featu

Enhanced DSP instructions were added to


standard ARM instruction set to support fast
multiplier operation.
E.g.MLA (Multiply & Accumulate instruction)
Arithmetic instructions alter condition codes only
when desired.
32-bit barrel shifter which can be used without
performance penalty with most arithmetic
instructions and address calculations.
A link register for fast function calls.
2-priority-level interrupt subsystem with
switched register banks.
18 MGR,ECE,RVCE
Additional design featu
Advanced Microcontroller Bus Architecture
(AMBA) has been widely used for ARM
processors.

19 MGR,ECE,RVCE
Additional design featu

The architecture provides a non-intrusive way


of extending the instruction set using
"coprocessors“ .
The coprocessor space is divided logically into
16 coprocessors .
All ARM peripherals are memory mapped.
-ARM memory space
-Coprocessor memory space

20 MGR,ECE,RVCE
Data Sizes and Instruction Sets
The ARM is a 32-bit architecture.
Most ARM’s implement two instructio
sets
◦ 32-bit ARM Instruction Set
◦ 16-bit Thumb Instruction Set

21 MGR,ECE,RVCE
ARM core: Data flow model

22 MGR,ECE,RVCE
Core: Data flow
Data enters the core through the data bus. The
data may be an instruction to execute or a data
item(operand).
Von Neuman-type (ARM7) or Harvard (ARM9) bus
structure.
Load/store architecture
The instruction decoder translates instructions
before they are executed.
37 pieces of 32-bit integer registers (16 available)
The signed extend hardware converts signed 8-bit
or 16-bit numbers to 32 bit values .
23 MGR,ECE,RVCE
Core: Data flo
ARM instructions typically have two source
register, Rn and Rm, and a single result or
destination register, Rd.
 The ALU or MAC (multiply-accumulate unit)
takes the register values Rn and Rm and
computes result.
Load and store instructions use the ALU to
generate an address to be held in the address
register and broadcast on the address bus.
One important feature of the ARM is that
register Rm alternatively can be preprocessed
in the barrel shifter before it enters the ALU.
Together the barrel shifter and ALU can
24 calculate a wide range of expressions and
MGR,ECE,RVCE

addresses.
Barrel Shifter
 In conventional processor shifting is normally
implemented by serial shift register.
 This operation takes one clock cycle for every
single bit shift.
 Barrel shifter implement shifting by several bits
in one cycle.

25 MGR,ECE,RVCE
Barrel shifter:
Implementation
The barrel shifter connects the input lines
representing a word to group of output lines
with the required shift determined by
control inputs.
If the input word has n bits, and shifts from
0 to n-1 bit positions to the right or left are
to be implemented, the control input
requires log2n lines to determine the
number of bits to be shifted.

26 MGR,ECE,RVCE
4 Bit shift right barrel shifter

27 MGR,ECE,RVCE
Barrel Shifter Operation

28 MGR,ECE,RVCE
Multiply & Accumulate (MAC)unit
Most of DSP of applications require the
accumulation of series of successive
multiplications.

MAC units:
-Multiplier
-Adder/ Substractor
-Accumulator

29 MGR,ECE,RVCE
MAC
If N products are to be accumulated, N-1
multiplies can be overlap with
accumulations.
During very first multiply, the accumulator
is idle since there is nothing to accumulate.
Similarly, during very last accumulation the
multiplier is idle since all N products have
been computed.
N+1 cycles for sum of products of N
multiplications

30 MGR,ECE,RVCE
ARM register set (User mode)
 Register structure depends on mode
of operation
 16 pieces of 32-bit integer registers
R0 - R15 are available in ARM-mode.
 R0 - R12 are general purpose
registers
 R13 is Stack Pointer (SP)
 R14 is subroutine Link Register
 R15 is Program Counter (PC)
 R16 is state register (CPSR, Current
Program Status Register
 R0 to R12 are orthogonal.
31 MGR,ECE,RVCE
Pipeline
 ARM CPUs based on pipeline execution of
instructions.
 The pipeline design for each ARM family
differs.

32 MGR,ECE,RVCE
Pipeline
As the pipeline length increases, the
amount of work done at each stage is
reduced, which allows processor to attain
a higher clock frequency.
ARM 9 process on average 1.1 Dhrystone
MIPS per MHz (increased instruction
throughput by around 13% compared ARM
7).
ARM 10 process on average 1.3 Dhrystone
MIPS per MHz (increased instruction
throughput by around 34% compared ARM
7).

33 MGR,ECE,RVCE
Pipeline Execution Characteristics
The execution of branch instruction or
branching by direct modification of PC causes
the ARM core to flush the pipeline.
Some architectures(ARM 10) uses branch
prediction which reduces the effect of
pipeline flush by predicting possible
branches.
An instruction in execute stage will complete
the even though an interrupt has been
raised.

34 MGR,ECE,RVCE
Multi-core
architectures
Single-core computer

36 MGR,ECE,RVCE
Single-core CPU chip
the single core

37 MGR,ECE,RVCE
Multi-core architectures
This lecture is about a new trend in computer
architecture:
Replicate multiple processor cores on a single
Core 1 Core 2 Core 3 Core 4
die.

Multi-core CPU chip

38 MGR,ECE,RVCE
Multi-core CPU chip
The cores fit on a single processor socket
Also called CMP (Chip Multi-Processor)

c c c c
o o o o
r r r r
e e e e

1 2 3 4

39 MGR,ECE,RVCE
The cores run in parallel
thread 1 thread 2 thread 3 thread 4

c c c c
o o o o
r r r r
e e e e

1 2 3 4

40 MGR,ECE,RVCE
Within each core, threads are time-
sliced (just like on a uniprocessor)
several several several several
threads threads threads threads

c c c c
o o o o
r r r r
e e e e

1 2 3 4

41 MGR,ECE,RVCE
Interaction with the Operating
System

OS perceives each core as a separate


processor

OS scheduler maps threads/processes to


different cores

Most major OS support multi-core today:


Windows, Linux, Mac OS X, …

42 MGR,ECE,RVCE
Why multi-core ?
Difficult to make single-core clock
frequencies even higher
Deeply pipelined circuits:
heat problems (Cooling Systems)
difficult design and verification
large design teams necessary
Many new applications are
multithreaded
General trend in computer
architecture (shift towards more
43 parallelism)
MGR,ECE,RVCE
Why Multicore ?
Core Core
Core Core Core
Core Core

Single Core Dual Core Quad Core


Core area A ~A/2 ~A/4
Core power W ~W/2 ~W/4
Chip power W+O W + O’ W + O’’
Core P 0.9P 0.8P
performance
Chip P 1.8P 3.2P
performance
44 MGR,ECE,RVCE
Instruction-level parallelism
Parallelism at the machine-instruction
level
The processor can re-order, pipeline
instructions, split them into
microinstructions, do aggressive branch
prediction, etc.
Instruction-level parallelism enabled
rapid increases in processor speeds over
the last 15 years

45 MGR,ECE,RVCE
Thread-level parallelism (TLP)
This is parallelism on a more coarser scale
E.g. Server can serve each client in a separate
thread (Web server, database server)
A computer game can do AI, graphics, and
physics in three separate threads
Single-core superscalar processors cannot fully
exploit TLP
Multi-core architectures are the next step in
processor evolution: explicitly exploiting TLP

46 MGR,ECE,RVCE
Question ?
What is the difference between
Core2 duo & Dual core?

47 MGR,ECE,RVCE
The cache coherence problem
Since we have private caches:
How to keep the data consistent across
caches?
Each core should perceive the memory as a
monolithic array, shared by all the cores

MGR,ECE,RVCE 48
The cache coherence problem
Suppose variable x initially contains
15213
Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache

multi-core chip
Main memory
49 MGR,ECE,RVCEx=15213
The cache coherence
problem
Core 1 reads x

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=15213

multi-core chip
Main memory
50 MGR,ECE,RVCEx=15213
The cache coherence
problem
Core 2 reads x

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=15213 x=15213

multi-core chip
Main memory
51 MGR,ECE,RVCEx=15213
The cache coherence
problem
Core 1 writes to x, setting it to 21660

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213

multi-core chip
Main memory assuming
52 MGR,ECE,RVCEx=21660 write-through
caches
The cache coherence
problem
Core 2 attempts to read x… gets a stale
copy
Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213

multi-core chip
Main memory
53 MGR,ECE,RVCEx=21660
Solutions for cache
coherence
This is a general problem with multiprocessors,
not limited just to multi-core
There exist many solution algorithms, coherence
protocols, etc.

A simple solution:
invalidation-based protocol with snooping

54 MGR,ECE,RVCE
Invalidation protocol with
snooping
Invalidation:
If a core writes to a data item, all other
copies of this data item in other caches are
invalidated
Snooping:
All cores continuously “snoop” (monitor)
the bus connecting the cores.

55 MGR,ECE,RVCE
The cache coherence
problem
Revisited: Cores 1 and 2 have both
read x
Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=15213 x=15213

multi-core chip
Main memory
56 MGR,ECE,RVCEx=15213
The cache coherence
problem
Core 1 writes to x, setting it to 21660

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213

sends INVALIDATED
invalidation
multi-core chip
request
Main memory assuming inter-core
57 MGR,ECE,RVCEx=21660 write-through bus
caches
The cache coherence
problem
After invalidation:

Core 1 Core 2 Core 3 Core 4

One or more One or One or more One or more


levels of more levels of levels of
cache levels of cache cache
x=21660 cache

multi-core chip
Main memory
58 MGR,ECE,RVCEx=21660
The cache coherence
problem
Core 2 reads x. Cache misses, and loads the new
copy.

Core 1 Core 2 Core 3 Core 4

One or more One or One or more One or more


levels of more levels of levels of
cache levels of cache cache
x=21660 cache
x=21660

multi-core chip
Main memory
59 MGR,ECE,RVCEx=21660
Alternative to invalidate protocol:
update protocol
Core 1 writes x=21660:

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=21660
UPDATED

broadcasts
multi-core chip
updated
value Main memory assuming inter-core
60 MGR,ECE,RVCEx=21660 write-through bus
caches
Invalidation vs update
• Multiple writes to the same location
– invalidation: only the first time
– update: must broadcast each write
(which includes new variable
value)

• Invalidation generally performs


better:
it generates less bus
61 traffic
MGR,ECE,RVCE
Programming for multi-core
Programmers must use threads or
processes

Spread the workload across multiple


cores

Write parallel algorithms

OS will map threads/processes to cores

62 MGR,ECE,RVCE
Pentium Processor
 Pentium processor uses Superscalar architecture and hence can
issue multiple instructions per cycle.
 All application software written for the Intel386 and Intel486
family microprocessors will run on the Pentium processors
without modification.
 The on-chip memory management unit (MMU) is completely
compatible with the Intel386 family and Intel486 family of CPUs.
 The Pentium processors implement several enhancements to
increase performance. The two instruction pipelines and floating-
point unit on Pentium processors are capable of independent
operation.
 Branch prediction is implemented in the Pentium processors. To
support this, Pentium processors implement two prefetch
buffers, one to prefetch code in a linear fashion, and one that
prefetches code according to the Branch Target buffer B so the
needed code is almost always prefetched before it is needed for
63 execution.
MGR,ECE,RVCE
Pentium..
The floating-point unit has been completely
redesigned over the Intel486 CPU. Faster
algorithms provide up to 10X speed-up for
common operations including add, multiply, and
load.
Pentium processors include separate code and
data caches integrated on-chip to meet
performance goals. Each cache is 8 Kbytes in
size.
The Pentium processors have increased the data
bus to 64 bits.

64 MGR,ECE,RVCE
P
e
n
t
i
u
m

65 MGR,ECE,RVCE
The block diagram shows the two instruction
pipelines, the "u” pipe and the "v” pipe.
The u-pipe can execute all integer and floating
point instructions. The v-pipe can execute
simple integer instructions.
The code cache, branch target buffer and
prefetch buffers are responsible for getting
raw instructions into the execution units of the
Pentium processor.

66 MGR,ECE,RVCE
Branch Prediction
Brach can be predicted either based on
branch code types statically or based on
branch history during program execution.
This requires collecting the frequency and
probabilities of branch taken and branch types
across a large number of program traces.
Lee & Smith have shown the use of Branch
Target Buffer(BTB) to implement branch
prediction. The BTB is used to hold recent
branch information including the address of
the branch target used.
The state diagram shown in next slide used
67 for back tracking the last two braches in given
MGR,ECE,RVCE
Branch Prediction
The BTB entry contains the backtracking
information which will guide prediction.
Prediction information is updated upon
completion of the current branch.
The BTB can be extended to store not only
the branch target address but also the target
instruction itself and a few of its successor
instructions, in order to allow zero delay in
converting conditional branches to
unconditional branches.

68 MGR,ECE,RVCE
Brach Target Buffer Organization

. . .
. . .

Branch Instruction Branch Prediction Branch target address


address Statistics

69 MGR,ECE,RVCE

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy