Coa Chapter 5
Coa Chapter 5
COMPUTER ORGANIZATION
AND ARCHITECTURE
UNIT-5
1
Contents
• Parallelism: Need, types , applications and challenges
• Architecture of Parallel Systems-Flynn’s classification
• ARM Processor: The thumb instruction set
• Processor and CPU cores, Instruction Encoding format
• Memory load and Store instruction
• Basics of I/O operations.
• Case study: ARM 5 and ARM 7 Architecture
2
Parallelism: Need, types , applications
and challenges
3
Parallelism
• Executing two or more operations at the same time is known as
parallelism.
• Parallel processing is a method to improve computer system
performance by executing two or more instructions simultaneously
• A parallel computer is a set of processors that are able to work
cooperatively to solve a computational problem.
• Two or more ALUs in CPU can work concurrently to increase
throughput
• The system may have two or more processors operating
concurrently
4
Goals of parallelism
• To increase the computational speed (ie) to reduce the amount
of time that you need to wait for a problem to be solved
• To increase throughput (ie) the amount of processing that can
be accomplished during a given interval of time
• To improve the performance of the computer for a given clock
speed
• To solve bigger problems that might not fit in the limited memory
of a single CPU
5
Applications of Parallelism
• Numeric weather prediction
• Socio economics
• Finite element analysis
• Artificial intelligence and automation
• Genetic engineering
• Weapon research and defence
• Medical Applications
• Remote sensing applications
6
Applications of Parallelism
7
Types of parallelism
1. Hardware Parallelism
2. Software Parallelism
• Hardware Parallelism :
The main objective of hardware parallelism is to increase the processing speed. Based on
the hardware architecture, we can divide hardware parallelism into two types: Processor
parallelism and memory parallelism.
• Processor parallelism
Processor parallelism means that the computer architecture has multiple nodes, multiple
CPUs or multiple sockets, multiple cores, and multiple threads.
• Memory parallelism means shared memory, distributed memory, hybrid distributed shared
memory, multilevel pipelines, etc. Sometimes, it is also called a parallel random access
machine (PRAM). “It is an abstract model for parallel computation which assumes that all the
processors operate synchronously under a single clock and are able to randomly access a
large shared memory. In particular, a processor can execute an arithmetic, logic, or memory
access operation within a single clock cycle”. This is what we call using overlapping or
pipelining instructions to achieve parallelism.
8
Hardware Parallelism
• One way to characterize the parallelism in a processor is by the
number of instruction issues per machine cycle.
• If a processor issues k instructions per machine cycle, then it is called
a k-issue processor.
• In a modern processor, two or more instructions can be issued per
machine cycle.
• A conventional processor takes one or more machine cycles to issue a
single instruction. These types of processors are called one-issue
machines, with a single instruction pipeline in the processor.
• A multiprocessor system which built n k-issue processors should be
able to handle a maximum of nk threads of instructions simultaneously
9
Software Parallelism
• It is defined by the control and data dependence of programs.
• The degree of parallelism is revealed in the program flow graph.
• Software parallelism is a function of algorithm, programming style,
and compiler optimization.
• The program flow graph displays the patterns of simultaneously
executable operations.
• Parallelism in a program varies during the execution period .
• It limits the sustained performance of the processor.
10
11
12
13
14
15
16
Software Parallelism - types
Parallelism in Software
17
Instruction level parallelism
• Instruction level Parallelism (ILP) is a measure of how many
operations can be performed in parallel at the same time in a
computer.
18
Eg. Instruction level parallelism
Consider the following example
1. x= a+b
2. y=c-d
3. z=x * y
Operation 3 depends on the results of 1 & 2
So ‘Z ‘ cannot be calculated until X & Y are calculated
But 1 & 2 do not depend on any other. So they can be computed
simultaneously.
19
• If we assume that each operation can be completed in one unit of
time then these 3 operations can be completed in 2 units of time .
20
Data-level parallelism (DLP)
• Data parallelism is parallelization across multiple processors in
parallel computing environments.
21
DLP - example
• Let us assume we want to sum all the elements of the given
array of size n and the time for a single addition operation is Ta
time units.
22
DLP in Adding elements of array
23
DLP in matrix multiplication
24
• The locality of data references plays an important part in
evaluating the performance of a data parallel programming
model.
25
Flynn’s Classification
• This taxonomy distinguishes multi-processor computer architectures
according to the two independent dimensions of Instruction stream
and Data stream.
• An instruction stream is sequence of instructions executed by
machine.
• A data stream is a sequence of data including input, partial or
temporary results used by instruction stream.
• Each of these dimensions can have only one of two possible states:
Single or Multiple.
• Flynn’s classification depends on the distinction between the
performance of control unit and the data processing unit rather than
its operational and structural interconnections.
26
Flynn’s Classification
27
SISD
• They are also called scalar • SISD computer having one control
processor i.e., one instruction at a unit, one processor unit and single
time and each instruction have only memory unit.
one set of operands. •
• Single instruction: only one
instruction stream is being acted
on by the CPU during any one
clock cycle.
• Single data: only one data stream
is being used as input during any
one clock cycle.
• Deterministic execution.
• Instructions are executed
sequentially.
28
SIMD
• A type of parallel computer. • single instruction is executed by
• Single instruction: All processing different processing unit on
units execute the same different set of data
instruction issued by the control
unit at any given clock cycle .
• Multiple data: Each processing
unit can operate on a different
data element as shown if figure
below the processor are
connected to shared memory or
interconnection network
providing multiple data to
processing unit
29
MISD
• A single data stream is fed into • same data flow through a linear
multiple processing units. array of processors executing
• Each processing unit operates different instruction streams
on the data independently via
independent instruction.
• A single data stream is
forwarded to different
processing unit which are
connected to different control
unit and execute instruction
given to it by control unit to
which it is attached.
30
MIMD
• Multiple Instruction: every • Different processor each
processor may be executing a processing different task.
different instruction stream.
• Multiple Data: every
processor may be working
with a different data stream.
• Execution can be
synchronous or
asynchronous, deterministic
or nondeterministic
31
32
33
ARM Features
34
35
Thumb instruction set (T variant)
36
ARM Core dataflow model
37
38
39
40
41
42
43
Single-core computer
44
Single-core CPU chip
the single core
45
Multi-core architectures
• Replicate multiple processor cores on a
single die.
Core 1 Core 2 Core 3 Core 4
c o r e c o r e c o r e c o r e
1 2 3 4
47
The cores run in parallel
thread 1 thread 2 thread 3 thread 4
c c c c
o o o o
r r r r
e e e e
1 2 3 4
48
Within each core, threads are time-sliced (just like on a
several uniprocessor)
several several several
threads threads threads threads
c c c c
o o o o
r r r r
e e e e
1 2 3 4
49
Instruction Encoding
• Remember that in a stored program computer, instructions are stored in
memory (just like data)
• Each instruction is fetched (according to the address specified in the PC),
decoded, and executed by the CPU
• The ISA defines the format of an instruction (syntax) and its meaning
(semantics)
• An ISA will define a number of different instruction formats.
• Each format has different fields
• The OPCODE field says what the instruction does (e.g. ADD)
• The OPERAND field(s) say where to find inputs and outputs of the instruction.
50
MIPS Instruction Encoding
The nice thing about MIPS (and other RISC machines) is that it has very few
instruction formats (basically just 3)
• All instructions are the same size (32 bits = 1 word)
• The formats are consistent with each other (i.e. the OPCODE field is
always in the same place, etc.)
• The three formats:
1. I-type (immediate)
2. R-type (register)
3. J-type (jump)
51
I-type (immediate)
• An immediate instruction has the form:
XXXI rt, rs, immed
• Recall that we have 32 registers, so we need ??? bits each to specify
the rt and rs registers
• We allow 6 bits for the opcode (this implies a maximum of ???
opcodes, but there are actually more, see later)
• This leaves 16 bits for the immediate field
31 25 20 15
OPC rs rt immed
26 21 16 0
52
I-type Example
• Example:
ADDI $a0, $12, 33 # a0 <- r12 + 33
The ADDI opcode is 8, register a0 is register # 4
31 25 20 15
8 12 4 33
26 21 16 0
53
Load-Store Formats
•A memory address is 32 bits, so it cannot be directly
encoded in an instruction
• Recall the use of a base register + offset (16-bits) in the
load-store instructions
• Thus, we need an OPCODE, a destination/source
register (destination for load, source for store), a base
register, and an offset
• This sounds very similar to the I-type format... example:
LW $14, 8($sp) # r14 is loaded from
stack+8
• The LW opcode is 35 (0x23)4
31 25 20 15
35 29 14 8
26 21 16 0
54
R-type (register) format
• General form:
XXX rd, rt, rs
• Arithmetic-logical and comparison instructions require the
encoding of 3 registers, the rest can be used to specify the
OPCODE.
• To keep the format as regular as possible, the OPCODE has a
primary “opcode” and a “function” field.
• We also need 5 bits for the shift-amount, in case of SHIFT
instructions.
• The 16 bits used for the immediate field in the I-type instruction
are split into 5 bits for rd, 5 bits for shift-amount, and 6 bits for
function (the other fields are the same).
OPC
31 rs
25 rt 20 rd 15 sht 10funct 5
26 21 16 11 6 0
55
R-type Example
• 31 25 20 15 10 5
0 8 9 7 0 34
26 21 16 11 6 0
56
J-type (Jump) Format
• For a jump, we only need to specify the opcode, and we can use the other
bits for an address:
31 25
OPC address
26
• We only have 26 bits for the address, but MIPS addresses are 32 bits
long...
• Because the address must reference an instruction, which is a word
address, we can shift the address left by 2 bits (giving us 28 bits). We get
the other 4 bits by combining with the 4 high-order bits of the PC.
57
Branch Addressing
There are 2 kinds of branches:
1.EQ/NEQ family (compares 2 regs for (in)equality), example:
BEQ $14, $8, 1000
2. Compare-to-zero family (compares 1 reg to zero),
example:
BGEZ $14, 1000
• Both “families” require OPCODE, rs register, and offset
(1.) requires an additional register (rt)
(2.) requires some encoding for (>=, <=, )
31 25 20 15
OPC rs rt offset/4
26 21 16
0
or code (for >, <, etc) 58
Branch example
BEQ $14, $8, 1000 # PC := PC+1000 if r14==r8
BGEZ $14, 20 # PC := PC+20 if r14 >= 0
• The opcode for BEQ is 4; for BGEZ is 1, the code for >= is
1
31 4 1425 8 20 15
250
26 21 16
0
1 14 1 5
31 25 20 15
26 21 16
0
59
60
61
Memory Load and Store
Operation
Overview
ARM Load/Store Instructions
• The ARM is a Load/Store Architecture:
• Only load and store instructions can access memory
• Does not support memory to memory data processing operations.
• Must move data values into registers before using them.
Types of instructions
ARM Load/Store Instructions
• ARM has three sets of instructions
which interact with main memory.
These are:
• Single register data transfer
(LDR/STR)
• Block data transfer (LDM/STM)
• Single Data Swap (SWP)
Basic Load and Store Instruction
Syntax and Example
N Z C V Q J U n d e f i n e d I F T mode
f s x c
• Condition code flags • Interrupt Disable bits.
• N = Negative result from ALU • I = 1: Disables the IRQ.
• Z = Zero result from ALU • F = 1: Disables the FIQ.
• C = ALU operation Carried out
• V = ALU operation oVerflowed
• T Bit
• Architecture xT only
• Sticky Overflow flag - Q flag • T = 0: Processor in ARM state
• Architecture 5TE/J only • T = 1: Processor in Thumb state
• Indicates if saturation has occurred
• Mode bits
• J bit
• Specify the processor mode
• Architecture 5TEJ only
• J = 1: Processor in Jazelle state
Program Counter (r15)
■ When the processor is executing in ARM state:
■ All instructions are 32 bits wide
■ All instructions must be word aligned
■ Therefore the pc value is stored in bits [31:2] with bits [1:0] undefined (as instruction
cannot be halfword or byte aligned).
31 28 27 25 24 23 0
Cond 1 0 1 L Offset
• The processor core shifts the offset field left by 2 positions, sign-extends it and adds it to
the PC
• ± 32 Mbyte range
• How to perform longer branches?
Data processing Instructions
• Consist of :
• Arithmetic: ADD ADC SUB SBC RSB RSC
• Syntax:
CF Destination 0 Destination CF
Destination CF
Immediate value
• 8 bit number, with a range of
0-255.
ALU • Rotated right through even
number of positions
• Allows increased range of
32-bit constants to be loaded
directly into registers
Result
Immediate constants
• Examples: 31 0
ror
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 range 0-0x000003fc step 0x00000004
#30
• Cycle time
• Basic MUL instruction
• 2-5 cycles on ARM7TDMI
• 1-3 cycles on StrongARM/XScale
• 2 cycles on ARM9E/ARM102xE
• +1 cycle for ARM9TDMI (over ARM7TDMI)
• +1 cycle for accumulate (not on 9E though result delay is one cycle longer)
• +1 cycle for “long”
• Above are “general rules” - refer to the TRM for the core you are using for the
exact details
Single register data transfer
LDR STR Word
LDRB STRB Byte
LDRH STRH Halfword
LDRSB Signed byte load
LDRSH Signed halfword load
• Syntax:
• LDR{<cond>}{<size>} Rd, <address>
• STR{<cond>}{<size>} Rd, <address>
e.g. LDREQB
Address accessed
• Address accessed by LDR/STR is specified by a base
register plus an offset
• For word and unsigned byte accesses, offset can be
• An unsigned 12-bit immediate value (ie 0 - 4095 bytes).
LDR r0,[r1,#8]
Offset r0
Source
12 0x20c 0x5 0x5 Register
for STR
r1
Base
Register 0x200 0x200
IA IB DA DB
LDMxx r10, {r0,r1,r4} r4
STMxx r10, {r0,r1,r4} r4 r1
r1 r0 Increasing
Base Register (Rb) r10 r0 r4 Address
r1 r4
r0 r1
r0
Software Interrupt (SWI)
31 28 27 24 23 0
Condition Field
• Causes an exception trap to the SWI hardware
vector
• The SWI handler can examine the SWI number
to decide what operation has been requested.
• By using the SWI mechanism, an operating
system can implement a set of privileged
operations which applications running in user
mode can request.
• Syntax:
• SWI{<cond>} <SWI number>
PSR Transfer Instructions
31 28 27 24 23 16 15 8 7 6 5 4 0
N Z C V Q J U n d e f i n e d I F T mode
f s x c
where
• <psr> = CPSR or SPSR
• [_fields] = any combination of ‘fsxc’
31 0
ADDS r2,r2,#1
32-bit ARM Instruction
For most instructions generated by compiler:
■ Conditional execution is not used
■ Source and destination registers identical
■ Only Low registers used
■ Constants are of limited size
■ Inline barrel shifter not used
15 0
ADD r2,#1
16-bit Thumb Instruction
Example ARM-based System
Interrupt
Controll
er I/O
nIRQ nFIQ
Peripherals
ARM
Core
8 bit ROM
AMBA
Arbiter Reset
ARM
TIC
Remap/
External Bus Interface Timer
Pause
ROM External
Bridge
Bus
Interface
External
RAM On-chip Interrupt
Decoder RAM Controller
• AMBA • ACT
• Advanced Microcontroller Bus • AMBA Compliance Testbench
Architecture
• ADK • PrimeCell
• Complete AMBA Design Kit • ARM’s AMBA compliant peripherals