0% found this document useful (0 votes)

32 views96 pages

Coa Chapter 5

Uploaded by

ramkh148

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views96 pages

Coa Chapter 5

Uploaded by

ramkh148

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 96

21CSS201T

COMPUTER ORGANIZATION
AND ARCHITECTURE

UNIT-5

1
Contents
• Parallelism: Need, types , applications and challenges
• Architecture of Parallel Systems-Flynn’s classification
• ARM Processor: The thumb instruction set
• Processor and CPU cores, Instruction Encoding format
• Memory load and Store instruction
• Basics of I/O operations.
• Case study: ARM 5 and ARM 7 Architecture

2
Parallelism: Need, types , applications
and challenges

3
Parallelism
• Executing two or more operations at the same time is known as
parallelism.
• Parallel processing is a method to improve computer system
performance by executing two or more instructions simultaneously
• A parallel computer is a set of processors that are able to work
cooperatively to solve a computational problem.
• Two or more ALUs in CPU can work concurrently to increase
throughput
• The system may have two or more processors operating
concurrently

4
Goals of parallelism
• To increase the computational speed (ie) to reduce the amount
of time that you need to wait for a problem to be solved
• To increase throughput (ie) the amount of processing that can
be accomplished during a given interval of time
• To improve the performance of the computer for a given clock
speed
• To solve bigger problems that might not fit in the limited memory
of a single CPU

5
Applications of Parallelism
• Numeric weather prediction
• Socio economics
• Finite element analysis
• Artificial intelligence and automation
• Genetic engineering
• Weapon research and defence
• Medical Applications
• Remote sensing applications

6
Applications of Parallelism

7
Types of parallelism
1. Hardware Parallelism
2. Software Parallelism

• Hardware Parallelism :
The main objective of hardware parallelism is to increase the processing speed. Based on
the hardware architecture, we can divide hardware parallelism into two types: Processor
parallelism and memory parallelism.
• Processor parallelism
Processor parallelism means that the computer architecture has multiple nodes, multiple
CPUs or multiple sockets, multiple cores, and multiple threads.
• Memory parallelism means shared memory, distributed memory, hybrid distributed shared
memory, multilevel pipelines, etc. Sometimes, it is also called a parallel random access
machine (PRAM). “It is an abstract model for parallel computation which assumes that all the
processors operate synchronously under a single clock and are able to randomly access a
large shared memory. In particular, a processor can execute an arithmetic, logic, or memory
access operation within a single clock cycle”. This is what we call using overlapping or
pipelining instructions to achieve parallelism.
8
Hardware Parallelism
• One way to characterize the parallelism in a processor is by the
number of instruction issues per machine cycle.
• If a processor issues k instructions per machine cycle, then it is called
a k-issue processor.
• In a modern processor, two or more instructions can be issued per
machine cycle.
• A conventional processor takes one or more machine cycles to issue a
single instruction. These types of processors are called one-issue
machines, with a single instruction pipeline in the processor.
• A multiprocessor system which built n k-issue processors should be
able to handle a maximum of nk threads of instructions simultaneously

9
Software Parallelism
• It is defined by the control and data dependence of programs.
• The degree of parallelism is revealed in the program flow graph.
• Software parallelism is a function of algorithm, programming style,
and compiler optimization.
• The program flow graph displays the patterns of simultaneously
executable operations.
• Parallelism in a program varies during the execution period .
• It limits the sustained performance of the processor.

10
11
12
13
14
15
16
Software Parallelism - types
Parallelism in Software

✔ Instruction level parallelism

✔ Task-level parallelism
✔ Data parallelism
✔ Transaction level parallelism

17
Instruction level parallelism
• Instruction level Parallelism (ILP) is a measure of how many
operations can be performed in parallel at the same time in a
computer.

• Parallel instructions are set of instructions that do not depend on

each other to be executed.

• ILP allows the compiler and processor to overlap the execution of

multiple instructions or even to change the order in which
instructions are executed.

18
Eg. Instruction level parallelism
Consider the following example
1. x= a+b
2. y=c-d
3. z=x * y
Operation 3 depends on the results of 1 & 2
So ‘Z ‘ cannot be calculated until X & Y are calculated
But 1 & 2 do not depend on any other. So they can be computed
simultaneously.

19
• If we assume that each operation can be completed in one unit of
time then these 3 operations can be completed in 2 units of time .

• ILP factor is 3/2=1.5 which is greater than without ILP.

• A superscalar CPU architecture implements ILP inside a single

processor which allows faster CPU throughput at the same clock
rate.

20
Data-level parallelism (DLP)
• Data parallelism is parallelization across multiple processors in
parallel computing environments.

• It focuses on distributing the data across different nodes, which

operate on the data in parallel.

• Instructions from a single stream operate concurrently on several

data

21
DLP - example
• Let us assume we want to sum all the elements of the given
array of size n and the time for a single addition operation is Ta
time units.

• In the case of sequential execution, the time taken by the

process will be n*Ta time unit

• if we execute this job as a data parallel job on 4 processors the

time taken would reduce to (n/4)*Ta + merging overhead time
units.

22
DLP in Adding elements of array

23
DLP in matrix multiplication

• A[m x n] dot B [n x k] can be finished in O(n) instead of O(m∗n∗k ) when

executed in parallel using m*k processors.

24
• The locality of data references plays an important part in
evaluating the performance of a data parallel programming
model.

• Locality of data depends on the memory accesses performed

by the program as well as the size of the cache.

25
Flynn’s Classification
• This taxonomy distinguishes multi-processor computer architectures
according to the two independent dimensions of Instruction stream
and Data stream.
• An instruction stream is sequence of instructions executed by
machine.
• A data stream is a sequence of data including input, partial or
temporary results used by instruction stream.
• Each of these dimensions can have only one of two possible states:
Single or Multiple.
• Flynn’s classification depends on the distinction between the
performance of control unit and the data processing unit rather than
its operational and structural interconnections.

26
Flynn’s Classification

28
SIMD
• A type of parallel computer. • single instruction is executed by
• Single instruction: All processing different processing unit on
units execute the same different set of data
instruction issued by the control
unit at any given clock cycle .
• Multiple data: Each processing
unit can operate on a different
data element as shown if figure
below the processor are
connected to shared memory or
interconnection network
providing multiple data to
processing unit

29
MISD
• A single data stream is fed into • same data flow through a linear
multiple processing units. array of processors executing
• Each processing unit operates different instruction streams
on the data independently via
independent instruction.
• A single data stream is
forwarded to different
processing unit which are
connected to different control
unit and execute instruction
given to it by control unit to
which it is attached.
30
MIMD
• Multiple Instruction: every • Different processor each
processor may be executing a processing different task.
different instruction stream.
• Multiple Data: every
processor may be working
with a different data stream.
• Execution can be
synchronous or
asynchronous, deterministic
or nondeterministic

31
32
33
ARM Features

34
35
Thumb instruction set (T variant)

36
ARM Core dataflow model

37
38
39
40
41
42
43
Single-core computer

44
Single-core CPU chip
the single core

45
Multi-core architectures
• Replicate multiple processor cores on a
single die.
Core 1 Core 2 Core 3 Core 4

Multi-core CPU chip 46

Multi-core CPU chip
• The cores fit on a single processor socket
• Also called CMP (Chip Multi-Processor)

c o r e c o r e c o r e c o r e

1 2 3 4

47
The cores run in parallel
thread 1 thread 2 thread 3 thread 4

c c c c
o o o o
r r r r
e e e e

1 2 3 4

48
Within each core, threads are time-sliced (just like on a
several uniprocessor)
several several several
threads threads threads threads

c c c c
o o o o
r r r r
e e e e

1 2 3 4

49
Instruction Encoding
• Remember that in a stored program computer, instructions are stored in
memory (just like data)
• Each instruction is fetched (according to the address specified in the PC),
decoded, and executed by the CPU
• The ISA defines the format of an instruction (syntax) and its meaning
(semantics)
• An ISA will define a number of different instruction formats.
• Each format has different fields
• The OPCODE field says what the instruction does (e.g. ADD)
• The OPERAND field(s) say where to find inputs and outputs of the instruction.

50
MIPS Instruction Encoding
The nice thing about MIPS (and other RISC machines) is that it has very few
instruction formats (basically just 3)
• All instructions are the same size (32 bits = 1 word)
• The formats are consistent with each other (i.e. the OPCODE field is
always in the same place, etc.)
• The three formats:
1. I-type (immediate)
2. R-type (register)
3. J-type (jump)

51
I-type (immediate)
• An immediate instruction has the form:
XXXI rt, rs, immed
• Recall that we have 32 registers, so we need ??? bits each to specify
the rt and rs registers
• We allow 6 bits for the opcode (this implies a maximum of ???
opcodes, but there are actually more, see later)
• This leaves 16 bits for the immediate field
31 25 20 15
OPC rs rt immed

26 21 16 0

52
I-type Example
• Example:
ADDI $a0, $12, 33 # a0 <- r12 + 33
The ADDI opcode is 8, register a0 is register # 4
31 25 20 15
8 12 4 33

26 21 16 0

53
Load-Store Formats
•A memory address is 32 bits, so it cannot be directly
encoded in an instruction
• Recall the use of a base register + offset (16-bits) in the
load-store instructions
• Thus, we need an OPCODE, a destination/source
register (destination for load, source for store), a base
register, and an offset
• This sounds very similar to the I-type format... example:
LW $14, 8($sp) # r14 is loaded from
stack+8
• The LW opcode is 35 (0x23)4
31 25 20 15
35 29 14 8
26 21 16 0
54
R-type (register) format
• General form:
XXX rd, rt, rs
• Arithmetic-logical and comparison instructions require the
encoding of 3 registers, the rest can be used to specify the
OPCODE.
• To keep the format as regular as possible, the OPCODE has a
primary “opcode” and a “function” field.
• We also need 5 bits for the shift-amount, in case of SHIFT
instructions.
• The 16 bits used for the immediate field in the I-type instruction
are split into 5 bits for rd, 5 bits for shift-amount, and 6 bits for
function (the other fields are the same).

OPC
31 rs
25 rt 20 rd 15 sht 10funct 5

26 21 16 11 6 0
55
R-type Example

• SUB $7, $8, $9 # r7 <- r8 - r9

• The opcode for all R-type instructions is zero, the function
code for SUB is 34, the shift amount is zero

• 31 25 20 15 10 5
0 8 9 7 0 34
26 21 16 11 6 0

56
J-type (Jump) Format
• For a jump, we only need to specify the opcode, and we can use the other
bits for an address:
31 25
OPC address

26
• We only have 26 bits for the address, but MIPS addresses are 32 bits
long...
• Because the address must reference an instruction, which is a word
address, we can shift the address left by 2 bits (giving us 28 bits). We get
the other 4 bits by combining with the 4 high-order bits of the PC.

57
Branch Addressing
There are 2 kinds of branches:
1.EQ/NEQ family (compares 2 regs for (in)equality), example:
BEQ $14, $8, 1000
2. Compare-to-zero family (compares 1 reg to zero),
example:
BGEZ $14, 1000
• Both “families” require OPCODE, rs register, and offset
(1.) requires an additional register (rt)
(2.) requires some encoding for (>=, <=, )

31 25 20 15
OPC rs rt offset/4
26 21 16
0
or code (for >, <, etc) 58
Branch example
BEQ $14, $8, 1000 # PC := PC+1000 if r14==r8
BGEZ $14, 20 # PC := PC+20 if r14 >= 0
• The opcode for BEQ is 4; for BGEZ is 1, the code for >= is
1
31 4 1425 8 20 15
250

26 21 16
0
1 14 1 5
31 25 20 15

26 21 16
0
59
60
61
Memory Load and Store
Operation
Overview
ARM Load/Store Instructions
• The ARM is a Load/Store Architecture:
• Only load and store instructions can access memory
• Does not support memory to memory data processing operations.
• Must move data values into registers before using them.
Types of instructions
ARM Load/Store Instructions
• ARM has three sets of instructions
which interact with main memory.
These are:
• Single register data transfer
(LDR/STR)
• Block data transfer (LDM/STM)
• Single Data Swap (SWP)
Basic Load and Store Instruction
Syntax and Example

ARM Load/Store Instructions

• Memory system must support all access sizes
• Syntax:
- LDR{<cond>}{<size>} Rd, <address>
- STR{<cond>}{<size>} Rd, <address>
e.g.
-LDR R0, [R1]
- STR R0,[R1]
- LDREQB R0, [R1]
Load Operation
Data Transfer: Memory to Register (load)
• To transfer a word of data, we need to specify two
things:
• Register: r0-r15
• Memory address: more difficult
-Think of memory as a single one-dimensional
array, so we can address it simply by supplying a
pointer to a memory address.
-There are times when we will want to offset
from this pointer.
Case study: ARM 5 and ARM 7
Architecture
Data Sizes and Instruction Sets
■ The ARM is a 32-bit architecture.

■ When used in relation to the ARM:

■ Byte means 8 bits
■ Halfword means 16 bits (two bytes)
■ Word means 32 bits (four bytes)

■ Most ARM’s implement two instruction sets

■ 32-bit ARM Instruction Set
■ 16-bit Thumb Instruction Set

■ Jazelle cores can also execute Java bytecode

Processor Modes
■ The ARM has seven basic operating modes:

■ User : unprivileged mode under which most tasks run

■ FIQ : entered when a high priority (fast) interrupt is raised

■ IRQ : entered when a low priority (normal) interrupt is raised

■ Supervisor : entered on reset and when a Software Interrupt

instruction is executed

■ Abort : used to handle memory access violations

■ Undef : used to handle undefined instructions

■ System : privileged mode using the same registers as user mode

The ARM Register Set
Current Visible Registers
r0
Abort Mode
r1
r2
r3 Banked out Registers
r4
r5
r6 User FIQ IRQ SVC Undef
r7
r8 r8
r9 r9
r10 r10
r11 r11
r12 r12
r13 r13 r13 r13 r13 r13
(sp)
r14 (sp)
r14 (sp)
r14 (sp)
r14 (sp)
r14 (sp)
r14
(lr)
r15 (lr) (lr) (lr) (lr) (lr)
(pc)
cpsr
spsr spsr spsr spsr spsr
Register Organization Summary
r0
r1
User
r2 mode
r3 r0-r7,
r15, User User User User
r4
and mode mode mode mode Thumb state
r5 r0-r12 r0-r12 r0-r12 r0-r12
cpsr Low registers
r6 , , , ,
r7 r15, r15, r15, r15,
and and and and
r8 r8
cpsr cpsr cpsr cpsr
r9 r9
r10 r10 Thumb state
r11 r11 High registers
r12 r12
r13 r13 r13 r13 r13 r13
(sp)
r14 (sp)
r14 (sp)
r14 (sp)
r14 (sp)
r14 (sp)
r14
(lr)
r15 (lr) (lr) (lr) (lr) (lr)
(pc)
cpsr
spsr spsr spsr spsr spsr

Note: System mode uses the User mode register set

The Registers
■ ARM has 37 registers all of which are 32-bits long.
■ 1 dedicated program counter
■ 1 dedicated current program status register
■ 5 dedicated saved program status registers
■ 30 general purpose registers

■ The current processor mode governs which of several banks is accessible.

Each mode can access
■ a particular set of r0-r12 registers
■ a particular r13 (the stack pointer, sp) and r14 (the link register, lr)
■ the program counter, r15 (pc)
■ the current program status register, cpsr

Privileged modes (except System) can also access

■ a particular spsr (saved program status register)
Program Status Registers
31 28 27 24 23 16 15 8 7 6 5 4 0

N Z C V Q J U n d e f i n e d I F T mode
f s x c
• Condition code flags • Interrupt Disable bits.
• N = Negative result from ALU • I = 1: Disables the IRQ.
• Z = Zero result from ALU • F = 1: Disables the FIQ.
• C = ALU operation Carried out
• V = ALU operation oVerflowed
• T Bit
• Architecture xT only
• Sticky Overflow flag - Q flag • T = 0: Processor in ARM state
• Architecture 5TE/J only • T = 1: Processor in Thumb state
• Indicates if saturation has occurred

• Mode bits
• J bit
• Specify the processor mode
• Architecture 5TEJ only
• J = 1: Processor in Jazelle state
Program Counter (r15)
■ When the processor is executing in ARM state:
■ All instructions are 32 bits wide
■ All instructions must be word aligned
■ Therefore the pc value is stored in bits [31:2] with bits [1:0] undefined (as instruction
cannot be halfword or byte aligned).

■ When the processor is executing in Thumb state:

■ All instructions are 16 bits wide
■ All instructions must be halfword aligned
■ Therefore the pc value is stored in bits [31:1] with bit [0] undefined (as instruction
cannot be byte aligned).

■ When the processor is executing in Jazelle state:

■ All instructions are 8 bits wide
■ Processor performs a word access to read 4 instructions at once
Development of the
ARM Architecture
Improved 5 5
Halfword and ARM/Thumb Jazelle
4 T T
signed Interworking Java bytecode
1 E
halfword /
CLZ E execution
byte support J
System SA-110 ARM9EJ-S ARM926EJ-S
Saturated maths
mode
2 DSP
SA-1110 multiply-accumulat ARM7EJ-S ARM1026EJ-S
e instructions

3 ARM1020E SIMD Instructions

Thumb
instruction 4T Multi-processing 6
set XScale
V6 Memory
Early ARM
architecture (VMSA)
architectures ARM7TDMI ARM9TDMI ARM9E-S
Unaligned data
support
ARM720T ARM940T ARM966E-S ARM1136EJ-S
Conditional Execution and Flags
• ARM instructions can be made to execute conditionally by postfixing
them with the appropriate condition code field.
• This improves code density and performance by reducing the number of
forward branch instructions.
CMP r3,#0 CMP r3,#0
BEQ skip ADDNE r0,r1,r2
ADD r0,r1,r2
skip

• By default, data processing instructions do not affect the condition

code flags but the flags can be decrement
optionally set by using “S”. CMP
r1 and set flags

does not need “S”. if Z flag clear then branch

loop
…
SUBS r1,r1,#1
BNE loop
Condition Codes
• The possible condition codes are listed below: Note AL is
the default and does not need to be specified

Suffix Description Flags tested

EQ Equal Z=1
NE Not equal Z=0
CS/HS Unsigned higher or same C=1
CC/LO Unsigned lower C=0
MI Minus N=1
PL Positive or Zero N=0
VS Overflow V=1
VC No overflow V=0
HI Unsigned higher C=1 & Z=0
LS Unsigned lower or same C=0 or Z=1
GE Greater or equal N=V
LT Less than N!=V
GT Greater than Z=0 & N=V
LE Less than or equal Z=1 or N=!V
AL Always
Examples of conditional execution
• Use a sequence of several conditional instructions
if (a==0) func(1);
CMP r0,#0
MOVEQ r0,#1
BLEQ func

• Set the flags, then use various condition codes

if (a==0) x=0;
if (a>0) x=1;
CMP r0,#0
MOVEQ r1,#0
MOVGT r1,#1

• Use conditional compare instructions

if (a==4 || a==10) x=0;
CMP r0,#4
CMPNE r0,#10
MOVEQ r1,#0
Branch instructions
• Branch : B{<cond>} label
• Branch with Link : BL{<cond>} subroutine_label

31 28 27 25 24 23 0

Cond 1 0 1 L Offset

Link bit 0 = Branch

1 = Branch with link
Condition field

• The processor core shifts the offset field left by 2 positions, sign-extends it and adds it to
the PC
• ± 32 Mbyte range
• How to perform longer branches?
Data processing Instructions
• Consist of :
• Arithmetic: ADD ADC SUB SBC RSB RSC

• Logical: AND ORR EOR BIC

• Comparisons: CMP CMN TST TEQ

• Data movement: MOV MVN

• These instructions only work on registers, NOT memory.

• Syntax:

<Operation>{<cond>}{S} Rd, Rn, Operand2

• Comparisons set flags only - they do not specify Rd

• Data movement does not specify Rn

• Second operand is sent to the ALU via barrel shifter.

The Barrel Shifter
LSL : Logical Left Shift ASR: Arithmetic Right Shift

CF Destination 0 Destination CF

Multiplication by a power of 2 Division by a power of 2,

preserving the sign bit

LSR : Logical Shift Right ROR: Rotate Right

...0 Destination CF Destination CF

Division by a power of 2 Bit rotate with wrap around

from LSB to MSB

RRX: Rotate Right Extended

Destination CF

Single bit rotate with wrap around

from CF to MSB
Using the Barrel Shifter: The Second Operand
Register, optionally with shift
Operand Operand operation
1 2 • Shift value can be either be:
• 5 bit unsigned integer
• Specified in bottom byte of
another register.
Barrel • Used for multiplication by
Shifter constant

Immediate value
• 8 bit number, with a range of
0-255.
ALU • Rotated right through even
number of positions
• Allows increased range of
32-bit constants to be loaded
directly into registers
Result
Immediate constants
• Examples: 31 0

ror #0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 range 0-0x000000ff step 0x00000001

ror #8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 range 0-0xff000000 step 0x01000000

ror
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 range 0-0x000003fc step 0x00000004
#30

• The assembler converts immediate values to the rotate form:

• MOV r0,#4096 ; uses 0x40 ror 26
• ADD r1,r2,#0xFF0000 ; uses 0xFF ror 16

• The bitwise complements can also be formed using MVN:

• MOV r0, #0xFFFFFFFF ; assembles to MVN r0,#0

• Values that cannot be generated in this way will cause an error.

Loading 32 bit constants
• To allow larger constants to be loaded, the assembler
offers a pseudo-instruction:
• LDR rd, =const

• This will either:

• Produce a MOV or MVN instruction to generate the value (if possible).
or
• Generate a LDR instruction with a PC-relative address to read the
constant from a literal pool (Constant data area embedded in the
code).
• For example
• LDR r0,=0xFF => MOV r0,#0xFF

• LDR r0,=0x55555555 => LDR r0,[PC,#Imm12]

…
…
DCD 0x55555555

• This is the recommended way of loading constants into a

register
Multiply
• Syntax:
• MUL{<cond>}{S} Rd, Rm, Rs Rd = Rm * Rs
• MLA{<cond>}{S} Rd,Rm,Rs,Rn Rd = (Rm * Rs) + Rn
• [U|S]MULL{<cond>}{S} RdLo, RdHi, Rm, Rs RdHi,RdLo := Rm*Rs
• [U|S]MLAL{<cond>}{S} RdLo, RdHi, Rm, Rs RdHi,RdLo := (Rm*Rs)+RdHi,RdLo

• Cycle time
• Basic MUL instruction
• 2-5 cycles on ARM7TDMI
• 1-3 cycles on StrongARM/XScale
• 2 cycles on ARM9E/ARM102xE
• +1 cycle for ARM9TDMI (over ARM7TDMI)
• +1 cycle for accumulate (not on 9E though result delay is one cycle longer)
• +1 cycle for “long”

• Above are “general rules” - refer to the TRM for the core you are using for the
exact details
Single register data transfer
LDR STR Word
LDRB STRB Byte
LDRH STRH Halfword
LDRSB Signed byte load
LDRSH Signed halfword load

• Memory system must support all access sizes

• Syntax:
• LDR{<cond>}{<size>} Rd, <address>
• STR{<cond>}{<size>} Rd, <address>

e.g. LDREQB
Address accessed
• Address accessed by LDR/STR is specified by a base
register plus an offset
• For word and unsigned byte accesses, offset can be
• An unsigned 12-bit immediate value (ie 0 - 4095 bytes).
LDR r0,[r1,#8]

• A register, optionally shifted by an immediate value

LDR r0,[r1,r2]
LDR r0,[r1,r2,LSL#2]

• This can be either added or subtracted from the base

• For halfword and signed halfword / byte, offset can be:

• An unsigned 8 bit immediate value (ie 0-255 bytes).
• A register (unshifted).
• Choice of pre-indexed or post-indexed addressing
Pre or Post Indexed Addressing?
• Pre-indexed: STR r0,[r1,#12]

Offset r0
Source
12 0x20c 0x5 0x5 Register
for STR
r1
Base
Register 0x200 0x200

Auto-update form: STR r0,[r1,#12]!

■ Post-indexed: STR r0,[r1],#12

Updated r1 Offset
Base 0x20c 12 0x20c
Register r0
Source
Original r1 0x5 Register
Base 0x200 0x5 for STR
Register 0x200
LDM / STM operation
• Syntax:
<LDM|STM> {<cond>}<addressing_mode> Rb{!}, <register list>
• 4 addressing modes:
LDMIA / STMIA increment after
LDMIB / STMIB increment before
LDMDA / STMDA decrement after
LDMDB / STMDB decrement before

IA IB DA DB
LDMxx r10, {r0,r1,r4} r4
STMxx r10, {r0,r1,r4} r4 r1
r1 r0 Increasing
Base Register (Rb) r10 r0 r4 Address
r1 r4
r0 r1
r0
Software Interrupt (SWI)
31 28 27 24 23 0

Cond 1 1 1 1 SWI number (ignored by processor)

Condition Field
• Causes an exception trap to the SWI hardware
vector
• The SWI handler can examine the SWI number
to decide what operation has been requested.
• By using the SWI mechanism, an operating
system can implement a set of privileged
operations which applications running in user
mode can request.
• Syntax:
• SWI{<cond>} <SWI number>
PSR Transfer Instructions
31 28 27 24 23 16 15 8 7 6 5 4 0

N Z C V Q J U n d e f i n e d I F T mode
f s x c

• MRS and MSR allow contents of CPSR / SPSR to

be transferred to / from a general purpose register.
• Syntax:
• MRS{<cond>} Rd,<psr> ; Rd = <psr>

• MSR{<cond>} <psr[_fields]>,Rm ; <psr[_fields]> = Rm

where
• <psr> = CPSR or SPSR
• [_fields] = any combination of ‘fsxc’

• Also an immediate form

• MSR{<cond>} <psr_fields>,#Immediate

• In User Mode, all bits can be read but only the

condition flags (_f) can be written.
ARM Branches and Subroutines
• B <label>
• PC relative. ±32 Mbyte range.
• BL <subroutine>
• Stores return address in LR
• Returning implemented by restoring the PC from LR
• For non-leaf functions, LR will have to be stacked
func1 func2
STMFD :
: sp!,{regs,lr}
: :
:
BL func1 :
BL func2
: :
:
: :
LDMFD
sp!,{regs,pc} MOV pc, lr
Thumb
• Thumb is a 16-bit instruction set
• Optimised for code density from C code (~65% of ARM code size)
• Improved performance from narrow memory
• Subset of the functionality of the ARM instruction set
• Core has additional execution state - Thumb
• Switch between ARM and Thumb using BX instruction

31 0
ADDS r2,r2,#1
32-bit ARM Instruction
For most instructions generated by compiler:
■ Conditional execution is not used
■ Source and destination registers identical
■ Only Low registers used
■ Constants are of limited size
■ Inline barrel shifter not used
15 0
ADD r2,#1
16-bit Thumb Instruction
Example ARM-based System

16 bit RAM 32 bit RAM

Interrupt
Controll
er I/O
nIRQ nFIQ
Peripherals

ARM
Core
8 bit ROM
AMBA
Arbiter Reset

ARM
TIC
Remap/
External Bus Interface Timer
Pause
ROM External

Bridge
Bus
Interface
External
RAM On-chip Interrupt
Decoder RAM Controller

AHB or ASB APB

System Bus Peripheral Bus

• AMBA • ACT
• Advanced Microcontroller Bus • AMBA Compliance Testbench
Architecture
• ADK • PrimeCell
• Complete AMBA Design Kit • ARM’s AMBA compliant peripherals

Chapter 02 - Asynchronous and Parallel Programming in .NET
No ratings yet
Chapter 02 - Asynchronous and Parallel Programming in .NET
55 pages
Computer Organization: - by Rama Krishna Thelagathoti (M.Tech CSE From IIT Madras)
No ratings yet
Computer Organization: - by Rama Krishna Thelagathoti (M.Tech CSE From IIT Madras)
118 pages
Parallelism in Computer Architecture
No ratings yet
Parallelism in Computer Architecture
27 pages
PAG unit1
No ratings yet
PAG unit1
64 pages
HPC-Unit-1
No ratings yet
HPC-Unit-1
65 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
Cloud Computing - Lecture 3
No ratings yet
Cloud Computing - Lecture 3
22 pages
Module -4 - Parallel Processing
No ratings yet
Module -4 - Parallel Processing
32 pages
HPC-Unit-2
No ratings yet
HPC-Unit-2
72 pages
Introduction Mod1
No ratings yet
Introduction Mod1
120 pages
Architecture
No ratings yet
Architecture
67 pages
StudM1p1Parallel Computer Modelsppt1shared
No ratings yet
StudM1p1Parallel Computer Modelsppt1shared
107 pages
PDC-architectures
No ratings yet
PDC-architectures
24 pages
Unit 5
No ratings yet
Unit 5
96 pages
001__DDS-IIIT-Jan-10th
No ratings yet
001__DDS-IIIT-Jan-10th
34 pages
COA - Module-5
No ratings yet
COA - Module-5
35 pages
Computer Achitecture II - Parallel - Computing
No ratings yet
Computer Achitecture II - Parallel - Computing
46 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
72 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
COA UNIT 5 (AutoRecovered)
No ratings yet
COA UNIT 5 (AutoRecovered)
14 pages
Coa PPT-2
No ratings yet
Coa PPT-2
16 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
19 pages
Programação Paralela e Distribuída
No ratings yet
Programação Paralela e Distribuída
39 pages
Module 2 - Parallel Computing
No ratings yet
Module 2 - Parallel Computing
55 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
No ratings yet
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
70 pages
Unit 5
No ratings yet
Unit 5
66 pages
Parallel Programming- Unit 1
No ratings yet
Parallel Programming- Unit 1
81 pages
APznzaaBPbq19r7DttJsFJDiz6xdljQmPxg0oflqRAoyoqcN6IEEo4yrW Ck8XgHkH5PDMZIHRNz7h0ZpQWHOHwyjvO3PX93sVHvLd5fwcGETUu8XvmdTkaodNRbNrLgkDFPQZVQMfz8KHkZay30aqD0CVLA10PSummzrUt1vN32NEahcaq-m3CTYqZXjSBaBus9kPl5fj8KDKPT (1)
No ratings yet
APznzaaBPbq19r7DttJsFJDiz6xdljQmPxg0oflqRAoyoqcN6IEEo4yrW Ck8XgHkH5PDMZIHRNz7h0ZpQWHOHwyjvO3PX93sVHvLd5fwcGETUu8XvmdTkaodNRbNrLgkDFPQZVQMfz8KHkZay30aqD0CVLA10PSummzrUt1vN32NEahcaq-m3CTYqZXjSBaBus9kPl5fj8KDKPT (1)
80 pages
Motivation For Parallelism Motivation For Parallelism
No ratings yet
Motivation For Parallelism Motivation For Parallelism
6 pages
Parallel Computing
No ratings yet
Parallel Computing
19 pages
Lecture-2-06.01.2025
No ratings yet
Lecture-2-06.01.2025
21 pages
Unit -01 easid
No ratings yet
Unit -01 easid
18 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Multiprocessing vs Multithreading 2
No ratings yet
Multiprocessing vs Multithreading 2
16 pages
Hpc_unit-1 Insem Notes
No ratings yet
Hpc_unit-1 Insem Notes
76 pages
Parallel Computers
No ratings yet
Parallel Computers
19 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Chapter 1 - Parallel Architectures
No ratings yet
Chapter 1 - Parallel Architectures
60 pages
Coa Unit 04
No ratings yet
Coa Unit 04
85 pages
Unit V
No ratings yet
Unit V
95 pages
Week1-Parallel-and-Distributed-Computing
No ratings yet
Week1-Parallel-and-Distributed-Computing
55 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
downloadfile (3)
No ratings yet
downloadfile (3)
16 pages
COA - Unit 4
No ratings yet
COA - Unit 4
84 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
72 pages
High Performance Computing
No ratings yet
High Performance Computing
17 pages
Project - ParallelComputing BSR v2
No ratings yet
Project - ParallelComputing BSR v2
40 pages
Parallel Architectures Parallel Architectures: Ever Faster
No ratings yet
Parallel Architectures Parallel Architectures: Ever Faster
11 pages
Monarchy in Isreal
No ratings yet
Monarchy in Isreal
24 pages
Parallel Processing
No ratings yet
Parallel Processing
31 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Literature Review On Code Switching and Code Mixing
100% (1)
Literature Review On Code Switching and Code Mixing
6 pages
Validation s
No ratings yet
Validation s
166 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Architecture of Parallel Computing
No ratings yet
Architecture of Parallel Computing
6 pages
STAMENOV C M Between English and Bulgari
No ratings yet
STAMENOV C M Between English and Bulgari
368 pages
DLL in Reading & Writing
100% (3)
DLL in Reading & Writing
2 pages
The Condemned Paly Ground
No ratings yet
The Condemned Paly Ground
292 pages
The Ocean at The End of The Lane
No ratings yet
The Ocean at The End of The Lane
4 pages
Cursive Writing
89% (9)
Cursive Writing
29 pages
Word-Formation in Modern Standard Arabic PDF
No ratings yet
Word-Formation in Modern Standard Arabic PDF
240 pages
Final - Ego - Module 1 Students
No ratings yet
Final - Ego - Module 1 Students
182 pages
English Quarter 2 Week 8
No ratings yet
English Quarter 2 Week 8
29 pages
Urdu Research Journal 4rth Issue Jan-March 2015
No ratings yet
Urdu Research Journal 4rth Issue Jan-March 2015
60 pages
Computer Systems Unit 2 - Fill The Blanks
No ratings yet
Computer Systems Unit 2 - Fill The Blanks
7 pages
What Does It Mean 10 - Verbs
No ratings yet
What Does It Mean 10 - Verbs
15 pages
Cambridge IGCSE ™: Physics 0625/63
No ratings yet
Cambridge IGCSE ™: Physics 0625/63
8 pages
Chapter 7 Rizal
No ratings yet
Chapter 7 Rizal
4 pages
SYSAUX tablespace full
No ratings yet
SYSAUX tablespace full
5 pages
Conjunctions and Interjections PP Lesson
No ratings yet
Conjunctions and Interjections PP Lesson
21 pages
Microsoft IIS Configuring BIND To Support Active Directory
No ratings yet
Microsoft IIS Configuring BIND To Support Active Directory
11 pages
Learn To Speak Aramaic Lords Prayer
100% (1)
Learn To Speak Aramaic Lords Prayer
8 pages
Presidential Address I Have Scinde Flogging A Dead White Male Orientalist Horse by Wendy Doniger
No ratings yet
Presidential Address I Have Scinde Flogging A Dead White Male Orientalist Horse by Wendy Doniger
22 pages
Session 1 Methods (+)
No ratings yet
Session 1 Methods (+)
4 pages
Exercise On Pronouns
No ratings yet
Exercise On Pronouns
7 pages
TM2
No ratings yet
TM2
2 pages
Alan Beck, "Is Radio Blind or Invisible? A Call For A Wider Debate On Listening-In"
No ratings yet
Alan Beck, "Is Radio Blind or Invisible? A Call For A Wider Debate On Listening-In"
20 pages
Reading Plan 10-EP2 - 1
No ratings yet
Reading Plan 10-EP2 - 1
19 pages
The Islamic Concept of Knowledge: Dr. Sayyid Wahid Akhtar Vol XII No. 3
No ratings yet
The Islamic Concept of Knowledge: Dr. Sayyid Wahid Akhtar Vol XII No. 3
6 pages
JNTUA 1 1 R15 Btech
100% (1)
JNTUA 1 1 R15 Btech
2 pages
Danica Kirana Ramadhina - MTSN 9 Kediri - Kabupaten Kediri - Pidato Bahasa Inggris (Putri) - Porseni MTs Jatim 2021
No ratings yet
Danica Kirana Ramadhina - MTSN 9 Kediri - Kabupaten Kediri - Pidato Bahasa Inggris (Putri) - Porseni MTs Jatim 2021
2 pages
GRAMMAR 1 Narrative-Tenses
No ratings yet
GRAMMAR 1 Narrative-Tenses
1 page
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
Operating Systems Interview Questions You'll Most Likely Be Asked
From Everand
Operating Systems Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Coa Chapter 5

Uploaded by

Coa Chapter 5

Uploaded by

21CSS201T

✔ Instruction level parallelism

• Parallel instructions are set of instructions that do not depend on

• ILP allows the compiler and processor to overlap the execution of

• ILP factor is 3/2=1.5 which is greater than without ILP.

• A superscalar CPU architecture implements ILP inside a single

• It focuses on distributing the data across different nodes, which

• Instructions from a single stream operate concurrently on several

• In the case of sequential execution, the time taken by the

• if we execute this job as a data parallel job on 4 processors the

• A[m x n] dot B [n x k] can be finished in O(n) instead of O(m∗n∗k ) when

• Locality of data depends on the memory accesses performed

• Four category of Flynn classification

Multi-core CPU chip 46

• SUB $7, $8, $9 # r7 <- r8 - r9

ARM Load/Store Instructions

■ When used in relation to the ARM:

■ Most ARM’s implement two instruction sets

■ Jazelle cores can also execute Java bytecode

■ User : unprivileged mode under which most tasks run

■ FIQ : entered when a high priority (fast) interrupt is raised

■ IRQ : entered when a low priority (normal) interrupt is raised

■ Supervisor : entered on reset and when a Software Interrupt

■ Abort : used to handle memory access violations

■ Undef : used to handle undefined instructions

■ System : privileged mode using the same registers as user mode

Note: System mode uses the User mode register set

■ The current processor mode governs which of several banks is accessible.

Privileged modes (except System) can also access

■ When the processor is executing in Thumb state:

■ When the processor is executing in Jazelle state:

3 ARM1020E SIMD Instructions

• By default, data processing instructions do not affect the condition

does not need “S”. if Z flag clear then branch

Suffix Description Flags tested

• Set the flags, then use various condition codes

• Use conditional compare instructions

Link bit 0 = Branch

• Logical: AND ORR EOR BIC

• Comparisons: CMP CMN TST TEQ

• Data movement: MOV MVN

• These instructions only work on registers, NOT memory.

<Operation>{<cond>}{S} Rd, Rn, Operand2

• Comparisons set flags only - they do not specify Rd

• Second operand is sent to the ALU via barrel shifter.

Multiplication by a power of 2 Division by a power of 2,

LSR : Logical Shift Right ROR: Rotate Right

...0 Destination CF Destination CF

Division by a power of 2 Bit rotate with wrap around

RRX: Rotate Right Extended

Single bit rotate with wrap around

ror #0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 range 0-0x000000ff step 0x00000001

ror #8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 range 0-0xff000000 step 0x01000000

• The assembler converts immediate values to the rotate form:

• The bitwise complements can also be formed using MVN:

• Values that cannot be generated in this way will cause an error.

• This will either:

• LDR r0,=0x55555555 => LDR r0,[PC,#Imm12]

• This is the recommended way of loading constants into a

• Memory system must support all access sizes

• A register, optionally shifted by an immediate value

• This can be either added or subtracted from the base

• For halfword and signed halfword / byte, offset can be:

Auto-update form: STR r0,[r1,#12]!

■ Post-indexed: STR r0,[r1],#12

Cond 1 1 1 1 SWI number (ignored by processor)

• MRS and MSR allow contents of CPSR / SPSR to

• MSR{<cond>} <psr[_fields]>,Rm ; <psr[_fields]> = Rm

• Also an immediate form

• In User Mode, all bits can be read but only the

16 bit RAM 32 bit RAM

AHB or ASB APB

System Bus Peripheral Bus

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.