0% found this document useful (0 votes)
9 views

4 Isa 2

The document discusses the instruction set architecture of ARM processors. It covers the basics of ARM ISA including the 32-bit word length, RISC nature with load/store architecture, and programmer-visible registers. It also describes the assembly language syntax and format of ARM instructions.

Uploaded by

abccdes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

4 Isa 2

The document discusses the instruction set architecture of ARM processors. It covers the basics of ARM ISA including the 32-bit word length, RISC nature with load/store architecture, and programmer-visible registers. It also describes the assembly language syntax and format of ARM instructions.

Uploaded by

abccdes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 112

ECSE324 : Computer Organization

Instruction Set Architecture


Chapter 2, Appendix D

Brett H. Meyer
Winter 2024

Revision history:
Warren Gross – 2017
Christophe Dubach – W2020, F2020, F2021, F2022, F2023
Brett H. Meyer – W2021, W2022, W2023, W2024
Some material from Hamacher, Vranesic, Zaky, and Manjikian, Computer Organization and Embedded Systems, 6 th ed, 2012, McGraw Hill
and Patterson and Hennessy, Computer Organization and Design, ARM Edition, Morgan Kaufmann, 2017, and notes by A. Moshovos

Timestamp: 2024/01/29 10:53:00

1
Disclaimer

It is possible (and even likely) that I will (sometimes) make mistakes


and give incorrect information during the live lectures. If you have
any doubts, please check the textbook, or ask for clarification online.

2
Introduction
Instruction Set Architecture

Each processor has a predefined set of instructions that it


implements called the instruction set.

Instruction Set Architecture (ISA)


The ISA, or programming model, consists of the instruction set,
information about how memory is organized, how to access
memory, etc.

3
Instruction Set Architecture

• The ISA forms a contract between the machine and the


programmer, defining features that
• software may use, and
• hardware will implement
• In general, multiple processors implement any given ISA
• E.g., consider: x86-64, ARMv7-A, Power ISA 3.0, RISC-V

Note: the ISA need not define how hardware will implement any
given feature.

4
Different Implementations of an ISA

Machine language software (assembly) is portable between two


processors if they implement the same ISA.

• The ISA is the interface between the hardware and software


• The ISA tells you what the processor does; the ISA is a public
specification
• The implementation is how it does it; the implementation is
private (trade secrets, etc)

ISAs may be used for a long time because of legacy software.

• x86 was introduced in 1978


• x86-64 extended x86 to support 64-bit operations in 2001
• Consequently, x86 software written for the 8086 in 1978 runs on
Core i9 (x86-64) in 202

5
The ARM Architecture

• A family of RISC processors used in


many devices, especially
smartphones and tablets
• There have been 200 billion ARM
processors shipped as of 2021 (link),
and ∼29 billion for 2021 alone! (link)
• ARM provides the processor design
to chip manufacturers, who fabricate
it in their own products:
• e.g., Apple A5 chip has a dual-core Nvidia Tegra 2 SoC
ARM Cortex-A9 processor source: www.anadtech.com

• e.g., Nvidia Tegra 2 SoC also has the


same ARM processor

6
ARM ISA

ARM has developed several ISAs, and many different


implementations based on each ISA.

• ARMv7-A is the ISA for the ARM


Cortex-A9 processors in Apple A5
(iPhone 5) and the Altera Cyclone V
SoC (the one from the labs!)
DE1-SoC Altera Cyclone V

There are other implementations of the ARMv7-A ISA that have


different characteristics: speed, power, cost, fault-tolerance, etc, ...

7
In the lab you will program an ARM Cortex-A9 processor
implementing the ARMv7-A ISA.

• The “Introduction to the ARM Processor Using Altera Toolchain”


document contains most of what you need for this course.
• Appendix D of the textbook describes ARMv4, which is very
similar, and should be adequate for this course. Some of the
terminology is slightly different and I will use the correct terms
in the lecture slides.
• The complete ISA is described in the ARMv7-AR Architecture
Reference Manual.
• The interesting parts for us are : A1–A4.

From now on, I will just refer to “ARM ISA” or “ARM assembly
language.”

8
ARM ISA
ARM ISA

Overview
Textbook§D.1, D.2
ARM ISA Basics

The processor word length is 32 bits: processor registers are 32 bits;


the address size is 32 bits.
The ISA is (mostly) RISC:

• All∗ instructions are 32-bits long.


• Only load and store instructions access memory.
• All arithmetic and logic instructions operate on registers.
• But there are some features which normally are seen in CISC
ISAs.

9
ARM ISA Basics

The processor word length is 32 bits: processor registers are 32 bits;


the address size is 32 bits.
The ISA is (mostly) RISC:

• All∗ instructions are 32-bits long.


• Only load and store instructions access memory.
• All arithmetic and logic instructions operate on registers.
• But there are some features which normally are seen in CISC
ISAs.

The ARM ISA also supports 16-bit wide Thumb-2 instructions.

9
ARM ISA Memory

• Memory is byte-addressable using 32-bit addresses


• Memory is litte-endian
• Word, half-word, and byte data transfers to and from processor
registers are supported (SW’s perspective)
• All memory accesses are word-aligned (HW’s implementation)

What sort of memory request in SW takes two memory accesses in


HW?

10
ARM Programmer-visible Registers

ARM implements sixteen 32-bit processor registers labeled R0


through R15.

• R15 is the program counter (PC)


• R14 is the link register (LR)
• R13 is the stack pointer (SP)

In general, we use only∗ R0...R12 as General Purpose Registers (GPRs)


and only use and refer to R13, R14, and R15 as SP, LR, and PC.

11
ARM Programmer-visible Registers

ARM implements sixteen 32-bit processor registers labeled R0


through R15.

• R15 is the program counter (PC)


• R14 is the link register (LR)
• R13 is the stack pointer (SP)

In general, we use only∗ R0...R12 as General Purpose Registers (GPRs)


and only use and refer to R13, R14, and R15 as SP, LR, and PC.

In practice, additional guidelines further limit the use of registers
by programmers and compilers. Curious? See the ARM Architecture
Procedure Call Standard.

11
ARM Programmer-visible Registers

ARM implements sixteen 32-bit processor registers labeled R0


through R15.

• R15 is the program counter (PC)


• R14 is the link register (LR)
• R13 is the stack pointer (SP)

In general, we use only∗ R0...R12 as General Purpose Registers (GPRs)


and only use and refer to R13, R14, and R15 as SP, LR, and PC.

In practice, additional guidelines further limit the use of registers
by programmers and compilers. Curious? See the ARM Architecture
Procedure Call Standard.

There is also a special status register called the Current Program


Status Register (CPSR) that indicates various useful information, such
as ALU flags! (More on this later).
11
ARM ISA

Syntax
Textbook§2.5, D.4
Assembly Language Syntax

Assembly language consists of shorthand instruction names called


mnemonics, a syntax for using them, and other directives for
organizing them.
A program called an assembler translates the mnemonics into
machine language instructions (binary; more later).
Here is a (very short) ARM assembly program:
ADD R1, R2, R3 // R1 <-- R2 + R3

• ADD is a mnemonic
• R1 is a destination register; the first operand
• R2 and R3 are source registers; the second and third operand
• // R1 <-- R2 + R3 is a comment (not a very useful one)

12
There are different ways to use each instruction.

ADD R1, R2, R3 // R1 <-- R2 + R3

Here, the syntax of the instruction is ADD Rd, Rn, Rm where

• Rd specifies the destination register


• Rn and Rm specify the source registers

ADD R4, R5, #24 // R4 <-- R5 + 24

Here, the syntax of the instruction is ADD Rd, Rn, Imm where

• Rd specifies the destination register


• Rn specifies the source register
• Imm specifies an immediate value (constant)

13
Instruction Format and Operands

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Cond OPcode S Rn Rd Operand2

Assembly instructions ultimately become machine instructions;


above, a 32-bit instruction is divided into several fields that
determine its operation:

• Cond: condition codes, corresponding to ALU flags; more on this


later
• OPcode: specifies the operation to be executed
• Rn, Rd, Operand2: operands the operation works with/on

14
Instruction Format and Operands

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Cond OPcode S Rn Rd Operand2

Each operand has a limited set of allowable uses:

• Rd refers to a destination register to which results are written


• Rn and Rm refer to source registers; their value does not change
(unless the register is the same as Rd)
• Imm refers to an immediate value (the maximum number of bits
might be specified, e.g., Imm16 for a 16-bit value); immediates
are saved in the instruction itself
• Op2 refers to a flexible source operand, which is either:
• an 8-bit immediate value Imm8
• a register Rm (with optional rotation or shift)

15
ARM ISA

General Data Processing Instructions


Textbook§2.8, D.4
Move Instructions

These instructions copy data into registers from other registers or


immediate values.
Where are immediate values stored?
MOV Rd, Op2 // MOVes value of Op2 into Rd
MOV Rd, #Imm16 // MOVes immediate 16-bit value into Rd

MVN Rd, Op2 // MOVes complement (Not) of Op2 value


// into Rd

MOVT Rd, #Imm16 // MOVes Top: moves a 16-bit constant into


// the high-order 16 bits of Rd and leaves
// the lower bits unchanged

Why is the last instruction useful?

16
Logic Instructions

These instructions perform binary logic operations on operands,


useful for testing conditions, manipulating data, etc.
AND Rd, Rn, Op2 // bitwise AND operation
ORR Rd, Rn, Op2 // bitwise OR operation
EOR Rd, Rn, Op2 // bitwise Exclusive OR (XOR) operation
BIC Rd, Rn, Op2 // BIt Clear: Rd <-- Rn AND NOT(Op2)

17
Shift and Rotate Instructions

Shift and rotate instructions change the positions of bits within a


register, moving them left or right.
Note: Last operand can be a register or an immediate value, as with
logic operations.
LSL R1, R2, #5 // Logical shift left
LSR R1, R2, R3 // Logical shift right
ASR R1, R2, #4 // Arithmetic shift right

• Logical ⇒ pad with 0s, Arithmetic ⇒ extend sign bit

ROR R1, R2, #2 // Circular rotate right

• Less significant bits (on the right of the register) are moved into
the most significant positions (on the left of the register).

18
Arithmetic Instructions

Addition/subtraction instructions:
ADD R0, R1, R2 // R0 <-- R1 + R2
ADD R0, R1, #-24 // R0 <-- R1 + (-24)
SUB R0, R1, #24 // R0 <-- R1 - (24)
ADD R0, R1, R2, LSL#2 // R0 <-- R1 + R2*4

What are the uses of LSL in this case?

19
Arithmetic Instructions

Addition/subtraction instructions:
ADD R0, R1, R2 // R0 <-- R1 + R2
ADD R0, R1, #-24 // R0 <-- R1 + (-24)
SUB R0, R1, #24 // R0 <-- R1 - (24)
ADD R0, R1, R2, LSL#2 // R0 <-- R1 + R2*4

What are the uses of LSL in this case?


Multiply instruction
MUL R2, R3, R4 // R2 <-- R3 * R4

Multiply-accumulate instruction
MLA R2, R3, R4, R5 // R2 <-- (R3 * R4) + R5

These multiply instructions only return the 32 least significant bits.

19
Arithmetic Instructions

Addition/subtraction instructions:
ADD R0, R1, R2 // R0 <-- R1 + R2
ADD R0, R1, #-24 // R0 <-- R1 + (-24)
SUB R0, R1, #24 // R0 <-- R1 - (24)
ADD R0, R1, R2, LSL#2 // R0 <-- R1 + R2*4

What are the uses of LSL in this case?


Multiply instruction
MUL R2, R3, R4 // R2 <-- R3 * R4

Multiply-accumulate instruction
MLA R2, R3, R4, R5 // R2 <-- (R3 * R4) + R5

These multiply instructions only return the 32 least significant bits.


There are other, more complex arithmetic instructions; they are not
covered in this course.
19
Arithmetic Instructions

What about division?


UDIV R0, R1, R2 // R0 <-- R1 / R2, R1 and R2 unsigned
SDIV R0, R1, R2 // R0 <-- R1 / R2, R1 and R2 signed

But many processors do not implement division.


Division hardware is

• Complex, and therefore costly;


• Slow; and,
• Used infrequently.

Consequently, it is often performed in software using an


ARM-provided library subroutine (e.g., aeabi_idiv()).

20
ARM ISA

Memory Instructions
Textbook§2.4, D.3
Arrays in C (Review)

short arr [ 5 ] = { 1 , 2 , 3 , 4 , 5}

Array elements are allocated one after


Address Content
the other in memory. (Remember ...
endianess!) 0x1000 0x01 Address Content
0x1001 0x00 ...
For a 1D array, arr[i] is stored at 0x1002 0x02 0x1000 1

address: &arr[0]+sizeof(TYPE)*i 0x1003 0x00 0x1002 2


0x1004 0x03 0x1004 3
where 0x1005 0x00 0x1006 4
0x1008 5
• & means address of 0x1006 0x04
...
0x1007 0x00
• &arr[0] is the address of the first 0x1008 0x05

array element, and base (starting) 0x1009 0x00 Half-word


...
address of the array view
• sizeof returns the number of bytes Byte view
required by TYPE
• sizeof(TYPE)*i is therefore the
offset of element i 21
Array Access Example

Consider the following C code snippet:


int arr [ 8 ] = { 1 7 , 58 , 79 , 15 , . . . } ; // s i z e o f ( i n t ) = 4 b y t e s
...
for ( int i =0; i <8; i ++) {
v = arr [ i ] ;
...
arr [ i ] = v ;
}

22
Array Access Example

Consider the following C code snippet:


int arr [ 8 ] = { 1 7 , 58 , 79 , 15 , . . . } ; // s i z e o f ( i n t ) = 4 b y t e s
...
for ( int i =0; i <8; i ++) {
v = arr [ i ] ;
...
arr [ i ] = v ;
}

When reading from an array, we need to:

• Get the base address (&arr);


• Multiply the index by the element size (i*4) to get the offset;
• Add to calculate the address of the element; and, then, finally
• Access memory!

22
int arr [ 8 ] = { 1 7 , 58 , 79 , 15 , . . . } ; // s i z e o f ( i n t ) = 4 b y t e s
...
for ( int i =0; i <8; i ++) {
v = arr [ i ] ;
...
arr [ i ] = v ;
}

To access arr we need an instruction that can read from memory:


LDR Rd, [Rn] // Rd <-- Mem[Rn], Rn = address in bytes

23
int arr [ 8 ] = { 1 7 , 58 , 79 , 15 , . . . } ; // s i z e o f ( i n t ) = 4 b y t e s
...
for ( int i =0; i <8; i ++) {
v = arr [ i ] ;
...
arr [ i ] = v ;
}

To access arr we need an instruction that can read from memory:


LDR Rd, [Rn] // Rd <-- Mem[Rn], Rn = address in bytes

Our C code is implemented in part with the following assembly:


// R0 = variable i
// R1 = base address of arr (&arr)
MOV R2, #4 // R2 = 4
MUL R2, R0, R2 // R2 = i*4 -- calculate offset for index i
ADD R3, R1, R2 // R3 = arr + i*4 -- absolute address of arr[i]
LDR R4, [R3] // R4 = arr[i] -- R4 <-- Mem[R1+i*4]

23
int arr [ 8 ] = { 1 7 , 58 , 79 , 15 , . . . } ; // s i z e o f ( i n t ) = 4 b y t e s
...
for ( int i =0; i <8; i ++) {
v = arr [ i ] ;
...
arr [ i ] = v ;
}

Address Content
... Assume the base address
0x0100 MOV R2,#4 of arr is 0x1000 (R1) and
0x0104 MUL R2,R0,R2
i=3 (R0). After execution
0x0108 ADD R3,R1,R2
of the load:
0x010C LDR R4,[R3]
... Registers
0x1000 17
R0 0x00000003
0x1004 58 R1 0x00001000
0x1008 79 R2 0x0000000C
R3 0x0000100C
0x100C 15
R4 0x0000000F
...

24
Load and Store Instructions

Memory accesses commonly∗ access words and take the form of:
LDR Rd, <EA> // Rd <-- Mem[EA]; reads a 32-bit word
STR Rm, <EA> // Mem[EA] <-- Rm; writes a 32-bit word

Loads and stores do not generally specify a memory address


explicitly; instead, they compute an effective address (EA) from a
base address and an offset.

Effective Address Calculation


EA = base + offset

Calculating an EA is very convenient for implementing common


program structures: e.g., loops and arrays; and, complex objects.

25
Load and Store Instructions

Memory accesses commonly∗ access words and take the form of:
LDR Rd, <EA> // Rd <-- Mem[EA]; reads a 32-bit word
STR Rm, <EA> // Mem[EA] <-- Rm; writes a 32-bit word

Loads and stores do not generally specify a memory address


explicitly; instead, they compute an effective address (EA) from a
base address and an offset.

Effective Address Calculation


EA = base + offset

Calculating an EA is very convenient for implementing common


program structures: e.g., loops and arrays; and, complex objects.

Other load and store instructions access bytes or half words,
doubles, or multiple words, and manipulate addresses in more
complex ways.
25
Effective Address Calculation

• The base address is always stored in a register (Rn)


• There are three kinds of offset:
• Immediate: a 12-bit number that is added to or subtracted from
the base address
• Index register: the offset is stored in a register (Rm)
• Scaled index register: the value in the index register is shifted by a
specified immediate value, then added to or subtracted from the
base address

Methods for Calculating the Effective Address

Name Assembler syntax Address generation

register indirect [Rn] EA = Rn


immediate offset [Rn, #offset] EA = Rn + offset
offset in Rm [Rn, ± Rm, shift] EA = Rn ± shifted(Rm)

26
Back to our Example
...
v = arr [ i ] ;
...

Immediate (with #0): EA = R3


// R0 = variable i
// R1 = base address of arr (&arr)
MOV R2, #4 // R2 <-- 4
MUL R2, R0, R2 // R2 <-- i*4 -- calculate offset for index i
ADD R3, R1, R2 // R3 <-- arr + i*4 -- absolute address of arr[i]
LDR R4, [R3, 0] // R4 <-- Mem[R3]

27
Back to our Example
...
v = arr [ i ] ;
...

Immediate (with #0): EA = R3


// R0 = variable i
// R1 = base address of arr (&arr)
MOV R2, #4 // R2 <-- 4
MUL R2, R0, R2 // R2 <-- i*4 -- calculate offset for index i
ADD R3, R1, R2 // R3 <-- arr + i*4 -- absolute address of arr[i]
LDR R4, [R3, 0] // R4 <-- Mem[R3]

Index: EA = R1 + R2
MOV R2, #4 // R2 <-- 4
MUL R2, R0, R2 // R2 <-- i*4
LDR R4, [R1, R2] // R4 <-- Mem[R1+R2]

27
Back to our Example
...
v = arr [ i ] ;
...

Immediate (with #0): EA = R3


// R0 = variable i
// R1 = base address of arr (&arr)
MOV R2, #4 // R2 <-- 4
MUL R2, R0, R2 // R2 <-- i*4 -- calculate offset for index i
ADD R3, R1, R2 // R3 <-- arr + i*4 -- absolute address of arr[i]
LDR R4, [R3, 0] // R4 <-- Mem[R3]

Index: EA = R1 + R2
MOV R2, #4 // R2 <-- 4
MUL R2, R0, R2 // R2 <-- i*4
LDR R4, [R1, R2] // R4 <-- Mem[R1+R2]

Scaled Index: EA = R1 + (R0 << 2) = R1 + (R0 × 4)


27
LDR R4, [R1, R0, LSL#2] // R4 <-- Mem[R1+R0<<2]
Store Instructions Calculate EA the Same Way

Here’s our C code again, but this time we’re copying into arr:
int arr [ 8 ] = { 1 7 , 58 , 79 , 15 , . . . } ; // s i z e o f ( i n t ) = 4 b y t e s
...
for ( int i =0; i <8; i ++) {
v = arr [ i ] ;
...
arr [ i ] = v ;
}

We have the same options for calculating the effective address as we


do for load instructions. E.g.,:
Scaled Index: EA = R1 + (R0 << 2) = R1 + (R0 × 4)
// R0 = variable i
// R1 = base address of arr (&arr)
// R4 = v
STR R4, [R1, R0, LSL#2] // Mem[R1+R0<<2] <-- R4

28
Checkpoint

For each instruction below, calculate the EA (Effective Address) given


the following register content:
R2 = 0x1A4DDA38
R6 = 0x10004008
R8 = 0x10004000
R10 = 0x00000002

LDR R2, [R6, #-4]

LDR R2, [R6, #0x200]

STR R2, [R6, -R8]

STR R2, [R8]

LDR R2, [R8, R10, LSL#3]

29
Pointers in C (Review)

• A pointer (int *ptr;) is an address


• You can perform pointer arithmetic to change the address
• E.g., ptr = ptr+2;
• Also, pre-increment (++ptr), and post-increment (ptr++)
• You can dereference a pointer (*ptr) to access the data at the
address
In C, you declare that a variable is a pointer with *
i n t *p ; // p i s a p o i n t e r to an i n t e g e r
// i . e . p i s the memory address o f a 32 − b i t v a r i a b l e
// s i n c e p c o n t a i n s an address , i t i s a l s o 32 − b i t s
// NB t h a t ” i n t * p ; ” and ” i n t * p ; ” a l s o do the same t h i n g
int x ;
int a [ 5 ] = {20 , 35 , 0 , 42 , 1 2 } ;

p = &a [ 3 ] ; // the address o f the 4 th element o f a r r i s s t o r e d i n p

x = *p ; // here , * means i n d i r e c t i o n ( the value addressed by p )


// i t ’ s t r i c k y ! C uses * to mean d i f f e r e n t t h i n g s i n c o n t e x t !

What is the value stored in x? 30


In C we dereference a pointer to access the value at its address:
x = *p ;

This is accomplished with the following assembly:


LDR R0, p // Load the value of p (&arr[3]) into R0
LDR R1, [R0] // R1 <-- Mem[R0]
STR R1, x // x <-- R1

Why is it important to know the pointer type?


i n t *p ;

Because we can do arithmetic on the pointer:


p = 0 x1000 ;

What is p+1?

31
i n t a r r [ 8 ] = { 5 6 , 2 6 , 8 8 , 4 5 , −45 , 7 7 , 9 8 , 1 3 } ;
print ( arr ) ;
p r i n t (& a r r [ 1 ] ) ; Address Content
...
i n t * p t r = &a r r [ 1 ] ;
print ( ptr ) ; 0x1000 56
print (* ptr ) ; 0x1004 26
0x1008 88
print ( ptr + 2 ) ;
print (*( ptr + 2 ) ) ; 0x100C 45
0x1010 -45
print ( ptr + + ) ;
print ( ptr ) ; 0x1014 77
0x1018 98
print (++ ptr ) ; 0x101C 13
print ( ptr ) ;
...
print (*( ptr + + ) ) ;
print (*(++ ptr ) ) ;

If arr starts at address 0x1000, what is printed by this C code?

32
Pointers in Assembly
C code without pointers:
int arr [8] = . . . ;
for ( int i =0; i <8; i ++) {
v = arr [ i ] ;
...
}

Loop body in assembly:


// R0 = i
// R1 = base address of arr
// R2 = v
LDR R2,[R1,R0,LSL#2] // v=arr[i]
ADD R0,R0,#1 // i++

33
Pointers in Assembly
C code without pointers: C code with pointers:
int arr [8] = . . . ; int arr [8] = . . . ;
for ( int i =0; i <8; i ++) { int * ptr = arr ;
v = arr [ i ] ; while ( p t r < ( a r r + 8 ) ) {
... v = *( ptr + + ) ;
} ...
}
Loop body in assembly:
// R0 = i
// R1 = base address of arr
// R2 = v
LDR R2,[R1,R0,LSL#2] // v=arr[i]
ADD R0,R0,#1 // i++

33
Pointers in Assembly
C code without pointers: C code with pointers:
int arr [8] = . . . ; int arr [8] = . . . ;
for ( int i =0; i <8; i ++) { int * ptr = arr ;
v = arr [ i ] ; while ( p t r < ( a r r + 8 ) ) {
... v = *( ptr + + ) ;
} ...
}
Loop body in assembly:
Loop body in assembly:
// R0 = i
// R1 = base address of arr // R0 = ptr
// R2 = v // R1 = v
LDR R2,[R1,R0,LSL#2] // v=arr[i] LDR R1, [R0] // v = *ptr
ADD R0,R0,#1 // i++ ADD R0, R0, #4 // ptr=ptr+4

Using a pointer instead of arr[i] uses one less register in


assembly! This is good for performance. A good compiler will
automatically transform code to use pointers.

33
Post/Pre-indexed Addressing Mode

ARM includes methods for automatically updating addresses after


memory accesses, improving performance.
Recall register indirect addressing:
// R0 = ptr
// R1 = v
LDR R1, [R0] // v = *ptr
ADD R0, R0, #4 // ptr=ptr+4

Post-indexed addressing performs Pre-indexed addressing updates


the access then updates (!) then performs the access
LDR R1,[R0],#4 // v = *(ptr++) LDR R1,[R0,#4]! // v = *(++ptr)

Using one instruction to read memory and increment the pointer:


• saves time (fewer instructions are executed),
• saves energy (fewer instructions are read from memory), and
• reduces system costs (less program memory is needed).
34
Post-indexed addressing performs Pre-indexed addressing updates
the access then updates (!) then performs the access
LDR R1,[R0],#4 // v = *(ptr++) LDR R1,[R0,#4]! // v = *(++ptr)

Address Content
...

0x1000 56
Assuming R0=0x1008 before the LDR
0x1004 26
instruction executes, what’s the content
of R0 and R1 after the instruction 0x1008 88
executes?
0x100C 45

0x1010 -45
...

35
Load/Store Addressing Mode Summary
Name Assembler Syntax Address Generation

Register indirect: [Rn] Address = Rn

Offset:
immediate offset [Rn,#offset] Address = Rn + offset
offset in Rm [Rn,±Rm,shift] Address = Rn ± shifted(Rm)

Pre-indexed:
immediate offset [Rn,#offset]! Address = Rn + offset
Rn ← Address
offset in Rm [Rn,±Rm,shift]! Address = Rn ± shifted(Rm)
Rn ← Address

Post-indexed:
immediate offset [Rn],#offset Address = Rn
Rn ← Rn + offset
offset in Rm [Rn],±Rm,shift Address = Rn
Rn ← Rn ± shifted(Rm)

• offset = a signed number (12 bits)


• shift = direction # integer
where direction is LSL for left shift or LSR for right shift,
and integer is a 5-bit unsigned number specifying the shift amount
36
Loading and Storing Byte and Half-words

Dedicated instructions load/store values smaller than a word:

LDRB (Load Register Byte) – zero padded to 32 bits


LDRH (Load Register Halfword) – zero padded to 32 bits

LDRSB (Load Register Signed Byte) – sign extended to 32 bits


LDRSH (Load Register Signed Halfword) – sign extended to 32 bits
STRB (Store Register Byte) – stores low byte of Rd
STRH (Store Register Halfword) – Store the low halfword of Rd

37
Loading and Storing Multiple Words

LDM and STM load and store blocks of words in consecutive memory
addresses into multiple registers.
STM: registers are accessed in order from largest-to-smallest index
(R15..R0)
LDM: registers are accessed in order from smallest to largest index
(R0..R15)
To determine the direction in which memory addresses are
computed, you must use one of the following suffixes for the
mnemonic to determine how to update the address:
• IA – Increment After the transfer (default)
• IB – Increment Before the transfer
• DA – Decrement After the transfer
• DB – Decrement Before the transfer
Registers need not be consecutive, e.g.,: LDMIA R8, {R0,R2,R9}.
38
Example:
LDMIA R3 ! , { R4 , R6−R8 , R10 }

R4 ← Mem[R3]
R6 ← Mem[R3 + 4]
R7 ← Mem[R3 + 8]
R8 ← Mem[R3 + 12]
R10 ← Mem[R3 + 16]
R3 ← R3 + 20 // increment after

39
PC-relative Addressing
Address Content
• The PC can be used as the base register
to access memory locations in terms of ...
their distance relative to PC+8
0x0FF0 96
• Recall pipelining
• The CPU updates PC ← PC+4 upon 0x0FF4 -8
fetching instruction i
• While i is being decoded, i + 1 is fetched 0x0FF8 78

and PC ← PC+4 again


0x0FFC 26
• When i is executing, the CPU is fetching
i + 2 at PC+8! 0x1000 LDR R0, [PC,#-16]

• PC-relative addressing is used when ...


accessing variables declared statically
What’s the content of R0 after executing this instruction?
LDR R0, [PC, #-16]

40
PC-relative Addressing
Address Content
• The PC can be used as the base register
to access memory locations in terms of ...
their distance relative to PC+8
0x0FF0 96
• Recall pipelining
• The CPU updates PC ← PC+4 upon 0x0FF4 -8
fetching instruction i
• While i is being decoded, i + 1 is fetched 0x0FF8 78

and PC ← PC+4 again


0x0FFC 26
• When i is executing, the CPU is fetching
i + 2 at PC+8! 0x1000 LDR R0, [PC,#-16]

• PC-relative addressing is used when ...


accessing variables declared statically
What’s the content of R0 after executing this instruction?
LDR R0, [PC, #-16]

The SP may be used in a similar way to access data on the stack


(more on this later, too). 40
ARM ISA

Assembling Simple Programs


Textbook§2.5, 2.9, D.5
Assembler Directives

We are almost ready to write out first assembly language program!


The assembler also accepts commands about how it should
assemble your program. These are not machine instructions and are
never translated to executable machine language.
Some common ones (see the Altera documentation for more):
.global symbol // makes symbol visible outside object file
.word expression // allocates a 32-bit variable in memory
.equ name, value // name is replaced with value in this file
.text // marks the beginning of the code
.end // marks the end of the code

• Text section = where code goes


• Data section = where data goes (everything* except code)

See more examples of GNU ARM assembly directives.

41
Loading 32-bit Constants into Registers

We often need a way to load large constant values into registers, e.g.,
32-bit addresses. The assembler uses a pseudo-instruction to do this.
LDR Rd, =value // pseudo-instruction: is it a load? a mov?

• If the value fits within the range allowed in a MOV instruction,


the assembler will produce a MOV instruction
• Otherwise, the assembler places the constant value into a
literal pool in memory, in the same text section as the
instruction, where it can be read at runtime:

LDR Rd, [PC, #offset]

where Mem[PC + offset] = value.

42
Example of 32-bit Constants (and our first programs!)

Loading a small constant:


.global _start
.text
_start: LDR R0, =0x00000020
.end

address content code


0x00000000 0xE3A00020 MOV R0, #32

Loading a large constant:


.global _start
.text
_start: LDR R0, =0xF0F0F0F0
.end

address content code


0x00000000 0xE51F0004 LDR R0, [PC, #-4]
0x00000004 0xF0F0F0F0 .word 0xF0F0F0F0
43
Declaring and initializing a variable, and defining expressions:
.global _start
n: .word 7
.equ m, 0x12
.equ o, 0x1234
_start:
LDR R0, n // R0 <-- Mem[n]
LDR R1, =m // R1 <-- m
LDR R2, =o // R2 <-- o

address content code


0x00000000 0x00000007 .word 7
0x00000004 0xE54F000c LDR R0, [PC, #-12]
0x00000008 0xE3A01012 MOV R1, #18
0x0000000C 0xE51F2004 LDR R2, [PC, #-4]
0x00000010 0x00001234 .word 0x00001234
• LDR R0,n is a real instruction where the label n = PC-12
• LDR R1,=m and LDR R2,=o are pseudo-instructions
What values are in each register after execution?
44
ARM ISA

CPSR & Branching


Textbook§D.9
Current Program Status Register (CPSR)
31 30 29 28 7 6 4 3 2 1 0

NZC V I F M[4:0]

Condition code flags Interrupts Processor mode

• Condition code flag bits are set to 1 when the condition is true
• Recall ALU flags: N = Negative, Z = Zero, C = Carry, V = Overflow
• Interrupt flags
• I = IRQ mask bit, F = FRQ (Fast interrupt) mask bit
• Processor mode
• 10000 = User (most of user code)
• 10001 = Serving fast interrupt (when dealing with I/O)
• 10010 = Serving normal interrupt (when dealing with I/O)
• 10011 = Supervisor (used by the Operating System)
CPSR is not a general-purpose register
Special instructions modify the CPSR, directly or as a side-effect,
while others will behave differently depending on CPSR state.

45
Condition Codes

Combinations of condition code flags are used to determine if the


result of an instruction satisfies a particular inequality.

Suffix Meaning CPSR Flags


EQ EQual(zero) Z=1
NE Not Equal (nonzero) Z=0
CS/HS Carry Set/ unsigned Higher or Same C=1
CC/LO Carry Clear / unsigned Lower C=0
MI MInus (negative) N=1
PL PLus (positive or zero) N=0
VS oVerflow Set V=1
VC oVerflow Clear V=0
HI unsigned Higher C=1 AND Z=0
LS unsigned Lower or Same C=0 OR Z=1
GE signed Greater or Equal N=V
LT signed Less Than N!=V
GT signed Greater Than Z=0 AND (N=V)
LE signed Less or Equal Z=1 OR (N!=V)
AL ALways executed None tested

46
Branch Instructions

Branch instructions read the condition code flags to determine


whether or to jump to a label or continue with the next instruction.
B{cond} LABEL

• The condition cond specifies a test of the condition code bits


• If the condition is true, the next instruction executed will be at
address LABEL, the target
• If the condition is false, the processor simply executes the next
instruction (fall-through)

Branch instructions enable control flow


Branch instructions are essential for control flow operations in
software: e.g., if/else, loops, function calls, etc.

47
Test & Compare Instructions

Some instructions are designed specifically to set condition flags:


TST Rs, Op2

Zero flag (Z) set to result of AND(Rs, Op2)

TEQ Rs, Op2

Zero flag (Z) set to result of XOR(Rs, Op2)

CMP Rs, Op2

Condition code flags set to result of Rs - Op2 (Rs unchanged)

CMN Rs, Op2

Condition code flags set to result of Rs + Op2 (Rs unchanged)

48
Example

Corresponding ARM assembly:


C code: LDR R0, X // R0 <-- Mem[X]
CMP R0, #3 // R0-#3, only update CPSR
i f ( x >3)
BLE ELSE // if R0-#3<=0 then branch
y = 7;
MOV R1, #7 // ** if code **
else
B END // branch to END
y = 13;
ELSE: MOV R1, #13 // ** else code **
END: STR R1, Y // Mem[Y] <-- R1

As an exercise, determine the contents of each register and CPSR


after each instruction, assuming:
1) x = 6,
2) x = 2, and
3) x = 3

49
Setting Conditions Codes with S Suffix

Data processing instructions (arithmetic, logic, move) affect the


condition codes if the suffix S is appended to the mnemonic.
Example:
ADDS R0, R1, R2 // sets condition codes
ADD R0, R1, R2 // does not

Condition codes are set based on the result of the data processing
instruction.
Note that the following two instructions set condition codes in the
same manner:
SUBS R0, R1, R2
CMP R1, R2

Unless the results of the subtraction is required, CMP is preferred,


since one less register is used.

50
Conditional Execution

Branch instructions are executed when the stated condition is true.


Most ARM instructions can be executed conditionally, too.
Instruction format: OP{cond}{S} Rd, Rn, Op2

LDR R0, X
i f ( x >3)
CMP R0, #3 // set flags
y = 7;
MOVGT R1, #7 // if R0-3 > 0
else
MOVLE R1, #13 // if R0-3 <= 0
y = 13;
STR R1, Y

If the condition is true, then the instruction executes, otherwise the


instruction has no effect. This can save some branches, resulting in
compact and fast code.
This is a pretty advanced and ARM-specific technique. For now,
thinking in terms of branches keeps things simple.

51
ARM ISA

Putting it all together:


calculating a dot product in assembly
Dot Product

The dot product of two vectors A and B is defined as:


n−1
X
A(i) · B(i)
i=0

C program for the dot product of two vector of six integers:


void main ( ) {
int n = 6;
i n t v e c t o r A [ 6 ] = { 5 , 3 , −6 , 1 9 , 8 , 1 2 } ;
i n t vectorB [ 6 ] = { 2 , 14 , −3 , 2 , −5 , 3 6 } ;
i n t dotP ;
int i ;

dotP = 0 ;
f o r ( i = 0 ; i <n ; i + + )
dotP += v e c t o r A [ i ] * v e c t o r B [ i ] ;

p r i n t f ( ” Dot product = %d\n ” , dotP ) ;


}
52
C variable declarations:
int n = 6;
int v e c t o r A [ 6 ] = { 5 , 3 , −6 , 1 9 , 8 , 1 2 } ;
int vectorB [ 6 ] = { 2 , 14 , −3 , 2 , −5 , 3 6 } ;
int dotP ;
int i;

Assembly memory allocation:


n: . word 6
vectorA : . word 5 , 3 , − 6 , 1 9 , 8 , 1 2
vectorB : . word 2 , 1 4 , − 3 , 2 , − 5 , 3 6
dotP : . space 4
// i w i l l be s t o r e d i n a r e g i s t e r , no memory a l l o c a t i o n needed

• .word a, b, c, ...
allocate storage for 1 or more words (4 bytes each) and initialize
with the values a, b, c, ...
• .space 4
allocate 4 bytes without initialization
• n, vectorA, ... are addresses (labels) corresponding to the start
of the allocated space
53
The for loop expands to a number of initialization instructions and
other code that is repeated once each iteration.
dotP = 0 ;
f o r ( i = 0 ; i <n ; i + + )
dotP += v e c t o r A [ i ] * v e c t o r B [ i ] ;

MOV R3 , #0 // r e g i s t e r R3 w i l l accumulate the product

LDR R0 , = v e c t o r A // R0 <−− v e c t o r A base address ( pseudo − i n s t r u c t i o n )


LDR R1 , = v e c t o r B // R1 <−− v e c t o r B base address ( pseudo − i n s t r u c t i o n )
LDR R2 , n // R2 <−− Mem[ n ] = 6

MOV R6 , #0 // i n i t i a l i z e i t e r a t i o n v a r i a b l e i

LOOP :
CMP R6 , R2 // do i −n and s e t f l a g s a c c o r d i n g l y
BGE END // we ’ re done i f i −n >= 0 ( i f i >= n )
LDR R4 , [ R0 ] , #4 // g e t v e c t o r A [ i ] ; post − index increments R0 a f t e r
LDR R5 , [ R1 ] , #4 // g e t v e c t o r B [ i ] ; post − index increments R1 a f t e r
MLA R3 , R4 , R5 , R3 // R3 <−− ( R4 * R5 ) + R3
ADD R6 , R6 , # 1 // i ++
B LOOP

END :
STR R3 , dotP // Mem[ dotP ] <−− R3

54
A more efficient approach uses SUBS:
dotP = 0 ;
i = n;
do { // assumes t h e r e i s at l e a s t one element i n each a r r a y
dotP += v e c t o r A [ i ] * v e c t o r B [ i ] ;
i − −;
} while ( i > 0 )

MOV R3 , #0 // r e g i s t e r R3 w i l l accumulate the product

LDR R0 , = v e c t o r A // R0 = v e c t o r A base address ( pseudo − i n s t r u c t i o n )


LDR R1 , = v e c t o r B // R1 = v e c t o r B base address ( pseudo − i n s t r u c t i o n )
LDR R2 , n // R2 =6 ( R2 i s i t h i s time )

LOOP :
LDR R4 , [ R0 ] , #4 // g e t v e c t o r A [ i ] ; post − index increments R0 a f t e r
LDR R5 , [ R1 ] , #4 // g e t v e c t o r B [ i ] ; post − index increments R1 a f t e r
MLA R3 , R4 , R5 , R3 // R3 = ( R4 * R5 ) + R3
SUBS R2 , R2 , # 1 // i − − and s e t c o n d i t i o n f l a g s
BGT LOOP // we ’ re not done i f i >0

STR R3 , dotP

• One less register used


• 5 vs 7 instructions in the loop body
55
Last bit, printing the result:
p r i n t f ( ‘ ‘ Dot product = %d\n ’ ’ , dotP ) ;

We have to call a library subroutine to print the results. This usually


requires an operating system to print information on a terminal, or
direct access to an I/O device in assembly (e.g., a screen). We will see
that in another lecture.

56
Full dot product code in ARM assembly
. g l o b a l _ s t a r t // t e l l s the assembler / l i n k e r where to s t a r t e x e c u t i o n

n: . word 6
v e c t o r A : . word 5 , 3 , − 6 , 1 9 , 8 , 1 2
v e c t o r B : . word 2 , 1 4 , − 3 , 2 , − 5 , 3 6
dotP : . space 4

_start :
MOV R3 , #0 // r e g i s t e r R3 w i l l accumulate the product
LDR R0 , = vectorA // R0 = v e c t o r A base address ( pseudo − i n s t r u c t i o n )
LDR R1 , = vectorB // R1 = v e c t o r B base address ( pseudo − i n s t r u c t i o n )
LDR R2 , n // R2 =6 ( R2 i s our loop i t e r a t i o n v a r i a b l e i )

LOOP :
LDR R4 , [ R0 ] , #4 // g e t v e c t o r A [ i ] ; post − index increments R0 a f t e r
LDR R5 , [ R1 ] , #4 // g e t v e c t o r B [ i ] ; post − index increments R1 a f t e r
MLA R3 , R4 , R5 , R3 // R3 = ( R4 * R5 ) + R3
SUBS R2 , R2 , # 1 // i − − and s e t c o n d i t i o n f l a g s
BGT LOOP // we ’ re not done i f i >0

STR R3 , dotP // save our r e s u l t i n memory

STOP :
B STOP // i n f i n i t e loop once we ’ re done
57
ARM ISA

Subroutine Calls
Textbook§2.6, 2.7, D.4
Subroutines

It is typical programming practice to reuse blocks of code in a


subroutine (i.e., procedure, function, method) that can be called from
many places in a program.
i n t add3 ( i n t a , i n t b , i n t c ) {
return a + b + c ;
}

void main ( ) {
i n t sum = 0 ;

sum += add3 ( 1 , 2 , 3 ) ;
sum += 1 0 ;
sum += add3 ( 1 0 , 2 0 , 3 0 ) ;

p r i n t f ( ”Sum = %d\n ” , sum ) ;


}

58
Requirements for calling subroutines:
• We should be able to call a
subroutine from anywhere in our i n t add3 ( i n t a , i n t b , i n t c )
program, i.e., change the PC so {
that the routine is executed return a + b + c ;
}
• A subroutine must be able to
return, i.e., change the PC so that void main ( ) {
execution continues immediately i n t sum = 0 ;

after the point where it was


sum += add3 ( 1 , 2 , 3 ) ;
called sum += 1 0 ;
• We should be able to pass sum += add3 ( 1 0 , 2 0 , 3 0 ) ;

parameters (or arguments) that p r i n t f ( ”Sum = %d\n ” , sum ) ;


may take different values across }
different calls
• A subroutine must be able to
return a value

59
Calling and Returning

A subroutine call is implemented with the Branch and Link


instruction BL that stores the address of the next instruction (return
address) in the link register LR (R14).
BL addr // LR <-- PC +4; PC <-- addr

To return, branch to the address stored in the link register with the
BX instruction (branches to the address in a register).
BX Rn // PC <-- Rn

C code:
boo ( ) {
coo ( ) ;
ARM assembly:
... boo: BL coo // LR <-- PC +4; PC <-- coo
} ...
coo ( ) { coo: ...
... BX LR // PC <-- LR
return ;
}
60
Nested Subroutine Calls

boo ( ) {
coo ( ) ;
B1 : doo ( ) ;
• These calls are nested: boo calls coo, coo calls doo
B2 : • If we save return addresses in LR, calling doo from
return ;
coo overwrites the return address back to boo!
}
coo ( ) { • doo() is called from two different places, and is
doo ( ) ; expected to return to different places for each call
C: return ;
} • How do we remember the return addresses for each
doo ( ) { call, in the correct order? (I.e., the reverse call
return ; order.)
}

boo calls coo save B1


coo calls doo save C
doo returns to coo PC ← C Which data structure shall we use
coo returns to boo PC ← B1 to save these addresses?
boo calls doo save B2
doo returns to boo PC ← B2
61
We need a way to recall return addresses
(and later, other things) in the opposite
order they were saved.
We will use a Last-in-First-out (LIFO) data
structure called a stack! The stack is
saved in main memory, and accessed
with special load and store instructions.

source: Mk2010 / CC 4.0 BY-SA

62
Stack Operations

• push(value): adds new item value to top of the stack (TOS)


• value = pop(): returns and removes the top element
• value = peek(distance): returns (but does not remove)
the value of an element at a distance relative to TOS;
peek(0) returns the element at the TOS

source: Maxtremus / CC0

63
ARM Memory Layout

• Recall: text is where compiled code goes


• Recall: data is where compile-time
statically allocated data goes 0x00000000
text
• The size of text and data sections are fixed
data
at compile-time
heap
• The heap is where dynamically allocated
(e.g., using new or malloc) data goes free space

• The heap starts at lower addresses and


grows “downward” toward higher addresses 0xffffffff
stack

• The bottom of the stack is at a fixed


address and the top of stack grows
“upward,” towards lower memory addresses

64
The Stack in ARM

• The stack is used to support -28 SP (top of stack)


subroutines: saving return
17
addresses, function
arguments, etc 739

• The data elements on the


stack are always∗ words;
...
memory accesses to the stack
are always∗ aligned
• Register R13 is the stack ... stack bottom
pointer (SP); it points to TOS

65
The Stack in ARM

• The stack is used to support -28 SP (top of stack)


subroutines: saving return
17
addresses, function
arguments, etc 739

• The data elements on the


stack are always∗ words;
...
memory accesses to the stack
are always∗ aligned
• Register R13 is the stack ... stack bottom
pointer (SP); it points to TOS


by convention; breaking from convention may break your code

65
Stack Operations in ARM

Push from Rj
STR Rj, [SP, #-4]!

SP ← SP - 4 -28 SP

Mem[SP] ← Rj 17

739
Pop into Rj
...
LDR Rj, [SP], #4

Rj ← Mem[SP]
SP ← SP + 4
Assuming Rj=19, SP=0xFFFFABCC and i=2,
what’s the content of the stack, register Rj,
Peek(i) into Rj
and SP, after each instruction executes?
LDR Rj, [SP, #const] (consider them separately)
where const = i ∗ 4
Rj ← Mem[SP+const]

66
Pushing and Popping Multiple Elements

Often, several elements need to be pushed/popped onto/from the


stack, e.g., at the start and end of subroutines.
There are two pseudoinstructions that are useful aliases for STM and
LDM (slide 38):

• PUSH {R1, R3-R5} is equivalent to


STMDB SP!, R1, R3-R5
(R5 is pushed first, and R1 ends up at the top of the stack)
• POP {R1, R3-R5} is equivalent to
LDMIA SP!, R1, R3-R5
(top of the stack ends up in R1)

67
Nested Subroutine Calls, Revisited

main ( ) { Subroutines that might call another subroutine


boo ( ) ; must follow this convention:
A: ...;
} • Before you call a subroutine: push the
boo ( ) { return address stored in LR onto the stack
push ( LR ) ;
coo ( ) ;
• When the subroutine returns: pop the
B1 : doo ( ) ; return address off the stack into LR
B2 : LR = pop ( ) ; Action Stack (TOS on left) LR
return ;
} main calls boo A
boo saves LR A A
coo ( ) {
boo calls coo A B1
push ( LR ) ; coo saves LR B1 A B1
doo ( ) ; coo calls doo B1 A C
C : LR = pop ( ) ; doo returns B1 A C
return ; coo restores LR A B1
coo returns A B1
}
boo calls doo A B2
doo ( ) { doo returns A B2
return ; boo restores LR A
} boo returns A
68
Passing parameters and returning values

For a small number of parameters, the ARM APCS recommends using:


• R0 – R3 (A1 – A4) for passing parameters, and
• R0 (A1) for the return value
i n t add3 ( i n t a , i n t b , i n t c ) {
return a + b + c ;
}

MOV R0, #1
MOV R1, #2
MOV R2, #3
PUSH {LR} // STR LR,[SP,#-4]!; saves return address
BL add3
STR R0, SUM // return value is in R0
POP {LR} // LDR LR,[SP],#4; restores return address
...

add3: ADD R0, R0, R1


ADD R0, R0, R2
BX LR
69
ARM APCS Uses the Callee-save Convention
add3: ADD R0, R0, R1
ADD R0, R0, R2
BX LR

• In the previous example, the callee overwrote R0, which was OK,
since the caller knew that the return value would be in R0
• In general, the caller may need the register values after the
callee returns, so the rule is a callee is responsible for leaving
the registers as it found them
Callee-save convention:
A subroutine should save any∗ registers it wants to use on the stack
and then restore the original values to the registers after it is
finished using them.

70
ARM APCS Uses the Callee-save Convention
add3: ADD R0, R0, R1
ADD R0, R0, R2
BX LR

• In the previous example, the callee overwrote R0, which was OK,
since the caller knew that the return value would be in R0
• In general, the caller may need the register values after the
callee returns, so the rule is a callee is responsible for leaving
the registers as it found them
Callee-save convention:
A subroutine should save any∗ registers it wants to use on the stack
and then restore the original values to the registers after it is
finished using them.


The ARM APCS states that argument registers A1 – A4 need not be
saved, but remember: they might be changed inside of subroutines!
70
Registers in the ARM Architecture Procedure Call Standard

Most registers are callee-saved: if a subroutine is going to use them,


their state must first be saved (on the stack), and later restored (from
the stack).

Register Synonym Special Role in the AAPCS


r15 PC Program counter
r14 LR Link register
r13 SP Stack pointer
r12 IP Intra-procedure scratch register
r11 v8 FP Frame pointer OR variable register 8
r10 v7 Variable register 7
r9 v6/SB/TR Platform register
r8 v5 Variable register 5
r7 v4 Variable register 4
r6 v3 Variable register 3
r5 v2 Variable register 2
r4 v1 Variable register 1
r3 a4 Argument / scratch register 4
r2 a3 Argument / scratch register 3
r1 a2 Argument / result / scratch register 2
r0 a1 Argument / result / scratch register 1

71
Passing Parameters On the Stack

When you have more than four parameters, you can pass four in
registers, and the additional ones on the stack. (This is what
compilers do, and what the APCS recommends.)
Or, you can pass all parameters and the return value on the stack.
Passing parameters in registers will always be faster. Why?
When you want to pass a data structure that does not fit into four
words, you must use the stack (for at least part of it). Example:
struct largeDataStruct {
int a ;
int b ;
int c ;
int d ;
int e ;
}

Let’s see how to pass everything on the stack with a program that
sums a list of numbers. 72
ARRAY : . word 6 , 5 , 4 , 3 , 2 , 1 , 1 4 , 1 3 , 1 2 , 1 1 , 1 0 , 9 , 8 , 7 // sum these
N: . word 14 // t h i s many o f them
SUM : . space 4 // r e s u l t goes here
. global _start
_ s t a r t : LDR A1 , =ARRAY // A1 p o i n t s to ARRAY
LDR A2 , N // A2 c o n t a i n s number o f elements to add
PUSH { A1 , A2 , LR } // push parameters and LR ( A1 i s TOS )
BL listadd // c a l l s u b r o u t i n e
LDR A1 , [ SP , # 0 ] // r e t u r n i s a t TOS
STR A1 , SUM // s t o r e i t i n memory
ADD SP , SP , #8 // c l e a r parameters
POP { LR } // r e s t o r e LR
stop : B stop
l i s t a d d : PUSH { V1 − V3 } // c a l l e e − save r e g i s t e r s l i s t a d d uses
LDR V1 , [ SP , # 1 6 ] // load param N from s t a c k
LDR V2 , [ SP , # 1 2 ] // load param ARRAY from s t a c k
MOV A1 , #0 // c l e a r R0 ( sum )
loop : LDR V3 , [ V2 ] , #4 // g e t next value from ARRAY
ADD A1 , A1 , V3 // form the p a r t i a l sum
SUBS V1 , V1 , #1 // decrement loop counter
BGT loop
STR A1 , [ SP , # 1 2 ] // s t o r e sum on s t a c k , r e p l a c i n g ARRAY
POP { V1 − V3 } // r e s t o r e r e g i s t e r s
BX LR

73
Passing by Value, Passing by Reference

i n t add3Val ( i n t a ) {
a = a+3;
return a ;
}
Recap from C: void add3Ref ( i n t * a ) {

• Passing by value: a copy of the value *a = ( * a ) + 3


}
is passed to the callee. If the copy is void main ( ) {
modified, there is no effect on the int i =77;
int j ;
caller side.
• Passing by reference: an address in j = add3Val ( i ) ;
memory where the value is stored is print ( i ) ;
print ( j ) ;
passed. The callee may modify the
value. add3Ref (& i ) ;
print ( i ) ;
print ( j ) ;
}

74
ARRAY: .word 6,5,4,3,2,1,14,13,12,11,10,9,8,7
N: .word 14
...

LDR A1, =ARRAY // A1 points to ARRAY


LDR A2, N // A2 contains number of elements to add
PUSH {A1, A2, LR} // push parameters and LR (A1 is TOS)
BL listadd // call subroutine

• The parameter N was passed by value, i.e., the actual value of N


(14) was passed to the subroutine; as it was modified in the
routine, the value in memory was not changed.
• The parameter ARRAY was passed by reference, i.e., a pointer to
the first element of the array was passed; if we’d changed
elements, they’d have been changed in memory.

75
Stack Frame

• The subroutine can∗ also allocate local ...

variables, only accessible by the localvar3 SP

subroutine, on the stack. localvar2

localvar1
• Using a frame pointer (R11) gives a
saved R4
consistent reference to parameters saved R5
[FP, #const] and local variables saved R6
[FP, #-const] saved LR

• When nesting, the stack frame also includes saved FP FP

the return address and frame pointer param1

param2
• FP is not strictly required; it is mainly used
param3
to make assembly programs easier to write,
param4
and to help with the debugger
... old TOS
• FP remains constant while in the same ...

subroutine

76
Stack Frame

• The subroutine can∗ also allocate local ...

variables, only accessible by the localvar3 SP

subroutine, on the stack. localvar2

localvar1
• Using a frame pointer (R11) gives a
saved R4
consistent reference to parameters saved R5
[FP, #const] and local variables saved R6
[FP, #-const] saved LR

• When nesting, the stack frame also includes saved FP FP

the return address and frame pointer param1

param2
• FP is not strictly required; it is mainly used
param3
to make assembly programs easier to write,
param4
and to help with the debugger
... old TOS
• FP remains constant while in the same ...

subroutine

Most local variables are actually allocated this way, reducing the
total memory required by a program. 76
ARM Instruction Encoding
Textbook§2.13
ARM Assembly vs. Binary

Machine language instruction are encoded as binary, with 32∗ bits


per instruction (ARM ISA is RISC).
The binary representation of an instruction is divided into fields.
Each field encodes different information about the instruction.
The general format for most instructions:

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Cond

source: https://alisdair.mcdiarmid.org/arm-immediate- value-encoding/

77
ARM Assembly vs. Binary

Machine language instruction are encoded as binary, with 32∗ bits


per instruction (ARM ISA is RISC).
The binary representation of an instruction is divided into fields.
Each field encodes different information about the instruction.
The general format for most instructions:

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Cond

source: https://alisdair.mcdiarmid.org/arm-immediate- value-encoding/


16-bit versions are available for many instructions, but such
instructions tend to be less flexible.

77
Condition Field

Cond. field Suffix Meaning CPSR Flags


0000 EQ EQual(zero) Z=1
0001 NE Not Equal (nonzero) Z=0
0010 CS/HS Carry Set/ unsigned Higher or Same C=1
0011 CC/LO Carry Clear / unsigned Lower C=0
0100 MI MInus (negative) N=1
0101 PL PLus (positive or zero) N=0
0110 VS oVerflow Set V=1
0111 VC oVerflow Clear V=0
1000 HI unsigned Higher C=1 AND Z=0
1001 LS unsigned Lower or Same C=0 OR Z=1
1010 GE signed Greater or Equal N=V
1011 LT signed Less Than N!=V
1100 GT signed Greater Than Z=0 AND (N=V)
1101 LE signed Less or Equal Z=1 OR (N!=V)
1110 AL ALways executed None tested

1111 is not used.

78
Data Processing Instruction Encoding

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Cond 0 0 I 0 1 0 0 S Rn Rd Operand2

source: https://alisdair.mcdiarmid.org/arm-immediate- value-encoding/

Examples:
ADDGES R1, R2, R3

Cond=1010, I=0, S=1, Rn=0010, Rd=0001, Operand2[3-0]=0011

ADD R1, R2, #15

Cond=1110, I=1, S=0, Rn=0010, Rd=0001, Operand2=000000001111

Why are the register fields 4 bits wide?

79
Immediate Value Encoding

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Cond 0 0 I 0 1 0 0 S Rn Rd Operand2

source: https://alisdair.mcdiarmid.org/arm-immediate- value-encoding/

12 bits are available to encode immediate value. However, the largest


value is not what you think it might be.
The ARM ISA has a very clever way of generating a lot of useful 32-bit
constants: 16 possible rotations of an 8-bit value

11 10 9 8 7 6 5 4 3 2 1 0

Rotate Immediate

source: https://alisdair.mcdiarmid.org/arm-immediate-value- encoding/

80
Rotation 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

0x0 7 6 5 4 3 2 1 0

0x1 1 0 7 6 5 4 3 2

0x2 3 2 1 0 7 6 5 4

0x3 5 4 3 2 1 0 7 6

0x4 7 6 5 4 3 2 1 0

0x5 7 6 5 4 3 2 1 0

0x6 7 6 5 4 3 2 1 0

0x7 7 6 5 4 3 2 1 0

0x8 7 6 5 4 3 2 1 0

0x9 7 6 5 4 3 2 1 0

0xA 7 6 5 4 3 2 1 0

0xB 7 6 5 4 3 2 1 0

0xC 7 6 5 4 3 2 1 0

0xD 7 6 5 4 3 2 1 0

0xE 7 6 5 4 3 2 1 0

0xF 7 6 5 4 3 2 1 0

source: https://alisdair.mcdiarmid.org/arm-immediate- value-encoding/

Rotations of an even number of times in a 32-bit word (0, 2, ..., 30)


https://alisdair.mcdiarmid.org/arm-immediate-value-encoding/
81
Load/Store Instruction Encoding

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Cond OPcode S Rn Rd Operand2

Rn is the base address.


Operand2 is the offset: an immediate value, or register value (four
LSBs), or register (four LSBs) and shift amount (five MSBs).
Note that:

• Not every addressing mode is available for every load/store


instruction.
• The range of permitted immediate values and the options for
scaled registers vary from instruction to instruction.

82
Branch Instruction Encoding
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Cond 1 0 1 L offset

Since the offset field is limited to 24 bits:


• the branch target address is relative to the current value of PC,
• the offset is left-shifted twice (offset is in words, not bytes)
L=1 is used for the BL instruction.
...

100010 BEQ LABEL In this example, we want to jump to


100410 Fall-through
address 110010 which is 100 bytes away.
...
updated PC = 100810 The relative offset is 92 bytes (100 − 8)
...
= 23 words
= 0000 0000 0000 0000 0001 0111.
LABEL = 110010 Target
The condition field is EQ = 0000.
...
83
Conclusions

This set of lectures has presented the ARM ISA and introduced:

• the major classes of instructions


• the different addressing modes used by memory accesses
• the way ARM branches work
• the way subroutine calls are implemented in assembly with the
stack
• the encoding of instructions in binary

The next lecture will:

• look at the software toolchain used to translate high-level


languages to machine code
• the role of the operating system software

84

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy