0% found this document useful (0 votes)
284 views12 pages

A Brief History of ARM

ARM was founded in 1990, separate company (apple had 43% stake) First ARM prototype came alive on 26-april-1985, 24800 transistors 50mm2, consumed 120mw of power Acorn's commercial ARM2 processor: 8-MHz, 26-bit addressing, 3-stage pipeline ARM610 for Newton in 1992, ARM7TDMI for Nokia in 1994 ARM architecture introduces out-of-order instruction issue and completion Register renaming to enable execution speculation Non-blocking memory system

Uploaded by

Shilpa Gireesh
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
284 views12 pages

A Brief History of ARM

ARM was founded in 1990, separate company (apple had 43% stake) First ARM prototype came alive on 26-april-1985, 24800 transistors 50mm2, consumed 120mw of power Acorn's commercial ARM2 processor: 8-MHz, 26-bit addressing, 3-stage pipeline ARM610 for Newton in 1992, ARM7TDMI for Nokia in 1994 ARM architecture introduces out-of-order instruction issue and completion Register renaming to enable execution speculation Non-blocking memory system

Uploaded by

Shilpa Gireesh
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

A brief history of ARM

First ARM prototype came alive on 26-Apr-1985, 3um technology, 24800 transistors
50mm2, consumed 120mW of power
ARM Architecture & NEON Acorn’s commercial ARM2 processor: 8-MHz, 26-bit addressing, 3-stage pipeline
ARM founded in October 1990, separate company (Apple had 43% stake)
ARM610 for Newton in 1992, ARM7TDMI for Nokia in 1994

Ian Rickards
Stanford University 28 Apr 2010

1 2

ARM 25 years later: Cortex-A9 MP Cortex-A9 Processor Microarchitecture


 1-4 way MP with optimized MESI  Introduces out-of-
order instruction
 16KB, 32KB, 64KB I & D caches issue and
completion
 128KB-8MB L2
 Multi-issue, Speculation, Renaming, OOO  Register
renaming to
 High performance FPU option enable execution
speculation
 NEON SIMD option
 Thumb-2  Non-blocking
memory system
 AXI bus with load-store
forwarding

 Gatecount: 500K (32KB I/D L1’s), 600K (core), 500K (NEON)  Fast loop mode in
instruction pre-
 40G “Low Power” macro: ~5mm2, 800MHz, 0.5W fetch to lower
power
 40G “High Performance” macro: ~7mm2 2GHz (typ), 2W consumption

3 4
Cortex-A9 MPCore Multicore Structure Hard Macro Configuration and Floorplan
Configurable Between 1 and Hardware Coherence for
4 CPUs with optional Cache, MMU and TLB
NEON and/or Floating-point maintenance operations
Unit

FPU/NEON TRACE FPU/NEON TRACE FPU/NEON TRACE FPU/NEON TRACE

Cortex-A9 CPU Cortex-A9 CPU Cortex-A9 CPU Cortex-A9 CPU Coherent access to
Flexible configuration processor caches
and power-aware from accelerators
interrupt controllerInstruction Data Instruction Data Instruction Data Instruction Data and DMA
Cache Cache Cache Cache Cache Cache Cache Cache

falcon_cpu floorplan
Snoop Control Unit (SCU)
Generalized
Interrupt Control
Accelerator
Coherence
 Osprey configuration includes level 2 cache controller
and Cortex A9 integration level
and Distribution Cache-2-Cache Snoop Port
Transfers Filtering
Timers  Top level includes Coresight PTM, CTI and CTM
 Implementation using r1p1 version of Cortex A9
 Dual core
Advanced Bus Interface Unit
 32k I$ and D$
 NEON present on both cores
Secure and
Design flexibility
over memory  PTM interface present
Virtualization aware
interrupt and IPI
throughput and  128 interrupts
latency
communications L2 Cache Controller (PL310) 128K-8MB  ACP present
Primary AMBA 3 64bit Interface Optional 2nd I/F with Address Filtering  Two AXI master ports
Elba top level floorpan  Level 2 cache memories external (interface exposed)

5 6

Why is ARM looking at “G” processes? Understanding power


 “G” can achieve around double the MHz than “LP”  Fundamental power parameters
 Active power is lower on “G” than “LP”  Average power => battery life
 Thermal Power sustained power @ max performance
 Example, Push 40LP to 800MHz, to compare with 800MHz MID macro
GUI updates web page render
music
The estimated LP numbers
correlate to an accelerated
implementation of an A8
Power
Traditional LP process
G is close in terms of power if
lowered to same performance as 2-3x faster
on LP.
clock
Power 40G process
G can scale much higher in terms
of performance than LP.

Key requirement is “run and power


power off” quickly off power off power off
Power Osprey

7 8
Power Domains Single-thread Coremarks/MHz
 HiP and MID macros have same power  Single-thread performance is key for GUI based applications
domains A9_PL310
 Both use distributed coarse grain power A9_PL310_noram
switches
Power plan for CPUs is symmetric “Osprey macro”
Atom 1.85
 A9 core and its L1 is power gated in Data
Engine 0
Data
Engine 1
lockstep

PTM/Debug
Cortex-A9 2.95
 Note that all power domains are only ON A9 CORE 0 A9 CORE 1
or OFF, there is no hardware retention + 32K I/D + 32K I/D
Cortex-A8 2.72
mode
 Software routine enables retention to RAM SCU + PL310_noram
1004K 2.33

L2 Cache RAM 74K 2.30


512/1024KB
0.00 0.50 1.00 1.50 2.00 2.50 3.00

9 10

Floating Point Performance Higher Flash Actionscript from A9

Intel

11 12
ARM Architecture evolution Dummies’ guide to Si implementation
 Some not-entirely-RISC features  Basic Fab tech
 LDM / STM  65nm, 40nm, 32nm, 28nm, etc.

 Full predicated execution (ADDEQ r0, r1, r2)  G vs. LP technology


 Carefully designed with customer/partner input considering gatecount  40G is 0.9V process, 40LP is 1.1V process
 Much lower leakage with LP, but half the performance
 Thumb  Intermediate “LPG” from TSMC too! Island of G within LP
 16-bit instruction set (mostly using r0-r7) selected for compiler requirements
 Vt’s – each Vt requires additional mask step
 Design goals: performance from 16-bit wide ROM, codesize
 HVt – lower leakage, but slower
 Thumb-2 in Cortex-A extends original Thumb (allows 32-bit/16-bit mix)  RVt – regular Vt
 Beneficial today – better performance from small caches  LVt – faster, but high leakage esp. at high temperature
 Jazelle  Cell library track size
 CPU mode allows direct execution of Java bytecodes  9-track, 12-track, 15-track (bigger => more powerful)

 ~60% of Java bytecodes directly executed by datapath  Backed off implementation vs. pushed implementation
 Top of Java stack stored in registers  High-K metal Gate
 Widely used in Nokia & DoCoMo handsets  Clock gating
…  Well biasing…

13 14

ARM Architecture Evolution What is NEON?


Key Technology
 NEON is a wide SIMD data processing architecture
Additions by  Extension of the ARM instruction set
Architecture Generation Thumb-EE
 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide)
Execution  NEON Instructions perform “Packed SIMD” processing
VFPv3
Environments:  Registers are considered as vectors of elements of the same data type
Improved
ARM11
NEON™ memory use  Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single prec. float
Adv SIMD  Instructions perform the same operation in all lanes
Improved
Thumb®-2 Media and Source
Source
DSP Registers
Registers
ARM9 TrustZone™ Elements
Dn

ARM10 Dm
SIMD Low Cost Operation
MCU
VFPv2
Dd Destination
Jazelle® Thumb-2 Only Register

V5 V6 V7 A&R V7 M Lane

15 16
Data Types Registers
 NEON natively supports a set of common data types  NEON provides a 256-byte register file
 Integer and Fixed-Point; 8-bit, 16-bit, 32-bit and 64-bit  Distinct from the core registers
 32-bit Single-precision Floating-point  Extension to the VFPv2 register file (VFPv3)

.S8
Signed,
8/16-bitUnsigned
D0

Unsigned
Signed,
Integers;
Integers;
.8
.I8
.U8  Two explicitly aliased views D1
Q0

Polynomials
Polynomials
.P8  32 x 64-bit registers (D0-D31) D2
Q1
.S16
.16
.I16
.U16
 16 x 128-bit registers (Q0-Q15) D3

.P16 : :

.I32
.S32  Enables register trade-off D30
32-bit Signed, .32 .U32 64-bit Signed,  Vector length Q15
D31
Unsigned .F32 Unsigned
Integers; Floats .S64 Integers;  Available registers
.64 .I64
.U64

 Also uses the summary flags in the VFP FPSCR


 Adds a QC integer saturation summary flag
 Data types are represented using a bit-size and format letter  No per-lane flags, so ‘carry’ handled using wider result (16bit+16bit -> 32-bit)

17 18

Vectors and Scalars NEON in Audio


 Registers hold one or more elements of the same data type  FFT: 256-point, 16-bit signed complex numbers
 Vn can be used to reference either a 64-bit Dn or 128-bit Qn register  FFT is a key component of AAC, Voice/pattern recognition etc.
 A register, data type combination describes a vector of elements
 Hand optimized assembler in both cases
63 0 127 0 FFT time No NEON With NEON
Dn Qn
(v6 SIMD asm) (v7 NEON asm)
I64 D0 F32 F32 F32 F32 Q0 Cortex-A8 500MHz 15.2 us 3.8 us
S32 S32 D7 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 Q7 Actual silicon (x 4.0 performance)

64-bit 128-bit

 Some instructions can reference individual scalar elements


 Scalar elements are referenced using the array notation Vn[x]
 Extreme example: FFT in ffmpeg: 12x faster
F32 F32 F32 F32 Q0
 C code -> handwitten asm
Q0[3] Q0[2] Q0[1] Q0[0]  Scalar -> vector processing
 Array ordering is always from the least significant bit  Unpipelined FPU -> pipelined NEON single precision FPU

19 20
How to use NEON For NEON instruction reference
OpenMAX DL library  Official NEON instruction Set reference is “Advanced SIMD” in
 Library of common codec components and signal processing routines ARM Architecture Reference Manual v7 A & R edition
 Status: Released on http://www.arm.com/products/esd/openmax_home.html
 Available to partners & www.arm.com request system
Vectorizing Compilers
 Exploits NEON SIMD automatically with existing source code
 Status: Released (in RVDS 3.1 Professional and later)
 Status: Codesourcery 2007q3 gcc and later

C Instrinsics
 C function call interface to NEON operations
 Supports all data types and operations supported by NEON
 Status: Released (in RVDS 3.0+ and Codesourcery 2007q3 gcc)

Assembler
 For those who really want to optimize at the lowest level
 Status: Released (in RVDS 3.0+ & Codesourcery 2007q3 gcc/gas)

21 22

ARM RVDS & gcc vectorising compiler Intrinsics


|L1.16|
 Include intrinsics header file
VLD1.32 {d0,d1},[r0]!
#include <arm_neon.h>
int a[256], b[256], c[256];
SUBS r3,r3,#1
foo () { armcc -S --cpu cortex-a8 VLD1.32 {d2,d3},[r1]!
int i; -O3 -Otime --vectorize test.c VADD.I32 q0,q0,q1
VST1.32 {d0,d1},[r2]!
 Use special NEON data types which correspond to D and Q registers, e.g.
BNE |L1.16|
int8x8_t D-register containing 8x 8-bit elements
for (i=0; i<256; i++){
int16x4_t D-register containing 4x 16-bit elements
a[i] = b[i] + c[i];
int32x4_t Q-register containing 4x 32-bit elements
}
} .L2:
add r1, r0, ip
add r3, r0, lr  Use special intrinsics versions of NEON instructions
add r2, r0, r4
gcc -S -O3 -mcpu=cortex-a8
add r0, r0, #8
vin1 = vld1q_s32(ptr);
-mfpu=neon -ftree-vectorize cmp r0, #1024 vout = vaddq_s32(vin1, vin2);
-ftree-vectorizer-verbose=6 fldd d7, [r3, #0] vst1q_s32(vout, ptr);
test.c fldd d6, [r2, #0]
vadd.i32 d7, d7, d6
fstd d7, [r1, #0]
 Strongly typed!
 armcc generates better NEON code
bne .L2
 Use vreinterpret_s16_s32( ) to change the type
(gcc can use Q-regs with ‘-mvectorize-with-neon-quad’ )
23 24
NEON in opensource Many different levels of parallelism
 Bluez – official Linux Bluetooth protocol stack
 NEON sbc audio encoder
 Pixman (part of cairo 2D graphics library)
 Compositing/alpha blending
 X.Org, Mozilla Firefox, fennec, & Webkit browsers
 e.g. fbCompositeSolidMask_nx8x0565neon 8x faster using NEON
 Multi-issue parallelism
 ffmpeg – libavcodec
 LGPL media player used in many Linux distros

 NEON SIMD parallelism
NEON Video: MPEG-2, MPEG-4 ASP, H.264 (AVC), VC-1, VP3, Theora
 NEON Audio: AAC, Vorbis, WMA
 x264 – Google Summer Of Code 2009
 GPL H.264 encoder – e.g. for video conferencing
 Android – NEON optimizations  Multi-core parallelism
 Skia library, S32A_D565_Opaque 5x faster using NEON
 Available in Google Skia tree from 03-Aug-2009
 Eigen2 linear algebra library
 Ubuntu 09.04 – supports NEON
 NEON versions of critical shared-libraries

25 26

ffmpeg (libavcodec) performance Scalability with SMP on Cortex-A9

 git.ffmpeg.org
snapshot 21-Sep-09

YouTube HQ video decode


480x270, 30fps
Including AAC audio

 Real silicon measurements


 OMAP3 Beagleboard
 ARM A9TC

 NEON ~2x overall


performance

27 28
Skia library S32A_D565_Opaque
Size Reference Google v6 NEON RVDS
C asm asm
60 100% 128% 24% 64%

NEON optimization example 64 100% 128% 22% 68%

68 100% 127% 23% 63%

980 100% 73% 23% 58%

986 100% 73% 23% 58%

29 30

Processing code Cortex-A8 TRM


vmovn.u16 d4, q12 vshr.u16 q8, q14, #5
vshr.u16 q11, q12, #5 vshr.u16 q9, q13, #6
vshr.u16 q10, q12, #6+5 vaddhn.u16 d6, q14, q8
vmovn.u16 d5, q11 vshr.u16 q8, q12, #5
vmovn.u16 d6, q10 vaddhn.u16 d5, q13, q9
vshl.u8 d4, d4, #3 vqadd.u8 d6, d6, d0
vshl.u8 d5, d5, #2 vaddhn.u16 d4, q12, q8
vshl.u8 d6, d6, #3

vmovl.u8 q14, d31 vqadd.u8 d6, d6, d0


vmovl.u8 q13, d31 vqadd.u8 d5, d5, d1
vmovl.u8 q12, d31 vqadd.u8 d4, d4, d2

vmvn.8 d30, d3 vshll.u8 q10, d6, #8


vmlal.u8 q14, d30, d6 vshll.u8 q3, d5, #8
vmlal.u8 q13, d30, d5 vshll.u8 q2, d4, #8
vmlal.u8 q12, d30, d4 vsri.u16 q10, q3, #5
vsri.u16 q10, q2, #11

31 32
Multiple 1-Element Structure Access
 VLD1, VST1 provide standard array access
 An array of structures containing a single component is a basic array
 List can contain 1, 2, 3 or 4 consecutive registers
 Transfer multiple consecutive 8, 16, 32 or 64-bit elements
[R1] x0
Quick review of NEON instructions +2 x1
[R4] x0 +4 x2
+2 x1 +6 x3
+R3 +4 x2 +8 x4
+6 x3 +10 x5
: x3 x2 x1 x0 D7
+12 x6 x3 x2 x1 x0 D3
VLD1.16 {D7}, [R4], R3 +14 x7
x7 x6 x5 x4 D4
:

VST1.16 {D3,D4}, [R1]

33 34

Addition: Basic Example – adding all lanes


 NEON supports various useful forms of basic
 Input in Q0 (D0 and D1) DO D1
addition
VADD.I16 D0, D1, D2  u16 input values DO D1
 Normal Addition - VADD, VSUB VSUB.F32 Q7, Q1, Q4
 Floating-point VADD.I8 Q15, Q14, Q15 VPADDL.U16 Q0, Q0
 Integer (8-bit to 64-bit elements) VSUB.I64 D0, D30, D5
 64-bit and 128-bit registers DO D1

 Now Q0 contains 4x u32 values DO


 Long Addition - VADDL, VSUBL VADDL.U16 Q1, D7, D8
(with 15 headroom bits)
 Promotes both inputs before operation VSUBL.S32 Q8, D1, D5 VPADD.U32 D0, D0, D1
 Signed/unsigned (8-bit to 32-bit source
elements)
 Reducing/folding operation DO

VADDW.U8 Q1, Q7, D8


needs 1 bit of headroom
 Wide Addition - VADDW, VSUBW
VSUBW.S16 Q8, Q1, D5
DO
 Promotes one input before operation
 Signed/unsigned (8-bit 32-bit source elements) VPADDL.U32 D0, D0

35 36
Exercise 2 - summing a vector
+
+
+
+ +

+ + Some NEON clever features


+ +
+
+
DO D1
+
+
DO
+
+ DO
+

37 38

Data Movement: Table Lookup Element Load Store Instructions


 Uses byte indexes to control byte look up in a table  All treat memory as an array of structures (AoS)
 Table is a list of 1,2,3 or 4 adjacent registers  SIMD registers are treated as structure of arrays (SoA)
 Enables interleaving/de-interleaving for efficient SIMD processing
11 4 8 13 26 8 0 3 D3  Transfer up to 256-bits in a single instruction

x3 z2 y2 x2 z1 y1 x1 z0 y0 x0
0 p o n m l k j i h g f e d c b a {D1,D2}
element 3-element structure

l e i n 0 i a d D0
 Three forms of Element Load Store instructions are provided
VTBL.8 D0, {D1, D2}, D3
 Forms distinguished by type of register list provided
 Multiple Structure Access e.g. {D0, D1}
 VTBL : out of range indexes generate 0 result  Single Structure Access e.g. {D0[2], D1[2]}
 VTBX : out of range indexes leave destination unchanged  Single Structure Load to all lanes e.g. {D0[], D1[]}

39 40
Multiple 2-Element Structure Access Multiple 3/4-Element Structure Access
 VLD2, VST2 provide access to multiple 2-element structures  VLD3/4, VST3/4 provide access to 3 or 4-element structures
 List can contain 2 or 4 registers  Lists contain 3/4 registers; optional space for building 128-bit vectors
 Transfer multiple consecutive 8, 16, or 32-bit 2-element structures  Transfer multiple consecutive 8, 16, or 32-bit 3/4-element structures
[R3] x0 [R1] x0
[R1] x0
+2 y0 +2 y0
+2 y0
[R1] x0 +4 x1 +4 z0
+4 z0
+2 y0 +6 y1 +6 x1
! +6 x1
+4 x1 +8 x2 +8 y1 x3 x2 x1 x0 D0
+8 y1
+6 y1 +10 y2 x3 x2 x1 x0 D0 +10 z1
! +10 z1 D1
+8 x2 +12 x3 +12 x2 x3 x2 x1 x0 D3
x7 x6 x5 x4 D1 +12 x2
+10 y2 : : y3 y2 y1 y0 D2
y3 y2 y1 y0 D4 :
+12 x3 x3 x2 x1 x0 D2 +28 x7 y3 y2 y1 y0 D2 +20 y3
+20 y3 D3
+14 y3 +30 y7 +22 z3 z3 z2 z1 z0 D5
y3 y2 y1 y0 D3 y7 y6 y5 y4 D3 +22 z3
: : : z3 z2 z1 z0 D4
:

VLD2.16 {D2,D3}, [R1] VLD2.16 {D0,D1,D2,D3}, [R3]! VST3.16 {D3,D4,D5}, [R1]


VLD3.16 {D0,D2,D4}, [R1]!

41 42

Logical Alignment hints on NEON load/store


 NEON supports bitwise logical operations  NEON data load/store: VLDn/VSTn
 Full unaligned support for NEON data access

VAND D0, D0, D1


 Instruction contains ‘alignment hint’ which permits implementations to be faster when
 VAND, VBIC, VEORR, VORN, VORR VORR Q0, Q1, Q15
address is aligned and hint is specified.
 Usage: base address specified as [<Rn>:<align>]
 Bitwise logical operation VEOR Q7, Q1, Q15
 Note it is a programming error to specify hint, but use incorrectly aligned address
VORN D15, D14, D1
 Independent of data type VBIC D0, D30, D2  Alignment hint can be :64, :128, :256 (bits) depending on number of D-regs
 64-bit and 128-bit registers
VLD1.8 {D0}, [R1:64]
D0 VLD1.8 {D0,D1}, [R4:128]!
 VBIT, VBIF, VBSL D1 VLD1.8 {D0,D1,D2,D3}, [R7:256]!, R2
 Bitwise multiplex operations 0 1 0 1 1 0 D2  ARM ARM uses “@” but this is not recommended in source code
 Insert True, Insert False, Select
 GNU gas currently only accepts “[Rn,:128]” syntax – note extra “,”
 3 versions overwrite different registers
D1
 64-bit and 128-bit registers  Applies to both Cortex-A8 and Cortex-A9 (see TRM for detailed instruction timing)
 Used with masks to provide selection VBIT D1, D0, D2

43 44
Dual issue [Cortex-A8 only] Thank you!
 NEON can dual issue NEON in the following circumstances  ARM Architecture has evolved with a balance of pure RISC
 No register operand/result dependencies
and customer driven input
 NEON data processing (ALU) instruction
 NEON load/store or NEON byte permute instruction or MRC/MCR
 VLDR/VSTR, VLDn/VSTn, VMOV, VTRN, VSWP, VZIP, VUZIP, VEXT, VTBL,
VTBX  NEON offers a clean architecture targeted at compiler code
VLD1.8 {D0}, [R1]! generation, offering
VMLAL.S8 Q2, D3, D2  Unaligned access
 Structure load/store operations
VEXT.8 D0, D1, D2, #1  Dual D-register/Q-register view to optimize register bank
SUBS r12, r12, #1
 Balance of performance vs. gatecount

 Also can dual-issue NEON with ARM instructions  Cortex-A9 and ARM’s hard macros provide a scalable low-
VLD1.8 {D0}, [R1]! power solution that is suitable for a wide range of high-
SUBS r12, r12, #1 performance consumer applications

45 46

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy