A Brief History of ARM
A Brief History of ARM
First ARM prototype came alive on 26-Apr-1985, 3um technology, 24800 transistors
50mm2, consumed 120mW of power
ARM Architecture & NEON Acorn’s commercial ARM2 processor: 8-MHz, 26-bit addressing, 3-stage pipeline
ARM founded in October 1990, separate company (Apple had 43% stake)
ARM610 for Newton in 1992, ARM7TDMI for Nokia in 1994
Ian Rickards
Stanford University 28 Apr 2010
1 2
Gatecount: 500K (32KB I/D L1’s), 600K (core), 500K (NEON) Fast loop mode in
instruction pre-
40G “Low Power” macro: ~5mm2, 800MHz, 0.5W fetch to lower
power
40G “High Performance” macro: ~7mm2 2GHz (typ), 2W consumption
3 4
Cortex-A9 MPCore Multicore Structure Hard Macro Configuration and Floorplan
Configurable Between 1 and Hardware Coherence for
4 CPUs with optional Cache, MMU and TLB
NEON and/or Floating-point maintenance operations
Unit
Cortex-A9 CPU Cortex-A9 CPU Cortex-A9 CPU Cortex-A9 CPU Coherent access to
Flexible configuration processor caches
and power-aware from accelerators
interrupt controllerInstruction Data Instruction Data Instruction Data Instruction Data and DMA
Cache Cache Cache Cache Cache Cache Cache Cache
falcon_cpu floorplan
Snoop Control Unit (SCU)
Generalized
Interrupt Control
Accelerator
Coherence
Osprey configuration includes level 2 cache controller
and Cortex A9 integration level
and Distribution Cache-2-Cache Snoop Port
Transfers Filtering
Timers Top level includes Coresight PTM, CTI and CTM
Implementation using r1p1 version of Cortex A9
Dual core
Advanced Bus Interface Unit
32k I$ and D$
NEON present on both cores
Secure and
Design flexibility
over memory PTM interface present
Virtualization aware
interrupt and IPI
throughput and 128 interrupts
latency
communications L2 Cache Controller (PL310) 128K-8MB ACP present
Primary AMBA 3 64bit Interface Optional 2nd I/F with Address Filtering Two AXI master ports
Elba top level floorpan Level 2 cache memories external (interface exposed)
5 6
7 8
Power Domains Single-thread Coremarks/MHz
HiP and MID macros have same power Single-thread performance is key for GUI based applications
domains A9_PL310
Both use distributed coarse grain power A9_PL310_noram
switches
Power plan for CPUs is symmetric “Osprey macro”
Atom 1.85
A9 core and its L1 is power gated in Data
Engine 0
Data
Engine 1
lockstep
PTM/Debug
Cortex-A9 2.95
Note that all power domains are only ON A9 CORE 0 A9 CORE 1
or OFF, there is no hardware retention + 32K I/D + 32K I/D
Cortex-A8 2.72
mode
Software routine enables retention to RAM SCU + PL310_noram
1004K 2.33
9 10
Intel
11 12
ARM Architecture evolution Dummies’ guide to Si implementation
Some not-entirely-RISC features Basic Fab tech
LDM / STM 65nm, 40nm, 32nm, 28nm, etc.
~60% of Java bytecodes directly executed by datapath Backed off implementation vs. pushed implementation
Top of Java stack stored in registers High-K metal Gate
Widely used in Nokia & DoCoMo handsets Clock gating
… Well biasing…
13 14
ARM10 Dm
SIMD Low Cost Operation
MCU
VFPv2
Dd Destination
Jazelle® Thumb-2 Only Register
V5 V6 V7 A&R V7 M Lane
15 16
Data Types Registers
NEON natively supports a set of common data types NEON provides a 256-byte register file
Integer and Fixed-Point; 8-bit, 16-bit, 32-bit and 64-bit Distinct from the core registers
32-bit Single-precision Floating-point Extension to the VFPv2 register file (VFPv3)
.S8
Signed,
8/16-bitUnsigned
D0
Unsigned
Signed,
Integers;
Integers;
.8
.I8
.U8 Two explicitly aliased views D1
Q0
Polynomials
Polynomials
.P8 32 x 64-bit registers (D0-D31) D2
Q1
.S16
.16
.I16
.U16
16 x 128-bit registers (Q0-Q15) D3
.P16 : :
.I32
.S32 Enables register trade-off D30
32-bit Signed, .32 .U32 64-bit Signed, Vector length Q15
D31
Unsigned .F32 Unsigned
Integers; Floats .S64 Integers; Available registers
.64 .I64
.U64
17 18
64-bit 128-bit
19 20
How to use NEON For NEON instruction reference
OpenMAX DL library Official NEON instruction Set reference is “Advanced SIMD” in
Library of common codec components and signal processing routines ARM Architecture Reference Manual v7 A & R edition
Status: Released on http://www.arm.com/products/esd/openmax_home.html
Available to partners & www.arm.com request system
Vectorizing Compilers
Exploits NEON SIMD automatically with existing source code
Status: Released (in RVDS 3.1 Professional and later)
Status: Codesourcery 2007q3 gcc and later
C Instrinsics
C function call interface to NEON operations
Supports all data types and operations supported by NEON
Status: Released (in RVDS 3.0+ and Codesourcery 2007q3 gcc)
Assembler
For those who really want to optimize at the lowest level
Status: Released (in RVDS 3.0+ & Codesourcery 2007q3 gcc/gas)
21 22
25 26
git.ffmpeg.org
snapshot 21-Sep-09
27 28
Skia library S32A_D565_Opaque
Size Reference Google v6 NEON RVDS
C asm asm
60 100% 128% 24% 64%
29 30
31 32
Multiple 1-Element Structure Access
VLD1, VST1 provide standard array access
An array of structures containing a single component is a basic array
List can contain 1, 2, 3 or 4 consecutive registers
Transfer multiple consecutive 8, 16, 32 or 64-bit elements
[R1] x0
Quick review of NEON instructions +2 x1
[R4] x0 +4 x2
+2 x1 +6 x3
+R3 +4 x2 +8 x4
+6 x3 +10 x5
: x3 x2 x1 x0 D7
+12 x6 x3 x2 x1 x0 D3
VLD1.16 {D7}, [R4], R3 +14 x7
x7 x6 x5 x4 D4
:
33 34
35 36
Exercise 2 - summing a vector
+
+
+
+ +
37 38
x3 z2 y2 x2 z1 y1 x1 z0 y0 x0
0 p o n m l k j i h g f e d c b a {D1,D2}
element 3-element structure
l e i n 0 i a d D0
Three forms of Element Load Store instructions are provided
VTBL.8 D0, {D1, D2}, D3
Forms distinguished by type of register list provided
Multiple Structure Access e.g. {D0, D1}
VTBL : out of range indexes generate 0 result Single Structure Access e.g. {D0[2], D1[2]}
VTBX : out of range indexes leave destination unchanged Single Structure Load to all lanes e.g. {D0[], D1[]}
39 40
Multiple 2-Element Structure Access Multiple 3/4-Element Structure Access
VLD2, VST2 provide access to multiple 2-element structures VLD3/4, VST3/4 provide access to 3 or 4-element structures
List can contain 2 or 4 registers Lists contain 3/4 registers; optional space for building 128-bit vectors
Transfer multiple consecutive 8, 16, or 32-bit 2-element structures Transfer multiple consecutive 8, 16, or 32-bit 3/4-element structures
[R3] x0 [R1] x0
[R1] x0
+2 y0 +2 y0
+2 y0
[R1] x0 +4 x1 +4 z0
+4 z0
+2 y0 +6 y1 +6 x1
! +6 x1
+4 x1 +8 x2 +8 y1 x3 x2 x1 x0 D0
+8 y1
+6 y1 +10 y2 x3 x2 x1 x0 D0 +10 z1
! +10 z1 D1
+8 x2 +12 x3 +12 x2 x3 x2 x1 x0 D3
x7 x6 x5 x4 D1 +12 x2
+10 y2 : : y3 y2 y1 y0 D2
y3 y2 y1 y0 D4 :
+12 x3 x3 x2 x1 x0 D2 +28 x7 y3 y2 y1 y0 D2 +20 y3
+20 y3 D3
+14 y3 +30 y7 +22 z3 z3 z2 z1 z0 D5
y3 y2 y1 y0 D3 y7 y6 y5 y4 D3 +22 z3
: : : z3 z2 z1 z0 D4
:
41 42
43 44
Dual issue [Cortex-A8 only] Thank you!
NEON can dual issue NEON in the following circumstances ARM Architecture has evolved with a balance of pure RISC
No register operand/result dependencies
and customer driven input
NEON data processing (ALU) instruction
NEON load/store or NEON byte permute instruction or MRC/MCR
VLDR/VSTR, VLDn/VSTn, VMOV, VTRN, VSWP, VZIP, VUZIP, VEXT, VTBL,
VTBX NEON offers a clean architecture targeted at compiler code
VLD1.8 {D0}, [R1]! generation, offering
VMLAL.S8 Q2, D3, D2 Unaligned access
Structure load/store operations
VEXT.8 D0, D1, D2, #1 Dual D-register/Q-register view to optimize register bank
SUBS r12, r12, #1
Balance of performance vs. gatecount
Also can dual-issue NEON with ARM instructions Cortex-A9 and ARM’s hard macros provide a scalable low-
VLD1.8 {D0}, [R1]! power solution that is suitable for a wide range of high-
SUBS r12, r12, #1 performance consumer applications
45 46