0% found this document useful (0 votes)

284 views12 pages

A Brief History of ARM

ARM was founded in 1990, separate company (apple had 43% stake) First ARM prototype came alive on 26-april-1985, 24800 transistors 50mm2, consumed 120mw of power Acorn's commercial ARM2 processor: 8-MHz, 26-bit addressing, 3-stage pipeline ARM610 for Newton in 1992, ARM7TDMI for Nokia in 1994 ARM architecture introduces out-of-order instruction issue and completion Register renaming to enable execution speculation Non-blocking memory system

Uploaded by

Shilpa Gireesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

284 views12 pages

A Brief History of ARM

Uploaded by

Shilpa Gireesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

A brief history of ARM

First ARM prototype came alive on 26-Apr-1985, 3um technology, 24800 transistors
50mm2, consumed 120mW of power
ARM Architecture & NEON Acorn’s commercial ARM2 processor: 8-MHz, 26-bit addressing, 3-stage pipeline
ARM founded in October 1990, separate company (Apple had 43% stake)
ARM610 for Newton in 1992, ARM7TDMI for Nokia in 1994

Ian Rickards
Stanford University 28 Apr 2010

1 2

ARM 25 years later: Cortex-A9 MP Cortex-A9 Processor Microarchitecture

1-4 way MP with optimized MESI Introduces out-of-
order instruction
16KB, 32KB, 64KB I & D caches issue and
completion
128KB-8MB L2
Multi-issue, Speculation, Renaming, OOO Register
renaming to
High performance FPU option enable execution
speculation
NEON SIMD option
Thumb-2 Non-blocking
memory system
AXI bus with load-store
forwarding

Gatecount: 500K (32KB I/D L1’s), 600K (core), 500K (NEON) Fast loop mode in
instruction pre-
40G “Low Power” macro: ~5mm2, 800MHz, 0.5W fetch to lower
power
40G “High Performance” macro: ~7mm2 2GHz (typ), 2W consumption

3 4
Cortex-A9 MPCore Multicore Structure Hard Macro Configuration and Floorplan
Configurable Between 1 and Hardware Coherence for
4 CPUs with optional Cache, MMU and TLB
NEON and/or Floating-point maintenance operations
Unit

FPU/NEON TRACE FPU/NEON TRACE FPU/NEON TRACE FPU/NEON TRACE

Cortex-A9 CPU Cortex-A9 CPU Cortex-A9 CPU Cortex-A9 CPU Coherent access to
Flexible configuration processor caches
and power-aware from accelerators
interrupt controllerInstruction Data Instruction Data Instruction Data Instruction Data and DMA
Cache Cache Cache Cache Cache Cache Cache Cache

falcon_cpu floorplan
Snoop Control Unit (SCU)
Generalized
Interrupt Control
Accelerator
Coherence
Osprey configuration includes level 2 cache controller
and Cortex A9 integration level
and Distribution Cache-2-Cache Snoop Port
Transfers Filtering
Timers Top level includes Coresight PTM, CTI and CTM
Implementation using r1p1 version of Cortex A9
Dual core
Advanced Bus Interface Unit
32k I$ and D$
NEON present on both cores
Secure and
Design flexibility
over memory PTM interface present
Virtualization aware
interrupt and IPI
throughput and 128 interrupts
latency
communications L2 Cache Controller (PL310) 128K-8MB ACP present
Primary AMBA 3 64bit Interface Optional 2nd I/F with Address Filtering Two AXI master ports
Elba top level floorpan Level 2 cache memories external (interface exposed)

5 6

Why is ARM looking at “G” processes? Understanding power

“G” can achieve around double the MHz than “LP” Fundamental power parameters
Active power is lower on “G” than “LP” Average power => battery life
Thermal Power sustained power @ max performance
Example, Push 40LP to 800MHz, to compare with 800MHz MID macro
GUI updates web page render
music
The estimated LP numbers
correlate to an accelerated
implementation of an A8
Power
Traditional LP process
G is close in terms of power if
lowered to same performance as 2-3x faster
on LP.
clock
Power 40G process
G can scale much higher in terms
of performance than LP.

Key requirement is “run and power

power off” quickly off power off power off
Power Osprey

7 8
Power Domains Single-thread Coremarks/MHz
HiP and MID macros have same power Single-thread performance is key for GUI based applications
domains A9_PL310
Both use distributed coarse grain power A9_PL310_noram
switches
Power plan for CPUs is symmetric “Osprey macro”
Atom 1.85
A9 core and its L1 is power gated in Data
Engine 0
Data
Engine 1
lockstep

PTM/Debug
Cortex-A9 2.95
Note that all power domains are only ON A9 CORE 0 A9 CORE 1
or OFF, there is no hardware retention + 32K I/D + 32K I/D
Cortex-A8 2.72
mode
Software routine enables retention to RAM SCU + PL310_noram
1004K 2.33

L2 Cache RAM 74K 2.30

512/1024KB
0.00 0.50 1.00 1.50 2.00 2.50 3.00

9 10

Floating Point Performance Higher Flash Actionscript from A9

Intel

11 12
ARM Architecture evolution Dummies’ guide to Si implementation
Some not-entirely-RISC features Basic Fab tech
LDM / STM 65nm, 40nm, 32nm, 28nm, etc.

Full predicated execution (ADDEQ r0, r1, r2) G vs. LP technology

Carefully designed with customer/partner input considering gatecount 40G is 0.9V process, 40LP is 1.1V process
Much lower leakage with LP, but half the performance
Thumb Intermediate “LPG” from TSMC too! Island of G within LP
16-bit instruction set (mostly using r0-r7) selected for compiler requirements
Vt’s – each Vt requires additional mask step
Design goals: performance from 16-bit wide ROM, codesize
HVt – lower leakage, but slower
Thumb-2 in Cortex-A extends original Thumb (allows 32-bit/16-bit mix) RVt – regular Vt
Beneficial today – better performance from small caches LVt – faster, but high leakage esp. at high temperature
Jazelle Cell library track size
CPU mode allows direct execution of Java bytecodes 9-track, 12-track, 15-track (bigger => more powerful)

~60% of Java bytecodes directly executed by datapath Backed off implementation vs. pushed implementation
Top of Java stack stored in registers High-K metal Gate
Widely used in Nokia & DoCoMo handsets Clock gating
… Well biasing…

13 14

ARM Architecture Evolution What is NEON?

Key Technology
NEON is a wide SIMD data processing architecture
Additions by Extension of the ARM instruction set
Architecture Generation Thumb-EE
32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide)
Execution NEON Instructions perform “Packed SIMD” processing
VFPv3
Environments: Registers are considered as vectors of elements of the same data type
Improved
ARM11
NEON™ memory use Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single prec. float
Adv SIMD Instructions perform the same operation in all lanes
Improved
Thumb®-2 Media and Source
Source
DSP Registers
Registers
ARM9 TrustZone™ Elements
Dn

ARM10 Dm
SIMD Low Cost Operation
MCU
VFPv2
Dd Destination
Jazelle® Thumb-2 Only Register

V5 V6 V7 A&R V7 M Lane

15 16
Data Types Registers
NEON natively supports a set of common data types NEON provides a 256-byte register file
Integer and Fixed-Point; 8-bit, 16-bit, 32-bit and 64-bit Distinct from the core registers
32-bit Single-precision Floating-point Extension to the VFPv2 register file (VFPv3)

.S8
Signed,
8/16-bitUnsigned
D0

Unsigned
Signed,
Integers;
Integers;
.8
.I8
.U8 Two explicitly aliased views D1
Q0

Polynomials
Polynomials
.P8 32 x 64-bit registers (D0-D31) D2
Q1
.S16
.16
.I16
.U16
16 x 128-bit registers (Q0-Q15) D3

.P16 : :

.I32
.S32 Enables register trade-off D30
32-bit Signed, .32 .U32 64-bit Signed, Vector length Q15
D31
Unsigned .F32 Unsigned
Integers; Floats .S64 Integers; Available registers
.64 .I64
.U64

Also uses the summary flags in the VFP FPSCR

Adds a QC integer saturation summary flag
Data types are represented using a bit-size and format letter No per-lane flags, so ‘carry’ handled using wider result (16bit+16bit -> 32-bit)

17 18

Vectors and Scalars NEON in Audio

Registers hold one or more elements of the same data type FFT: 256-point, 16-bit signed complex numbers
Vn can be used to reference either a 64-bit Dn or 128-bit Qn register FFT is a key component of AAC, Voice/pattern recognition etc.
A register, data type combination describes a vector of elements
Hand optimized assembler in both cases
63 0 127 0 FFT time No NEON With NEON
Dn Qn
(v6 SIMD asm) (v7 NEON asm)
I64 D0 F32 F32 F32 F32 Q0 Cortex-A8 500MHz 15.2 us 3.8 us
S32 S32 D7 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 Q7 Actual silicon (x 4.0 performance)

64-bit 128-bit

Some instructions can reference individual scalar elements

Scalar elements are referenced using the array notation Vn[x]
Extreme example: FFT in ffmpeg: 12x faster
F32 F32 F32 F32 Q0
C code -> handwitten asm
Q0[3] Q0[2] Q0[1] Q0[0] Scalar -> vector processing
Array ordering is always from the least significant bit Unpipelined FPU -> pipelined NEON single precision FPU

19 20
How to use NEON For NEON instruction reference
OpenMAX DL library Official NEON instruction Set reference is “Advanced SIMD” in
Library of common codec components and signal processing routines ARM Architecture Reference Manual v7 A & R edition
Status: Released on http://www.arm.com/products/esd/openmax_home.html
Available to partners & www.arm.com request system
Vectorizing Compilers
Exploits NEON SIMD automatically with existing source code
Status: Released (in RVDS 3.1 Professional and later)
Status: Codesourcery 2007q3 gcc and later

C Instrinsics
C function call interface to NEON operations
Supports all data types and operations supported by NEON
Status: Released (in RVDS 3.0+ and Codesourcery 2007q3 gcc)

Assembler
For those who really want to optimize at the lowest level
Status: Released (in RVDS 3.0+ & Codesourcery 2007q3 gcc/gas)

21 22

ARM RVDS & gcc vectorising compiler Intrinsics

|L1.16|
Include intrinsics header file
VLD1.32 {d0,d1},[r0]!
#include <arm_neon.h>
int a[256], b[256], c[256];
SUBS r3,r3,#1
foo () { armcc -S --cpu cortex-a8 VLD1.32 {d2,d3},[r1]!
int i; -O3 -Otime --vectorize test.c VADD.I32 q0,q0,q1
VST1.32 {d0,d1},[r2]!
Use special NEON data types which correspond to D and Q registers, e.g.
BNE |L1.16|
int8x8_t D-register containing 8x 8-bit elements
for (i=0; i<256; i++){
int16x4_t D-register containing 4x 16-bit elements
a[i] = b[i] + c[i];
int32x4_t Q-register containing 4x 32-bit elements
}
} .L2:
add r1, r0, ip
add r3, r0, lr Use special intrinsics versions of NEON instructions
add r2, r0, r4
gcc -S -O3 -mcpu=cortex-a8
add r0, r0, #8
vin1 = vld1q_s32(ptr);
-mfpu=neon -ftree-vectorize cmp r0, #1024 vout = vaddq_s32(vin1, vin2);
-ftree-vectorizer-verbose=6 fldd d7, [r3, #0] vst1q_s32(vout, ptr);
test.c fldd d6, [r2, #0]
vadd.i32 d7, d7, d6
fstd d7, [r1, #0]
Strongly typed!
armcc generates better NEON code
bne .L2
Use vreinterpret_s16_s32( ) to change the type
(gcc can use Q-regs with ‘-mvectorize-with-neon-quad’ )
23 24
NEON in opensource Many different levels of parallelism
Bluez – official Linux Bluetooth protocol stack
NEON sbc audio encoder
Pixman (part of cairo 2D graphics library)
Compositing/alpha blending
X.Org, Mozilla Firefox, fennec, & Webkit browsers
e.g. fbCompositeSolidMask_nx8x0565neon 8x faster using NEON
Multi-issue parallelism
ffmpeg – libavcodec
LGPL media player used in many Linux distros

NEON SIMD parallelism
NEON Video: MPEG-2, MPEG-4 ASP, H.264 (AVC), VC-1, VP3, Theora
NEON Audio: AAC, Vorbis, WMA
x264 – Google Summer Of Code 2009
GPL H.264 encoder – e.g. for video conferencing
Android – NEON optimizations Multi-core parallelism
Skia library, S32A_D565_Opaque 5x faster using NEON
Available in Google Skia tree from 03-Aug-2009
Eigen2 linear algebra library
Ubuntu 09.04 – supports NEON
NEON versions of critical shared-libraries

25 26

ffmpeg (libavcodec) performance Scalability with SMP on Cortex-A9

git.ffmpeg.org
snapshot 21-Sep-09

YouTube HQ video decode

480x270, 30fps
Including AAC audio

Real silicon measurements

OMAP3 Beagleboard
ARM A9TC

NEON ~2x overall

performance

27 28
Skia library S32A_D565_Opaque
Size Reference Google v6 NEON RVDS
C asm asm
60 100% 128% 24% 64%

NEON optimization example 64 100% 128% 22% 68%

68 100% 127% 23% 63%

980 100% 73% 23% 58%

986 100% 73% 23% 58%

29 30

Processing code Cortex-A8 TRM

vmovn.u16 d4, q12 vshr.u16 q8, q14, #5
vshr.u16 q11, q12, #5 vshr.u16 q9, q13, #6
vshr.u16 q10, q12, #6+5 vaddhn.u16 d6, q14, q8
vmovn.u16 d5, q11 vshr.u16 q8, q12, #5
vmovn.u16 d6, q10 vaddhn.u16 d5, q13, q9
vshl.u8 d4, d4, #3 vqadd.u8 d6, d6, d0
vshl.u8 d5, d5, #2 vaddhn.u16 d4, q12, q8
vshl.u8 d6, d6, #3

vmovl.u8 q14, d31 vqadd.u8 d6, d6, d0

vmovl.u8 q13, d31 vqadd.u8 d5, d5, d1
vmovl.u8 q12, d31 vqadd.u8 d4, d4, d2

vmvn.8 d30, d3 vshll.u8 q10, d6, #8

vmlal.u8 q14, d30, d6 vshll.u8 q3, d5, #8
vmlal.u8 q13, d30, d5 vshll.u8 q2, d4, #8
vmlal.u8 q12, d30, d4 vsri.u16 q10, q3, #5
vsri.u16 q10, q2, #11

31 32
Multiple 1-Element Structure Access
VLD1, VST1 provide standard array access
An array of structures containing a single component is a basic array
List can contain 1, 2, 3 or 4 consecutive registers
Transfer multiple consecutive 8, 16, 32 or 64-bit elements
[R1] x0
Quick review of NEON instructions +2 x1
[R4] x0 +4 x2
+2 x1 +6 x3
+R3 +4 x2 +8 x4
+6 x3 +10 x5
: x3 x2 x1 x0 D7
+12 x6 x3 x2 x1 x0 D3
VLD1.16 {D7}, [R4], R3 +14 x7
x7 x6 x5 x4 D4
:

VST1.16 {D3,D4}, [R1]

33 34

Addition: Basic Example – adding all lanes

NEON supports various useful forms of basic
Input in Q0 (D0 and D1) DO D1
addition
VADD.I16 D0, D1, D2 u16 input values DO D1
Normal Addition - VADD, VSUB VSUB.F32 Q7, Q1, Q4
Floating-point VADD.I8 Q15, Q14, Q15 VPADDL.U16 Q0, Q0
Integer (8-bit to 64-bit elements) VSUB.I64 D0, D30, D5
64-bit and 128-bit registers DO D1

Now Q0 contains 4x u32 values DO

Long Addition - VADDL, VSUBL VADDL.U16 Q1, D7, D8
(with 15 headroom bits)
Promotes both inputs before operation VSUBL.S32 Q8, D1, D5 VPADD.U32 D0, D0, D1
Signed/unsigned (8-bit to 32-bit source
elements)
Reducing/folding operation DO

VADDW.U8 Q1, Q7, D8

needs 1 bit of headroom
Wide Addition - VADDW, VSUBW
VSUBW.S16 Q8, Q1, D5
DO
Promotes one input before operation
Signed/unsigned (8-bit 32-bit source elements) VPADDL.U32 D0, D0

35 36
Exercise 2 - summing a vector
+
+
+
+ +

+ + Some NEON clever features

+ +
+
+
DO D1
+
+
DO
+
+ DO
+

37 38

Data Movement: Table Lookup Element Load Store Instructions

Uses byte indexes to control byte look up in a table All treat memory as an array of structures (AoS)
Table is a list of 1,2,3 or 4 adjacent registers SIMD registers are treated as structure of arrays (SoA)
Enables interleaving/de-interleaving for efficient SIMD processing
11 4 8 13 26 8 0 3 D3 Transfer up to 256-bits in a single instruction

x3 z2 y2 x2 z1 y1 x1 z0 y0 x0
0 p o n m l k j i h g f e d c b a {D1,D2}
element 3-element structure

l e i n 0 i a d D0
Three forms of Element Load Store instructions are provided
VTBL.8 D0, {D1, D2}, D3
Forms distinguished by type of register list provided
Multiple Structure Access e.g. {D0, D1}
VTBL : out of range indexes generate 0 result Single Structure Access e.g. {D0[2], D1[2]}
VTBX : out of range indexes leave destination unchanged Single Structure Load to all lanes e.g. {D0[], D1[]}

39 40
Multiple 2-Element Structure Access Multiple 3/4-Element Structure Access
VLD2, VST2 provide access to multiple 2-element structures VLD3/4, VST3/4 provide access to 3 or 4-element structures
List can contain 2 or 4 registers Lists contain 3/4 registers; optional space for building 128-bit vectors
Transfer multiple consecutive 8, 16, or 32-bit 2-element structures Transfer multiple consecutive 8, 16, or 32-bit 3/4-element structures
[R3] x0 [R1] x0
[R1] x0
+2 y0 +2 y0
+2 y0
[R1] x0 +4 x1 +4 z0
+4 z0
+2 y0 +6 y1 +6 x1
! +6 x1
+4 x1 +8 x2 +8 y1 x3 x2 x1 x0 D0
+8 y1
+6 y1 +10 y2 x3 x2 x1 x0 D0 +10 z1
! +10 z1 D1
+8 x2 +12 x3 +12 x2 x3 x2 x1 x0 D3
x7 x6 x5 x4 D1 +12 x2
+10 y2 : : y3 y2 y1 y0 D2
y3 y2 y1 y0 D4 :
+12 x3 x3 x2 x1 x0 D2 +28 x7 y3 y2 y1 y0 D2 +20 y3
+20 y3 D3
+14 y3 +30 y7 +22 z3 z3 z2 z1 z0 D5
y3 y2 y1 y0 D3 y7 y6 y5 y4 D3 +22 z3
: : : z3 z2 z1 z0 D4
:

VLD2.16 {D2,D3}, [R1] VLD2.16 {D0,D1,D2,D3}, [R3]! VST3.16 {D3,D4,D5}, [R1]

VLD3.16 {D0,D2,D4}, [R1]!

41 42

Logical Alignment hints on NEON load/store

NEON supports bitwise logical operations NEON data load/store: VLDn/VSTn
Full unaligned support for NEON data access

VAND D0, D0, D1

Instruction contains ‘alignment hint’ which permits implementations to be faster when
VAND, VBIC, VEORR, VORN, VORR VORR Q0, Q1, Q15
address is aligned and hint is specified.
Usage: base address specified as [<Rn>:<align>]
Bitwise logical operation VEOR Q7, Q1, Q15
Note it is a programming error to specify hint, but use incorrectly aligned address
VORN D15, D14, D1
Independent of data type VBIC D0, D30, D2 Alignment hint can be :64, :128, :256 (bits) depending on number of D-regs
64-bit and 128-bit registers
VLD1.8 {D0}, [R1:64]
D0 VLD1.8 {D0,D1}, [R4:128]!
VBIT, VBIF, VBSL D1 VLD1.8 {D0,D1,D2,D3}, [R7:256]!, R2
Bitwise multiplex operations 0 1 0 1 1 0 D2 ARM ARM uses “@” but this is not recommended in source code
Insert True, Insert False, Select
GNU gas currently only accepts “[Rn,:128]” syntax – note extra “,”
3 versions overwrite different registers
D1
64-bit and 128-bit registers Applies to both Cortex-A8 and Cortex-A9 (see TRM for detailed instruction timing)
Used with masks to provide selection VBIT D1, D0, D2

43 44
Dual issue [Cortex-A8 only] Thank you!
NEON can dual issue NEON in the following circumstances ARM Architecture has evolved with a balance of pure RISC
No register operand/result dependencies
and customer driven input
NEON data processing (ALU) instruction
NEON load/store or NEON byte permute instruction or MRC/MCR
VLDR/VSTR, VLDn/VSTn, VMOV, VTRN, VSWP, VZIP, VUZIP, VEXT, VTBL,
VTBX NEON offers a clean architecture targeted at compiler code
VLD1.8 {D0}, [R1]! generation, offering
VMLAL.S8 Q2, D3, D2 Unaligned access
Structure load/store operations
VEXT.8 D0, D1, D2, #1 Dual D-register/Q-register view to optimize register bank
SUBS r12, r12, #1
Balance of performance vs. gatecount

Also can dual-issue NEON with ARM instructions Cortex-A9 and ARM’s hard macros provide a scalable low-
VLD1.8 {D0}, [R1]! power solution that is suitable for a wide range of high-
SUBS r12, r12, #1 performance consumer applications

45 46

ARM Notes For Students
100% (3)
ARM Notes For Students
24 pages
21CS43 SIMP Questions-TIE
No ratings yet
21CS43 SIMP Questions-TIE
60 pages
Arm Cortex
100% (2)
Arm Cortex
31 pages
ARM Cortex-M3/M4 Processor Core Features
No ratings yet
ARM Cortex-M3/M4 Processor Core Features
38 pages
mod 2 arm
No ratings yet
mod 2 arm
46 pages
Lecture2.2 ARM Instruction Set Architecture
No ratings yet
Lecture2.2 ARM Instruction Set Architecture
95 pages
Arm Addressing Mode and Instruction Set
No ratings yet
Arm Addressing Mode and Instruction Set
74 pages
Arm Processor Architecture
No ratings yet
Arm Processor Architecture
84 pages
DAY1_ARM
No ratings yet
DAY1_ARM
44 pages
Embedded Systems
No ratings yet
Embedded Systems
111 pages
ARM-ISA-and-Cortex-M0
No ratings yet
ARM-ISA-and-Cortex-M0
45 pages
Arm Arhitecture
No ratings yet
Arm Arhitecture
27 pages
2) ARM
No ratings yet
2) ARM
26 pages
DCP UNIT IV
No ratings yet
DCP UNIT IV
72 pages
Lec Arm PDF
No ratings yet
Lec Arm PDF
25 pages
Lecture 05 ARM Processors
No ratings yet
Lecture 05 ARM Processors
65 pages
ARM Processors and Architectures - Uni Program
No ratings yet
ARM Processors and Architectures - Uni Program
81 pages
MPMC Unit-3_Part-1
No ratings yet
MPMC Unit-3_Part-1
10 pages
Embedded Processor: Unit II
No ratings yet
Embedded Processor: Unit II
50 pages
Es & Vlsi 12-11-2021
No ratings yet
Es & Vlsi 12-11-2021
24 pages
Unit 1
No ratings yet
Unit 1
18 pages
Embedded System and Microprocessors
No ratings yet
Embedded System and Microprocessors
29 pages
Adv Comp Arch Q3'11
No ratings yet
Adv Comp Arch Q3'11
54 pages
ARM - PPT 8
100% (1)
ARM - PPT 8
74 pages
ARM Arch 1704437782
No ratings yet
ARM Arch 1704437782
26 pages
Advanced Microcontroller: Department of Electronics and Telecommunication Engineering
No ratings yet
Advanced Microcontroller: Department of Electronics and Telecommunication Engineering
56 pages
Module3 ARM
No ratings yet
Module3 ARM
96 pages
Arm9 Embedded Book-Guide
100% (2)
Arm9 Embedded Book-Guide
67 pages
ARM Introduction-1
100% (2)
ARM Introduction-1
26 pages
The ARM Architecture The ARM Architecture
No ratings yet
The ARM Architecture The ARM Architecture
26 pages
The ARM Processor
100% (2)
The ARM Processor
24 pages
Arm 2011
No ratings yet
Arm 2011
55 pages
23 EMBEDDED GR Ppapag Introduction To ARM Processors
No ratings yet
23 EMBEDDED GR Ppapag Introduction To ARM Processors
19 pages
Arm Based Microcontroller
No ratings yet
Arm Based Microcontroller
44 pages
DDI0408I Cortex A9 Fpu r4p1 TRM
No ratings yet
DDI0408I Cortex A9 Fpu r4p1 TRM
27 pages
Richard Grisenthwaite
No ratings yet
Richard Grisenthwaite
25 pages
ARM CORTEX - M & OMAP Processors
50% (2)
ARM CORTEX - M & OMAP Processors
34 pages
Unit-Viii: Arm 32-Bit Mcus: Architecture, Programming, & Development Tools
No ratings yet
Unit-Viii: Arm 32-Bit Mcus: Architecture, Programming, & Development Tools
16 pages
ARM Basic Architecture
No ratings yet
ARM Basic Architecture
83 pages
ARM Processor Roadmap
100% (1)
ARM Processor Roadmap
23 pages
Arm Cortex-A9 Mpcore Processor: Presented by
No ratings yet
Arm Cortex-A9 Mpcore Processor: Presented by
25 pages
Lessons From The ARM Architecture: Richard Grisenthwaite Lead Architect and Fellow ARM
No ratings yet
Lessons From The ARM Architecture: Richard Grisenthwaite Lead Architect and Fellow ARM
30 pages
Introduction
No ratings yet
Introduction
53 pages
ARM Processors 11
No ratings yet
ARM Processors 11
20 pages
ARM Cortexa8 Longi
No ratings yet
ARM Cortexa8 Longi
8 pages
ARM Architecture: Computer Organization and Assembly Languages P GZ y GG Yung-Yu Chuang
No ratings yet
ARM Architecture: Computer Organization and Assembly Languages P GZ y GG Yung-Yu Chuang
26 pages
ARM Notes1
No ratings yet
ARM Notes1
15 pages
Difference Between Von Neumann and Harvard Architecture
No ratings yet
Difference Between Von Neumann and Harvard Architecture
6 pages
ARM CPU Architecture
No ratings yet
ARM CPU Architecture
30 pages
Cortex A8
No ratings yet
Cortex A8
5 pages
ARM
No ratings yet
ARM
5 pages
AT - Better C Code For ARM Devices
No ratings yet
AT - Better C Code For ARM Devices
30 pages
ARM7 - LPC 2148 Processor
100% (1)
ARM7 - LPC 2148 Processor
50 pages
ARM History
No ratings yet
ARM History
2 pages
ARM Processor: Chapter 1: ARM Embedded Systems
No ratings yet
ARM Processor: Chapter 1: ARM Embedded Systems
25 pages
ARM Cortex-A9 MPCore
No ratings yet
ARM Cortex-A9 MPCore
34 pages
9106 1635 40 RCS Troubleshooting Diagrams
No ratings yet
9106 1635 40 RCS Troubleshooting Diagrams
98 pages
HIRA Template 200912
0% (1)
HIRA Template 200912
107 pages
System in Package
100% (1)
System in Package
40 pages
Class 11 Cs Study Material
No ratings yet
Class 11 Cs Study Material
76 pages
Zbook - Using Ffmpeg With Nvidia Gpu H - 5dbe
No ratings yet
Zbook - Using Ffmpeg With Nvidia Gpu H - 5dbe
18 pages
Tape Deck Tascam DA-302 Manual
No ratings yet
Tape Deck Tascam DA-302 Manual
33 pages
IT Akshali Mittal File
No ratings yet
IT Akshali Mittal File
22 pages
A1 1a Extreme 5320 S Core Switch v4
No ratings yet
A1 1a Extreme 5320 S Core Switch v4
99 pages
Project Proposal and Activity Design
No ratings yet
Project Proposal and Activity Design
6 pages
Healthy Food Order App-Case Study Deck by Neha Goyal
100% (1)
Healthy Food Order App-Case Study Deck by Neha Goyal
17 pages
Chapter 1
No ratings yet
Chapter 1
6 pages
Servidor Cisco Usc 220 m5
No ratings yet
Servidor Cisco Usc 220 m5
83 pages
Digital Forensic Analysis of Ransomwares For Identification and Binary Extraction of Cryptographic Keys
No ratings yet
Digital Forensic Analysis of Ransomwares For Identification and Binary Extraction of Cryptographic Keys
8 pages
A0300597 Rev b3300 Series 1520kva Specification 101719 Final
No ratings yet
A0300597 Rev b3300 Series 1520kva Specification 101719 Final
20 pages
FichaTecnica Cerabar S PMP71
No ratings yet
FichaTecnica Cerabar S PMP71
96 pages
Load Cut Off Switch Upon Over Voltage or Under Voltage
No ratings yet
Load Cut Off Switch Upon Over Voltage or Under Voltage
39 pages
Design Verification
No ratings yet
Design Verification
7 pages
2 - Power Supply
No ratings yet
2 - Power Supply
6 pages
WMA02 01 Que 20150126 PDF
No ratings yet
WMA02 01 Que 20150126 PDF
48 pages
Effective CV Writing
No ratings yet
Effective CV Writing
31 pages
IO
No ratings yet
IO
13 pages
How To Connect Mobile Internet To Your PC Via Tethering
No ratings yet
How To Connect Mobile Internet To Your PC Via Tethering
9 pages
Roland SH09 Service Manual PDF
0% (1)
Roland SH09 Service Manual PDF
7 pages
Basic Computer Class: Lesson 4 Using Email
No ratings yet
Basic Computer Class: Lesson 4 Using Email
20 pages
Keyboard Shortcuts
No ratings yet
Keyboard Shortcuts
5 pages
Array Antenna
No ratings yet
Array Antenna
3 pages
Escrip Basica RB Mikrotik
No ratings yet
Escrip Basica RB Mikrotik
3 pages
How Artificial Intelligence in Impacting Real Life Every Day: October 2017
No ratings yet
How Artificial Intelligence in Impacting Real Life Every Day: October 2017
6 pages
Contoh Soalan RAE Dan Jawapan 2-1
No ratings yet
Contoh Soalan RAE Dan Jawapan 2-1
5 pages
Jntu No. Student Name Phone Email Job Position Present Company (Working For) Place
No ratings yet
Jntu No. Student Name Phone Email Job Position Present Company (Working For) Place
2 pages
Mastering the Art of ARM Assembly Programming: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering the Art of ARM Assembly Programming: Unlock the Secrets of Expert-Level Skills
Steve Jones
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
First Hop Redundancy Protocol: Network Redundancy Protocol
From Everand
First Hop Redundancy Protocol: Network Redundancy Protocol
Mulayam Singh
No ratings yet
Routing in Wireless Mesh Networks
From Everand
Routing in Wireless Mesh Networks
Raghav Kumar
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

A Brief History of ARM

Uploaded by

A Brief History of ARM

Uploaded by

A brief history of ARM

ARM 25 years later: Cortex-A9 MP Cortex-A9 Processor Microarchitecture

FPU/NEON TRACE FPU/NEON TRACE FPU/NEON TRACE FPU/NEON TRACE

Why is ARM looking at “G” processes? Understanding power

Key requirement is “run and power

L2 Cache RAM 74K 2.30

Floating Point Performance Higher Flash Actionscript from A9

Full predicated execution (ADDEQ r0, r1, r2) G vs. LP technology

ARM Architecture Evolution What is NEON?

Also uses the summary flags in the VFP FPSCR

Vectors and Scalars NEON in Audio

Some instructions can reference individual scalar elements

ARM RVDS & gcc vectorising compiler Intrinsics

ffmpeg (libavcodec) performance Scalability with SMP on Cortex-A9

YouTube HQ video decode

Real silicon measurements

NEON ~2x overall

NEON optimization example 64 100% 128% 22% 68%

68 100% 127% 23% 63%

980 100% 73% 23% 58%

986 100% 73% 23% 58%

Processing code Cortex-A8 TRM

vmovl.u8 q14, d31 vqadd.u8 d6, d6, d0

vmvn.8 d30, d3 vshll.u8 q10, d6, #8

VST1.16 {D3,D4}, [R1]

Addition: Basic Example – adding all lanes

Now Q0 contains 4x u32 values DO

VADDW.U8 Q1, Q7, D8

+ + Some NEON clever features

Data Movement: Table Lookup Element Load Store Instructions

VLD2.16 {D2,D3}, [R1] VLD2.16 {D0,D1,D2,D3}, [R3]! VST3.16 {D3,D4,D5}, [R1]

Logical Alignment hints on NEON load/store

VAND D0, D0, D1

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.