ARM System Architectures 09-02-2016
ARM System Architectures 09-02-2016
ARM System Architectures 09-02-2016
http://en.wikipedia.org/wiki/Advanced_Microcontroller_Bus_Architecture
Overview of the AMBA protocol family (based on [])
Coherent
Hub Inteface CHI
(10/2011)
AMBA Coherency
ACE™
Extensions Protocol
ACE-Lite
Advanced
Peripheral Bus APB™ APB2 APB v1.0 APB v2.0
Coherent
Hub Inteface CHI
(10/2011)
AMBA Coherency
ACE™
Extensions Protocol
ACE-Lite
Advanced
Peripheral Bus APB™ APB2 APB v1.0 APB v2.0
ASB APB
5.2.2 The ASB bus (Advanced System Bus) []
Main features of the ASB bus
a) The ASB bus is a high performance parallel bus, more precisely a high performance
bus IP utilizable for SOC designers.
b) Bus operation supports
• multiple masters,
• burst transfers and
• pipelining in two stages (as bus granting and bus transfers may be performed
in parallel).
• Nevertheless, there is a limitation of the ASB bus as only a single master
might be active at a time.
c) The interface signals include
• mostly uni-directional lines, like the address, control or transfer response lines,
• but the data lines carrying write and read data between the masters and the
slaves are bi-directional.
In addition the data line of the APB bus is also bi-directional.
d) The ASB protocol makes use of both edges of the clock signal.
Interface signals of ASB masters []
Interface signals of ASB slaves []
Principle of operation of the ASB bus-1 []
• 8-bits (byte)
• 16-bit (halfword) and
• 32-bit (word)
They are encoded in the BSIZE[1:0] signals that are driven by the active bus master
and have the same timing as the address bus [a].
By contrast, the AHB protocol allows in addition significantly wider data buses,
as discussed later.
Multi-master operation []
• A simple two wire request/grant mechanism is implemented between the arbiter
and each bus master.
• The arbiter ensures that only a single bus master may be active on the bus and
also ensures that when no masters are requesting the bus a default master
is granted.
• The specification also supports shared lock signal.
This signal allows bus masters to indicate that the current transfer is indivisible
from the subsequent transfer and will prevent other bus masters from
gaining access to the bus until the locked transfer has completed.
• The arbitration protocol is defined but the prioritization is left over to the
• application.
Black box layout of the ASB arbiter assuming three bus masters []
Description of the operation of the arbiter (simplified)
• The ASB bus protocol supports a straightforward form of pipelined operation,
such that arbitration for the next transfer is performed during the current
transfer.
• The ASB bus can be re-arbitrated on every clock cycle.
• The arbiter samples all the request signals (AREQx) on the falling edge of BCLK
and during the low phase of BCLK the arbiter asserts the appropriate grant signal
(AGNTx) using the internal priority scheme and the value of the lock signal
(BLOK).
The APB bus (Advanced Peripheral Bus-1 []
It appears as a local secondary bus encapsulated as a single slave device,
as indicated below.
ASB APB
Jason R. Andrews Co-Verification of Hardware and Software for ARM SoC Design
Elsevier Inc. 2005
For highest performance, typical designs based on ASB use an ARM processor with
a write-back cache. A write-back cache is a cache algorithm that allows data to be
written into cache without updating the system memory. Since ASB does not have
any provisions for maintaining cache coherency of multiple caching bus masters only
one processor can be used on ASB.
Jason R. Andrews Co-Verification of Hardware and Software for ARM SoC Design
Elsevier Inc. 2005
5.3 The AMBA 2 protocol family
5.3 The AMBA 2 protocol family
5.3.1 Overview (based on [])
(1996) (5/1999) (6/2003) (3/2010) (6/2013)
Coherent
Hub Inteface CHI
(10/2011)
AMBA Coherency
ACE™
Extensions Protocol
ACE-Lite
Advanced
Peripheral Bus APB™ APB2 APB v1.0 APB v2.0
Jason R. Andrews Co-Verification of Hardware and Software for ARM SoC Design
Elsevier Inc. 2005
a) Split transactions-1
Transactions are split into two phases, into the address and the data phases,
as shown below.
Splitting the transfer into two phases allows overlapping the address phase
of any transfer with the data phase of the previous transfer, as discussed later.
Wait states
Figure: Example of a split read or write transaction with two wait states []
Split transactions-3
Overlapping the address and data phases of different transfers (as shown below)
increases the pipeline depth of the bus operation from two to three and thus
contributes for higher performance.
Table: Transfer sizes in the AHB protocol indicated by the HSIZE[2:0] signals []
From the available data bus width options practically only the 32, 64, or 128 bits
wide alternatives are used.
d) Using only uni-directional signals-1
The AHB protocol makes use only of uni-directional data buses, as shown below.
http://www.design-reuse.com/articles/24123/amba-ahb-to-axi-bus-comparison.html
Example operation of the AHB bus for three masters and four slaves []
Nevertheless, the APB2 protocol further on does not support any pipelining of the
address and control signals.
Example APB2 write transfer
• The write transfer starts with the address, write data cycle signal and select
signal all changing after the rising edge of the clock.
• After the following clock edge the enable signal (PENABLE) is asserted, and this
indicates, that the ENABLE cycle is taking place.
• The address, data and control signals all remain valid through the ENABLE cycle.
• The transfer completes at the end of this cycle when the PENABLE signal becomes
deasserted.
5.3.4 The AHB-Lite bus extension
In 2001 ARM extended the original AHB bus is two directions [], as shown below:
http://www.design-reuse.com/news/856/arm-multi-layer-ahb-ahb-lite.html
The AHB-Lite specification
• The AHB-Lite bus was launched along with the Multi-layer AHB specification
in 2001 as an extension of the AHB bus [a].
Subsequently it was specified in a stand alone document in 2006 [b].
• The AHB-Lite bus is considered as being part of the AMBA 2 protocol family.
[a] http://www.design-reuse.com/news/856/arm-multi-layer-ahb-ahb-lite.html
[b]
Key features of the AHB-Lite bus []
• AHB-Lite is a subset of AHB.
• It simplifies platform designs including only a single master.
Key featurures
• Single Master
• Simple slaves
• Easier module design/debug
• No arbitration issues
http://www.hipeac.net/system/files/cm0ds_2_0.pdf
An example AMBA system based on the AHB-Lite bus []
http://web.mit.edu/clarkds/www/Files/slides1.pdf
Block diagram of an example AMBA system based on the AHB-Lite bus []
http://www.hipeac.net/system/files/cm0ds_2_0.pdf
5.3.5 The Multi-layer AHB bus
In 2001 ARM extended the original AHB bus is two directions [], as shown below:
http://www.design-reuse.com/news/856/arm-multi-layer-ahb-ahb-lite.html
The AHB bus interconnect
Single-layer Multi-layer
AHB interconnect AHB-interconnect
Multi-layer AHB
Overview
DVI 0045A
ARM Limited. 2001
Block diagram of a three Masters/four Slaves multi-layer interconnect []
(Only the Master to Slave direction is shown)
Multi-layer AHB
Overview
DVI 0045A
ARM Limited. 2001
http://pdf.datasheetarchive.com/indexerfiles/Datasheets-SL1/DSASL001562.pdf
Example operation of a three Masters/four Slaves multi-layer interconnect []
(Only the Master to Slave direction is shown)
http://www.13thmonkey.org/documentation/ARM/multilayerAHB.pdf
Main benefits of a multi-layer AHB interconnect []
• It allows multiple transactions from multiple masters to different slaves at a time,
in fact, implementing a corossbar interconnect, as indicated in the next Figure.
This results in increased bandwidth.
• Standard AHB master and slave modules can be used without modification.
The only hardware that has to be added to the standard AHB solution is the
multiplexor block needed to connect multiple masters to the slaves.
Multi-layer AHB
Overview
DVI 0045A
ARM Limited. 2001
http://pdf.datasheetarchive.com/indexerfiles/Datasheets-SL1/DSASL001562.pdf
5.4 The AMBA 3 protocol family
5.4 The AMBA 3 protocol family
5.4.1 Overview
(1996) (5/1999) (6/2003) (3/2010) (6/2013)
Coherent
Hub Inteface CHI
(10/2011)
AMBA Coherency
ACE™
Extensions Protocol
ACE-Lite
Advanced
Peripheral Bus APB™ APB2 APB v1.0 APB v2.0
It provides the read data sent during the burst transfer from the slave to the
master.
Write channels-1
The AXI protocol defines the following channels for writes:
The VALID signal indicates the validity of the information sent from the master
to the slave whereas the READY signal acknowledges the receipt of the information.
This straightforward synchronization mechanism simplifies the interface design.
http://www.doulos.com/knowhow/arm/Migrating_from_AHB_to_AXI/
Providing sets of handshake and LAST signals for each channel
There is a different set of handshake and LAST signals for each of the channels,
e.g. the Write address channel has the following set of handshake and LAST signals:
http://www.doulos.com/knowhow/arm/Migrating_from_AHB_to_AXI/
b4) Identifying different phases of the same transaction-1
• Each transaction is identified by an ID tag that allows to order related transaction
phases to individual read or write bursts.
• ID tags support multi-master out-of-order transactions for increasing data
throughput, as out-of-order transactions can be sorted out at the destination.
b4) Identifying different phases of the same transaction-2
There are individual four bit long ID tags for each of the five transaction channels,
as follows:
• AWID: The ID tag for the write address group of signals.
• WID: The write ID tag for a write burst.
Along with the write data, the master transfers a WID to match the
AWID of the corresponding address.
• BID: The ID tag for the write response.
• The write response (BRESP) indicates the status of the write burst
performed (OK etc.).
The slave transfers a BID to match the AWID and WID of the transaction
to which it is responding.
• ARID: The ID tag for the read address group of signals.
• RID: The read ID tag for a read burst.
The slave transfers an RID to match the ARID of the transaction to which
it is responding.
b4) Identifying different phases of the same transaction-3
All transactions with a given ID tag must be ordered to an individual read or
write burst (as indicated in the next Figure), but transactions with different ID
tags need not be ordered.
Example: Identification the three phases of an AXI write burst []
Write data
transaction
Write response
transaction
http://www.doulos.com/knowhow/arm/Migrating_from_AHB_to_AXI/
c) Support of out-of-order transactions
The AXI protocol (AXI3 and its revisions, such as the AXI4) allows out-of-order
transactions to provide higher performance compared with the AHB protocol.
Out-of-order transactions are supported when the bus protocol allows
• issuing multiple outstanding transfers and
• completing transactions out-of-order,
as indicated below.
Out-of-order transactions
• arbitration between
multiple bus masters
and
• multiplexing the signals
of the masters and
the slaves,
as the next Figure shows.
• In addition, the AHB
protocol allows for
overlapping the address
and data phases of
transactions of different
masters to two-stage
pipeline bus operation
(beyond the overlapped
(third stage)
bus granting operation).
By contrast, the AXI3 protocol and its further revisions assume that masters and
slaves are connected together in a more flexible way by some sort of an interconnect,
as shown below.
Figure: Assumed interconnect between masters and slaves in the AXI3 protocol []
multip
Shared buses have the limitation of allowing only a single transaction from a granted
source to a specified destination at a time, whereas
crossbar switches allow multiple transactions from multiple sources to different
destinations at a time, as shown in the next Figures.
Interconnecting AXI masters and slaves-4 - the master to slave directions []
Interconnecting AXI masters and slaves
M0 M1 M2 M0 M1 M2
E.g. address/
control or Mux.
write data
transaction Mux. Mux. Mux.
S0 S1 S2 S0 S1 S2
Interconnecting AXI masters and slaves-5 – the slave to master direction []
Interconnecting AXI masters and slaves
M0 M1 M2 M0 M1 M2
E.g. Mux.
read data
transaction
S0 S1 S2 S0 S1 S2
Interconnecting AXI masters and slaves-6 []
• Crossbar switches provide multiple transfers at a time but impose much higher
complexity and implementation cost than shared buses.
• On the other hand different transaction channels in an AXI interconnect carry
different data volumes.
E.g. read or write data channels will transfer more than a single data item in a
transaction, whereas for example, read or write response channels or address
channels transmit only a single data item per transaction.
• Given the fact that the AXI specification defines five independent transaction
channels, obviously, in an AXI implementation it is possible to choose different
interconnect types (shared bus or crossbar) for different transaction channels,
depending on the expected data volume to optimize cost vs. performance. .
• Based on the above considerations, read and write data channels can be expected
to be routed via crossbar switches whereas address and response channels
via shared buses.
http://www.doulos.com/knowhow/arm/Migrating_from_AHB_to_AXI/
Remarks to the system layout []
In an actual system layout all components of an AXI system need to agree on
certain parameters, such as write buffer capability, read data reordering depth and
many others.
http://www.doulos.com/knowhow/arm/Migrating_from_AHB_to_AXI/
Throughput comparison AHB vs. AXI []
• For assessing the throughput of the AHB and the AXI bus it is appropriate
to compare the main features of these buses, as follows:
• The AHB bus is a single-channel shared bus, whereas the AXI bus is a
multi-channel read/write optimized bus.
• In case of the single-layer AHB bus all bus masters or requesting bus ports
may use the same single-channel shared bus.
• In case of a multi-layer AHB bus each bus master or requesting port may use a
different interconnect layer unless they request the same destinations.
• For the AXI bus each bus master or requesting bus may use one of the five
channels (Read address channel, Read data channel, Write address channel,
Write data channel, and Write response channel).
Nevertheless, it is implementation dependent whether individual channels are
built up as shared buses or as crossbars (multi-layer interconnects).
http://www.design-reuse.com/articles/24123/amba-ahb-to-axi-bus-comparison.html
Case example for comparing the bandwidth provided by the AHB and AXI
buses – Digital camera []
http://rtcgroup.com/arm/2007/presentations/179%20-
%20Comparative%20Analysis%20of%20AMBA%202.0%20and%20AMBA%203%20AXI%20Protocol-
Based%20Subsystems.pdf
System requirements: Digital camera []
http://rtcgroup.com/arm/2007/presentations/179%20-
%20Comparative%20Analysis%20of%20AMBA%202.0%20and%20AMBA%203%20AXI%20Protocol-
Based%20Subsystems.pdf
AHB implementation []
http://rtcgroup.com/arm/2007/presentations/179%20-
%20Comparative%20Analysis%20of%20AMBA%202.0%20and%20AMBA%203%20AXI%20Protocol-
Based%20Subsystems.pdf
AXI3 implementation []
http://rtcgroup.com/arm/2007/presentations/179%20-
%20Comparative%20Analysis%20of%20AMBA%202.0%20and%20AMBA%203%20AXI%20Protocol-
Based%20Subsystems.pdf
5.4.3 The ATB bus (Advanced Trace Bus)
• The ATB bus was first described as part of the CoreSight on-chip debug and trace
tool for AMBA 3 based SOCs, termed as the AMBA 3 ATB protocol in 2004 [a].
Subsequently it was specified in a stand alone document in 2006 [] and
designated as the AMBA 3 ATB protocol v1.0.
• This version is considered as being part of the AMBA 3 protocol family.
• It allows on-chip debugging and trace analysis for AMBA-based SoCs.
• Each IP in the SoC that has trace capabilities is connected to the ATB.
• Then master interfaces write trace data on the ATB bus, while slave interfaces
receive trace data from the ATB [c].
Here we do not want to go into any details of the ATB bus.
[a] http://common-codebase.googlecode.com/svn/trunk/others/Cortex_M0_M3/CoreSight_Architecture_Specification.pdf
http://web.eecs.umich.edu/~prabal/teaching/eecs373-f11/readings/ARM_AMBA3_APB.pdf
ACP (Accelerator Coherency Port) [], [] Only in AMBA3?
ACE replaces it??
The ACP port is a standard (64 or 128-bit wide) AXI slave port provided for
non-cached AXI master peripherals, such as DMA Engines or Cryptographic Engines.
It is optional in the ARM11 MPCore and mandatory in subsequent Cortex processors
(except low-cost oriented processors, such as the Cotex-A7MPCore).
The AXI 64 slave port allows a device, such as an external DMA, direct access to
coherent data held in the processor’s caches or in the memory, so device drivers
that use ACP do no need to perform cache cleaning or flushing to ensure cache
coherency.
Implementing DMA on ARM SMP Systems Application Note 228 8/2009 ARM
ACP is an implementation of an AMBA 3 AXI slave interface. It supports memory
coherent
accesses to the Cortex-A15 MPCore memory system, but cannot receive coherent
requests,
barriers or distributed virtual memory messages.
Coherent
Hub Inteface CHI
(10/2011)
AMBA Coherency
ACE™
Extensions Protocol
ACE-Lite
Advanced
Peripheral Bus APB™ APB2 APB v1.0 APB v2.0
[b]
http://www.em.avnet.com/en-us/design/trainingandevents/Documents/X-
Tech%202012%20Presentations/XTECH_B_AXI4_Technical_Seminar.pdf
5.5.2.2 The AXI4 interface (Advanced eXtensible Interface)
Main updates to AXI3 include []:
• support for burst lengths of up to 256 beats for incrementing bursts
• Quality of Service (QoS) signaling
• updated write response requirements
• additional information on ordering requirements
• optional user signaling
• removal of locked transactions
• removal of write interleaving.
Here we do not go into details of the updates but refer to the given reference.
Nevertheless, we recap key features of the AXI4 interface subsequently, as
a reference for pointing out main differences to AXI4-Lite and AXI4-Streams.
http://www.em.avnet.com/en-us/design/trainingandevents/Documents/X-
Tech%202012%20Presentations/XTECH_B_AXI4_Technical_Seminar.pdf
Example for AXI4 transactions []
Master
Slave
http://www.em.avnet.com/en-us/design/trainingandevents/Documents/X-
Tech%202012%20Presentations/XTECH_B_AXI4_Technical_Seminar.pdf
Key features of the AXI4 interface []
http://www.em.avnet.com/en-us/design/trainingandevents/Documents/X-
Tech%202012%20Presentations/XTECH_B_AXI4_Technical_Seminar.pdf
5.5.2.3 The AXI 4-Lite interface
Key features []
http://www.em.avnet.com/en-us/design/trainingandevents/Documents/X-
Tech%202012%20Presentations/XTECH_B_AXI4_Technical_Seminar.pdf
5.5.2.4 The AXI 4-Stream interface
Key features []
http://www.em.avnet.com/en-us/design/trainingandevents/Documents/X-
Tech%202012%20Presentations/XTECH_B_AXI4_Technical_Seminar.pdf
5.5.3 The APB bus v2.0 (APB4 bus)
Both updates include only minor differences to the previous releases, as documented
in [].
We do not go here into details but refer to the cited publication.
The Second release of the ATB bus includes only minor differences to the original
release, as documented in [].
Here, we do not go into details but refer to the cited publication.
http://www.arm.com/files/pdf/CacheCoherencyWhitepaper_6June2011.pdf
All shared transactions are controlled by the ACE coherent interconnect.
ARM has developed the CCI-400 Cache Coherent Interconnect to support
coherency for up to two CPU clusters and three additional ACE-Lite I/O coherent
masters, as indicated in the next Figure.
http://www.arm.com/files/pdf/CacheCoherencyWhitepaper_6June2011.pdf
Key features of the ACE protocol []
The ACE protocol provides a framework for maintaining system level coherency
while leaving the freedom for system designers to determine
• the ranges of memory that are coherent,
• the memory system components for implementing the coherency extensions and
also
• the software models used for the communication between system components.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ihi0022d/index.html
Implementation of the ACE protocol
http://www.ece.cmu.edu/~ece742/lib/exe/fetch.php?media=arm_multicore_and_system_coherence_-_cmu.pdf
About snoop filtering
Snoop filtering tracks the cache lines that are allocated in a master’s cache. To support an
external snoop filter, a cached master must be able to broadcast which cache lines are
allocated and which are evicted.
Support for an external snoop filter is optional within the ACE protocol. A master component
must state in its data sheet if it provides support. See Chapter C10 Optional External Snoop
Filtering for the mechanism the ACE protocol supports for the construction of an external snoop
filter.
For a master component that does not support an external snoop filter, the cache line states
permitted after a transaction has completed are less strict.
2.3.3 Introducing the AMBA 4 ACE interface
AXI Coherecy
Extensions
Coherent
Bus
Advanced Interface
Extensible
Interface
Advanced Advanced
High Performance Advanced
System Bus
Bus Trace Bus
Advanced
Peripheral Bus
Evolution of the AMBA specifications (based on [])
Coherent
Bus
Advanced Interface
Extensible
Interface
Advanced
Coherency Port ACP
Advanced Advanced
High Performance Advanced
System Bus
Bus Trace Bus
Advanced
Peripheral Bus
http://www.rapidio.org/wp-content/uploads/2014/10/OSS-2014-ARM64-Coherent-Scale-Out-over-RapidIO-V4.pdf
AXI Coherecy
Extensions
Coherent
Bus
Advanced Interface
Extensible
Interface
Advanced Advanced
High Performance Advanced
System Bus
Bus Trace Bus
Advanced
Peripheral Bus
AXI Coherecy
Extensions
Coherent
Bus
Advanced Interface
Extensible
Interface
Advanced Advanced
High Performance Advanced
System Bus
Bus Trace Bus
Advanced
Peripheral Bus
The ACP (accelerator Coherency Port) in A5 MPCore and A9 MPCore
will be replaced by ACE in A7 MPCcore
and subsequent proc.s
A9 MPCore: An optional Accelerator Coherency Port (ACP) suitable for coherent memory
transfers
A5 MPCore: an Acceleration Coherency Port (ACP), an optional AXI 64-bit slave port that can be
connected to a noncached peripheral such as a DMA engine.
2.3.3 Introducing the AMBA 4 ACE interface
ARM extended AMBA 3 AXI to AMBA 4 ACE (AMBA with Coherency Extension) by
3 further channels and a number of additional signals
in order to implement system wide coherency, as the next Figure indicates.
Extension of the AMBA 3 (AXI) interface with snoop channels and additional
signals to have the AMBA 4 (ACE) interface []
(ACADDR)
(CRRESP)
(CDDATA)
http://www.ece.cmu.edu/~ece742/lib/exe/fetch.php?media=arm_multicore_and_system_coherence_-_cmu.pdf
Use of the additional snoop channels []
• The ACADDR channel is a snoop address input to the master.
• The CRRESP channel is used by the master to signal the response to snoops
to the interconnect.
• The CDDATA channel is output from the master to transfer snoop data to the
originating master and/or external memory.
http://www.arm.com/files/pdf/CacheCoherencyWhitepaper_6June2011.pdf
2.2.2 ARM’s 1. generation cache coherency management
2.2.2 ARM’s 1. generation cache coherency management (based on [])
DMIPS/fc
Released
Announced
5
MPC: MPCore
[a]
[c] http://www.mpsoc-forum.org/previous/2003/slides/MPSoC_ARM_MP_Architecture.pdf
[b]
Principle of ARM’s 1. generation cache coherency management []
• To achieve cache coherency AMD developed a specific scheme unlike the usual
snooping or directory based approaches.
• In this scheme the cores send read/write requests to a central coherency
control unit (the SCU) via the AHB bus and augment these requests with relevant
cache state information sent over a dedicated bus, called the CCB Bus
(Coherency Control Bus), as shown below.
Note that in usual implementations the coherency control unit observes the read/write
requests of the cores (and external I/O channels) and if needed sends snoop
requests to the cache controllers to be informed about the state of the referenced
cache line. []: előző oldal [b]
(cont.)
• The additional information sent over the CCB bus to the SCU specify e.g.
whether or not data requested are held in the caches, what the status of the
referenced cache line is, etc.
• Based on the cache coherency model chosen and by taking into account the
additional information delivered by the CCB signals, the SCU decides on the
required actions needed to maintain cache coherency for read and write requests
of the cores and sends the appropriate coherency commands to the cache
controllers via the CCB bus.
Here we note that both the patent description and its first implementation in the
ARM11 MPCore make use of the MESI protocol.
Remark
To outline the signals carried over the CCB bus, subsequently we cite an excerpt
from the patent description [], with minor modifications to increase readability.
“Coherency request signals are characterizing the nature of a memory access being
requested such that the coherency implications associated with that memory
access request can be handled by the snoop control unit.
As an example, line fill read requests for the cache memory associated with a
coherent multi-processing core may be augmented to indicate whether they are a
simple line fill request or a line fill and invalidate request whereby the snoop
control unit should invalidate other copies of the data value concerned which are
held elsewhere.
In a similar way, different types of Write request may be distinguished between by
the coherency request signals on the CCB in a manner which can then be acted
upon by the snoop control unit.
The core status signals pass coherency related information from the core to the
snoop control unit such as, for example, signals indicating whether or not a
particular core is operating in a coherent multi-processing mode, is ready to
receive a coherency command from the snoop control unit, and does or does not
have a data value which is being requested from it by the snoop control unit.”
The core sideband signals passed from the core to the snoop control unit via the
CCB include signals indicating that the data being sent by the core is current valid
data and can be sampled, that the data being sent is “dirty” and needs to
be written back to its main stored location, and elsewhere as appropriate, that the
data concerned is within an eviction Write buffer and is no longer present within
the cache memory of the core concerned, and other signals as may be required.
The snoop control unit coherency commands passed from the snoop control unit to
the processor core include command specifying operations relating to coherency
management which are required to be performed by the processor core under
instruction of the snoop control unit. As an example, a forced change in the status
value associated with a data value being held within a cache memory of a
processor core may be instructed such as to change that status from modified or
exclusive status to invalid or shared in accordance with the applied coherency
protocol.
Other commands may instruct the processor core to provide a copy of a current
data value to the snoop control unit such that this may be forwarded to
another processor core to service a memory read request, from that processor
core. Other commands include, for example, a clean command.”
Implementation of ARM’s 1. gen. cache coherency management concept
in the ARM11 MPCore processor
The cache management technique described in the patents [], [] was implemented
first in the ARM11 MPCore processor (2005) that may include up to quad cores.
Block diagram and key features of the implemented coherency management
technique are shown in the next Figure.
Block diagram of the ARM11 MPCore processor-1
ARM11 MPcore
Memory
Block diagram of the ARM11 MPCore processor-1
SCU
L1D tagRAM 0 L1D tagRAM 1 L1D tagRAM 2 L1D tagRAM 3
CCB: Coherency Control Bus
Memory (SDRAM/DDR/LPDDR)
Block diagram of the ARM11 MPCore processor-1
SCU
L1D tagRAM 0 L1D tagRAM 1 L1D tagRAM 2 L1D tagRAM 3
CCB: Coherency Control Bus
AXI3 64-bit AXI3 64-bit (opt.)
• Enhanced MESI
• SCU holds copies of each L1D directory
Shared L2 cache controller (L2C310) + L2 cache data to reduce snoop traffic between
L1D caches and the L2
AXI3 AXI3 Low power (opt.) • Direct cache to cache transfers supported
Memory (SDRAM/DDR/LPDDR)
Block diagram of the Cortex-A9 MPCore processor-1
SCU
L1D tagRAM 0 L1D tagRAM 1 L1D tagRAM 2 L1D tagRAM 3
CCB: Coherency Control Bus
310
Memory (SDRAM/DDR/LPDDR)
Block diagram of the Cortex-A9 MPCore processor-1
ARM11 MPcore
Generic Interrupt Controller
SCU
CCB: Coherency Control Bus
310
Memory (SDRAM/DDR/LPDDR)
Block diagram of the Cortex-A9 MPCore processor-1
ARM11 MPcore
Generic Interrupt Controller
SCU
CCB: Coherency Control Bus
310
Memory (SDRAM/DDR/LPDDR)
ARM11 MPcore
Generic Interrupt Controller
Memory (SDRAM/DDR/LPDDR)
Introduction of a network interconnect in the Cortex-A9 MPCore
Cortex-A9 MPcore
CPU0 CPU3
GPU
AXI3 AXI3
Memory (SDRAM/DDR/LPDDR)
Introduction of a network interconnect along with the Cortex-A9 MPCore
AXI3 64-bit AXI3 64-bit (opt.) AXI3 64-bit AXI3 64-bit (opt.)
Memory (SDRAM/DDR/LPDDR)
Cortex-A9 MPcore Cortex-A5 MPcore
DFI2 DFI2
Cortex-A9 MPcore
AXI3 64-bit AXI3 64--bit AXI3 ACE 128bit ACE 128-bit ACE-Lite
DFI2.1 DFI2.1
ACE-Lite ACE-Lite
ACE 128bit ACE 128-bit 128-bit ACE 128-bit ACE 128-bit 128-bit
ACE-Lite ACE-Lite
ACE 128-bit ACE 128-bit 128-bit ACE 128-bit ACE 128-bit 128-bit
ACE-Lite ACE-Lite
ACE 128-bit ACE 128-bit 128-bit ACE or CHI ACE or CHI 128-bit
ACE-Lite ACE-Lite
ACE or CHI ACE or CHI 128-bit ACE or CHI ACE or CHI 128-bit
http://www.iet-cambridge.org.uk/arc/seminar07/slides/JohnGoodacre.pdf
ARM’s key extensions to the MESI protocol
Already in their patent applications [], [] AMD made three key extensions to the
MESI protocol, as described next:
a) Direct Data Intervention (DDI)
b) Duplicating tag RAMs
c) Migratory line.
These extensions were implemented in the ARM11 MPCore and AMD’s subsequent
multicore processors, as follows.
US 7,162,590 B2
a) Direct Data Intervention (DDI) [c]
• Operating systems often let migrate tasks from one core to another.
In this case the migrated task needs to access data that is stored in the L1
cache of another core.
• Without using a snooping mechanism (as ARM’s cache management technique
avoids using snooping in case of core requests) migrating cores becomes a
complicated and long process.
First the original core needs to invalidate and clean the relevant cache lines out
to the next level of the memory architecture.
Subsequently, once the data is available from the next level of the memory
architecture (e.g. from the L2 or main memory), the data has to be loaded
into the new core’s data cache, as indicated in the right side of the next Figure).
• DDI eliminates this problem as with DDI the SCU will receive the cache line from
the owner cache immediately and will forward it to the requesting core without
accessing the next level of the memory hierarchy.
Figure: Accessing cached data during Figure: Accessing cached data during
core migration without cache-to-cache transfer core migration with cache-to-cache transfer
The OS must ensure that cache lines are cleaned Accesses to shared DMA memory regions are
before an outgoing DMA transfer is started, and routed to the cache controller which will
invalidated before a memory range affected by clean the relevant cache lines for DMA reads
an incoming DMA transfer is accessed. or invalidate them for DMA writes.
This causes some overhead to DMA operations,
since such operations are performed usually by
loops directed by the OS.
Implementing DMA on ARM SMP Systems Application Note 228 8/2009 ARM
Example: Maintaining software managed I/O coherency for writes []
Released
Announced
5
MPC: MPCore
http://www.mpsoc-forum.org/previous/2008/slides/8-6%20Goodacre.pdf
Block diagram of the Cortex-A9 MPCore (4/2008)
• Enhanced MESI
• SCU holds copies of each L1D directory
Accelerators/ SCU to reduce snoop traffic between
DMA Engines) L1D and L2
(via AXI 3 Slave Snoop Master 0 Master 1 L1D Tag RAMs • Cache to cache transfers supported
64-bit)
interface) AXI3-64 (Master 1
ACP ports optional)
Memory
Support of attaching accelerators to the system architecture effectively
via a dedicated, cache coherent port
Alternative ways to attach accelerators to SoC designs []
Basically, there are three alternatives to attach accelerators to a system architecture,
as follows:
• Attaching accelerators over a system interconnect,
• attaching accelerators tightly coupled to the processor, and
• attaching accelerators on die
as indicated on the next Figure.
2008
Figure: Alternative ways to attach accelerators to a system architecture []
a) Principle of attaching an accelerator to the system via an interconnect-1 []
Principle of attaching an accelerator to the system via an interconnect-2 []
http://mobile.arm.com/markets/mobile/computing.php
Note
In this case cache coherency needs to be maintained by software.
System example 2: Attaching an accelerator (Mali GPU) and DMA
to the system via an interconnect (NIC-400 X-bar switch) []
Note http://www.arm.com/products/processors/cortex-a/cortex-a5.php
2008
c) Attaching an accelerator directly to the processor via the ACP port []
2008
The ACP (Accelerator Coherency Port) [], []
• The ACP port is a standard AMBA 3 AXI (AXI3) slave port provided for non-cached
peripherals, such as DMA Engines or Cryptographic Engines.
It is an interconnect point for a range of peripherals that for overall system
performance, power consumption or software simplification are better interfaced
directly with the processor.
• The ACP port allows a device direct access to coherent data held in the processor’s
caches or in the memory, so device drivers that use ACP do no need to perform
cache maintenance, i.e. cache cleaning or flushing to ensure cache coherency.
• It is 64-bit wide in ARM’s second generation cache coherency implementations
(in ARM Cortex-A9/A5 MPCore processors), but 64 or 128-bit wide in subsequent
3. generation implementations.
• The ACP port is optional in the ARM Cortex-A9 MPCore and mandatory in
subsequent Cortex-A processors (except low-cost oriented processors, such as
the Cortex-A7MPCore).
A9 MPCore trm
Implementing DMA on ARM SMP Systems Application Note 228 8/2009 ARM
System example for using the ACP port []
http://www.iet-cambridge.org.uk/arc/seminar07/slides/JohnGoodacre.pdf
Comparing the traditional way of attaching accelerators with attaching
accelerators via the ACP port []
2008
Benefits of attaching accelerators via the ACP port []
• About 25 % reduction in memory transactions due to reduction of cache flushes,
• software no longer needs to be concerned with cache flushes, which can be
particularly troublesome on a multicore processor.
2008
Remark
• In 10/2006 AMD announced their Fusion project to integrate the CPU and GPU
(i.e. an accelerator) on the same die.
• Nevertheless, at the same time AMD did not reveal any concept how to solve
the data communication overhead between the CPU and the GPU.
• This missing concept was then announced by the Heterogeneous System
Architecture (HSA) concept in 6/2011 (at AMD’s Fusion Developer Summit).
Originally, HSA was termed as the Fusion System Architecture (FSA) but became
renamed to HSA in 1/2012 to give the concept a chance to became an open
standard.
Actually, HSA is an optimized platform architecture for OpenCL providing a unified
coherent address space allowing heterogeneous cores working together
seamlessly in coherent memory.
Remark to the implementation cache coherency management in the
Cortex-A5 MPCore (5/2010)
It is basically the same implementation as in the Cortex-A9 MPCore, nevertheless
the L1D cache coherency protocol was changed from enhanced MESI to enhanced
MOESI [].
trm A5 mpcore
2.2.4 ARM’s 3. generation cache coherency management
2.2.4 ARM’s 4. generation cache coherency management (based on [])
DMIPS/fc
Released
Announced
5
MPC: MPCore
A15 trm
A7 trm
b) Enhanced master port carrying external snoop requests
With the third generation cache coherency technology the master port became a
single
• ACE 64/128 bit master port for the Cortex-A15/A7/A12 processors and
• ACE or CHI master interface for the Cortex-A57/53 processors,
instead of one standard and one optional AMBA 3 AXI-64 ports provided by the
previous generation cache coherency technique.
Note that the key difference concerning the port provision is that the ACE or CHI
buses carry snoop requests from external sources whereas the AXI 3 bus didn’t.
This a major enhancement since it provides the possibility to build up cache coherent
systems of multiple heterogeneous processors including processors core clusters.
Consequently, the new design has to handle both internal and external snoop
requests.
Case example 1:
ARM’s Cortex-A57 processor []
SCU
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0488c/DDI0488C_cortex_a57_mpcore_r1p0_trm.pdf
Case example 2: Cache coherent system based on a Cache Coherent
Network controller []
Snoop requests
http://www.ece.cmu.edu/~ece742/lib/exe/fetch.php?media=arm_multicore_and_system_coherence_-_cmu.pdf
e written b
http://www.arm.com/files/pdf/CacheCoherencyWhitepaper_6June2011.pdf
Relationship between ACE states and MOESI states []
• The ACE states can be mapped directly onto the MOESI cache coherency states.
Nevertheless, ACE is designed to support components that may use different
cache state models, including MESI, MOESI, MEI etc..
Thus some components may not support all ACE transactions, e.g. the ARM
Cortex-A15 internally makes use of the MESI states for providing cache coherency
for the L1 data caches, meaning that the cache cannot be in the SharedDirty
(Owned) state.
• To emphasize that ACE is not restricted to the MOESI cache state model, ACE
does not use the familiar MOESI terminology.
http://www.arm.com/files/pdf/CacheCoherencyWhitepaper_6June2011.pdf
ACE cache line states and their alternative MOESI naming []
http://www.arm.com/files/pdf/CacheCoherencyWhitepaper_6June2011.pdf
Implementing different system wide and processor wide cache coherency
protocols
As long as ARM makes use of the MOESI protocol to provide system wide cache
coherency they prefer to employ the MESI protocol for providing processor wide
cache coherency for most of their multicore processor.
This can easily be implemented as follows.
The cache lines incorporate state information for the 5-state MOESI protocol
but for maintaining multicore coherency only four states will used such that the
SharedDirty and SharedClean states are both considered as SharedDirty
states meaning that they need to be written back to the memory.
2.3.4 Supporting two types of coherency,
full coherency and I/O coherency
2.3.4 Supporting two types of coherency, full coherency and I/O coherency
Types of coherency []
http://community.arm.com/groups/processors/blog/2013/12/03/extended-system-coherency--part-1--cache-
coherency-fundamentals
Remark
The AMBA 4 ACE--‐Lite interface has the additional signals on the existing channels
but not the new snoop channels.
ACE--‐Lite masters can snoop ACE–compliant masters, but cannot themselves be
snooped.
http://www.arm.com/files/pdf/CacheCoherencyWhitepaper_6June2011.pdf
Example 1: Full coherency for processors, I/O coherency for interfaces and
accelerators []
Example 2: Snooping transactions in case of full coherency []
http://www.ece.cmu.edu/~ece742/lib/exe/fetch.php?media=arm_multicore_and_system_coherence_-_cmu.pdf
Example 3: Snooping transactions in case of I/O coherency []
http://www.ece.cmu.edu/~ece742/lib/exe/fetch.php?media=arm_multicore_and_system_coherence_-_cmu.pdf
2.3.5 Introducing the concept of domains
2.3.5 Introducing the concept of domains
In order to increase the efficiency of their system wide cache coherency solution
ARM introduced also coherency domains along with their AMBA 4 ACE interface.
This allows course-grain filtering of snoops and directs cache maintenance and DVM
requests in a system with partitioned memory.
Domain types
http://www.ece.cmu.edu/~ece742/lib/exe/fetch.php?media=arm_multicore_and_system_coherence_-_cmu.pdf
Note
• The inner domain shares both code and data
(i.e. is running the same operating system, whereas
• The outer domain shares data but not code.
http://www.arm.com/files/pdf/CacheCoherencyWhitepaper_6June2011.pdf
Domain control pins []
There are three domain control pins in the Cortex-A7/A15 or the Cortex-A57/A53
processors that direct how the snoop and maintenance requests to the system
will be handled by the SCU, as follows in case of the Cortex-A7 processor:
The BROADCASTINNER pin
It controls issuing coherent transactions targeting the Inner Shareable domain on
the coherent interconnect.
When asserted, the processor is considered to be part of an Inner Shareable domain
that extends beyond the processor and Inner shareable snoop and maintenance
operations are broadcast to other masters in this domain on the ACE or CHI
interface.
When BROADCASTINNER is asserted, BROADCASTOUTER must also be asserted.
When BROADCASTINNER is deasserted, the processor does not issue DVM requests
on the ACE AR channel or CHI TXREQ channel.
The BROADCASTOUTER pin
BROADCASTOUTER controls issuing coherent transactions targeting the Outer
shareability domain on the coherent interconnect.
When asserted, the processor is considered to be part of the Outer Shareable
domain and Outer shareable snoop and maintenance operations in this domain
are broadcast externally on the ACE or CHI interface, else not.
A57 trm
The BROADCASTCACHEMAINT pin
It controls issuing L3 cache maintenance transactions, such as CleanShared and
CleanInvalid, on the coherent interconnect.
When set to 1 this indicates to the processor that there are external downstream
caches (L3 cache) and maintenance operations are broadcast externally, else not.
Permitted combinations of the domain control signals and supported
configurations in the Cortex A7 processors []
Note
Similar interpretations are true for the Cortex-A-15/A57/A53 processors.
A7 trm
2.3.6 Providing support for system-wide page tables
2.3.6 Providing support for system-wide page tables
This precondition is termed also as Distributed Virtual Memory (DVM) support [].
• Multi-cluster CPU systems share a single set of MMU page tables in the memory.
• A TLB (Translation Look-Aside Buffer) is a cache of MMU page tables being in the
memory.
• When one master updates page tables it needs to invalidate TLBs that may contain
a stale copy of the MMU page table entry.
• DVM support of AMBA 4 (ACE) does facilitate this by providing broadcast
invalidation messages.
• DVM messages are sent on the Read channel using the ARSNOOP signaling.
A system MMU may take use of the TLB invalidation messages to ensure its entries
are up-to-date.
Example for DVM messages []
http://www.ece.cmu.edu/~ece742/lib/exe/fetch.php?media=arm_multicore_and_system_coherence_-_cmu.pdf
Remarks on the address translation in ARM’s Cortex-Ax processors []
In order to give a glimpse into the address translation process we briefly summarize
main steps of the address translation process as performed by the Cortex-A15
processor, nevertheless strongly simplified by neglecting issues including access
permissions, protection etc.
Principle of the address translation
• The ARM Cortex-A series processors implement a virtual-to-physical address
translation that is based on two-level page tables kept in the main memory.
• The address translation process performed by means of the page tables is
usually termed as table walk.
• Here we do not want to go into details of the page tables or of the table walk
process, rather we will only illustrate this in the next Figure.
(We note that a good overview of virtual to physical address translation is given
e.g. in []).
Principle of virtual to physical address translation based on two-level
page tables (based on [])
Physical address
http://www.cs.rutgers.edu/~pxk/416/notes/09a-paging.html
Use of TLBs to speed up the address translation process
Traditionally, the address translation process is speeded up by caching the latest
translations.
The ARM Cortex-A series processors (and a few previous lines, such as the ARM9xx
or ARM11xx lines) implement caching of the latest address translations by means
of a two level TLB (Translation Look-aside Buffer) cache structure, consisting of
Level 1 and Level 2 TLBs, as indicated in the next Figure.
Example of two-level TLB caching (actually that of the A15) (based on [])
Miss
Miss
Physical address
Physical address
First level TLBs-1
• The first level of caching is performed by the L1 TLBs (dubbed also as Micro TLBs).
• Both the instruction and the data caches have their own Micro TLBs.
• Micro TLBs operate fully associative and perform a look up in a single cycle.
As an example the Cortex-A15 has a 32 entry TLB for the L1 instruction cache and
two separate L1 TLBs with 32-32 entries for data reads and writes, as indicated
in the next Figure.
Example of a two-level TLB system (actually that of the A15) (based on [])
Miss
Miss
32 entries
32 entries
32 entries
Physical address
Physical address
First level TLBs-2
• A hit in an L1 TLB returns the physical address to the associated L1 (instruction
or data cache) for comparison.
• Misses in the L1 TLBs are handled by a unified L2 TLB.
(Unified means that the same TLB is used for both instructions and data).
http://www.cs.rutgers.edu/~pxk/416/notes/09a-paging.html
The second level TLB-1
• The second level TLB is a 512 entry 4-way set associative cache in the Cortex-
A15 (called the main TLB), as shown next.
Example of a two-level TLB system (actually that of the A15) (based on [])
Miss
Miss
32 entries
512 entries
32 entries
32 entries
Physical address
Physical address
The second level TLB-2
• Accesses to the L2 TLB take a number of cycles.
• A hit in the L2 TLB returns the physical address to the associated L1 cache for
comparison.
• A miss in the L2 TLB invokes a hardware translation table walk, as indicated
in the next Figure.
Principle of the virtual to physical address translation in the A15 (based on [])
Miss
Miss
Physical address
Physical address Physical address
The hardware table-walk
• It retrieves the address translation information from the translation tables in the
physical memory, as show in the Figure, but this process is not detailed here.
• Once retrieved the translation information is placed into the associated TLBs,
possible by overwriting the existing values.
• The entry to be overwritten is chosen typically in a round robin fashion.
Note
In an operating system, like Linux each process has its own set of tables, and
the OS switches to the right table whenever there is a process (context) switch
by rewriting the Translation Table Base Register (TTBR) [].
http://www.slideshare.net/prabindh/enabling-two-level-translation-tables-in-armv7-mmu
2.3.7 Providing cache coherent interconnects
2.3.7.1 Overview of ARM’s cache coherent interconnects
The role of an interconnect in the system architecture
An interconnect provides the needed interconnections between the major system
components, such as cores, accelerators, memory, I/O etc, like in the Figure below
[].
Interface ports receiving data requests e.g. from processor cores, the GPU, DMAs
or the LCD, are called Slaves whereas those initiating data requests e.g. to the
memory or other peripherals are designated as Masters, as indicated in the Figure
above.
http://www.ece.cmu.edu/~ece742/lib/exe/fetch.php?media=arm_multicore_and_system_coherence_-_cmu.pdf
Interconnect topologies of processor platforms
Examples
P P GPU P GPU
GPU NB M
Cross bar Ring
Per. SB Per.
M Per. M Per.
Per.
4.3 Example for a Sandy Bridge based desktop platform with the H67 chipset (1)
On the left side there is a double star like chipset based platform topology (actually
a Sandy Bridge based desktop platform with the H67 chipset from Intel, whereas
on the right a crossbar interconnect based platform topology (actually an NIC-301
inerconnect with a Cortex-A9 processor from ARM.
2.3.7.1 Overview of ARM’s cache coherent interconnects
Overview of ARM’s recent on-die interconnect solutions
Examples
NIC-301 (~ 2006) CCI-400 (2010)
NIC-400 (2011) CCN-504 (2012)
CCN-508 (2013)
Configurable Configurable
No. of master ports Configurable
(1-64) (1-64)
AXI3/AXI4/AHB-Lite/
Type of master ports AX3 AXI3/AHB-Lite/APB2/3
APB2/3/4
32/64/128/256-bit 32/64/128/256-bit
Width of master ports 32/64-bit
(APB only 32-bit) (APB only 32-bit)
Integrated snoop filter No No No
https://www.arm.com/products/system-ip/interconnect/corelink-cci-550-cache-coherent-interconnect.php
High level block diagram of ARM's first (AXI3-based) PL-300 interconnect []
http://www.arm.com/files/pdf/AT_-
_Building_High_Performance_Power_Efficient_Cortex_and_Mali_systems_with_ARM_CoreLink.pdf
Example 2: Cortex-A7 MP/NIC-400 based entry-level smartphone platform []
http://www.arm.com/products/processors/cortex-a/cortex-a7.php
ARM CoreLink 400 & 500 Series System IP 12/2012
ARM CoreLink 400 & 500 Series System IP 12/2012
2.3.7.1 ARM-s non cache coherent interconnects
used for mobiles
Overview of ARM’s cache coherent interconnects
(10/2013)
Released CCN-508
8 Announced Up to 24 AMBA
interfaces for
I/O-coherent
accelerators and I/O
7
1-32 MB L3
Supports A57/A53
? nm
6 4 DDR4
5 (10/2012)
CCN-504
1. generation 4 ACE
4 18 AXI 4 ACE-Lite
interfaces
(10/2010) 8/16 MB L3
3 CCI-400 Snoop filter
Supports A15/A12/
2 ACE-128 ports
A7/A57/A53
3 ACE-Lite ports 28 nm
2
No integrates L3 2 DDR4/3 channels
Supports A15/A7
1 32/38 nm
2 DDR3/2 channels
• The CCI-400 is the first ARM interconnect to support cache coherency while
• using the AMBA 4 ACE interconnect bus.
(AMBA Coherency Extensions).
• The CCI-400 Cache Coherent Interconnect is part of the CoreLink 400 System,
as indicated in the next Table.
CoreLink 400 System components (based on [])
Cache Coherent Interconnect There are cache coherent interconnects with snoop
CCI-5xx
with snoop filter filters to reduce snoop traffic
https://www.arm.com/products/system-ip/interconnect/corelink-cci-550-cache-coherent-interconnect.php
Block diagram of the CCI-400 []
http://www.ece.cmu.edu/~ece742/lib/exe/fetch.php?media=arm_multicore_and_system_coherence_-_cmu.pdf
Internal architecture
of the CCI-400
Cache Coherent
Interconnect []
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0470c/DDI0470C_cci400_r0p2_trm.pdf
Example 1: Cache coherent SOC based on the CCI-400 interconnect []
(Network Interconnect)
(Memory
Management Unit)
(DVM: Distributed
Virtual Memory)
http://www.arm.com/files/pdf/CacheCoherencyWhitepaper_6June2011.pdf
Hardware Coherency and Snoops
The simplest implementation of cache coherency is to broadcast a snoop to all processor
caches to locate shared data on-demand. When a cache receives a snoop request, it performs a
tag array lookup to determine whether it has the data, and sends a reply accordingly.
For example in the image above we can see arrows showing snoops between big and LITTLE
processor clusters, and from IO interfaces into both processor clusters. These snoops are
required for accessing any shared data to ensure their caches are hardware cache coherent. In
other words, to ensure that all processors and IO see the same consistent view of memory.
For most workloads the majority of lookups performed as a result of snoop requests will miss,
that is they fail to find copies of the requested data in cache. This means that many snoop-
induced lookups may be an unnecessary use of bandwidth and energy. Of course we have
removed the much higher cost of software cache maintenance, but maybe we can optimize this
further?
https://community.arm.com/groups/processors/blog/2015/02/03/extended-system-coherency--part-3--corelink-cci-500
Introduction of snoop filters in the CCI-500
Snoop filter Supported The multiprocessor device provides support for an external snoop filter
in an interconnect. It
indicates when clean lines are evicted from the processor by sending Evict transactions on the
ACE
write channel. However there are some cases where incorrect software can prevent an Evict
transaction from being sent, therefore you must ensure that any external snoop filter is built
to
handle a capacity overflow that sends a back-invalidation to the processor if it runs out of
storage.
a7 trm
Example 1: Cache coherent SOC based on the CCI-500 interconnect []
http://www.arm.com/products/system-ip/interconnect/corelink-cci-500.php
Example 1: Cache coherent SOC based on the CCI-550 interconnect []
ARM
Use of ARM's interconnect IP by major SOC providers
(10/2013)
Released CCN-508
8 Announced Up to 24 AMBA
interfaces for
I/O-coherent
accelerators and I/O
7
1-32 MB L3
Snoop Filter
Supports A57/A53
6 ? nm
4 DDR4
5 (10/2012)
CCN-504
4 ACE
4 18 AXI 4 ACE-Lite
Interfaces
8/16 MB L3
(10/2010)
3 Snoop filter
CCI-400 Supports A15/A12/
2 ACE-128 ports A7/A57/A53
2 3 ACE-Lite ports 28 nm
No integrates L3 2 DDR4/3 channels
Supports A15/A7
1 32/38 nm
2 DDR3/2 channels
Integrated L3 cache No
Technology 32/28 nm
Extended System Coherency - Part 2 - Implementation, big.LITTLE, GPU Compute and Enterprise
Közzétette neilparris ARM Processors alatt a a következő napon: 2014.02.17. 11:41:28
Example 2: Cache coherent SOC with Cortex-A12 and MaliT622
based on the CCI-400 interconnect []
http://www.dailytech.com/ARM+Reveals+Cortex+A12+With+Malli+T600+GPU+OpenGL+ES+30+for+MidRan
ge+Mobile/article31677c.htm 6/2013
Further technologies supported by the CCI-400
• big.LITTLE technology
• Barriers
• Virtualization
• End-to-end QoS (Quality of Service)
(Allows traffic and latency regulations, removes traffic blocking, as indicated
in the next Figure).
End-to-end QoS (Quality of Service) []
Example 1: Cache coherent SOC based on the CCI-400 interconnect []
(Network Interconnect)
(Memory
Management Unit)
(DVM: Distributed
Virtual Memory)
http://www.arm.com/files/pdf/CacheCoherencyWhitepaper_6June2011.pdf
Example 2: Cache coherent SOC with Cortex-A12 and MaliT622
based on the CCI-400 interconnect []
http://www.dailytech.com/ARM+Reveals+Cortex+A12+With+Malli+T600+GPU+OpenGL+ES+30+for+MidRan
ge+Mobile/article31677c.htm 6/2013
ARM
Example 3: High-End mobile platform based on the CCI-400 interconnect
with dual Cortex-A57 and quad Cortex-A53 clusters and Mali GPU []
2.3.7.3 ARM’s cache coherent interconnects
for enterprise computing
2.3.7.3 ARM’s cache coherent interconnects for enterprise computing
Recently, there are four implementations:
2 channel (2x40/72-bit)
DDR3/DDR4/LPDDR3 memory controller
DMC-520 Dynamic Memory Controller 128-bit system interface
DFI 3.1 memory interface
Memory transfer rate up to DDR4-3200
http://www.arm.com/products/system-ip/interconnect/corelink-ccn-504-cache-coherent-network.php
The ring interconnect fabric of the CCN-504 (dubbed Dickens) []
Remark: The Figure indicates only 15 ACE-Lite slave ports and 1 master port
whereas ARM's specifications show 18 ACE-Lite slave ports and 2 master ports.
http://www.freescale.com/files/training/doc/dwf/DWF13_APF_NET_T0795.pdf
Example 2: SOC based on the cache coherent CCN-508 interconnect []
https://www.arm.com/products/system-ip/interconnect/corelink-ccn-508.php
The ring interconnect fabric of the CCN-508 []
http://www.hotchips.org/wp-content/uploads/hc_archives/hc26/HC26-11-day1-epub/HC26.11-4-ARM-Servers-epub/HC26.11.420-
Network-Flippo-ARM_LSI%20HC2014%20v0.12.pdf
Example 2: SOC based on the cache coherent CCN-512 interconnect []
https://www.arm.com/products/system-ip/interconnect/corelink-ccn-512.php
Use of ARM's interconnect IP by major SOC providers
http://www.theregister.co.uk/2014/05/06/arm_corelink_ccn_5xx_on_chip_interconnect_microarchitecture/?page=2
Main features of ARM’s 2. generation cache coherent interconnects
No. of
CPU clusters
(10/2013)
Released CCN-508
8 Announced Up to 24 AMBA
interfaces for
I/O-coherent
accelerators and I/O
7
1-32 MB L3
Snoop Filter
Supports A57/A53
6 ? nm
4 DDR4
5 (10/2012)
CCN-504
4 ACE
4 18 AXI 4 ACE-Lite
Interfaces
8/16 MB L3
(10/2010)
3 Snoop filter
CCI-400 Supports A15/A12/
2 ACE-128 ports A7/A57/A53
2 3 ACE-Lite ports 28 nm
No integrates L3 2 DDR4/3 channels
Supports A15/A7
1
32/38 nm
2 DDR3/2 channels
Number of master ports for CPU clusters 4 (128-bit ACE or CHI) 8 (CHI)
AMBA interfaces for I/O coherent accelerators and I/O
18 24
(ACE-Lite AXI3/4)
Supported processors (Cortex-Ax) A15/A7/A12/A57/A53 A57/A53
Technology 28 nm n.a.
cc
http://www.arm.com/products/system-ip/interconnect/corelink-ccn-508.php
Intermixing cache coherent and non-coherent subsystems
It is feasible to implement cache coherent subsystems (based on the cache coherent
interconnects e.g. CCI-400 or CCN-504/CCN-508) and non cache coherent
subsystems (based on the non cache coherent interconnects NIC-301/NIC-400)
to implement a SOC, as the next Figure shows.
SOC consisting of a cache coherent and a non cache coherent subsystem []
http://115.28.165.193/down/arm/ve/Datasheet_CoreTileExpress_V2P-CA15x2_CA7x3.pdf
ARM Analyst Day May 2014