Compute Caches: Ntroduction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Compute Caches

Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das
University of Michigan, Ann Arbor
{shaizeen, sjeloka, arunsub, nsatish, blaauw, reetudas}@umich.edu

Abstract—This paper presents the Compute Cache archi- recently fabricated chip [2] demonstrates feasibility of bit-
tecture that enables in-place computation in caches. Compute line computing. They also show a stability of more than
Caches uses emerging bit-line SRAM circuit technology to re- six sigma robustness for Monte Carlo simulations, which is
purpose existing cache elements and transforms them into ac-
tive very large vector computational units. Also, it significantly considered industry standard for robustness against process
reduces the overheads in moving data between different levels variations.
in the cache hierarchy. Past processing-in-memory (PIM) solutions proposed to
Solutions to satisfy new constraints imposed by Compute move processing logic near the cache [4], [5] or main
Caches such as operand locality are discussed. Also discussed memory [6], [7]. 3D stacking can make this possible [8].
are simple solutions to problems in integrating them into a
conventional cache hierarchy while preserving properties such Compute Caches significantly push the envelope by enabling
as coherence, consistency, and reliability. in-place processing using existing cache elements. It is an
Compute Caches increase performance by 1.9× and reduce effective optimization for data-centric applications, where at
energy by 2.4× for a suite of data-centric applications, includ- least one of the operands (e.g., dictionary in WordCount)
ing text and database query processing, cryptographic kernels, used in computation has cache locality.
and in-memory checkpointing. Applications with larger frac-
tion of Compute Cache operations could benefit even more, as
Efficiency of Compute Caches arises from two main
our micro-benchmarks indicate (54× throughput, 9× dynamic sources: massive parallelism and reduced data movement. A
energy savings). cache is typically organized as a set of sub-arrays; as many
as hundreds of sub-arrays, depending on the cache level.
These sub-arrays can potentially compute concurrently on
I. I NTRODUCTION
data stored in them (KBs of data) with little extensions to
As computing today is dominated by data-centric appli- the existing cache structures (8% of cache area overhead).
cations, there is a strong impetus for specialization for this Thus, caches can effectively function as large vector compu-
important domain. Conventional processors’ narrow vector tational units, whose operand sizes are orders of magnitude
units fail to exploit the high degree of data-parallelism larger than conventional SIMD units (KBs vs bytes). To
in these applications. Also, they expend disproportionately achieve similar capability, the logic close to memory in a
large fraction of time and energy in moving data over cache conventional PIM solution would need to provision more
hierarchy, and in instruction processing, as compared to the than hundred additional vector functional units. The second
actual computation [1]. benefit of Compute Caches is that they avoid the energy
We present the Compute Cache architecture for dramati- and performance cost incurred not only for transferring data
cally reducing these inefficiencies through in-place (in-situ) between the cores and different levels of cache hierarchy
processing in caches. A modern processor devotes a large (through network-on-chip), but even between a cache’s sub-
fraction (40-60%) of die area to caches which are used for array to its controller (through in-cache interconnect).
storing and retrieving data. Our key idea is to re-purpose This paper addresses several problems in realizing the
and transform the elements used in caches into active com- Compute Cache architecture, discusses ISA and system
putational units. This enables computation in-place within software extensions, and re-designs several data-centric ap-
a cache sub-array, without transferring data in or out of it. plications to take advantage of the new processing capability.
Such a transformation can unlock massive data-parallel com- An important problem in using Compute Caches is sat-
pute capabilities, dramatically reduce energy spent in data isfying the operand locality constraint. Bit-line computing
movement over the cache hierarchy, and thereby directly requires that the data operands are stored in rows that share
address the needs of data-centric applications. the same set of bit-lines. We architect a cache geometry,
Our proposed architecture uses an emerging SRAM circuit where ways in a set are judiciously mapped to a sub-array,
technology, which we refer to as bit-line computing [2], so that software can easily satisfy operand locality. Our
[3]. By simultaneously activating multiple word-lines, and design allows a compiler to ensure operand locality simply
sensing the resulting voltage over the shared bit-lines, several by placing operands at addresses that are page aligned (same
important operations over the data stored in the activated page offset). It avoids exposing the internals of a cache, such
bit-cells can be accomplished without data corruption. A as its size or geometry, to software.
When in-place processing is not possible for an operation Core0 Core7
Bit-line Word-line
due to lack of operand locality, we propose to use near-place Cache Controller
Compute Caches. In near-place design, the source operands
L1$ L1$ A

Decoder2
Decoder1
are read out from the cache sub-arrays, the operation is Sub-array

performed in a logic unit placed close to the cache controller, H-Tree B


and the result may be written back to the cache. L2$ L2$

Besides operand locality, Compute Caches brings forth Ring Bank


Sense-amp

several interesting questions. How to orchestrate concurrent Interconnect


computation over operands spreading across multiple cache = A op B
L3$-Slice0 L3$-Slice7
sub-arrays? How to ensure coherence between compute-
enabled caches? How to ensure consistency model con-
(a) (b) (c)
straints when computation is spread between cores and
caches? Soft errors are a significant concern in modern Figure 1: Compute Cache overview. (a) Cache hierarchy. (b)
processors. Can ECC be used for Compute Caches? When Cache Geometry (c) In-place compute in a sub-array.
not possible, what are the alternative solutions? We discuss
relatively simple solutions to address these problems. banks, coherency, consistency, and reliability.
Compute Caches supports several in-place vector opera- • To support Compute Cache operations without operand
tions: copy, search, compare and logical operations (and, locality, we study near-place processing in cache.
or, xor, and not) which can accelerate a wide variety • We re-designed several important applications (text pro-
of applications. We study two text processing applications cessing, databases, checkpointing) to utilize Compute
(word count, string matching), database query processing Cache operations. We demonstrate significant speedup
with bitmap indexing, copy-on-write checkpointing in OS, (1.9×) and energy savings (2.4×) compared to proces-
and bit matrix multiplication (BMM); a critical primitive sors with conventional SIMD units. While our savings
used in cryptography, bioinformatics, and image processing. for applications are limited by the fraction of their com-
We re-designed these applications to efficiently represent putation that can be accelerated using Compute Caches
their computation in terms of Compute Cache supported (Amdahl’s law), our micro-benchmarks demonstrate that
vector operations. Section V identifies a number of addi- applications with larger fraction of Compute Cache op-
tional domains that can benefit from Compute Caches: data erations could benefit even more(54× throughput, 9×
analytics, search, network processing etc. dynamic energy).
We evaluate the merits of Compute Caches for a multi-
core processor modeled after Intel’s SandyBridge [9] proces- II. BACKGROUND
sor with eight cores, three levels of caches, and a ring inter- This section provides a brief background of cache hierar-
connect. For the applications we study, on average, Compute chy, cache geometry, and bit-line computing in SRAM.
Caches improve performance by 1.9× and reduce energy
A. Cache Hierarchy and Geometry
by 2.4× compared to a conventional processor with 32-byte
wide vector units. Applications with a higher fraction of Figure 1 (a) illustrates a multi-core processor modeled
Compute Cache operations can benefit significantly more. loosely after Intel’s Sandybridge [9]. It has a three-level
Through micro-benchmarks that manipulate 4KB operands, cache hierarchy comprising of private L1 and L2, and a
we show that Compute Caches provide 9× dynamic energy shared L3. The shared L3 cache is distributed into slices
savings over a baseline using 32-byte SIMD units while which are connected to the cores via a shared ring intercon-
providing 54× better throughput on average. nect. A cache consists of a cache controller and several banks
In summary, this paper makes the following contributions: ((Figure 1 (b)). Each bank has several sub-arrays connected
by a H-Tree interconnect. For example, a 2 MB L3 cache
• We make a case for caches that can compute. Using bit- slice has a total of 64 sub-arrays distributed across 16 banks.
line computing, our Compute Caches naturally support A sub-array in a cache bank is organized into multiple
vector processing over large data operands (several rows of data-storing bit-cells. The bit-cells in the same row
KBs). This dramatically reduces overhead due to data are connected to a word-line. The bit-cells along a column
movement between caches and cores. Furthermore, share the same bit-line. Typically, in any cycle, one word-
in-place computing even avoids data transfer between a line is activated, from where a data block is either read from,
cache’s sub-array and its controller. or written to, through the column bit-lines.

• We present the Compute Cache architecture that ad- B. Bit-line Computing


dresses various architectural problems: operand locality, Compute Caches use emerging bit-line computing tech-
managing parallelism across various cache levels and nology in SRAMs [2], [3] (Figure 2) which observes that,
BLB0 BL0 BLBn BLn core alu data-mov

WLi 0 1 0 1

WLj 1 0 0 1
Vref

SA SA SA SA Cache Cache Cache

WLi NOR WLj WLi AND WLj

Figure 2: SRAM circuit for in-place operations. Two rows Core Core Core
(WLi and WLj ) are activated. An AND operation is per- (a) Scalar Core (b) SIMD Core (c) Compute Cache
formed by sensing bit-line (BL). All the bit-lines are initially Figure 3: Proportion of energy (top) for bulk comparison
pre-charged to ‘1’. If both the activated bits in a column have operation and area (bottom). Red dot depicts logic capability.
a ‘1’ (column ‘n’), then the BL stays high and it is sensed
as a ‘1’. If any one of the bits were ‘0’ it will lower the Cache cache-ic (h-tree) cache-access
BL voltage below Vref and will be sensed as a ‘0’. A NOR L1-D 179 pJ 116 pJ
L2 675 pJ 127 pJ
operation can be performed by sensing bit-line bar (BLB). L3-slice 1985 pJ 467 pJ

Table I: Cache energy per read access


when multiple word-lines are activated simultaneously, the
shared bit-lines can be sensed to produce the result of and hundreds of sub-arrays distributed across different banks
and nor on the data stored in the two activated rows. Data which can potentially compute concurrently on cache blocks
corruption due to multi-row access is prevented by lowering stored in them. This enables us to exploit large scale data
word-line voltage to bias against write of the SRAM array. level parallelism (e.g. a 16MB L3 has 512 sub-arrays and
Jeloka et al. [2]’s measurements across 20 fabricated test can support 8 KB operands) dwarfing even a SIMD core.
chips demonstrate that data-corruption does not occur even The top row of Figure 3 shows relative energy consump-
when 64 word-lines are simultaneously activated during such tion for a comparison operation over several blocks of 4KB
an in-place computation. They show a stability of more than operands (Section VI-D). In a scalar core, less than 1% of the
six sigma robustness for Monte Carlo simulations, which is energy is expended on the ALU operation, while nearly three
considered industry standard for robustness against process quarters of the energy is spent in processing instructions in
variations. Also, note that, by lowering the word-line voltage the core, and one-fourth is spent on data movement. While
further, robustness can be traded off for increase in delay. vector processing (SIMD) support (Figure 3 (b)) in general-
Even with it, Compute Caches will still deliver significant purpose and data-parallel accelerators reduce the instruction
savings given its potential (Section VI, 54× throughput, 9× processing overhead to some degree, it does not help address
dynamic energy savings). the data movement overhead. Compute Cache architecture
Section IV-B discusses our extensions to bit-line comput- (Figure 3 (c)) can reduce the instruction processing over-
ing enabled SRAM to support additional operations: copy, heads by an order of magnitude, by supporting SIMD
xor, equality comparison, search, and carryless multiplica- operation on large operands (tens of KB). Also, it avoids
tion (clmul). the energy and performance cost due to data movement.
In-place Compute Cache reduces on-chip data movement
III. A C ASE FOR C OMPUTE C ACHES overhead, which consists of two components. First, is the
In-place Compute Cache has the potential to provide energy spent on data transfer. This includes not only the
massive data-parallelism, while also dramatically reduc- significant energy spent on the processor interconnect’s
ing the instruction processing and on-chip data movement wires and routers, but also the H-Tree interconnect used for
overheads. Figure 3 pictorially depicts these benefits by data transfer within a cache. A near-place Compute Cache
comparing a scalar core, a SIMD core with vector processing solution can solve the former but not the latter. As shown
support, and Compute Caches. in Table I, H-Tree consumes nearly 80% of cache energy
The bottom half in Figure 3 depicts the area proportioning spent in reading from a 2MB L3 cache slice.
and processing capability of the three architectures. Signif- Second, is the energy spent when reading and writing in
icant fraction of die area in a conventional processor is for the higher-level caches. In a conventional processor, a data
caches. A Compute Cache re-purposes the elements used in block trickles up the cache hierarchy all the way from L3
this large area into compute units for a small area overhead to L1 cache, and into a core’s registers, before it can be
(8% of cache area). A typical last-level cache consists of operated upon. An L3 Compute Cache can eliminate all this
Opcode Src1 Src2 Dest Size Description
cc copy a - b n b[i] = a[i] It also supports equality comparison and search. We limit
cc buz a - - n a[i] = 0 the operand size (n) of these instructions to 64 words (512
cc cmp a b r n r[i] = (a[i] == b[i])
cc search a k r n r[i] = (a[i] == k)
bytes), so that the result can be returned as a 64-bit value
cc and a b c n c[i] = a[i] & b[i] to a processor core’s register. For the search instruction, the
cc or a b c n c[i] = (a[i] || b[i]) key size is set to 64 bytes. For smaller keys, the programmer
cc xor a b c n c[i] = a[i] ⊕ b[i]
cc clmulX a b c n ci = ⊕(a[i] & b[i]) can either duplicate the key multiple times starting from the
cc not a - b n b[i] =!(a[i]) key’s address (if its size is a word multiple), or pad the key
a,b,c,k: addresses r:register ∀i, i ∈ [1, n] , X = [ 64/128/256 ]
and source data operands to be 64 bytes.
Table II: Compute Cache ISA.
B. Cache Sub-arrays with In-Place Compute

BLB0 BL0 BLBn BLn


overhead. A shared L3 Compute Cache can also reduce the
cost of sharing data between two cores, as it would avoid
write-back from a source core’s L1 to shared L3, and then WLi 0 1 0 1
a transfer back to a destination core’s L1. 0 1 0 1
Destination
WLj 1 0 Enabled 0 1
IV. C OMPUTE C ACHE A RCHITECTURE Vref
Figure 1 illustrates the Compute Cache (CC) architecture.
SA SA SA SA
We enhance all the levels in the cache hierarchy with in-
place compute capability. Computation is done at the highest 0 1 0 1 Copy_Wr
level where the application exhibits significant locality. In- Enable
Wr_Enable
place compute is based on the bit-line computing technology
Db0 D0 Dn Dbn
we discussed in Section II. We enhance these basic in-
place compute capabilities to support xor and several in- Figure 4: In-place copy operation (from row i to j).
place operations (copy, search, comparison, and carryless
multiplication). Compute Caches is made possible by our SRAM sub-
In-place computing is possible only when operands are array design that facilitates in-place computation. We start
mapped to sub-arrays such that they share the same bit-lines. with the basic circuit framework proposed by Jeloka et
We refer to this requirement as operand locality. We discuss al. [2], which supports logical and and nor operations.
a cache geometry that allows a compiler to satisfy operand To a conventional cache’s sub-array, we add an additional
locality by ensuring that the operands are page-aligned. decoder to allow activating two wordlines, one for each
Each cache controller is extended to manage the parallel operand. The two single-ended sense amplifiers required for
execution of CC instructions across its several banks. It also separately sensing both the bit-lines attached to a bit-cell
decides the cache level to perform the computation and are obtained by re-configuring the original differential sense
fetches the operands to that level. Given that a Compute amplifier.
Cache can modify data, we discuss its implication in ensur- In addition to and and nor operations, we extend the
ing coherence and consistency properties. Finally, we discuss circuit to support xor operation by NOR-ing bit-line and
design alternatives for supporting ECC in Compute Caches. bit-line complement. We realize compound operations such
In the absence of operand locality, we propose to compute as compare and search by using the results of bitwise xor.
near-place in cache. For this, we add a Logic Unit in To compare two words, the individual bit-wise xor results
the cache controller. Although near-place cache computing are combined using a wired-NOR. Comparison is utilized to
requires additional functional units, and cannot save H- do iterative search over cache blocks stored in sub-arrays.
Tree interconnect energy inside caches, it successfully helps By feeding the result of the sense-amplifiers back to the
reduce the energy spent in transferring and storing data in bit-lines, one word-line can be copied to another without
the higher-level caches. ever latching the source operand. We leverage the fact that
the last read value is same as the data to be written in the
A. Instruction Set Architecture (ISA) next cycle, and coalesce the read-write operation to enable
Compute Cache (CC) ISA extensions are listed in Table II. more energy-efficient copy operation as shown in Figure 4.
It supports several vector instructions, whose operands are By resetting input data latch before a write we can enable
specified using register indirect addressing. Operand sizes in-place zeroing of a cache block.
are specified through immediate values and can be as large Finally, the carryless multiplication (clmul) operation is
as 16K. It supports vector copying, zeroing, and logical oper- done using a logical and on two sub-array rows, followed
ations. It also supports vector carry less multiply instruction by xor reduction of all the resultant bits. This is supported
(cc clmul) at single/double/quad word granularity. by adding a xor reduction tree to each sub-array.
Cache Banks BP Block size Min. address bits match
Our extensions have negligible impact on the baseline L1-D 2 2 64 8
read/write accesses as they use the same circuit as the base- L2 8 2 64 10
line, including differential sensing. An in-place operation L3-slice 16 4 64 12

takes longer than a single read or write sub-array access, as Table III: Cache geometry and operand locality constraint.
it requires longer word-line pulse to activate and sense two
rows to compensate for the lower word-line voltage. Sensing
time also increases due to the use of single-ended sense
We make two design choices for our cache organization
amplifiers, as opposed to differential sensing. However,
to simplify operand locality constraint. First, all the ways in
note that this is still less than the delay baseline would
a set are mapped to the same block partition as shown in
incur to accomplish an equivalent in-place operation, as it
Figure 5(a). This ensures that operand locality would not be
would require multiple read accesses and/or write access.
affected based on which way is chosen for a cache block.
Section VI-C provides the detailed delay, energy and area
parameters for compute capable cache sub-arrays. Second, we use a portion of set-index bits to select the
block’s bank and block partition, as shown in Figure 5(b).
C. Operand Locality As long as these are the same for two operands, they are
For in-place operations, the operands need to be physi- guaranteed to be mapped to the same block partition.
cally stored in a sub-array, such that they share the same set Software requirement: The number of address bits that
of bitlines. We term this requirement as operand locality. In must match for operand locality varies based on the cache
this section, we discuss cache organization and software con- size. As shown in Table III, even the largest cache (L3) in
straints that can together satisfy this property. Fortunately, our model requires that only least 12 bits are the same for
we find that software can ensure operand locality as long as two operands (we assume pages are mapped to a NUCA
operands are page-aligned, i.e., have the same page offset. slice closest to the core actively accessing them). Given that
Besides this, the programmer or the compiler does not need our pages are 4KB in size, we observe that as long as the
to know about any other specifics of the cache geometry. operands are page aligned, i.e., have the same page offset,
then they will be placed in the address space such that the
Cache {1 bank, 16 sets (S0-S15), 4 ways per set} least significant bits (12 for 4 KB page) in their addresses
BP0 BP1 BP2 BP3
(both virtual and physical) match. This would trivially satisfy
Physical Address Decoding
Way 0
Way 1 S1 S4 S5
the operand locality requirement for all the cache levels and
S0 Way 2
Way 3 Block offset
Set index
sizes we study. Note that, we only require operands to be
Tag
S2 S3 S6 S7 placed at the same offset of 4KB memory regions, and it is
H-Tree
not necessary to place them in separate pages. For super-
BP4 BP5 BP6 BP7 pages that are larger than 4KB, operands can be placed
Bank Bits
S8 S9 S12 S13
within a page while ensuring 12-bit address alignment.
Block Partition We expect that for data-intensive regular applications that
S10 S11 S14 S15 operate on large chunks of data, it is possible to satisfy
(b)
Sub-array
Bank
this property. Many operating system operations that involve
(a)
copying from one page to another are guaranteed to exhibit
operand locality for our system. Compiler and dynamic
Figure 5: Cache organization example, address decoding memory allocators could be extended to optimize for this
([i][j] = set i, way j), alternate address decoding for parallel property in future.
tag-data access caches Finally, a binary compiled with a given address bit align-
ment requirement (12 bits in our work) is portable across a
Operand locality aware cache organization: Figure 5 wide range of cache architectures as long as the number of
illustrates a simple cache with one bank each with four sub- address bits to be aligned is equal to or less than what they
arrays. Rows in a sub-array share the same set of bitlines. were compiled for. If the cache geometry changes such that
We define a new term, Block Partition (BP). Block par- it requires greater alignment, then the programs would have
tition for a sub-array is the group of cache blocks in that to be recompiled to satisfy that stricter constraint.
sub-array that share the same bitlines. In-place operation is Column Multiplexing: With column multiplexing, mul-
possible between any two cache blocks stored within a block tiple adjacent bit-lines are multiplexed to a single bit data
partition. In our example, since each row in a sub-array has output, which is then observed using one sense-amplifier.
two cache blocks, there are two block partitions per sub- This keeps area overhead of peripherals under check and
array. In total, there are eight block partitions (BP0-BP7). improves resilience to particle strikes. Fortunately, in column
In-place compute is possible between any blocks that map multiplexed sub-arrays, adjacent bits in a cache block are
to the same block partition ( e.g. blocks in sets S0 and S2). interleaved across different sub-arrays such that their bitlines
cc_and A, B, C, 8
are not multiplexed. In this case, the logical block partition
that we defined would be interleaved across the sub-arrays. CORE 1. Core issues cc_and
Thus, an entire cache block can be accessed in parallel. 7. L1 controller notifies to L1 controller
completion of operation
Given this, in-place concurrent operation on all the bits in a to Core Set0
cache block is possible even with column multiplexing. L1 free
Our design choice of placing ways of a set within a block free
free 2. L1 forwards operation to L2
partition does not affect the degree of column multiplexing
6. L3 notifies L1
as we interleave cache blocks of different sets instead.
Set0
Way Mapping vs Parallel Tag-Data Access: We chose L2 B : dirty
to place all the ways of a set within a block partition, so A : clean
3. L2 writebacks B to L3,
free
that operand locality is not dependent on which way is forwards operation to L3
chosen for a block at runtime. However, this prevents us B : dirty
from supporting parallel tag-data access, where all the cache Set0
L3 B : clean
blocks in a set are pro-actively read in parallel with the A : clean Memory
5. L3 performs operation
tag match. This optimization is typically used for L1 as it free C : clean
can reduce the read latency by overlapping tag match with 4. L3 fetches C from memory
read. But it incurs a high read energy overhead (4.7× higher
energy per access for L1 cache) for modest performance gain Figure 6: Compute Caches in action
(2.5% for SPLASH-2[10]). Given the significant benefits of
L1 Compute Cache, we think it is a worthy trade-off to forgo
this optimization for L1. Finally, if the address range of any operand of a CC in-
struction spans multiple pages, it raises a pipeline exception.
The exception handler splits the instruction into multiple CC
D. Managing Parallelism
operations such that each of its operands are within a page.
Cache controllers are extended to provision for CC con-
trollers which orchestrate the execution of CC instructions. E. Fetching In-Place Operands
The CC controller breaks a CC instruction into multiple The Compute Cache (CC) controllers are responsible
simple vector operations whose operands span at most a for deciding the level in the cache hierarchy where CC
single cache block and issues them to the sub-arrays. Since operations need to be performed, and issuing commands
a typical cache hierarchy can have hundreds of sub-arrays to the cache sub-array to execute them. To simplify our
(16MB L3 cache has 512 sub-arrays), we can potentially design, in our study, the CC controller always performs the
issue hundreds of concurrent operations. This is only limited operations at the highest-level cache where all the operands
by two factors. First, the bandwidth of the shared intercon- are present. If any of the operands are not cached, then the
nects used to transmit address/commands. Note that we do operation is performed at lowest-level cache (L3). Cache
not replicate the address bus in our H-tree interconnects. allocation policy can be improved in future by enhancing
Second, number of sub-arrays activated at same time can be our CC controller with a cache block reuse predictor [11].
limited to limit peak power drawn. Once a cache level is picked, CC controller fetches any
The controller at L1-cache uses an instruction missing operands to that level. The controller also pins
table to keep track of the pending CC instructions. the cache-lines the operands are fetched in while the CC
The simple vector operations are kept track of in the operation is under way. To avoid the eviction of operands
operation table. The instruction table tracks metadata while waiting for missing operands, we promote the cache
associated at instruction level (i.e., result, count of simple blocks of that operand to the MRU position in the LRU
vector operations completed, next simple vector operation chain. However, on receiving a forwarded coherence request,
to be generated). The operation table, on other hand, tracks we release the lock to avoid deadlock and re-fetch the
status of each operand associated with the operation and operand. Getting a forwarded request to a locked cache line
issues request to fetch the operand if it is not present in will be rare for two reasons. First, in DRF [12] compliant
the cache (Section IV-E). When all operands are in cache, programs, only one thread will be able to operate on a cache
we issue the operation to the cache sub-array. As operations block while holding its software lock. Second, as operands
complete they update the instruction table, and the L1-cache of a single CC operation are cache block wide, false sharing
controller notifies the core when an instruction is complete. will be low. Nevertheless, to avoid starvation in pathological
To support search instruction, CC controller replicates key scenarios, if CC operation fails to get permission after
in all the block partitions where the source data resides. To repeated attempts (set to two), processor core will translate
avoid doing this again for the same instruction, we track and execute a CC operation as RISC operations.
such replications per instruction in a key table. Figure 6 shows a working example. Core issues operation
cc and with address operands A, B and C to L1 controller preceding pending operations are completed, including CC
( 1 ). Each is of size 64 bytes (8 words) spanning an entire operations. Similar to conventional vector instructions, it is
cache block. For clarity, only one cache set in each cache not possible to specify a fence between scalar operations
level is shown. None of the operands are present in L1 cache. within a single vector CC instruction.
Operand B is in L2 cache and is dirty. L3 cache has clean
copy of A and a stale copy of B. C is not in any cache. H. Memory Disambiguation and Store Coalescing
L3 cache is chosen for the CC operation, as it is the Similar to SIMD instructions, Compute Cache (CC) vec-
highest cache level where all operands are present. L1 and tor instructions require additional support in the processor
L2 controllers will forward this operation to L3 ( 2 , 3 ). core for memory ordering. We classify instructions in CC
Before doing so, L2 cache will first write-back B to L3. ISA into two types. CC-R type (cc_cmp, cc_search)
Note that caches already write-back dirty data to next cache only read from memory. The rest of the instructions are
level on eviction and we use this existing mechanism. CC-RW type as they both read and write from memory.
On receiving the command, L3 fetches C from memory Under RMO memory model, CC-R can be executed out-of-
( 4 ). Note that, as an optimization, C need not be fetched order, whereas CC-RW behaves like a store. In the following
from memory as it will be over-written entirely. Once all discussion, we refer to CC-R as load, and CC-RW as store.
the operands are ready, L3 performs the CC operation ( 5 ) Conventional processor cores use a load-store queue
and subsequently notifies the L1 controller ( 6 ) of it’s (LSQ) to check for address conflicts between a load and
completion, which in turn notifies the core ( 7 ). the preceding uncommitted stores. As vector instructions
can access more than a word, it is necessary to enhance the
F. Cache Coherence LSQ with the support for checking address ranges, instead of
Compute Cache optimization interacts with the cache just one address. For this reason, we use a dedicated vector
coherence protocol minimally and as a result does not LSQ, where each entry has additional space to keep track
introduce any new race conditions. As discussed above, of address ranges for the operands of a vector instruction.
while the controller locks cache lines while performing CC Similar to LSQ, we also split the store buffer into two, one
operation, on receipt of a forwarded coherence request, the for scalar stores and another for vector stores. The vector
controller releases the lock and responds to the request. store buffer supports address range checks (max 12 com-
Thus, a forwarded coherence request is always responded parisons/entry). Our scalar store buffer permits coalescing.
to in cases where it would be responded to in the baseline However, it is not possible to coalesce CC-RW instructions
design. with any store, because their output is not known till they
Typically, higher-level caches writeback dirty data to the are performed in a cache. As the vector store buffer is non-
next-level cache on evictions. Coherence protocols already coalescing, it is possible for the two store buffers to contain
support such writebacks. In the Compute Cache architecture, stores to the same location. If such a scenario is detected,
when a cache level is skipped to perform CC operations, any the conflicting store is stalled until the preceding store is
dirty operands in the skipped level need to be written back complete which ensures program order between stores to
to next level of cache to ensure correctness. To do this, we the same location. We augment the store buffer with a field
use the existing writeback mechanism and thus require no which points to any successor store and a stall bit. The stall
change to the underlying coherence protocol. bit is reset when the predecessor store completes.
Data values are not forwarded from vector stores to any
G. Consistency Model Implications loads, or from any store to a vector load. Code segments
Current language consistency models (C++ and Java) are where both vector and scalar operations access the same
variants of the DRF model [12], and therefore a processor location within a short time span is likely to be rare. If
only needs to adhere to the RMO memory model. While such a code segment is frequently executed, the compiler
ISAs providing stronger guarantees (x86) exist, they can can choose to not employ Compute Cache optimization.
be exploited only by writing assembly programs. As a
consequence, while we believe stronger memory model I. Error Detection and Correction
guarantees for Compute Caches is an interesting problem Systems with strong reliability requirements employ Error
(to be explored in future work), we assume RMO model Correction Codes (ECC) for caches. ECC protection for
in our design. In RMO, no memory ordering is needed conventional and near-place operations are unaffected in our
between data reads and writes, including all CC operations. design. For cc copy simply copying ECC from source to
Individual operations within a vector CC instruction can also destination suffices. For cc buz, ECC of zeroed blocks can
be performed in parallel by the CC controller. be updated. For comparison and search, ECC check can be
Programmers use fence instructions to order memory performed by comparing the ECCs of the source operands.
operations, which is sufficient in the presence of CC in- An error is detected if data bits match, but the ECC bits
structions. Processor stalls commit of a fence operation until don’t, or vice versa.
Configuration 8 core CMP
For in-place logical operations (cc and, cc or, cc xor, Processor 2.66 GHz out-of-order core, 48 entry LQ, 32 entry SQ
cc clmul, and cc not), it is challenging to perform the L1-I Cache 32KB, 4-way, 5 cycle access
L1-D Cache 32KB, 8-way, 5 cycle access
check and compute the ECC for the result. We propose two L2 Cache inclusive, private, 256KB, 8-way,11 cycle access
alternatives. One alternative is to read out the xor of the two L3 Cache inclusive, shared, 8 NUCA slices, 2MB each, 16-way, 11
cycle + queuing delay
operands and their ECCs, and check the integrity at the ECC Interconnect ring, 3 cycle hop latency, 256-bit link width
logic unit (ECC(A xor B) = ECC(A) xor ECC(B)). Coherence directory based, MESI
This unit also computes the ECC of the result. Our sub- Memory 120 cycle latency

array design permits computing the xor operation alongside Table IV: Simulator Parameters
any logical operation. Although the logical operation is still
done in-place, this method will incur extra data transfers to
and from the ECC logic unit. Cache scrubbing during cache Logical Operations: Compute Cache logical operations
idle cycles [13] is a more attractive option. Since soft errors can speedup processing of commonly used bit manipulation
in caches are infrequent (0.7 to 7 errors/year [14]), periodic primitives such as bitmaps. Bitmaps are used in graph and
scrubbing can be effective while keeping performance and database indexing/query processing. Query processing on
energy overheads low. databases with bitmap indexing requires logical operation on
large bitmaps. Compute Caches can also accelerate binary
J. Near-Place Compute Caches bit matrix multiplication (BMM) which has uses in numerous
In the absence of operand locality, we propose to compute applications such as error correcting codes, cryptography,
instructions “near” the cache. Our controller is provisioned bioinformatics, and Fast Fourier Transform (FFT). Given its
with additional logic units (not arithmetic units) and registers importance, it was implemented as a dedicated instruction
to temporarily store the operands. The source operands in Cray supercomputers [17] and Intel processors provision
are read from the cache sub-array into the registers at the a x86 carryless multiply (clmul) instruction to speed it.
controller, and then computed results are written back to the Inherent cache locality in matrix multiplication makes BMM
cache. In-place computation has two benefits over near-place suitable for Compute Caches. Further, our large vector
computation. First, it provides massive compute capability operations can allow BMM to scale to large matrices.
for almost no additional area overhead. For example, a 16 Copy Operation: Prior research [7] makes a strong
MB L3 with 512 sub-arrays allows 8KB of data to be case for optimizing copy performance which is a common
computed in parallel. To support equivalent computational operation in many applications in system software and ware-
capability, we would need 128 vector ALUs, each of width house scale computing [18]. The operating system spends a
64-bytes. This is not a trivial overhead. We assume one considerable chunk of its time (more than 50%) copying bulk
vector logic unit per cache controller in our near-cache data [19]. For instance copying is necessary for frequently
design. Second, in-place compute avoids data transfer over used system calls like fork, inter-process communication,
H-Tree wires. This reduces in-place compute latency (14 virtual machine cloning and deduplication, file system and
cycles) compared to near-cache (22 cycles). Also, 60%-80% network management. Our copy operation can accelerate
of total cache read energy is due to H-Tree wire transfer (See checkpointing, which has a wide range of uses, including
Table I), which is eliminated with in-cache computation. fault tolerance and time-travel debugging. Finally, our copy
Nevertheless, near cache computing retains the other benefits primitive can also be employed in bulk zeroing which is an
of Compute Caches, by avoiding transferring data to the important primitive required for memory safety [20].
higher-level caches and the core.
VI. E VALUATION
V. A PPLICATIONS In this section we demonstrate the efficacy of Compute
Our Compute Cache design supports simple but common Caches (CC) using both micro-benchmark study and a suite
operations, which can be utilized to accelerate diverse set of of data-intensive applications.
data intensive applications. A. Simulation Methodology
Search and Compare Operations: Compare and search We model a multi-core processor using SniperSim [21],
are common operations in many emerging applications, a Pin-based simulator per Table IV. We use McPAT [22] to
especially text processing. Intel recently added seven new model power consumption in both cores and caches.
instructions to the x86 SSE 4.2 vector support that efficiently
perform character searches and comparison [15]. Compute B. Application Customization and Setup
Cache architecture can significantly improve the efficiency In this section we describe how we redesigned applica-
of these instructions. Similar to specialized CAM accelera- tions in our study to utilize CC instructions.
tors [16], our search functionality can be utilized to speed up WordCount: WordCount [23] reads a text file (10MB)
applications such as, search engines, decision tree training and builds a dictionary of unique words and their fre-
and compression and encoding schemes. quency of appearance in the file. While the baseline does
cache write read cmp copy search not logic
a binary search over the dictionary to check if a new L3 2852 2452 840 1340 3692 1340 1672
word is found, we model the dictionary as alphabet in- L2 1154 802 242 608 1396 608 704
dexed (first two letters of word) CAM (1KB each). As the L1 375 295 186 324 561 324 387
dictionary is large (719KB) we perform search operations Table V: Cache energy (pJ) per cache-block (64-byte)
in L3 cache. CC search instruction returns a bit vector
indicating match/mismatch for multiple words and hence
we also model additional mask instructions which report read disturbs and to account for circuit parameter variation
match/mismatch per word. across technology nodes.
StringMatch: StringMatch [23] reads words from a text We use the above parameters in conjunction with energy
file (50MB), encrypts them and compares them to a list of per cache access from McPAT to determine the energy of CC
encrypted keys. Encryption cannot be offloaded to cache, operations (Table V). CC operations cost higher in lower-
hence, encrypted words are present in L1-cache and we level caches as they employ larger sub-arrays. However, they
perform CC search in it. By replicating an encrypted key do deliver higher savings (compared to baseline read/write(s)
across all sub-arrays in L1, a single search instruction can needed) as they have larger in-cache interconnect compo-
compare it against multiple encrypted words. Similar to nents. For search, we assume a write operation for key; this
WordCount we also model mask instructions. cost will get amortized over large searches.
DB-BitMap: We also model FastBit [24] a bitmap index
library. The input database index is created using data sets D. Microbenchmark Study
obtained from a real physics experiment, STAR [25]. A To demonstrate the efficacy of Compute Caches we model
sample query performs logical OR or AND of large bitmap four microbenchmarks: copy, compare, search and logical-
bins (several 100 KBs each). We modify the query to use or. We compare Compute Caches to a baseline(Base 32)
cc or operations ( each processes 2KB of data). We measure which supports 32-byte SIMD loads and stores.
average query processing time for a sample query mix Figure 7 (a) depicts the throughput attained for different
running over uncompressed bitmap indexes. operations for operand size of 4KB. For this experiment,
BMM: Our optimized baseline BMM implementation all operands are in L3 cache and the Compute Cache
(Section V) uses blocking and x86 CLMUL instructions. operation is performed therein. Among the operations, for
Given the reuse of matrix we perform cc clmul in L1-cache. baseline, search achieves highest throughput as it incurs
We model 256 × 256 bit matrices. single cache miss for the key and subsequent cache misses
Checkpointing: We model in-memory copy-on-write are only for data. Compute Cache accelerates throughput
checkpointing support at page granularity for SPLASH- for all operations: 54× over Base 32 averaged across the
2 [10] benchmark suite (checkpointing interval of 100,000 four kernels. Our throughput improvement has two primary
application instructions). sources: massive data parallelism exposed in presence of
independent sub-arrays to compute in, and latency reduction
C. Compute Sub-Array: Delay and Area Impact due to avoiding data movement to the core. For instance, for
Compute Caches have negligible impact on the baseline copy operation, data parallelism exposes 32× and latency
read/write accesses as we still support differential sensing. reduction exposes 1.55× throughput improvement.
To get delay and energy estimates, we perform SPICE Figure 7 (b) depicts the dynamic energy consumed for
simulations on a 28nm SOI CMOS process based sub-array, operand size of 4KB. Dynamic energy depicted is broken
using standard foundry 6T bit-cells. 1 A and/or/xor 64-byte down into core, cache data access (cache-access), cache
in-place operation is 3× longer as compared to single sub- interconnect (cache-ic) and network-on-chip (noc) compo-
array access while rest of CC operations are 2× longer. In nents. We term data movement energy to be everything
terms of energy, cmp/search/clmul are 1.5×, copy/buz/not except the core component. Overall, CC provides dynamic
are 2×, and the rest are 2.5× baseline sub-array access. The energy savings of 90%, 89%, 71% and 92% for copy, com-
area overhead is 8% for a sub-array of size 512 × 512 2 . pare, search and logical (OR) kernels relative to Base 32.
Note, our estimates account for technology variations and Large vector CC instructions help bring down core com-
process, voltage and temperature changes. Further, these ponent of energy. Further, CC successfully eliminates all
estimates are conservative when compared to measurements the components of data movement. Writes incurred due to
on silicon [2] in order to provision for robust margin against key replication limit efficacy of search CC operation in
bringing down L3 cache energy components. As data size
1 SRAM arrays we model are 6T cell based. Lower-level caches (L2/L3) to be searched increases, key replication overheads will get
are optimized for density and employ 6T-based arrays. However, L1-cache amortized increasing effectiveness of CC.
can employ 8T cell based designs. To support in-place operations in such Figure 7 (c) depicts total energy consumed broken down
a design, a differential read-disturb resilient 8T design [26] can be used.
2 The optimal sub-array dimension for L3 and L2 caches we model are into static and dynamic components. Due to reduction in
512 × 512 and 128 × 512 bits respectively. execution time, CC can significantly reduce static energy.
core uncore-static
2500 4000
Throughput(million ops/sec)

dynamic energy (nJ)


700
Base_32 cache-access 3500 core-static

total energy (nJ)


600 CC_L3 2000 3000 uncore-dynamic
cache-ic
500 noc 2500 core-dynamic
1500
400 2000
1000 1500
300
500 1000
200 500
100 0 0

Base_32
CC_L3

Base_32
CC_L3

Base_32
CC_L3

Base_32
CC_L3

Base_32
CC_L3

Base_32
CC_L3

Base_32
CC_L3

Base_32
CC_L3
0
copy compare search logical

copy compare search logical copy compare search logical

Figure 7: Benefit of CC for 4KB operand. a) Throughput b) Dynamic energy c) Total energy

uncore-static 5
900 core uncore-static

total energy savings


core-static 2500 4
800 l1-access core-static
uncore-dynamic uncore-dynamic
l1-ic 3

(ratio)
700 savings in dynamic energy(nJ)
core-dynamic l2-access core-dynamic
total energy (nJ)

2000
600 l2-ic 2

500 l3-access 1
1500 l3-ic
400 0
noc
300 BMM WordCount StringMatch DB-BitMap

200 1000 3.5

performance normalized
3
100
2.5

to baseline
0 500
2
CC_in_place
CC_near_place

CC_in_place
CC_near_place

CC_in_place
CC_near_place

CC_in_place
CC_near_place

1.5
0 1
0.5
CC_L3
CC_L2
CC_L1

CC_L3
CC_L2
CC_L1

CC_L3
CC_L2
CC_L1

CC_L3
CC_L2
CC_L1
0
BMM WordCount StringMatch DB-BitMap
copy compare search logical
copy compare search logical
Figure 9: a) Total energy benefit b) Per-
Figure 8: a) Total energy of in-place vs near place for 4KB operand formance improvement of CC for appli-
b) Savings in dynamic energy for 4KB operand for different cache levels cations

Overall, averaged across the four kernels studied, CC pro- 34% for L1 and L2 caches respectively relative to Base 32.
vides 91% in total energy savings relative to Base 32.
Near-place design: In our analysis so far, we have E. Application Benchmarks
assumed perfect operand locality i.e. all Compute Cache In this section we study the benefits of Compute Caches
operations are performed in-place. Figure 8 (a) depicts the for five applications. Figure 9 (b) shows the overall speedup
total energy for near-place and in-place CC configurations. of Compute Caches for four of these applications. We see
Recall that in-place computation enables far more paral- a performance improvement of 2× for WordCount, 1.5×
lelism than near-place and offers larger savings in terms of for StringMatch, 3.2× for BMM, and 1.6× for DB-BitMap.
performance and hence total energy. For example, our L3- Figure 9 (a) shows ratio of total energy of CC to baseline
cache allows 8KB data to be operated in parallel. Near-place processor with 32-byte SIMD units. We observe average
design would need 128 64-byte wide logical units to provide energy savings of 2.7× across these applications. Majority
equivalent data parallelism. This is not a trivial overhead. of benefits come from three sources: data parallelism ex-
As such, for 4KB operands, in-cache provides 3.6× total posed by large vector operations, reduction in number of
energy savings and 16× throughput improvement on average instructions and data movement.
over near-place. Note however that, near-place can still offer For instance, recall that while baseline WordCount does
considerable benefits over the baseline architecture. a binary search over dictionary of unique words, Compute
Computing at different cache levels: We next evaluate Cache does a CAM search using cc search instructions.
the efficacy of Compute Caches when operands are present Superficially it may seem that binary search will outperform
in different cache levels. Figure 8 (b) depicts the difference CAM search. However, we find that CC version has 87%
in dynamic energy between CC configurations and their fewer instructions by doing away with book keeping instruc-
corresponding Base 32 configurations. As expected, the tions of binary search. Further, our vector cc search enables
absolute savings are higher, when operands are in lower- energy efficient CAM searches. These benefits are also
level caches. However, we find that doing Compute Cache evident in StringMatch, BMM and DB-BitMap (32%, 98%
operations in L1 or L2 cache can also provide significant and 43% instruction reduction respectively). The massive
savings. As the number of CC instructions stays same data level parallelism we enable benefits data intensive
regardless of cache level, core energy savings is equal for range and join queries in DB-BitMap application. Recall
all cache levels. Overall, CC provides savings of 95% and that this benchmark performs many independent logical OR
performance overhead(%) 150
Base more logic around CAM to orchestrate computation on them.
120 Base_32
CC_L3 None of these solutions exploit the benefits of in-place bit-
90
line computing cache noted in Section III. We get massive
60
number of compute units by re-purposing cache elements
30
that already exist. Also, in-place Compute Cache reduces
0
sk
y
s ce si
ty ea
n data movement overhead between a cache’s sub-arrays and
m di
x ole rne y tra dio om
fm ra ch ba ra ra ge its controller. On the flip side, in-place cache computing
imposes restrictions on the type of operations that can be
Figure 10: Performance overhead of CC for checkpointing
supported and placement of operands, which we address
35000 in the paper. When in-place operation is not possible, we
30000 uncore-static
used near-place Compute Cache for copy, logical, and search
total energy (mJ)

core-static
25000 uncore-dynamic
20000 core-dynamic operations, which has also not been studied in the past.
15000
Row-clone [7] enabled data copy from a source DRAM
10000
5000 row to a row buffer and then to a destination row. Thereby,
0 it avoided data movement over the memory channels. A
no_chkpt
Base
Base_32
CC_L3

no_chkpt
Base
Base_32
CC_L3

no_chkpt
Base
Base_32
CC_L3

no_chkpt
Base
Base_32
CC_L3

no_chkpt
Base
Base_32
CC_L3

no_chkpt
Base
Base_32
CC_L3

no_chkpt
Base
Base_32
CC_L3
subsequent CAL article [29] suggested that data could be
fmm radix cholesky barnes raytrace radiosity average
copied to a temporary buffer in DRAM, from where logical
operations could be performed. Row-clone’s approach is also
Figure 11: Total energy with and without checkpointing
a form of near-place computing, which requires that all
operands are copied to new DRAM rows before they can
operations over large bitmap bins. Since these operations are be operated upon. Bit-line in-place operations may not be
independent, many of them can be issued in parallel. feasible in DRAM, as DRAM reads are destructive (one of
Significant cache locality exhibited by these applications the reasons why DRAMs need refreshing).
makes them highly suitable for Compute Caches. As cache Recent research enhanced non-volatile memory technol-
accesses are cheaper than memory accesses, computation in ogy to support certain in-memory CAM [16] and bitwise
cache is more profitable for data with high locality or reuse. logic operations [30]. Compute Cache architecture is more
The dictionary in WordCount has high locality. BMM has efficient when at least one of the operands has cache locality
inherent locality due to the nature of matrix multiplication. (e.g., dictionary in word count). Ultimately, the locality
In DB-BitMap, there is significant reuse within a query due characteristics of an application should guide in which level
to aggregation of results into a single bitmap bin, and there of memory hierarchy the computation must be performed.
is potential reuse of bitmaps across queries. In StringMatch,
Bit-line computing in SRAMs has been used to implement
locality comes due to repeated use of encrypted keys.
custom accelerators: approximate dot products in analog
Figure 10 depicts the overall checkpointing overhead for domain for pattern recognition [31] and CAMs [32]. How-
SPLASH-2 applications as compared to baseline with no ever, it has not been used to architect a compute cache
checkpointing. In absence of SIMD support, this overhead in a conventional cache hierarchy, where we need general
can be as high as 68% while in presence of it the average solutions to problems such as operand locality, coherence
overhead is 30%. By further reducing instruction count and and consistency which are addressed in this paper. We also
avoiding data movement, CC brings down this overhead to a demonstrated the utility of our Compute Cache enabled
mere 6%. CC successfully relegates checkpointing to cache, operations to accelerate a fairly diverse range of applications
avoids data pollution of higher level caches and relieves the (databases, cryptography, data analytics).
processor of any checkpointing overhead. Figure 11 shows
significant energy savings due to Compute Caches. Note
VIII. C ONCLUSION
that, for checkpointing, all operations are page-aligned and
hence we achieve perfect operand locality. In this paper we propose Compute Cache (CC) archi-
tecture which unlocks hitherto untapped computational ca-
VII. R ELATED W ORK
pability present in on-chip caches by exploiting emerging
Past processing-in-memory (PIM) solutions move com- SRAM circuit technology. Using bit-line computing enabled
pute near the memory [6]. This can be accomplished using caches, we can perform several simple operations in-place in
recent advancements in 3D die-stacking [8]. There have cache over very-wide operands. This exposes massive data
also been few proposals that talk about adding hardware parallelism saving instruction processing, cache interconnect
structures near the cache, which track information that helps and intra-cache energy expenditure. We present solutions
improve efficiency of copy [5] and atomic operations [27]. to several challenges exposed by such an architecture. We
Associative processor [28] uses CAMs (area and energy inef- demonstrate the efficacy of our architecture using a suite of
ficient compared to SRAM caches) as caches and augments data intensive benchmarks and micro-benchmarks.
IX. ACKNOWLEDGMENTS [16] Q. Guo, X. Guo, Y. Bai, and E. İpek, “A resistive tcam
accelerator for data-intensive computing,” in Proceedings of
We thank the anonymous reviewers for their comments
the 44th Annual IEEE/ACM International Symposium on
which helped improve this paper. This work was supported Microarchitecture, ser. MICRO-44, 2011.
in part by the NSF under the CAREER-1149773 and SHF- [17] “Cray Assembly Language (CAL) for Cray X1 Systems
1527301 awards and by C-FAR, one of the six SRC STAR- Reference Manual. version 1.2. Cray Inc., ,” 2003.
net Centers sponsored by MARCO and DARPA. [18] S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan,
T. Moseley, G.-Y. Wei, and D. Brooks, “Profiling a
R EFERENCES warehouse-scale computer,” in Proceedings of the 42Nd An-
nual International Symposium on Computer Architecture, ser.
[1] B. Dally, “Power, programmability, and granularity: The ISCA ’15.
challenges of exascale computing,” in Parallel Distributed [19] M. Calhoun, S. Rixner, and A. Cox, “Optimizing kernel block
Processing Symposium (IPDPS), 2011 IEEE International, memory operations,” in Workshop on Memory Performance
2011. Issues, 2006.
[2] S. Jeloka, N. B. Akesh, D. Sylvester, and D. Blaauw, “A 28 [20] X. Yang, S. M. Blackburn, D. Frampton, J. B. Sartor, and
nm configurable memory (tcam/bcam/sram) using push-rule K. S. McKinley, “Why nothing matters: The impact of
6t bit cell enabling logic-in-memory,” IEEE Journal of Solid- zeroing,” ser. OOPSLA ’11.
State Circuits, 2016. [21] T. Carlson, W. Heirman, and L. Eeckhout, “Sniper: Exploring
[3] M. Kang, E. P. Kim, M. s. Keel, and N. R. Shanbhag, the level of abstraction for scalable and accurate parallel
“Energy-efficient and high throughput sparse distributed multi-core simulation,” in High Performance Computing,
memory architecture,” in 2015 IEEE International Symposium Networking, Storage and Analysis (SC), 2011 International
on Circuits and Systems (ISCAS), 2015. Conference for, 2011.
[4] P. A. La Fratta and P. M. Kogge, “Design enhancements for [22] S. Li, J. H. Ahn, R. Strong, J. Brockman, D. Tullsen, and
in-cache computations,” in Workshop on Chip Multiprocessor N. Jouppi, “Mcpat: An integrated power, area, and timing
Memory Systems and Interconnects, 2009. modeling framework for multicore and manycore architec-
[5] F. Duarte and S. Wong, “Cache-based memory copy hardware tures,” in Microarchitecture, 2009. MICRO-42. 42nd Annual
accelerator for multicore systems,” Computers, IEEE Trans- IEEE/ACM International Symposium on, 2009.
actions on, vol. 59, no. 11, 2010. [23] R. M. Yoo, A. Romano, and C. Kozyrakis, “Phoenix rebirth:
[6] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Kee- Scalable mapreduce on a large-scale shared-memory system,”
ton, C. Kozyrakis, R. Thomas, and K. Yelick, “A case for in Proceedings of the 2009 IEEE International Symposium on
intelligent ram,” Micro, IEEE, 1997. Workload Characterization (IISWC), ser. IISWC ’09, 2009.
[7] V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarung- [24] “Fastbit: An efficient compressed bitmap index technology,”
nirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, ”https://sdm.lbl.gov/fastbit/,2015”.
M. A. Kozuch, and T. C. Mowry, “Rowclone: Fast and [25] “The STAR experiment.” http://www.star.bnl.gov/.
energy-efficient in-dram bulk data copy and initialization,” [26] J.-J. Wu, Y.-H. Chen, M.-F. Chang, P.-W. Chou, C.-Y. Chen,
in Proceedings of the 46th Annual IEEE/ACM International H.-J. Liao, M.-B. Chen, Y.-H. Chu, W.-C. Wu, and H. Ya-
Symposium on Microarchitecture, ser. MICRO-46. mauchi, “A large sigma vth vdd tolerant zigzag 8t sram with
[8] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “Pim-enabled instruc- area-efficient decoupled differential sensing and fast write-
tions: A low-overhead, locality-aware processing-in-memory back scheme,” Solid-State Circuits, IEEE Journal of, 2011.
architecture,” in Proceedings of the 42Nd Annual Interna- [27] J. H. Lee, J. Sim, and H. Kim, “Bssync: Processing near mem-
tional Symposium on Computer Architecture, ser. ISCA ’15, ory for machine learning workloads with bounded staleness
2015. consistency models,” in Proceedings of the 2015 International
[9] O. L. Lempel, “2nd generation intel core processor fam- Conference on Parallel Architecture and Compilation (PACT),
ily:intel core i7, i5 and i3,” ser. HotChips ’11, 2011. ser. PACT ’15.
[10] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, [28] L. Yavits, A. Morad, and R. Ginosar, “Computer architecture
“The splash-2 programs: Characterization and methodological with associative processor replacing last-level cache and simd
considerations,” in Proceedings of the 22Nd Annual Interna- accelerator,” IEEE Transactions on Computers, 2015.
tional Symposium on Computer Architecture, 1995. [29] V. Seshadri, K. Hsieh, A. Boroum, D. Lee, M. Kozuch,
[11] J. Jalminger and P. Stenstrom, “A novel approach to cache O. Mutlu, P. Gibbons, and T. Mowry, “Fast bulk bitwise and
block reuse predictions,” in Parallel Processing, 2003. Pro- and or in dram,” Computer Architecture Letters, 2015.
ceedings. 2003 International Conference on, 2003. [30] S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie, “Pinatubo:
[12] S. V. Adve and M. D. Hill, “Weak ordering—a new A processing-in-memory architecture for bulk bitwise opera-
definition,” in Proceedings of the 17th Annual International tions in emerging non-volatile memories,” in Proceedings of
Symposium on Computer Architecture, ser. ISCA ’90. the 53rd Annual Design Automation Conference, ser. DAC
[13] J. B. Sartor, W. Heirman, S. M. Blackburn, L. Eeckhout, ’16.
and K. S. McKinley, “Cooperative cache scrubbing,” in [31] M. Kang, M. S. Keel, N. R. Shanbhag, S. Eilert, and
Proceedings of the 23rd international conference on Parallel K. Curewitz, “An energy-efficient vlsi architecture for pattern
architectures and compilation, 2014. recognition via deep embedding of computation in sram,” in
[14] M. Wilkening, V. Sridharan, S. Li, F. Previlon, S. Gurumurthi, 2014 IEEE International Conference on Acoustics, Speech
and D. R. Kaeli, “Calculating architectural vulnerability fac- and Signal Processing (ICASSP), 2014.
tors for spatial multi-bit transient faults,” in Proceedings [32] K. Pagiamtzis and A. Sheikholeslami, “Content-addressable
of the 47th Annual IEEE/ACM International Symposium on memory (cam) circuits and architectures: a tutorial and sur-
Microarchitecture, 2014. vey,” Solid-State Circuits, IEEE Journal of, 2006.
[15] “Xml parsing accelerator with intel streaming simd extensions
4 (intel sse4),” Intel Developer Zone, 2015.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy