Compute Caches: Ntroduction
Compute Caches: Ntroduction
Compute Caches: Ntroduction
Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das
University of Michigan, Ann Arbor
{shaizeen, sjeloka, arunsub, nsatish, blaauw, reetudas}@umich.edu
Abstract—This paper presents the Compute Cache archi- recently fabricated chip [2] demonstrates feasibility of bit-
tecture that enables in-place computation in caches. Compute line computing. They also show a stability of more than
Caches uses emerging bit-line SRAM circuit technology to re- six sigma robustness for Monte Carlo simulations, which is
purpose existing cache elements and transforms them into ac-
tive very large vector computational units. Also, it significantly considered industry standard for robustness against process
reduces the overheads in moving data between different levels variations.
in the cache hierarchy. Past processing-in-memory (PIM) solutions proposed to
Solutions to satisfy new constraints imposed by Compute move processing logic near the cache [4], [5] or main
Caches such as operand locality are discussed. Also discussed memory [6], [7]. 3D stacking can make this possible [8].
are simple solutions to problems in integrating them into a
conventional cache hierarchy while preserving properties such Compute Caches significantly push the envelope by enabling
as coherence, consistency, and reliability. in-place processing using existing cache elements. It is an
Compute Caches increase performance by 1.9× and reduce effective optimization for data-centric applications, where at
energy by 2.4× for a suite of data-centric applications, includ- least one of the operands (e.g., dictionary in WordCount)
ing text and database query processing, cryptographic kernels, used in computation has cache locality.
and in-memory checkpointing. Applications with larger frac-
tion of Compute Cache operations could benefit even more, as
Efficiency of Compute Caches arises from two main
our micro-benchmarks indicate (54× throughput, 9× dynamic sources: massive parallelism and reduced data movement. A
energy savings). cache is typically organized as a set of sub-arrays; as many
as hundreds of sub-arrays, depending on the cache level.
These sub-arrays can potentially compute concurrently on
I. I NTRODUCTION
data stored in them (KBs of data) with little extensions to
As computing today is dominated by data-centric appli- the existing cache structures (8% of cache area overhead).
cations, there is a strong impetus for specialization for this Thus, caches can effectively function as large vector compu-
important domain. Conventional processors’ narrow vector tational units, whose operand sizes are orders of magnitude
units fail to exploit the high degree of data-parallelism larger than conventional SIMD units (KBs vs bytes). To
in these applications. Also, they expend disproportionately achieve similar capability, the logic close to memory in a
large fraction of time and energy in moving data over cache conventional PIM solution would need to provision more
hierarchy, and in instruction processing, as compared to the than hundred additional vector functional units. The second
actual computation [1]. benefit of Compute Caches is that they avoid the energy
We present the Compute Cache architecture for dramati- and performance cost incurred not only for transferring data
cally reducing these inefficiencies through in-place (in-situ) between the cores and different levels of cache hierarchy
processing in caches. A modern processor devotes a large (through network-on-chip), but even between a cache’s sub-
fraction (40-60%) of die area to caches which are used for array to its controller (through in-cache interconnect).
storing and retrieving data. Our key idea is to re-purpose This paper addresses several problems in realizing the
and transform the elements used in caches into active com- Compute Cache architecture, discusses ISA and system
putational units. This enables computation in-place within software extensions, and re-designs several data-centric ap-
a cache sub-array, without transferring data in or out of it. plications to take advantage of the new processing capability.
Such a transformation can unlock massive data-parallel com- An important problem in using Compute Caches is sat-
pute capabilities, dramatically reduce energy spent in data isfying the operand locality constraint. Bit-line computing
movement over the cache hierarchy, and thereby directly requires that the data operands are stored in rows that share
address the needs of data-centric applications. the same set of bit-lines. We architect a cache geometry,
Our proposed architecture uses an emerging SRAM circuit where ways in a set are judiciously mapped to a sub-array,
technology, which we refer to as bit-line computing [2], so that software can easily satisfy operand locality. Our
[3]. By simultaneously activating multiple word-lines, and design allows a compiler to ensure operand locality simply
sensing the resulting voltage over the shared bit-lines, several by placing operands at addresses that are page aligned (same
important operations over the data stored in the activated page offset). It avoids exposing the internals of a cache, such
bit-cells can be accomplished without data corruption. A as its size or geometry, to software.
When in-place processing is not possible for an operation Core0 Core7
Bit-line Word-line
due to lack of operand locality, we propose to use near-place Cache Controller
Compute Caches. In near-place design, the source operands
L1$ L1$ A
Decoder2
Decoder1
are read out from the cache sub-arrays, the operation is Sub-array
WLi 0 1 0 1
WLj 1 0 0 1
Vref
Figure 2: SRAM circuit for in-place operations. Two rows Core Core Core
(WLi and WLj ) are activated. An AND operation is per- (a) Scalar Core (b) SIMD Core (c) Compute Cache
formed by sensing bit-line (BL). All the bit-lines are initially Figure 3: Proportion of energy (top) for bulk comparison
pre-charged to ‘1’. If both the activated bits in a column have operation and area (bottom). Red dot depicts logic capability.
a ‘1’ (column ‘n’), then the BL stays high and it is sensed
as a ‘1’. If any one of the bits were ‘0’ it will lower the Cache cache-ic (h-tree) cache-access
BL voltage below Vref and will be sensed as a ‘0’. A NOR L1-D 179 pJ 116 pJ
L2 675 pJ 127 pJ
operation can be performed by sensing bit-line bar (BLB). L3-slice 1985 pJ 467 pJ
takes longer than a single read or write sub-array access, as Table III: Cache geometry and operand locality constraint.
it requires longer word-line pulse to activate and sense two
rows to compensate for the lower word-line voltage. Sensing
time also increases due to the use of single-ended sense
We make two design choices for our cache organization
amplifiers, as opposed to differential sensing. However,
to simplify operand locality constraint. First, all the ways in
note that this is still less than the delay baseline would
a set are mapped to the same block partition as shown in
incur to accomplish an equivalent in-place operation, as it
Figure 5(a). This ensures that operand locality would not be
would require multiple read accesses and/or write access.
affected based on which way is chosen for a cache block.
Section VI-C provides the detailed delay, energy and area
parameters for compute capable cache sub-arrays. Second, we use a portion of set-index bits to select the
block’s bank and block partition, as shown in Figure 5(b).
C. Operand Locality As long as these are the same for two operands, they are
For in-place operations, the operands need to be physi- guaranteed to be mapped to the same block partition.
cally stored in a sub-array, such that they share the same set Software requirement: The number of address bits that
of bitlines. We term this requirement as operand locality. In must match for operand locality varies based on the cache
this section, we discuss cache organization and software con- size. As shown in Table III, even the largest cache (L3) in
straints that can together satisfy this property. Fortunately, our model requires that only least 12 bits are the same for
we find that software can ensure operand locality as long as two operands (we assume pages are mapped to a NUCA
operands are page-aligned, i.e., have the same page offset. slice closest to the core actively accessing them). Given that
Besides this, the programmer or the compiler does not need our pages are 4KB in size, we observe that as long as the
to know about any other specifics of the cache geometry. operands are page aligned, i.e., have the same page offset,
then they will be placed in the address space such that the
Cache {1 bank, 16 sets (S0-S15), 4 ways per set} least significant bits (12 for 4 KB page) in their addresses
BP0 BP1 BP2 BP3
(both virtual and physical) match. This would trivially satisfy
Physical Address Decoding
Way 0
Way 1 S1 S4 S5
the operand locality requirement for all the cache levels and
S0 Way 2
Way 3 Block offset
Set index
sizes we study. Note that, we only require operands to be
Tag
S2 S3 S6 S7 placed at the same offset of 4KB memory regions, and it is
H-Tree
not necessary to place them in separate pages. For super-
BP4 BP5 BP6 BP7 pages that are larger than 4KB, operands can be placed
Bank Bits
S8 S9 S12 S13
within a page while ensuring 12-bit address alignment.
Block Partition We expect that for data-intensive regular applications that
S10 S11 S14 S15 operate on large chunks of data, it is possible to satisfy
(b)
Sub-array
Bank
this property. Many operating system operations that involve
(a)
copying from one page to another are guaranteed to exhibit
operand locality for our system. Compiler and dynamic
Figure 5: Cache organization example, address decoding memory allocators could be extended to optimize for this
([i][j] = set i, way j), alternate address decoding for parallel property in future.
tag-data access caches Finally, a binary compiled with a given address bit align-
ment requirement (12 bits in our work) is portable across a
Operand locality aware cache organization: Figure 5 wide range of cache architectures as long as the number of
illustrates a simple cache with one bank each with four sub- address bits to be aligned is equal to or less than what they
arrays. Rows in a sub-array share the same set of bitlines. were compiled for. If the cache geometry changes such that
We define a new term, Block Partition (BP). Block par- it requires greater alignment, then the programs would have
tition for a sub-array is the group of cache blocks in that to be recompiled to satisfy that stricter constraint.
sub-array that share the same bitlines. In-place operation is Column Multiplexing: With column multiplexing, mul-
possible between any two cache blocks stored within a block tiple adjacent bit-lines are multiplexed to a single bit data
partition. In our example, since each row in a sub-array has output, which is then observed using one sense-amplifier.
two cache blocks, there are two block partitions per sub- This keeps area overhead of peripherals under check and
array. In total, there are eight block partitions (BP0-BP7). improves resilience to particle strikes. Fortunately, in column
In-place compute is possible between any blocks that map multiplexed sub-arrays, adjacent bits in a cache block are
to the same block partition ( e.g. blocks in sets S0 and S2). interleaved across different sub-arrays such that their bitlines
cc_and A, B, C, 8
are not multiplexed. In this case, the logical block partition
that we defined would be interleaved across the sub-arrays. CORE 1. Core issues cc_and
Thus, an entire cache block can be accessed in parallel. 7. L1 controller notifies to L1 controller
completion of operation
Given this, in-place concurrent operation on all the bits in a to Core Set0
cache block is possible even with column multiplexing. L1 free
Our design choice of placing ways of a set within a block free
free 2. L1 forwards operation to L2
partition does not affect the degree of column multiplexing
6. L3 notifies L1
as we interleave cache blocks of different sets instead.
Set0
Way Mapping vs Parallel Tag-Data Access: We chose L2 B : dirty
to place all the ways of a set within a block partition, so A : clean
3. L2 writebacks B to L3,
free
that operand locality is not dependent on which way is forwards operation to L3
chosen for a block at runtime. However, this prevents us B : dirty
from supporting parallel tag-data access, where all the cache Set0
L3 B : clean
blocks in a set are pro-actively read in parallel with the A : clean Memory
5. L3 performs operation
tag match. This optimization is typically used for L1 as it free C : clean
can reduce the read latency by overlapping tag match with 4. L3 fetches C from memory
read. But it incurs a high read energy overhead (4.7× higher
energy per access for L1 cache) for modest performance gain Figure 6: Compute Caches in action
(2.5% for SPLASH-2[10]). Given the significant benefits of
L1 Compute Cache, we think it is a worthy trade-off to forgo
this optimization for L1. Finally, if the address range of any operand of a CC in-
struction spans multiple pages, it raises a pipeline exception.
The exception handler splits the instruction into multiple CC
D. Managing Parallelism
operations such that each of its operands are within a page.
Cache controllers are extended to provision for CC con-
trollers which orchestrate the execution of CC instructions. E. Fetching In-Place Operands
The CC controller breaks a CC instruction into multiple The Compute Cache (CC) controllers are responsible
simple vector operations whose operands span at most a for deciding the level in the cache hierarchy where CC
single cache block and issues them to the sub-arrays. Since operations need to be performed, and issuing commands
a typical cache hierarchy can have hundreds of sub-arrays to the cache sub-array to execute them. To simplify our
(16MB L3 cache has 512 sub-arrays), we can potentially design, in our study, the CC controller always performs the
issue hundreds of concurrent operations. This is only limited operations at the highest-level cache where all the operands
by two factors. First, the bandwidth of the shared intercon- are present. If any of the operands are not cached, then the
nects used to transmit address/commands. Note that we do operation is performed at lowest-level cache (L3). Cache
not replicate the address bus in our H-tree interconnects. allocation policy can be improved in future by enhancing
Second, number of sub-arrays activated at same time can be our CC controller with a cache block reuse predictor [11].
limited to limit peak power drawn. Once a cache level is picked, CC controller fetches any
The controller at L1-cache uses an instruction missing operands to that level. The controller also pins
table to keep track of the pending CC instructions. the cache-lines the operands are fetched in while the CC
The simple vector operations are kept track of in the operation is under way. To avoid the eviction of operands
operation table. The instruction table tracks metadata while waiting for missing operands, we promote the cache
associated at instruction level (i.e., result, count of simple blocks of that operand to the MRU position in the LRU
vector operations completed, next simple vector operation chain. However, on receiving a forwarded coherence request,
to be generated). The operation table, on other hand, tracks we release the lock to avoid deadlock and re-fetch the
status of each operand associated with the operation and operand. Getting a forwarded request to a locked cache line
issues request to fetch the operand if it is not present in will be rare for two reasons. First, in DRF [12] compliant
the cache (Section IV-E). When all operands are in cache, programs, only one thread will be able to operate on a cache
we issue the operation to the cache sub-array. As operations block while holding its software lock. Second, as operands
complete they update the instruction table, and the L1-cache of a single CC operation are cache block wide, false sharing
controller notifies the core when an instruction is complete. will be low. Nevertheless, to avoid starvation in pathological
To support search instruction, CC controller replicates key scenarios, if CC operation fails to get permission after
in all the block partitions where the source data resides. To repeated attempts (set to two), processor core will translate
avoid doing this again for the same instruction, we track and execute a CC operation as RISC operations.
such replications per instruction in a key table. Figure 6 shows a working example. Core issues operation
cc and with address operands A, B and C to L1 controller preceding pending operations are completed, including CC
( 1 ). Each is of size 64 bytes (8 words) spanning an entire operations. Similar to conventional vector instructions, it is
cache block. For clarity, only one cache set in each cache not possible to specify a fence between scalar operations
level is shown. None of the operands are present in L1 cache. within a single vector CC instruction.
Operand B is in L2 cache and is dirty. L3 cache has clean
copy of A and a stale copy of B. C is not in any cache. H. Memory Disambiguation and Store Coalescing
L3 cache is chosen for the CC operation, as it is the Similar to SIMD instructions, Compute Cache (CC) vec-
highest cache level where all operands are present. L1 and tor instructions require additional support in the processor
L2 controllers will forward this operation to L3 ( 2 , 3 ). core for memory ordering. We classify instructions in CC
Before doing so, L2 cache will first write-back B to L3. ISA into two types. CC-R type (cc_cmp, cc_search)
Note that caches already write-back dirty data to next cache only read from memory. The rest of the instructions are
level on eviction and we use this existing mechanism. CC-RW type as they both read and write from memory.
On receiving the command, L3 fetches C from memory Under RMO memory model, CC-R can be executed out-of-
( 4 ). Note that, as an optimization, C need not be fetched order, whereas CC-RW behaves like a store. In the following
from memory as it will be over-written entirely. Once all discussion, we refer to CC-R as load, and CC-RW as store.
the operands are ready, L3 performs the CC operation ( 5 ) Conventional processor cores use a load-store queue
and subsequently notifies the L1 controller ( 6 ) of it’s (LSQ) to check for address conflicts between a load and
completion, which in turn notifies the core ( 7 ). the preceding uncommitted stores. As vector instructions
can access more than a word, it is necessary to enhance the
F. Cache Coherence LSQ with the support for checking address ranges, instead of
Compute Cache optimization interacts with the cache just one address. For this reason, we use a dedicated vector
coherence protocol minimally and as a result does not LSQ, where each entry has additional space to keep track
introduce any new race conditions. As discussed above, of address ranges for the operands of a vector instruction.
while the controller locks cache lines while performing CC Similar to LSQ, we also split the store buffer into two, one
operation, on receipt of a forwarded coherence request, the for scalar stores and another for vector stores. The vector
controller releases the lock and responds to the request. store buffer supports address range checks (max 12 com-
Thus, a forwarded coherence request is always responded parisons/entry). Our scalar store buffer permits coalescing.
to in cases where it would be responded to in the baseline However, it is not possible to coalesce CC-RW instructions
design. with any store, because their output is not known till they
Typically, higher-level caches writeback dirty data to the are performed in a cache. As the vector store buffer is non-
next-level cache on evictions. Coherence protocols already coalescing, it is possible for the two store buffers to contain
support such writebacks. In the Compute Cache architecture, stores to the same location. If such a scenario is detected,
when a cache level is skipped to perform CC operations, any the conflicting store is stalled until the preceding store is
dirty operands in the skipped level need to be written back complete which ensures program order between stores to
to next level of cache to ensure correctness. To do this, we the same location. We augment the store buffer with a field
use the existing writeback mechanism and thus require no which points to any successor store and a stall bit. The stall
change to the underlying coherence protocol. bit is reset when the predecessor store completes.
Data values are not forwarded from vector stores to any
G. Consistency Model Implications loads, or from any store to a vector load. Code segments
Current language consistency models (C++ and Java) are where both vector and scalar operations access the same
variants of the DRF model [12], and therefore a processor location within a short time span is likely to be rare. If
only needs to adhere to the RMO memory model. While such a code segment is frequently executed, the compiler
ISAs providing stronger guarantees (x86) exist, they can can choose to not employ Compute Cache optimization.
be exploited only by writing assembly programs. As a
consequence, while we believe stronger memory model I. Error Detection and Correction
guarantees for Compute Caches is an interesting problem Systems with strong reliability requirements employ Error
(to be explored in future work), we assume RMO model Correction Codes (ECC) for caches. ECC protection for
in our design. In RMO, no memory ordering is needed conventional and near-place operations are unaffected in our
between data reads and writes, including all CC operations. design. For cc copy simply copying ECC from source to
Individual operations within a vector CC instruction can also destination suffices. For cc buz, ECC of zeroed blocks can
be performed in parallel by the CC controller. be updated. For comparison and search, ECC check can be
Programmers use fence instructions to order memory performed by comparing the ECCs of the source operands.
operations, which is sufficient in the presence of CC in- An error is detected if data bits match, but the ECC bits
structions. Processor stalls commit of a fence operation until don’t, or vice versa.
Configuration 8 core CMP
For in-place logical operations (cc and, cc or, cc xor, Processor 2.66 GHz out-of-order core, 48 entry LQ, 32 entry SQ
cc clmul, and cc not), it is challenging to perform the L1-I Cache 32KB, 4-way, 5 cycle access
L1-D Cache 32KB, 8-way, 5 cycle access
check and compute the ECC for the result. We propose two L2 Cache inclusive, private, 256KB, 8-way,11 cycle access
alternatives. One alternative is to read out the xor of the two L3 Cache inclusive, shared, 8 NUCA slices, 2MB each, 16-way, 11
cycle + queuing delay
operands and their ECCs, and check the integrity at the ECC Interconnect ring, 3 cycle hop latency, 256-bit link width
logic unit (ECC(A xor B) = ECC(A) xor ECC(B)). Coherence directory based, MESI
This unit also computes the ECC of the result. Our sub- Memory 120 cycle latency
array design permits computing the xor operation alongside Table IV: Simulator Parameters
any logical operation. Although the logical operation is still
done in-place, this method will incur extra data transfers to
and from the ECC logic unit. Cache scrubbing during cache Logical Operations: Compute Cache logical operations
idle cycles [13] is a more attractive option. Since soft errors can speedup processing of commonly used bit manipulation
in caches are infrequent (0.7 to 7 errors/year [14]), periodic primitives such as bitmaps. Bitmaps are used in graph and
scrubbing can be effective while keeping performance and database indexing/query processing. Query processing on
energy overheads low. databases with bitmap indexing requires logical operation on
large bitmaps. Compute Caches can also accelerate binary
J. Near-Place Compute Caches bit matrix multiplication (BMM) which has uses in numerous
In the absence of operand locality, we propose to compute applications such as error correcting codes, cryptography,
instructions “near” the cache. Our controller is provisioned bioinformatics, and Fast Fourier Transform (FFT). Given its
with additional logic units (not arithmetic units) and registers importance, it was implemented as a dedicated instruction
to temporarily store the operands. The source operands in Cray supercomputers [17] and Intel processors provision
are read from the cache sub-array into the registers at the a x86 carryless multiply (clmul) instruction to speed it.
controller, and then computed results are written back to the Inherent cache locality in matrix multiplication makes BMM
cache. In-place computation has two benefits over near-place suitable for Compute Caches. Further, our large vector
computation. First, it provides massive compute capability operations can allow BMM to scale to large matrices.
for almost no additional area overhead. For example, a 16 Copy Operation: Prior research [7] makes a strong
MB L3 with 512 sub-arrays allows 8KB of data to be case for optimizing copy performance which is a common
computed in parallel. To support equivalent computational operation in many applications in system software and ware-
capability, we would need 128 vector ALUs, each of width house scale computing [18]. The operating system spends a
64-bytes. This is not a trivial overhead. We assume one considerable chunk of its time (more than 50%) copying bulk
vector logic unit per cache controller in our near-cache data [19]. For instance copying is necessary for frequently
design. Second, in-place compute avoids data transfer over used system calls like fork, inter-process communication,
H-Tree wires. This reduces in-place compute latency (14 virtual machine cloning and deduplication, file system and
cycles) compared to near-cache (22 cycles). Also, 60%-80% network management. Our copy operation can accelerate
of total cache read energy is due to H-Tree wire transfer (See checkpointing, which has a wide range of uses, including
Table I), which is eliminated with in-cache computation. fault tolerance and time-travel debugging. Finally, our copy
Nevertheless, near cache computing retains the other benefits primitive can also be employed in bulk zeroing which is an
of Compute Caches, by avoiding transferring data to the important primitive required for memory safety [20].
higher-level caches and the core.
VI. E VALUATION
V. A PPLICATIONS In this section we demonstrate the efficacy of Compute
Our Compute Cache design supports simple but common Caches (CC) using both micro-benchmark study and a suite
operations, which can be utilized to accelerate diverse set of of data-intensive applications.
data intensive applications. A. Simulation Methodology
Search and Compare Operations: Compare and search We model a multi-core processor using SniperSim [21],
are common operations in many emerging applications, a Pin-based simulator per Table IV. We use McPAT [22] to
especially text processing. Intel recently added seven new model power consumption in both cores and caches.
instructions to the x86 SSE 4.2 vector support that efficiently
perform character searches and comparison [15]. Compute B. Application Customization and Setup
Cache architecture can significantly improve the efficiency In this section we describe how we redesigned applica-
of these instructions. Similar to specialized CAM accelera- tions in our study to utilize CC instructions.
tors [16], our search functionality can be utilized to speed up WordCount: WordCount [23] reads a text file (10MB)
applications such as, search engines, decision tree training and builds a dictionary of unique words and their fre-
and compression and encoding schemes. quency of appearance in the file. While the baseline does
cache write read cmp copy search not logic
a binary search over the dictionary to check if a new L3 2852 2452 840 1340 3692 1340 1672
word is found, we model the dictionary as alphabet in- L2 1154 802 242 608 1396 608 704
dexed (first two letters of word) CAM (1KB each). As the L1 375 295 186 324 561 324 387
dictionary is large (719KB) we perform search operations Table V: Cache energy (pJ) per cache-block (64-byte)
in L3 cache. CC search instruction returns a bit vector
indicating match/mismatch for multiple words and hence
we also model additional mask instructions which report read disturbs and to account for circuit parameter variation
match/mismatch per word. across technology nodes.
StringMatch: StringMatch [23] reads words from a text We use the above parameters in conjunction with energy
file (50MB), encrypts them and compares them to a list of per cache access from McPAT to determine the energy of CC
encrypted keys. Encryption cannot be offloaded to cache, operations (Table V). CC operations cost higher in lower-
hence, encrypted words are present in L1-cache and we level caches as they employ larger sub-arrays. However, they
perform CC search in it. By replicating an encrypted key do deliver higher savings (compared to baseline read/write(s)
across all sub-arrays in L1, a single search instruction can needed) as they have larger in-cache interconnect compo-
compare it against multiple encrypted words. Similar to nents. For search, we assume a write operation for key; this
WordCount we also model mask instructions. cost will get amortized over large searches.
DB-BitMap: We also model FastBit [24] a bitmap index
library. The input database index is created using data sets D. Microbenchmark Study
obtained from a real physics experiment, STAR [25]. A To demonstrate the efficacy of Compute Caches we model
sample query performs logical OR or AND of large bitmap four microbenchmarks: copy, compare, search and logical-
bins (several 100 KBs each). We modify the query to use or. We compare Compute Caches to a baseline(Base 32)
cc or operations ( each processes 2KB of data). We measure which supports 32-byte SIMD loads and stores.
average query processing time for a sample query mix Figure 7 (a) depicts the throughput attained for different
running over uncompressed bitmap indexes. operations for operand size of 4KB. For this experiment,
BMM: Our optimized baseline BMM implementation all operands are in L3 cache and the Compute Cache
(Section V) uses blocking and x86 CLMUL instructions. operation is performed therein. Among the operations, for
Given the reuse of matrix we perform cc clmul in L1-cache. baseline, search achieves highest throughput as it incurs
We model 256 × 256 bit matrices. single cache miss for the key and subsequent cache misses
Checkpointing: We model in-memory copy-on-write are only for data. Compute Cache accelerates throughput
checkpointing support at page granularity for SPLASH- for all operations: 54× over Base 32 averaged across the
2 [10] benchmark suite (checkpointing interval of 100,000 four kernels. Our throughput improvement has two primary
application instructions). sources: massive data parallelism exposed in presence of
independent sub-arrays to compute in, and latency reduction
C. Compute Sub-Array: Delay and Area Impact due to avoiding data movement to the core. For instance, for
Compute Caches have negligible impact on the baseline copy operation, data parallelism exposes 32× and latency
read/write accesses as we still support differential sensing. reduction exposes 1.55× throughput improvement.
To get delay and energy estimates, we perform SPICE Figure 7 (b) depicts the dynamic energy consumed for
simulations on a 28nm SOI CMOS process based sub-array, operand size of 4KB. Dynamic energy depicted is broken
using standard foundry 6T bit-cells. 1 A and/or/xor 64-byte down into core, cache data access (cache-access), cache
in-place operation is 3× longer as compared to single sub- interconnect (cache-ic) and network-on-chip (noc) compo-
array access while rest of CC operations are 2× longer. In nents. We term data movement energy to be everything
terms of energy, cmp/search/clmul are 1.5×, copy/buz/not except the core component. Overall, CC provides dynamic
are 2×, and the rest are 2.5× baseline sub-array access. The energy savings of 90%, 89%, 71% and 92% for copy, com-
area overhead is 8% for a sub-array of size 512 × 512 2 . pare, search and logical (OR) kernels relative to Base 32.
Note, our estimates account for technology variations and Large vector CC instructions help bring down core com-
process, voltage and temperature changes. Further, these ponent of energy. Further, CC successfully eliminates all
estimates are conservative when compared to measurements the components of data movement. Writes incurred due to
on silicon [2] in order to provision for robust margin against key replication limit efficacy of search CC operation in
bringing down L3 cache energy components. As data size
1 SRAM arrays we model are 6T cell based. Lower-level caches (L2/L3) to be searched increases, key replication overheads will get
are optimized for density and employ 6T-based arrays. However, L1-cache amortized increasing effectiveness of CC.
can employ 8T cell based designs. To support in-place operations in such Figure 7 (c) depicts total energy consumed broken down
a design, a differential read-disturb resilient 8T design [26] can be used.
2 The optimal sub-array dimension for L3 and L2 caches we model are into static and dynamic components. Due to reduction in
512 × 512 and 128 × 512 bits respectively. execution time, CC can significantly reduce static energy.
core uncore-static
2500 4000
Throughput(million ops/sec)
Base_32
CC_L3
Base_32
CC_L3
Base_32
CC_L3
Base_32
CC_L3
Base_32
CC_L3
Base_32
CC_L3
Base_32
CC_L3
Base_32
CC_L3
0
copy compare search logical
Figure 7: Benefit of CC for 4KB operand. a) Throughput b) Dynamic energy c) Total energy
uncore-static 5
900 core uncore-static
(ratio)
700 savings in dynamic energy(nJ)
core-dynamic l2-access core-dynamic
total energy (nJ)
2000
600 l2-ic 2
500 l3-access 1
1500 l3-ic
400 0
noc
300 BMM WordCount StringMatch DB-BitMap
performance normalized
3
100
2.5
to baseline
0 500
2
CC_in_place
CC_near_place
CC_in_place
CC_near_place
CC_in_place
CC_near_place
CC_in_place
CC_near_place
1.5
0 1
0.5
CC_L3
CC_L2
CC_L1
CC_L3
CC_L2
CC_L1
CC_L3
CC_L2
CC_L1
CC_L3
CC_L2
CC_L1
0
BMM WordCount StringMatch DB-BitMap
copy compare search logical
copy compare search logical
Figure 9: a) Total energy benefit b) Per-
Figure 8: a) Total energy of in-place vs near place for 4KB operand formance improvement of CC for appli-
b) Savings in dynamic energy for 4KB operand for different cache levels cations
Overall, averaged across the four kernels studied, CC pro- 34% for L1 and L2 caches respectively relative to Base 32.
vides 91% in total energy savings relative to Base 32.
Near-place design: In our analysis so far, we have E. Application Benchmarks
assumed perfect operand locality i.e. all Compute Cache In this section we study the benefits of Compute Caches
operations are performed in-place. Figure 8 (a) depicts the for five applications. Figure 9 (b) shows the overall speedup
total energy for near-place and in-place CC configurations. of Compute Caches for four of these applications. We see
Recall that in-place computation enables far more paral- a performance improvement of 2× for WordCount, 1.5×
lelism than near-place and offers larger savings in terms of for StringMatch, 3.2× for BMM, and 1.6× for DB-BitMap.
performance and hence total energy. For example, our L3- Figure 9 (a) shows ratio of total energy of CC to baseline
cache allows 8KB data to be operated in parallel. Near-place processor with 32-byte SIMD units. We observe average
design would need 128 64-byte wide logical units to provide energy savings of 2.7× across these applications. Majority
equivalent data parallelism. This is not a trivial overhead. of benefits come from three sources: data parallelism ex-
As such, for 4KB operands, in-cache provides 3.6× total posed by large vector operations, reduction in number of
energy savings and 16× throughput improvement on average instructions and data movement.
over near-place. Note however that, near-place can still offer For instance, recall that while baseline WordCount does
considerable benefits over the baseline architecture. a binary search over dictionary of unique words, Compute
Computing at different cache levels: We next evaluate Cache does a CAM search using cc search instructions.
the efficacy of Compute Caches when operands are present Superficially it may seem that binary search will outperform
in different cache levels. Figure 8 (b) depicts the difference CAM search. However, we find that CC version has 87%
in dynamic energy between CC configurations and their fewer instructions by doing away with book keeping instruc-
corresponding Base 32 configurations. As expected, the tions of binary search. Further, our vector cc search enables
absolute savings are higher, when operands are in lower- energy efficient CAM searches. These benefits are also
level caches. However, we find that doing Compute Cache evident in StringMatch, BMM and DB-BitMap (32%, 98%
operations in L1 or L2 cache can also provide significant and 43% instruction reduction respectively). The massive
savings. As the number of CC instructions stays same data level parallelism we enable benefits data intensive
regardless of cache level, core energy savings is equal for range and join queries in DB-BitMap application. Recall
all cache levels. Overall, CC provides savings of 95% and that this benchmark performs many independent logical OR
performance overhead(%) 150
Base more logic around CAM to orchestrate computation on them.
120 Base_32
CC_L3 None of these solutions exploit the benefits of in-place bit-
90
line computing cache noted in Section III. We get massive
60
number of compute units by re-purposing cache elements
30
that already exist. Also, in-place Compute Cache reduces
0
sk
y
s ce si
ty ea
n data movement overhead between a cache’s sub-arrays and
m di
x ole rne y tra dio om
fm ra ch ba ra ra ge its controller. On the flip side, in-place cache computing
imposes restrictions on the type of operations that can be
Figure 10: Performance overhead of CC for checkpointing
supported and placement of operands, which we address
35000 in the paper. When in-place operation is not possible, we
30000 uncore-static
used near-place Compute Cache for copy, logical, and search
total energy (mJ)
core-static
25000 uncore-dynamic
20000 core-dynamic operations, which has also not been studied in the past.
15000
Row-clone [7] enabled data copy from a source DRAM
10000
5000 row to a row buffer and then to a destination row. Thereby,
0 it avoided data movement over the memory channels. A
no_chkpt
Base
Base_32
CC_L3
no_chkpt
Base
Base_32
CC_L3
no_chkpt
Base
Base_32
CC_L3
no_chkpt
Base
Base_32
CC_L3
no_chkpt
Base
Base_32
CC_L3
no_chkpt
Base
Base_32
CC_L3
no_chkpt
Base
Base_32
CC_L3
subsequent CAL article [29] suggested that data could be
fmm radix cholesky barnes raytrace radiosity average
copied to a temporary buffer in DRAM, from where logical
operations could be performed. Row-clone’s approach is also
Figure 11: Total energy with and without checkpointing
a form of near-place computing, which requires that all
operands are copied to new DRAM rows before they can
operations over large bitmap bins. Since these operations are be operated upon. Bit-line in-place operations may not be
independent, many of them can be issued in parallel. feasible in DRAM, as DRAM reads are destructive (one of
Significant cache locality exhibited by these applications the reasons why DRAMs need refreshing).
makes them highly suitable for Compute Caches. As cache Recent research enhanced non-volatile memory technol-
accesses are cheaper than memory accesses, computation in ogy to support certain in-memory CAM [16] and bitwise
cache is more profitable for data with high locality or reuse. logic operations [30]. Compute Cache architecture is more
The dictionary in WordCount has high locality. BMM has efficient when at least one of the operands has cache locality
inherent locality due to the nature of matrix multiplication. (e.g., dictionary in word count). Ultimately, the locality
In DB-BitMap, there is significant reuse within a query due characteristics of an application should guide in which level
to aggregation of results into a single bitmap bin, and there of memory hierarchy the computation must be performed.
is potential reuse of bitmaps across queries. In StringMatch,
Bit-line computing in SRAMs has been used to implement
locality comes due to repeated use of encrypted keys.
custom accelerators: approximate dot products in analog
Figure 10 depicts the overall checkpointing overhead for domain for pattern recognition [31] and CAMs [32]. How-
SPLASH-2 applications as compared to baseline with no ever, it has not been used to architect a compute cache
checkpointing. In absence of SIMD support, this overhead in a conventional cache hierarchy, where we need general
can be as high as 68% while in presence of it the average solutions to problems such as operand locality, coherence
overhead is 30%. By further reducing instruction count and and consistency which are addressed in this paper. We also
avoiding data movement, CC brings down this overhead to a demonstrated the utility of our Compute Cache enabled
mere 6%. CC successfully relegates checkpointing to cache, operations to accelerate a fairly diverse range of applications
avoids data pollution of higher level caches and relieves the (databases, cryptography, data analytics).
processor of any checkpointing overhead. Figure 11 shows
significant energy savings due to Compute Caches. Note
VIII. C ONCLUSION
that, for checkpointing, all operations are page-aligned and
hence we achieve perfect operand locality. In this paper we propose Compute Cache (CC) archi-
tecture which unlocks hitherto untapped computational ca-
VII. R ELATED W ORK
pability present in on-chip caches by exploiting emerging
Past processing-in-memory (PIM) solutions move com- SRAM circuit technology. Using bit-line computing enabled
pute near the memory [6]. This can be accomplished using caches, we can perform several simple operations in-place in
recent advancements in 3D die-stacking [8]. There have cache over very-wide operands. This exposes massive data
also been few proposals that talk about adding hardware parallelism saving instruction processing, cache interconnect
structures near the cache, which track information that helps and intra-cache energy expenditure. We present solutions
improve efficiency of copy [5] and atomic operations [27]. to several challenges exposed by such an architecture. We
Associative processor [28] uses CAMs (area and energy inef- demonstrate the efficacy of our architecture using a suite of
ficient compared to SRAM caches) as caches and augments data intensive benchmarks and micro-benchmarks.
IX. ACKNOWLEDGMENTS [16] Q. Guo, X. Guo, Y. Bai, and E. İpek, “A resistive tcam
accelerator for data-intensive computing,” in Proceedings of
We thank the anonymous reviewers for their comments
the 44th Annual IEEE/ACM International Symposium on
which helped improve this paper. This work was supported Microarchitecture, ser. MICRO-44, 2011.
in part by the NSF under the CAREER-1149773 and SHF- [17] “Cray Assembly Language (CAL) for Cray X1 Systems
1527301 awards and by C-FAR, one of the six SRC STAR- Reference Manual. version 1.2. Cray Inc., ,” 2003.
net Centers sponsored by MARCO and DARPA. [18] S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan,
T. Moseley, G.-Y. Wei, and D. Brooks, “Profiling a
R EFERENCES warehouse-scale computer,” in Proceedings of the 42Nd An-
nual International Symposium on Computer Architecture, ser.
[1] B. Dally, “Power, programmability, and granularity: The ISCA ’15.
challenges of exascale computing,” in Parallel Distributed [19] M. Calhoun, S. Rixner, and A. Cox, “Optimizing kernel block
Processing Symposium (IPDPS), 2011 IEEE International, memory operations,” in Workshop on Memory Performance
2011. Issues, 2006.
[2] S. Jeloka, N. B. Akesh, D. Sylvester, and D. Blaauw, “A 28 [20] X. Yang, S. M. Blackburn, D. Frampton, J. B. Sartor, and
nm configurable memory (tcam/bcam/sram) using push-rule K. S. McKinley, “Why nothing matters: The impact of
6t bit cell enabling logic-in-memory,” IEEE Journal of Solid- zeroing,” ser. OOPSLA ’11.
State Circuits, 2016. [21] T. Carlson, W. Heirman, and L. Eeckhout, “Sniper: Exploring
[3] M. Kang, E. P. Kim, M. s. Keel, and N. R. Shanbhag, the level of abstraction for scalable and accurate parallel
“Energy-efficient and high throughput sparse distributed multi-core simulation,” in High Performance Computing,
memory architecture,” in 2015 IEEE International Symposium Networking, Storage and Analysis (SC), 2011 International
on Circuits and Systems (ISCAS), 2015. Conference for, 2011.
[4] P. A. La Fratta and P. M. Kogge, “Design enhancements for [22] S. Li, J. H. Ahn, R. Strong, J. Brockman, D. Tullsen, and
in-cache computations,” in Workshop on Chip Multiprocessor N. Jouppi, “Mcpat: An integrated power, area, and timing
Memory Systems and Interconnects, 2009. modeling framework for multicore and manycore architec-
[5] F. Duarte and S. Wong, “Cache-based memory copy hardware tures,” in Microarchitecture, 2009. MICRO-42. 42nd Annual
accelerator for multicore systems,” Computers, IEEE Trans- IEEE/ACM International Symposium on, 2009.
actions on, vol. 59, no. 11, 2010. [23] R. M. Yoo, A. Romano, and C. Kozyrakis, “Phoenix rebirth:
[6] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Kee- Scalable mapreduce on a large-scale shared-memory system,”
ton, C. Kozyrakis, R. Thomas, and K. Yelick, “A case for in Proceedings of the 2009 IEEE International Symposium on
intelligent ram,” Micro, IEEE, 1997. Workload Characterization (IISWC), ser. IISWC ’09, 2009.
[7] V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarung- [24] “Fastbit: An efficient compressed bitmap index technology,”
nirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, ”https://sdm.lbl.gov/fastbit/,2015”.
M. A. Kozuch, and T. C. Mowry, “Rowclone: Fast and [25] “The STAR experiment.” http://www.star.bnl.gov/.
energy-efficient in-dram bulk data copy and initialization,” [26] J.-J. Wu, Y.-H. Chen, M.-F. Chang, P.-W. Chou, C.-Y. Chen,
in Proceedings of the 46th Annual IEEE/ACM International H.-J. Liao, M.-B. Chen, Y.-H. Chu, W.-C. Wu, and H. Ya-
Symposium on Microarchitecture, ser. MICRO-46. mauchi, “A large sigma vth vdd tolerant zigzag 8t sram with
[8] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “Pim-enabled instruc- area-efficient decoupled differential sensing and fast write-
tions: A low-overhead, locality-aware processing-in-memory back scheme,” Solid-State Circuits, IEEE Journal of, 2011.
architecture,” in Proceedings of the 42Nd Annual Interna- [27] J. H. Lee, J. Sim, and H. Kim, “Bssync: Processing near mem-
tional Symposium on Computer Architecture, ser. ISCA ’15, ory for machine learning workloads with bounded staleness
2015. consistency models,” in Proceedings of the 2015 International
[9] O. L. Lempel, “2nd generation intel core processor fam- Conference on Parallel Architecture and Compilation (PACT),
ily:intel core i7, i5 and i3,” ser. HotChips ’11, 2011. ser. PACT ’15.
[10] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, [28] L. Yavits, A. Morad, and R. Ginosar, “Computer architecture
“The splash-2 programs: Characterization and methodological with associative processor replacing last-level cache and simd
considerations,” in Proceedings of the 22Nd Annual Interna- accelerator,” IEEE Transactions on Computers, 2015.
tional Symposium on Computer Architecture, 1995. [29] V. Seshadri, K. Hsieh, A. Boroum, D. Lee, M. Kozuch,
[11] J. Jalminger and P. Stenstrom, “A novel approach to cache O. Mutlu, P. Gibbons, and T. Mowry, “Fast bulk bitwise and
block reuse predictions,” in Parallel Processing, 2003. Pro- and or in dram,” Computer Architecture Letters, 2015.
ceedings. 2003 International Conference on, 2003. [30] S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie, “Pinatubo:
[12] S. V. Adve and M. D. Hill, “Weak ordering—a new A processing-in-memory architecture for bulk bitwise opera-
definition,” in Proceedings of the 17th Annual International tions in emerging non-volatile memories,” in Proceedings of
Symposium on Computer Architecture, ser. ISCA ’90. the 53rd Annual Design Automation Conference, ser. DAC
[13] J. B. Sartor, W. Heirman, S. M. Blackburn, L. Eeckhout, ’16.
and K. S. McKinley, “Cooperative cache scrubbing,” in [31] M. Kang, M. S. Keel, N. R. Shanbhag, S. Eilert, and
Proceedings of the 23rd international conference on Parallel K. Curewitz, “An energy-efficient vlsi architecture for pattern
architectures and compilation, 2014. recognition via deep embedding of computation in sram,” in
[14] M. Wilkening, V. Sridharan, S. Li, F. Previlon, S. Gurumurthi, 2014 IEEE International Conference on Acoustics, Speech
and D. R. Kaeli, “Calculating architectural vulnerability fac- and Signal Processing (ICASSP), 2014.
tors for spatial multi-bit transient faults,” in Proceedings [32] K. Pagiamtzis and A. Sheikholeslami, “Content-addressable
of the 47th Annual IEEE/ACM International Symposium on memory (cam) circuits and architectures: a tutorial and sur-
Microarchitecture, 2014. vey,” Solid-State Circuits, IEEE Journal of, 2006.
[15] “Xml parsing accelerator with intel streaming simd extensions
4 (intel sse4),” Intel Developer Zone, 2015.