Temporal Streaming of Shared Memory
Temporal Streaming of Shared Memory
Temporal Streaming of Shared Memory
Thomas F. Wenisch, Stephen Somogyi, Nikolaos Hardavellas, Jangwoo Kim, Anastassia Ailamaki and Babak Falsafi Computer Architecture Laboratory (CALCM) Carnegie Mellon University http://www.ece.cmu.edu/~puma2
Abstract
Coherent read misses in shared-memory multiprocessors account for a substantial fraction of execution time in many important scientific and commercial workloads. We propose Temporal Streaming, to eliminate coherent read misses by streaming data to a processor in advance of the corresponding memory accesses. Temporal streaming dynamically identifies address sequences to be streamed by exploiting two common phenomena in shared-memory access patterns: (1) temporal address correlationgroups of shared addresses tend to be accessed together and in the same order, and (2) temporal stream localityrecently-accessed address streams are likely to recur. We present a practical design for temporal streaming. We evaluate our design using a combination of trace-driven and cycle-accurate full-system simulation of a cache-coherent distributed shared-memory system. We show that temporal streaming can eliminate 98% of coherent read misses in scientific applications, and between 43% and 60% in database and web server workloads. Our design yields speedups of 1.07 to 3.29 in scientific applications, and 1.06 to 1.21 in commercial workloads.
1. Introduction
Technological advancements in semiconductor fabrication along with microarchitectural and circuit innovation have led to phenomenal increases in processor speed over the past decades. During the same period, memory (and interconnect) speed has not kept pace with the rapid acceleration of processors, resulting in an ever-growing processor/memory performance gap. This gap is exacerbated in scalable shared-memory multiprocessors, where a cache-coherent access often requires traversing multiple cache hierarchies and incurs several network round-trip delays. There are a myriad of proposals for reducing or hiding the coherence miss latency. Techniques to relax memory order [1,10] have been shown to hide virtually all of the coherent write miss latency. In contrast, prior proposals to mitigate the impact of coherent read misses have fallen short of effectively hiding the read miss latency. Techniques targeting coherence optimization (e.g., [13,15,18,19,21,22,29]) can only hide part of the read latency. Prefetching [26] or forwarding [17] techniques seek to hide the entire cache (read) miss latency. These techniques have been shown to be effective for workloads with regular (e.g., strided) memory access patterns. Unfortunately, memory access patterns in
many important commercial [3] and scientific [23] workloads are often highly irregular and not amenable to simple predictive and prefetching schemes. As such, coherent read misses remain a key performance-limiting bottleneck in these workloads [2,23]. Recent research [3] advocates fetching data in the form of streamsi.e., sequences of cache blocks that occur together rather than individual blocks. Streaming not only enables accurate data fetching through correlating a recurring sequence of addresses, but also significantly enhances fetch lookahead commensurately to the sequence length. These results indicate that streaming can hide the read miss latency even in workloads with long chains of dependent cache misses (e.g., online transaction processing, OLTP). Unfortunately, the prior proposal [3] for generalized streaming requires a sophisticated hierarchical compression algorithm to analyze whole program memory address traces, which may only be practical when run offline and is prohibitively complex to implement in hardware. In this paper, we propose Temporal Streaming, a technique to hide coherent read miss latency in shared-memory multiprocessors. Temporal streaming is based on the observation that recent sequences of shared data accesses often recur in the same precise order. Temporal streaming uses the miss history from recent sharers to extract temporal streams and move data to a subsequent sharer in advance of data requests, at a transfer rate that matches the consumption rate. Unlike prior proposals for streaming [3] that require persistent stream behavior throughout program execution to enable offline analysis, temporal streaming can exploit streams with temporal (but not necessarily persistent) behavior by identifying streams on the fly directly in hardware. Through a combination of memory trace analysis and cycleaccurate full-system simulation [12] of a cache-coherent distributed shared-memory system (DSM) running scientific, OLTP (TPC-C on DB2 and Oracle) and web server (SPECweb on Apache and Zeus) workloads, we contribute the following. Temporal address correlation & stream locality: We investigate the inherent properties of our workload suite, and show that (1) shared addresses are accessed in repetitive sequences, and (2) recently followed sequences are likely to recur systemwide. More than 93% of coherent read misses in scientific applications and 40% to 65% in commercial workloads follow precisely a recent sequence. Temporal streaming engine: We propose a design for temporal streaming with practical hardware mechanisms to record and follow streams. Our design yields speedups of 1.07 to
The rest of this paper is organized as follows. We introduce temporal streaming in Section 2, and show how to exploit it to hide coherent read latency. Section 3 presents the Temporal Streaming Engine, our hardware realization of temporal streaming. We describe our evaluation methodology in Section 4, and quantitatively evaluate the temporal streaming phenomena and our hardware design in Section 5. We present related work in Section 6 and conclude in Section 7.
Order
3.29 in scientific applications, 1.11 to 1.21 in online transaction processing workloads, and 1.06 in web server workloads.
Node i
Miss A Miss B Miss C Miss D Miss E Find B
Directory Node
Node j
Locate B
Req B
Stream {C,D,E}
Fetch C,D,E
2. Temporal Streaming
In this paper, we propose Temporal Streaming, a technique to identify and communicate streams of shared data dynamically in DSM multiprocessors. The objective of temporal streaming is to hide communication latency by streaming data to consuming nodes in advance of processor requests for the data. Unlike conventional DSM systems, where shared data are communicated throughout the system individually, temporal streaming exploits the correlation between recurring access sequences to communicate data in streams. While temporal streaming applies to generalized address streams, in this paper we focus on coherent read misses because they present a performance-limiting bottleneck in many workloads and their detrimental effect is aggravated as cache sizes increase [2]. Temporal streaming exploits two properties common in shared memory access patterns: (1) temporal address correlation, where groups of shared addresses tend to be accessed together and in the same order, and (2) temporal stream locality, where recently-accessed address streams are likely to recur. In this paper, we use the term temporal correlation to encompass both properties. Temporal address correlation arises primarily from shared data access patterns. When data structures are stable (although their contents may be changing), access patterns repeat, and coherence miss sequences exhibit temporal address correlation. Thus, temporal address correlation can be found in accesses to generalized data structures such as linked-data structures (e.g., lists and trees) and arrays. In contrast, spatial or stride locality, commonly exploited by conventional prefetching techniques, rely on a data structures layout in memory which is only characteristic of array-based data structures. Temporal stream locality arises because recently accessed data structures are likely to be accessed again; therefore address sequences that were recently followed are likely to recur. In applications with migratory sharing patternsmost commercial and some scientific applicationsthis type of locality occurs system-wide as the migratory data are accessed in the same way by all nodes. Figure 1 illustrates an example of temporal streaming in a DSM. Node i incurs coherent read misses and records the sequence of misses {A,B,C,D,E}, which we refer to as its coherence miss order. We define a stream1 to be a sub-sequence of addresses in a nodes order. Node j later misses on address B, and requests the data from the directory node. The directory node responds to this request through the baseline coherence mechaFIGURE 1: Temporal streaming.
nism, and additionally requests a stream (following B) from the most recent consumer, Node i. We call the initial miss address, B, a stream head. Node i looks up address B in its order and assumes that requests to the subsequent addresses {C,D,E} are likely to follow. Thus, it forwards the stream {C,D,E} to Node j. Upon receipt of the stream, Node j retrieves the data for each block. Subsequent accesses to these addresses hit locally and avoid long-latency coherence misses. Temporal streaming requires three capabilities: (1) recording the order of a nodes coherent read misses, (2) locating a stream in a nodes order and (3) streaming data to the requesting processor at a rate that matches its consumption rate.
1.
Throughout this paper, we use stream as a noun to refer to a sequence of addresses, and stream as a verb to refer to moving a sequence of either addresses or data.
Recording Node
L1
Directory Node
1 Load miss X
Read X
L2
3 Append X to CMOB
Protocol Controller
FIGURE 3: Recording the order. does not know if or where each address will be appended until the load instruction retires. The required CMOB capacity depends on the size of the applications active shared data working set, and may be quite large. Therefore, we place the CMOB in a private region of main memory which also allows us to tailor its capacity to fit an applications requirements. TSE can tolerate the resulting high access latency to CMOB in memory because write accesses (to append the packetized blocks of addresses to the order) occur in the background and are off the processors critical path and read accesses (to locate or follow streams) are either amortized (on the initial miss) or overlapped through streaming lookahead. We report CMOB capacity requirements for our application suite in Section 5.4.
Node i
1 Miss X
Directory Node
Node j
Stream Engine
Stream queue
Address
Stream
FIGURE 5: Stream engine and streamed value buffer. entries in the FIFO queues are removed. When the FIFO heads disagree, indicating low temporal correlation, the stream engine stalls further data requests to avoid wasting bandwidth. However, the engine continues to monitor all off-chip memory requests to check for matches against the stalled FIFO heads. Upon a match, the processor is likely repeating the miss sequence recorded in the matching FIFO. Therefore, the stream engine discards the contents of all other (disagreeing) FIFOs and resumes fetching data using only the selected stream. We have investigated complex schemes that examine more than just the FIFO heads, but found they provide no advantage. When a stream queue is half empty, the stream engine requests additional addresses from the source CMOB. The ability to follow long streams by periodically requesting additional addresses distinguishes TSE from prefetching approaches that only retrieve a constant number of blocks in response to a miss [25]. Without this ability, the system will incur one miss for each group of fetched blocks, even if the entire miss sequence exhibits temporal address correlation. Figure 5 (right) depicts the anatomy of the SVB, a small fully-associative buffer for storing streamed data. Each SVB entry includes a valid bit, address, data, and the identity of the queue from which it was streamed. When a processor access hits in the SVB, the entry is moved to the L1 data cache, and the stream engine is notified to retrieve a subsequent cache block from the corresponding stream queue. The SVB entries contain only clean data, and are invalidated upon a write to the corresponding block by any (including the local) processor. SVB entries are replaced using an LRU policy. The SVB serves a number of purposes. First, it serves as custom storage for stream data to avoid direct storage in, and inadvertent pollution of, the cache hierarchy when the addresses are not temporally correlated. Second, it allows for direct bookkeeping and management of streamed data and obviates the need for modifications to the baseline cache hierarchy. Finally, it serves as a window to mitigate small (e.g., a few cache blocks) deviations in the sequence of stream accesses (e.g., due to control flow irregularities in programs) by the processor. By presenting multiple blocks simultaneously from a stream in a fully-associative buffer, SVB allows the processor to skip or request cache blocks slightly out of stream order. The SVB size dictates the maximum allowable stream lookaheadi.e., a constant number of blocks outstanding in the SVBfor each active stream. Ideally, the stream engine retrieves blocks such that they arrive immediately in advance of consumption by the processor. Therefore, effective streaming requires that
FIGURE 4: Locating and forwarding address streams. cache coherence protocol. Second, streams of addresses do not incur any coherence overhead, whereas erroneously-streamed data blocks incur additional invalidation messages. Finally, sending streams of addresses allows the stream engine to identify temporal streams (i.e., consisting of temporally-correlated addresses) which are likely to result in hits. The directory management mechanisms in DSM offer a natural solution for CMOB pointer storage and lookup. By extending each directory entry with one or more CMOB pointers, TSE enables random-access lookups within a CMOB; each CMOB pointer in the directory includes a node ID and an offset within the CMOB where the address is located, with the storage overhead of (number of CMOB pointers) (log2(nodes) + log2(CMOB size)) bits. As such, CMOBs can be relatively large structures (e.g., millions of entries) residing in main memory. In contrast, prior proposals for prefetching based on recording address sequences in uniprocessors (e.g., [25]) resort to complex on-chip address hashing schemes and limited address history buffers.
... ...
. . . .
...
...
CB A CB A
addr
data
...
Q id
... ...
. . . .
Z Y X Z Y X
v v
addr addr
data data
Q id Q id
the SVB holds enough blocks (i.e., allows for enough lookahead) to satisfy a burst of coherent read requests by the processor while subsequent blocks are being retrieved. We explore the issues involved in choosing the lookahead throughout Section 5. We show that in practice a small (e.g., tens of entries) SVB allows for enough lookahead to achieve near-optimal coverage while enabling quick lookup.
4. Methodology
We quantify temporal address correlation and stream locality, and evaluate our proposed hardware design across a range of scientific and commercial applications. We collect our results using a combination of trace-driven and cycle-accurate fullsystem simulation of a distributed shared-memory multiprocessor using SIMFLEX [12]. SIMFLEX is a simulation framework that uses modular component-based design and rigorous statistical sampling to enable the development of complex models and ensure representative measurement results with fast simulation turnaround. SIMFLEX builds on Virtutech Simics [20], a full system simulator that allows functional emulation of unmodified commercial applications and operating systems. SIMFLEX furnishes Simics with cycle-accurate models of an out-of-order processor core, cache hierarchy, microcoded coherence protocol engine, multi-banked distributed memory, and 2D torus interconnect. We implement a low-occupancy directory-based NACKfree cache-coherence protocol. We simulate a 16-processor distributed shared-memory system with 3 GB of memory running Solaris 8. We implement an aggressive version of the total store order memory consistency model [1]. We perform speculative load and store prefetching as described by Gharachorloo et al. [8], and speculatively relax memory ordering constraints at memory barrier and atomic readmodify-write memory operations [10]. We list other relevant parameters of our system model in Table 1. Table 2 describes the applications and parameters we use in this study. We target our study at commercial workloads, but include a representative group of scientific applications for comparison. We choose scientific applications which are (1) scalTable 1. DSM system parameters.
Processing Nodes UltraSPARC III ISA 4 GHz 8-stage pipeline; out-of-order execution 8-wide dispatch / retirement 256-entry ROB, LSQ and store buffer L1 Caches Split I/D, 64KB 2-way, 2-cycle load-to-use 4 ports, 32 MSHRs L2 Cache Unified, 8MB 8-way, 25-cycle hit latency 1 port, 32 MSHRs Main Memory 60 ns access latency 64 banks per node 64-byte coherence unit Protocol Controller 1 GHz microcoded controller 64 transaction contexts Interconnect 4x4 2D torus 25 ns latency per hop 128 GB/s peak bisection bandwidth
able to large data sets, and (2) maintain a high sensitivity to memory system performance when scaled. We include em3d [6], an electromagnetic force simulation, moldyn [23], a molecular dynamics simulation and ocean [30] current simulation. We evaluate two database management systems, IBM DB2 v7.2 EEE, and Oracle 10g Enterprise Database Server, running the TPC-C v3.0 online transaction processing workload.1 We use an optimized TPC-C toolkit provided by IBM for DB2. For Oracle, we developed and optimized our own toolkit. We tuned the number of client processes and other database parameters in our detailed timing model and chose the client and database configuration that maximized baseline system performance for each database management system. Client processes are configured with no think time, and database data and log files are striped across multiple disks to eliminate I/O bottlenecks. We evaluate the performance of WWW servers running the SPECweb99 benchmark on Apache HTTP Server v2.0 and Zeus Web Server v4.3. We simulate an 8-processor client system that sustains 16,000 simultaneous web connections to our 16-processor server via a simulated ethernet network. We run the client processors at a fixed IPC of 8.0 with a 4 GHz clock and provide sufficient bandwidth on the ethernet link to ensure that neither client performance nor available network bandwidth limit server performance. We collect memory traces and performance results on the server system only. Our trace-based analyses use memory access traces collected from SIMFLEX with in-order execution, no memory system stalls, and a fixed IPC of 1.0. We analyze traces of at least ten iterations for scientific applications. We warm commercial applications for at least 5,000 transactions (or completed web requests) prior to starting traces, and then trace at least 500 transactions. We use the first iteration of each scientific and the first 100 million instructions (per processor) of each commercial application to warm trace-based simulations prior to measurement. Our timing results for the scientific applications are derived from measurements of a single iteration started with warmed cache, branch predictor, and CMOB state. We use iteration runtime as our measure of performance. Table 2. Applications and parameters.
Scientific Applications em3d moldyn ocean Apache DB2 Oracle Zeus 400K nodes, degree 2, span 5, 15% remote 19652 molecules, boxsize 17, 2.56M max interactions 514x514 grid, 9600s relaxations, 20K res., err. tol. 1e-07 Commercial Applications 16K connections, fastCGI, worker threading model 100 warehouses (10 GB), 64 clients, 450 MB buffer pool 100 warehouses (10 GB), 16 clients, 1.4 GB SGA 16K connections, fastCGI
1.
Solaris, TPC, Oracle, Zeus, DB2 and other trademarks are the property of their respective owners. None of the results presented in this paper should be construed to indicate the absolute or relative performance of any of the commercial systems used.
For the commercial applications, we use a systematic sampling approach developed in accordance with SMARTS [31]. SMARTS is a rigorous statistical sampling methodology, which prescribes a procedure for determining sample sizes, warm-up, and measurement periods based on an analysis of the variance of target metrics (e.g., IPC), to obtain the best statistical confidence in results with minimal simulation. We collect approximately 100 brief measurements of 400,000 cycles each. We launch measurements from checkpoints with warmed caches, branch predictors, and CMOBs, then run for 200,000 cycles to warm queue and interconnect state prior to collecting statistics. We use the aggregate number of user instructions committed per cycle (i.e., user IPC summed over the 16 processors) as our performance metric. We exclude system commits from this metric because we cannot distinguish system commits that represent forward progress from those that do not (e.g., the idle loop). We have independently corroborated Hankins et al.s [11] results that the number of user instructions per transaction in the TPC-C workload remains constant over a wide range of database configurations (whereas system commits per transaction do not). Thus, aggregate user IPC is proportional to database throughput.
Cum. % Consumptions
100% 80% 60% 40% 20% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Apache DB2 Oracle Zeus em3d moldyn ocean
FIGURE 6: Opportunity to exploit temporal correlation. corresponds roughly to stream lookahead) of up to 16. All scientific applications in our suite exhibit near-perfect correlation, as they repeat the same data access pattern across all iterations. The commercial applications access data structures that change over time. Nevertheless, more than 40% of all consumptions in commercial applications are perfectly correlated, indicating that a significant portion of data structures and access patterns remain stable. Allowing for reordering of up to eight blocks increases the fraction to 49%63% of consumptions. These results indicate that temporal streaming has the potential to eliminate nearly all coherent read misses in scientific applications, and almost half in commercial workloads.
5. Results
In this section, we investigate the opportunity for temporal streaming and the effectiveness of the Temporal Streaming Engine. Throughout our results, we report the effectiveness of TSE at eliminating consumptions, which we define as read requests that incur a coherence miss but are not a spin on a contended lock or barrier variable. We exclude coherent read misses that occur during spins because there is no performance advantage to predicting or streaming them.
200%
220%
% Consumptions
Discards Coverage
224%
239%
238%
1 2 3 4 em3d
1 2 3 4 moldyn
1 2 3 4 ocean
1 2 3 4 Apache
1 2 3 4 DB2
1 2 3 4 Oracle
1 2 3 4 Zeus
FIGURE 7: TSE sensitivity to the number of compared streams. Effective streaming requires a stream lookahead sufficiently high to enable the SVB to satisfy consumption bursts by the processor. However, a stream lookahead higher than the required for effective streaming may erroneously stream too many blocks (i.e., discards) and degrade streaming accuracy. Figure 8 shows the effect of the stream lookahead on discards. For the scientific applications, which all exhibit near-perfect temporal correlation, even a high stream lookahead results in few discards. For the commercial applications, discards grow linearly with lookahead. In contrast, TSE coverage grows only slightly with increasing stream lookahead, as Figure 6 suggests. Thus, the ideal stream lookahead is the minimum sufficient to satisfy consumption bursts by the processor. We describe how to determine the value for the stream lookahead in Section 5.6. two previous consumers, or from two moments in time. We tested our intuition experimentally, and found no sensitivity to the number of stream queues. Nevertheless, providing multiple stream queues in a TSE implementation compensates for the delays and event reorderings that occur in a real system. Most importantly, additional stream queues are necessary to avoid stream thrashing [28], where potentially useful streams are overwritten with useless streams from a non-correlated miss. Our results show that applications typically follow one perfectly correlated stream at a time. Thus, the required SVB capacity in number of blocks is equal to the stream lookahead. For a stream lookahead of eight, the required SVB capacity is 512 bytes. Figure 9 confirms that there is little increase in coverage when moving from a 512-byte to an infinite SVB. The small increase in coverage results from the rare case of blocks that are accessed long after they are retrieved. We choose a 32-entry (2 KB) SVB because it offers near-optimal performance and is easy to implement a low-latency fully-associative buffer of this size.
Stream Lookahead
10
15
20
25
FIGURE 8: Effect of stream lookahead on discards. Discards are normalized to true consumptions.
200%
Discards Coverage
% Consumptions
150% 100% 50% 0% 2k 8k 2k 8k 512 2k 8k inf 512 2k 8k inf 512 2k 8k inf 2k 8k 512 512 512 512 inf inf inf 2k 8k Zeus
Oracle
em3d
moldyn
ocean
Apache
DB2
Oracle
FIGURE 9: Sensitivity to SVB size. inf indicates infinite storage. Figure 11 shows the interconnect bisection bandwidth overhead associated with TSE. Each bar represents the bandwidth consumed by TSE overhead traffic (correctly streamed cache blocks replace processor coherent read misses in the baseline system one-for-one). The annotation above each bar indicates the ratio of overhead traffic to traffic in the base system. The dominant component of TSEs bandwidth overhead arises from streaming addresses between nodes. The bandwidth overhead of TSE is a small fraction of the available bandwidth in current multiprocessor systems. The HP GS1280 multiprocessor system provides 49.6 GB/s interconnect bisection bandwidth in a 16-processor 2D-torus configuration [7]. Thus, the interconnect bandwidth overhead of TSE is less than 7% of available bandwidth in current technology, and less than 3% of bandwidth available in our DSM timing model. are separated by the same stride, and prefetches eight blocks in advance of a processor request. Prefetched blocks are stored in a small cache identical to TSEs SVB. We also compare against the Global History Buffer (GHB) prefetcher proposed by Nesbit and Smith [25]. GHB was recently shown to outperform a wide variety of other prefetching mechanisms on SPEC applications [26]. In GHB, consumption misses are recorded in an on-chip circular buffer similar to the CMOB, and are located using an onchip fully-associative index table. GHB supports several indexing options. We evaluate global distance-correlation (G/DC) as advocated by [26], and global address correlation (G/AC), as this is more similar to TSE. We use a 512-entry history buffer and fetch eight blocks per prefetch operation. We compare to TSE with a 1.5 MB CMOB and other parameters as previously described. Because TSE targets only consumptions, we configure the other prediction mechanisms to train and predict only for consumptions. Figure 12 shows that TSE outperforms the other techniques by eliminating 43%-100% of consumptions. Because none of the applications exhibit significant strided access patterns, the stride prefetcher rarely prefetches, resulting in both low coverage and low discards. Address-correlating GHB (G/AC) outperforms distance correlation (G/DC) in terms of discards across commercial applications, but falls short of TSE coverage because its 512entry consumption history is too small to capture repetitive consumption sequences.
BW Overhead (GB/s)
4 3 2 1 0
em3d moldyn ocean DB2 Apache Zeus
% of Peak Coverage
12k
48k
192k
768k
192
768
3M
12
48
3k
FIGURE 11: Interconnect bisection bandwidth overhead. The annotation above each bar indicates the ratio of overhead traffic to traffic in the base system.
inf
250%
Coverage
Discards
% Consumptions
Stride G/DC
Stride G/DC
Stride G/DC
Stride G/DC
Stride G/DC
Stride G/DC
Stride G/DC
G/AC TSE
G/AC TSE
G/AC TSE
G/AC TSE
G/AC TSE
G/AC TSE
em3d
moldyn
ocean
Apache
DB2
Oracle
Zeus
Benchmark & Forwarding Technique FIGURE 12: TSE compared to recent prefetchers. G/DC refers to distance-correlating Global History Buffer, G/AC refers to address-correlating Global History Buffer.
performance. However, the data-dependent nature of the commercial workloads [27] and instruction window constraints may restrict the processors ability to issue multiple outstanding consumptions. Whereas the processor may quickly stall, TSE can retrieve all blocks within a stream in parallel, thereby eliminating consumptions despite short stream lengths. To verify our hypothesis, we measure the consumption memory level parallelism (MLP) [4]the average number of coherent read misses outstanding when at least one is outstandingin our baseline timing model, and report the results in Table 3. Our results show that, in general, the commercial applications issue consumptions serially. The latency to fill the consumption miss that triggers the stream lookup is approximately the same as the latency to retrieve streams and initiate streaming. Thus, streaming can begin at the time the processor requests the first block on the stream without sacrificing timeliness. We determine the appropriate stream lookaheads for em3d and moldyn by first calculating the rate at which consumption misses would be issued in our base system if all coherent read latency was removed. We then divide the stream retrieval roundtrip latency (i.e., 3-hop coherence miss latency) by the no-wait consumption rate. For ocean, this simple approach fails because all coherence activity occurs in bursts, as evidenced by its high consumption MLP in the baseline system. To improve cache locality, ocean blocks its computation, which, as a side effect, groups consumptions into bursts. We set the stream lookahead to a maximal reasonable value of 24 for ocean based on the number of available L2 MSHRs in our system model. There is relatively little sensitivity to stream lookahead in commercial applications because of their low consumption MLP. We found that a lookahead of eight works well across these applications. Table 3 shows the effect of streaming timeliness on TSE coverage using both trace analysis and cycle-accurate simulation. Trace Cov. indicates consumptions eliminated by TSE as reported by our trace analysis. Full Cov. indicates consumptions eliminated completely by TSE in the cycle-accurate simulation. Partial Cov. indicates consumptions whose latency was partially
G/AC TSE
covered by TSEthe processor issued a request while a streamed value was still in flight. TSE on the cycle-accurate simulator attains lower coverage relative to the trace analysis because streams may arrive late after the processor has issued requests for the addresses in the stream. With the exception of ocean, most of the trace-measured coverage is timely (the consumptions are fully covered) in the cycle-accurate simulation of TSE, while the remaining consumptions are partially covered. We measured that partially covered consumptions hide on average 40% of the consumption latency in commercial workloads, and 60%-75% in scientific applications. In the case of ocean, partial coverage is particularly high. Even a stream lookahead of 24 blocks is insufficient to fully hide all coherent read misses, as the communication bursts in ocean are bandwidth bound.
nication-bound em3d. Despite high coverage, TSE eliminates only ~40% of coherent read stalls in ocean, as the majority of coherent read misses are only partially hidden. Although partially covered consumptions in ocean hide on average 60% of the consumption latency, much of the miss latency is overlapped in the baseline case as well because of the high MLP. The commercial applications spend between 30%-35% of overall execution time on coherent read stalls. The TSEs performance impact is particularly large in DB2 because coherent read stalls are more prevalent in user (as opposed to OS) code than in the other commercial applications. User coherent read stalls have a disproportionately large impact on database throughput because misses in database code form long dependence chains [27], and are thus on the critical execution path. DB2 spends 43% of user execution time on coherent read stalls. TSE is particularly effective on these misses, eliminating 53% of user coherent read stalls. As cache sizes continue to increase in future processors, coherence misses will become a larger fraction of long-latency off-chip accesses [2], and the performance impact of TSE and similar techniques will grow.
6. Related Work
Prior correlation-based prefetching approaches (e.g., Markov predictors [14] and Global History Buffer [25]) only considered locality and address correlation local to one node. In contrast, temporal streaming finds candidate streams by locating the most recent occurrence of a stream head across all nodes in the system. Thread-based prefetching techniques [5] use idle contexts on a multithreaded processor to run helper threads that overlap misses with speculative execution. However, the spare resources the helper threads require (e.g., idle thread contexts, fetch and execution bandwidth) may not be available when the processor executes an application exhibiting high thread-level parallelism (e.g., OLTP). TSE, on the contrary, does not occupy processor resources. Huh et al., [13] split a traditional cache coherence protocol into a fast protocol that addresses performance, and a backing protocol that ensures correctness. Unlike their scheme, which relies on detecting a tag-match to an invalidated cache line, TSE directly identifies coherent read misses using directory informa3.3 1.3 Speedup 1.2 1.1 1.0 Apache moldyn Oracle ocean DB2 em3d Zeus
5.7 Performance
We measure the performance impact of TSE using our cycleaccurate full-system timing model of a DSM multiprocessor. Figure 14 (left) illustrates the opportunity and effectiveness of TSE at eliminating stalls caused by coherent read misses. The base and TSE time breakdowns are normalized to represent the same amount of completed work. Figure 14 (right) reports the speedup achieved by TSE, with 95% confidence intervals for the sample-derived commercial application results. TSE eliminates nearly all coherent read stalls in em3d and moldyn. TSE provides a drastic speedup of nearly 3.3 in commu1.0 Normalized Time 0.8 0.6 0.4 0.2 TSE TSE TSE TSE TSE
Busy
Other Stalls
TSE
base
base
base
base
base
base
em3d
moldyn
ocean
Apache
DB2
Oracle
FIGURE 14: Performance improvement from TSE. The left figure shows an execution time breakdown. The right figure shows the speedup of TSE over the base system, with 95% confidence intervals for commercial application speedups.
base
Zeus
TSE
tion, thus ensuring independence from the employed cache size. Moreover, coherent reads in [13] are still speculative for the entire length of a long-latency coherence miss and therefore stress the ROB, while our scheme allows coherent read references that hit in the SVB to retire immediately. Keleher [16] describes the design and use of Tapeworm, a mechanism implemented as a software library that records updates to shared data within a critical section, and pushes those updates to the next acquirer of the lock. While tapeworm can be efficiently implemented in software distributed shared-memory systems, a hardware-only realization requires either the introduction of a race-prone speculative data push operation in the coherence protocol, or a split performance/correctness protocol as in [13]. Instead, our technique relies on streaming to communicate shared data to consumers, without changes to the coherence protocol or application modifications. Recent research has also aimed at making processors more tolerant of long-latency misses. Mutlu et al. [24] allow MLP to break past ROB limits, by speculatively ignoring dependencies and continuing execution of the thread upon a miss to issue prefetches. However, their method is constrained by branch prediction accuracy and hides only part of the latency, as the runahead thread may not be able to execute far enough in advance during the time it takes to satisfy a miss. Techniques seeking to exceed the dataflow limit through value prediction or to increase MLP at the processor (e.g., SMT) or the chip level (e.g., CMP) are complementary to our work.
References
[1] S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29(12):6676, Dec. 1996. [2] L. A. Barroso, K. Gharachorloo, and E. Bugnion. Memory system characterization of commercial workloads. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 314, June 1998. [3] T. M. Chilimbi and M. Hirzel. Dynamic hot data stream prefetching for general-purpose programs. In Proceedings of the SIGPLAN 02 Conference on Programming Language Design and Implementation (PLDI), June 2002. [4] Y. Chou, B. Fahs, and S. Abraham. Microarchitecture optimizations for exploiting memory-level parallelism. In Proceedings of the 31st Annual International Symposium on Computer Architecture, June 2004. [5] J. D. Collins, D. M. Tullsen, H. Wang, and J. P. Shen. Dynamic speculative precomputation. In Proceedings of the 34th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 34), December 2001. [6] D. E. Culler, A. Dusseau, S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick. Parallel programming in Split-C. In Proceedings of Supercomputing 93, pages 262273, Nov. 1993. [7] Z. Cvetanovic. Performance analysis of the alpha 21364based hp gs1280 multiprocessor. In Proceedings of the 30th Annual International Symposium on Computer Architecture, pages 218229, June 2003. [8] K. Gharachorloo, A. Gupta, and J. Hennessy. Two techniques to enhance the performance of memory consistency models. In Proceedings of the 1991 International Conference on Parallel Processing (Vol. I Architecture), pages I 355364, Aug. 1991. [9] C. Gniady and B. Falsafi. Speculative sequential consistency with little custom storage. In Proceedings of the 10th International Conference on Parallel Architectures and Compilation Techniques, Sept. 2002. [10] C. Gniady, B. Falsafi, and T. N. Vijaykumar. Is SC + ILP = RC? In Proceedings of the 26th Annual International Symposium on Computer Architecture, pages 162171, May 1999. [11] R. Hankins, T. Diep, M. Annavaram, B. Hirano, H. Eri, H. Nueckel, and J. P. Shen. Scaling and characterizing database workloads: Bridging the gap between research and practice. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36), Dec. 2003. [12] N. Hardavellas, S. Somogyi, T. F. Wenisch, R. E. Wunderlich, S. Chen, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. Simflex: A fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture. SIGMETRICS Performance Evaluation Review, 31(4):3135, April 2004. [13] J. Huh, J. Chang, D. Burger, and G. S. Sohi. Coherence decoupling: making use of incoherence. In Proceedings of the
7. Conclusion
In this paper, we presented temporal streaming, a novel approach to eliminate coherent read misses in distributed sharedmemory systems. Temporal streaming exploits two phenomena common in the shared memory access patterns of scientific and commercial multiprocessor workloads: temporal address correlation, that sequences of shared addresses are repetitively accessed together and in the same order; and temporal stream locality, that recently-accessed streams are likely to recur. We showed that temporal streaming has the potential to eliminate 98% of coherent read misses in scientific applications, and 43% to 60% in OLTP and web server applications. Through cycle-accurate fullsystem simulation of a cache-coherent distributed sharedmemory multiprocessor, we demonstrated that our hardware realization of temporal streaming yields speedups of 1.07 to 3.29 in scientific applications, and 1.06 to 1.21 in commercial workloads, while incurring overhead of less than 7% of available bandwidth in current technology.
Acknowledgements
The authors would like to thank Sumanta Chatterjee and Karl Haas for their assistance with Oracle, and the members of the Carnegie Mellon Impetus group and the anonymous reviewers for their feedback on earlier drafts of this paper. This work was partially supported by grants and equipment from IBM and Intel corporations, the DARPA PAC/C contract F336150214004-AF, an NSF CAREER award, an IBM faculty partnership award, a Sloan research fellowship, and NSF grants CCR-0113660, IIS0133686, and CCR-0205544.
11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XI), October 2004. [14] D. Joseph and D. Grunwald. Prefetching using Markov Predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 252263, June 1997. [15] S. Kaxiras and C. Young. Coherence communication prediction in shared memory multiprocessors. In Proceedings of the 6th IEEE Symposium on High-Performance Computer Architecture, January 2000. [16] P. Keleher. Tapeworm: High-level abstractions of shared accesses. In Proceedings of the 3rd Symposium on Operating Systems Design and Implementation (OSDI), February 1999. [17] D. A. Koufaty, X. Chen, D. K. Poulsena, and J. Torrellas. Data forwarding in scalable shared-memory multiprocessors. In Proceedings of the 1995 International Conference on Supercomputing, July 1995. [18] A.-C. Lai and B. Falsafi. Memory sharing predictor: The key to a speculative coherent DSM. In Proceedings of the 26th Annual International Symposium on Computer Architecture, May 1999. [19] A.-C. Lai and B. Falsafi. Selective, accurate, and timely selfinvalidation using last-touch prediction. In Proceedings of the 27th Annual International Symposium on Computer Architecture, June 2000. [20] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. IEEE Computer, 35(2):5058, February 2002. [21] M. K. Martin, M. D. Hill, and D. A. Wood. Token coherence: Decoupling performance and correctness. In Proceedings of the 30th Annual International Symposium on Computer Architecture, June 2003. [22] S. S. Mukherjee and M. D. Hill. Using prediction to accelerate coherence protocols. In Proceedings of the 25th Annual International Symposium on Computer Architecture, June 1998. [23] S. S. Mukherjee, S. D. Sharma, M. D. Hill, J. R. Larus, A. Rogers, and J. Saltz. Efficient support for irregular appli-
cations on distributed-memory machines. In 5th ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), pages 6879, July 1995. [24] O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. Runahead execution: an effective alternative to large instruction windows. IEEE Micro, 23(6):2025, November/December 2003. [25] K. J. Nesbit and J. E. Smith. Data cache prefetching using a global history buffer. In Proceedings of the 10th IEEE Symposium on High-Performance Computer Architecture, Feb. 2004. [26] D. G. Perez, G. Mouchard, and O. Temam. Microlib: a case for the quantitative comparison of micro-architecture mechanisms. In Proceedings of the 3rd Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD04), June 2004. [27] P. Ranganathan, K. Gharachorloo, S. V. Adve, and L. A. Barroso. Performance of database workloads on sharedmemory systems with out-of-order processors. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VIII), pages 307318, Oct. 1998. [28] T. Sherwood, S. Sair, and B. Calder. Predictor-directed stream buffers. In Proceedings of the 33rd Annual IEEE/ ACM International Symposium on Microarchitecture (MICRO 33), pages 4253, December 2000. [29] S. Somogyi, T. F. Wenisch, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Memory coherence activity prediction in commercial workloads. In 3rd Workshop on Memory Performance Issues, June 2004. [30] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, July 1995. [31] R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. Smarts: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of the 30th Annual International Symposium on Computer Architecture, June 2003.