This Research Was Supported by Brazilian Finep and CNPQ
This Research Was Supported by Brazilian Finep and CNPQ
This Research Was Supported by Brazilian Finep and CNPQ
Ricardo Bianchini, Raquel Pinto, and Claudio L. Amorim COPPE Systems Engineering Federal University of Rio de Janeiro Rio de Janeiro, Brazil 21945-970 FAX/Phone: + 55 21 590-2552 Technical Report ES-401/96, July 1996, COPPE/UFRJ
Abstract
Prefetching strategies can conceivably be used to reduce the high remote data access latencies of software-only distributed shared-memory systems (DSMs). However, in order to design e ective prefetching techniques, one must understand the page fault behavior of parallel applications running on top of these systems. In this paper we study this behavior according to its spatial, temporal, and sharing-related characteristics. Among other important observations, our study shows that: a) the amount of useful computation between the earliest time a prefetch can be issued and the actual use of the page is enough to hide most or all the latency of fetching remote data, which means that prefetching techniques have the potential to be e ective; b) page fault patterns change signi cantly throughout execution for several applications, which means that prefetching techniques based on dynamic, recent-history information may not be e ective; c) page faults are frequently spatially clustered but not temporally clustered, which means that sequential prefetching as in hardware DSMs may also be pro table for software DSMs; and d) the set of page invalidations received at synchronization points are often a poor indication of near-future page accesses, which means that the invalidations should not be used to guide prefetching. Based on this study we propose and evaluate ve prefetching techniques for the TreadMarks system running on our simulated network of workstations. Our results show that the prefetching techniques we study can deliver performance improvements of up to 30%, but no one technique is consistently e ective for all applications.
1 Introduction
Software-only distributed shared-memory systems (DSMs) combine the ease of shared-memory programming with the low cost of message-passing architectures. However, these systems often exhibit high remote data access latencies when running real parallel applications. Prefetching strategies can conceivably be used to reduce these latencies.
This research was supported by Brazilian FINEP and CNPq.
In order to design e ective prefetching techniques, one must understand the page fault behavior of parallel applications running on top of software DSMs. Such an understanding should provide insight into whether prefetching in general and speci c prefetching strategies in particular can be e ective. Thus, this paper presents a study of the page fault behavior of applications running on software DSMs and uses its results to guide the design of several prefetching techniques. We concentrate on the spatial, temporal, and sharing-related characteristics of the sequence of page faults that require remote data fetches. Basically, we are interested in answering several important questions about the page fault behavior and its relationship to prefetching: Is there a signi cant amount of useful computation that can be used to tolerate the latency of fetching remote data? Do we need sophisticated compilers to insert prefetch calls to the runtime system in applications or do dynamic, runtime-only techniques su ce? Are page faults spatially and/or temporally clustered? Can prefetching techniques use page invalidations at lock and barrier operations to guide prefetching? This paper answers these questions for the rst time, as far as we are aware. Our simulation results of parallel applications running on top of TreadMarks 8] show that the average amount of useful computation between the earliest time a prefetch can be issued and the actual use of the page is enough to hide most or all the latency of remote data operation for all our applications. Our results also demonstrate that page fault patterns change signi cantly throughout execution for several applications, which means that prefetching techniques based on dynamic, recent-history information may not be e ective; adaptive or compiler-based techniques might be necessary. We nd that page faults are frequently spatially clustered but not temporally clustered, which means that sequential prefetching as in hardware DSMs may also be pro table for software DSMs. In addition, our results show that the set of page invalidations received at synchronization points are often a poor indication of near-future page accesses, which means that the invalidations should not be used to guide prefetching. Based on these results we propose and evaluate ve di erent prefetching techniques. Among other characteristics, the techniques vary in terms of the aggressiveness with which to prefetch, the use of invalidation notices to guide prefetching, and the use of compiler-inserted prefetching calls in applications. We evaluate the techniques when implemented in TreadMarks. Our results show that the prefetching techniques can deliver performance improvements of up to 30%, but no one technique performs consistently well for all applications. The remainder of this paper is organized as follows. The next section motivates the paper by describing the main characteristics of TreadMarks and showing that remote data fetch overheads signi cantly degrade the performance of applications running on top of it. Section 3 describes our simulation methodology and workload. In section 4 we discuss our page fault behavior results. 2
Section 5 describes the prefetching techniques we propose based on the observed page fault behavior. The section also presents results on the performance of these techniques. In section 6 we describe related work. Finally, section 7 draws our conclusions.
The details of the simulation and application characteristics that led to these gures will be presented in section 3.
7 6 5
Speedup
100 90 80 70 60 50 40 30
4 3 2 1 0 2 4 6 8 10 12 Number of processors 14 16
(%)
20 10 0
Radix
Em3d
signi cant of these overheads, cache miss latency. The busy time represents the amount of useful work performed by the computation processor. Data fetch latency is a combination of coherence processing time and network latencies involved in fetching pages and di s as a result of page access violations. Synchronization time represents the delays involved in waiting at barriers and lock acquires/releases, including the overhead of interval and write notice processing. IPC overhead accounts for the time the computation processor spends servicing requests coming from remote processors. Figure 2 shows that TreadMarks su ers severe remote data fetch and synchronization overheads. IPC overheads are not as signi cant since they are often hidden by data fetch and synchronization latencies. However, IPC overheads gain importance when prefetching is used. This study seeks a complete understanding of each processor's page fault behavior, which can lead to the design of prefetching techniques for signi cantly reducing the overhead of remote data fetches.
System Constant Name Default Value Number of processors 16 TLB size 128 entries TLB ll service time 100 cycles All interrupts 400 cycles Page size 4K bytes Total cache per processor 128K bytes Write bu er size 4 entries Cache line size 32 bytes Memory setup time 10 cycles Memory access time (after setup) 3 cycles/word PCI setup time 10 cycles PCI burst access time (after setup) 3 cycles/word Network path width 8 bits (bidirectional) Messaging overhead 200 cycles Switch latency 4 cycles Wire latency 2 cycles List processing 6 cycles/element Page twinning 5 cycles/word + memory accesses Di application and creation 7 cycles/word + memory accesses Table 1: Default Values for System Parameters. 1 cycle = 10 ns. of the memory system and control ow within a processor can change as a result of the timing of memory references. We simulate a network of workstations with 16 nodes in detail. Each node consists of a computation processor, a write bu er, a rst-level direct-mapped data cache (all instructions are assumed to take 1 cycle), local memory, and a mesh network router (using wormhole routing). Table 1 summarizes the default parameters used in our simulations. All times are given in 10-ns processor cycles.
3.2 Workload
We report results for four representative parallel programs: Em3d, FFT, Radix, and Ocean. Em3d 4] is from UC Berkeley. FFT is from Rice University and comes with the TreadMarks distribution. Ocean and Radix are from the Splash-2 suite 11]. These applications were run on the default problem sizes for 32 processors, as suggested by the Stanford researchers. Table 2 lists the applications and their input sizes. Em3d simulates electromagnetic wave propagation through 3D objects. We simulate 40064 5
electric and magnetic objects connected randomly, with a 10% probability that neighboring objects reside in di erent nodes. The interactions between objects are simulated for 6 iterations. FFT performs a complex 1-D FFT that is optimized to reduce interprocessor communication. The data set consists of 65,536 data points to be transformed, and another group of 65,536 points called roots of units. Each of these groups of points is organized as a 256 256 matrix. Ocean studies large-scale ocean movements based on eddy and boundary currents. We simulate a 258 258 ocean grid. Radix is an integer radix sort kernel. The algorithm is iterative, performing one iteration per digit of the 1M keys. Application Input Size Em3d 40064 nodes, 10% remote FFT 65,536 data points Ocean 258 258 ocean Radix 1M integers, radix 1024 Table 2: Applications and Input Sizes.
4.1 Overview
Given that realistic parallel applications generate an enormous amount of page fault information, in this section we concentrate on representative snapshots of the execution of our applications. Figure 3 presents three consecutive snapshots of the sequence of page faults taken by each of the applications in our suite. Each of the snapshots corresponds to a phase of execution (the time period in between two consecutive barrier events) on one of the processors (processor 7) in the system. The lower and upper limits on the Y-axis of the graphs for each application represent the rst and last pages of its shared data area, respectively. Lock and unlock operations are represented by full and dashed vertical lines, respectively. The graphs in the gure show several important characteristics of the page fault behavior of our applications: 6
Em3d (proc7)
Em3d (proc7)
Em3d (proc7)
262900 262800
Page Number
Fft (proc7)
262900 262800
Page Number
Fft (proc7)
262900 262800
Page Number
Fft (proc7)
265500 265000
Page Number Page Number
265500 265000 264500 264000 263500 263000 Ocean (proc7) 262500 Ocean (proc7)
Page Number
Time
Time
Time
264000
264000
264000
Page Number
Page Number
263500
263500
Page Number
263500
263000
263000
263000
Time
Time
Time
Figure 3: Overview of Page Fault Behavior for Em3d, FFT, Ocean, and Radix (from top to bottom). 7
The number of page faults within a phase of execution may be signi cant, as in FFT and Radix, and may vary widely across phases, as in Radix. Critical sections delimited by lock and unlock operations have very di erent page fault behavior than the rest of the phase. This can be clearly observed in Radix, but also happens for Ocean; the other two applications do not apply lock synchronization. Page faults are fairly clustered inside small chunks of the shared address space in most cases. Page faults are frequently spread more or less evenly throughout each snapshot, i.e. faults are not usually clustered in time. Although interesting, these observations are too super cial to be useful when designing and evaluating prefetching strategies. For a more precise analysis of these and other characteristics of the stream of page faults experienced by a processor, we must \zoom in" on the spatial, temporal, and sharing-related characteristics of our applications. In this detailed analysis we study the page faults occurring inside and outside of critical sections separately, given the signi cant di erence in their associated page fault behaviors.
16 Em3d (proc7) 14 12
95
Average Stride
10 8 6 4 2 0 14 2 4 Phase 6 8 10
90 80
Fft (proc7) 12 10
Fft (proc7)
Average Stride
70
80
60
Average Stride
60
40
40
20
20
20
15
Average Stride
10
10
0 0 1 2 3 Phase 4 5 6
Figure 4: Spatial Distribution of Faults for Em3d, FFT, Ocean, and Radix (from top to bottom). 9
page number and the next in the sorted list. These results are shown in the graphs on the right side of gure 4. The discontinuities in the FFT and Radix graphs represent the fact that certain phases in these applications have less than two page faults, making it impossible to compute strides. These graphs show that FFT exhibits relatively large strides in its rst phase, but faults di s in for a sequential group of pages during the third, fourth, and fth phases of its execution. Em3d and Radix also exhibit very tight clustering of page faults in some of their phases, but not all of them. Ocean's average fault stride around 8 persists for most of its phases, but for the other phases strides can be as large as 500. Page faults that occur inside of critical sections exhibit much simpler behavior. Critical sections tend to be relatively short in the applications we study. In the vast majority of cases, either one or two pages experience faults within the sections. Di erent executions of a critical section exhibit a clear pattern as they tend to access the same pages always.
110000
100000 90000 80000 70000 60000 50000 40000 30000 20000 0 100000 1 2 Phase 3
Fft (proc7)
5 2.5e+06
2 Phase
Ocean (proc7)
Ocean (proc7)
90000 80000 70000 60000 50000 40000 30000 20000 Phase 300000 Radix (proc7) 2.5e+07 Radix (proc7) 0 Phase
2e+06
1.5e+06
1e+06
500000
2e+07
1.5e+07
1e+07
5e+06
0 0 1 2 3 Phase 4 5 6
Figure 5: Temporal Distribution of Faults for Em3d, FFT, Ocean, and Radix (from top to bottom). 11
4.5 Discussion
The results we presented in the previous sections describe the page fault behavior of applications running on top of multiple-writer, lazy release-consistent software DSMs. Our understanding of this fault behavior allows us to guide the design of our prefetching techniques based on several observations: 1. The signi cant number of page faults occurring in certain phases of execution suggests that prefetching for all of the corresponding pages at once might cause excessive resource contention. 2. The widely di erent fault behavior across phases of the execution of Radix suggests that xed, runtime-only prefetching techniques may fail to improve performance for some applications. 12
90 80 70 60 50 40 30 20 10 0 0 100 2 4 Phase
Em3d (proc7)
100
60
40
20
10
Ocean (proc7) 80
60
40
20
0 Phase
Figure 6: Relationship Between Faults Outside of Critical Sections and Sharing for Em3d, FFT, Radix, and Ocean (clockwise from top left corner).
16
60
40
20
0 Critical Section
Figure 7: Relationship Between Faults Inside of Critical Sections and Sharing for Ocean (left) and Radix (right).
13
Some form of adaptive, runtime technique might be a reasonable option, but may require a relatively large number of page faults before adaptation is complete. Compiler analysis should help when fault patterns change or when a large percentage of the misses is not due to sharing, but static techniques are not always applicable or even possible. 3. The di erences between the fault behavior of pages faulted on outside and inside of critical sections suggest that di erent prefetching policies for these two types of faults might be appropriate. The fact that very few pages are accessed inside most critical sections and that these accesses occur almost immediately after the lock acquire is completed (when it is safe to prefetch) indicate that prefetching for these pages is not pro table. However, this does not necessarily mean that we should not prefetch at lock acquire points, since the di s prefetched then might be used later, outside of the critical section. 4. The fact that there is always a signi cant amount of time between a barrier event and each page fault indicates that the latency of the required prefetches can be completely hidden, if prefetches are started at the barrier. 5. Prefetching on a page fault is also a viable option, provided that the processor can guess the very next page that will be required with reasonable accuracy. Given the spatial clustering of faults in FFT, Em3d, and Radix, sequential prefetching might provide appropriate guesses. 6. Prefetching based on the write notices received at synchronization points might not be a good strategy in some cases, since they often do not provide a good indication of future page accesses.
Although simple and intuitive, the Naive strategy can cause four types of performance problems. As suggested by observation 6, one serious problem is that this technique can generate an enormous number of useless prefetches (prefetches for pages that are invalidated before being used), when the write notices received at synchronization operations are not a good representation of near-future page accesses. A second problem with the Naive technique is that prefetches are all clustered in time and, as mentioned in observation 1, may cause several processors to compete for service at remote nodes. More speci cally, this situation can hurt the performance of the remote processor as its associated network interface (during di sends and receives) and the processor itself (during di generation) contend for the access to memory.2 The third type of performance problem that can result from the Naive technique is that prefetches are issued even at the start of short critical sections, which might delay the lock release operation as the processor has to wait for the prefetches to complete before e ectively freeing the lock. The fourth potential problem with our Naive strategy is that it issues prefetches in an order that is not necessarily similar to the order in which page faults will be taken. The OPT1 technique is intended to eliminate the problems mentioned above. The technique still uses write notices to guide di prefetching, but associates two counters with each page that are used to determine how often prefetches of the page are useful. Prefetches are only issued for pages that experience 50% or more useful prefetches, provided that they are currently valid at the local node. In addition, inspired by observations 3 and 5, this technique spreads prefetches out in time by only issuing prefetches at the time of a page fault outside of a critical section; at each such fault, the di s for a single page are prefetched.3 The page to prefetch for is determined by dequeuing an element of one of two lists: the list containing the numbers of the pages that experienced faults in the previous phase of execution, and the list that records the faults of the phase before the previous one. The order in which page numbers appear in each of the lists is the same as the sequence of page faults during the phase. The similarity between the second and third phases (section 4.2) determines which of the lists is to be used throughout the rest of the execution. Similar lists determine that the list corresponding to the previous phase is to be used. Otherwise, the list corresponding to the phase before the last is to be chosen. This strategy attempts to adjust to the common fault behavior of Em3d, where similar phases alternate during execution. The main di erence between the OPT1 and OPT2 techniques is that OPT2 avoids using write notices to guide prefetching altogether, since when the notices are poor descriptions of future faults OPT1 will simply reduce the number of useless prefetches, not increase the number of useful ones. Instead of the write notices, one of the lists of pages faulted on itself is used as a description of future faults. OPT2 does not consider the utility of prefetches either.
In our simulations we assume that, after started, the network interface DMA can complete its transfers without having to compete for resources with the processor. 3 In our system, the rst access to a prefetched page also causes a violation, even if the page is all set to be used. Thus, we also start prefetches on these events when prefetching on faults.
2
15
Prefetching Write notice Adaptive Prefetching Sequential No pref in Static at synch ops hints on faults prefetches crit sects Analysis
Table 3: Main Characteristics of Prefetching Techniques Inspired by observation 5, the Sequential technique issues di prefetches for two pages sequentially following or preceding the page faulted on in memory, provided that the fault did not happen inside of a critical section, the candidate page is not currently valid, has been referenced in the past, and does not have an outstanding or completed prefetch for it. 32 candidate pages are tested for these characteristics, and if none of them passes the test the system decides not to prefetch on the current fault. In contrast with Naive, OPT1, and OPT2, this technique could also be implemented to prefetch the pages a processor never referenced before. However, we avoid prefetching these pages, since this could lead to a large number of useless prefetches when the pages prefetched by a processor are not at all part of the its working set. As mentioned in observation 2, compiler analysis might be required to prefetch e ciently for certain applications. The Compiler-based strategy is similar to software prefetching for hardware DSMs 2, 9]. In this technique, prefetch calls are inserted in the application code manually to orchestrate page and di prefetching. Each prefetch call determines prefetching for one or more pages. Before inserting the calls, we traced the pages faults to count the number of faults associated with each page. In order to prefetch without generating substantial overhead for specifying the corresponding pages, we inserted prefetch calls only for the pages that experience signi cant numbers of faults and performed loop unrolling and splitting wherever necessary. Note that the accuracy of this technique can never be matched by a real compiler; the technique is used to evaluate the potential of compiler-based solutions to prefetching. Table 3 summarizes the main characteristics of the prefetching techniques we propose.
120
100
104
100
100
Prefetch Utilization (%)
80 68 60
useless useful
80
81
78 70
60 50 40 50 49
40
35
33
20
20 9 5 3 Radix Application 12
Em3d Application
FFT
Ocean
Figure 8: Prefetch Utilization for Naive, OPT1, OPT2, Sequential, and Compiler-based Techniques (from left to right). The OPT1 and OPT2 techniques indeed reduce the number of useless prefetches in Naive signi cantly. As we expected, Radix represents a major problem for the OPT1 and OPT2 techniques, since most of its faults are not sharing-related and fault patterns do not repeat throughout the execution. OPT1 and OPT2 perform roughly the same for all applications, except for Ocean. For this application, the techniques decide that alternate phases are more similar than consecutive ones based on the initial fault behavior of the application, which turns out to be a poor choice for the later phases of execution. OPT2 su ers much more from this bad decision, since its prefetches are driven by the lists of page faults, while the prefetches in OPT1 are driven by write notices. The Sequential technique performs well for the FFT, Ocean, and Radix applications, where the technique issues at least as many useful prefetches as the more aggressive Naive technique. The di erence in useful prefetches between Sequential and Naive is particularly large for Radix. As we have seen in section 4, in this application past history is a poor description of future accesses, so both the Sequential and Compiler-based techniques achieve a much greater percentage of useful prefetches than all other techniques. In fact, the Compiler-based technique is the best one overall, since it entails the largest number of useful prefetches, except in the case of Ocean. For this application, the technique does not issue prefetches for all pages that cause access faults to avoid excessive computation overhead. Execution Time. Figure 9 presents a detailed view of the execution time performance of our applications running on 16 processors. The time categories in the graphs are the same as in gure 2. The leftmost bar in each graph, Base, represents the standard TreadMarks running time, while the other bars represent the Naive, OPT1, OPT2, Sequential, and Compiler-based prefetching techniques, from left to right. The graphs show that the prefetching techniques we study improve the performance of Em3d, FFT, Ocean, and Radix by as much as 9, 3, 29, and 30%, respectively. All techniques are successful 17
2.5
x 10
Em3d 9 8 100%
x 10
FFT 142% others ipc synch data busy 100% 102% 97% 97% 97%
95%
91%
91%
92%
95%
Number of Cycles
1.5
Number of Cycles
7 6 5 4 3 2 1
0.5
Base
Naive
Base
Naive
10 9 8 7
x 10
x 10
3.5
Number of Cycles
6 5 4 3
Number of Cycles
2.5
2 100% 1.5
104%
104%
103%
70% 1
2 1 0 0.5
Base
Naive
Base
Naive
Figure 9: Running Time Performance of Em3d, FFT, Radix, and Ocean (clockwise from top left corner). at reducing the data fetch overheads of TreadMarks, except in the case of Radix where the Compilerbased prefetching strategy is the only one to do so. In general, the Naive and Compiler-based techniques are the most successful in this respect. Naive achieves an excellent running time performance for Ocean. However, for other applications the gains in Naive are usually surpassed by much higher IPC and synchronization overheads. Synchronization times increase because prefetching makes short critical sections extremely expensive. IPC times increase as a result of prefetching when nodes guess their future access patterns incorrectly and end up prefetching pages they will not actually use. Each useless prefetch causes node interference (in the form of IPC time) that would not occur in Base. In addition, IPC times can increase as a result of competition between the processor and the network interface for access to memory. Sequential, OPT1, and OPT2 do not increase IPC and synchronization overheads as much as Naive, but do not deliver consistently good performance either. The Compiler-based strategy delivers excellent performance to Radix, but is not the ideal choice for the other applications. The 18
main reasons for this result are that the Compiler-based technique su ers the computation overhead of specifying and calling prefetch routines, and also generates increased IPC overheads. In summary, we nd that any one technique to be consistently e ective must be able to reduce data fetch overheads without increasing the IPC and synchronization latencies signi cantly.
6 Related Work
As far as we know, no other study has addressed the page fault behavior of applications running on software DSMs explicitly. Prefetching for software DSMs has received little attention so far 5, 7, 6, 1]. Among other techniques, Dwarkadas et al. 5] studied di prefetching at lock acquire operations. Their cross-synchronization prefetching strategy can be very precise about future page access patterns, but only sends prefetches to the last lock releaser, which might not have an upto-date copy of the data prefetched. In other work, Dwarkadas et al. 6] combine compile-time analysis with runtime support for several sophisticated techniques, including di prefetching. Their prefetching techniques aggregate the di s of several pages in a single reply message whenever possible, therefore achieving a reduction in the number of messages in addition to the overhead tolerance provided by prefetching. The techniques we considered in this paper do not seek to reduce the number of messages, but simply to tolerate data access overheads. The Sparks DSM construction library 7] provides a clean interface for recording page fault histories as used in two of our prefetching techniques. Based on this interface, Keleher describes a technique called prefetch playback that identi es pairs of producer and consumer processors (if any exist) and sends the corresponding updates at barrier events. The content of the updates is determined by histories of faults recorded throughout the previous phase of execution. This strategy relies on page fault patterns to repeat at all phases, which we have shown not always happens. The interface is general enough however, that more sophisticated prefetch techniques can be implemented without much di culty. Our previous work 1] has proposed the use of simple hardware support for aggressively tolerating overheads in software DSMs. In the context of that work we evaluated the Naive prefetching technique both under standard TreadMarks and under a modi ed version of the system that takes advantage of the extra hardware. Our experiments detected the performance problems of the Naive strategy, but showed that it can pro t substantially from our hardware support. All the other techniques studied in this paper should bene t from this support even more than Naive.
7 Conclusions
In this paper we assessed the page fault behavior of parallel applications running on top of software DSMs. Based on several important observations about this behavior, we proposed and evaluated ve di prefetching techniques for the TreadMarks DSM. Simulation results of this system running on a network of workstations showed that our prefetching techniques can deliver performance improvements over standard TreadMarks of up to 30%. However, no technique was consistently 19
e ective for all applications. Nevertheless, we only covered a restricted set of prefetching techniques, so using the behavior information we provide in this paper might lead to new and more pro table techniques. Given the initial results we presented however, our conclusion is that prefetching techniques for software DSMs should only be consistently pro table with hardware support for alleviating IPC and synchronization overheads.
Acknowledgements
We would like to thank Leonidas Kontothanassis for contributing to our simulation infrastructure and for numerous discussions on topics related to the research presented in this paper. We would also like to thank Cristiana Seidel for comments that helped improve this paper.
References
1] R. Bianchini, L. Kontothanassis, R. Pinto, M. De Maria, M. Abud, and C. L. Amorim. Hiding Communication Latency and Coherence Overhead in Software DSMs. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, Oct 1996. 2] D. Callahan, K. Kennedy, and A. Porter eld. Software Prefetching. Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 40{52, April 1991. 3] J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation and Performance of Munin. In Proceedings of the 13th Symposium on Operating Systems Principles, October 1991. 4] D. Culler et al. Parallel Programming in Split-C. In Proceedings of Supercomputing '93, pages 262{273, November 1993. 5] S. Dwarkadas, A. Cox, H. Lu, and W. Zwaenepoel. Compiler-Directed Selective Update Mechanisms for Software Distributed Shared Memory. Technical Report TR95-253, Department of Computer Science, Rice University, 1995. 6] S. Dwarkadas, A. Cox, and W. Zwaenepoel. An Integrated Compile-Time/Run-Time Software Distributed Shared Memory System. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, Oct 1996. 7] P. Keleher. Coherence as an Abstract Type. Technical Report CS-TR-3544, Department of Computer Science, University of Maryland, Oct 1995. 8] P. Keleher, S. Dwarkadas, A. Cox, and W. Zwaenepoel. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. In Proceedings of the USENIX Winter '94 Technical Conference, pages 17{21, Jan 1994. 20
9] T. Mowry and A. Gupta. Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87{ 106, June 1991. 10] J. E. Veenstra and R. J. Fowler. MINT: A Front End for E cient Simulation of Shared-Memory Multiprocessors. In Proceedings of the 2nd International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 1994. 11] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 24{36, May 1995.
21