0% found this document useful (0 votes)
40 views11 pages

Cilk - Multithreaded Runtime System

Cilk is a runtime system that provides efficient parallel programming using multithreading. It uses a "work-stealing" scheduler to achieve near-optimal parallel execution time based on the measures of "work" and "critical path". The Cilk runtime system runs on several parallel and distributed platforms and has been used to implement applications such as protein folding, graphics rendering, and the Socrates chess program.

Uploaded by

Senthil Kumar M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views11 pages

Cilk - Multithreaded Runtime System

Cilk is a runtime system that provides efficient parallel programming using multithreading. It uses a "work-stealing" scheduler to achieve near-optimal parallel execution time based on the measures of "work" and "critical path". The Cilk runtime system runs on several parallel and distributed platforms and has been used to implement applications such as protein folding, graphics rendering, and the Socrates chess program.

Uploaded by

Senthil Kumar M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Cilk: An Efficient Multithreaded Runtime System

Robert D. Blumofe Christopher F. Joerg Bradley C. Kuszmaul


Charles E. Leiserson Keith H. Randall Yuli Zhou
MIT Laboratory for Computer Science
545 Technology Square
Cambridge, MA 02139

Abstract level 0

Cilk (pronounced “silk”) is a C-based runtime system for multi-


threaded parallel programming. In this paper, we document the effi- level 1
ciency of the Cilk work-stealing scheduler, both empirically and ana-
lytically. We show that on real and synthetic applications, the “work”
and “critical path” of a Cilk computation can be used to accurately level 2
model performance. Consequently, a Cilk programmer can focus on
reducing the work and critical path of his computation, insulated from
level 3
load balancing and other runtime scheduling issues. We also prove
that for the class of “fully strict” (well-structured) programs, the Cilk
scheduler achieves space, time, and communication bounds all within Figure 1 : The Cilk model of multithreaded computation. Threads are
a constant factor of optimal. shown as circles, which are grouped into procedures. Each downward edge
The Cilk runtime system currently runs on the Connection Ma- corresponds to a spawn of a child, each horizontal edge corresponds to a spawn
chine CM5 MPP, the Intel Paragon MPP, the Silicon Graphics Power of a successor, and each upward, curved edge corresponds to a data dependency.
Challenge SMP, and the MIT Phish network of workstations. Ap- The numbers in the figure indicate the levels of procedures in the spawn tree.
plications written in Cilk include protein folding, graphic rendering,
backtrack search, and the ?Socrates chess program, which won third
prize in the 1994 ACM International Computer Chess Championship. procedures, each of which is broken into a sequence of threads, which
form the vertices of the dag. Each thread is a nonblocking C func-
1 Introduction tion, which means that it can run to completion without waiting or
suspending once it has been invoked. As one of the threads from a
Cilk procedure runs, it can spawn a child thread which begins a new
Multithreading has become an increasingly popular way to implement
child procedure. In the figure, downward edges connect threads and
dynamic, highly asynchronous, concurrent programs [1, 8, 9, 10, 11,
their procedures with the children they have spawned. A spawn is
12, 15, 19, 21, 22, 24, 25, 28, 33, 34, 36, 39, 40]. A multithreaded sys-
like a subroutine call, except that the calling thread may execute con-
tem provides the programmer with a means to create, synchronize, and
currently with its child, possibly spawning additional children. Since
schedule threads. Although the schedulers in many of these runtime
threads cannot block in the Cilk model, a thread cannot spawn chil-
systems seem to perform well in practice, none provide users with a
dren and then wait for values to be returned. Rather, the thread must
guarantee of application performance. Cilk is a runtime system whose
additionally spawn a successor thread to receive the children’s return
work-stealing scheduler is efficient in theory as well as in practice.
values when they are produced. A thread and its successors are con-
Moreover, it gives the user an algorithmic model of application per-
sidered to be parts of the same Cilk procedure. In the figure, sequences
formance based on the measures of “work” and “critical path” which
of successor threads that form Cilk procedures are connected by hori-
can be used to predict the runtime of a Cilk program accurately.
zontal edges. Return values, and other values sent from one thread to
A Cilk multithreaded computation can be viewed as a directed
another, induce data dependencies among the threads, where a thread
acyclic graph (dag) that unfolds dynamically, as is shown schemat-
receiving a value cannot begin until another thread sends the value.
ically in Figure 1. A Cilk program consists of a collection of Cilk
Data dependencies are shown as upward, curved edges in the figure.
This research was supported in part by the Advanced Research Projects Agency under Thus, a Cilk computation unfolds as a spawn tree composed of proce-
Grants N00014-94-1-0985 and N00014-92-J-1310. Robert Blumofe is supported in part by dures and the spawn edges that connect them to their children, but the
an ARPA High-Performance Computing Graduate Fellowship. Keith Randall is supported
execution is constrained to follow the precedence relation determined
in part by a Department of Defense NDSEG Fellowship.
To appear in the Proceedings of the Fifth ACM SIGPLAN Symposium on Prin- by the dag of threads.
ciples and Practice of Parallel Programming (PPoPP ’95), Santa Barbara Califor- The execution time of any Cilk program on a parallel computer
nia, July 19–21, 1995. Also available on the web as ftp://ftp.lcs.mit.edu with P processors is constrained by two parameters of the computa-
/pub/cilk/PPoPP95.ps.Z.
tion: the work and the critical path. The work, denoted T1 , is the
time used by a one-processor execution of the program, which cor-
responds to the sum of the execution times of all the threads. The
critical path length, denoted T1 , is the total amount of time required
by an infinite-processor execution, which corresponds to the largest
sum of thread execution times along any path. With P processors, the
execution time cannot be less than T1 =P or less than T1 . The Cilk waiting closure code
scheduler uses “work stealing” [3, 7, 13, 14, 15, 19, 27, 28, 29, 34, 40]
to achieve execution time very near to the sum of these two measures. x: T1
Off-line techniques for computing such efficient schedules have been 1
known for a long time [5, 16, 17], but this efficiency has been difficult 17
to achieve on-line in a distributed environment while simultaneously join
using small amounts of space and communication. counters
We demonstrate the efficiency of the Cilk scheduler both empir- 6 arguments
ically and analytically. Empirically, we have been able to document
that Cilk works well for dynamic, asynchronous, tree-like, MIMD-
style computations. To date, the applications we have programmed y: T2
include protein folding, graphic rendering, backtrack search, and the
?Socrates chess program, which won third prize in the 1994 ACM
0
International Computer Chess Championship. Many of these applica- 42
tions pose problems for more traditional parallel environments, such x:1
as message passing [38] and data parallel [2, 20], because of the unpre-
dictability of the dynamic workloads on processors. Analytically, we ready closure
prove that for “fully strict” (well-structured) programs, Cilk’s work-
stealing scheduler achieves execution space, time, and communication Figure 2: The closure data structure.
bounds all within a constant factor of optimal. To date, all of the ap-
plications that we have coded are fully strict.
The Cilk language is an extension to C that provides an abstraction is essentially a global reference to an empty argument slot of a closure,
of threads in explicit continuation-passing style. A Cilk program is implemented as a compound data structure containing a pointer to a
preprocessed to C and then linked with a runtime library to run on the closure and an offset that designates one of the closure’s argument
Connection Machine CM5 MPP, the Intel Paragon MPP, the Silicon slots. Continuations can be created and passed among threads, which
Graphics Power Challenge SMP, or the MIT Phish [4] network of enables threads to communicate and synchronize with each other.
workstations. In this paper, we focus on the Connection Machine Continuations are typed with the C data type of the slot in the closure.
CM5 implementation of Cilk. The Cilk scheduler on the CM5 is At runtime, a thread can spawn a child thread by creating a closure
written in about 30 pages of C, and it performs communication among for the child. Spawning is specified in the Cilk language as follows:
processors using the Strata [6] active-message library.
The remainder of this paper is organized as follows. Section 2 spawn T (args : : : )
describes Cilk’s runtime data structures and the C language extensions
This statement creates a child closure, fills in all available arguments,
that are used for programming. Section 3 describes the work-stealing
and initializes the join counter to the number of missing arguments.
scheduler. Section 4 documents the performance of several Cilk ap-
Available arguments are specified as in C. To specify a missing argu-
plications. Section 5 shows how the work and critical path of a Cilk
ment, the user specifies a continuation variable (type cont) preceded
computation can be used to model performance. Section 6 shows ana-
by a question mark. For example, if the second argument is ?k, then
lytically that the scheduler works well. Finally, Section 7 offers some
Cilk sets the variable k to a continuation that refers to the second argu-
concluding remarks and describes our plans for the future.
ment slot of the created closure. If the closure is ready, that is, it has no
missing arguments, then spawn causes the closure to be immediately
2 The Cilk programming environment and implementa- posted to the scheduler for execution. In typical applications, child
tion closures are usually created with no missing arguments.
To create a successor thread, a thread executes the following state-
In this section we describe a C language extension that we have de- ment:
veloped to ease the task of coding Cilk programs. We also explain the
basic runtime data structures that Cilk uses. spawn next T (args : : : )
In the Cilk language, a thread T is defined in a manner similar to
a C function definition: This statement is semantically identical to spawn, but it informs
the scheduler that the new closure should be treated as a successor,
thread T (arg-decls : : : ) f stmts : : : g as opposed to a child. Successor closures are usually created with
some missing arguments, which are filled in by values produced by
The Cilk preprocessor translates T into a C function of one argument the children.
and void return type. The one argument is a pointer to a closure data A Cilk procedure does not ever return values in the normal way
structure, illustrated in Figure 2, which holds the arguments for T. A to a parent procedure. Instead, the programmer must code the parent
closure consists of a pointer to the C function for T, a slot for each procedure as two threads. The first thread spawns the child procedure,
of the specified arguments, and a join counter indicating the number passing it a continuation pointing to the successor thread’s closure.
of missing arguments that need to be supplied before T is ready to The child sends its “return” value explicitly as an argument to the
run. A closure is ready if it has obtained all of its arguments, and it waiting successor. This strategy of communicating between threads
is waiting if some arguments are missing. To run a ready closure, the is called explicit continuation passing. Cilk provides primitives of the
Cilk scheduler invokes the thread as a procedure using the closure itself following form to send values from one closure to another:
as its sole argument. Within the code for the thread, the arguments
are copied out of the closure data structure into local variables. The send argument (k, value)
closure is allocated from a simple runtime heap when it is created, and
it is returned to the heap when the thread terminates. This statement sends the value value to the argument slot of a waiting
The Cilk language supports a data type called a continuation, closure specified by the continuation k. The types of the continuation
which is specified by the type modifier keyword cont. A continuation and the value must be compatible. The join counter of the waiting
decisions, including on which processor a thread should be placed and
thread fib (cont int k, int n)
how to pack and unpack data when a closure is migrated from one
f if (n<2)
send argument (k,n) processor to another.
else
f cont int x, y;
spawn next sum (k, ?x, ?y);
3 The Cilk work-stealing scheduler
spawn fib (x, n-1); Cilk’s scheduler uses the technique of work-stealing [3, 7, 13, 14, 15,
spawn fib (y, n-2);
19, 27, 28, 29, 34, 40] in which a processor (the thief) who runs out of
g
g work selects another processor (the victim) from whom to steal work,
and then steals the shallowest ready thread in the victim’s spawn tree.
Cilk’s strategy is for thieves to choose victims at random [3, 27, 37].
thread sum (cont int k, int x, int y)
f send argument (k, x+y); At runtime, each processor maintains a local ready queue to hold
g ready closures. Each closure has an associated level, which corre-
sponds to the number of spawn’s (but not spawn next’s) on the path
from the root of the spawn tree. The ready queue is an array in which
Figure 3 : A Cilk procedure, consisting of two threads, to compute the nth the Lth element contains a linked list of all ready closures having
Fibonacci number. level L.
Cilk begins executing the user program by initializing all ready
queues to be empty, placing the root thread into the level-0 list of
closure is decremented, and if it becomes zero, then the closure is Processor 0’s queue, and then starting a scheduling loop on each
ready and is posted to the scheduler. processor. Within a scheduling loop, a processor first checks to see
Figure 3 shows the familiar recursive Fibonacci procedure written whether its ready queue is empty. If it is, the processor commences
in Cilk. It consists of two threads, fib and its successor sum. Re- “work stealing,” which will be described shortly. Otherwise, the
flecting the explicit continuation passing style that Cilk supports, the processor performs the following steps:
first argument to each thread is the continuation specifying where the 1. Remove the thread at the head of the list of the deepest nonempty
“return” value should be placed. level in the ready queue.
When the fib function is invoked, it first checks to see if the
2. Extract the thread from the closure, and invoke it.
boundary case has been reached, in which case it uses send argument
to “return” the value of n to the slot specified by continuation k. As a thread executes, it may spawn or send arguments to other threads.
Otherwise, it spawns the successor thread sum, as well as two children When the thread terminates, control returns to the scheduling loop.
to compute the two subcases. Each of these two children is given a When a thread at level L spawns a child thread T , the scheduler
continuation specifying to which argument in the sum thread it should executes the following operations:
send its result. The sum thread simply adds the two arguments when 1. Allocate and initialize a closure for T.
they arrive and sends this result to the slot designated by k. 2. Copy the available arguments into the closure, initialize any
Although writing in explicit continuation passing style is some- continuations to point to missing arguments, and initialize the
what onerous for the programmer, the decision to break procedures join counter to the number of missing arguments.
into separate nonblocking threads simplifies the Cilk runtime system. 3. Label the closure with level L + 1.
Each Cilk thread leaves the C runtime stack empty when it completes. 4. If there are no missing arguments, post the closure to the ready
Thus, Cilk can run on top of a vanilla C runtime system. A com- queue by inserting it at the head of the level-(L + 1) list.
mon alternative [19, 25, 32, 34] is to support a programming style
Execution of spawn next is similar, except that the closure is labeled
in which a thread suspends whenever it discovers that required val-
with level L and, if it is ready, posted to the level-L list.
ues have not yet been computed, resuming when the values become
A processor that executes send argument(k, value) performs the
available. When a thread suspends, however, it may leave temporary
following steps:
values on the runtime stack which must be saved, or each thread must
have its own runtime stack. Consequently, this alternative strategy 1. Find the closure and argument slot referenced by the continua-
requires changes to the runtime system that depend on the C calling tion k.
stack layout and register usage conventions. Another advantage of 2. Place value in the argument slot, and decrement the join counter
Cilk’s strategy is that it allows multiple children to be spawned from of the closure.
a single nonblocking thread, which saves on context switching. In 3. If the join counter goes to zero, post the closure to the ready
Cilk, r children can be spawned and executed with only r + 1 context queue at the appropriate level.
switches, whereas the alternative of suspending whenever a thread is When the continuation k refers to a closure on a remote processor,
spawned causes 2r context switches. Since our primary interest is in network communication ensues. The processor that initiated the
understanding how to build efficient multithreaded runtime systems, send argument function sends a message to the remote processor
but without redesigning the basic C runtime system, we chose the to perform the operations. The only subtlety occurs in step 3. If the
alternative of burdening the programmer with a requirement which closure must be posted, it is posted to the ready queue of the initiating
is perhaps less elegant linguistically, but which yields a simple and processor, rather than to that of the remote processor. This policy is
portable runtime implementation. necessary for the scheduler to be provably good, but as a practical
Cilk supports a variety of features that give the programmer greater matter, we have also had success with posting the closure to the re-
control over runtime performance. For example, when the last action mote processor’s queue, which can sometimes save a few percent in
of a thread is to spawn a ready thread, the programmer can use the overhead.
keyword call instead of spawn that produces a “tail call” to run the If the scheduler attempts to remove a thread from an empty ready
new thread immediately without invoking the scheduler. Cilk also queue, the processor becomes a thief and commences work stealing
allows arrays and subarrays to be passed as arguments to closures. as follows:
Other features include various abilities to override the scheduler’s
1. Select a victim processor uniformly at random.
2. If the victim’s ready queue is empty, go to step 1.  ?Socrates is a parallel chess program that uses the Jamboree
3. If the victim’s ready queue is nonempty, extract a thread from search algorithm [23, 29] to parallelize a minmax tree search.
the tail of the list in the shallowest nonempty level of the ready The work of the algorithm varies with the number of processors,
queue, and invoke it. because it does speculative work that may be aborted during
Work stealing is implemented with a simple request-reply communi- runtime. ?Socrates is a production-quality program that won
cation protocol between the thief and victim. third prize in the 1994 ACM International Computer Chess
Why steal work from the shallowest level of the ready queue? The Championship running on the 512-node CM5 in the National
reason is two-fold. First, we would like to steal large amounts of work, Center for Supercomputing Applications at the University of
and shallow closures are likely to execute for longer than deep ones. Illinois, Urbana-Champaign.
Stealing large amounts of work tends to lower the communication Table 4 shows typical performance measures for these Cilk appli-
cost of the program, because fewer steals are necessary. Second, the cations. Each column presents data from a single run of a benchmark
closures at the shallowest level of the ready queue are also the ones that application. We adopt the following notations, which are used in
are shallowest in the dag, a key fact proven in Section 6. Consequently, the table. For each application, we have an efficient serial C im-
if processors are idle, the work they steal tends to make progress along plementation, compiled using gcc -O2, whose measured runtime is
the critical path. denoted Tserial . The work T1 is the measured execution time for the
Cilk program running on a single node of the CM5.2 The critical
4 Performance of Cilk applications path length T1 of the Cilk computation is measured by timestamping
each thread and does not include scheduling or communication costs.
This section presents several applications that we have used to bench- The measured P -processor execution time of the Cilk program run-
mark the Cilk scheduler. We also present empirical evidence from ning on the CM5 is given by TP , which includes all scheduling and
experiments run on a CM5 supercomputer to document the efficiency communication costs. The row labeled “threads” indicates the number
of our work-stealing scheduler. The CM5 is a massively parallel of threads executed, and “thread length” is the average thread length
computer based on 32MHz SPARC processors with a fat-tree inter- (work divided by the number of threads).
connection network [30]. Certain derived parameters are also displayed in the table. The
The applications are described below: ratio Tserial =T1 is the efficiency of the Cilk program relative to the
fib is the same as was presented in Section 2, except that the
C program. The ratio T1 =T1 is the average parallelism. The value
T1 =P + T1 is a simple model of the runtime, which will be discussed
second recursive spawn is replaced by a “tail call” that avoids in the next section. The speedup is T1 =TP , and the parallel efficiency
the scheduler. This program is a good measure of Cilk overhead,
because the thread length is so small.

is T1 =(P TP ). The row labeled “space/proc.” indicates the maximum


number of closures allocated at any time on any processor. The row
queens is a backtrack search program that solves the problem

labeled “requests/proc.” indicates the average number of steal requests
of placing N queens on a N N chessboard so that no two made by a processor during the execution, and “steals/proc.” gives the
queens attack each other. The Cilk program is based on serial average number of closures actually stolen.
code by R. Sargent of the MIT Media Laboratory. Thread length The data in Table 4 shows two important relationships: one be-
was enhanced by serializing the bottom 7 levels of the search tween efficiency and thread length, and another between speedup and
tree. average parallelism.
pfold is a protein-folding program [35] written in conjunc- Considering the relationship between efficiency Tserial =T1 and
tion with V. Pande of MIT’s Center for Material Sciences and thread length, we see that for programs with moderately long threads,
Engineering. This program finds hamiltonian paths in a three- the Cilk scheduler induces very little overhead. The queens, pfold,
dimensional grid using backtrack search. It was the first pro- ray, and knary programs have threads with average length greater
gram to enumerate all hamiltonian paths in a 3 4 4 grid.   than 50 microseconds and have efficiency greater than 90 percent.
We timed the enumeration of all paths starting with a certain On the other hand, the fib program has low efficiency, because the
sequence. threads are so short: fib does almost nothing besides spawn and
ray is a parallel program for graphics rendering based on the send argument.
serial POV-Ray program, which uses a ray-tracing algorithm. Despite it’s long threads, the ?Socrates program has low efficiency,
The entire POV-Ray system contains over 20,000 lines of C because its parallel Jamboree search algorithm [29] is based on specu-
code, but the core of POV-Ray is a simple doubly nested loop latively searching subtrees that are not searched by a serial algorithm.
that iterates over each pixel in a two-dimensional image. For Consequently, as we increase the number of processors, the program
ray we converted the nested loops into a 4-ary divide-and- executes more threads and, hence, does more work. For example,
conquer control structure using spawns.1 Our measurements the 256-processor execution did 7023 seconds of work whereas the
do not include the approximately 2.4 seconds of startup time 32-processor execution did only 3644 seconds of work. Both of these
required to read and process the scene description file. executions did considerably more work than the serial program’s 1665
knary(k,n,r) is a synthetic benchmark whose parameters can seconds of work. Thus, although we observe low efficiency, it is due
be set to produce a variety of values for work and critical path. to the parallel algorithm and not to Cilk overhead.
It generates a tree of branching factor k and depth n in which Looking at the speedup T1 =TP measured on 32 and 256 proces-
the first r children at every level are executed serially and the sors, we see that when the average parallelism T1 =T1 is large com-
remainder are executed in parallel. At each node of the tree, the pared with the number P of processors, Cilk programs achieve nearly
program runs an empty “for” loop for 400 iterations. perfect linear speedup, but when the average parallelism is small, the
speedup is much less. The fib, queens, pfold, and ray programs,
1 Initially, the serial POV-Ray program was about 5 percent slower than the Cilk version
running on one processor. The reason was that the divide-and-conquer decomposition ?
2 For the Socrates program, T 1 is not the measured execution time, but rather it is
performed by the Cilk code provides better locality than the doubly nested loop of the an estimate of the work obtained by summing the execution times of all threads, which
serial code. Modifying the serial code to imitate the Cilk decomposition improved its ?
yields a slight underestimate. Socrates is an unusually complicated application, because
performance. Timings for the improved version are given in the table. its speculative execution yields unpredictable work and critical path. Consequently, the
measured runtime on one processor does not accurately reflect the work on P > 1
processors.
fib queens pfold ray knary knary ?Socrates ?Socrates
(33) (15) (3,3,4) (500,500) (10,5,2) (10,4,1) (depth 10) (depth 10)
(32 proc.) (256 proc)
(application parameters)
Tserial 8.487 252.1 615.15 729.2 288.6 40.993 1665 1665
T1 73.16 254.6 647.8 732.5 314.6 45.43 3644 7023
Tserial =T1 0.116 0.9902 0.9496 0.9955 0.9174 0.9023 0.4569 0.2371
T1 0.000326 0.0345 0.04354 0.0415 4.458 0.255 3.134 3.24
T1 =T1 224417 7380 14879 17650 70.56 178.2 1163 2168
threads 17,108,660 210,740 9,515,098 424,475 5,859,374 873,812 26,151,774 51,685,823
thread length 4.276s 1208s 68.08s 1726s 53.69s 51.99s 139.3s 135.9s
(32-processor experiments)
TP 2.298 8.012 20.26 21.68 15.13 1.633 126.1 -
T1 =P +
T1 2.287 7.991 20.29 22.93 14.28 1.675 117.0 -
T1 =TP 31.84 31.78 31.97 33.79 20.78 27.81 28.90 -
( 
T1 = P TP ) 0.9951 0.9930 0.9992 1.0558 0.6495 0.8692 0.9030 -
space/proc. 70 95 47 39 41 42 386 -
requests/proc. 185.8 48.0 88.6 218.1 92639 3127 23484 -
steals/proc. 56.63 18.47 26.06 79.25 18031 1034 2395 -
(256-processor experiments)
TP 0.2892 1.045 2.590 2.765 8.590 0.4636 - 34.32
T1 =P +
T1 0.2861 1.029 2.574 2.903 5.687 0.4325 - 30.67
T1 =TP 253.0 243.7 250.1 265.0 36.62 98.00 - 204.6
( 
T1 = P TP ) 0.9882 0.9519 0.9771 1.035 0.1431 0.3828 - 0.7993
space/proc. 66 76 47 32 48 40 - 405
requests/proc. 73.66 80.40 97.79 82.75 151803 7527 - 30646
steals/proc. 24.10 21.20 23.05 18.34 6378 550 - 1540

Table 4: Performance of Cilk on various applications. All times are in seconds, except where noted.

for example, have in excess of 7000-fold parallelism and achieve more (assuming no register window overflow) plus 1 cycle for each word
than 99 percent of perfect linear speedup on 32 processors and more argument (assuming all arguments are transferred in registers). Thus,
than 95 percent of perfect linear speedup on 256 processors.3 The a spawn in Cilk is roughly an order of magnitude more expensive than
?Socrates program exhibits somewhat less parallelism and also some- a C function call. This Cilk overhead is quite apparent in the fib pro-
what less speedup. On 32 processors the ?Socrates program has 1163- gram, which does almost nothing besides spawn and send argument.
fold parallelism, yielding 90 percent of perfect linear speedup, while Based on fib’s measured efficiency of 0:116, we can conclude that the
on 256 processors it has 2168-fold parallelism yielding 80 percent aggregate average cost of a spawn/send argument in Cilk is between
of perfect linear speedup. With even less parallelism, as exhibited 8 and 9 times the cost of a function call/return in C.
in the knary benchmarks, less speedup is obtained. For example, Efficient execution of programs with short threads requires a low-
the knary(10,5,2) benchmark exhibits only 70-fold parallelism, overhead spawn operation. As can be observed from Table 4, the
and it realizes barely more than 20-fold speedup on 32 processors vast majority of threads execute on the same processor on which they
(less than 65 percent of perfect linear speedup). With 178-fold paral- are spawned. For example, the fib program executed over 17 million
lelism, knary(10,4,1) achieves 27-fold speedup on 32 processors threads but migrated only 6170 (24.10 per processor) when run with
(87 percent of perfect linear speedup), but only 98-fold speedup on 256 processors. Taking advantage of this property, other researchers
256 processors (38 percent of perfect linear speedup). [25, 32] have developed techniques for implementing spawns such that
Although these speedup measures reflect the Cilk scheduler’s abil- when the child thread executes on the same processor as its parent,
ity to exploit parallelism, to obtain application speedup, we must fac- the cost of the spawn operation is roughly equal the cost of a C
tor in the efficiency of the Cilk program compared with the serial function call. We hope to incorporate such techniques into future
C program. Specifically, the application speedup Tserial =TP is the implementations of Cilk.
product of efficiency Tserial =T1 and speedup T1 =TP . For example, Finally, we make two observations about the space and communi-
applications such as fib and ?Socrates with low efficiency generate cation measures in Table 4.
correspondingly low application speedup. The ?Socrates program, Looking at the “space/proc.” rows, we observe that the space per
with efficiency 0:2371 and speedup 204:6 on 256 processors, exhibits processor is generally quite small and does not grow with the number

application speedup of 0:2371 204:6 = 48:51. For the purpose of of processors. For example, ?Socrates on 32 processors executes over
performance prediction, we prefer to decouple the efficiency of the 26 million threads, yet no processor ever has more than 386 allocated
application from the efficiency of the scheduler. closures. On 256 processors, the number of executed threads nearly
Looking more carefully at the cost of a spawn in Cilk, we find that doubles to over 51 million, but the space per processors barely changes.
it takes a fixed overhead of about 50 cycles to allocate and initialize a In Section 6 we show formally that for Cilk programs, the space per
closure, plus about 8 cycles for each word argument. In comparison, processor does not grow as we add processors.
a C function call on a CM5 processor takes 2 cycles of fixed overhead Looking at the “requests/proc.” and “steals/proc.” rows in Table 4,
3 In fact, the ray program achieves superlinear speedup even when comparing to the we observe that the amount of communication grows with the critical
efficient serial implementation. We suspect that cache effects cause this phenomenon.
path but does not grow with the work. For example, fib, queens,
pfold, and ray all have critical paths under a tenth of a second simpler curves TP = T1 =P + T1 and TP = T1 =P + 2 T1 for 
long and perform fewer than 220 requests and 80 steals per processor, comparison. As can be seen from the figure, little is lost in the linear
whereas knary(10,5,2) and ?Socrates have critical paths more than speedup range of the curve by assuming that c1 = 1. Indeed, a fit
3 seconds long and perform more than 20,000 requests and 1500 steals to TP = T1 =P + c1 (T1 ) yields c1 = 1:509  0:3727 with
R = 0:983592 and a mean relative error of 4:04 percent, which
2
per processor. The table does not show any clear correlation between
work and either requests or steals. For example, ray does more than is in some ways better than the fit that includes a c1 term. (The R2
twice as much work as knary(10,5,2), yet it performs two orders of measure is a little worse, but the mean relative error is much better.)
magnitude fewer requests. In Section 6, we show that for “fully strict” It makes sense that the data points become more scattered when
Cilk programs, the communication per processor grows linearly with P is close to or exceeds the average parallelism. In this range, the
the critical path length and does not grow as function of the work. amount of time spent in work stealing becomes a significant fraction
of the overall execution time. The real measure of the quality of a
5 Modeling performance scheduler is how much larger T1 =T1 must be than P before TP
shows substantial influence from the critical path. One can see from
In this section, we further document the effectiveness of the Cilk Figure 5 that if the average parallelism exceeds P by a factor of 10,
scheduler by showing empirically that it schedules applications in the critical path has almost no impact on the running time.
a near-optimal fashion. Specifically, we use the knary synthetic To confirm our simple model of the Cilk scheduler’s performance
benchmark to show that the runtime of an application on P processors on a real application, we ran ?Socrates on a variety of chess positions.
can be accurately modeled as TP  T1 =P + c1 T1 , where c1  Figure 6 shows the results of our study, which confirm the results
1:5. This result shows that we obtain nearly perfect linear speedup from the knary synthetic benchmarks. The curve shown is the best
fit to TP = c1 (T1 =P ) + c1 (T1 ), where c1 = 1:067 0:0141 

when the critical path is short compared with the average amount
of work per processor. We also show that a model of this kind is and c1 = 1:042 0:0467 with 95 percent confidence. The R2
accurate even for ?Socrates, which is our most complex application correlation coefficient of the fit is 0:9994, and the mean relative error
programmed to date and which does not obey all the assumptions is 4:05 percent.
assumed by the theoretical analyses in Section 6. Indeed, as some of us were developing and tuning heuristics to
A good scheduler should to run an application with T1 work in increase the performance of ?Socrates, we used work and critical
T1 =P time on P processors. Such perfect linear speedup cannot be path as our measures of progress. This methodology let us avoid
obtained whenever T1 > T1 =P , since we always have TP  T1 , being trapped by the following interesting anomaly. We made an
or more generally, TP  f g
max T1 =P; T1 . The critical path T1 “improvement” that sped up the program on 32 processors. From
is the stronger lower bound on TP whenever P exceeds the average our measurements, however, we discovered that it was faster only
parallelism T1 =T1 , and T1 =P is the stronger bound otherwise. A because it saved on work at the expense of a much longer critical path.
good scheduler should meet each of these bounds as closely as possible. Using the simple model TP = T1 =P + T1 , we concluded that on
In order to investigate how well the Cilk scheduler meets these two a 512-processor machine, which was our platform for tournaments,
lower bounds, we used our knary benchmark (described in Section 4), the “improvement” would yield a loss of performance, a fact that
which can exhibit a range of values for work and critical path. we later verified. Measuring work and critical path enabled us to
Figure 5 shows the outcome of many experiments of running use experiments on a 32-processor machine to improve our program
knary with various values for k, n, r, and P . The figure plots for the 512-processor machine, but without using the 512-processor
the speedup T1 =TP for each run against the machine size P for that machine, on which computer time was scarce.
run. In order to compare the outcomes for runs with different pa-
rameters, we have normalized the data by dividing the plotted values 6 A theoretical analysis of the Cilk scheduler
by the average parallelism T1 =T1 . Thus, the horizontal position of
each datum is P=(T1 =T1 ), and the vertical position of each datum In this section we use algorithmic analysis techniques to prove that for
is (T1 =TP )=(T1 =T1 ) = T1 =TP . Consequently, on the horizontal the class of “fully strict” Cilk programs, Cilk’s work-stealing schedul-
axis, the normalized machine-size is 1:0 when the average available ing algorithm is efficient with respect to space, time, and commu-
parallelism is equal to the machine size. On the vertical axis, the nication. A fully strict program is one for which each thread sends
normalized speedup is 1:0 when the runtime equals the critical path, arguments only to its parent’s successor threads. For this class of
and it is 0:1 when the runtime is 10 times the critical path. We can programs, we prove the following three bounds on space, time, and
draw the two lower bounds on time as upper bounds on speedup. The communication:
horizontal line at 1:0 is the upper bound on speedup obtained from
Space The space used by a P -processor execution is bounded by
the critical path, and the 45-degree line is the upper bound on speedup
obtained from the work per processor. As can be seen from the figure, SP  S1 P , where S1 denotes the space used by the serial
on the knary runs for which the average parallelism exceeds the num- execution of the Cilk program. This bound is existentially
ber of processors (normalized machine size < 1), the Cilk scheduler optimal to within a constant factor [3].
obtains nearly perfect linear speedup. In the region where the number Time With P processors, the expected execution time, including
of processors is large compared to the average parallelism (normal- scheduling overhead, is bounded by TP = O(T1 =P + T1 ).
ized machine size > 1), the data is more scattered, but the speedup is Since both T1 =P and T1 are lower bounds for any P -processor
always within a factor of 4 of the critical-path upper bound. execution, our expected time bound is within a constant factor
The theoretical results from Section 6 show that the expected
running time of an application on P processors is TP = O(T1 =P +
of optimal.
T1 ). Thus, it makes sense to try to fit the data to a curve of the Communication The expected number of bytes communicated dur-
form TP = c1 (T1 =P ) + c1 (T1 ). A least-squares fit to the data ing a P -processor execution is O(T1 P Smax ), where Smax de-
to minimize the relative error yields c1 = 0:9543  0:1775 and notes the largest size of any closure. This bound is existentially

c1 = 1:54 0:3888 with 95 percent confidence. The R correlation
2
optimal to within a constant factor [41].
coefficient of the fit is 0:989101, and the mean relative error is 13:07
percent. The curve fit is shown in Figure 5, which also plots the The expected time bound and the expected communication bound can
be converted into high-probability bounds at the cost of only a small
Critical Path Bound .
1

0.1

d
un
Normalized Speedup

Bo
up
0.01

eed
Sp
ear
Lin
Measured Value:
0.001
Model 1: 1 000 1 + 1 000 1
:  T =P :  T

Model 2: 1 000 1 + 2 000 1


:  T =P :  T

0.0001 Curve Fit: 0 954 1 + 1 540 1


:  T =P :  T

. 0.0001 0.001 0.01 0.1 1 10


Normalized Machine Size
Figure 5: Normalized speedups for the knary synthetic benchmark using from 1 to 256 processors. The horizontal axis is P and the vertical axis is the speedup
T1 =TP , but each data point has been normalized by dividing the these parameters by T1 =T1 .

.
1 Critical Path Bound
d
un
Bo
up
eed
Sp
ear
Normalized Speedup

Lin

0.1

Measured Value:
Model 1: 1 000 1 + 1 000 1
:  T =P :  T

Model 2: 1 000 1 + 2 000 1


:  T =P :  T

Curve Fit: 1 067 1 + 1 042 1


:  T =P :  T

0.01

. 0.01 0.1 1
Normalized Machine Size
Figure 6: Normalized speedups for the ?Socrates chess program.
spawn_next We associate each allocated closure with a primary leaf as follows.
If the closure is a primary leaf, it is assigned to itself. Otherwise, if
the closure has any allocated children, then it is assigned to the same
spawn
primary leaf as its leftmost child. If the closure is a leaf but has some
left siblings, then the closure is assigned to the same primary leaf as its
leftmost sibling. In this recursive fashion, we assign every allocated
c b closure to a primary leaf. Now, we consider the set of closures assigned
to a given primary leaf. The total space of these closures is at most S1 ,
because this set of closures is a subset of the closures that are allocated
during a 1-processor execution when the processor is executing this
a
primary leaf, which completes the proof.

Figure 7 : The closures at some time during a 1-processor execution. Data-


We now give the theorems bounding execution time and communi-
cation cost. Proofs for these theorems generalize the results of [3] for a
dependency edges are not shown. The black nodes represent ready closures,
the gray nodes represent waiting closures, and white nodes represent closures more restricted model of multithreaded computation. As in [3], these
that have already been executed. The black and gray closures are allocated and proofs assume a communication model in which messages are delayed
consume space, but the white closures have been deallocated. Gray, curved only by contention at destination processors, but no assumptions are
edges represent the additional edges in D 0 that do not also belong to D . made about the order in which contending messages are delivered [31].
The bounds given by these theorems assume that no thread has more
than one successor thread.
additive term in both cases. Proofs of these bounds use generalizations The proofs of these theorems are analogous to the proofs of The-
of the techniques developed in [3]. We defer complete proofs and give orems 12 and 13 in [3]. We show that certain “critical” threads are
outlines here. likely to be executed after only a modest number of steal requests, and
The space bound follows from the “busy-leaves” property which that executing a critical thread guarantees progress on the critical path
characterizes the allocated closures at all times during the execution. of the dag.
At any given time during the execution, we say that a closure is a leaf We first construct an augmented dag D0 that will be used to define
if it has no allocated child closures, and we say that a leaf closure is the critical threads. The dag D0 is constructed by adding edges to
a primary leaf if, in addition, it has no left-sibling closures allocated. the original dag D of the computation. For each child procedure v
In Figure 7, which shows the allocated closures at some time during of a thread t, we add an edge to D0 from the first thread of v to
an execution, closure a is the only primary leaf. Closure b is a leaf, the first thread of the next child procedure spawned by t after v is
but it is not primary, since it has left siblings and closure c is not spawned. We make the technical assumption that the first thread of
a leaf, because a and its two siblings are counted as children of c. each procedure executes in zero time since we can add a zero-time
The busy-leaves property states that every primary leaf closure has thread to the beginning of each procedure without affecting work or
a processor working on it. To prove the space bound, we show that depth. An example of the dag D0 is given in Figure 7, where the
Cilk’s scheduler maintains the busy-leaves property, and then we show additional edges are shown gray and curved. We draw the children
that the busy-leaves property implies the space bound. spawned by a node in right-to-left order in the figure, because the
execution order by the local processor is left to right, corresponding
Theorem 1 For any fully strict Cilk program, if S1 is the space used to LIFO execution. The dag D0 is constructed for analytic purposes
to execute the program on 1 processor, then with any number P of only and has no effect on the scheduling of the threads. An important
processors, Cilk’s work-stealing scheduler uses at most S1 P space. property of D0 is that its critical path is the same as the critical path
of the original dag D.
Proof: We first show by induction on execution time that Cilk’s We next define the notion of a critical thread formally. We have
work-stealing scheduler maintains the busy-leaves property. We then already defined a ready thread as a thread all of whose predecessors
show that the busy-leaves property implies the space bound. in D have been executed. Similarly, a critical thread is a thread all of
To see that Cilk’s scheduler maintains the busy-leaves property, whose predecessors in D0 have been executed. A critical thread must
we consider the three possible ways that a primary-leaf closure can be be ready, but a ready thread may or may not be critical. We now state
created. First, when a thread spawns children, the leftmost of these a lemma which shows that a critical thread must be the shallowest
children is a primary leaf. Second, when a thread completes and its thread in a ready queue.
closure is freed, if that closure has a right sibling and that sibling
has no children, then the right-sibling closure becomes a primary Lemma 2 During the execution of any fully strict Cilk program for
leaf. And third, when a thread completes and its closure is freed, if which no thread has more than one successor thread, any critical
that closure has no allocated siblings, then the leftmost closure of its thread must be the shallowest thread in a ready queue. Moreover, the
parent’s successor threads is a primary leaf. The induction follows by critical thread is also first in the steal order.
observing that in all three of these cases, Cilk’s scheduler guarantees
that a processor works on the new primary leaf. In the third case we Proof: For a thread t to be critical, the following conditions must
use the fact that a newly activated closure is posted on the processor hold for the ready queue on the processor in which t is enqueued:
that activated it and not on the processor on which it was residing. 1. No right siblings of t are in the ready queue. If a right sibling
The space bound SP  S1 P is obtained by showing that every procedure v of t were in the ready queue, then the first thread
allocated closure can be associated with a primary leaf and that the of v would not have been executed, and because the first thread
total space of all closures assigned to a given primary leaf is at most S1 . of v is a predecessor of t in D0 , t would not be critical.
Since Cilk’s scheduler keeps all primary leaves busy,with P processors 2. No right siblings of any of t’s ancestors are in the ready queue.
we are guaranteed that at every time during the execution, at most P This fact follows from the same reasoning as above.
primary-leaf closures can be allocated, and hence the total amount of
3. No left siblings of any of t’s ancestors are in the ready queue.
space is at most S1 P .
This condition must hold because all of these siblings occur
before t’s parent in the local execution order, and t’s parent yet ideal for more traditional parallel applications that can be pro-
must have been executed for t to be critical. grammed effectively in, for example, a message-passing, data-parallel,
4. No successor threads of t’s ancestors are enabled. This condi- or single-threaded, shared-memory style. We are currently working
tion must be true, because any successor thread must wait for on extending Cilk’s capabilities to broaden its applicability. A major
all children to complete before it is enabled. Since t has not constraint is that we do not want new features to destroy Cilk’s guar-
completed, no successor threads of t’s ancestors are enabled. antees of performance. Our current research focuses on implementing
This condition makes use of the fact that the computation is “dag-consistent” shared memory, which allows programs to operate on
fully strict, which implies that the only thread to which t can shared memory without costly communication or hardware support;
send its result is t’s parent’s unique successor. on providing a linguistic interface that produces continuation-passing
A consequence of these conditions is that no thread could possibly code for our runtime system from a more traditional call-return specifi-
be above t in the ready queue, because all threads above t are either cation of spawns; and on incorporating persistent threads and less strict
already executed, stolen, or not enabled. In t’s level, t is first in the semantics in ways that do not destroy the guaranteed performance of
work-stealing order, because it is the rightmost thread at that level. our scheduler. Recent information about Cilk is maintained on the
World Wide Web in page http://theory.lcs.mit.edu/˜cilk.
Theorem 3 For any number P of processors and any fully strict
Cilk program in which each thread has at most one successor, if the Acknowledgments
program has work T1 and critical path length T1 , then Cilk’s work-
stealing scheduler executes the program in expected time E [TP ] = We gratefully acknowledge the inspiration of Michael Halbherr, now
O (T1 =P + T1 ). Furthermore, for any  > 0, the execution time is of the Boston Consulting Group in Zurich, Switzerland. Mike’s PCM
TP = O (T1 =P + T1 + lg P + lg(1=)) with probability at least runtime system [18] developed at MIT was the precursor of Cilk, and
1 . many of the design decisions in Cilk are owed to him. We thank
Shail Aditya and Sivan Toledo of MIT and Larry Rudolph of Hebrew
Proof: This proof is just a straightforward application of the tech- University for helpful discussions. Xinmin Tian of McGill University
niques in [3], using our Lemma 2 as a substitute for Lemma 9 in [3]. provided helpful suggestions for improving the paper. Rolf Riesen of
Because the critical threads are first in the work-stealing order, they Sandia National Laboratories ported Cilk to the Intel Paragon MPP
are likely to be stolen (or executed locally) after a modest number of running under the SUNMOS operating system, John Litvin and Mike
steal requests. This fact can be shown formally using a delay sequence Stupak ported Cilk to the Paragon running under OSF, and Andy Shaw
argument. of MIT ported Cilk to the Silicon Graphics Power Challenge SMP.
Thanks to Matteo Frigo and Rob Miller of MIT for their many con-
Theorem 4 For any number P of processors and any fully strict tributions to the Cilk system. Thanks to the Scout project at MIT and
Cilk program in which each thread has at most one successor, if the the National Center for Supercomputing Applications at University of
program has critical path length T1 and maximum closure size Smax , Illinois, Urbana-Champaign for access to their CM5 supercomputers
then Cilk’s work-stealing scheduler incurs expected communication for running our experiments. Finally, we acknowledge the influence
O (T1 P Smax ). Furthermore, for any  > 0, the communication cost of Arvind and his dataflow research group at MIT. Their pioneering
is O((T1 + lg(1=))P Smax ) with probability at least 1 . work attracted us to this path, and their vision continues to draw us
forward.
Proof: This proof is exactly analogous to the proof of Theorem 13
in [3]. We observe that at most O(T1 P ) steal attempts occur in an References
execution, and all communication costs can be associated with one
of these steal requests such that at most O(Smax ) communication is [1] Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska,
associated with each steal request. The high-probability bound is and Henry M. Levy. Scheduler activations: Effective kernel sup-
analogous. port for the user-level management of parallelism. In Proceedings
of the Thirteenth ACM Symposium on Operating Systems Prin-
ciples, pages 95–109, Pacific Grove, California, October 1991.
7 Conclusion [2] Guy E. Blelloch. Programming parallel algorithms. In Proceed-
ings of the 1992 Dartmouth Institute for Advanced Graduate
To produce high-performance parallel applications, programmers of- Studies (DAGS) Symposium on Parallel Computation, pages 11–
ten focus on communication costs and execution time, quantities that 18, Hanover, New Hampshire, June 1992.
are dependent on specific machine configurations. We argue that a [3] Robert D. Blumofe and Charles E. Leiserson. Scheduling mul-
programmer should think instead about work and critical path, ab- tithreaded computations by work stealing. In Proceedings of the
stractions that can be used to characterize the performance of an al- 35th Annual Symposium on Foundations of Computer Science,
gorithm independent of the machine configuration. Cilk provides a pages 356–368, Santa Fe, New Mexico, November 1994.
programming model in which work and critical path are observable [4] Robert D. Blumofe and David S. Park. Scheduling large-scale
quantities, and it delivers guaranteed performance as a function of parallel computations on networks of workstations. In Proceed-
these quantities. Work and critical path have been used in the theory ings of the Third International Symposium on High Performance
community for years to analyze parallel algorithms [26]. Blelloch Distributed Computing, pages 96–105, San Francisco, Califor-
nia, August 1994.
[2] has developed a performance model for data-parallel computations
based on these same two abstract measures. He cites many advantages [5] Richard P. Brent. The parallel evaluation of general arithmetic
expressions. Journal of the ACM, 21(2):201–206, April 1974.
to such a model over machine-based models. Cilk provides a similar
[6] Eric A. Brewer and Robert Blumofe. Strata: A multi-layer com-
performance model for the domain of asynchronous, multithreaded
munications library. Technical Report to appear, MIT Laboratory
computation. for Computer Science. Available as ftp://ftp.lcs.mit.edu
Although Cilk offers performance guarantees, its current capa- /pub/supertech/strata/strata.tar.Z.
bilities are limited, and programmers find its explicit continuation- [7] F. Warren Burton and M. Ronan Sleep. Executing functional
passing style to be onerous. Cilk is good at expressing and executing programs on a virtual tree of processors. In Proceedings of
dynamic, asynchronous, tree-like, MIMD computations, but it is not the 1981 Conference on Functional Programming Languages
and Computer Architecture, pages 187–194, Portsmouth, New [24] L. V. Kalé. The Chare kernel parallel programming system. In
Hampshire, October 1981. Proceedings of the 1990 International Conference on Parallel
[8] Martin C. Carlisle, Anne Rogers, John H. Reppy, and Laurie J. Processing, Volume II: Software, pages 17–25, August 1990.
Hendren. Early experiences with Olden. In Proceedings of the [25] Vijay Karamcheti and Andrew Chien. Concert—efficient run-
Sixth Annual Workshop on Languages and Compilers for Parallel time support for concurrent object-oriented programming lan-
Computing, Portland, Oregon, August 1993. guages on stock hardware. In Supercomputing ’93, pages 598–
[9] Rohit Chandra, Anoop Gupta, and John L. Hennessy. COOL: 607, Portland, Oregon, November 1993.
An object-based language for parallel programming. IEEE Com- [26] Richard M. Karp and Vijaya Ramachandran. Parallel algorithms
puter, 27(8):13–26, August 1994. for shared-memory machines. In J. van Leeuwen, editor, Hand-
[10] Jeffrey S. Chase, Franz G. Amador, Edward D. Lazowska, book of Theoretical Computer Science—Volume A: Algorithms
Henry M. Levy, and Richard J. Littlefield. The Amber system: and Complexity, chapter 17, pages 869–941. MIT Press, Cam-
Parallel programming on a network of multiprocessors. In Pro- bridge, Massachusetts, 1990.
ceedings of the Twelfth ACM Symposium on Operating Systems [27] Richard M. Karp and Yanjun Zhang. Randomized parallel algo-
Principles, pages 147–158, Litchfield Park, Arizona, December rithms for backtrack search and branch-and-bound computation.
1989. Journal of the ACM, 40(3):765–789, July 1993.
[11] Eric C. Cooper and Richard P. Draves. C Threads. Technical Re- [28] David A. Kranz, Robert H. Halstead, Jr., and Eric Mohr. Mul-
port CMU-CS-88-154, School of Computer Science, Carnegie- T: A high-performance parallel Lisp. In Proceedings of the
Mellon University, June 1988. SIGPLAN ’89 Conference on Programming Language Design
[12] David E. Culler, Anurag Sah, Klaus Erik Schauser, Thorsten von and Implementation, pages 81–90, Portland, Oregon, June 1989.
Eicken, and John Wawrzynek. Fine-grain parallelism with min- [29] Bradley C. Kuszmaul. Synchronized MIMD Computing. Ph.D.
imal hardware support: A compiler-controlled threaded abstract thesis, Department of Electrical Engineering and Computer
machine. In Proceedings of the Fourth International Conference Science, Massachusetts Institute of Technology, May 1994.
on Architectural Support for Programming Languages and Op- Available as MIT Laboratory for Computer Science Technical
erating Systems, pages 164–175, Santa Clara, California, April Report MIT/LCS/TR-645 or ftp://theory.lcs.mit.edu
1991. /pub/bradley/phd.ps.Z.
[13] Rainer Feldmann, Peter Mysliwietz, and Burkhard Monien. [30] Charles E. Leiserson, Zahi S. Abuhamdeh, David C. Dou-
Studying overheads in massively parallel min/max-tree evalu- glas, Carl R. Feynman, Mahesh N. Ganmukhi, Jeffrey V. Hill,
ation. In Proceedings of the Sixth Annual ACM Symposium on W. Daniel Hillis, Bradley C. Kuszmaul, Margaret A. St. Pierre,
Parallel Algorithms and Architectures, pages 94–103, Cape May, David S. Wells, Monica C. Wong, Shaw-Wen Yang, and Robert
New Jersey, June 1994. Zak. The network architecture of the Connection Machine CM-5.
[14] Raphael Finkel and Udi Manber. DIB—a distributed imple- In Proceedings of the Fourth Annual ACM Symposium on Par-
mentation of backtracking. ACM Transactions on Programming allel Algorithms and Architectures, pages 272–285, San Diego,
Languages and Systems, 9(2):235–256, April 1987. California, June 1992.
[15] Vincent W. Freeh, David K. Lowenthal, and Gregory R. An- [31] Pangfeng Liu, William Aiello, and Sandeep Bhatt. An atomic
drews. Distributed Filaments: Efficient fine-grain parallelism on model for message-passing. In Proceedings of the Fifth An-
a cluster of workstations. In Proceedings of the First Sympo- nual ACM Symposium on Parallel Algorithms and Architectures,
sium on Operating Systems Design and Implementation, pages pages 154–163, Velen, Germany, June 1993.
201–213, Monterey, California, November 1994. [32] Eric Mohr, David A. Kranz, and Robert H. Halstead, Jr. Lazy
[16] R. L. Graham. Bounds for certain multiprocessing anomalies. task creation: A technique for increasing the granularity of par-
The Bell System Technical Journal, 45:1563–1581, November allel programs. IEEE Transactions on Parallel and Distributed
1966. Systems, 2(3):264–280, July 1991.
[17] R. L. Graham. Bounds on multiprocessing timing anomalies. [33] Rishiyur S. Nikhil. A multithreaded implementation of Id using
SIAM Journal on Applied Mathematics, 17(2):416–429, March P-RISC graphs. In Proceedings of the Sixth Annual Workshop on
1969. Languages and Compilers for Parallel Computing, number 768
[18] Michael Halbherr, Yuli Zhou, and Chris F. Joerg. MIMD-style in Lecture Notes in Computer Science, pages 390–405, Portland,
parallel programming with continuation-passing threads. In Pro- Oregon, August 1993. Springer-Verlag.
ceedings of the 2nd International Workshop on Massive Par- [34] Rishiyur S. Nikhil. Cid: A parallel, shared-memory C for
allelism: Hardware, Software, and Applications, Capri, Italy, distributed-memory machines. In Proceedings of the Seventh An-
September 1994. nual Workshop on Languages and Compilers for Parallel Com-
[19] Robert H. Halstead, Jr. Multilisp: A language for concurrent puting, August 1994.
symbolic computation. ACM Transactions on Programming Lan- [35] Vijay S. Pande, Christopher F. Joerg, Alexander Yu Grosberg,
guages and Systems, 7(4):501–538, October 1985. and Toyoichi Tanaka. Enumerations of the hamiltonian walks on
[20] W. Hillis and G. Steele. Data parallel algorithms. Communica- a cubic sublattice. Journal of Physics A, 27, 1994.
tions of the ACM, 29(12):1170–1183, December 1986. [36] Martin C. Rinard, Daniel J. Scales, and Monica S. Lam. Jade: A
[21] Wilson C. Hsieh, Paul Wang, and William E. Weihl. Com- high-level, machine-independent language for parallel program-
putation migration: Enhancing locality for distributed-memory ming. Computer, 26(6):28–38, June 1993.
parallel systems. In Proceedings of the Fourth ACM SIGPLAN [37] Larry Rudolph, Miriam Slivkin-Allalouf, and Eli Upfal. A simple
Symposium on Principles and Practice of Parallel Programming load balancing scheme for task allocation in parallel machines.
(PPoPP), pages 239–248, San Diego, California, May 1993. In Proceedings of the Third Annual ACM Symposium on Paral-
[22] Suresh Jagannathan and Jim Philbin. A customizable substrate lel Algorithms and Architectures, pages 237–245, Hilton Head,
for concurrent languages. In Proceedings of the ACM SIGPLAN South Carolina, July 1991.
’92 Conference on Programming Language Design and Imple- [38] V. S. Sunderam. PVM: A framework for parallel distributed
mentation, pages 55–67, San Francisco, California, June 1992. computing. Concurrency: Practice and Experience, 2(4):315–
[23] Chris Joerg and Bradley C. Kuszmaul. Massively paral- 339, December 1990.
lel chess. In Proceedings of the Third DIMACS Parallel [39] Andrew S. Tanenbaum, Henri E. Bal, and M. Frans Kaashoek.
Implementation Challenge, Rutgers University, New Jersey, Programming a distributed system using shared objects. In Pro-
October 1994. Available as ftp://theory.lcs.mit.edu ceedings of the Second International Symposium on High Per-
/pub/cilk/dimacs94.ps.Z. formance Distributed Computing, pages 5–12, Spokane, Wash-
ington, July 1993.
[40] Mark T. Vandevoorde and Eric S. Roberts. WorkCrews: An
abstraction for controlling parallelism. International Journal of
Parallel Programming, 17(4):347–366, August 1988.
[41] I-Chen Wu and H. T. Kung. Communication complexity for
parallel divide-and-conquer. In Proceedings of the 32nd Annual
Symposium on Foundations of Computer Science, pages 151–
162, San Juan, Puerto Rico, October 1991.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy