Cilk - Multithreaded Runtime System
Cilk - Multithreaded Runtime System
Abstract level 0
number of closures allocated at any time on any processor. The row
queens is a backtrack search program that solves the problem
labeled “requests/proc.” indicates the average number of steal requests
of placing N queens on a N N chessboard so that no two made by a processor during the execution, and “steals/proc.” gives the
queens attack each other. The Cilk program is based on serial average number of closures actually stolen.
code by R. Sargent of the MIT Media Laboratory. Thread length The data in Table 4 shows two important relationships: one be-
was enhanced by serializing the bottom 7 levels of the search tween efficiency and thread length, and another between speedup and
tree. average parallelism.
pfold is a protein-folding program [35] written in conjunc- Considering the relationship between efficiency Tserial =T1 and
tion with V. Pande of MIT’s Center for Material Sciences and thread length, we see that for programs with moderately long threads,
Engineering. This program finds hamiltonian paths in a three- the Cilk scheduler induces very little overhead. The queens, pfold,
dimensional grid using backtrack search. It was the first pro- ray, and knary programs have threads with average length greater
gram to enumerate all hamiltonian paths in a 3 4 4 grid. than 50 microseconds and have efficiency greater than 90 percent.
We timed the enumeration of all paths starting with a certain On the other hand, the fib program has low efficiency, because the
sequence. threads are so short: fib does almost nothing besides spawn and
ray is a parallel program for graphics rendering based on the send argument.
serial POV-Ray program, which uses a ray-tracing algorithm. Despite it’s long threads, the ?Socrates program has low efficiency,
The entire POV-Ray system contains over 20,000 lines of C because its parallel Jamboree search algorithm [29] is based on specu-
code, but the core of POV-Ray is a simple doubly nested loop latively searching subtrees that are not searched by a serial algorithm.
that iterates over each pixel in a two-dimensional image. For Consequently, as we increase the number of processors, the program
ray we converted the nested loops into a 4-ary divide-and- executes more threads and, hence, does more work. For example,
conquer control structure using spawns.1 Our measurements the 256-processor execution did 7023 seconds of work whereas the
do not include the approximately 2.4 seconds of startup time 32-processor execution did only 3644 seconds of work. Both of these
required to read and process the scene description file. executions did considerably more work than the serial program’s 1665
knary(k,n,r) is a synthetic benchmark whose parameters can seconds of work. Thus, although we observe low efficiency, it is due
be set to produce a variety of values for work and critical path. to the parallel algorithm and not to Cilk overhead.
It generates a tree of branching factor k and depth n in which Looking at the speedup T1 =TP measured on 32 and 256 proces-
the first r children at every level are executed serially and the sors, we see that when the average parallelism T1 =T1 is large com-
remainder are executed in parallel. At each node of the tree, the pared with the number P of processors, Cilk programs achieve nearly
program runs an empty “for” loop for 400 iterations. perfect linear speedup, but when the average parallelism is small, the
speedup is much less. The fib, queens, pfold, and ray programs,
1 Initially, the serial POV-Ray program was about 5 percent slower than the Cilk version
running on one processor. The reason was that the divide-and-conquer decomposition ?
2 For the Socrates program, T 1 is not the measured execution time, but rather it is
performed by the Cilk code provides better locality than the doubly nested loop of the an estimate of the work obtained by summing the execution times of all threads, which
serial code. Modifying the serial code to imitate the Cilk decomposition improved its ?
yields a slight underestimate. Socrates is an unusually complicated application, because
performance. Timings for the improved version are given in the table. its speculative execution yields unpredictable work and critical path. Consequently, the
measured runtime on one processor does not accurately reflect the work on P > 1
processors.
fib queens pfold ray knary knary ?Socrates ?Socrates
(33) (15) (3,3,4) (500,500) (10,5,2) (10,4,1) (depth 10) (depth 10)
(32 proc.) (256 proc)
(application parameters)
Tserial 8.487 252.1 615.15 729.2 288.6 40.993 1665 1665
T1 73.16 254.6 647.8 732.5 314.6 45.43 3644 7023
Tserial =T1 0.116 0.9902 0.9496 0.9955 0.9174 0.9023 0.4569 0.2371
T1 0.000326 0.0345 0.04354 0.0415 4.458 0.255 3.134 3.24
T1 =T1 224417 7380 14879 17650 70.56 178.2 1163 2168
threads 17,108,660 210,740 9,515,098 424,475 5,859,374 873,812 26,151,774 51,685,823
thread length 4.276s 1208s 68.08s 1726s 53.69s 51.99s 139.3s 135.9s
(32-processor experiments)
TP 2.298 8.012 20.26 21.68 15.13 1.633 126.1 -
T1 =P +
T1 2.287 7.991 20.29 22.93 14.28 1.675 117.0 -
T1 =TP 31.84 31.78 31.97 33.79 20.78 27.81 28.90 -
(
T1 = P TP ) 0.9951 0.9930 0.9992 1.0558 0.6495 0.8692 0.9030 -
space/proc. 70 95 47 39 41 42 386 -
requests/proc. 185.8 48.0 88.6 218.1 92639 3127 23484 -
steals/proc. 56.63 18.47 26.06 79.25 18031 1034 2395 -
(256-processor experiments)
TP 0.2892 1.045 2.590 2.765 8.590 0.4636 - 34.32
T1 =P +
T1 0.2861 1.029 2.574 2.903 5.687 0.4325 - 30.67
T1 =TP 253.0 243.7 250.1 265.0 36.62 98.00 - 204.6
(
T1 = P TP ) 0.9882 0.9519 0.9771 1.035 0.1431 0.3828 - 0.7993
space/proc. 66 76 47 32 48 40 - 405
requests/proc. 73.66 80.40 97.79 82.75 151803 7527 - 30646
steals/proc. 24.10 21.20 23.05 18.34 6378 550 - 1540
Table 4: Performance of Cilk on various applications. All times are in seconds, except where noted.
for example, have in excess of 7000-fold parallelism and achieve more (assuming no register window overflow) plus 1 cycle for each word
than 99 percent of perfect linear speedup on 32 processors and more argument (assuming all arguments are transferred in registers). Thus,
than 95 percent of perfect linear speedup on 256 processors.3 The a spawn in Cilk is roughly an order of magnitude more expensive than
?Socrates program exhibits somewhat less parallelism and also some- a C function call. This Cilk overhead is quite apparent in the fib pro-
what less speedup. On 32 processors the ?Socrates program has 1163- gram, which does almost nothing besides spawn and send argument.
fold parallelism, yielding 90 percent of perfect linear speedup, while Based on fib’s measured efficiency of 0:116, we can conclude that the
on 256 processors it has 2168-fold parallelism yielding 80 percent aggregate average cost of a spawn/send argument in Cilk is between
of perfect linear speedup. With even less parallelism, as exhibited 8 and 9 times the cost of a function call/return in C.
in the knary benchmarks, less speedup is obtained. For example, Efficient execution of programs with short threads requires a low-
the knary(10,5,2) benchmark exhibits only 70-fold parallelism, overhead spawn operation. As can be observed from Table 4, the
and it realizes barely more than 20-fold speedup on 32 processors vast majority of threads execute on the same processor on which they
(less than 65 percent of perfect linear speedup). With 178-fold paral- are spawned. For example, the fib program executed over 17 million
lelism, knary(10,4,1) achieves 27-fold speedup on 32 processors threads but migrated only 6170 (24.10 per processor) when run with
(87 percent of perfect linear speedup), but only 98-fold speedup on 256 processors. Taking advantage of this property, other researchers
256 processors (38 percent of perfect linear speedup). [25, 32] have developed techniques for implementing spawns such that
Although these speedup measures reflect the Cilk scheduler’s abil- when the child thread executes on the same processor as its parent,
ity to exploit parallelism, to obtain application speedup, we must fac- the cost of the spawn operation is roughly equal the cost of a C
tor in the efficiency of the Cilk program compared with the serial function call. We hope to incorporate such techniques into future
C program. Specifically, the application speedup Tserial =TP is the implementations of Cilk.
product of efficiency Tserial =T1 and speedup T1 =TP . For example, Finally, we make two observations about the space and communi-
applications such as fib and ?Socrates with low efficiency generate cation measures in Table 4.
correspondingly low application speedup. The ?Socrates program, Looking at the “space/proc.” rows, we observe that the space per
with efficiency 0:2371 and speedup 204:6 on 256 processors, exhibits processor is generally quite small and does not grow with the number
application speedup of 0:2371 204:6 = 48:51. For the purpose of of processors. For example, ?Socrates on 32 processors executes over
performance prediction, we prefer to decouple the efficiency of the 26 million threads, yet no processor ever has more than 386 allocated
application from the efficiency of the scheduler. closures. On 256 processors, the number of executed threads nearly
Looking more carefully at the cost of a spawn in Cilk, we find that doubles to over 51 million, but the space per processors barely changes.
it takes a fixed overhead of about 50 cycles to allocate and initialize a In Section 6 we show formally that for Cilk programs, the space per
closure, plus about 8 cycles for each word argument. In comparison, processor does not grow as we add processors.
a C function call on a CM5 processor takes 2 cycles of fixed overhead Looking at the “requests/proc.” and “steals/proc.” rows in Table 4,
3 In fact, the ray program achieves superlinear speedup even when comparing to the we observe that the amount of communication grows with the critical
efficient serial implementation. We suspect that cache effects cause this phenomenon.
path but does not grow with the work. For example, fib, queens,
pfold, and ray all have critical paths under a tenth of a second simpler curves TP = T1 =P + T1 and TP = T1 =P + 2 T1 for
long and perform fewer than 220 requests and 80 steals per processor, comparison. As can be seen from the figure, little is lost in the linear
whereas knary(10,5,2) and ?Socrates have critical paths more than speedup range of the curve by assuming that c1 = 1. Indeed, a fit
3 seconds long and perform more than 20,000 requests and 1500 steals to TP = T1 =P + c1 (T1 ) yields c1 = 1:509 0:3727 with
R = 0:983592 and a mean relative error of 4:04 percent, which
2
per processor. The table does not show any clear correlation between
work and either requests or steals. For example, ray does more than is in some ways better than the fit that includes a c1 term. (The R2
twice as much work as knary(10,5,2), yet it performs two orders of measure is a little worse, but the mean relative error is much better.)
magnitude fewer requests. In Section 6, we show that for “fully strict” It makes sense that the data points become more scattered when
Cilk programs, the communication per processor grows linearly with P is close to or exceeds the average parallelism. In this range, the
the critical path length and does not grow as function of the work. amount of time spent in work stealing becomes a significant fraction
of the overall execution time. The real measure of the quality of a
5 Modeling performance scheduler is how much larger T1 =T1 must be than P before TP
shows substantial influence from the critical path. One can see from
In this section, we further document the effectiveness of the Cilk Figure 5 that if the average parallelism exceeds P by a factor of 10,
scheduler by showing empirically that it schedules applications in the critical path has almost no impact on the running time.
a near-optimal fashion. Specifically, we use the knary synthetic To confirm our simple model of the Cilk scheduler’s performance
benchmark to show that the runtime of an application on P processors on a real application, we ran ?Socrates on a variety of chess positions.
can be accurately modeled as TP T1 =P + c1 T1 , where c1 Figure 6 shows the results of our study, which confirm the results
1:5. This result shows that we obtain nearly perfect linear speedup from the knary synthetic benchmarks. The curve shown is the best
fit to TP = c1 (T1 =P ) + c1 (T1 ), where c1 = 1:067 0:0141
when the critical path is short compared with the average amount
of work per processor. We also show that a model of this kind is and c1 = 1:042 0:0467 with 95 percent confidence. The R2
accurate even for ?Socrates, which is our most complex application correlation coefficient of the fit is 0:9994, and the mean relative error
programmed to date and which does not obey all the assumptions is 4:05 percent.
assumed by the theoretical analyses in Section 6. Indeed, as some of us were developing and tuning heuristics to
A good scheduler should to run an application with T1 work in increase the performance of ?Socrates, we used work and critical
T1 =P time on P processors. Such perfect linear speedup cannot be path as our measures of progress. This methodology let us avoid
obtained whenever T1 > T1 =P , since we always have TP T1 , being trapped by the following interesting anomaly. We made an
or more generally, TP f g
max T1 =P; T1 . The critical path T1 “improvement” that sped up the program on 32 processors. From
is the stronger lower bound on TP whenever P exceeds the average our measurements, however, we discovered that it was faster only
parallelism T1 =T1 , and T1 =P is the stronger bound otherwise. A because it saved on work at the expense of a much longer critical path.
good scheduler should meet each of these bounds as closely as possible. Using the simple model TP = T1 =P + T1 , we concluded that on
In order to investigate how well the Cilk scheduler meets these two a 512-processor machine, which was our platform for tournaments,
lower bounds, we used our knary benchmark (described in Section 4), the “improvement” would yield a loss of performance, a fact that
which can exhibit a range of values for work and critical path. we later verified. Measuring work and critical path enabled us to
Figure 5 shows the outcome of many experiments of running use experiments on a 32-processor machine to improve our program
knary with various values for k, n, r, and P . The figure plots for the 512-processor machine, but without using the 512-processor
the speedup T1 =TP for each run against the machine size P for that machine, on which computer time was scarce.
run. In order to compare the outcomes for runs with different pa-
rameters, we have normalized the data by dividing the plotted values 6 A theoretical analysis of the Cilk scheduler
by the average parallelism T1 =T1 . Thus, the horizontal position of
each datum is P=(T1 =T1 ), and the vertical position of each datum In this section we use algorithmic analysis techniques to prove that for
is (T1 =TP )=(T1 =T1 ) = T1 =TP . Consequently, on the horizontal the class of “fully strict” Cilk programs, Cilk’s work-stealing schedul-
axis, the normalized machine-size is 1:0 when the average available ing algorithm is efficient with respect to space, time, and commu-
parallelism is equal to the machine size. On the vertical axis, the nication. A fully strict program is one for which each thread sends
normalized speedup is 1:0 when the runtime equals the critical path, arguments only to its parent’s successor threads. For this class of
and it is 0:1 when the runtime is 10 times the critical path. We can programs, we prove the following three bounds on space, time, and
draw the two lower bounds on time as upper bounds on speedup. The communication:
horizontal line at 1:0 is the upper bound on speedup obtained from
Space The space used by a P -processor execution is bounded by
the critical path, and the 45-degree line is the upper bound on speedup
obtained from the work per processor. As can be seen from the figure, SP S1 P , where S1 denotes the space used by the serial
on the knary runs for which the average parallelism exceeds the num- execution of the Cilk program. This bound is existentially
ber of processors (normalized machine size < 1), the Cilk scheduler optimal to within a constant factor [3].
obtains nearly perfect linear speedup. In the region where the number Time With P processors, the expected execution time, including
of processors is large compared to the average parallelism (normal- scheduling overhead, is bounded by TP = O(T1 =P + T1 ).
ized machine size > 1), the data is more scattered, but the speedup is Since both T1 =P and T1 are lower bounds for any P -processor
always within a factor of 4 of the critical-path upper bound. execution, our expected time bound is within a constant factor
The theoretical results from Section 6 show that the expected
running time of an application on P processors is TP = O(T1 =P +
of optimal.
T1 ). Thus, it makes sense to try to fit the data to a curve of the Communication The expected number of bytes communicated dur-
form TP = c1 (T1 =P ) + c1 (T1 ). A least-squares fit to the data ing a P -processor execution is O(T1 P Smax ), where Smax de-
to minimize the relative error yields c1 = 0:9543 0:1775 and notes the largest size of any closure. This bound is existentially
c1 = 1:54 0:3888 with 95 percent confidence. The R correlation
2
optimal to within a constant factor [41].
coefficient of the fit is 0:989101, and the mean relative error is 13:07
percent. The curve fit is shown in Figure 5, which also plots the The expected time bound and the expected communication bound can
be converted into high-probability bounds at the cost of only a small
Critical Path Bound .
1
0.1
d
un
Normalized Speedup
Bo
up
0.01
eed
Sp
ear
Lin
Measured Value:
0.001
Model 1: 1 000 1 + 1 000 1
: T =P : T
.
1 Critical Path Bound
d
un
Bo
up
eed
Sp
ear
Normalized Speedup
Lin
0.1
Measured Value:
Model 1: 1 000 1 + 1 000 1
: T =P : T
0.01
. 0.01 0.1 1
Normalized Machine Size
Figure 6: Normalized speedups for the ?Socrates chess program.
spawn_next We associate each allocated closure with a primary leaf as follows.
If the closure is a primary leaf, it is assigned to itself. Otherwise, if
the closure has any allocated children, then it is assigned to the same
spawn
primary leaf as its leftmost child. If the closure is a leaf but has some
left siblings, then the closure is assigned to the same primary leaf as its
leftmost sibling. In this recursive fashion, we assign every allocated
c b closure to a primary leaf. Now, we consider the set of closures assigned
to a given primary leaf. The total space of these closures is at most S1 ,
because this set of closures is a subset of the closures that are allocated
during a 1-processor execution when the processor is executing this
a
primary leaf, which completes the proof.