Parallel Prefix Sum
Parallel Prefix Sum
Parallel Prefix Sum
CS2101
Plan
2 Algorithms
3 Applications
4 Implementation in Julia
Problem Statement and Applications
Plan
2 Algorithms
3 Applications
4 Implementation in Julia
Problem Statement and Applications
Overview
This chapter will be the first dedicated to the applications of a
parallel algorithm.
This algorithm, called the parallel scan, aka the parallel prefix sum is
a beautiful idea with surprising uses: it is a powerful recipe to turning
serial into parallel.
Watch closely what is being optimized for: this is an amazing lesson
of parallelization.
Application of parallel scan are numerous:
• it is used in program compilation, scientific computing and,
• we already met prefix sum with the counting-sort algorithm!
Problem Statement and Applications
Prefix sum
Remark
So a Julia implementation of the above specification would be:
function prefixSum(x)
n = length(x)
y = fill(x[1],n)
for i=2:n
y[i] = y[i-1] + x[i]
end
y
end
n = 10
prefixSum(x)
Comments (1/2)
The i-th iteration of the loop is not at all decoupled from the (i − 1)-th
iteration.
Impossible to parallelize, right?
Problem Statement and Applications
Remark
So a Julia implementation of the above specification would be:
function prefixSum(x)
n = length(x)
y = fill(x[1],n)
for i=2:n
y[i] = y[i-1] + x[i]
end
y
end
n = 10
prefixSum(x)
Comments (2/2)
Consider again ~x = (1, 2, 3, 4, 5, 6, 7, 8) and its prefix sum
~y = (1, 3, 6, 10, 15, 21, 28, 36).
Is there any value in adding, say, 4+5+6+7 on itw own?
If we separately have 1+2+3, what can we do?
Suppose we added 1+2, 3+4, etc. pairwise, what could we do?
Problem Statement and Applications
Plan
2 Algorithms
3 Applications
4 Implementation in Julia
Algorithms
function prefixSum(x)
n = length(x)
y = fill(x[1],n)
for i=2:n
y[i] = y[i-1] + x[i]
end
y
end
Comments
Recall that this is similar to the cumulated frequency computation
that is done in the prefix sum algorithm.
Observe that this sequential algorithm performa n − 1 additions.
Algorithms
Principles
Assume we have the input array has n entries and we have n workers at our disposal
We aim at doing as much as possible per parallel step. For simplicity, we assume
that n is a power of 2.
Hence, during the first parallel step, each worker (except the first one) adds the
value it owns to that of its left neighbour: this allows us to compute all sums of the
forms xk−1 + xk−2 , for 2 ≤ k ≤ n.
For this to happen, we need to work out of place. More precisely, we need an
auxiliary with n entries.
Algorithms
Principles
Recall that the k-th slot, for 2 ≤ k ≤ n, holds xk−1 + xk−2 .
If n = 4, we can conclude by adding Slot 0 and Slot 2 on one hand and Slot 1 and
Slot 3 on the other.
More generally, we can perform a second parallel step by adding Slot k and Slot
k − 2, for 3 ≤ k ≤ n.
Algorithms
Principles
Now the k-th slot, for 4 ≤ k ≤ n, holds xk−1 + xk−2 + xk−3 + xk−4 .
If n = 8, we can conclude by adding Slot 5 and Slot 1, Slot 6 and Slot 2, Slot 7
and Slot 3, Slot 8 and Slot 4.
More generally, we can perform a third parallel step by adding Slot k and Slot
k − 4 for 5 ≤ k ≤ n.
Algorithms
Pseudo-code
Active Proocessors P[1], ...,P[n]; // id the active processor index
for d := 0 to (log(n) -1) do
if d is even then
if id > 2^d then
M[n + id] := M[id] + M[id - 2^d]
else
M[n + id] := M[id]
end if
else
if id > 2^d then
M[id] := M[n + id] + M[n + id - 2^d]
else
M[id] := M[n + id]
end if
end if
if d is odd then M[n + id] := M[id] end if
Observations
M [n + 1], . . . , M [2n] are used to hold the intermediate results at Steps
d = 0, 2, 4, . . . (log(n) − 2).
Note that at Step d, (n − 2d ) processors are performing an addition.
Moreover, at Step d, the distance between two operands in a sum is 2d .
Algorithms
Recall
M [n + 1], . . . , M [2n] are used to hold the intermediate results at
Steps d = 0, 2, 4, . . . (log(n) − 2).
Note that at Step d, (n − 2d ) processors are performing an addition.
Moreover, at Step d, the distance between two operands in a sum is
2d .
Analysis
It follows from the above that the naive parallel algorithm performs
log(n) parallel steps
Moreover, at each parallel step, at least n/2 additions are performed.
Therefore, this algorithm performs at least (n/2)log(n) additions
Thus, this algorithm is not work-efficient since the work of our serial
algorithm is simply n − 1 additions.
Algorithms
Algorithm
Input: x[1], x[2], . . . , x[n] where n is a power of 2.
Step 1: (x[k], x[k − 1]) = (x[k] + x[k − 1], x[k] for all even k’s.
Step 2: Recursive call on x[2], x[4], . . . , x[n]
Step 3: x[k − 1] = x[k] − x[k − 1] for all even k’s.
Algorithms
Analysis
Since the recursive call is applied to an array of size n/2, the total number of
recursive calls is log(n).
Before the recursive call, one performs n/2 additions
After the recursive call, one performs n/2 subtractions
Elementary calculations show that this recursive algorithm performs at most a
total of 2n additions and subtractions
Thus, this algorithm is work-efficient. In addition, it can run in 2log(n)
parallel steps.
Applications
Plan
2 Algorithms
3 Applications
4 Implementation in Julia
Applications
Plan
2 Algorithms
3 Applications
4 Implementation in Julia
Implementation in Julia
function prefixSum(x)
n = length(x)
y = fill(x[1],n)
for i=2:n
y[i] = y[i-1] + x[i]
end
y
end
n = 10
prefixSum(x)
Implementation in Julia
julia> boring(a,b)=a
# methods for generic function boring
boring(a,b) at none:1
julia> boring2(a,b)=b
# methods for generic function boring2
boring2(a,b) at none:1
Comments
First, we test Julia’s reduce function with different operations.
Implementation in Julia
julia> Hadamard(3)
8x8 Array{Int64,2}:
1 1 1 1 1 1 1 1
1 -1 1 -1 1 -1 1 -1
1 1 -1 -1 1 1 -1 -1
1 -1 -1 1 1 -1 -1 1
1 1 1 1 -1 -1 -1 -1
1 -1 1 -1 -1 1 -1 1
1 1 -1 -1 -1 -1 1 1
1 -1 -1 1 -1 1 1 -1
Comments
Next, we compute Fibonacci numbers and Hadamard matrices via prefix sum.
Implementation in Julia
julia> printnice(x)=println(round(x,3))
# methods for generic function printnice
printnice(x) at none:1
julia> printnice(M[4]*M[3]*M[2]*M[1])
-.466 .906
1.559 -3.447
Comments
In the above we do a prefix multiplication with random matrices.
Implementation in Julia
julia>
Comments
In the above example we apply ‘reduce()‘ to function composition:
Implementation in Julia
Comments
We prepare a prefix-sum computation with 8 workers and 8 matrices to
multiply.
Implementation in Julia
... 121 methods not shown (use methods(*) to see them all)
julia> # The serial version requires 7 operations. The parallel version uses
Comments
We prepare a prefix-sum computation with 8 workers and 8 matrices to
multiply.
Implementation in Julia
\julia> n=2048
2048
julia> tic(); @sync prefix8!(t, *); t_par = toc() #Caution: race condition bug #4330
elapsed time: 7.434856303 seconds
7.434856303
julia> @printf("Serial: %.3f sec Parallel: %.3f sec speedup: %.3fx (theory=1.4x)", t_ser, t_par, t_ser/
Serial: 10.680 sec Parallel: 7.435 sec speedup: 1.436x (theory=1.4x)
Comments
Now let’s run prefix in parallel on 8 processors.