217-lec10
217-lec10
217-lec10
Programming
Lecture 10
Reduction Trees
2
Partition and Summarize
• A commonly used strategy for processing large input
data sets
– There is no required order of processing elements in a data
set (associative and commutative)
– Partition the data set into smaller chunks
– Have each thread to process a chunk
– Use a reduction tree to summarize the results from each
chunk into the final answer
• We will focus on the reduction tree step for now.
• Google and Hadoop MapReduce frameworks are
examples of this pattern
3
Reduction enables other techniques
• Reduction is also needed to clean up after some
commonly used parallelizing transformations
• Privatization
– Multiple threads write into an output location
– Replicate the output location so that each thread has a
private output location
– Use a reduction tree to combine the values of private
locations into the original output location
4
What is a reduction computation
• Summarize a set of input values into one value using a
“reduction operation”
– Max
– Min
– Sum
– Product
– Often with user defined reduction operation function as long
as the operation
• Is associative and commutative
• Has a well-defined identity value (e.g., 0 for sum)
5
An efficient sequential reduction
algorithm performs N operations - O(N)
• Initialize the result as an identity value for the
reduction operation
– Smallest possible value for max reduction
– Largest possible value for min reduction
– 0 for sum reduction
– 1 for product reduction
3 7 4 6
max max
7 6
max
7
7
A tournament is a reduction tree
8
What is the reduction operation?
A Quick Analysis
• For N input values, the reduction tree performs
– (1/2)N + (1/4)N + (1/8)N + … (1/N) = (1- (1/N))N = N-1
operations
– In Log (N) steps – 1,000,000 input values take 20 steps
• Assuming that we have enough execution resources
– Average Parallelism (N-1)/Log(N))
• For N = 1,000,000, average parallelism is 50,000
• However, peak resource requirement is 500,000!
• This is a work-efficient parallel algorithm
– The amount of work done is comparable to sequential
• Many parallel algorithms are not work efficient
– But not resource efficient… 9
A Sum Reduction Example
• Parallel implementation:
– Recursively halve # of threads, add two values per thread
in each step
– Takes log(n) steps for n elements, requires n/2 threads
Dat 0 1 2 3 4 5 6 7 8 9 10 11
a
3 0..7 8..15
steps
Dat 3 1 7 0 4 1 6 3
a
1 4 7 5 9
Active Partial
2 11 14 Sum elements
3 25
steps
12
Simple Thread Index to Data Mapping
• Each thread is responsible of an even-index location
of the partial sum vector
– One input is the location of responsibility
13
A Simple Thread Block Design
• Each thread block takes 2* BlockDim input elements
• Each thread loads 2 elements into shared memory
__shared__ float partialSum[2*BLOCK_SIZE];
14
The Reduction Steps
• No more than half of threads will be executing after the first step
– All odd-index threads are disabled after first step
– After the 5th step, entire warps in each block will fail the if test, poor
resource utilization but no divergence.
• This can go on for a while, up to 5 more steps (1024/32=16= 25), where each
active warp only has one productive thread until all warps in a block retire
– Some warps will still succeed, but with divergence since only one
thread will succeed
17
Thread Index Usage Matters
• In some algorithms, one can shift the index usage to
improve the divergence behavior
– Commutative and associative operators
18
A Better Strategy
• Always compact the partial sums into the first
locations in the partialSum[] array
19
An Example of 16 threads
Thread 0 Thread 1 Thread 2 Thread 14Thread 15
0 1 2 3 … 13 14 15 16 17 18 19
0+16 15+31
20
A Better Reduction Kernel
21
A Quick Analysis
22
A Story about an Old Engineer
• From Hwu/Yale Patt
23
Parallel Algorithm Overhead
__shared__ float partialSum[2*BLOCK_SIZE];
+ + + +
Although the number of “operations” is N, each “operation
involves much more complex address calculation and
4 7 5 9
intermediate result manipulation.
+
26
7
ANY MORE QUESTIONS?