217-lec10

CS/EE 217 GPU Architecture and Parallel
Programming
Lecture 10
Reduction Trees
© David Kirk/NVIDIA and Wen-mei W. Hwu 1

University of Illinois, 2007-2012
Objective
• To master Reduction Trees, arguably the most widely
used parallel computation pattern
– Basic concept
– Performance analysis
• Memory coalescing
• Control divergence
• Thread utilization
2
Partition and Summarize
• A commonly used strategy for processing large input
data sets
– There is no required order of processing elements in a data
set (associative and commutative)
– Partition the data set into smaller chunks
– Have each thread to process a chunk
– Use a reduction tree to summarize the results from each
chunk into the final answer
• We will focus on the reduction tree step for now.
• Google and Hadoop MapReduce frameworks are
examples of this pattern
3
Reduction enables other techniques
• Reduction is also needed to clean up after some
commonly used parallelizing transformations
• Privatization
– Multiple threads write into an output location
– Replicate the output location so that each thread has a
private output location
– Use a reduction tree to combine the values of private
locations into the original output location
4
What is a reduction computation
• Summarize a set of input values into one value using a
“reduction operation”
– Max
– Min
– Sum
– Product
– Often with user defined reduction operation function as long
as the operation
• Is associative and commutative
• Has a well-defined identity value (e.g., 0 for sum)
5
An efficient sequential reduction
algorithm performs N operations - O(N)
• Initialize the result as an identity value for the
reduction operation
– Smallest possible value for max reduction
– Largest possible value for min reduction
– 0 for sum reduction
– 1 for product reduction
• Scan through the input and perform the reduction

operation between the result value and the current
input value
6
A parallel reduction tree algorithm
performs N-1 Operations in log(N) steps
3 1 7 0 4 1 6 3
max max max max
3 7 4 6
max max
7 6
max
7
7
A tournament is a reduction tree
8
What is the reduction operation?
A Quick Analysis
• For N input values, the reduction tree performs
– (1/2)N + (1/4)N + (1/8)N + … (1/N) = (1- (1/N))N = N-1
operations
– In Log (N) steps – 1,000,000 input values take 20 steps
• Assuming that we have enough execution resources
– Average Parallelism (N-1)/Log(N))
• For N = 1,000,000, average parallelism is 50,000
• However, peak resource requirement is 500,000!
• This is a work-efficient parallel algorithm
– The amount of work done is comparable to sequential
• Many parallel algorithms are not work efficient
– But not resource efficient… 9
A Sum Reduction Example
• Parallel implementation:
– Recursively halve # of threads, add two values per thread
in each step
– Takes log(n) steps for n elements, requires n/2 threads
• Assume an in-place reduction using shared memory

– The original vector is in device global memory
– The shared memory is used to hold a partial sum vector
– Each step brings the partial sum vector closer to the sum
– The final sum will be in element 0
– Reduces global memory traffic due to partial sum values
10
Vector Reduction with Branch Divergence
Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5
Dat 0 1 2 3 4 5 6 7 8 9 10 11
a
1 0+1 2+3 4+5 6+7 8+9 10+11
2 0...3 4..7 8..11
3 0..7 8..15
steps
Partial Sum elements 11

A Sum Example
Thread 0 Thread 1 Thread 2 Thread 3
Dat 3 1 7 0 4 1 6 3
a
1 4 7 5 9
Active Partial
2 11 14 Sum elements
3 25
steps
12
Simple Thread Index to Data Mapping
• Each thread is responsible of an even-index location
of the partial sum vector
– One input is the location of responsibility
• After each step, half of the threads are no longer

needed
• In each step, one of the inputs comes from an

increasing distance away
13
A Simple Thread Block Design
• Each thread block takes 2* BlockDim input elements
• Each thread loads 2 elements into shared memory
__shared__ float partialSum[2*BLOCK_SIZE];
unsigned int t = threadIdx.x;
unsigned int start = 2*blockIdx.x*blockDim.x;
partialSum[t] = input[start + t];
partialSum[blockDim+t] = input[start+ blockDim.x+t];
14
The Reduction Steps
for (unsigned int stride = 1;

stride <= blockDim.x; stride *= 2)
{
__syncthreads();
if (t % stride == 0)
partialSum[2*t]+= partialSum[2*t+stride];
}
Why do we need syncthreads()?

15
Back to the Global Picture
• Thread 0 in each thread block write the sum of the
thread block in partialSum[0] into a vector indexed by
the blockIdx.x
• There can be a large number of such sums if the

original vector is very large
– The host code may iterate and launch another kernel
• If there are only a small number of sums, the host can

simply transfer the data back and add them together.
16
Some Observations
• In each iteration, two control flow paths will be sequentially
traversed for each warp
– Threads that perform addition and threads that do not
– Threads that do not perform addition still consume execution resources
• No more than half of threads will be executing after the first step
– All odd-index threads are disabled after first step
– After the 5th step, entire warps in each block will fail the if test, poor
resource utilization but no divergence.
• This can go on for a while, up to 5 more steps (1024/32=16= 25), where each
active warp only has one productive thread until all warps in a block retire
– Some warps will still succeed, but with divergence since only one
thread will succeed
17
Thread Index Usage Matters
• In some algorithms, one can shift the index usage to
improve the divergence behavior
– Commutative and associative operators
• Example - given an array of values, “reduce” them to

a single value in parallel
– Sum reduction: sum of all values in the array
– Max reduction: maximum of all values in the array
– …
18
A Better Strategy
• Always compact the partial sums into the first
locations in the partialSum[] array
• Keep the active threads consecutive
19
An Example of 16 threads
Thread 0 Thread 1 Thread 2 Thread 14Thread 15
0 1 2 3 … 13 14 15 16 17 18 19
0+16 15+31
20
A Better Reduction Kernel
for (unsigned int stride = blockDim.x;

stride > 0; stride /= 2)
{
__syncthreads();
if (t < stride)
partialSum[t] += partialSum[t+stride];
}
21
A Quick Analysis
• For a 1024 thread block

– No divergence in the first 5 steps
– 1024, 512, 256, 128, 64, 32 consecutive threads are active
in each step
– The final 5 steps will still have divergence
22
A Story about an Old Engineer
• From Hwu/Yale Patt
23
Parallel Algorithm Overhead

for (unsigned int stride = blockDim.x/2;
stride >= 1; stride >>= 1)
{
__syncthreads();
if (t < stride)
}
24
Parallel Algorithm Overhead

for (unsigned int stride = blockDim.x/2;
stride >= 1; stride >>= 1)
{
__syncthreads();
if (t < stride)
}
25
Parallel Execution Overhead
3 1 7 0 4 1 6 3
+ + + +
Although the number of “operations” is N, each “operation
involves much more complex address calculation and
4 7 5 9
intermediate result manipulation.
If the parallel code+is executed on a single-thread

+ hardware,
it would be significantly slower than the code based on the
original sequential algorithm.
7 6
+
26
7
ANY MORE QUESTIONS?
© David Kirk/NVIDIA and Wen-mei W. Hwu, 27

University of Illinois, 2007-2012

217-lec10

Uploaded by

Copyright:

Available Formats

217-lec10

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

217-lec10

Uploaded by

Copyright:

Available Formats

CS/EE 217 GPU Architecture and Parallel

© David Kirk/NVIDIA and Wen-mei W. Hwu 1

• Scan through the input and perform the reduction

max max max max

• Assume an in-place reduction using shared memory

1 0+1 2+3 4+5 6+7 8+9 10+11

2 0...3 4..7 8..11

Partial Sum elements 11

• After each step, half of the threads are no longer

• In each step, one of the inputs comes from an

unsigned int t = threadIdx.x;

unsigned int start = 2*blockIdx.x*blockDim.x;

partialSum[t] = input[start + t];

partialSum[blockDim+t] = input[start+ blockDim.x+t];

for (unsigned int stride = 1;

Why do we need syncthreads()?

• There can be a large number of such sums if the

• If there are only a small number of sums, the host can

• Example - given an array of values, “reduce” them to

• Keep the active threads consecutive

for (unsigned int stride = blockDim.x;

• For a 1024 thread block

unsigned int t = threadIdx.x;

unsigned int t = threadIdx.x;

If the parallel code+is executed on a single-thread

© David Kirk/NVIDIA and Wen-mei W. Hwu, 27

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

unsigned int start = 2blockIdx.xblockDim.x;