0% found this document useful (0 votes)
144 views

GPU Cluster

A GPU cluster for high performance computing is proposed that uses commodity graphics processing units (GPUs) due to their attractive flops/dollar ratio and fast speed increases. As a demonstration, a lattice Boltzmann method simulation of airborne contaminant dispersion in Times Square, New York City was implemented on a cluster of 30 GPU nodes and achieved a speed 4.6 times faster than a CPU cluster implementation. Other potential applications discussed for GPU clusters include cellular automata, PDE solvers, and finite element methods.

Uploaded by

Prashant Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
144 views

GPU Cluster

A GPU cluster for high performance computing is proposed that uses commodity graphics processing units (GPUs) due to their attractive flops/dollar ratio and fast speed increases. As a demonstration, a lattice Boltzmann method simulation of airborne contaminant dispersion in Times Square, New York City was implemented on a cluster of 30 GPU nodes and achieved a speed 4.6 times faster than a CPU cluster implementation. Other potential applications discussed for GPU clusters include cellular automata, PDE solvers, and finite element methods.

Uploaded by

Prashant Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

ACM / IEEE Supercomputing Conference 2004, November 06-12, Pittsburgh, PA

GPU Cluster for High Performance Computing


Zhe Fan, Feng Qiu, Arie Kaufman, Suzanne Yoakum-Stover
{fzhe, qfeng, ari, suzi}@cs.sunysb.edu
Center For Visual Computing and Department of Computer Science
Stony Brook University
Stony Brook, NY 11794-4400
ABSTRACT
Inspired by the attractive Flops/dollar ratio and the incredi-
ble growth in the speed of modern graphics processing units
(GPUs), we propose to use a cluster of GPUs for high perfor-
mance scientic computing. As an example application, we
have developed a parallel ow simulation using the lattice
Boltzmann model (LBM) on a GPU cluster and have sim-
ulated the dispersion of airborne contaminants in the Times
Square area of New York City. Using 30 GPU nodes, our
simulation can compute a 480x400x80 LBM in 0.31 sec-
ond/step, a speed which is 4.6 times faster than that of our
CPU cluster implementation. Besides the LBM, we also dis-
cuss other potential applications of the GPU cluster, such as
cellular automata, PDE solvers, and FEM.
Keywords: GPU cluster, data intensive computing, lattice
Boltzmann model, urban airborne dispersion, computational
uid dynamics
1 INTRODUCTION
The GPU, which refers to the commodity off-the-shelf 3D
graphics card, is specically designed to be extremely fast at
processing large graphics data sets (e.g., polygons and pix-
els) for rendering tasks. Recently, the use of the GPU to
accelerate non-graphics computation has drawn much atten-
tion [6, 16, 3, 29, 10, 28]. This kind of research is propelled
by two essential considerations:
Price/Performance Ratio: The computational power of to-
days commodity GPUs has exceeded that of PC-based
CPUs. For example, the nVIDIA GeForce 6800 Ultra,
recently released, has been observed to reach 40 GFlops
in fragment processing [11]. In comparison, the theo-
retical peak performance of the Intel 3GHz Pentium4
using SSE instructions is only 6 GFlops. This high
GPU performance results from the following: (1) A
current GPU has up to 16 pixel processors and 6 ver-
tex processors that execute 4-dimensional vector oat-
SC04, November 6-12, 2004, Pittsburgh PA, USA
0-7695-2153-3/04 $20.00 (c)2004 IEEE
ing point instructions in parallel; (2) pipeline constraint
is enforced to ensure that data elements stream through
the processors without stalls [29]; and (3) unlike the
CPU, which has long been recognized to have a mem-
ory bottleneck for massive computation [2], the GPU
uses fast on-board texture memory which has one or-
der of magnitude higher bandwidth (e.g., 35.2GB/sec
on the GeForce 6800 Ultra). At the same time, the
booming market for computer games drives high vol-
ume sales of graphics cards which keeps prices low
compared to other specialty hardware. As a result, the
GPU has become a commodity SIMD machine on the
desktop that is ready to be exploited for computation
exhibiting high compute parallelism and requiring high
memory bandwidth.
Evolution Speed: Driven by the game industry, GPU per-
formance has approximately doubled every 6 months
since the mid-1990s [15], which is much faster than the
growth rate of CPU performance that doubles every 18
months on average (Moores law), and this trend is ex-
pected to continue. This is made possible by the explicit
parallelism exposed in the graphics hardware. As the
semiconductor fabrication technology advances, GPUs
can use additional transistors much more efciently for
computation than CPUs by increasing the number of
pipelines.
Recently, the development of GPUs has reached a new
high-point with the addition of single-precision 32bit oat-
ing point capabilities and the high level language program-
ming interface, called Cg [20]. The developments mentioned
above have facilitated the abstraction of the modern GPU as
a stream processor. Consequently, mapping scientic com-
putation onto the GPU has turned from initially hardware
hacking techniques to more of a high level designing task.
Many kinds of computations can be accelerated on GPUs
including sparse linear system solvers, physical simulation,
linear algebra operations, partial difference equations, fast
Fourier transform, level-set computation, computational ge-
ometry problems, and also non-traditional graphics, such as
volume rendering, ray-tracing, and ow visualization. (We
refer the reader to the web site of General-Purpose Computa-
tion Using Graphics Hardware (GPGPU) [1] for more infor-
mation.) Whereas all of this work has been limited to com-
puting small-scale problems on a single GPU, in this paper
we address the large scale computation on a GPU cluster.
Inspired by the attractive Flops/$ ratio and the projected
development of the GPU, we believe that a GPU cluster is
promising for data-intensive scientic computing and can
substantially outperforma CPUcluster at the equivalent cost.
Although there have been some efforts to exploit the paral-
lelism of a graphics PC cluster for interactive graphics tasks
[9, 13, 14], to the best of our knowledge we are the rst to
develop a scalable GPU cluster for high performance scien-
tic computing and large-scale simulation. We have built a
cluster with 32 computation nodes connected by a 1 Gigabit
Ethernet switch. Each node consists of a dual-CPU HP PC
with an nVIDIA GeForce FX 5800 Ultra the GPU that
cost $399 in April 2003. By adding 32 GPUs to this cluster,
we have increased the theoretical peak performance of the
cluster by 512 Gops at a cost of only $12,768.
As an example application, we have simulated airborne
contaminant dispersion in the Times Square area of New
York City. To model transport and dispersion, we use
the computational uid dynamics (CFD) model known as
the Lattice Boltzmann Method (LBM), which is second-
order accurate and can easily accommodate complex-shaped
boundaries. Beyond enhancing our understanding of the
uid dynamics processes governing dispersion, this work
will support the prediction of airborne contaminant propa-
gation so that emergency responders can more effectively
engage their resources in response to urban accidents or at-
tacks. For large scale simulations of this kind, the combined
computational speed of the GPU cluster and the linear nature
of the LBM model create a powerful tool that can meet the
requirements of both speed and accuracy.
In the context of modeling contaminant transport, Brown
et al. [4, 5] have presented an approach for computing wind
elds and simulating contaminant transport on three differ-
ent scales: mesoscale, urban scale and building scale. The
system they developed, called HIGRAD, computes the ow
eld by using a second-order accurate nite difference ap-
proximation of the Navier-Stokes equations and doing large
eddy simulation with a small time step to resolve turbulent
eddies. These simulations required a few hours on a super-
computer or cluster to solve a 1.6 km 1.5 km area in Salt
Lake City at a grid spacing of 10 meters (grid resolution:
160 150 36). In comparison, our method is also second-
order accurate, incorporates a more detailed city model, and
can simulate the Times Square area in New York City at a
grid spacing of 3.8 meters (grid resolution: 480 400 80)
with small vortices in less than 20 minutes.
This paper is organized as follows: Section 2 illustrates
how the GPU can be used for non-graphics computing. Sec-
tion 3 presents our GPU cluster, called the Stony Brook Vi-
sual Computing Cluster. In Section 4, we detail our LBM
implementation on the GPU cluster, followed by the perfor-
mance results and a comparison with our CPU cluster. Sec-
tion 5 presents our dispersion simulation in the Times Square
area of New York City. In Section 6, we discuss other po-
tential usage of the GPU cluster for scientic computations.
Finally, we conclude in Section 7.
2 GPU COMPUTING MODEL
A graphics task such as rendering a 3D scene on the GPU
involves a sequence of processing stages that run in parallel
and in a xed order, known as the graphics hardware pipeline
(see Figure 1). The rst stage of the pipeline is the ver-
tex processing. The input to this stage is a 3D polygonal
mesh. The 3D world coordinates of each vertex of the mesh
are transformed to a 2D screen position. Color and texture
coordinates associated with each vertex are also evaluated.
In the second stage, the transformed vertices are grouped
into rendering primitives, such as triangles. Each primitive
is scan-converted, generating a set of fragments in screen
space. Each fragment stores the state information needed
to update a pixel. In the third stage, called the fragment pro-
cessing, the texture coordinates of each fragment are used
to fetch colors of the appropriate texels (texture pixels) from
one or more textures. Mathematical operations may also be
performed to determine the ultimate color for the fragment.
Finally, various tests (e.g., depth and alpha) are conducted to
determine whether the fragment should be used to update a
pixel in the frame buffer.
Vertices in 3D
Transformed vertices in
screen position
Fetching
Texels
Vertex
Processing
Fragment
Processing
Fragments
Scan-converting
Fragments
with colors
Figure 1: A simplied illustration of the graphics hardware
pipeline.
To support extremely fast processing of large graph-
ics data sets (vertices and fragments), modern GPUs (e.g.,
nVIDIA GeForce and ATI Radeon family cards) employ a
stream processing model with parallelism. Currently, up to
6 vertices in the vertex processing stage, and up to 16 frag-
ments in the fragment processing stage can be processed in
parallel by multi-processors. The GPU hardware supports 4-
dimensional vectors (representing homogeneous coordinates
or the RGBA color channels) and a 4-component vector
oating point SIMD instruction set for computation. In ad-
dition, the pipeline discipline is enforced that every element
in the stream is processed by the similar function and inde-
pendently of the other elements. This ensures that data ele-
ments stream through the pipeline without stalls, and largely
account for the high performance gains associated with pro-
cessing large data sets [29].
Currently, most of the techniques for non-graphics com-
putation on the GPU take advantage of the programmable
fragment processing stage. Using the C-like, high-level lan-
guage, Cg [20], programmers can write fragment programs
to implement general-purpose operations. Since fragment
programs can fetch texels from arbitrary positions in tex-
tures residing in texture memory, a gather operation is sup-
ported. Note however, that while the vertex stage is also pro-
grammable, it does not support the gather operation. The
steps involved in mapping a computation on the GPU are
as follows: (1) The data are laid out as texel colors in tex-
tures; (2) Each computation step is implemented with a
user-dened fragment program which can include gather and
mathematic operations. The results are encoded as pixel col-
ors and rendered into a pixel-buffer (a buffer in GPU mem-
ory which is similar to a frame-buffer); (3) Results that are
to be used in subsequent calculations are copied to textures
for temporary storage.
For general-purpose computation on the GPU, an essen-
tial requirement is that the data structure can be arranged
in arrays in order to be stored in a 2D texture or a stack
of 2D textures. For a matrix or a structured grid, this lay-
out in texture is natural. Accommodating more complicated
data structures may require the use of indirection textures
that store texture coordinates used to fetch texels from other
textures. For example, to store a static 2D binary tree, all the
nodes can be packed into a 2D texture in row-priority order
according to the node IDs. Using two indirection textures,
the texture coordinates of each nodes left child and right
child can be stored. However, lacking pointers in GPU pro-
grams makes computations that use some other complex data
structures (i.e., dynamic link list) difcult for the GPU. GPU
computation may also be inefcient in cases where the pro-
gram control ow is complex. It is also the case that the GPU
on-board texture memory is relatively small (currently the
maximum size is 256MB). In our previous work with LBM
simulation on a single GeForce FX 5800 Ultra with 128MB
texture memory, we found that at most 86MB texture mem-
ory can actually be used to store the computational lattice
data. As a result, our maximum lattice size was 92
3
. For-
tunately, many massive computations exhibit the feature that
they only require simple data structures and simple program
control ows. By using a cluster of GPUs, these computa-
tions can reap the benets of GPU computing while avoiding
its limitations.
3 OUR GPU CLUSTER
The Stony Brook Visual Computing Cluster (Figure 2) is our
GPU cluster built for two main purposes: as a GPU cluster
for graphics and computation and as a visualization cluster
for rendering large volume data sets. It has 32 nodes con-
nected by a 1 Gigabit Ethernet switch (Actually, the cluster
has 35 nodes, but only 32 are used in this project). Each
node is an HP PC equipped with two Pentium Xeon 2.4GHz
processors and 2.5GB memory. Each node has a GPU, the
GeForce FX 5800 Ultra with 128MB memory, used for GPU
cluster computation. Each node also has a volume render-
ing hardware (VolumePro 1000) and currently 9 of the nodes
have also HP Sepia-2A composting cards with fast Server-
Net [25] for rendering large volume data sets. Each node
can boot under Windows XP or Linux, although our current
application of the GPU cluster runs on Windows XP.
Figure 2: The Stony Brook Visual Computing Cluster.
The architecture of our GPU cluster is shown as Figure
3. We use MPI for data transfer across the network during
execution. Each port of the switch has 1 Gigabit bandwidth.
Besides network transfer, data transfer includes upstream-
ing data from GPU to PC memory and downstreaming data
from PC memory to GPU for the next computation. This
communication occurs over an AGP 8x bus, which has been
well known to have an asymmetric bandwidth (2.1GB/sec
peak for downstream and 133MB/sec peak for upstream).
The asymmetric bandwidth reects the need for the GPU to
push vast quantities of graphics data at high speed and to
read back only a small portion of data. As shown in Section
4.4, the slower upstream transfer rate slows down the en-
tire communication. Recent exciting news indicates that this
situation will be improved with the PCI-Express bus to be
available later this year [30]. By connecting with a x16 PCI-
Express slot, a graphics card can communicate with the sys-
tem at 4GB/sec in both upstream and downstream directions.
Moreover, the PCI-Express will allow multiple GPUs to be
plugged into one PC. The interconnection of these GPUs will
greatly reduce the network load.
Currently, we only use the fragment processing stage of
the GeForce FX 5800 Ultra for computing, which features
PC
Memory
GPU
Node 0
Node 1 ...
Gigabit Network
Switch
AGPx8
2.1GB/sec
Node 31
Network
Card
Texture
Memory
CPU
133MB/sec
1 Gbit/sec 1 Gbit/sec
Figure 3: The architecture of our GPU cluster. (Although all
32 nodes have the same conguration, we show only node 0 in
detail.)
a theoretical peak of 16 Gops, while the dual-processor
Pentium Xeon 2.4GHz reaches approximately 10 Gops.
The theoretical peak performance of our GPU cluster is
(16 + 10) 32 = 832 Gops. Although the whole GPU
cluster cost was about $136,000 (excluding the VolumePro
cards and the Sepia cards which are not used here), this price
can be decreased by designing the system specically for the
purpose of GPUcluster computation, since the large memory
congurations and the dual processors of the PCs in this clus-
ter actually do not improve the performance of GPU com-
puting. Stated in another way, by plugging 32 GPUs into
this cluster, we increase its theoretical peak performance by
16 32 = 512 GFlops at a price of $399 32 =$12, 768.
We therefore get in principle 41.1 Mops peak/$.
4 PARALLEL LBM COMPUTATION ON THE GPU
CLUSTER
In this section we describe the rst example application, par-
allel LBM computation that we developed on the GPU clus-
ter. We begin this section with a brief introduction to the
LBM model and then review our previous work of mapping
the computation onto a single GPU. Afterwards, we present
the algorithm and network optimization techniques for scal-
ing the model onto our GPU cluster and report the perfor-
mance in comparison with the same model executed on the
CPU cluster.
4.1 LBM Flow Model
The LBM is a relatively new approach in computational
uid dynamics for modeling gases and uids [26]. Devel-
oped principally by the physics community, the LBM has
been applied to problems of ow and reactive transport in
porous media, environmental science, national security, and
others. The numerical method is highly parallelizable, and
most notably, it affords great exibility in specifying bound-
ary shapes. Even moving and time-dependent boundaries
can be accommodated with relative ease [24].
The LBM models Boltzmann dynamics of ow parti-
cles on a regular lattice. Figure 4 shows a unit cell of the
D3Q19 lattice, which includes 19 velocity vectors in three-
dimension (the zero velocity in the center site and the 18
velocities represented by the 6 nearest axial and 12 second-
nearest minor diagonal neighbor links). Associated with
each lattice site, and corresponding to each of the 19 veloci-
ties are 19 oating point variables, f
i
, representing velocity
distributions. Each distribution represents the probability of
the presence of a uid particle with the associated velocity.
f i
c i
Figure 4: The D3Q19 LBM lattice geometry. The velocity dis-
tribution f
i
is associated with the link vector c
i
.
The Boltzmann equation expresses how the average num-
ber of ow particles move between neighboring sites due to
inter-particle interactions and ballistic motion. This dynam-
ics can be represented as a two-step process of collision and
streaming. Particles stream synchronously along links in dis-
crete time steps. Between streaming steps, the Bhatnager,
Gross, Krook (BGK) model is used to model collisions as a
statistical redistribution of momentum, which locally drives
the system toward equilibrium while conserving mass and
momentum [31]. Complex shaped boundaries such as curves
and porous media can be represented by the location of the
intersection of the boundary surfaces with the lattice links
[24]. The LBM is second-order accurate in both time and
space, with an advection-limited time step. In the limit of
zero time step and lattice spacing, LBM yields the Navier-
Stokes equation for an incompressible uid.
The LBM model can be further extended to capture ther-
mal effects as in convective ows. A hybrid thermal model
has been recently developed [17]. The hybrid thermal LBM
(HTLBM) abandons the BGK collision model for the more
stable Multiple Relaxation Time (MRT) collision model [8].
Temperature, modeled with a standard diffusion-advection
equation implemented as a nite difference equation is cou-
pled to the MRT LBM via an energy term. Ultimately, the
implementation of the HTLBM is similar to the earlier LBM
requiring only two additional matrix multiplications.
4.2 LBM on a Single GPU
In a previous work [18], our group have implemented a BGK
LBM simulation on the nVIDIA GeForce4 GPU, which has
a non-programmable fragment processor, using complex tex-
ture operations. Since then we have ported the BGK LBM
computation to newer graphics hardware, the GeForce FX,
and have achieved about 8 times faster speed on the GeForce
FX 5900 Ultra compared to the software version running on
Pentium IV 2.53GHz without using SSE instructions. The
programmability of the GeForce FX makes porting to the
GPU straightforward and efcient. Because our latest par-
allel version on the GPU cluster is based on it, we briey
review the single GPU implementation on the GeForce FX.
As shown in Figure 5, to lay out the LBM data, the lattice
sites are divided into several volumes. Each volume contains
data associated with a given state variable and has the same
resolution as the LBM lattice. For example, each of the 19
velocity distributions f
i
in D3Q19 LBM, is represented in a
volume. To use the GPU vector operations and save storage
space, we pack four volumes into one stack of 2D textures
(note that a fragment or a texel has 4 color components).
Thus, the 19 distribution values are packed into 5 stacks of
textures. Flow densities and ow velocities at the lattice sites
are packed into one stack of textures in a similar fashion.
...
D3Q19 LBM
+X Direction Volume
+Y Direction Volume
+XY Direction Volume
-XY Direction Volume
A Stack of 2D Textures
Figure 5: Each velocity distribution f
i
, associated with a given
direction, is grouped into a volume. We pack every four vol-
umes into one stack of 2D textures.
Boundary link information (e.g., ags indicating whether
the lattice links intersect with boundary surfaces along with
the intersection positions) is also stored in textures. How-
ever, since most links do not intersect the boundary surface,
we do not store boundary information for the whole lattice.
Instead, we cover the boundary regions of each Z slice using
multiple small rectangles. Thus, we need to store the bound-
ary information only inside those rectangles in 2D textures.
The LBM operations (e.g., streaming, collision, and
boundary conditions) are translated into fragment programs
to be executed in the rendering passes. For each fragment
in a given pass, the fragment program fetches any required
current lattice state information from the appropriate tex-
tures, computes the LBM equations to evaluate the new lat-
tice states, and renders the results to a pixel buffer. When the
pass is completed, the results are copied back to textures for
use in the next step.
4.3 Scaling LBM onto the GPU Cluster
To scale LBM onto the GPU cluster, we choose to decom-
pose the LBM lattice space into sub-domains, each of which
is a 3D block. As shown in Figure 6, each GPU node com-
putes one sub-domain. In every computation step, velocity
distributions at border sites of the sub-domain may need to
stream to adjacent nodes. This kind of streaming involves
three steps: (1) Distributions are read out from the GPU;
(2) They are transferred through the network to appropriate
neighboring nodes; (3) They are then written to the GPU
in the neighboring nodes. For ease of discussion, we di-
vide these across-network streaming operations into two cat-
egories: streaming axially to nearest neighbors (represented
by black arrows in Figure 6) and streaming diagonally to
second-nearest neighbors (represented by blue arrows). Note
that although Figure 6 only demonstrates 9 sub-domains ar-
ranged in 2 dimensions, our implementation is scalable and
functions in a similar fashion for sub-domains arranged in 3
dimensions.
Figure 6: Each block represents a sub-domain of the LBM lat-
tice processed by one GPU. Velocity distributions at border sites
stream to adjacent nodes at every computation step. Black ar-
rows indicate velocity distributions that stream axially to near-
est neighbor nodes while blue arrows indicate velocity distribu-
tions that stream diagonally to second-nearest neighbor nodes.
The primary challenge in scaling LBM computation onto
the GPU cluster is to minimize the communication cost
the time taken for network communication and for trans-
ferring data between the GPU and the PC memory. Over-
lapping network communication time with the computation
time is feasible, since the CPU and the network card are all
standing idle while the GPU is computing. However, be-
cause each GPU can compute its sub-domain quickly, op-
timizing network performance to keep communication time
from becoming the bottleneck is still necessary. Intuitively
one might want to minimize the size of transferred data. One
way to do this is to make the shape of each sub-domain as
close as possible to a cube, since for block shapes the cube
has the smallest ratio between boundary surface area and vol-
ume. Another idea that we have not yet studied is to employ
lossless compression of transferred data by exploiting space
coherence or data coherence between computation steps. We
have found, however, that other issues actually dominate the
communication performance.
The communication switching time has a signicant im-
pact on network performance. We performed experiments on
the GPU cluster using MPI and replicated these experiments
using communication code that we developed using TCP/IP
sockets. The results were the same: (1) During the time
when a node is sending data to another node, if a third node
tries to send data to either of those nodes, the interruption
will break the smooth data transfer and may dramatically re-
duce the performance; (2) Assuming the total communica-
tion data size is the same, a simulation in which each node
transfers data to more neighbors has a considerably larger
communication time than a simulation in which each node
transfers to fewer neighbors.
To address these issues, we have designed communica-
tion schedules [27] that reduce the likelihood of interrup-
tions. We have also further simplied the communication
pattern of the parallel LBM simulation. In our design, the
communication is scheduled in multiple steps and in each
step certain pairs of nodes exchange data. This schedule and
pattern are illustrated in Figure 7 for 16 nodes arranged in
2 dimensions. The same procedure works for congurations
with more nodes and for 3D arrangement as well. The dif-
ferent colors represent the different steps. In the rst step, all
nodes in the (2i)th columns exchange data with their neigh-
bors to the left. In the second step, these nodes exchange
data with neighbors to the right. In the third and fourth steps,
nodes in the (2i)th rows exchange data with their neighbors
above and below, respectively. Note that LBM computation
requires that nodes need to exchange data with their second-
nearest neighbors too. There are as many as 4 second-nearest
neighbors in 2D arrangements and up to 12 in 3D D3Q19
arrangements. To keep the communication pattern from be-
coming too complicated, and to avoid additional overhead
associated with more steps, we do not allow direct data ex-
change diagonally between second-nearest neighbors. In-
stead, we transfer those data indirectly in a two-step process.
For example, as shown in Figure 7, data that node B wants
to send to node E will rst be sent to node A in step 1, then
be sent by node A to node E in step 3. If the sub-domain
in a GPU node is a lattice of size N
3
, the size of the data
that it sends to a nearest neighbor is 5N
2
, while the data it
sends to a second-nearest neighbor has size of only N. Using
the indirect pattern increases the packet size between nearest
neighbors only by
c
5N
(c is 1 or 2 for 2D arrangement and
1-4 for 3D arrangement). Since the communication pattern
is also greatly simplied, particularly for 3D node arrange-
ments, the network performance is greatly improved.
Step 1
Step 2
Step 3
Step 4
A C D
F G H
I J K L
M N O P
B
E
Figure 7: The communication schedule and pattern of parallel
LBM Simulation. Different colors indicate the different steps
in the schedule.
We also found that for simulations with a small number
of nodes (less than 16), synchronizing the nodes by calling
MPI barrier() at each scheduled step improves the network
performance. However, if more than 16 nodes are used,
the overhead of the synchronization overwhelms the perfor-
mance gained from the synchronized schedule.
The data transfer speed from GPU to CPU represents an-
other bandwidth limitation. Because of the way that we map
the data to textures (described in Section 4.2), the velocity
distributions that stream out of the sub-domain are stored in
different texels and different channels in multiple textures.
We have designed fragment programs which run in every
time step to gather together into a texture all these data. Then
they are read from the GPU in a single read operation (e.g.,
OpenGL function glGetTexImage()). In so doing, we mini-
mize the overhead of initializing the read operations. As de-
scribed in Section 3, this bandwidth limitation will be ame-
liorated later this year when the PCI-Express bus becomes
available on the PC platform.
4.4 Performance of LBM on the GPU Cluster
In addition to the GPU cluster implementation, we have
implemented the parallel LBM on the same cluster using
Table 1: Per step execution time (in ms) for CPU and GPU clusters and the GPU cluster / CPU cluster speedup factor. Each node
computes an 80
3
sub-domain of the lattice.
CPU cluster GPU cluster
Number
GPU and CPU Network Communication: Speedup
of nodes
Total Computation
Communication Non-overlapping Cost (Total)
Total
1 1420 214 - - 214 6.64
2 1424 216 13 0 (38) 229 6.22
4 1430 224 42 0 (47) 266 5.38
8 1429 222 50 0 (68) 272 5.25
12 1431 230 50 0 (80) 280 5.11
16 1433 235 50 0 (85) 285 5.03
20 1436 237 50 0 (87) 287 5.00
24 1437 238 50 0 (90) 288 4.99
28 1439 237 50 11 (131) 298 4.83
30 1440 237 50 25 (145) 312 4.62
32 1440 237 49 31 (151) 317 4.54
the CPUs. The time and work taken to develop and opti-
mize these two implementations were similar (about 3 man-
months each). Note that although each node has two CPUs,
for the purpose of a fair comparison, we used only one thread
(hence one CPU) per node for computation.
In Table 1, we report the simulation execution time per
step (averaged over 500 steps) in milliseconds on both the
CPU cluster and the GPU cluster with 1, 2, 4, 8, 16, 20,
24, 28, 30 and 32 nodes. Each node evaluates an 80
3
sub-
domain and the sub-domains are arranged in 2 dimensions.
The timing for the CPU cluster simulation (shown in col-
umn 2 of table 1) includes only computation time because
the network communication time was overlapped with the
computation by using a second thread for network communi-
cation. The timing for the GPU cluster simulation (shown in
column 6) includes: computation time, GPU and CPU com-
munication time, and non-overlapping network communica-
tion time. Note that the computation time also includes the
time for boundary condition evaluation for the city model
described in Section 5. As the boundary condition evalua-
tion time is only a small portion of the computation time,
the computation time is similar for all the nodes. Network
communication time (plotted as a function of the number of
nodes in Figure 8) was partially overlapped with the com-
putation because we let each GPU compute collision oper-
ation on inner cells of its sub-domain (which takes roughly
120 ms) simultaneously with network communication. If the
network communication time exceeds 120 ms, the remain-
der will be non-overlapping and affect the simulation time.
In column 5 we show this remainder cost along with a total
network communication time in parenthesis.
The GPU cluster / CPU cluster speedup factor is plotted
as a function of the number of nodes in Figure 9. When
0
40
80
120
160
0 4 8 12 16 20 24 28 32
Number of Nodes
N
e
t
w
o
r
k

C
o
m
m
u
n
i
c
a
t
i
o
n

T
i
m
e

Non-
overlapping
Overlapping
Figure 8: The network communication time measured in ms.
The area under the blue line represents the part of network
communication time which was overlapped with computation.
The shadow area represents the remainder.
only a single node is used, the speedup factor is 6.64. This
value projects the theoretical maximum GPU cluster / CPU
cluster speedup factor which could be reached if all com-
munication bottlenecks were eliminated by better optimized
network and larger GPU/CPU bandwidth. When the num-
ber of nodes is below 28, the network communication will
be totally overlapped with the computation. Accordingly the
growth of the number of nodes only marginally increases the
execution time due to the GPU/CPU communication and the
curve attens approximately at 5. When the number of nodes
increase to 28 or above, the network cant be totally over-
lapped, resulting in a drop in the curve.
Three enhancements can further improve this speedup fac-
tor without changing the way that we map the LBM com-
putation onto the GPU cluster: (1) Using a faster network,
such as Myrinet. (2) Using the PCI-Express bus that will be
available later this year to achieve faster communication be-
tween the GPU and the system and to plug multiple GPUs
into each PC. (3) Using GPUs with larger texture memories
0
1
2
3
4
5
6
7
0 4 8 12 16 20 24 28 32
Number of Nodes
S
p
e
e
d
u
p

F
a
c
t
o
r
:

G
P
U

C
l
u
s
t
e
r

/

C
P
U

C
l
u
s
t
e
r

Figure 9: Speedup factor of the GPU cluster compared with the


CPU cluster.
(currently, larger memories of 256MB are available) so that
each GPU can compute a larger sub-domain of the lattice
and thereby increase the computation/communication ratio.
Further GPU development, and the consequent increase in
performance, will serve to improve the speedup factor even
further (Note that todays GeForce 6800 Ultra, which has
been observed to reach 40 GFlops in fragment processing, is
already at least 2.5 times faster than the GeForce FX 5800
Ultra in our cluster). On the other hand, our CPU cluster im-
plementation could be further optimized too by using SSE
instructions, which we are going to implement in the near
future. With this optimization, the CPU cluster computation
is supposed to be about 2 to 3 times faster.
To quantify the scalability of the GPU cluster, Table 2
shows the computed efciency of the GPU cluster as a func-
tion of the number of nodes. The efciency values are also
plotted in Figure 10.
Table 2: The GPU cluster computational power and the ef-
ciency with respect to the number of nodes.
Number Number of cells
of Nodes computed per second
Speedup Efciency
1 2.3M
2 4.3M 1.87 93.5%
4 7.3M 3.17 79.3%
8 14.4M 6.26 78.3%
12 20.9M 9.09 75.8%
16 27.4M 11.91 74.4%
20 34.0M 14.78 73.9%
24 40.7M 17.70 73.8%
28 45.9M 19.96 71.3%
30 47.0M 20.43 68.1%
32 49.2M 21.39 66.8%
Our simulation computes 640 320 80 = 15.6M LBM
cells in 0.317 second/step using 32 GPU nodes, resulting in
49.2M cells/second. This performance is comparable with
supercomputers [21, 22, 23]. In the work of Martys et al.
[21], 128 128 256 = 4M LBM cells were computed
0%
20%
40%
60%
80%
100%
0 4 8 12 16 20 24 28 32
Number of Nodes
E
f
f
i
c
i
e
n
c
y

o
f

G
P
U

C
l
u
s
t
e
r

Figure 10: Efciency of the GPU cluster with respect to the


number of nodes.
in about 5 seconds/step on IBM SP2 using 16 processors,
which corresponds to 0.8M cells/second. In 2002, Mas-
saioli and Amati [22] reported the optimized D3Q19 BGK
LBM running on 16 IBM SP Nodes (16-way Nighthawk II
nodes, Power3@375MHz) with 16GB shared memory us-
ing OpenMP. They computed 128 128 256 = 4M
LBM cells in 0.26 second/step, which is 15.4M cells/second.
They were able to further increase this performance to 20.0M
cells/second using more sophisticated optimization tech-
niques, such as (1) fuse the streaming and collision steps
to reduce the memory accesses; (2) keep distributions at
rest in memory and implement the streaming by the in-
dexes translation; (3) bundle the distributions in a way that
relieves the Segment Lookaside Buffer (SLB) and Transla-
tion Lookaside Buffer (TLB) activities during address trans-
lation. In 2004, by using the above sophisticated optimiza-
tion techniques and further taking advantage of vector codes,
they achieved the performance of 108.1M cells/second on 32
processors with Power4 IBM [23]. Still, the GPU cluster
is competitive with supercomputers at a substantially lower
price.
In the above discussion, we have chosen to x the size of
each sub-domain as to maximize the performance of each
GPU node. This means, using more nodes we can obtain
more cycles to compute larger lattices within a similar time
frame. However, another performance criterion for a cluster
is to keep the problem size xed, but increase the number of
nodes to achieve a faster speed. However, we have found that
in doing so, the sub-domains become smaller, resulting in a
low computation/communication ratio. As a consequence,
the network performance becomes the bottleneck. We thus
may need a faster network in order to better exploit the com-
putational power of the GPUs. We have tested this perfor-
mance criterion with a 160 160 80 lattice and started
with 4 nodes. When the number of nodes increases from 4 to
16, the GPU cluster / CPU cluster speedup factor drops from
5.3 to 2.4. When more nodes are used, the GPU cluster and
the CPU cluster gradually converge to achieve comparable
performance.
5 DISPERSION SIMULATION IN NEW YORK CITY
Using the LBM, we have simulated on our GPU cluster the
transport of airborne contaminants in the Times Square area
of New York City. As shown in Figure 11, this area extends
North from 38th Street to 59th Street, and East from the 8th
Avenue to Park Avenue.
Figure 11: The simulation area shown on the Manhattan map,
enclosed by the blue contour. This area extends North from
38th Street to 59th Street, and East from the 8th Avenue to
Park Avenue. It covers an area of about 1.66 km 1.13 km,
consisting of 91 blocks and roughly 850 buildings.
The geometric model of the Times Square area that we use
is a 3D polygonal mesh that has considerable details and ac-
curacy (see Figure 12). It covers an area of about 1.66 km
1.13 km, consisting of 91 blocks and roughly 850 build-
ings. We model the ow using the D3Q19 BGK LBM with
a 480 400 80 lattice. This simulation is executed on 30
nodes of the GPU cluster (each node computes an 80
3
sub-
domain). The urban model is rotated to align it with the LBM
domain axes. It occupies a lattice area of 440 300 on the
ground. As a result, the simulation resolution is about 3.8
meters / lattice spacing. We simulate a northeasterly wind
with a velocity boundary condition on the right side of the
LBMdomain. The LBMowmodel runs at 0.31 second/step
on the GPU cluster. After 1000 steps of LBM computation,
the pollution tracer particles begin to propagate along the
LBM lattice links according to transition probabilities ob-
tained from the LBM velocity distributions [19].
Figure 12 shows the velocity eld visualized with stream-
lines at time step 1000. The blue color streamlines indi-
cates that the direction of velocity is approximately horizon-
tal, while the white color indicates a vertical component in
the velocity as the ow passes over the buildings. Figure 13
shows the dispersion simulation snapshot with volume ren-
dering of the contaminant density.
Currently, we render the images off-line. In the future,
we plan to make better use of the GPUs by rendering the re-
sults on-line. A potential advantage of the GPU cluster is
that the on-line visualization is feasible and efcient. Since
the simulation results already reside in the GPUs, each node
could rapidly render its contents, and the images could then
be transferred through a specially designed composing net-
work to form the nal image. HP is already developing new
technology [12] for its Sepia PCI cards [25], that can read
out data from the GPU through the DVI port and transfer
them at a rate of 450-500 MB/second in its composing net-
work. This feature will enable immediate visual feedback
for computational steering.
6 DISCUSSION: OTHER COMPUTATIONS ON THE
GPU CLUSTER
As discussed in Section 1, many kinds of computations have
been ported to the GPU. Many of these have the potential to
run on a GPUcluster as well. The limitations lie in the inabil-
ity to efciently handle complex data structures and complex
control ows. One approach to this problem is to let the GPU
and CPU work together, each doing the job that it does best.
This has been illustrated by Carr et al. [7], who used the
CPU to organize the data structure and the GPU to compute
ray-triangle intersections. This hybrid computation makes
it possible to apply the GPU cluster to more computational
problems. Since our main focus is ow simulation, in the
following we discuss the possibility of computing cellular
automata, explicit and implicit PDE methods, and FEM on
the GPU cluster.
Since the LBM is a kind of explicit numerical method on
a structured grid, we expect that the GPU cluster comput-
ing can be applied to the entire class of explicit methods on
structured grids and cellular automata as well. For explicit
methods on unstructured grids, the main challenge is to rep-
resent the grid in textures. If the grid connection does not
change during computation, the structure can be laid out in
textures in a preprocessing step. The data associated with the
grid points can be laid out in textures in the order of point
IDs. Using indirection textures, the texture coordinates of
neighbors of each point can also be stored. Hence, access-
ing neighbor variables will require two texture fetch opera-
tions. The rst operation fetches the texture coordinates of
the neighbor. Using the coordinates, the second operation
fetches the neighbor variables.
To parallelize explicit methods on the GPU cluster, the
domain can be decomposed into local sub-domains (see Fig-
ure 14). For each GPU node, we denote the grid points in-
side its sub-domain as local points and the grid points out-
side its sub-domain but whose variables are needed to be ac-
cessed as neighbor points. All other points are called ex-
ternal points. Non-local gather operations, which involve
accessing the data of neighbor points, can be achieved as a
local gather operation by adding proxy points at the com-
putation boundary to store the variables of neighbor points
obtained over the network.
Implicit nite differences and FEM require the solution of
a large sparse linear system, Ax = y. Kr uger and Wester-
mann [16] and Boltz et al. [3] have implemented iterative
methods for solving sparse linear systems such as conjugate
gradient and Gauss-Seidel on the GPU. To scale their ap-
Figure 12: A snapshot of the simulation of air ow in the Times Square area of New York City at time step 1000, visualized by
streamlines. The blue color indicates that the direction of velocity is approximately horizontal, while the white color indicates a
vertical component in the velocity as the ow passes over buildings. Red points indicate streamline origins. Simulation lattice size is
480 400 80. (Only a portion of the simulation volume is shown in this image.)
proach to the GPU cluster, in addition to decomposing the
domain, the matrix and vector need to be decomposed so
that matrix vector multiplies can be executed in parallel. In
the case of a sparse linear system, the matrix and vector may
be decomposed using an approach similar to one developed
for a CPU cluster [32]. In each cluster node, the local ma-
trix includes those matrix rows which correspond to local
points, and the local vector includes those vector elements
which correspond to the local and neighbor (proxy) points
(see Figure 15). In each iteration step, the network commu-
nication is needed to read the vector elements corresponding
to neighbor points in order to update proxy point elements in
the local vector. Then, the local matrix and local vector mul-
tiple is executed and the result is the vector corresponding
to local points. Since each time-step takes several iteration
steps, although the network communication to local compu-
tation ratio is still at the order of O(
1
N
), the actual value of
this ratio may be larger than for explicit methods on the GPU
cluster.
7 CONCLUSIONS
In this paper, we propose the use of a cluster of commodity
GPUs for high performance scientic computing. Adding
32 GPUs to a CPU cluster for computation increases the
theoretical peak performance by 512 Gops at the cost of
$12,768. To demonstrate the GPU cluster performance, we
used the LBM to simulate the transport of airborne contami-
nants in the Times Square area of New York City with a res-
olution of 3.8 meters and performance of 0.31 second/step
on 30 nodes. Compared to the same model implemented on
the CPU cluster, the speed-up is above 4.6 and better per-
formance is anticipated. Considering the rapid evolution of
GPUs, we believe that the GPU cluster is a very promising
machine for scientic computation. Our approach is not lim-
ited to LBM, and we also discussed methods for implement-
ing other numerical methods on the GPU cluster including
cellular automata, nite differences, and FEM.
Figure 13: A snapshot of the simulation of air ow in the Times Square area with dispersion density volume rendered.
8 ACKNOWLEDGEMENTS
This work has been supported by an NSF grant CCR-
0306438 and a grant from the Department of Homeland Se-
curity, Environment Measurement Lab. We would like to
thank Bin Zhang for setting up and maintaining the Stony
Brook Visual Computing Cluster. We also thank Li Wei for
his early work on the single GPU accelerated LBM, and Ye
Zhao, Xiaoming Wei and Klaus Mueller for helpful discus-
sions on LBM related issues. Finally, we would like to ac-
knowledge HP and Terarecon for their contributions and help
with our cluster.
REFERENCES
[1] General-Purpose Computation Using Graphics Hardware
(GPGPU). http://www.gpgpu.org.
[2] J. Backus. Can programming be liberated from the von Neu-
mann style? A functional style and its algebra of programs.
ACM Turing Award Lecture, 1977.
[3] J. Bolz, I. Farmer, E. Grinspun, and P. Schr oder. Sparse matrix
solvers on the GPU: conjugate gradients and multigrid. ACM
Trans. Graph. (SIGGRAPH), 22(3):917924, 2003.
[4] M. Brown, M. Leach, R. Calhoun, W.S. Smith, D. Stevens,
J. Reisner, R. Lee, N.-H. Chin, and D. DeCroix. Multiscale
modeling of air ow in Salt Lake City and the surrounding
region. ASCE Structures Congress, 2001. LA-UR-01-509.
[5] M. Brown, M. Leach, J. Reisner, D. Stevens, S. Smith, H.-
N. Chin, S. Chan, and B. Lee. Numerical modeling from
mesoscale to urban scale to building scale. AMS 3rd Urb.
Env. Symp., 2000.
[6] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian,
M. Houston, and P. Hanrahan. Brook for GPUs: Stream
Computing on Graphics Hardware. ACM Trans. Graph. (SIG-
GRAPH), to appear, 2004.
[7] N. A. Carr, J. D. Hall, and J. C. Hart. The ray engine. Proceed-
ings of Graphics Hardware, pages 3746, September 2002.
[8] D. DHumieres, M. Bouzidi, and P. Lallemand. Thirteen-
velocity three-dimensional lattice Boltzmann model. Phys.
Rev. E, 63(066702), 2001.
[9] N. K. Govindaraju, A. Sud, S.-E. Yoon, and D. Manocha.
Interactive visibility culling in complex environments using
occlusion-switches. In Proceedings Symposium on Interac-
tive 3D Graphics, pages 103112, 2003.
[10] M. Harris, G. Coombe, T. Scheuermann, and A. Lastra.
Physically-based visual simulation on graphics hardware.
SIGGRAPH/Eurographics Workshop on Graphics Hardware,
pages 109118, September 2002.
[11] M. J. Harris. GPGPU: Beyond graphics. Eurographics Tuto-
rial, August 2004.
[12] A. Heirich, P. Ezolt, M. Shand, E. Oertli, and G. Lupton. Per-
Local Sub-
domain
Local points
Neighbor points
External points
Proxy points
Adding proxy points
Figure 14: Decomposing the grid and adding proxy points to
support non-local gather operations
formance scaling and depth/alpha acquisition in DVI graphics
clusters. In Proc. Workshop on Commodity-Based Visualiza-
tion Clusters CCViz02, 2002.
[13] G. Humphreys, M. Eldridge, I. Buck, G. Stoll, M. Everett,
and P. Hanrahan. Wiregl: a scalable graphics system for
clusters. In Proceedings of the 28th Annual Conference on
Computer Graphics and Interactive Techniques(SIGGRAPH),
pages 129140, 2001.
[14] G. Humphreys, M. Houston, R. Ng, R. Frank, S. Ahern,
P. D. Kirchner, and J. T. Klosowski. Chromium: a stream-
processing framework for interactive rendering on clusters.
In Proceedings of the 29th Annual Conference on Computer
Graphics and Interactive Techniques (SIGGRAPH), pages
693702, 2002.
[15] D. Kirk. Innovation in graphics technology. Talk in Canadian
Undergraduate Technology Conference, 2004.
[16] J. Kr uger and R. Westermann. Linear algebra operators for
GPU implementation of numerical algorithms. ACM Trans.
Graph. (SIGGRAPH), 22(3):908916, 2003.
[17] P. Lallemand and L. Luo. Theory of the lattice Boltzmann
method: Acoustic and thermal properties in two and three di-
mensions. Phys. Rev. E, 68(036706), 2003.
[18] W. Li, X. Wei, and A. Kaufman. Implementing lattice Boltz-
mann computation on graphics hardware. Visual Computer,
19(7-8):444456, December 2003.
[19] C. P. Lowe and S. Succi. Go-with-the-ow lattice Boltzmann
methods for tracer dynamics, chapter 9. Lecture Notes in
Physics. Springer-Verlag, 2002.
[20] W. R. Mark, R. S. Glanville, K. Akeley, and M. J. Kilgard.
Cg: a system for programming graphics hardware in a C-like
language. ACM Trans. Graph. (SIGGRAPH), 22(3):896907,
2003.
[21] N. Martys, J. Hagedorn, D. Goujon, and J. Devaney. Large
scale simulations of single and multi-component ow in
porous media. Proceedings of The International Symposium
on Optical Science, Engineering, and Instrumentation, June
1999.
Local Sub-domain
Local points
Proxy points
Constructing
local system
Matrix
Local
points
Proxy
points
L
o
c
a
l

p
o
i
n
t
s

L
o
c
a
l

p
o
i
n
t
s

P
r
o
x
y

p
o
i
n
t
s

Vector
Result
Figure 15: Decomposition of a matrix and a vector to imple-
ment matrix vector multiplies in parallel.
[22] F. Massaioli and G. Amati. Optimization and scaling of an
OpenMP LBM code on IBM SP nodes. Scicomp06 Talk, Au-
gust 2002.
[23] F. Massaioli and G. Amati. Performance portability of a lattice
Boltzmann code. Scicomp09 Talk, March 2004.
[24] R. Mei, W. Shyy, D. Yu, and L. S. Luo. Lattice Boltzmann
method for 3-D ows with curved boundary. J. Comput.
Phys., 161:680699, March 2000.
[25] L. Moll, A. Heirich, and M. Shand. Sepia: scalable 3D
compositing using PCI pamette. In Proc. IEEE Symposium
on Field Programmable Custom Computing Machines, pages
146155, April 1999.
[26] S. Succi. The Lattice Boltzmann Equation for Fluid Dynamics
and Beyond. Numerical Mathematics and Scientic Compu-
tation. Oxford University Press, 2001.
[27] A. T. C. Tam and C.-L. Wang. Contention-aware communi-
cation schedule for high-speed communication. Cluster Com-
puting, (4), 2003.
[28] C. J. Thompson, S. Hahn, and M. Oskin. Using modern graph-
ics architectures for general-purpose computing: A frame-
work and analysis. International Symposium on Microarchi-
tecture (MICRO), November 2002.
[29] S. Venkatasubramanian. The graphics card as a stream com-
puter. SIGMOD Workshop on Management and Processing of
Massive Data, June 2003.
[30] A. Wilen, J. Schade, and R. Thornburg. Introduction to
PCI Express*: A Hardware and Software Developers Guide.
2003.
[31] D. A. Wolf-Gladrow. Lattice Gas Cellular Automata and Lat-
tice Boltzmann Models: an Introduction. Springer-Verlag,
2000.
[32] F. Zara, F. Faure, and J-M. Vincent. Physical cloth simulation
on a PC cluster. In Proceedings of the Fourth Eurographics
Workshop on Parallel Graphics and Visualization, pages 105
112, 2002.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy