GPU Cluster
GPU Cluster
T
i
m
e
Non-
overlapping
Overlapping
Figure 8: The network communication time measured in ms.
The area under the blue line represents the part of network
communication time which was overlapped with computation.
The shadow area represents the remainder.
only a single node is used, the speedup factor is 6.64. This
value projects the theoretical maximum GPU cluster / CPU
cluster speedup factor which could be reached if all com-
munication bottlenecks were eliminated by better optimized
network and larger GPU/CPU bandwidth. When the num-
ber of nodes is below 28, the network communication will
be totally overlapped with the computation. Accordingly the
growth of the number of nodes only marginally increases the
execution time due to the GPU/CPU communication and the
curve attens approximately at 5. When the number of nodes
increase to 28 or above, the network cant be totally over-
lapped, resulting in a drop in the curve.
Three enhancements can further improve this speedup fac-
tor without changing the way that we map the LBM com-
putation onto the GPU cluster: (1) Using a faster network,
such as Myrinet. (2) Using the PCI-Express bus that will be
available later this year to achieve faster communication be-
tween the GPU and the system and to plug multiple GPUs
into each PC. (3) Using GPUs with larger texture memories
0
1
2
3
4
5
6
7
0 4 8 12 16 20 24 28 32
Number of Nodes
S
p
e
e
d
u
p
F
a
c
t
o
r
:
G
P
U
C
l
u
s
t
e
r
/
C
P
U
C
l
u
s
t
e
r
L
o
c
a
l
p
o
i
n
t
s
P
r
o
x
y
p
o
i
n
t
s
Vector
Result
Figure 15: Decomposition of a matrix and a vector to imple-
ment matrix vector multiplies in parallel.
[22] F. Massaioli and G. Amati. Optimization and scaling of an
OpenMP LBM code on IBM SP nodes. Scicomp06 Talk, Au-
gust 2002.
[23] F. Massaioli and G. Amati. Performance portability of a lattice
Boltzmann code. Scicomp09 Talk, March 2004.
[24] R. Mei, W. Shyy, D. Yu, and L. S. Luo. Lattice Boltzmann
method for 3-D ows with curved boundary. J. Comput.
Phys., 161:680699, March 2000.
[25] L. Moll, A. Heirich, and M. Shand. Sepia: scalable 3D
compositing using PCI pamette. In Proc. IEEE Symposium
on Field Programmable Custom Computing Machines, pages
146155, April 1999.
[26] S. Succi. The Lattice Boltzmann Equation for Fluid Dynamics
and Beyond. Numerical Mathematics and Scientic Compu-
tation. Oxford University Press, 2001.
[27] A. T. C. Tam and C.-L. Wang. Contention-aware communi-
cation schedule for high-speed communication. Cluster Com-
puting, (4), 2003.
[28] C. J. Thompson, S. Hahn, and M. Oskin. Using modern graph-
ics architectures for general-purpose computing: A frame-
work and analysis. International Symposium on Microarchi-
tecture (MICRO), November 2002.
[29] S. Venkatasubramanian. The graphics card as a stream com-
puter. SIGMOD Workshop on Management and Processing of
Massive Data, June 2003.
[30] A. Wilen, J. Schade, and R. Thornburg. Introduction to
PCI Express*: A Hardware and Software Developers Guide.
2003.
[31] D. A. Wolf-Gladrow. Lattice Gas Cellular Automata and Lat-
tice Boltzmann Models: an Introduction. Springer-Verlag,
2000.
[32] F. Zara, F. Faure, and J-M. Vincent. Physical cloth simulation
on a PC cluster. In Proceedings of the Fourth Eurographics
Workshop on Parallel Graphics and Visualization, pages 105
112, 2002.