Você está na página 1de 4

8 IEEE ANTENNAS AND WIRELESS PROPAGATION LETTERS, VOL.

9, 2010

Low-Frequency MLFMA on Graphics Processors


M. Cwikla, Member, IEEE, J. Aronsson, Student Member, IEEE, and V. Okhmatovski, Senior Member, IEEE

Abstract—A parallelization of the low-frequency multilevel only publication on MLFMA parallelization for GPUs. In this
fast multipole algorithm (MLFMA) for graphics processing units letter, we describe a GPU implementation of the low-frequency
(GPUs) is presented. The implementation exhibits speedups be- MLFMA with the Helmholtz kernel [2] using double-precision
tween 10 and 30 compared to a serial CPU implementation of the
arithmetic. The parallelization pattern similar to the one dis-
algorithm. The error of the MLFMA on the GPU is controllable
down to machine precision. Under the typical method-of-moments cussed in [7] is utilized. The methodology achieves speedups
(MoM) error requirement of three correct digits, modern GPUs of over 30 compared to a conventional serial implementation of
are shown to handle problems with up to 7.5 million degrees of the low-frequency MLFMA on a CPU.
freedom in dense matrix approximation.
Index Terms—CUDA, graphics processing unit (GPU), low-fre- II. FIELD EXPANSIONS IN LOW-FREQUENCY MLFMA
quency fast multipole method, fast algorithms, multiscale Consider a spatial distribution of time-harmonic point
modeling. sources located at , and having magnitudes
, respectively. The field produced by such
sources at observation point is given by [9]
I. INTRODUCTION

HE multilevel fast multipole algorithm (MLFMA) is (1)


T today’s most powerful method for solving large-scale
electromagnetic problems with the boundary-element method where is the wavelength in free space and
of moments (MoM) [1]. It reduces the computational com- . In (1) and everywhere below, the time-harmonic depen-
plexity and memory requirement in MoM’s dense matrix-vector dence is assumed and suppressed for brevity.
products from to being the number The MLFMA represents the field in (1) using truncated
of unknowns in the MoM discretization. While the high-fre- multipole expansions in spherical coordinates as
quency MLFMA is computationally the fastest version of
multipole-centric algorithms, it suffers from the low-frequency
breakdown [2]. To remedy this limitation, several low-fre-
quency versions of MLFMA have recently been proposed
[2]–[5]. Combined with high-frequency MLFMA, they can (2)
be used in the construction of broadband MLFMAs for large
problems featuring multiscale geometries. for observation points outer to a spherical domain enclosing all
As the MLFMA is intended for solving large-scale problems, the sources, and
various parallel versions of the algorithm have been developed
in the past, predominantly for CPU-based clusters of worksta-
tions. With the advent of the graphics processing unit (GPU)
technology, which in recent years has become more than an (3)
order of magnitude faster than the CPU in terms of floating-
point operations per second (FLOPs), a high interest in the par- for observation points inside a spherical domain that does not
allelization of computationally intensive algorithms for GPUs contain the sources. In (2) and (3), and are the spher-
arose in the computational community [6]. However, since the ical Bessel and Hankel functions of order , respectively, and
hardware is optimized for graphics applications, special care are the spherical harmonics of degree and order . Ex-
must be taken in the parallelization of such algorithms for the pansions of the form (2) and (3) are known as -expansions and
GPU. -expansions, respectively [9]. The coefficients and in
In the case of the MLFMA, its first parallelization for GPUs (2) and (3) depend on a distribution of sources as follows:
was proposed in [7] for single-precision arithmetic and the
Laplace kernel. Here, speedups of 30 to 60 were achieved (4)
when compared with serial CPU runs. To date, this remains the

Manuscript received October 19, 2009; manuscript revised December 10, (5)
2009. Date of publication January 15, 2010; date of current version March 05,
2010. This work was in part supported by the National Science and Engineering
Research Council of Canada (NSERC). where , and .
The authors are with the Department of Electrical and Computer Engi-
neering, University of Manitoba, Winnipeg, MB R3T5V6, Canada (e-mail: The MLFMA subdivides the computational domain recur-
mcwikla@ieee.org). sively [9]. At th level of subdivision, the domain is split into
Digital Object Identifier 10.1109/LAWP.2010.2040571 cubes (boxes), , where level corresponds
1536-1225/$26.00 © 2010 IEEE

Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:03:34 UTC from IEEE Xplore. Restrictions apply.
CWIKLA et al.: LOW-FREQUENCY MLFMA ON GRAPHICS PROCESSORS 9

to the original domain and to the leaf nodes in the tree


structure. To represent the field outside a box due to sources
contained inside of it, the MLFMA expands the box’s field ac-
cording to an -expansion (2) defined at level by
coefficients . It also computes co-
efficients of a -expansion (3) for the field inside each box
due to the sources outside of it at levels [9].
The translation of coefficients and from level to level
can be performed via the rotate-translate-rotate (or “point-and-
shoot”) method [9]. This method first rotates the coordinate
system so that the translation occurs along the -axis in the
new coordinate system (forward rotation). It then performs the Fig. 1. Typical architecture of a GPU card and its interfacing to the CPU.
translation of the expansion coefficients along the new -axis
(coaxial translation) and, finally, rotates the coordinate system
back to its original orientation (backward rotation). This re-
duces the complexity of translations from (using the
naive method) to . The forward rotation, coaxial trans-
lation, and backward rotation are formalized as
Fig. 2. Examples of coalesced and uncoalesced memory accesses.

(6) memory accesses are shown in Fig. 2. For optimal performance


during kernel executions, coefficients and are stored in
global memory in a manner such that read and write operations
can be coalesced. As an example, coefficients for contigu-
ously numbered boxes are stored in the first contiguous
(7)
locations in global memory that are referenced by a pointer,
followed by coefficients , and so on.
Caching Repetitively Used Data: Some frequently accessed
(8) data in MLFMA, such as the rotation and translation coefficients
in the rotate-translate-rotate method [9], can be stored be in the
Each set of coefficients , and can be com- low-latency shared memory available on each multiprocessor on
puted in time via a recurrence relation [9]. In (6), GPU.
, and , when translation to the -expansion and Latency Hiding: Multiple blocks should be kept active on
to the -expansion is considered, respectively. In (8), each multiprocessor so that on a given multiprocessor, while
and , when translation to the -expansion and to a certain block waits for a memory access, another block can
the -expansion is considered, respectively. Detailed computa- continue to undergo execution.
tional procedures for both rotation and translation coefficients
are given in [9]. IV. PARALLEL LOW-FREQUENCY MLFMA ON GPU

III. GPU ARCHITECTURE AND PROGRAMMING MODEL A. Preprocessing on CPU


Before detailing the GPU parallelization of key MLFMA op- The parallel MLFMA implementation on the GPU begins
erations in (1)–(8), we overview the key features of today’s GPU with preprocessing steps that are performed on the CPU prior
architecture and the programming considerations. to the traversal of the tree. First, the nonempty leaf boxes are
A typical GPU card architecture, its interface with CPU, as sequentially indexed . The sources in-
well as available computational resources are depicted in Fig. 1. side them are renumbered as follows. The point sources situ-
The GPU cards feature a large amount of high-latency global ated in leaf box are numbered from to ,
memory (up to 4 GB) and a small amount of low-latency shared sources numbered to are situated in leaf box , and
memory. To mitigate the effects of high global memory access so on, where sources from to are
latency, the following techniques can be exercised in the devel- situated in the th box. In the second preprocessing
opment of a parallel MLFMA for the GPU. step, the translation and rotation coefficients , and ap-
Coalescing Global Memory Access: As the access to global pearing in (6)–(8), respectively, are precomputed on the CPU
memory incurs a latency of 400 to 600 clock cycles [10], it for all levels and copied to the GPU’s global
is important to coalesce such operations whenever possible. memory (Fig. 1). This step requires less than 2 s to complete on
A single sequence of instructions on a GPU is referred to as Intel’s Xeon E5520 CPU running at 2.26 GHz for a seven-level
a thread. Threads are grouped into warps, to a maximum of tree with truncation numbers reaching 30 at
32 threads per warp. On an nVidia GPU, one of the conditions the coarsest tree level . The preprocessing steps are run
required for optimal memory bandwidth is to have the first on the CPU because of their minimal impact on the overall run
16 or last 16 threads of a warp accessing global memory in a time. Upon completion of the preprocessing stage, the traversal
coalesced fashion [10]. Examples of coalesced and uncoalesced of the tree begins.

Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:03:34 UTC from IEEE Xplore. Restrictions apply.
10 IEEE ANTENNAS AND WIRELESS PROPAGATION LETTERS, VOL. 9, 2010

B. Leaf Level Aggregation Kernel 13: end while


14: end for
Algorithm 1 Parallel Leaf Box Aggregation on GPU

1: {initialize box count} D. Translation Kernel


2: while {for all boxes at leaf level} do
3: for all One of the most time-consuming operations within the
{for boxes in parallel} do MLFMA is the translation of -expansions to -expansions. In
4: for to {all points in th box} do this step, each of the boxes at level is considered to be
5: { is the th point position, a destination box, denoted as . There are source
is th box center, is defined in spherical boxes, in the same level, which are in ’s interaction list [9].
coordinates of th box, i.e. The -expansion in each source box within the interaction list
6: for to do is converted to a -expansions (3) with respect to the center of
7: box . Parallelization for the GPU is as follows.
8: for to do Organization of the interaction set at a given tree
9: level begins with collecting all unique translation vec-
10: end for tors , from the centers of
11: end for source boxes to the centers of destination boxes [9]. In this
12: end for implementation, we do not consider additional symmetries
13: {increment by # of [7]. For each unique vector at level , we collect all
threads} of the “source box–destination box” pairs that correspond to
14: end for and sort the pairs with respect to the source box. We then
15: end while iterate through each of the 316 vectors. Each iteration has
translations to perform, each of which is assigned to one thread,
The procedure for parallel leaf level aggregation is with threads being executed concurrently. This approach
given by Algorithm 1. The coefficients are computed on the allows for read operations from global memory to be
GPU according to (4) in parallel for 1 boxes at a time. GPU coalesced when reading coefficients times
methods for computing spherical Bessel and Hankel functions in (6)–(8). The amount of shared memory allocated is enough
are implemented according to [11]. for just a few hundred coefficients to be cached, which allows
for multiple blocks to be active per multiprocessor. Each time
C. Aggregation and Disaggregation Kernels a loop on or completes, the algorithm checks if the cache
needs refreshing with new coefficients and does so if necessary.
The nonleaf level aggregation procedure Additionally, a portion of shared memory ( in size) is
is given by Algorithm 2, which is executed as one thread per used in order to cache values of the sine and cosine of
parent box. The disaggregation procedure is organized similarly and . These are used in the “on-the-fly” computation
to Algorithm 2, with the exception that it is executed as one of and . Before any rotations begin, all threads
thread per child box, and that is directly assigned to only compute an approximately equal number of said sines and
once. cosines and write the results into shared memory. The above
procedure is summarized in Algorithm 3.
Algorithm 2 Parallel Nonleaf Aggregation on GPU
Algorithm 3 Parallel Translation on GPU
The forward rotation, translation, and backward rotation
coefficients , and , respectively, are precomputed The interaction list is computed and sorted for all levels
on CPU and copied into shared memory of GPU prior to prior to the execution of this kernel. Precomputation of
execution. coefficients , and is similar to Algorithm 2.
1: for to 2 {for all nonleaf levels} do 1: for to {for all levels} do
2: {initialize box count} 2: for to 315 {for all translation vectors } do
3: while {for all boxes at level } do 3: and
4: for all 4: {initialize interaction pair counter for
{for boxes in parallel} do 5: while do
5: for each nonempty child of th box do 6: for all
6: {for threads in parallel (one thread per pair)} do
7: 7:
8: 8:
9: 9:
10: end for 10:
11: end for 11:
12: {increment by # of threads} 12:
1Although the user specifies the number of blocks and threads per block, the 13:
GPU hardware is ultimately responsible for scheduling threads in parallel. 14: end for

Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:03:34 UTC from IEEE Xplore. Restrictions apply.
CWIKLA et al.: LOW-FREQUENCY MLFMA ON GRAPHICS PROCESSORS 11

TABLE I TABLE II
BENCHMARK PARAMETERS BENCHMARKS RUN TIMES IN SECONDS AND SPEEDUP

15:
16: end while
17: end for
18: end for

E. Leaf-Level Disaggregation Kernel


The MLFMA evaluates the far-field contribution with respect
to each point from coefficients at the leaf level using (3).
Parallelization of these computations is similar to Algorithm 1.
The exception is that it is executed as one thread per particle and
(3) replaces (4).

F. Near-Field Calculation Kernel


The near-field contributions due to points contained inside
a box and within its adjacent neighbor boxes are computed di-
rectly using (1). The presented implementation can handle up to
320 sources per box. This limitation is due the number of regis- VI. CONCLUSION
ters available per GPU multiprocessor as one thread is assigned
to handle one particle within a box. More details on these ker- The parallelization of a low-frequency MLFMA for a GPU is
nels can be found in [8]. discussed. The MLFMA operations such as aggregation, disag-
gregation, and translation are parallelized to take advantage of
the GPU’s multiprocessor architecture. A speedup of between
V. NUMERICAL RESULTS
10 and 30 was achieved for simulations with 0.5 to 7.3 million
In the numerical experiments, we considered a uniform dis- sources. The results are demonstrated for three and nine digits
tribution of , and of accuracy on domains of size and , respectively.
sources confined to a cube of volume m , as well as
sources on a spherical surface of diameter
. The magnitudes , were chosen REFERENCES
randomly such that and . The perfor- [1] F. P. Andriulli et al., “A multiplicative Calderon preconditioner for the
mance of proposed GPU implementation was compared against electric field integral equation,” IEEE Trans. Antennas Propag., vol.
56, no. 8, pt. 1, pp. 2398–2412, Aug. 2008.
its CPU counterpart. Table I summarizes the parameters and nu- [2] H. Cheng et al., “A wideband fast multipole method for the Helmholtz
merical results for the benchmarks with (Bench- equation in three dimensions,” J. Comp. Phys., vol. 216, no. 1, pp.
mark A) and (Benchmark B). The results presented 300–325, Jul. 2006.
in Table II were run on a Quadro FX 5800 GPU with 4 GB of [3] L. Xuan et al., “A broadband multilevel fast multipole algorithm,” in
Proc. IEEE AP-S Intl. Symp., Monterey, CA, Jun. 2004, vol. 2, pp.
memory, while the CPU used was an Intel Xeon E5520 run- 1195–1198.
ning at 2.26 GHz. Error levels listed in Table I were computed [4] H. Wallen and J. Sarvas, “Translation procedures for broadband
using the -norm. For the translation procedure, caching data MLFMA,” Prog. Electromagn. Res., vol. 55, pp. 47–78, 2005.
in the manner described in the Algorithm 3 was only found to [5] I. Bogaert et al., “A nondirective plane wave MLFMA stable at low
frequencies,” IEEE Trans. Antennas Propag., vol. 56, no. 12, pp.
be advantageous for levels , where speedups of approx- 3752–3767, Dec. 2008.
imately 10 were realized for . At the remaining levels [6] W. Hwu et al., “Compute unified device architecture application suit-
, an alternative to the translation procedure in Algo- ability,” IEEE Comput. Sci. Eng., vol. 11, no. 3, pp. 16–25, May/Jun.
2009.
rithm 3 was used, which handled one destination box per thread. [7] N. Gumerov and R. Duraiswami, “Fast multipole methods on graphics
The GPU code that implemented the coalescing and caching op- processors,” J. Comput. Phys., vol. 227, pp. 8290–8313, 2005.
timizations discussed in Section III yielded speedups of 1.1 to [8] M. Cwikla, “The low-frequency MLFMM on graphics processors,”
1.6 in total GPU run time versus code that did not use the said M.S. thesis, Dept. ECE, Univ. Manitoba, Winnipeg, MB, Canada,
2009.
optimizations. Since no more than 320 sources per leaf box can [9] N. A. Gumerov and R. Duraiswami, Fast Multipole Methods for the
be handled due to the number of available registers on each mul- Helmholtz Equation in Three Dimensions. Amsterdam, The Nether-
tiprocessor, this implementation is limited to problem sizes lands: Elsevier, 2006.
[10] “nVidia CUDA Programming Guide,” ver. 2.3, nVidia Corp., May
between 7.0 and 7.5 million and for a volumetric dis- 2009.
tribution of sources. For surface distributions, the upper limit is [11] S. Zhang and J.-M. Jin, Computation of Special Functions. New
approximately million for . York: Wiley, 1996.

Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:03:34 UTC from IEEE Xplore. Restrictions apply.

Você também pode gostar