Escolar Documentos
Profissional Documentos
Cultura Documentos
the channel flow simulated here, the result is that the num- y
= hg + 2 y , = hv + 2 (2)
ber of degrees of freedom in a DNS and therefore the memory t t
requirement scales with Re3 , while the computational cost where is the kinematic viscosity, hg and hv arise from the
scales as Re4 [22]. DNS of turbulence is thus a natural su- nonlinear terms, y is the wall-normal component of the
percomputing problem, which is why the National Science vorticity and is related to the wall-normal component of
Foundation included turbulence DNS as an essential perfor- the velocity v through = 2 v.
mance benchmark for the Blue-Waters system[15].
These equations are reduced to ordinary differential equa-
The DNS of turbulent channel flow described here were tions in time for the Fourier-B-spline coefficients of y and
undertaken to address questions about the nature of wall- using a Fourier-Galerkin method in x and z and a B-
bounded turbulence at high Reynolds number. The Reynolds spline collocation method in y. The Galerkin quadratures
number simulated (Re = 5200 based on the friction veloc- for the nonlinear terms, hg and hv , are performed using
ity) was selected to be large enough to display characteris- Fourier-Gauss quadrature, which, since the nonlinearities
tics of very high Reynolds number wall-bounded turbulence. are quadratic, requires 3/2 as many uniformly spaced quadra-
This Reynolds number requires a simulation with 242 bil- ture points in x and z as there are Fourier modes in those
lion degrees of freedom, which to the authors knowledge directions (commonly called dealiasing)[20]. The evaluation
is the largest scientific DNS yet conducted, with 3.5 times of the velocities on the quadrature grid, and the computa-
more degrees of freedom than Kaneda et al s 2004 Isotropic tion of the Galerkin integral are done using the fast Fourier
Homogeneous simulation[14] and 15 times larger than the transform (FFT) algorithm.
previously largest channel DNS of Hoyas and Jimenez[10],
conducted in 2006. Time advancement is performed using a mixed implicit-explicit
(IMEX) scheme based on third-order low-storage Runge-
2. THE COMPUTATIONAL PROBLEM Kutta [23], with the nonlinear (convective) terms treated
Simulated here is turbulent channel flow, the flow between explicitly and the viscous terms implicitly. Both the implicit
infinite parallel planes driven by a pressure gradient, as time-stepping and the determination of v from require the
shown in figure 1. For computational purposes the infinite solution of linear algebraic systems with dimension deter-
domains in the directions parallel to the wall are truncated mined by the size of the B-spline basis. For each wavenum-
to a finite box with periodic boundary conditions in the ber in the Fourier representation, three such linear systems
streamwise (x) and spanwise (z) directions. Zero velocity must be solved, each of which are equivalent to solving a
boundary conditions are imposed at the walls. second-order, two-point boundary value problem in y using
B-spline collocation.
The velocity is represented in space as a truncated Fourier
series in x and z, and in terms of 7th -order basis splines 2.2 Parallel decomposition
(B-splines) in the wall-normal direction (y). B-splines were The most effective way to perform the required FFTs and
selected for their excellent error characteristics[17] as well as linear algebra in parallel is to have each parallel process
a straightforward formulation using the recursive relation of work on the FFT in one dimension (in x or z) or linear al-
DeBoor[5]. A Fourier spectral formulation is preferred for gebra (in y), with all data in that direction local, and to
turbulence DNS, when applicable, because of its superior decompose the other two spatial directions across processes
resolution properties [2], despite the resulting algorithmic (MPI tasks). The result is that each process works on a
need for expensive all-to-all communication. The communi- pencil of data that is long in the direction in which the
cation cost is more than compensated for by the reduced grid transform or linear algebra is being done. The algorithm
size needed by spectral methods for the same accuracy, and will perform transforms in each of the horizontal directions
the reduced cost of solving the Poisson equation that neces- and linear algebra in the wall-normal direction, requiring
sarily arises when solving the incompressible Navier-Stokes the data repeatedly redistributed across processes to change
equations. the orientation of the pencils[7]. A schematic detailing the
data partitioning is shown in figure 2. This reshuffling is
2.1 Numerical Formulation essentially a global transpose and involves a communication
The incompressible Navier-Stokes equations are solved using pattern like that implemented in MPI_alltoall. This pencil
the well-known formulation of Kim, Moin and Moser[16]. decomposition is used rather than the alternative planar de-
First, the incompressible Navier-Stokes equations, composition because it provides far greater flexibility with
u respect to possible MPI communicator topologies and node
+ (uu) = p + 2 u, u = 0, (1) counts.
t
2.3 Simulation steps
Our program starts at the beginning of a time step with the
Fourier coefficients for the three velocity components u, v
and w evaluated at the collocation points in y, distributed
in the y-pencils data configuration. Following is an outline
of the operations required for a single Runge-Kutta substep
starting at this point.
Figure 3: Banded matrix storage (left: original full
(a) Transpose the data for three velocity components from matrix, center: format for conventional LAPACK
y-pencils to z-pencils banded solver, right: format for customized solver)
(b) Pad the data in z with zeros for FFT on 3/2 larger data
vector
(c) Perform inverse Fourier transform in z direction
4. CODE OPTIMIZATION
A number of optimizations were pursued for the channel
(d) Transpose the data from z-pencils to x-pencils
DNS code described here to address performance at the sin-
(e) Pad the data in x with zeros for FFT on 3/2 larger data
gle core level, the node level and the system level.
vector
(f) Perform inverse Fourier transform in x-direction. The
three velocity components are now evaluated on a 3D 4.1 Single core optimization
grid distributed in the x-pencil configuration. There are two primary computation kernels whose single-
(g) Compute five quadratic products of u, v and w at each core performance is important. These are the one-dimensional
point FFTs and the linear algebra required to time advance the
(h) Reverse steps (f) to (a) in reverse order for the five fields Navier-Stokes equations. We use the FFTW 3.3 library[9]
of nonlinear terms. Here, the forward Fourier transform for FFTs, and a custom routine for the linear algebra was
is performed in (f) and (c), and the data arrays are developed and optimized as described below.
truncated rather than padded in (e) and (b).
(i) Perform the linear algebra needed to time advance equa- 4.1.1 Linear Algebra
tions for the Fourier-B-spline coefficients of y and . As discussed in section 2.1, time advancing the Navier-Stokes
(j) Perform linear algebra needed to evaluate the new values equations using the IMEX scheme reduces to solving a pair
of the velocity components at the collocation points in of two-point boundary-value problems for y and for each
y. wavenumber in the Fourier representation. These problems
are of the form:
t d2
3. BENCHMARK SYSTEMS I k 2
I = R, (3)
The DNS code described here was benchmarked on four 2 dy 2
HPC platforms with different system architectures. They
are: where can be either y or the Fourier coefficients at
the next time level for this wavenumber, I is the identity
operator, t is the time step, is a coefficient determined
1. Mira: Argonne National Laboratory, BlueGene/Q, by the time discretization scheme, k2 = kx2 + kz2 is the sum
Power BQC 16C 1.60GHz, Custom 5D Torus of the squares of the Fourier wavenumbers in x and z and R
2. Stampede: Texas Advanced Computing Center, Pow- is the accumulation of all the terms on the right hand side.
erEdge C8220, Xeon E5-2680 8C 2.700GHz, Infiniband Also, determining the wall-normal velocity v from requires
FDR, Intel Xeon Phi the solution of the following ODE:
3. Lonestar: Texas Advanced Computing Center, Dell 2
d 2
PowerEdge M610 Cluster, Xeon 5680 3.3Ghz, Infini- k I v = . (4)
dy 2
band QDR
4. Blue Waters: National Center for Supercomputing Applying the B-spline collocation method to these equations
Applications, Cray XE6, AMD 6276 Interlagos 2.3Ghz, results in a system of linear algebraic equations, with the
Cray Gemini 3D torus interconnect matrix representing the operator in brackets in (3) or (4).
In both cases the matrix is banded with extra non zero values
in the first and last few rows, as shown on the left side of
Most of the performance measurements were conducted on figure 3.
Mira because it is the platform being used for the full-scale
scientific runs. Second, on Stampede and Blue Waters, the A matrix with this structure can be solved with a general
benchmarks were conducted without using the available ac- banded matrix solver as in LAPACK, but this is not opti-
celerators. Another study[24] found excellent scaling proper- mal. First, the full matrix would need to be stored in the
ties with the Intel MIC using a compute kernel from this sim- form shown in the center of figure 3, and many unnecessary
ulation. These benchmarks were conducted using the IBM computations would be performed on elements containing
XL compilers for Blue Gene on Mira (Version 14.1), the In- zeros. Second, in our case, the matrix is real, while the
tel compilers for Lonestar and Stampede (version 11.1 and solution vector and right hand side are complex. This can
13.0, respectively), and the Cray compilers (version 8.1.4) be solved with the DBTRF/DGBTRS routines by rearranging
for Blue Waters. Finally, the MPI_wtime() function was used the complex-valued vectors into two sequential real-valued
to measure elapsed time on all platforms. vectors, and then back after solution. To circumvent these
Table 1: Elapsed time for solving a linear system Table 2: Single core performance of Navier-Stokes
Lonestar Mira time-advance on Mira
SIMD No SIMD
Bandwidth MKLR MKLC Custom ESSL Custom
GFlops 4.96 (38.8%) 1.16 (9.05%)
3 0.67 0.65 0.14 0.81 0.16 Instruction per Cycle 1.22 0.89
5 0.55 0.61 0.12 0.85 0.19 Load hit in L1 98.01 98.2 %
7 0.53 0.58 0.11 0.81 0.19 Load hit in L2 1.45 0.92 %
9 0.53 0.56 0.10 0.84 0.19 Load hit in DDR 0.53 0.88 %
11 0.47 0.56 0.10 0.88 0.19 DDR Traffic (Bytes/cycle) 14.2 (79%) 16.8 (93%)
13 0.45 0.55 0.11 0.74 0.21 Elapsed Time (sec) 3.96 3.34
15 0.41 0.53 0.11 0.71 0.20
* N = 1024
** Data is normalized by elapsed time using ZBTRF/ZGBTRS
of Netlib LAPACK
4.2 Threading
*** R - DBTRF/DGBTRS and C ZBTRF/ZGBTRS, There were three major opportunities to take advantage of
For ESSL, DGBF/DGBS, which are equivalent to DB- on-node parallelism through OpenMP threads to enhance
TRF/DGBTRS, are used. single-node performance, rather than relying on MPI. These
are the FFTs, the Navier-Stokes time advance and the on-
node data reordering that is needed as part of the trans-
shortcomings and eliminate unnecessary floating point op- pose process. There are two potential advantages of this
erations, a customized linear algebra solver was developed, approach. First, it reduces the total number of MPI mes-
based on a custom matrix data structure in which non-zero sages and thus increases the message size. This reduces the
elements are moved to otherwise empty elements as shown latency cost of the MPI communication. Second, as will be
on the right panel of figure 3. In this form, the memory shown below, the optimum degree of on-node parallelism is
requirement is reduced by half, which is important for cache not the same for all three of the above functions. Using
management. In addition, it is found that compilers inef- threads rather than MPI allows the degree of parallelism to
ficiently optimized the low-level operations on matrix ele- differ for different functions.
ments for the LU decomposition. Instead, loops were un-
rolled by hand to improve reuse of data in cache. It is straight-forward to thread the FFT calculations because
the transform of each full line of data is independent of other
This customized linear algebra solver was tested on Lones- data lines. The FFTW 3.3 library is thread-safe and so the
tar and Mira with a problem size N = 1024, and results are FFTW 3.3 calls are simply embedded in OpenMP threads.
presented in Table 1, along with benchmarks of implemen- The FFTW 3.3 library can also use internal threading when
tations from the Intel MKL[4] library (Lonestar) and IBM processing multiple FFTs at once. This was not used, since
ESSL[11] (Mira). Note that all times are normalized by it is more efficient to embed the FFTW 3.3 calls together
the elapsed time for Netlib LAPACK[1]. On Lonestar and with other threaded operations, such as the padding for the
Mira, the customized solver is about four times faster than 3/2 expansion of the data lines for dealiasing, or the com-
the Intel MKL library or IBM ESSL Library. This speed-up putation of the nonlinear terms. Indeed, a single OpenMP
is due to eliminating unnecessary floating point operations threaded block spans the inverse x transform, the compu-
and optimizing cache reuse. tation of the nonlinear terms and the forward x transform.
This makes the cost of performing the nonlinear products
4.1.2 Single Core Performance negligible because the data remain in cache across all three
The single core performance in solving equations (3-4) for operations.
time-advancement of the Navier-Stokes equations was mea-
sured with the Hardware Performance Monitor(HPM) Similar to the FFTs, the linear algebra for the Navier-Stokes
counter module from IBM on Mira. The quoted theoreti- time-advance is performed on a single data line (in y for a
cal peak Flops per core is 12.8 GFlops, with the LINPACK single wavenumber), with no dependence on other data lines.
benchmark attaining 10.38 GFlops. As shown in Table 2, The entire time advance calculation can thus be performed
our PDE solver shows about 9% of theoretical peak perfor- in a single OpenMP threaded block, which also allows the
mance. Here 98.2% of instructions access data in L1 cache data to remain in cache throughout the calculation.
(including L1 prefetch). Even though this implies highly ef-
ficient reuse of data in cache, the performance of this kernel As part of the global transpose prosses, the data on each
is clearly limited by the memory bandwidth, given the low node need to be reordered in an on-node transpose, which
number of instructions per cycle and saturated DDR traf- takes the form A(i, j, k) A(j, k, i). The speed of this on-
fic compared to peak performance (18 Bytes/cycle) which is node transpose is clearly limited by the memory bandwidth,
measured by STREAM[27]. The effect of compilation using as nothing but memory access is occurring. By dividing this
the SIMD option was also investigated. While compilation transpose up into independent pieces and threading across
with SIMD increased the aggregate Flops, it also increased the pieces with OpenMP, one is able to improve the utiliza-
the computation time. This almost certainly means that tion of the memory bandwidth by maintaining multiple data
compiling with SIMD introduced unfavorable code optimiza- streams from memory.
tions which reduced performance. As a result of this finding,
all data gathered for benchmarks in this paper are from code
compiled without the SIMD option.
Table 3: Single node performance of FFT/Navier-Stokes time advance
Cores 2 3 4 5 6
Speed up 2.03/1.99 3.18/2.98 4.07/3.65 7 4.88/4.77 5.49/5.70
efficiency (101.5%/99.4%) (105.9%/99.3%) (101.8%/91.2%) (97.6%/95.5%) (91.4%/95.1%)
U/u
15
<uiuj>/u 2
6
wall, and the outer flow that is only weakly influenced by
the wall. A single simulation is currently being run on 32 4
racks (524,288 cores) on the BG/Q system, Mira, at ALCF.
A Fourier-B-spline representation with 10,240 modes in the 2