Escolar Documentos
Profissional Documentos
Cultura Documentos
A Performance Guide
For HPC Applications
On the IBM System x iDataPlex
dx360 M4 System
Release 1.0.2
June 19, 2012
IBM Systems and Technology Group
Contents
Contributors...................................................................................................... 8
Introduction...................................................................................................... 9
1 iDataPlex dx360 M4 ................................................................................. 11
1.1 Processor .....................................................................................................11
1.1.1 Supported Processor Models ................................................................................. 12
1.1.2 Turbo Boost 2.0 ...................................................................................................... 13
1.2 System .........................................................................................................16
1.2.1 I/O and Locality Considerations.............................................................................. 17
1.2.2 Memory Subsystem................................................................................................ 17
1.2.3 UEFI......................................................................................................................... 20
1.3 Mellanox InfiniBand Interconnect ................................................................23
1.3.1 References .............................................................................................................. 27
4 MPI........................................................................................................... 61
4.1 Intel MPI ......................................................................................................61
4.1.1 Compiling................................................................................................................ 61
4.1.2 Running Parallel Applications ................................................................................. 62
4.1.3 Processor Binding ................................................................................................... 63
4.2 IBM Parallel Environment ............................................................................64
4.2.1 Building an MPI program ........................................................................................ 64
4.2.2 Selecting MPICH2 libraries...................................................................................... 65
4.2.3 Optimizing for Short Messages............................................................................... 65
4.2.4 Optimizing for Intranode Communications ............................................................ 65
4.2.5 Optimizing for Large Messages............................................................................... 65
4.2.6 Optimizing for Intermediate-Size Messages........................................................... 66
4.2.7 Optimizing for Memory Usage ............................................................................... 66
4.2.8 Collective Offload in MPICH2.................................................................................. 66
4.2.9 MPICH2 and PEMPI Environment Variables ........................................................... 67
4.2.10 IBM PE Standalone POE Affinity ............................................................................. 69
4.2.11 OpenMP Support .................................................................................................... 70
4.3 Using LoadLeveler with IBM PE ....................................................................70
4.3.1 Requesting Island Topology for a LoadLeveler Job................................................. 70
4.3.2 How to run OpenMPI and INTEL MPI jobs with LoadLeveler ................................. 71
4.3.3 LoadLeveler JCF (Job Command File) Affinity Settings ........................................... 71
4.3.4 Affinity Support in LoadLeveler .............................................................................. 72
Figures
Figure 1-1 Processor Ring Diagram ................................................................................................ 12
Figure 1-2 dx360 M4 Block Diagram with Data Buses ................................................................... 16
Figure 1-3 Relative Memory Latency by Clock Speed ..................................................................... 19
Figure 1-4 Relative Memory Throughput by Clock Speed............................................................... 20
Figure 6-1 Comparing actual Linpack and system peak performance (GFlops) for different
numbers of nodes ........................................................................................................................... 91
Figure 6-2 Comparing measured Linpack and system peak performance (PFlops) for large
numbers of nodes ........................................................................................................................... 92
Figure 6-3 Measured Bandwidth (MB/s) for single-core STREAM tests using GCC ........................ 94
Figure 6-4 Measured Bandwidth (MB/s) for single-core STREAM tests using Intel icc................... 96
Figure 6-5 Measured Bandwidth (MB/s) for single-core STREAM tests comparing Intel icc and GCC
........................................................................................................................................................ 96
Figure 6-6 Measured Bandwidth (MB/s) for single-core STREAM tests comparing Intel icc without
streaming stores and GCC .............................................................................................................. 98
Figure 6-7 Single core memory bandwidth as a function of core frequency .................................. 99
Figure 6-8 Memory Bandwidth (MB/s) over 16 cores GCC throughput benchmark .................. 100
Figure 6-9 Memory bandwidth (MB/s) minimum number of sockets 16-way OpenMP
benchmark.................................................................................................................................... 101
Figure 6-10 Memory bandwidth (MB/s) performance of 8 threads on 1 or 2 sockets............... 102
Figure 6-11 Memory bandwidth (MB/s) split threads between two sockets............................. 103
Figure 6-12 Memory bandwidth (MB/s) vs stride length for 1 to 16 threads .............................. 106
Figure 7-1 Using the low 128-bits of the YMMn registers for XMMn........................................... 113
Figure 7-2 Scalar and vector operations....................................................................................... 114
Figure 7-3 Sandy Bridge block diagram emphasizing SIMD AVX functional units ........................ 116
Figure 8-1 Functional block diagram of the Tesla Fermi GPU ...................................................... 124
Figure 8-2 Tesla Fermi SM block diagram .................................................................................... 125
Figure 8-3 Cuda core..................................................................................................................... 126
Figure 8-4 NVIDIA GPU memory hierarchy ................................................................................... 128
Figure 9-1 Server Power States..................................................................................................... 136
Figure 9-2 The effect of VRD voltage ............................................................................................ 142
Figure 9-3 Relative influence of power saving features................................................................ 143
Tables
Table 1-1 Sandy Bridge Feature Overview Compared to Xeon E5600 ............................................ 11
Table 1-2 Supported Sandy Bridge Processor Models .................................................................... 13
Table 1-3 Maximum Turbo Upside by Sandy Bridge CPU model ................................................. 14
Table 1-4 Supported DIMM types................................................................................................... 18
Table 1-5 Common UEFI Performance Tunings .............................................................................. 22
Table 2-1 GNU compiler processor-specific optimization options .................................................. 30
Table 2-2 A mapping between GCC and Intel compiler options for processor architectures.......... 30
Table 2-3 General GNU compiler optimization options.................................................................. 31
Table 2-4 General Intel compiler optimization options .................................................................. 34
Table 2-5 Intel compiler options that control vectorization ........................................................... 38
Table 2-6 Intel compiler options that enhance vectorization ........................................................ 38
Table 2-7 Intel compiler options for reporting on optimization...................................................... 39
Table 2-8 Global (inter-procedural) optimization options for the GNU compiler suite.................. 40
Table 2-9 Global (inter-procedural) optimization options for the Intel compiler suite................... 41
Table 2-10 Automatic parallelization for the Intel compiler........................................................... 43
Table 2-11 OpenMP options for the Intel compiler suite................................................................ 44
Table 2-12 GNU OpenMP runtime environment variables recognized by the Intel compiler
toolchain......................................................................................................................................... 45
Table 2-13 GNU compiler options for CAF ...................................................................................... 46
Table 2-14 Intel compiler options for CAF ...................................................................................... 46
Table 3-1 OpenMP binding options ................................................................................................ 56
Table 4-1 Intel MPI wrappers for GNU and Intel compiler ............................................................. 61
Table 4-2 Intel MPI settings for I_MPI_FABRICS............................................................................. 62
Table 5-1 Event modifiers for perf e <event>:<mod> ................................................................... 74
Table 6-1 LINPACK Job Parameters ................................................................................................ 88
Table 6-2 HPL performance on up to 18 iDataPlex dx360 M4 islands ........................................... 92
Table 6-3 Single core memory bandwidth as a function of core frequency.................................... 99
Table 6-4 Memory Bandwidth (MB/s) over 16 cores throughput benchmark........................... 100
Table 6-5 Memory bandwidth (MB/s) over 16 cores OpenMP benchmark with icc 20M....... 101
Table 6-6 Memory bandwidth (MB/s) over 16 cores OpenMP benchmark with icc 200M..... 101
Table 6-7 Memory bandwidth (MB/s) minimum number of sockets OpenMP benchmark with
icc.................................................................................................................................................. 101
Table 6-8 Memory bandwidth (MB/s) split threads between two sockets OpenMP benchmark
with icc.......................................................................................................................................... 102
Table 6-9 Memory bandwidth (MB/s) minimum number of sockets OpenMP benchmark with
gcc ................................................................................................................................................ 103
Table 6-10 Memory bandwidth (MB/s) divide threads between two sockets OpenMP
benchmark with gcc...................................................................................................................... 103
Table 6-11 Strided memory bandwidth (MB/s) 16 threads ....................................................... 104
Table 6-12 Strided memory bandwidth (MB/s) 8 threads ......................................................... 104
Table 6-13 Strided memory bandwidth (MB/s) 4 threads ......................................................... 105
Table 6-14 Strided memory bandwidth (MB/s) 2 threads ......................................................... 105
Table 6-15 Strided memory bandwidth (MB/s) 1 thread........................................................... 105
Table 6-16 Reverse order (stride=-1) memory bandwidth (MB/s) 1 to 16 threads.................... 106
Table 6-17 Stride 1 memory bandwidth (MB/s) 1 to 16 threads ............................................... 106
Table 6-18 Strided memory bandwidth (MB/s) with indexed loads 1 thread............................ 107
Table 6-19 Strided memory bandwidth (MB/s) with indexed loads 16 threads ........................ 107
Table 6-20 Strided memory bandwidth (MB/s) with indexed stores 1 thread........................... 107
Table 6-21 Strided memory bandwidth (MB/s) with indexed stores 16 threads ....................... 108
Table 6-22 Best values of HPL N,P,Q for different numbers of total available cores................ 109
Table 6-23 HPCC performance on 1 to 32 nodes ......................................................................... 110
Table 6-24 NAS PB Class D performance on 1 to 32 nodes........................................................... 112
Table 8-1 HPL performance on GPUs............................................................................................ 130
Table 9-1 Global server states ...................................................................................................... 137
Table 9-2 Sleep states................................................................................................................... 138
Table 9-3 CPU idle power-saving states ....................................................................................... 139
Table 9-4 CPU idle states for each core and socket ...................................................................... 140
Table 9-5 CPU Performance states ............................................................................................... 141
Table 9-6 Subsystem power states ............................................................................................... 142
Table 9-7 Memory power states................................................................................................... 143
Contributors
Authors
Charles Archer
Mark Atkins
Torsten Bloth
Achim Boemelburg
George Chochia
Don DeSota
Brad Elkin
Dustin Fredrickson
Julian Hammer
Jarrod B. Johnson
Swamy Kandadai
Peter Mayes
Eric Michel
Raj Panda
Karl Rister
Ananthanarayanan Sugavanam
Nicolas Tallet
Francois Thomas
Robert Wolford
Dave Wootton
Introduction
In March of 2012, IBM introduced a petaflop-class supercomputer, the iDataPlex dx360
M4. Supercomputers are used for simulations, design and for solving very large, complex
problems in various domains including science, engineering and economics.
Supercomputing data centers like the Leibniz Rechenzentrum (LRZ) in Germany are
looking for petaflop-class systems with two important qualities:
1. systems that are highly dense to save on data center space
2. systems that are power efficient to save on energy costs, which can run into
millions of dollars over the life time of a supercomputer.
IBM designed the latest generation of its iDataPlex-class systems to meet the
performance, density, power and cooling requirements of a supercomputing data center
such as LRZ.
Servers OS
iDataplex
Management
Interconnect Software
GPFS
Storage IBM
The true benefit of a supercomputer is realized only when the user community acquire
and use the special skills needed to maximize the performance of their applications. With
this singular objective in mind, this document has been created to help application
specialists to wring out the last FLOP from their applications. The document is structured
as a guide that provides pointers and references to more detailed sources of information
on a given topic, rather than as a collection of self-contained recipes for performance
optimization.
In the iDataPlex system, the dx360 M4 is a 2-socket, SMP node which is the
computational nucleus in the supercomputer. Chapter 1 provides a high-level description
of the dx360 M4 node as well as the InfiniBand interconnect. Intels latest generation
Sandy Bridge server processor is used in the dx360 M4. In this chapter, processor and
system-level information that is essential for tuning is provided.
Two different types of compilers, GNU and Intel, are covered as part of processor-level
performance tuning in chapter 2. Various compile options including a set of
recommended options, vectorization, shared memory parallelization and the use of math
libraries are some of the topics covered in this chapter.
The iDataPlex supports Red Hat Enterprise Linux (RHEL) and SUSE Linux Enterprise
Server (SLES). Various aspects of operating system tuning that can benefit application
performance are covered in chapter 3. Memory affinitization, process and thread binding ,
as well as tools for monitoring performance are some of the key topics in this chapter.
Chapter 6 provides performance results on some of the standard benchmarks that are
frequently used in supercomputing, namely LINPACK and STREAM. Additionally, results
on HPCC and the NAS Parallel Benchmarks on the iDataPlex system are reported.
These first six chapters cover the essentials of performance tuning on the iDataPlex
dx360 M4. However, for those readers who want to go the extra mile in tuning on this
system, a few additional topics are covered in the remaining chapters.
A 256-bit SIMD unit called AVX is provided in the Sandy Bridge class of microprocessors.
SIMD programming is covered in chapter 7.
The dx360 M4 node can also accommodate Nvidia GPGPUs. Aspects on how to compile
and run on Nvidia GPGPUs are covered in chapter 8.
Power consumption in supercomputers has become a serious concern for data center
operators because of the high operating expenses. Consequently, supercomputing
application developers and users have become sensitive to the power consumption
behavior of their applications. Therefore, a document of this nature is not complete
without a discussion of power consumption which is treated in the last chapter.
1 iDataPlex dx360 M4
The dx360 M4 is the latest rack-dense, compute node cluster offering in the iDataPlex
product line, offering numerous hardware features to optimize system and cluster
performance:
Up to two Intel Xeon processor E5-2600 series processors, each providing up to
8-cores and 16 threads, core speeds up to 2.7 GHz, up to 20 MB of L3 cache,
and QPI interconnect links of up to 8 GT/s.
Optimized support of Intel Turbo Boost Technology 2.0 allows CPU cores to run
above rated speeds during peak workloads
4 DIMM channels per processor, offering sixteen DIMMs of registered DDR3
ECC memory, able to operate at 1600 MHz and with up to 256GB per node.
Support of solid-state drives (SSDs) enabling improved I/O performance for
many workloads
PCI Express 3.0 I/O capability enabling high bandwidth interconnect support
10 Gb Ethernet and FDR10 mezzanine cards offering high interconnect
performance without consuming a PCIe slot.
Support for high-performance GPGPU adapters
Additional details on the dx360 M4, including supported configurations of hardware and
Operating Systems can be found in the dx360 M4 Product Guide, located here.
1.1 Processor
The dx360 M4 is built around the high performance capabilities of the Intel E5-2600
family of processors, code named Sandy Bridge. As a major microarchitecture update
from the previous generation E5600 series of CPUs, Sandy Bridge provides many key
specification improvements, as noted in the following table:
In addition, Sandy Bridge also introduces support for AVX extensions within an updated
execution stack, enabling 256-bit floating point (FP) operations to be decoded and
executed as a single micro-operation (uOp). The effect of this is a doubling in peak FP
capability, sustaining 8 double precision FLOPs/cycle.
In order to provide sufficient data bandwidth to efficiently utilize the additional processing
capability, the Sandy Bridge processor integrates a high performance, bidirectional ring
architecture similar to that used in the E7 family of CPUs. This high performance ring
interconnects the CPU cores, Last Level Cache (LLC, or L3), PCIe, QPI, and memory
Memory Controller
L1 / L2
L1 / L2
Cache
Cache
CPU LLC LLC CPU
Core 2.5 MB 2.5 MB Core
L1 / L2
L1 / L2
Cache
Cache
CPU LLC LLC CPU
Core 2.5 MB 2.5 MB Core
L1 / L2
L1 / L2
Cache
L1 / L2
Cache
Cache
QPI PCIe
While each physical LLC segment is loosely associated with a corresponding core, this
cache is shared as a logical unit, and any core can access any part of this cache.
Though access latency around the ring is dependent on the number of 1-cycle hops that
must be traversed, the routing architecture guarantees the shortest path will be taken.
With 32B of data able to be returned on each cycle, and with the ring and LLC clocked
with the CPU core, cache and memory latencies have dropped as compared to the
previous generation architecture, while cache and memory bandwidths are significantly
improved. Since the ring is clocked at the core frequency, however, its important to note
that sustainable memory and cache performance is directly dependent on the speed of
the CPU cores.
Another key performance improvement in the Sandy Bridge family of CPUs is the
migration of the I/O controller into the CPU itself. While I/O adapters were previously
connected via PCIe to an I/O Hub external to the processor, Sandy Bridge has moved the
controller inside the CPU and has made it a stop on the high bandwidth ring. This feature
not only enables extremely high I/O bandwidth supporting the fastest Gen3 PCIe speeds,
but also enables I/O latency reductions of up to 30% as compared to Xeon E5600 based
architectures.
Advanced
E5-2680 2.7 GHz 20 MB 8 130 W 8.0 GT/s 1600 MHz Yes Ver 2.0
E5-2670 2.6 GHz 20 MB 8 115 W 8.0 GT/s 1600 MHz Yes Ver 2.0
E5-2665 2.4 GHz 20 MB 8 115 W 8.0 GT/s 1600 MHz Yes Ver 2.0
E5-2660 2.2 GHz 20 MB 8 95 W 8.0 GT/s 1600 MHz Yes Ver 2.0
E5-2650 2.0 GHz 20 MB 8 95 W 8.0 GT/s 1600 MHz Yes Ver 2.0
Standard
E5-2640 2.5 GHz 15 MB 6 95 W 7.2 GT/s 1333 MHz Yes Ver 2.0
E5-2630 2.3 GHz 15 MB 6 95 W 7.2 GT/s 1333 MHz Yes Ver 2.0
E5-2620 2.0 GHz 15 MB 6 95 W 7.2 GT/s 1333 MHz Yes Ver 2.0
Basic
E5-2667 2.9 GHz 15 MB 6 130 W 8 GT/s 1600 MHz Yes Ver 2.0
Low Power
E5-2650L 1.8 GHz 20 MB 8 70 W 8.0 GT/s 1600 MHz Yes Ver 2.0
E5-2630L 2.0 GHz 15 MB 6 60 W 7.2 GT/s 1333 MHz Yes Ver 2.0
Activated when the operating system transitions to the highest performance state (P0),
and using the integrated power and thermal monitoring capabilities of the Sandy Bridge
processor, Turbo Boost exploits the available power and thermal headroom of the CPU to
increase the operating frequency on one or more cores. The maximum Turbo frequency
that a core is able to run at is limited by the processor model and dependent on the
number of cores that are actively running on a processor socket. When more cores are
inactive, and are therefore able to be put to sleep, more power and thermal headroom
becomes available in the processor and higher frequencies are possible for the remaining
cores. Thus, the maximum turbo frequency is possible when just 1 core is active, and the
remaining cores are able to sleep. When all cores are active, as is common in many
cluster applications, a frequency falling between the Max Turbo 1-Core Active and the
processors Rated Core Speed is achieved, as illustrated in the following table:
With Sandy Bridge and Turbo Boost 2.0, the core is permitted to operate above the
processors Thermal Design Power (TDP) for brief intervals provided the CPU has
thermal headroom, is operating within current limits, and is not already operating at its
Max Turbo frequency. The amount of time the core is allowed to operate above TDP is
dependent on the application-specific power consumption measured before and during
the above TDP interval, where energy credits are allowed to build up when operating
below TDP, then get exhausted when operating above TDP. In practice, only highly
optimized floating-point-intensive routines, often exploiting AVX optimization, can stress
the core enough to push above TDP, and the duration it can run above TDP generally
lasts only a couple of seconds, but this depends largely on the workload characteristics.
For compute workloads like Linpack, which operate at sustained levels for extended
periods, the brief period of increased frequency while running above TDP returns a
minimal net performance gain. This is because the brief duration of increased frequency
is only a small part of the overall workload time, and the high steady-state loading never
drops significantly below TDP during the measurement interval. Since this sustained
high (TDP) loading prevents energy credits from building back up, the processor is
unable to exceed TDP throughout the remainder of the benchmark measurement interval.
For more bursty, real world applications, the ability to operate above TDP for brief
intervals can return an incremental performance boost. In this bursty application
scenario, the processor spends short intervals below TDP where energy credits are able
to build up, then exhausts those energy credits when operating above TDP. Because
more time is spent above TDP for this case, the performance gains realized for Turbo
Boost are greater.
It is important to note that the maximum Turbo Boost upsides listed in Table 1-3 are not
guaranteed for all workloads. For workloads with heavy power and thermal
characteristics, specifically AVX-optimized routines like Linpack, a processor may run at
a frequency lower than its listed Max Turbo frequency. In these specific high-load
workload cases, the core will run as fast as it can while staying at or under its TDP. The
only frequency guaranteed for all workloads is the processors rated frequency, though
in practice some portion of the Turbo Boost capability is still possible even with highly
optimized AVX codes.
Finally, since any level of Turbo Boost above the All Cores Active frequency is
dependent on at least some of the cores being in ACPI C2 or C3 sleep states, these C-
States must remain enabled in system setup.
1.2 System
The dx360 M4 introduces some key system-level features enabling maximum
performance levels to be sustained. By subsystem, these include:
Memory:
4x 1600 MHz capable DDR3 Memory Channels per processor
2 DIMMs per memory channel
16 total DIMMs supporting a total capacity of up to 256 GB
Processor Interconnect
Dual QPI Interconnects, operating at up to 8 GT/s
Expansion Cards
Each processor supplies 24 lanes of Gen3 PCIe to a PCIe riser card
Each PCIe Gen3 riser provides one x16 slot (1U riser), or one x16 slot and one
x8 slot (2U riser)
Communications
Integrated dual port Intel 1 gigabit Ethernet controller for basic connectivity needs
Mezzanine Card options of either 10Gbit Ethernet or FDR InfiniBand, without
consuming a PCIe slot
The physical topology of these parts is depicted in the following block diagram. Note that
the interconnecting buses are also shown, since understanding which CPU these buses
connect to can be the key to tuning the locality of system resources.
SDB SDB
CPU CPU
1 0
16 15 10 9 8 7 2 1
14 13 12 11 6 5 4 3
Dual QPI Interconnects
Front of System
Not explicitly shown in this diagram, but key to many workloads, is the storage
Note also that the dx360 M4 uses dual coherent QPI links to interconnect the CPUs.
Data traffic across these links is automatically load balanced to ensure maximum
performance. Combined with up to 8 GT/s speeds, this capability enables significantly
higher remote node data bandwidths than prior generation platforms.
This Non-Uniform I/O architecture enables very high performance and low latency I/O
accesses to a given processors I/O resources, but the possibility does exist for I/O
access to a remote processors resources, requiring traversal of the QPI links. While
this worst-case remote I/O access is still generally faster than the best-case performance
of the E5600, it is important to understand that some I/O accesses can be faster than
others with this architecture. With that in mind, the end user may chose to implement I/O
tuning techniques to pin the system software to local I/O resources, if maximum I/O
performance is required. This may be especially important for those environments
implementing GPU solutions.
Additional detail covering supported I/O adapters and GPUs is available in the dx360 M4
Product Guide
The speed that the entire memory subsystem is clocked at is determined by the lower of
1) the CPUs maximum supported memory speed as indicated in Table 1-2
2) the speed of the slowest DIMM channel on the system.
The maximum operating speed of each DIMM channel is dependent on the capability of
the DIMMs used, the speed and voltage that the DIMM is configured to run at, and the
number of DIMMs on the memory channel.
A list of the dx360 M4s supported DIMMs and the maximum frequencies that these
DIMMs can operate in various configurations and power settings are listed below.
* These part details were not available during the writing of this paper, but are expected
to be announced by the time this paper is published. See the product pages for further
DIMM details.
Note that 1.35V DIMMs are able to operate at 1.5V, and this will occur in configurations
which mix 1.35v and 1.5v DIMMs. Memory Speed and power settings are available from
within the UEFI configuration menus, under System Settings -> Memory, or via ASU as
discussed in section 1.2.3.1.
8 of the identical size and type of DIMM should be installed at a time, one
per memory channel, in order to achieve maximum memory performance.
Using the DIMM slot numbering as indicated in Figure 1-1 above, identical DIMMs should
be installed within each of the following 8-DIMM groups:
1) DIMMs 1, 3, 6, 8 (CPU0), and DIMMs 9, 11, 14, 16 (CPU1)
2) DIMMs 2, 4, 5, 7 (CPU0), and DIMMs 10, 12, 13, 15 (CPU1)
115 115
120 108
100 103
100
80
60
40
20
0
X5670 - SDB - 2.7 SDB - 2.6 SDB - 2.4 SDB - 2.2 SDB - 2.0 SDB - 1.8 SDB - 1.6 SDB - 1.4 SDB - 1.2
2.93 GHz GHz GHz GHz GHz GHz GHz GHz GHz GHz
As shown, the top Sandy Bridge processor frequencies have up to 15% lower latency
than a prior generation Xeon X5670. With Turbo Boost enabled, this same Sandy Bridge
processor is able to reach even higher clock speeds, reducing latencies by another 5+%.
However, when clock frequencies are reduced to lower frequencies, this has a direct
impact on the memory subsystem, and latencies can increase drastically.
Relative M em o ry T h ro u g h p u t 120
100
80
60
40
20
0
2.7 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2
SDB Processor Frequency (GHz)
While processor ratings less than 1.8 GHz are not supported on the dx360 M4, the lower
frequencies shown are possible when processor P-states are enabled, which enables a
power-saving, down-clocking of the processor. While the active usage of a low frequency
P-state by the OS will general occur only during periods which lack performance
sensitivity, there are specific cases where this can become an issue in real workloads.
Consider the case where an application may only exercise one processor socket (NUMA
node) at a time. In this case, the 2nd processor socket may be allowed to downclock, or
even sleep, assuming these capabilities are enabled in the System Settings and the OS.
However, cache coherency operations and remote memory requests may still take place
to the 2nd processor, which now has a critical component of its cache and memory
subsystem, the ring bus, being clocked at a reduced speed. For this reason,
environments which may have this sort of unbalanced processor loading occurring,
specifically while demanding peak memory performance, may consider disabling
processor P-states within System Settings, or setting the minimum processor frequency
within the OS. This latter method is explained for Linux OS here.
1.2.3 UEFI
The platform UEFI, working in conjunction with the Integrated Management Module, is
responsible for control of the low level hardware system settings. Many of the tunings
used to optimize performance are available within the F1 System Setup menu,
presented during boot time. These settings are also available from a command line
interface using the IBM Advanced Settings Utility, or ASU.
Since the dx360 M4 is used in cluster deployments, this section will first introduce the
ASU tool scripting tool, then provide UEFI tunings as implemented in ASU.
1.2.3.1 ASU
The Advanced Settings Utility is a key platform tuning and management mechanism to
read and write system configuration settings without having to manually enter F1 Setup
menus at boot time. Though changes to ASU settings still generally require a system
reboot to apply, this utility allows a consistent tuning of platforms and clusters using either
manual command line execution, or automated scripting.
Further information on ASU, including links to download Linux and Windows versions of
the tool are located here.
Many environments will not need explicit control over individual platform settings with
Sandy Bridge, as the dx360 M4 can be optimized for most environments with just a
couple of simple settings. For most environments, the UEFI enables four Operating
Modes which will cover the tuning requirements for many of the common usage
scenarios. These are:
1) Minimal Power
This mode reduces memory and QPI speeds, and activates the most aggressive
power management features, the combination of which will have steep
performance implications. This setting would be recommended for those
environments which must minimize power consumption at all costs.
2) Efficiency Favor Power
This mode also limits QPI and memory speeds, but not as aggressively as
Minimal Power mode. Turbo Mode remains disabled in this operating mode.
3) Efficiency Favor Performance (System Default)
This is the default operating mode. Memory and QPI buses are run at their
maximum hardware supported speeds, though most power management
features remain enabled. Turbo mode is enabled, and processor sleep states
(C-States) are enabled and allowed to enter the deepest sleep levels. This mode
generally enables an optimal balance of performance/watt on Sandy Bridge.
4) Maximum Performance
This mode generally allows the highest performance levels for most applications,
though exceptions do occur. Processor C-States continue to remain enabled, as
these are necessary for optimal Turbo Mode performance. The C-State Limit is
reduced to ACPI C2 in this mode to minimize the latency associated with waking
the sleeping cores, and C1 Enhanced Mode is disabled. Processor Performance
States are disabled in this mode, ensuring the CPU is always operating at or
above rated frequency.
There is also a 5th, Custom Mode, which enables all of the UEFI features to be set
individually for specific workload conditions. The following table covers the most
common performance-specific parameters, listed in ASU setting format:
Each of these parameters can be set from the ASU as well as from the corresponding
menu in F1 System Setup.
Fundamentally, setting the Operating Mode to either the default of Efficiency Favor
Performance or Maximum Performance, depending on the performance and power
sensitivity of the environment, combined with application-driven optimization of the
Hyper-Threading setting, will provide excellent performance results for a given application
environment.
Above, you will have noted different HCAs are available at different InfiniBand technology
speeds: FDR10 and FDR. First, InfiniBand speed terminology will be described. Then,
The base data rate for InfiniBand technology has been the single data rate (SDR), which
is 2.5Gbps per lane or bit in a link. Previous interconnect technologies, up to FDR10,
have used this SDR reference speed. The standard width of the interface is 4X or 4 bits
wide. Therefore, the standard SDR bandwidth or speed of a link is 2.5Gbps times 4 bit
lanes, or 10Gbps. DDR, or double data rate, is 20Gbps per 4x link. QDR, or quad data
rate, is 40Gbps per link.
FDR10 is based off of FDR technology with the 10 appended to FDR to directly indicate
the bit lane speed of 10Gbps. An important difference between FDR10 and QDR is that
FDR10 is more efficient in its data transfer than QDR, because of certain FDR
characteristics.
The FDR nomenclature begins to deviate from basing the speed on a multiple of
2.5Gbps. FDR stands for fourteen data rate, or a bit speed of 14Gbps, which translates
into a 4x link speed of 56Gbps.
While FDR10 is nominally the same speed (40Gbps per 4x link) as the previous
generation (QDR), there is a different encoding of the data on the link that allows for
more efficient use of the link while still providing data protection. The QDR technology ran
an 8/10 bit encoding which yields 80% efficiency for every bit of data payload sent across
the link. FDR10 uses a 64/66 bit encoding which yields 97% efficiency. In other words,
the effective rate of a QDR link is 32Gbps; whereas, FDR10 has an effective data rate of
38.8 Gbps. In both cases, the effective data rate is also used by a modest number of bits
implementing packet overhead.
By using the same nominal speed as QDR, FDR10 can use the same basic cable
technology as QDR, which has helped with getting the improved FDR link efficiency to
market quicker.
To achieve FDR10 efficiencies, the HCAs must be attached to switches that support FDR
bit encoding. If they are attached to QDR switches, the HCAs will operate, but at QDR
rates and efficiencies.
FDR operates at 56Gbps per 4x link. It maintains the same bit encoding as FDR10 and
therefore the same 97% efficiency of the link. This yields an effective data rate of 54.3
Gbps. To achieve full FDR rates, the HCAs must be attached to switches that support full
FDR rates. If they are attached to switches that support a maximum of QDR or FDR10
rates, the HCAs will operate, but at the lower speeds.
The Mellanox model numbers for currently supported FDR10/FDR switches are:
SX6036 = a 36 port Edge switch.
SX6536 = a 648 port Director switch, which scales from 8.64 Tbps up to 72.52 Tbps
of bandwidth in a single enclosure.
Both switch models are non-blocking. Both switch models can support any speed up to
FDR (including FDR10).
Edge switches are typically used for small clusters of servers or as top-of-rack switches
that provide edge or leaf connectivity to one or more Director switches implemented as
core switches. This allows for scaling beyond 648 nodes.
It is also possible to use the SX6536 to connect up to 648 HCAs in a single InfiniBand
subnet, or plane.
The typical large scale solution is implemented as a fat-tree to maintain 100% bi-
sectional bandwidth for any to any node communication. For example, for a cluster of
1296 nodes, each with one connection into a plane, the typical topology would be to have
72 SX6036 Edge switches distributed amongst the frames of DX360 M4 servers. This
has 18 servers connected to each Edge switch. The Edge switches will then connect to
two SX6536 Director switches with 9 cables from each of the Edge switches connecting
to each of the Director switches.
Various subnet managers typically have several possible routing algorithms such as
Minimum number of Hops (default), up-down, fat tree, and so on. It is recommended that
the various options be discussed with a routing expert before deviating from the default
algorithm. Parameters like the types of applications, the chosen topology and the
experiences of a particular algorithm in the field should be considered.
MINHOP or shortest path optimizes routing to achieve the shortest path between two
nodes. It balances routing based on the number of paths using each port in the fabric.
This is the default algorithm.
UPDN or up-down provides a shortest path optimization, but also considers a set of
ranking rules. This is designed for a topology that is not a pure Fat tree, and has potential
deadlock loops that must be avoided.
FAT TREE can be used for various fat-tree topologies as it optimizes for various
congestion-free communication patterns. It is similar to UPDN in that it also employs
ranking rules.
LASH or layered shortest path uses InfiniBand virtual layers (SL) to provide dead-lock
free shortest path routing.
DOR or dimension ordered routing provides deadlock free routes for hypercube and
mesh topologies.
FILE or file-based loads the routing information directly from a file. This would be for very
specialized applications and has disadvantages in that it restricts the subnet managers
ability to dynamically react to changes in the topology.
UFM TARA or traffic aware routing is unique to Mellanoxs Unified Fabric Manager
(UFM) and combines UPDN with traffic-aware balancing that includes application
topology and weighted patterns. This requires UFM to work in concert with applications
and job managers to maintain awareness of traffic patterns so that it can dynamically
optimize routing. Therefore, it may not be possible to use this algorithm with all solutions.
The Mellanox subnet manager also includes support for adaptive routing. Adaptive
routing allows the switch to choose how to route a packet based on availability of the
optional ports used to get from one node to another. This works as you traverse the fabric
from the source node to halfway out in the fabric. Once you reach the halfway point, the
remainder of the path is predestined and no more choices are available. If the congestion
pattern tends to be in the first half of the route, this can be an effective tool. If the
congestion pattern is in the back half of the route, adaptive routing is less effective for
example, for many-to-one patterns, the congestion starts at the destination node end and
backs up into the fabric.
The Mellanox subnet manager also includes support for Quality of Service (QoS). It uses
service lanes (or virtual lanes) and a weight factor for each lane to ensure that higher
priority data traffic is separated from and takes precedence over lower priority data traffic.
In this way, the higher priority traffic avoids being delayed by lower priority traffic. To take
advantage of QoS, the applications, MPI and RDMA stack must be implemented in a way
that uses service lanes. As this is not always the case, some solutions are limited to
separating IP over InfiniBand (IPoIB) traffic from RDMA traffic by taking advantage of the
ability to assign a non-default service lane for IPoIB.
Note: The InfiniBand MTU is different from an IP MTU. It is a maximum transmission unit
at the physical layer, or the size of packets in the fabric itself. Larger IP packets bound by
the IP MTU are broken down into smaller packets in the physical layer bound by the
InfiniBand MTU.
While it is not often used in the industry, smaller fabric solutions can sometimes benefit
from LMC (LID mask control) being set to 1 or 2. A non-zero LMC causes the SM to
assign multiple LIDs to each device, and then generate a different path to each LID.
While originally envisioned as a failover mechanism, this also allows for upper layer
protocols to scatter traffic over several paths with the intention of reducing congestion.
This requires an RDMA stack that is aware of the multiple paths provided by a non-zero
LMC so that the path can be periodically switched according to some algorithm (like
round-robin, or least recently used). There is a cost associated with LMC > 0, in that each
port is assigned multiple LIDs (local identifiers) and this will take up more buffer and
memory space. It will also affect start-up time for RC (reliable) connections. As a cluster
is scaled up, the impact becomes more noticeable. In fact, if a cluster gets large enough,
the hardware may run out of space to support the number of buffers required for LMC >
0. Typically, a performance expert should be consulted on the MPI solution to see if there
is any benefit to LMC > 0.
Finally, a pro-active method for monitoring the health of the fabric can help maintain
performance. With the current InfiniBand architecture, errors in the fabric require a
retransmission from source to destination. This can be costly in terms of latency and lost
effective bandwidth as packets are retransmitted. Again, the recommendation is to
consult with an expert in fabric monitoring to understand how best to determine that the
network fabric is healthy. Some considerations are:
Monitoring error counters and choosing thresholds that are appropriate for
the expected bit error rate (BER) of the fabric technology.
Typically, default thresholds are only adequate for links that greatly exceed the
acceptable BER. For links that are noisy and can impact performance, but are
barely over the acceptable BER, default thresholds are likely to be inadequate. A
time based threshold is preferred. However, many basic monitoring tools only
have count based thresholds (ignoring the bit error rate), which leads to the need
to develop a local strategy for regularly clearing the error counters to impose a
rough time-base to the count threshold. An expert should be consulted for the
appropriate bit error rate. In many cases, this is currently in the 10-15 to 10-14
range.
Monitoring for lost links in the fabric that can lead to imbalanced routes
and congestion.
Monitoring link speeds and widths.
When a link is established it is trained to the highest speed at which it can
operate based on both the inherent limitations of the technology (FDR10, FDR,
QDR, and so on) as well as the particular instance of hardware on the link. A
noisy cable or port may be tuned to a lower speed or smaller width to allow it to
operate. Quite often the switch technology will allow a system administrator to
override this and force a link to operate only at the maximum. However, the
default case is to tune to whatever the link can handle. Therefore, unless the
default is overridden, the system administrator will want to be sure to monitor for
speed or width changes (particular across node or switch reboots).For FDR10, it
is particularly important to be observant regarding whether or not the link has
come up at FDR10 versus QDR. It is possible that the link can handle 10Gbps
per bit lane, but not tune to the 64/66 bit encoding. Various tools will vary in
reporting whether 10Gbps is QDR or FDR10 some will only report if it is sub-
optimal.
1.3.1 References
[1] InfiniBand Linux SW Stack
[2] Routing Algorithm Info (Requires a userid with Mellanox support access).
Along with compilers, mathematical libraries are heavily used in HPC codes and are
critical for high performance. Section 2.2 introduces the Intel MKL library optimized for
Intel processors and briefly mentions the other alternatives.
The most important languages for High Performance Computing are FORTRAN, C, and
C++ because they are the ones used by the vast majority of codes. This chapter focuses
on those languages, though compiler providers also support others.
All of the descriptions of the GNU and Intel compiler options included in the tables in this
chapter are taken from the GNU Optimization Options Guide and the Intel Fortran
Compiler User and Reference Guide.
2.1 Compilers
The support of AVX vector units introduced with Sandy Bridge processor has been
announced with the 4.6.3 release of GCC compilers. With GCC version 4.7.0 several new
features have been implemented, detailed in the compiler option section below.
GCC 4.7.0 has been successfully built on a Sandy Bridge system by using the following
configure options:
export BASEDIR=/path/to/GCC-4.7.0
export LD_LIBRARY_PATH=${BASEDIR}/dlibs/lib:$LD_LIBRARY_PATH
${BASEDIR}/gcc-4.7.0-RC-20120314/configure \
-prefix=${BASEDIR}/install \
--with-gmp=${BASEDIR}/dlibs \
--with-gmp-lib=${BASEDIR}/dlibs/lib \
--with-mpfr-lib=${BASEDIR}/dlibs/lib \
--with-mpfr=${BASEDIR}/dlibs \
--with-mpc-lib=${BASEDIR}/dlibs/lib \
--with-mpc=${BASEDIR}/dlibs \
--with-ppl=${BASEDIR}/dlibs \
--with-ppl-lib=${BASEDIR}/dlibs/lib \
--with-libelf=${BASEDIR}/dlibs \
--enable-languages=c,c++,fortran --enable-shared --enable-threads=posix \
--enable-checking=release --with-system-zlib --enable-__cxa_atexit \
--disable-libunwind-exceptions --enable-libgcj-multifile
Several additional tools are available to help analyzing and optimizing HPC codes on
Intel systems. These tools are described in Chapter5.
2.1.3.1 Architecture
In order to efficiently optimize a program on a specific processor, the compiler needs
information on the architectures details. It is then able to use adapted parameters for its
internal optimization engines and generate optimized assembly code matching the
identified hardware as perfectly as possible: cache sizes, vector unit details, prefetching
engines, etc.
If compiling on the same architecture as the one used for the computation, compilers
(through usage of compiler options) can also automatically detect the processor details
and then generate optimal settings without user interaction.
For cross compiling, the user has to specify the target architecture to the compiler.
Another case is when user wants to have a binary that can run on all architectures in the
processor family (for instance, the x86 family). This is best applied for creating binaries
for pre- or post-processing programs that dont require significant computing power but
can be conveniently run on various systems within the same processor family without
recompilation.
2.1.3.1.1 GNU
The following options are used by GNU compiler to specify the hardware architecture to
be used for code generation:
-march= Generate code for given CPU
-mtune= Schedule code for given CPU
The corei7-avx argument tells the compiler to generate code for Sandy Bridge and use
AVX instructions:
-march=corei7-avx
By default, it also enables the -mavx compiler flag for autovectorization.
2.1.3.1.2 Intel
For local or cross compilation, the -xcode option specifies the hardware architecture
(code in this example) to be used for code generation. For the Sandy Bridge
architecture, using
-xAVX
may generate SIMD instructions for Intel processors.
If the code is being compiled on the same processor that will be used for computation
(local compilation), the following option produces optimized code:
-xhost
It tells the compiler to generate instructions for the most complete instruction set available
on the host processor.
For compatibility with GCC, the Intel compiler accepts GNU syntax for some options; the
-mavx and -march=corei7-avx flags are equivalent to -xAVX
Options -x and -m are mutually exclusive. If both are specified, the compiler uses the last
one specified and generates a warning.
The Intel compiler ignores the options in Table 2-2. These options only generate a
warning message. The suggested replacement options should be used instead.
Table 2-2 A mapping between GCC and Intel compiler options for processor architectures
GCC Compatibility Option Suggested Replacement Option
-mfma -march=core-avx2
-mvzeroupper -march=corei7-avx
The Intel compiler includes many compiler options that can affect the optimization and
the subsequent performance of the code, but this section only touches on the most
common options.
2.1.3.2.1 GNU
The GCC/GFortran compiler has to be configured and compiled on a specific target
system, so it may not support some features and compiler technologies depending on the
configure arguments. The compiler explicitly reports on the exact set of optimizations that
are enabled at each level by including the Q help=optimizers option:
$ gfortran -Q --help=optimizers
The general optimization levels are listed in Table 2-3
The Optimization Options guide [15] on the GNU web site provides more details.
2.1.3.2.2 Intel
Table 2-4 lists the general levels for optimization that are available for the Intel compiler,
The Intel Fortran Compiler User and Reference guide [2] provides more information and
a complete list of optimization flags.
2.1.3.3 Vectorization
All processor manufacturers have introduced SIMD (vector) units to improve the
computing capabilities of their processors. Intel introduced the AVX (Advanced Vector
Extensions) unit working on 256 bit wide data with the Sandy Bridge processor. More
information is available in Chapter 7.
In order to enable access to this additional compute power, compilers must produce
instructions specific for this hardware. The most important compiler options to enable
SIMD instructions are presented next.
2.1.3.3.1 GNU
The GCC compiler enables autovectorization with
-ftree-vectorize
It is enabled by default with O3, -mavx or -march=corei7-avx.
The architecture flags select the type of SIMD instructions used and also enable
autovectorization. For the Sandy Bridge processor, the recommended options are:
-mavx
or
-march=corei7-avx
to use AVX (for 256-bit data) instructions and enable autovectorization.
So the recommended compiler options for the GNU compiler for Sandy Bridge
processors are:
-O3 -march=corei7-avx (or O3 mavx)
Information on which loops were or were not vectorized and why, can be obtained using
the flag -ftree-vectorizer-verbose=<level>
2.1.3.3.2 Intel
Vectorization is automatically enabled with -O2.
The architecture flags select the type of SIMD instructions used and also enable
autovectorization. For the Sandy Bridge processor, the recommended options are:
-xAVX
to allow for 256-bit vector instructions and enable autovectorization or
-xSSE4.2
to select the latest instruction set for 128-bit vector data and enable autovectorization.
Note:
User-mandated SIMD vectorization directives supplement automatic vectorization just as
OpenMP parallelization supplements automatic parallelization. SIMD vectorization uses
the !DIR$ SIMD directive to effect loop vectorization. The directive must be added before
a loop and the loop must be recompiled to become vectorized (the option -simd is
enabled by default).
To disable the SIMD transformations for vectorization, specify option -no-simd
To disable transformations that enable more vectorization, specify options -no-
vec -no-simd
Complete information is available in [2]
Additional compiler options allow the compiler to perform a more comprehensive analysis
and better vectorization:
Sometimes code blocks are not optimized or vectorized. Producing compiler reports
provides diagnostic information providing hints on how to tune source code more. It is
done using the following flags from Table 2-7.
2.1.3.4.1 GNU
A complete inter-procedural analysis has only been part of the GNU compiler since
version 4.5. It was previously limited to inlining functions into a single file using the
-finline-functions compiler flag. Now there is a more sophisticated process
available by using the flto option, which performs link time optimization across multiple
files.
Table 2-8 Global (inter-procedural) optimization options for the GNU compiler suite
-finline- Consider all functions for inlining, even if they
functions are not declared inline. The compiler
heuristically decides which functions are worth
integrating in this way.
If all calls to a given function are integrated,
and the function is declared static, then the
function is normally not output as assembler code
in its own right.
Enabled at level -O3.
2.1.3.4.2 Intel
Intel implementation of Inter Procedural Optimization supports 2 models: single-file
compilation using ip compiler option, and multi-file compilation using ipo compiler flag.
Optimizations that can be done by Intel compiler when using inter procedural analysis:
Table 2-9 Global (inter-procedural) optimization options for the Intel compiler suite
-ip This option determines whether additional
interprocedural optimizations for single-file
compilation are enabled.
Automatic parallelization:
Using a specific compiler option, the user tells the compiler to automatically parallelize
the sections that will support it. This parallelization is implemented through shared
memory mechanisms and is very similar to OpenMP threading. Parallel execution is
managed through environment variables, often the same as are used explicitly with
OpenMP.
Thread Affinity:
Thread affinity is a critical concept when using threads for computing: the locality of the
data used by the threads must be managed. The performance of a core will be higher if
the data used by the thread running on that core is located in the nearest hardware
memory DIMMs. To ensure this locality, one has to bind the thread to run on a particular
core, using the data located in the DIMM physically attached to the processor chip
containing this core. Environment variables and tools can control this binding. More
details are available in Section 3.4.
2.1.3.5.1 GNU
Automatic parallelization
The -ftree-parallelize-loops compiler flag creates multithreading automatically:
This flag must be passed to both the compile and link steps.
OpenMP
The OpenMP directives are processed with the fopenmp compiler flag.
When using this flag, all local arrays will be made automatic and then allocated on the
stack. This can be a source of segmentation faults during execution, because of too small
a limit on the stack size. The stack size can be changed by using the following
environment variables:
OMP_STACKSIZE from the OpenMP standard
GOMP_STACKSIZE from the GNU implementation
Both variables change the default size of the stack allocated by each thread.
The size is limited by the value of the stack limit of the user reported by
ulimit s
Thread affinity
GOMP_CPU_AFFINITY : Binds threads to specific CPUs.
Syntax: GOMP_CPU_AFFINITY="0 3 1-2 4-15:2" will bind the initial thread to CPU 0, the
second to CPU 3, the third to CPU 1, the fourth to CPU 2, the fifth to CPU 4, the sixth
through tenth to CPUs 6, 8, 10, 12, and 14 respectively and then start assigning back
from the beginning of the list.
2.1.3.5.2 Intel
Automatic parallelization
The parallel compiler flag automatically enables loops to use multithreading. This flag
must be passed to both the compile and link steps.
This option must be used with optimization levels -O2 or -O3 (The -O3 option also sets
the -opt-matmul flag).
OpenMP
The openmp flag allows the compiler to recognize OpenMP directives in the source
code.
This flag must be passed to both the compile and link steps.
Table 2-11 OpenMP options for the Intel compiler suite
-openmp Enables the parallelizer to generate multi-threaded
code based on the OpenMP* directives. The code can be
executed in parallel on both uniprocessor and
multiprocessor systems.
If you use this option, multithreaded libraries are
used, but option fpp is not automatically invoked.
This option sets option -automatic.
Thread affinity
The Intel runtime library has the ability to bind OpenMP threads to physical processing
units. The interface is controlled using the KMP_AFFINITY environment variable.
The syntax to use this variable is very complete and covers all testable cases. Reference
[2] [2] has all of the details.
One way to explicitly control the way the threads are assigned to physical or virtual cores
in a system is to use the explicit type in conjunction with the proclist modifier.
For instance, to bind 16 OpenMP threads on 16 physical cores (numbered from 0 to 15)
on a Hyper-Threaded system with 16 cores and 32 logical cpus, the following
export KMP_AFFINITY="proclist=[0-15:1],granularity=fine,explicit
export KMP_AFFINITY= \
"proclist=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],granularity=fine,explicit"
Table 2-12 GNU OpenMP runtime environment variables recognized by the Intel compiler toolchain
GOMP_STACKSIZE GNU extension recognized by the Intel OpenMP
compatibility library. Same as OMP_STACKSIZE.
KMP_STACKSIZE overrides GOMP_STACKSIZE, which
overrides OMP_STACKSIZE.
GOMP_CPU_AFFINITY and KMP_AFFINITY with the explicit type have the same syntax.
export GOMP_CPU_AFFINITY="0-15:1
export KMP_AFFINITY="proclist=[0-15:1],granularity=fine,explicit
Recently, Coarrays have emerged as an alternative method integrated into the Fortran
compiler and may be a future contender to address distributed parallelism. Coarrays
(also called CAF = Co Array Fortran) allow parallel programs to use a Partitioned Global
Address Space (PGAS) following the SPMD (single program, multiple data)
parallelization paradigm. Each process (called an image) has its own private variables.
Variables which have a so-called codimension are addressable from other images. This
extension is part of the Fortran 2008 standard.
Various compilers implement Coarrays but do not have the same functionality.
2.1.3.6.1 GNU
The implementation of CAF in the GNU Fortran compiler is very new and almost useless.
The latest information and status are available at http://gcc.gnu.org/wiki/Coarray and
http://gcc.gnu.org/wiki/CoarrayLib.
Reported from the Current Implementation Status in GCC Fortran on the GCC Trunk
[4.7 (experimental)]
GCC 4.6: Only single-image support (i.e. num_images() == 1) but many
features do not work.
GCC 4.7: Includes multi-image support via a communication library. There is
comprehensive support for a single image, but most features do not yet work with
num_images() > 1.
To enable a Fortran code to use CAF with the GNU compiler, the user has to specify the
-fcoarray switch:
2.1.3.6.2 Intel
The CAF implementation in the Intel compiler is more mature and allows compiling and
running with coarrays on local and remote nodes. It uses shared memory transfers for
intra-node accesses/transfers and Intel MPI for inter-node exchanges. No comparison of
the CAF implementation and the native MPI implementation has been done.
2.1.4 Alternatives
The following compiler suites include FORTRAN, C, and C++ compilers supporting the
Intel Sandy Bridge features and provide also tools for debugging, optimization, auto
parallelization, etc. They have not been assessed recently enough to be included in this
document.
The PGI Workstation compilers from the Portland Group [11]
The PathScale EKOPath 4 compilers from PathScale [10]
2.2 Libraries
HPC libraries are fundamental tools for scientists during code development: They provide
standardized interfaces to tested implementations of algorithms, methods, and solvers.
They are easy to use and more efficient than manually coding the equivalent
functionality. Frequently, they are already vectorized and parallelized to take advantage
of modern HPC architectures.
HPC codes are usually developed following open standards for the libraries, but used in
production with highly optimized math libraries like MKL for Intel processors.
Sparse BLAS and solvers are also available in MKL library. It supports CSR, CSC, BSR,
DIA and Skyline data storage as well as NIST and SparseKit style interfaces.
integral part of MKL library and provides a number of generator subroutines implementing
commonly used continuous and discrete distributions to help improve their performance.
All these distributions are based on the highly optimized Basic Random Number
Generators (BRNGs) and VML.
For instance, for the following configuration: Linux + Intel Fortran compiler + SMP version
of MKL + SCALAPACK + BLACS + Fortran95 interface for BLAS and LAPACK, this tool
provides the following information:
Compiler options:
-I$(MKLROOT)/include/intel64/lp64 -I$(MKLROOT)/include
For the link line:
-L$(MKLROOT)/lib/intel64
$(MKLROOT)/lib/intel64/libmkl_blas95_lp64.a
(MKLROOT)/lib/intel64/libmkl_lapack95_lp64.a
-lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_intel_thread
-lmkl_core -lmkl_blacs_intelmpi_lp64 -openmp -lpthread lm
The GNU compiler, MPICH, 32bit and 64bit are some of the possibilities to use with the
MKL library.
2.2.2.1 BLAS
The BLAS (Basic Linear Algebra Subprograms) [16] are routines that provide standard
building blocks for performing basic vector and matrix operations. The Level 1 BLAS
perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-
vector operations, and the Level 3 BLAS perform matrix-matrix operations. Because the
BLAS are efficient, portable, and widely available, they are commonly used in the
development of high quality linear algebra software, LAPACK for example.
2.2.2.2 LAPACK
LAPACK [17] is written in Fortran 90 and provides routines for solving systems of
simultaneous linear equations, least-squares solutions of linear systems of equations,
eigenvalue problems, and singular value problems. The associated matrix factorizations
(LU, Cholesky, QR, SVD, Schur, and generalized Schur) are also provided, as are
related computations such as reordering of the Schur factorizations and estimating
condition numbers. Dense and banded matrices are handled, but not general sparse
matrices. In all areas, similar functionality is provided for real and complex matrices, in
both single and double precision.
2.2.2.3 SCALAPACK
The ScaLAPACK [18] (or Scalable LAPACK) library includes a subset of LAPACK
routines redesigned for distributed memory MIMD parallel computers. It is currently
written in a Single-Program-Multiple-Data style using explicit message passing for inter
processor communication. It assumes matrices are laid out in a two-dimensional block
cyclic decomposition.
2.2.2.4 ATLAS
The ATLAS [19] (Automatically Tuned Linear Algebra Software) project is an ongoing
research effort focusing on applying empirical techniques in order to provide portable
performance. At present, it provides C and Fortran77 interfaces to a portably efficient
BLAS implementation, as well as a few routines from LAPACK.
2.2.2.5
2.2.2.6 FFTW
FFTW [20] is a C subroutine library for computing the discrete Fourier transform (DFT) in
one or more dimensions, of arbitrary input size, and of both real and complex data (as
well as of even/odd data, i.e. the discrete cosine/sine transforms or DCT/DST).
The latest official release of FFTW is version 3.3.1 and introduces support for the AVX
x86 extensions.
2.2.2.7 GSL
The GNU Scientific Library (GSL) [21] is a numerical library for C and C++ programmers.
It is free software under the GNU General Public License.
The library provides a wide range of mathematical routines such as random number
generators, special functions and least-squares fitting. There are over 1000 functions in
total with an extensive test suite.
2.3 References
All of the descriptions of the GNU and Intel compiler options included in the tables in this
chapter are taken from the GNU Optimization Options Guide and the Intel Fortran
Compiler User and Reference Guide.
[1] Intel Composer XE web page: http://software.intel.com/en-us/articles/intel-composer-xe/
[2] Intel Fortran Compiler User and Reference guide :
http://software.intel.com/sites/products/documentation/hpc/composerxe/en-
us/2011Update/fortran/lin/index.htm
[3] Intel AVX web page: http://software.intel.com/en-us/avx/
[4] Intel MKL: web page: http://software.intel.com/en-us/articles/intel-mkl/
3 Linux
The iDataPlex dx360 M4 supports the following versions of 64bit (x86-64) Red Hat
Enterprise Linux (RHEL) and SUSE Enterprise Linux Server (SLES):
The processor frequency settings can be controlled through either hardware or software.
The hardware configuration is controllable through the system's UEFI interface which is
available during system initialization. It is also possible to adjust the UEFI configuration
using the IBM Advanced Settings Utility (ASU). Detailed information on ASU is available
in section 1.2.3.1.
In addition to the available hardware controls, Linux provides its own clock frequency
management. Linux uses what are referred to as CPU governors to manage clock
frequency. The two most common governors are performance and ondemand. The
performance governor always runs the processor at its nominal clock frequency. In
contrast, the ondemand governor will vary the clock frequency depending on the
processor utilization levels of the system. The method of controlling the clock frequency
management in Linux varies from one distribution to the next, so it is best to consult the
distribution documentation for details.
RHEL 6
SLES 11
More information is included in the cpufrequtils packages available on RHEL and SLES.
To find the exact package needed on RHEL, try
$ yum search cpufreq
On SLES, try
$ zypper search cpufreq
Using the 2 MB huge page can improve performance by reducing pressure on the
processor's translation lookaside buffer (TLB) which typically has a fixed number of
elements that it can cache. By increasing the page size, the TLB is capable of caching
entries that address larger amounts of memory than when small pages are used.
Historically access to the 2 MB page size has been restricted to applications specifically
coded to do so which has limited the ability to make use of this feature.
Two recent additions to Linux have increased the viability of using huge pages for
applications without specifically modifying them to do so. The libhugetlbfs project
enables applications to explicitly make use of huge pages when the user requests them.
There are usability concerns with libhugetlbfs since the huge pages must be allocated
ahead of time by the system administrator. This is done using the following command
(allocating 30 huge pages in this case):
# echo 30 > /proc/sys/vm/nr_hugepages
In order for huge pages to be allocated in this manner, the operating system must be able
to find appropriately sized regions of contiguous free memory (2 MB in this case). This
can be problematic on systems which have been running for awhile, where memory
fragmentation has occurred.
The number of allocated pages (and current usage) can be checked by running the
following command:
# grep HugePage /proc/meminfo
AnonHugePages: 4237312 kB
HugePages_Total: 30
HugePages_Free: 30
HugePages_Rsvd: 0
HugePages_Surp: 0
As shown in this output (line AnonHugePages), recent x86-64 Linux distributions
(including RHEL 6.2 and SLES 11.2) support a new kernel feature called transparent
huge pages (THP). This example shows over 4 GB of memory backed by huge pages
allocated via THP. THP allows for applications to automatically be backed by huge
pages without any special effort by the user. To enable THP, the Linux memory
management subsystem has been enhanced with a memory defragmenter. The
defragmenter increases the likelihood of large contiguous memory regions being
available after the system has been running for awhile. The presence of THP does not
preclude the use of explicit huge pages or libhugetlbfs. The presence of the new memory
defragmenter should make their use easier.
3.3.1 Introduction
Linux optimizes for memory access performance by attempting to make memory
allocations in a NUMA (Non Uniform Memory Access) aware fashion. That is, Linux will
attempt to allocate local memory when possible and only resort to performing a remote
allocation if the local allocation fails.
While the allocation path attempts to behave in an optimal fashion, this behavior can be
offset by the kernel's task scheduler, which is not NUMA aware. It is possible (likely
even) that the task scheduler can move a process / thread to a core which is remote to
the already allocated memory. This increases the memory access latency and may
decrease the achievable memory bandwidth. This is one reason why it is recommended
that most HPC applications perform explicit process or thread binding.
To display the NUMA topology (memory and processor topology) on the dx360 M4:
% numactl --hardware
The output should be similar to the following, depending on the installed processors and
memory:
% numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 65514 MB
node 0 free: 61196 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 65536 MB
node 1 free: 56252 MB
node distances:
node 0 1
0: 10 11
1: 11 10
The numactl utility can be used to modify memory allocation in a variety of manners
such as:
Require allocation on a specific NUMA node(s)
Prefer allocation on a specific NUMA node(s)
Interleave allocation on specific NUMA nodes
Interleaving is particularly useful when memory allocation is performed by a master
thread and application source code modification is not possible.
For further information, and detailed command argument documentation, man numactl.
3.4.1 taskset
taskset is the Linux system command that:
sets the processor affinity of a running process
sets the processor affinity when launching a new command
retrieves the processor affinity of a running process
As such, taskset is a low-level mechanism for managing processor affinity.
Practically speaking, the typical usages of the taskset command are the following:
In the context of an MPI parallel execution, the taskset command must be integrated
into a user-defined script that will be responsible for performing an automatic mapping
between a given MPI rank and a unique processor ID for each process instance.
3.4.2 numactl
numactl is the Linux system command that allows processes to run with a specific
NUMA scheduling or memory placement policy. Its coverage is broader than the one of
the taskset system command as it also manages the memory placement for a process.
In the context of an MPI parallel execution, as is the case with the taskset command,
the numactl command must be integrated into a user-defined script that is to be
responsible for performing an automatic mapping between a given MPI rank and a
unique processor ID for each process instance.
GNU
The environment variable GOMP_CPU_AFFINITY is used to specify an explicit binding
for OpenMP threads.
Intel
The Intel compilers OpenMP runtime library provides the Intel OpenMP Thread Affinity
Interface, which is made up of three levels:
1. High-level affinity interface
this interface is entirely controlled by one single environment variable
(KMP_AFFINITY), which is used to determine the machine topology and to assign
OpenMP threads to the processors based upon their physical location in the
machine.
KMP_AFFINITY=[<modifier>,]<type>
where:
<modifier>
o proclist= {<proc-list>}
Specify a list of processor IDs for explicit binding.
o granularity= {core [default] | thread}
Specify the lowest levels that OpenMP threads are allowed to float within a
topology map.
o verbose | noverbose
<type>
o none [default]
Do not bind OpenMP threads to particular thread contexts. Specify
KMP_AFFINITY= [verbose, none] to list a machine topology map.
o compact
Assign the OpenMP thread <n>+1 to a free thread context as close as
possible to the thread context where the <n> OpenMP thread was placed.
o disabled
Completely disable the thread affinity interfaces.
o explicit
Assign OpenMP threads to a list of processor IDs that have been explicitly
specified by using the proclist modifier.
o scatter
Distribute the threads as evenly as possible across the entire system
(opposite of compact).
Summary of OpenMP threads binding options corresponding to which compiler was used
to build the executable.
3.4.4 LoadLeveler
When using both LoadLeveler as a workload scheduler and Parallel Environment as MPI
library, processor affinity can be requested directly at LoadLeveler level through the
task_affinity keyword.
This keyword has the same syntax as the Parallel Environment MP_TASK_AFFINITY
environment variable:
These settings are low level hardware configuration details that are not normally visible
from within Linux. Modification of these controls can be accomplished during system
initialization by entering the UEFI control interface or by using the ASU utility presented in
Section 1.2.3.1. To see what prefetch controls can be modified, run the following
command:
# asu64 show all | grep Prefetch
Processors.HardwarePrefetcher=Enable
Processors.AdjacentCachePrefetch=Enable
Processors.DCUStreamerPrefetcher=Disable
Processors.DCUIPPrefetcher=Enable
Each of these controls can be modified using the following ASU syntax:
# asu64 set <property> <Enable|Disable>
For example:
# asu64 set Processors.HardwarePrefetcher Disable
To activate changes to the prefetch controls a system reboot is required. Since these
properties are not normally visible from Linux, verifying the current settings requires the
use of ASU.
It is recommended to ignore the first sample of data from many of these utilities since that
data point represents an average of all data collected since the system was booted
instead of the previous interval.
3.7.1 Top
The top utility is a universally available Linux utility for monitoring the state of individual
processes and the system as a whole. It is primarily a tool used to focus on CPU
utilization and memory consumption but it does expose additional information such as
scheduler priorities and page fault statistics.
Typical output:
# top
top - 15:09:46 up 55 min, 6 users, load average: 0.04, 0.01, 0.00
Tasks: 397 total, 1 running, 396 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 24596592k total, 602736k used, 23993856k free, 24112k buffers
Swap: 26836984k total, 0k used, 26836984k free, 135456k cached
3.7.2 vmstat
vmstat is a tool that is widely available across various Unix like operating systems. On
Linux it can display disk and memory statistics, CPU utilization, interrupt rate, and
process scheduler information in compact form.
The first sample of data from vmstat should be ignored since it is represents data
collected since the system booted, not the previous interval.
Typical output:
# vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 718196 209928 443204 0 0 0 1 0 5 1 0 99 0 0
0 0 0 718188 209928 443204 0 0 0 2 179 172 3 0 97 0 0
0 0 0 718684 209928 443204 0 0 0 51 185 161 4 0 96 0 0
3.7.3 iostat
The iostat tool provides detailed input/output statistics for block devices as well as
system level CPU utilization. Using the extended statistics option (-x) and displaying the
data in kilobytes per second (-k) instead of blocks per second are recommended options
for this utility.
The first sample of data from iostat should be ignored since it is represents data
collected since the system booted, not the previous interval.
Typical output:
# iostat -xk 5
Linux 2.6.32-220.el6.x86_64 (host) 03/19/2012 _x86_64_ (2 CPU)
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.01 0.19 0.01 0.48 0.14 2.38 10.34 0.00 0.95 0.54 0.03
dm-0 0.00 0.00 0.01 0.60 0.14 2.38 8.28 0.00 1.37 0.44 0.03
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 1.03 0.84 0.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.60 0.00 0.20 0.00 3.20 32.00 0.00 3.00 3.00 0.06
dm-0 0.00 0.00 0.00 0.80 0.00 3.20 8.00 0.00 3.00 0.75 0.06
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3.7.4 mpstat
mpstat is a utility for displaying detailed CPU utilization statistics for the entire system or
individual processors. It is also capable of displaying detailed interrupt statistics if
requested. Monitoring the CPU utilization of individual processors is done by specifying
the "-P ALL" parameter and can be useful when processor affinity is in use.
The mpstat utility waits for the specified interval before printing a sample rather than
initially presenting data since the system was booted. This means that no samples need
to be ignored when using mpstat to monitor the system.
Typical output:
# mpstat -P ALL 5
Linux 2.6.32-220.el6.x86_64 (host) 03/19/2012 _x86_64_ (2 CPU)
03:56:22 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
03:56:27 PM all 0.10 0.00 0.10 0.00 0.00 0.00 0.00 0.00 99.80
03:56:27 PM 0 0.20 0.00 0.20 0.00 0.00 0.00 0.00 0.00 99.60
03:56:27 PM 1 0.00 0.00 0.20 0.00 0.00 0.00 0.00 0.00 99.80
03:56:27 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
03:56:32 PM all 2.34 0.00 0.71 0.00 0.00 0.00 0.00 0.00 96.95
03:56:32 PM 0 4.21 0.00 0.60 0.00 0.00 0.00 0.00 0.00 95.19
03:56:32 PM 1 0.41 0.00 0.62 0.00 0.00 0.00 0.00 0.00 98.97
4 MPI
This section is not a replacement for the Intel documentation. The Intel documents:
GettingStarted.pdf
Reference_Manual.pdf
can be found in the doc/ directory included with Intel MPI (opt/intel/impi/<version>/doc).
The following steps are involved in using Intel MPI to run parallel applications:
Step 1: Compile and Link
Step 2: Selecting a network interface
Step 3: Running the application
4.1.1 Compiling
To compile a parallel application using Intel MPI, one needs to make sure that Intel MPI
is in the path. Intel provides scripts in the bin/ and bin64/ directories to accomplish this
task (mpivars.sh/mpivars.csh depending on the shell being used). In addition, Intel MPI
provides wrappers for C, C++ and Fortran compilers. Table 4-1 lists some of the
wrappers:
Table 4-1 Intel MPI wrappers for GNU and Intel compiler
mpicc Wrappers for GNU C compilers
mpicxx Wrappers for GNU C++ compiler
mpif77 Wrappers for g77 compiler
mpif90 Wrappers for gfortran compiler
mpiicc Wrappers for Intel C compiler (icc)
mpiicpc Wrappers for Intel C++ compiler (icpc)
mpiifort Wrappers for Intel Fortran compiler
(fortran77/fortran95)
To compile a C-program using the Intel C-compiler:
mpiicc -myprog -O3 test.c
Before compiling using Intel compilers, make sure that the Intel compilers are in your
path. Intel provides scripts to accomplish this (bin/compilervars.sh or
/bin/compilervars.csh).
or
mpiifort -myprog -O3 -openmp test.f (test.f90)
for hybrid applications
To make sure that the mpdboot ran successfully, one can query the nodes in the parallel
partition using
mpdtrace
which will list all of the nodes in the parallel partition. To force mpdboot to maintain the
same order as in the hostfile, one can add an additional flag ordered in the mpdboot
command.
One can combine the functions of mpdboot and mpiexec by using mpirun instead.
Intel MPI dynamically selects the appropriate fabric for communication among the MPI
processes. To select a specific fabric for communication, set the environmental variable
I_MPI_FABRICS. Table 4-2 provides some communication fabric settings.
1 0,1,2,3,4,5,6,7 (8,24)(9,25)(10,26)(11,27)(12,28)(13,29)(14,30)(15,31)
For example:
I_MPI_PIN_PROCESSOR_LIST='0,1,2,3'
pins 4 MPI tasks to logical processors 0,1,2 and 3. Setting the environmental variable
I_MPI_DEBUG=10
or higher gives additional information about binding done by Intel MPI.
For hybrid applications (OpenMP+MPI), the binding of MPI tasks is done to a domain as
opposed to a processor. This is done using I_MPI_DOMAIN so that all child threads from
the given MPI task will run in the same domain.
Libraries built with the GCC and Intel compilers are provided for each implementation.
and the MPICH2 library. The PEMPI library is selected if the -mpilib ibm_pempi option
is used.
To select lockless mode in a PEMPI program initialized by MPI_Init(), one has to set
export MP_SINGLE_THREAD=yes
export MP_LOCKLESS=yes
in the environment. Note, these settings would not work with jobs which use one-sided
MPI or MPI/IO functions.
All of these optimizations are valid for any message size, but they have the greatest
impact on short messages.
The best latency is achieved with the environment setting MP_INSTANCES=1 (default).
Short messages are defined as the lower bound for MP_BULK_MIN_MSG_SIZE (4K).
The Maximum Transmission Unit (MTU) may impact latency. Setting
MP_FIFO_MTU=2K is typically better for messages under 4KB.
large scale, a job may run out of RC connections which consume memory proportionally
to the number of RC connections. The default number of connections can be overridden
by setting MP_RC_MAX_QP.
When the number of MPI tasks sharing a node is small, setting MP_INSTANCES to a
value larger than one may help to improve bandwidth. Increasing the number of
MP_INSTANCES will increase the number of RC connections accordingly.
By default, messages above 16KB are transmitted in RDMA mode (qualifying them as
large messages). This is the crossover point between FIFO and RDMA modes. It can be
overridden by setting MP_BULK_MIN_MSG_SIZE.
On the send side, short messages are copied to a retransmission buffer. This allows MPI
to return to user program while a message is in transit. The default buffer size can be
overridden by setting MP_REXMIT_BUF_SIZE.
As implied by sections 4.2.3 and 4.2.5, intermediate-size messages are between 4KB
and the RDMA crossover point (default 16KB, as set by MP_BULK_MIN_MSG_SIZE).
For a small task count per node, FIFO mode is less efficient than RDMA.
MPI_Bcast
MPI_Allgather
MPI_Allgatherv
The list of supported data types includes:
All data types for C language bindings, except MPI_LONG_DOUBLE
All data types for C reduction functions (C reduction types).
The following data types for FORTRAN language bindings: MPI_INTEGER,
MPI_INTEGER2, MPI_INTEGER4, MPI_INTEGER8, MPI_REAL, MPI_REAL4
and MPI_REAL8
FCA does not support data types for FORTRAN reduction functions (FORTRAN
reduction types).
By default, collective offload is turned off. To enable it, the environment variable
MP_COLLECTIVE_OFFLOAD=[all | yes] must be set. Setting
MP_COLLECTIVE_OFFLOAD=[none| no] disables collective offload. Once enabled, the
FCA collective algorithm will be the first one MPICH2 will try. If the FCA algorithm cannot
run at this time, a default MPICH2 algorithm will be executed. The FCA algorithm may not
be available if a node has no FCA software installed, does not have a valid license, the
FCA daemon is not running on the network, etc. FCA support is limited to 2K MPI
communicators per network.
MP_EUIDEVELOP=min
When set to min, selects the optimized MPICH2 library. The optimized library helps to
reduce latency. The standard MPICH2 library is selected by default. (Note that the
PEMPI library will skip some parameter checking when MP_EUIDEVELOP=min is used.)
MP_SINGLE_THREAD=[yes|no]
Avoids some PAMI locking overhead and improves short message latency when set to
yes. MP_SINGLE_THREAD=yes is valid only for user programs which make MPI calls
from a single thread. For mutithreaded processes with threads making concurrent MPI
calls, setting MP_SINGLE_THREAD=yes, will cause inconsistent results. The default
value is no.
MP_SHARED_MEMORY=[yes|no]
To specify the use of shared memory for intranode communications rather than network.
The default value is yes. In a few cases disabling shared memory improves
performance.
MP_SHMEM_PT2PT=[yes|no]
Specifies whether intranode point-to-point MPICH2 communication should use optimized,
shared-memory protocols. Allowable values are yes and no. The default value is
yes.
MP_EAGER_LIMIT=<INTEGER>
Changes the message size threshold above which rendezvous protocol is used. This
environment variable may be useful in reducing the latency for medium-size messages.
Larger values increase the memory overhead.
MP_REXMIT_BUF_SIZE=<INTEGER>
Specifies the size of a local retransmission buffer (send side). The recommended value is
the size of MP_EAGER_LIMIT plus 1K. It may help to reduce the latency of medium-size
messages. Larger values increase memory overhead.
MP_FIFO_MTU=[2K|4K]
If a chassis MTU on the InfiniBand switch is 4K, the environment variable can be set to
4K. This will improve bandwidth for medium and large messages if a job is running in
FIFO mode (MP_USE_BULK_XFER=no). It may have a negative impact on the latency
of messages below 4K. The default value is 2K.
MP_RDMA_MTU=[2K|4K]
If a chassis MTU on the InfiniBand switch is 4K, the environment variable can be set to
4K. This may improve bandwidth for medium and large messages if a job is running in
RDMA mode (MP_USE_BULK_XFER=yes). The default value is 2K.
MP_PULSE=<INTEGER>
The interval (in seconds) at which POE checks the remote nodes to ensure that they are
communicating with the home node. Setting to 0 reduces jitter. The default value is
600.
MP_INSTANCES=<INTEGER>
The number of instances corresponds to the number of IB Queue Pairs (QP) over which
a single MPI task can stripe. Striping over multiple QPs improves network bandwidth in
RDMA mode when a single instance does not saturate the link bandwidth. The default is
one, which is usually sufficient when there are multiple MPI tasks per node.
MP_USE_BULK_XFER=[yes|no]
Enables bulk message transfer (RDMA mode). RDMA mode requires RC connections
between each pair of communicating tasks which takes memory resources. The value
no will turn on FIFO mode which is scalable. In some cases FIFO bandwidth
outperforms RDMA bandwidth due to reduced contention in the switch. The default value
is yes.
MP_BULK_MIN_MSG_SIZE=<INTEGER>
Sets the minimum message length for bulk transfer (RDMA mode). A valid range is from
4096 to 2147483647 (INT_MAX). Note, that for PEMPI, MP_EAGER_LIMIT value takes
precedence if it is larger. MPICH2 ignores the value of MP_BULK_MIN_MSG_SIZE.
This environment variable can help optimize the crossover point between FIFO and
RDMA modes.
MP_RC_MAX_QP=<INTEGER>
Specifies the maximum number of Reliable Connected Queue Pairs (RC QPs) that can
be created. The purpose of MP_RC_MAX_QP is to limit the amount of memory that is
consumed by RC QPs. This is recommended for applications which are close to or
exceed the memory limit.
MP_RFIFO_SIZE=<INTEGER>
The default size of the receive FIFO used by each MPI task is 4MB. Larger jobs are
recommended to use the maximum size receive FIFO (16MB) by setting
MP_RFIFO_SIZE=16777216.
MP_BUFFER_MEM=<INTEGER>
Specifies the size of the Early Arrival buffer that is used by the communication subsystem
to buffer eager messages arriving before a matching receive is posted. Setting
MP_BUFFER_MEM can address MPCI_MSG: ATTENTION: Due to memory limitation
eager limit is reduced to X. MP_BUFFER_MEM applies to PEMPI only.
MP_POLLING_INTERVAL=<INTEGER>
This defines the interval in microseconds at which the LAPI timer thread runs. Setting the
polling interval equal to 800000 defines an 800 millisecond timer. The default is 400
milliseconds.
MP_RETRANSMIT_INTERVAL=<INTEGER>
PAMI will retransmit packets if an acknowledgement is not received in time.
Retransmissions are costly and often unnecessary, generating duplicate packets. Setting
a higher value will allow PAMI to tolerate larger delays before the retransmission logic
kicks in.
available for both Intel and GNU OpenMP implementations. Since PE does not
know the OpenMP implementation in use (Intel or GNU), PE has to set the both
GOMP_CPU_AFFINITY and KMP_AFFINITY for each task.
For non-OpenMP jobs - using MP_TASK_AFFINITY=cpu, core, primary, or
mcm - POE will examine the x86 device tree and determine the cpusets to which
the tasks will be attached, using system level affinity API calls of
sched_setaffinity.
Adapter affinity (MP_TASK_AFFINITY=sni) is not supported on x86 systems.
When OMP_NUM_THREADS is not exported, POE will use the value of n in the
MP_TASK_AFFINITY = core:n, cpu:n, or primary:n as the number of parallel threads.
For Intel, PE will set the KMP_AFFINITY variable, with a list of CPUs in the proclist
sub-option value. Note POE has to allow for the user specified KMP_AFFINITY variable,
and the list of CPUs to any existing options. If the user has already specified a proclist
sub-option value, POE will override the user-specified value, while displaying an
appropriate message. If MP_INFOLEVEL is set to 4 or higher, POE will also add the
verbose option to KMP_AFFINITY. An example of the KMP_AFFINITY format POE will
set (for MP_TASK_AFFINITY=cpu:4) is: KMP_AFFINITY=proclist=[3,2,1,0],explicit .
When MP_BINDPROC = yes is specified, POE will bind/attach the tasks based on the
list of CPUs in the KMP_AFFINITY and GOMP_CPU_AFFINITY values.
specifies the minimum and maximum number of islands to select for this job step. A value
of -1 represents all islands in the cluster.
If island_count is not specified, all machines will be selected from a common island.
The llstatus and llrstatus commands will be enhanced to show which island contains
the machine or machine _group.
A script that will launch a job built with OpenMPI version older than 1.5.4 is given below
# ! /bin/ksh
# LoadLeveler JCF file for running an Open MPI job
# @ job_type = MPICH
# @ node = 4
# @ tasks_per_node = 2
# @ output = ompi_test.$(cluster).$(process).out
# @ error = ompi_test.$(cluster).$(process).err
# @ queue
export LD_LIBRARY_PATH=/opt/openmpi/lib
/opt/openmpi/bin/mpirun -leave-session-attached --mca plm_rsh_agent "llspawn.stdio :
ssh" -n $LOADL_HOSTFILE -machinefile $LOADL_HOSTFILE
mpi_hello_sleep_openmpi
mcm_distribute tells the central manager to distribute tasks across all available MCMs
on a machine.
When CORE(n) or CPU(n) is specified, the central manager will assign n physical cores
or n logical CPUs to each job task. (Note that a physical core can have multiple logical
CPUs.)
cpus_per_core specifies the number of logical CPUs per processor core that should
be allocated to each task of a job with the processor-core affinity requirement
(#@task_affinity = CORE). This requirement can be only satisfied by nodes configured in
SMT mode.
parallel_threads=m will bind m OpenMP threads to the CPUs selected for the task by
task_affinity = CPU(n) keyword, where m <= n. If task_affinity = CORE(n) is specified, m
OpenMP threads will be bound to m CPUs, one CPU per core.
5.1.1 Ulimit
The ulimit command controls user limits on jobs run interactively. Large HPC jobs often
use more hardware resources than are given to users by default. These limits are put in
place to make sure that program bugs and user errors dont make the system unusable
or crash. However, for benchmarking and workload characterization, it is best to set all
resource limits to unlimited (or the highest allowed value) before running. This is
especially true for resources like stack, memory and CPU time.
ulimit s unlimited
ulimit m unlimited
ulimit t unlimited
For a description of the events available to count see chapter 19 in volume 3b of the Intel
architecture manual.
For a guide on how to use these counters for analysis see appendix B.3 of the Intel
Software Optimization guide. The general optimization material in the rest of the
document is also recommended.
There is a good paper on bottleneck analysis using performance counters on the x86
5500 series processors, much of which is applicable to Sandy Bridge processors. It can
be found here.
command returns its help message; if not, it has to be installed. The installation varies by
distribution, but to install on RHEL (as root) enter
$ yum install perf
Install on SLES (as root) with the command
$ zypper install perf
Also make sure if a new kernel is installed that perf gets updated the match the kernel
To use perf for collecting performance counters, the perf list and perf stat
subcommands are used. See 'perf help' for more information on the subcommands
available for perf. A tutorial is available here. For more information on any command
enter perf help COMMAND.
Using the latest available version of perf is strongly recommended. Later versions
provide more features.
The profiling aspect of perf will be covered in a later section. perf list is used to show the
events available for the hardware being used. Any of the events can be appended with a
colon followed by one or more modifiers that further qualify what will be counted. The
modifiers are shown in the table below. Raw events can be modified with the same
syntax as the symbolic events provided by perf list.
5.2.1.1 Example 1
The first example demonstrates how to collect counts on built-in events with the
command
$ perf stat
Events will be counted for the standard benchmark program stream. Download stream.c
from here.
A typical output is
Performance counter stats for 'numactl --physcpubind 2 ./stream':
There are a couple of reasons that would cause problems with getting valid counter data
from perf.
The first is if the oprofile daemon is running. If this is true, running perf will give the
following error:
Error: open_counter returned with 16 (Device or resource busy).
/bin/dmesg may provide additional information.
5,906,710,542 L1-dcache-loads
1,107,616,593 L1-dcache-load-misses # 18.75% of all L1-dcache hits
18,073,237,009 cycles # 0.000 GHz
20,100,714,663 instructions # 1.11 insns per cycle
5.2.1.2 Example 2
The next example includes the syntax for a raw event. UOPS_ISSUED.ANY is collected in
addition to the counters above. Section 19.3 of the architecture manual 3b provides the
umask 01 and the event number 0x0e. The raw code concatenates the two - 010e.
An alternative way to get the mask and event code is to use libpfm4. Install libpfm4 and
go into the examples directory. The utility showevtinfo will give the event codes and
umasks for the current processor.
24,228,890,952 r10e
5,906,666,765 L1-dcache-loads
1,107,547,809 L1-dcache-load-misses # 18.75% of all L1-dcache hits
18,063,959,547 cycles # 0.000 GHz
20,100,581,873 instructions # 1.11 insns per cycle
Notice the output gives the raw code instead of the event name. Using libpfm4 and a
couple of scripts provides the translation from a raw code to a symbolic name. The two
scripts are
{
if ( $1 == "PMU") {
if (mfound == 0) {
if ((type == "wsm_dp") || (type == "ix86arch"))
printf(" \'0x%s\':\'%s\',\n", code1, name);
}
mfound = 0;
type = $4
}
if ( $1 == "Name") {
name = $3;
}
if ( $1 == "Code") {
code1 = substr($3,3);
if (length(code) == 1) {
code = "0" code1
}
}
if ( substr($1,1,5) == "Umask" ) {
mfound = 1;
mask=substr($3,3);
if(index(mask,"0") == 1)
mask = substr(mask,2)
qual=substr($7,2);
sub(/\]/,"",qual);
if ((type == "wsm_dp") || (type == "ix86arch"))
printf " \'0x%s%s\':\'%s.%s\',\n", mask, code, name, qual;
}
}
Code Listing 2 get_events.py
# get_events.py
def loadData(infile):
file = open(infile, 'r')
for line in file:
if not line.strip():
continue
line_data = line.split()
if line_data[0].isdigit():
if (len(line_data) > 1):
if line_data[1] == "raw":
if line_data[2] in event_names:
name = event_names[line_data[2]]
else:
print "could not find", line_data[2]
else:
name = line_data[1]
else:
name = "Group"
print line_data[0], ", ", name
file.close()
def main():
if (len(sys.argv)) < 2:
print "Usage:\npython make_perf_data.py infile"
exit()
pass
data = loadData(sys.argv[1])
# munge the data around
return
main()
To translate the raw code, the counter output must be saved to a file, e.g.
$HOME/counter_output_filename. Then the scripts should be copied (as root) to the
examples/ directory in libpfm4. From the examples/ directory issue:
./showevtinfo | awk f get_event_dict.awk > evt_dict (as root)
python get_events.py $HOME/counter_output_filename > $HOME/counters.csv
counters.csv can now be loaded into a spreadsheet and will provide the counts and
symbolic names.
SYNOPSIS
perf stat [-e <EVENT> | --event=EVENT] [-S] [-a] <command>
perf stat [-e <EVENT> | --event=EVENT] [-S] [-a] <command> [<options>]
DESCRIPTION
This command runs a command and gathers performance counter statistics from it.
OPTIONS
<command>...
Any command you can specify in a shell.
-e, --event=
Select the PMU event. Selection can be a symbolic event name (use perf list to list
all events) or a raw PMU event (eventsel+umask) in the form of rNNN where NNN is a
hexadecimal event descriptor.
-i, --inherit
child tasks inherit counters
-p, --pid=<pid>
stat events on existing pid
-a
system-wide collection
-c
scale counter values
The e option is used to specify the events to count. The Sandy Bridge processor
supports 4 programmable and 3 fixed counters. The fixed events are
UNHALTED_CORE_CYCLES (cycles), INSTRUCTION_RETIRED (instructions) and
UNHALTED_REFERENCE_CYCLES. If more than 4 events requiring the programmable
counters are specified, perf will multiplex the counters. The c (--scale) option will
normalize the multiplexed counts to the entire collection period and will provide an
indication of how much of this time period was spent collecting each of the counters.
Counters need to be collected for a long enough period to get good samples to represent
the entire application. The time required to get a good sample will vary depending how
steady the application is. The best way to ensure that the samples are large enough is to
run with two different collection periods. (For example, run the benchmark twice as long
the second time through.) The sample period is long enough if the counts are
proportionately similar. Keep in mind that multiplexing influences the total sample time for
each event and it must be taken into account in future collections for that application.
5.2.1.4 Example 3
An example using multiplexed event counters is:
numactl --physcpubind 2 perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-
misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,L1-dcache-
prefetch-misses ./stream_gcc
The output reports a total runtime of 4.74 seconds. The percentage in the square
brackets shows what percentage of 4.74 seconds was used while collecting for that
particular counter. The perf output automatically adjusts for multiplexing so that the count
displayed represents the actual count divided by the percentage collected.
To collect events for a job that is already running (under numactl binding control) or for a
certain process of a job, perf can be used with the p and a options while running the
sleep command. If the job is multitasked or multithreaded, using the i flag collects event
counts for all child processes.
perf p <pid> -a i sleep <data collection interval>
Attaching to an already running job is the technique to use if
the job runs for a long time and event data can be collected for a shorter time
(the length of the sleep)
the job has a warm up period that needs to be excluded from the event counts.
a specific job phase has to be reached before collecting event counts.
These are the basics of collecting counter data. The question now becomes which events
should be collected for analysis? Start out with the basic events needed for CPI,
instruction mix, cache miss rates, ROB and RAT stalls, and branch prediction metrics.
For programs that use large arrays, data on L1TLB and L2TLB misses is also useful.
Some other things to consider are the use of vector instructions and the effectiveness of
prefetching. What is needed will depend on what performance issues need to be
understood.
A high level overview of PAPI is here. The examples/ and ctests/ subdirectories in the
PAPI tree have additional useful information. The instrumentation examples in this
document use the low-level API which is documented here. There is no support for this
library, so, as mentioned in the previous section, if there are problems collecting
performance data, look first at conflicts with oprofile or the nmi_watchdog_timer.
The above examples directly call the code to be tested. However, the recommended
method to use is to create a custom library for local use that has the functions
papi_init(), thread_init(), start_counters(), stop_counters(), restart_counters(), and
print_counters(). An associated header file (called papi_interface.h below) including the
declarations for these custom functions is also needed. Using these functions will
minimize the changes needed to instrument the code. Include the .h file in the
instrumented code and add the above functions as needed. Using #ifdefs isolates the
PAPI changes from the rest of the code to make testing more convenient.
The example below illustrates how to alter code so that it is instrumented for collecting
performance counts for my_func2(), not my_func1() or my_func3().
#ifdef PAPI_LOCAL
#include papi_interface.h
#endif
int main ( ) {
#ifdef PAPI_LOCAL
papi_init();
thread_init();
#endif
my_func1();
#ifdef PAPI_LOCAL
start_counters();
#endif
my_func2();
#ifdef PAPI_LOCAL
stop_counters();
print_counters();
#endif
my_func3();
}
Several tools are available for profiling. The most frequently used is gprof, but perf and
oprofile are also used.
To get gprof-compatible output, first binaries need to be compiled and created with the
added -pg option (additional options like optimization level, -On, can also be added):
$ gcc pg o myprog.exe myprog.c
or
$ gfortran pg o myprog.exe myprog.f
When the program is executed, a gmon.out file is generated (or, for a parallel job.
several gmon.<#>.out files are generated, one per task). To get the human-readable
profile, run:
$ gprof myprog.exe gmon.out > myprog.gprof
or
$ gprof myprog.exe gmon.*.out > myprog.gprof
The first part of an example output from gprof is:
Flat profile:
In the above profile, the function rand_read accounts for 69% of the time, even though it
is only called once. The function get_block_index is called almost 17 million times, but
only accounts for 5% of the time. The obvious routine to focus on for optimization is the
function rand_read.
granularity: each sample hit covers 2 byte(s) for 0.32% of 3.16 seconds
-----------------------------------------------
2.17 0.00 1/1 run_concurrent [2]
[3] 68.7 2.17 0.00 1 rand_read [3]
-----------------------------------------------
0.50 0.16 1/1 main [1]
[4] 20.9 0.50 0.16 1 gen_indices [4]
0.16 0.00 16777216/16777216 get_block_index [5]
-----------------------------------------------
gprof can also be used to tell the number of times a line of code is executed. This is
done using the gcov tool. See the documentation here for more details.
5.3.2 Microprofiling
Microprofiling is defined as charging counter events to instructions (in contrast to event
counts for an entire program as discussed in Section 5.2.1, or by function as discussed in
Section 5.3.1). This is typically done with a sampling-based profile. Sampling-based
profiling uses the overflow bit out of a counter to generate an interrupt and capture an
instruction address. The profiling tool can be set up to interrupt after a specified number
of cycles. Based on the number of times an instruction address shows up versus the total
number of samples, the instruction is assigned that percentage of the total number of
event occurrences.
There are three main tools used for microprofiling: vtune, perf and oprofile. All can use
cycles (time) or another counter event to do profiling. Only perf and oprofile will be
covered in this document since vtune requires a license to run.
perf uses cycles as its trigger event by default. This provides a list of instructions where
the program is spending time. To sample the program with the cycles event, enter
perf record [prog_name] [prog_args]
perf outputs some statistics and a file called perf.data. One key point is that the perf
command has to be bound to a CPU to get reproducible results.
perf annotate can be used to see where the program is spending its time. To get more
detail on the rand_read function, enter
perf annotate rand_read
The output is:
: /* j gets set to a random index in rarray */
: j = indices[i];
0.05 : 401209: 8b 45 f8 mov -0x8(%rbp),%eax
0.00 : 40120c: 48 98 cltq
0.00 : 40120e: 48 c1 e0 02 shl $0x2,%rax
1.75 : 401212: 48 03 45 e0 add -0x20(%rbp),%rax
0.00 : 401216: 8b 00 mov (%rax),%eax
0.09 : 401218: 89 45 f0 mov %eax,-0x10(%rbp)
: k += rarray[j];
0.05 : 40121b: 8b 45 f0 mov -0x10(%rbp),%eax
1.85 : 40121e: 48 98 cltq
0.00 : 401220: 48 c1 e0 03 shl $0x3,%rax
0.00 : 401224: 48 03 45 e8 add -0x18(%rbp),%rax
0.00 : 401228: 48 8b 00 mov (%rax),%rax
87.86 : 40122b: 89 c2 mov %eax,%edx
2.03 : 40122d: 8b 45 f4 mov -0xc(%rbp),%eax
0.09 : 401230: 01 d0 add %edx,%eax
1.57 : 401232: 89 45 f4 mov %eax,-0xc(%rbp)
The line
k += rarray[j];
is taking most of the time, with the assembly instruction
40122b: 89 c2 mov %eax,%edx
getting assigned 88% of the total time.
The e option on perf record allows other events besides cycles to be used. This is
useful to figure out which specific lines of code are strongly associated with events like
cache-misses. Call chain data is output by using perf record g option followed by perf
report.
With higher levels of compiler optimization, the compiler can inline functions and reorder
instructions.
A typical sequence to use for gathering an event profile with opcontrol/opreport is:
$ opcontrol --deinit
$ opcontrol --init
$ opcontrol reset
$ opcontrol --image all
$ opcontrol --separate none
$ opcontrol --start-daemon --event=CPU_CLK_UNHALTED:100000
--event=INST_RETIRED:100000
$ opcontrol --start
$ [command_name] [command_args]
$ opcontrol dump
$ opcontrol -h
For instance, when compiling the program petest, the commands to use are
$ gcc -c -g I/opt/ibmhpc/ppedev.hpct/include petest.c
$ gcc o petest g L/opt/ibmhpc/ppedev.hpct/lib64 -lhpc
Before running the application, the HPM_EVENT_SET environment variable has to be
set to the correct hardware counter group. The hpccount l command lists available
groups.
It also provides a time-based trace view which shows a trace of MPI function calls in the
application. This trace can be used to examine MPI communication patterns. Individual
trace events provide information about the time spent in that MPI function call and, for
communication calls, the number of bytes transferred for that trace event.
1
The product list of Nehalem processors is found here (EP and EX) .
2
The product list of Westmere processors is found here (EP and EX).
3
The product list of Sandy Brtidge processors is found here and here,
To use the MPI profiling and trace tool, the application must be relinked with the profiling
and trace library. A programming API is provided so that an application can be
instrumented to selectively trace a subset of MPI function calls.
When compiling an application, it should be compiled with the g flag. When linking the
application, link with the libmpitrace library.
For instance, to compile when compiling the program petest, the commands to use are
$ gcc -c petest.c
$ gcc o petest g L/opt/ibmhpc/ppedev.hpct/lib64 -lmpitrace
It also provides a time-based trace view which shows more detailed information about the
I/O system calls.
To use the I/O profiling and trace tool, you must re-link your application with the profiling
and trace library.
When you compile your application, it should be compiled with the g flag. When you link
your application, you must link with the libtkio library.
Current documentation for the HPC Toolkit can be found on the HPC Central website.
The documentation web page is here. Click the Attachments tab and download the
latest version of the documentation.
The HPC Toolkit is part of the IBM Parallel Environment (PE) Developer Edition product.
PE Developer Edition is an Eclipse-based IDE that you can use to edit, compile, debug
and run your application. PE Developer Edition also contains a plug-in for HPC Toolkit
that is integrated with the rest of the developer environment and provides the same
viewing capabilities as the X11-based viewer that is part of HPC Toolkit. Since the plug-in
for HPC Toolkit is integrated with Eclipse, an instrumented application can be run from
within the Eclipse IDE to obtain performance measurements.
Current documentation for the IDE contained in PE Developer Edition can be found here.
The HPC Toolkit is installed if the ppedev runtime RPM is present (rpm qi
ppedev_runtime) on all of the nodes in the cluster and that the ppedev_hpct RPM is
installed on the login nodes in the cluster.
When using the HPC Toolkit to analyze the performance of parallel applications, the IBM
Parallel Environment Runtime Edition product must be installed on all of the nodes of the
cluster.
PE Developer Edition, including HPC Toolkit, is supported on Red Hat 6 and SLES 11
SP1 x86 based Linux systems.
The Eclipse IDE environment provided by PE Developer Edition requires that either
Eclipse 3.7 SR2 (Indigo) is already installed or that the version of Eclipse 3.7 SR2
packaged with PE Developer Edition is installed. Also, for Windows- and Linux-based
systems, the IBM Java version 6 packaged with PE Developer Edition must be installed.
For Mac users, the version of Java provided by default is all that is required.
This version brings several improvements to the original version, including in particular:
The possibility to run the benchmark in hybrid mode (MPI + OpenMP).
The "As You Go" feature which reports information on achieved performance
level throughout the whole run. The feature evaluates the intrinsic quality of an
execution configuration without the need to wait for the end of the execution.
It is crucial to determine an optimal set of parameters in order to reach the best balance
between:
Computation to communication ratio.
Load unbalance between the computation cores.
The optimal configuration will be basically established by using a guess and
check methodology (in which the As You Go feature is extremely useful).
The following hints might help though:
The matrix size (N) must be the largest possible with respect to computation
nodes memory size.
The optimal block size (NB) is said to be 160 or 168 when running with Intel
MKL.
A slightly rectangular shape (P = Q x 0.6) might prove to be optimal, but this
highly depends on the platform architecture (including interconnect).
Other input settings are considered as having a very limited impact on the overall
performance. The following set of parameters can be taken as-is:
16.0 threshold
1 # of panel fact
0 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
0 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
256 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
0 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
Compilation
Most of the LINPACK performance relies on the selected mathematical library. As such,
compilation options do not play a significant role in the performance field.
CPU Speed
CPU Speed is the internal system mechanism that allows the Turbo Mode capability to
be exploited. It needs to be properly configured to set the CPU frequency to the
maximum allowed and prevent this frequency from being reduced.
Measured Performance
Figure 6-1 presents the measured Linpack compute rate and the peak performance for
different numbers of nodes (1 to 32) using pure-MPI interprocess communication.
Figure 6-1 Comparing actual Linpack and system peak performance (GFlops) for different numbers of nodes
12000.0
10649.6
10000.0
8966.0
8000.0
Data
Performance
6000.0 (GFlops)
5324.8 Peak
(GFlops)
4584.0
4000.0
2662.4
2305.0
2000.0
1331.2
1167.0
584.3 665.6
294.9 332.8
0.0
1 2 4 8 16 32
# Nodes
Figure 6-2 Comparing measured Linpack and system peak performance (PFlops) for large numbers of nodes
3.5
3
2.5
Petaflops
2 Actual Performance
1.5 Peak Performance
1
0.5
0
4096 7168 9216
No. of nodes (dx360 M4)
6.2 STREAM
STREAM is a simple synthetic benchmark program that measures memory bandwidth in
MB/s and the computation rate for simple vector kernels. It has been developed by John
McCalpin, while he was a professor at the University of Delaware. The benchmark is
specifically designed to work with large data setslarger than the Last Level Cache
(LLC) on the target systemso that the results are indicative of very large vector-
oriented applications. It has emerged as a de-facto industry standard benchmark. It is
available in both FORTRAN and C, in single processor and multi-processor versions,
OpenMP- and MPI-parallel.
The STREAM results are presented below by running the application in serial mode on a
single processor and in an OpenMP mode.
with memory access to 2 double precision words (16 bytes) and no floating point
operations per iteration
with memory access to 2 double precision words (16 bytes) and one floating point
operation per iteration
with memory access to 3 double precision words (16 bytes) and one floating point
operation per iteration
with memory access to 3 double precision words (16 bytes) and two floating point
operations per iteration
The general rule for STREAM is that each array must be at
least 4x the size of the sum of all the last-level caches used
in the run, or 1 Million elements -- whichever is larger.
To improve performance one can test different offsets, which have not been considered
here.
all: stream.exe
stream.exe: stream.c
$(CC) $(CFLAGS) stream.c -o stream.exe
clean:
rm -f stream.exe *.o
The array dimensions have been set, as per McCalpins rule, to the minimum required
size of 20M or 160MB.
Figure 6-3 Measured Bandwidth (MB/s) for single-core STREAM tests using GCC
12000
10000
8000
6000 gcc
4000
2000
0
copy scale add triad
Following the recommendations given there, the following compile options are added:
opt-streaming-stores always
ffreestanding
all: stream.exe
clean:
rm -f stream.exe *.o
results in:
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 20000000, Offset = 0
Total memory required = 457.8 MB.
...
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 11572.7884 0.0277 0.0277 0.0277
Scale: 6919.6523 0.0463 0.0462 0.0463
Add: 9040.9096 0.0531 0.0531 0.0531
Triad: 9121.3157 0.0527 0.0526 0.0527
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
The Intel Fortran compiler yields a result very similar to the Intel C compiler:
...
----------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 11581.2764 0.0277 0.0276 0.0277
Scale: 6921.3294 0.0463 0.0462 0.0463
Add: 9045.3372 0.0531 0.0531 0.0531
Triad: 9122.6796 0.0527 0.0526 0.0527
----------------------------------------------------
Solution Validates!
----------------------------------------------------
Figure 6-4 Measured Bandwidth (MB/s) for single-core STREAM tests using Intel icc
12000
10000
8000
6000 icc
4000
2000
0
copy scale add triad
Comparing the two sets of results shows the binaries produced by the GCC compiler are
faster:
Figure 6-5 Measured Bandwidth (MB/s) for single-core STREAM tests comparing Intel icc and GCC
single core
12000
10000
8000
MB/s
gcc
6000
icc
4000
2000
0
copy scale add triad
From a private communication with Andrey Semin from Intel, changing the compiler
Figure 6-6 Measured Bandwidth (MB/s) for single-core STREAM tests comparing Intel icc without
streaming stores and GCC
single core
12000
10000
8000 gcc
MB/s
6000 icc
4000 icc*
2000
0
copy scale add triad
12000
10000
copy
8000
MB/s
scale
6000
add
4000
2000 triad
0
Hz
Hz
Hz
Hz
Hz
G
G
G
7
4
2
0
2.
2.
1.
1.
2.
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
do
taskset -c $i ./stream.exe > thruput.$i &
done
The following table gives the average, minimum and maximum over all 16 processes:
Figure 6-8 Memory Bandwidth (MB/s) over 16 cores GCC throughput benchmark
throughput 16 x 1way
70000
60000
50000
MB/s
40000
gcc
30000
20000
10000
0
copy scale add triad
Table 6-5 Memory bandwidth (MB/s) over 16 cores OpenMP benchmark with icc 20M
Function Rate (MB/s) Avg time Min time Max time
copy 74055.2461 0.0043 0.0043 0.0044
scale 73511.7362 0.0044 0.0044 0.0044
add 76419.2796 0.0063 0.0063 0.0064
triad 75852.0805 0.0064 0.0063 0.0064
Figure 6-9 Memory bandwidth (MB/s) minimum number of sockets 16-way OpenMP benchmark
OpenMP 16-way
80000
70000
60000
50000
MB/s
40000 icc
30000
20000
10000
0
copy scale add triad
Table 6-7 Memory bandwidth (MB/s) minimum number of sockets OpenMP benchmark with icc
Funct. \ 16 8 4 2 1
n
Copy 75193 36546 26072 13628 6943
Scale 74052 35794 25675 13575 6962
Add 77112 37716 32867 17701 9070
Triad 76229 37136 31558 17523 9034
Binding half the threads to one socket and the other half to the other socket the picture
improves, especially for 8 cores:
Table 6-8 Memory bandwidth (MB/s) split threads between two sockets OpenMP benchmark with
icc
Funct. \ 8 4 2
n
Copy 55375 29288 13662
Scale 54543 29220 13649
Add 69763 37665 17752
Triad 67197 37245 17712
Figure 6-10 Memory bandwidth (MB/s) performance of 8 threads on 1 or 2 sockets
OpenMP 8-way
80000
70000
60000
50000
MB/s
1 socket
40000
2 sockets
30000
20000
10000
0
copy scale add triad
So with Stream
16 cores get ~76 GB/s or 4.75 GB/s per core
8 cores on a single socket ~37 GB/s or 4.63 GB/s per core
8 cores using both sockets ~62 GB/s or 7.7 GB/s per core
Figure 6-11 shows the saturation when splitting different numbers of OpenMP threads
between both sockets. It reveals that, in general, going from four to eight threads using
both sockets scales nicely whereas the memory bandwidth becomes saturated going
from 8 to 16 threads.
Figure 6-11 Memory bandwidth (MB/s) split threads between two sockets
OpenMP 2 sockets
1 ... 16 threads
80000
1
60000
2
MB/s
40000 4
8
20000
16
0
copy scale add triad
Table 6-10 Memory bandwidth (MB/s) divide threads between two sockets OpenMP benchmark
with gcc
Function 8 4 2
Copy 54450 43233 24226
Scale 54803 42520 24139
Add 61337 48836 26044
Triad 61721 49789 26501
For the single core performance, gcc is clearly ahead of the Intel executable, but for the
OpenMP version this picture reverses. Furthermore 8 cores distributed over both sockets
can nearly exhaust the memory bandwidth. Binding is absolutely necessary.
6.2.5.1 Stride
16 threads by array stride through memory:
30000
25000 1
20000 2
MB/s
15000 4
10000 8
5000
16
0
2 3 4 5 6 7 8 9 10 20 40
stride
6.2.5.3 Indexed
There are two cases: loads via an index array and stores via an indexed array.
To achieve this another array index is included to modify the stream code, where
index[i] = (ix + iy * i) % N
and the offset variable ix and the stride variable iy are read in at runtime.
All results have been generated from binaries created with the Intel compiler.
The results showed that the initial offset ix does not change the performance, so ix = 0 is
used for all runs.
load case:
These runs measure the performance of indexed loads - for example, the triad case is
modified to become:
for (j=0; j<N; j++)
a[j] = b[index[j]]+scalar*c[index[j]];
OMP_NUM_THREADS=1
Table 6-18 Strided memory bandwidth (MB/s) with indexed loads 1 thread
stride copy scale add triad
1 5773 5776 10327 10137
2 5000 5011 6882 6735
3 4291 4284 5185 5049
4 3693 3694 4084 4061
5 3158 3158 3395 3408
6 2737 2738 2942 2955
7 2362 2363 2587 2581
8 2062 2062 2293 2280
9 1899 1898 2114 2091
10 1777 1777 1958 1937
OMP_NUM_THREADS=16
Table 6-19 Strided memory bandwidth (MB/s) with indexed loads 16 threads
stride copy scale add triad
1 52100 51908 49500 49438
2 44120 43960 36057 36097
3 31340 31368 28352 28313
4 33826 33868 31453 31356
5 23580 23583 19623 19613
6 16698 16712 14117 14104
7 16999 17006 13932 13933
8 26357 26378 24269 24554
9 12432 12431 10339 10345
10 11300 11302 9161 9193
store case:
We measure the performance - for example for the triad case -
for (j=0; j<N; j++)
a[index[j]] = b[j]+scalar*c[j];
OMP_NUM_THREADS=1
Table 6-20 Strided memory bandwidth (MB/s) with indexed stores 1 thread
stride copy scale add triad
1 9340 9368 10148 10070
2 6107 5992 7486 7445
Table 6-21 Strided memory bandwidth (MB/s) with indexed stores 16 threads
stride copy scale add triad
1 41807 41915 49940 49967
2 24076 24023 34037 34050
3 19998 19963 27335 27359
4 22477 22465 31790 31830
5 12538 12534 17957 17953
6 9908 9904 14470 14477
7 9050 9056 13174 13177
8 16971 17608 24840 24853
9 7412 7414 10865 10867
10 6691 6693 9838 9836
Observations:
remember: index[i] = (ix + iy * i) % N
for the case ix = 0, iy = 1 we have the same memory access pattern as for the
standard stream, but for example, the triad case shows 49967 / 76229 = 65.5%
of the performance (store case). Whereas surprisingly the case
OMP_NUM_THREADS=1 gives slightly better performance (comparing results
against the Intel compiler).
6.3 HPCC
The HPC Challenge Benchmark is composed of seven individual benchmarks combined
into a single executable program, limiting the tuning possibilities for any specific part. It is
made up of the following tests:
1. HPL High Performance Linpack measures the floating-point-execution rate by
solving a linear system of equations.
2. PTRANS Parallel Matrix Transpose measures the network capacity by
requiring paired communications across all processors in parallel.
3. DGEMM (double-precision) General Matrix Multiply measures the floating-
point-execution rate by using the corresponding matrix multiplication kernel
included in the BLAS library.
4. STREAM measures the sustainable memory bandwidth and corresponding
computation rate for simple vector kernels.
5. RandomAccess measures the integer operation rate on random memory
locations.
6. FFT measures the floating-point-execution rate by executing a one-dimensional
complex Discrete Fourier Transformation (DFT), which may be implemented
using an FFT library.
7. Communication a combination of tests to measure network bandwidth and
latency by simulating various parallel communication patterns.
In case of DGEMM, RandomAccess and FFT, three types of jobs are run:
1. single one single thread
2. star also known as embarrassingly parallel, parallel execution without inter-
processor communication (multiple serial runs)
3. MPI parallel execution with inter-processor communication
Table 6-22 Best values of HPL N,P,Q for different numbers of total available cores
Number of cores N P Q
16 219326 4 4
32 310173 4 8
64 438651 8 8
128 620346 8 16
256 877302 16 16
512 1240692 16 32
1024 1754604 32 32
6.3.4 Results
Table 6-23 HPCC performance on 1 to 32 nodes
Cores (16 per node)
16 32 64 128 256
HPL in 0.299 0.591 1.166 2.312 4.480
TFLOP/s
PTRANS 5.173 9.745 12.093 23.296 42.947
in GB/s
DGEMM in GFLOP/s
Single 20.238 20.16 20.232 20.269 20.207
Star 19.783 18.64 19.275 19.362 19.605
Single Stream in GB/s
Copy 7.584 7.606 7.603 7.539 7.535
Scale 7.600 7.618 7.615 7.549 7.546
Add 9.820 9.843 9.834 9.745 9.739
Triad 9.789 9.811 9.796 9.710 9.707
Star Stream in GB/s
Copy 4.680 4.693 4.841 4.746 4.713
Scale 4.608 4.618 4.793 4.703 4.632
Add 4.882 4.889 5.375 5.134 4.883
Triad 4.817 4.818 5.369 5.098 4.819
RandomAccess in GUP/s
Single 0.035 0.035 0.035 0.035 0.035
Star 0.018 0.018 0.018 0.018 0.018
MPI 0.177 0.301 0.495 0.763 1.262
FFT in GFLOP/s
Single 3.500 3.552 3.534 3.530 3.561
Star 2.842 2.630 2.464 2.521 2.593
MPI 19.58 24.833 41.435 74.95 147.63
PingPong Latency in us
Min 0.367 0.517 0.775 0.951 0.876
Avg 0.866 1.233 1.534 1.967 2.134
Max 2.517 2.293 2.724 3.838 2.962
PingPong Bandwidth in GB/s
Kernels:
CG Conjugate Gradient, irregular memory access and communication
MG Multi-Grid V-Cycle, long- and short-distance communication, memory
intensive
FT (Fast) Fourier Transformation solving a partial differential equation (PDE) in
3D space, all-to-all communication
Pseudo-applications to solve nonlinear PDEs using the following algorithms:
BT Block Tri-diagonal solver, non-blocking communication
SP Scalar Penta-diagonal solver, non-blocking communication
LU Lower-Upper Gauss-Seidel solver
Compilation was done directly on a Sandy Bridge node with the following flags:
-O3 -xAVX -xhost
In case of class E with small process count, the following flags were added:
-mcmodel=medium -shared-intel
6.4.2 Results
Results are in total giga-operations per second with class D problem size:
The Intel64 SIMD hardware and instruction set support for previous architectures has
8 architected 64-bit (MMX) and 16 architected 128-bit (SSE) registers
arithmetic, bit-shuffling and logical operations for
1-, 2-, 4-, and 8-byte integers
4- and 8-byte floating point data
The AVX floating point architecture adds the following new capabilities:
Wider Vectors
The 16 128-bit registers (named XMM0-15) have been extended to 32 bytes (256
bits). The new architected registers are named YMM0-15. They can hold 8
single-precision or 4 double-precision floating point values.
Figure 7-1 Using the low 128-bits of the YMMn registers for XMMn
YMM0
XMM0
Scalar processors perform operations that manipulate single data elements such as
fixed-point or floating-point numbers. For example, scalar processors usually have an
instruction that adds two integers to produce a single-integer result.
Figure 7-2 illustrates the difference between scalar and vector operations.
7 5 12
Scalar Add Operation 1 7
6
6 11 17
4 + 7 11 +
3 4 7
3 5 8
10 2 12
Processor designers are continually looking for ways to improve application performance.
The addition of vector operations to a processors architecture is one method that a
processor designer can use to make it easier to improve the peak performance of a
processor. However, the actual performance improvements that can be obtained for a
specific application depend on how well the application can exploit vector operations and
avoid other system bottlenecks like memory bandwidth.
The concept of vector processing has existed since the 1950s. Early implementations of
vector processing (known as array processing) were installed in the 1960s. They used
special purpose peripherals attached to general purpose computers. An example is the
IBM 2938 Array Processor, which could be attached to some models of the IBM
System/360. This was followed by the IBM 3838 Array Processor in later years.
By the mid-1970s, vector processing became an integral part of the main processor in
large supercomputers manufactured by companies such as Cray Research. By the mid-
1980s, vector processing became available as an optional feature on large general-
purpose computers such as the IBM 3090TM.
AMD 3DNow!
IBM VSX
Figure 7-3 Sandy Bridge block diagram emphasizing SIMD AVX functional units
Instruction Fetch
Allocate/Rename Zeroing
max 4 per cycle
ALU/JMP
ALU ALU Load 16 bytes Load 16 bytes Store 16 bytes
VI* MUL VI* ADD Store address Store address VI* ADD
Memory Control
* VI = Vector Integer 32 bytes read, 16 bytes write/cycle New for AVX!
** Only some non-AVX execution
units are present 32 KB L1 Data Cache
Notes for the programmer:
1. The execution pipeline can sustain up to four instructions fetched, dispatched,
executed and completed in any given cycle. Up to three AVX instructions can be
issued in a given cycle.
2. The peak dispatching rate is 6 micro-ops per cycle, to increase the likelihood that
the execution pipeline will not stall because no instructions are available to
decode.
3. There is no fused multiply-add (FMA). Instead, the maximum performance of 16
FP ops per cycle is reached by issuing an independent AVX FP multiply and an
AVX FP add.
4. A 128-bit AVX load can take one cycle. An AVX store can complete in two cycles
5. 256-bit AVX registers are architected as YMM0-YMM15. The registers can also
handle 128-bit AVX vectors, using XMM0-XMM15.
6. AVX shuffles are different than AVX blends. AVX shuffles are byte-permuting
operations. AVX blends mix bytes from two vectors but preserve order. AVX
shuffles are only executed by Port 5, so minimize the number of times shuffles
are needed.
7. Integer SSE instructions are only supported as 128-bit AVX instructions.
8. There is a 1 cycle penalty to move data from the INT stack to the FP stack.
9. AVX supports unaligned memory accesses, but the performance is better when
accessing 32-byte aligned data.
So how does SIMD code differ from scalar code? Compare the code fragments from
examples 7-1 and 7-2.
Intel AVX functionality targets a diverse set of applications in the following areas:
Video editing/ post production
Audio processing
Image processing
Animation
Bioinformatics
A broad range of scientific applications in physics and chemistry.
A broad range of engineering applications dedicated to solving the partial
7.5 Auto-vectorization
Translating scalar code into vector intrinsics is beyond the scope of this discussion.
However, it is relatively straightforward to get the Intel compilers to automatically
vectorize code and report on which loops have been vectorized. The recommended
options are:
For C/C++:
icc O3 xAVX vec-report1 vec-report2
For Fortran:
ifort O3 xAVX vec-report1 vec-report2
These options would be used in addition to other optimization options, such as -finline or
opt-streaming-stores.
xAVX, is the option that explicitly asks the compiler to auto-vectorize loops with
AVX instructions. The compiler will try auto-vectorization by default at
optimization levels O2 and above.
-vec-report1 reports when the compiler has vectorized a loop
-vec-report2 provides reasons about why the compiler failed to vectorize a
loop
#define ITER 10
void foo(int size)
{
int i, j;
float *x, *y, *a;
int iter_count=1024;
...
...
for (j = 0; j < ITER; j++){
for (i = 0; i < iter_count; i+=1){
x[i] = y[i] + a[i+1];
}
}
}
After building a program with auto-vectorization, program performance should be tested.
If the performance is not as expected, a programmer can refer to comments in the listing
provided by -vec-report2 to identify reasons for loops that failed to auto-vectorize
and give the programmer direction on how to correct code to auto-vectorize properly.
every iteration i of the loop, c[i-1] is being read which is written in the i-1th iteration.
When such a loop is transformed using AVX instructions it results in incorrect values
being computed in the c array.
The compiler may, in certain situations ( where memory overlap pragmas are not
provided or the compiler cannot safely analyze overlaps), generate versioned code to
compute the overlap at runtime i.e. it inserts code to test for memory overlap and
generates auto-vectorized or serial code based on whether the test passes or fails at
runtime.
code) may increase the cost of implementing auto-vectorization and the compiler may
pre-empt the decision to auto-vectorize those loops based on heuristic/profile-driven cost
analyses. In the code snippet shown below, the accesses of array b[] are non-unit
strides, since another array, idx[], is used to index into b[]. The compiler does not auto-
vectorize such a loop.
for (i = 0; i < N; i++)
a[i] = b[idx[i]]
These are less portable, and cant be interchanged with malloc() and free().
8 Hardware Accelerators
8.1 GPGPUs
The Graphics Processing Unit (GPU) has evolved into a highly parallel, multithreaded
many core processor with exceptional compute power and with very high local memory
bandwidth. Originally, GPUs were designed for 3-D rendering of large sets of pixels, and
vertices mapped naturally to parallel threads. Many high performance computing (HPC)
applications are capable of taking advantage of these threads in the GPUs. Both NVIDIA
and AMD with Fusion/ATI have been offering these GPUs for high performance
computing applications for the last four years. GPUs that can be used for HPC
applications are commonly referred to as GPGPUs (General Purpose GPUs).
IBM has been selling NVIDIA GPUs in the iDataPlex and bladeserver nodes since 2011.
There has been a lot of interest from customers on GPU based servers.
The purpose of this chapter is to examine the role of GPUs on IBM iDataPlex systems,
details of the GPUs, the software available for GPUs, and programming the GPUs for
performance. This chapter discusses how to run HPL with GPUs, and the tools available
for GPU performance.
A grid is an array of thread blocks that executes the same kernel, reads inputs from
global memory, writes results to global memory, and synchronizes between dependent
kernel calls.
Each thread block has a per-Block shared memory space used for inter-thread
communication, data sharing, and result sharing in parallel algorithms. Grids of thread
blocks share results in Global memory space after kernel-wide global synchronization.
CUDAs hierarchy of threads maps to a hierarchy of processors on the GPU; The GPU
executes one or more kernel threads; a streaming multiprocessor (SM) executes one
or more thread blocks; and CUDA cores and other execution units in the SM execute
threads. The SM executes threads in groups of 32 threads called a warp.
There are 512 CUDA cores, and these are organized into 16 SMs of 32 cores each. The
GPU has 64-bit memory partitions, for a 384-bit memory interface supporting a total of 6
GB of GDDR5 DRAM memory. A host interface connects the GPU to the CPU via PCI-
Express. The GigaThread global scheduler distributes thread blocks to the SM thread
scheduler. Figure 8-1 shows the GPU layout for the Fermi GPU (the code name for the
latest Tesla product as of March 2012). One of the SMs is outlined in red.
DRAM I/F
DRAM I/F
HOST I/F
L2
Giga Thread
DRAM I/F
DRAM I/F
DRAM I/F
Figure 8-2 shows a block diagram of the Fermi Streaming Multiprocessor (SM). Each SM
has16 load/store units. (It can load data for 16 threads at a time.) There are 4 Special
Function Units (SFUs) for functions such as sine, cosine, reciprocal, and square root.
And each SM has a fundamental computational block of 32 CUDA cores. One CUDA
core is outlined in red.
Uniform Cache
As shown in Figure 8-3, each CUDA core has an integer arithmetic logic unit (ALU) and a
floating point unit (FPU) capable of providing a fused multiply-add (FMA) instruction for
both single and double precision arithmetic. Each FPU takes 1 clock to deliver
single-precision results, 2 clocks for double-precision results.
CUDA Core
Dispatch Port
Operand Collector
FPU ALU
Result Queue
Each SM has 64 KB of configurable (16/48 KB) shared memory and L1 cache. This on-
chip shared memory enables threads in the same block to cooperate. On the Tesla 2090
GPU, there is a 768 KB L2 cache per SM that can be written by any one of the threads in
the SM.
The basic programming approach is as follows. If a code will benefit from the data-
parallel model using multiple threads, that kernel is identified, and dispatched to be
performed in the GPUs. There is an overhead associated with the data transfer from and
to the GPUs from the host. From the architecture point of view, GPUs are a lot simpler
since all the threads are doing the same operations, with no context switching. The GPU
computing flow is roughly as follows:
Copy data from CPU memory to the GPU memory
The CPU instructs (by a calling a CUDA function) GPU to perform kernel
computations
The GPU executes the kernel instructions in every one of its cores
Resulting data is copied from the GPU memory to the CPU memory
CUDA C extends the C language by allowing the programmer to define C functions
known as kernels, which are executed N times in parallel by N different CUDA threads
Serial SAXPY in C :
saxpy_serial(int n, float a, float *x, float *y)
{
for (int i=0; i< n; ++i)
y[i] = a*x[i] + y[i];
}
//driver invocation of the saxpy kernel void
saxpy_serial(n, 5.0, x, y)
Driver invocation :
//driver invocation of the cuda saxpy kernel with 256 threads/block
int nblocks (n+255) / 256;
saxpy_cuda<<<nblocks, 256>>>(n, 5.0, x, y);
With this driver invocation of the CUDA kernel for SAXPY, these computations are done
simultaneously.
FORTRAN support
PGI provides support for GPUs with multiple mechanisms. There is a CUDA C, and
CUDA FORTRAN from PGI. In addition, there is support through directives, in the PGI
Accelerator model. These are very similar to OpenMP directives for parallelization [3].
There is also a CAPS HMPP compiler that provides support for FORTRAN codes [4].
Figure 8-4 shows the basic memory hierarchy of the NVIDIA GPUs. To reiterate each
SM has a 64 KB L1 cache on chip, and 768 KB L2 cache off chip; there is a 6 GB global
GPU memory; and the GPU has a PCI-express interface to the system. The data transfer
rates vary widely in CUDA computing:
with in the device (50-80 GB/sec)
asynchronous host-to-device with pinned memory (10-20 GB/sec)
PCIe transfer rate (4-6 GB/sec)
Because of the small size of the caches, and the possibility of memory bank conflicts,
CUDA programming strongly discourages any cache blocking when tuning for
performance. The desired tuning procedure is known as memory coalescing. The details
of this may be found in the NVIDIA tutorials. Here is an example of memory coalescing
in CUDA for a 2-D transpose [5].
In the nave transpose, loads are coalesced; stores are not (the stride runs over the
column index). In addition, there are other read-only memories, known as texture
memory and constant memory, available in the GPUs.
__global__ void transposeNaive(float *odata, float *idata, int width, int height)
{
int xIndex = blockIdx.x * blockDim.x + threadIdx.x; //Convert thread indices to
int yIndex = blockIdx.y * blockDim.y + threadIdy.y; //coordinates in matrix
int index_in = xIndex + width * yIndex; // Convert matrix coordinates
int index_out = yIndex + height * xIndex; // to flattened array indices
odata[index_out] = idata[index_in]
}
In the coalesced method using shared memory, there are two steps:
1. transpose the submatrix into Shared Memory
2. Write rows of the transposed submatrix back to global memory
And the resulting code looks like this:
__global__ void transposeCoalesced(float *odata, float *idata, int width, int height)
{
__shared__ float tile[TILE_DIM][TILE_DIM+1];
int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
int index_in = xIndex + (yIndex)*width;
tile[threadIdx.y][threadIdx.x] = idata[index_in];
--syncthreads();
In addition to memory coalescing, there are other memory optimization procedures like
using texture memory and execution configuration (determining the best thread block
dimension). In [6], implementation of the Himeno benchmark that solves the 3-D Poisson
equation is discussed with the details of optimizing for performance on GPUs. Using
finite-differences, the Poisson equation is discretized in space yielding a 19-point stencil.
The discretized equations are solved iteratively using Jacobi relaxation. This benchmark
is designed to run with various problem sizes to fit the system. On the Himeno code,
optimized for GPUs, memory coalescing gives a 57% performance improvement. Use of
Texture cache improves the performance by an additional 33%. Other optimizations
(removing logic, and branching) improve the performance by an additional 18% [6].
This example illustrates a case where 2 Tesla M2070 GPGPUs are attached to each 12-
core Intel Westmere node. These are some of runtime environmental variables
CPU_CORES_PER_GPU=6
export MKL_NUM_THREADS=$CPU_CORES_PER_GPU
export OMP_NUM_THREADS=$CPU_CORES_PER_GPU
export CUDA_DGEMM_SPLIT=0.85
export CUDA_DTRSM_SPLIT=0.75
The last 2 environmental variables distribute the work load between the CPUs and the
GPUs.
There is a significant impact with processor binding on the performance. With OpenMPI,
use of rankfile is an effective binding tool.
The HPL-GPU code on the CPU side is a hybrid parallel code. It is best run with a
command like this:
$ mpirun -machinefile host.list -np 8 --mca btl_openib_flags 1 -- rankfile rankfile
$HPL_DIR/bin/CUDA_pinned/xhpl | tee out_8
where the rankfile looks like this :
rank 0=i04n201 slot=0,1,2,3,4,5
rank 1=i04n201 slot=6,7,8,9,10,11
rank 2=i04n202 slot=0,1,2,3,4,5
rank 3=i04n202 slot=6,7,8,9,10,11
rank 4=i04n203 slot=0,1,2,3,4,5
rank 5=i04n203 slot=6,7,8,9,10,11
rank 6=i04n204 slot=0,1,2,3,4,5
rank 7=i04n204 slot=6,7,8,9,10,11
Also, it is important to choose the right problem size for peak performance. This is
generally based on the available node memory, as recommended in the HPL download
site ([8]).
The following table shows GPU HPL performance on a customer system that has 2
GPUs per node on 32 nodes with 48 GB of memory per node.
The scaling is roughly linear. This is slightly above 50% of the peak performance of the
system, and typically that is the expected performance.
8.1.7 OpenACC
OpenACC is a programming standard to develop applications faster on GPUs. It is
OpenMP-like and directives-based, and is developed by Cray, NVIDIA, PGI and CAPs.
The details may be found in [10].
==============NVSMI LOG==============
Attached GPUs : 2
GPU 0:14:0
Product Name : Tesla M2090
Display Mode : Disabled
Persistence Mode : Disabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0322411030473
GPU UUID : GPU-891aa48aaa2cffee-966d8d4e-e0c73aa5-
7646437
9-4a364676f8c384cd02804ab0
Inforom Version
OEM Object : 1.1
ECC Object : 2.0
Power Management Object : 4.0
PCI
Bus : 14
Device : 0
Domain : 0
Device Id : 109110DE
Bus Id : 0:14:0
Fan Speed : N/A
Memory Usage
Total : 6143 Mb
Used : 10 Mb
Free : 6132 Mb
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Temperature
Gpu : N/A
Power Readings
Power State : P12
Power Management : Supported
Power Draw : 32.58 W
Power Limit : 225 W
Clocks
Graphics : 50 MHz
SM : 100 MHz
Memory : 135 MHz
Similarly, it provides information for the second GPU with bus Id labeled as 0:15:0.
If the GPUs are being used, under Utilization, it will show the GPU and the GPU-
memory utilization details in percentages which change dynamically as the program runs
on the GPUs.
8.1.9 References
9 Power Consumption
The power consumption of a large HPC cluster can reach hundreds, if not thousands, of
kilowatts and customers can be limited in their deployments by the amount of electrical
power that is available in their data center.
The need for ever growing computing power means that the energy efficiency of HPC
clusters must improve for the power envelope to remain under control and not become
the dominant factor limiting the size of a deployment. This is well understood by all
manufacturers who have put energy efficiency at the top of their priorities, whether at the
chip level, the node level, the rack level or the data center level.
From one generation to the next, improvements in the chip manufacturing process have
allowed the power consumption of processors to stay roughly stable while the computing
power has increased, by adding more cores instead of increasing the processor clock
speed.
The Intel Sandy Bridge processor supports a variety of power-saving features like
multiple sleep states and DVFS (Dynamic Voltage and Frequency Scaling) which can be
leveraged by system and application software to reduce the energy consumption of HPC
workloads.
The iDataPlex dx360 M4 uses all these techniques and provides enhanced power
management functions like power capping and power trending.
There are two simple ways of measuring the power consumption of an iDataPlex dx360
M4 server, using either the rvitals command from xCAT (the man page is here) or a
simple Linux script to format the data collected by the ibmaem kernel module.
Here is the output of the rvitals command on a dx360 M4 node with hostname n010.
therm-stc:~ # rvitals n010 power
n010: AC Avg Power: 145 Watts (495 BTUs/hr)
n010: CPU Avg Power: 32 Watts (109 BTUs/hr)
n010: Domain A AvgPwr: 50 Watts (171 BTUs/hr)
n010: Domain B AvgPwr: 50 Watts (171 BTUs/hr)
n010: MEM Avg Power: 6 Watts (20 BTUs/hr)
n010: Power Status: on
n010: AC Energy Usage: 90.4085 kWh +/-5.0%
n010: DC Energy Usage: 39.6219 kWh +/-2.5%
To measure the energy consumption of a given workload on an iDataPlex dx360 M4
server, one would use:
$ rvitals n010 power
$ ./workload.sh
$ rvitals n010 power
and work out the difference between the AC Energy Usage readings before and after.
If rvitals is not available, a simple script to process the information provided by the
ibmaem Linux kernel module can yield similar, although less detailed, information.
Provided the ibmaem kernel module is loaded (modprobe ibmaem), one could use this
script:
$ cat nrg.sh
#!/bin/bash
v=1
if [ ! -r /sys/devices/platform/aem.1 ] ; then
v=0
fi
BDC=`cat /sys/devices/platform/aem.$v/energy1_input`
BAC=`cat /sys/devices/platform/aem.$v/energy2_input`
b=$(date "+%s%N")
$*
e=$(date "+%s%N")
ADC=`cat /sys/devices/platform/aem.$v/energy1_input`
AAC=`cat /sys/devices/platform/aem.$v/energy2_input`
RT=$(echo "($e-$b)/1000000"|bc)
DC=$((ADC BDC)
AC=$((AAC - BAC)
DCP=$(echo "$DC/1000/$RT"|bc -l|cut -d'.' -f1)
ACP=$(echo "$AC/1000/$RT"|bc -l|cut -d'.' -f1)
echo "Energy: $(( (DC) / 1000000 ))J (DC) Time:$RT (ms) AvgPower: ${DCP}W (DC)"
echo "Energy: $(( (AC) / 1000000 ))J (AC) Time:$RT (ms) AvgPower: ${ACP}W (AC)"
With any of the power-saving states, there is a tradeoff between power savings and
latency. For example, enabling the CPU C6 state allows CPU cores to be completely
turned off, which saves power. But since the CPU cores are powered down, it takes
additional time to restore their state when they transition back to the C0 state. If
maximum overall performance is desired, all power-saving states can be disabled. This
will minimize latencies to transition into and out of the power states, but at the same time
power will be increased dramatically. At the other extreme, if power settings are
optimized for maximum power savings, performance can suffer due to long latencies.
For most applications, the default system settings offer a good balance between
performance and efficiency. If necessary, the defaults can be changed if increased
performance or power savings are desired.
The diagram below shows an overview of the various power states in the server.
There is a hierarchy among the power states. At the highest level, the G-states represent
the overall state of the server. The G-states map to the S-states (system sleep states).
Progressing to the right, there are subsystem power states that represent the current
state of the CPU, memory, and subsystem devices. As shown by the arrows, certain
power states cannot be entered if higher level power states are not active. For example,
for a CPU core to be in P1 state, the CPU core also has to be in C0 state, the system has
to be in S0 system sleep state, and the overall server has to be in G0 state.
For additional information on system power states, refer to the ACPI =Advanced
Configuration and Power Interface.
9.3.1 G-States
G-states are global server states that define the operational state of the entire server. As
the number of the G-state increases, there is additional power saved. However, as
shown below, the latency to move back to G0 state also increases.
9.3.2 S-States
S-states define the sleep state of the entire system. The table below describes the
various sleep states.
9.3.3 C-States
C-states are CPU idle power-saving states. C-states higher than C0 only become active
when a CPU core is idle. If a process is running on a CPU core, the core is always in C0
state. If Hyper-Threading is enabled, the C-state resolves down to the physical core. For
example, if one Hyper-Thread is active and another Hyper-Thread is idle on the same
core, the core will remain in C0 state.
C-states can operate on each core separately or the entire CPU package. The CPU
package is the physical chip in which the CPU cores reside. It includes the CPU cores,
caches, memory controllers, PCI Express interfaces, and miscellaneous logic. The non-
CPU core hardware inside the package is commonly referred to as the uncore.
Core C-states transitions are driven by interrupts or the operating system scheduler with
MWAIT commands. The number of cores in C3 or C6 also impacts the maximum turbo
frequency that is available. If maximum peak performance is desired, all CPU C-states
should be enabled.
Note that CPU C-states do not directly map to ACPI C-states. The reason for this is
historical. ACPI C-states range from C0 to C3. At the time when they were defined,
there were no CPUs that supported the C6 state. So the mapping was 1:1 (ACPI C0
=CPUC0, ACPI C1=CPU C1, etc.). Newer CPUs, however, support the C6 state. In
order to get the maximum power savings when going to the ACPI C3 state, the CPU C6
state gets mapped to ACPI C3 and the CPU C3 state gets mapped to ACPI C2.
4
The number of C-states and the specific power savings associated with each C-state is dependent on the
specific type and SKU of CPU installed.
5
VRD stands for the voltage regulator device
9.3.4 P-States
P-states are defined as the CPU performance states. Each CPU core supports multiple
P-states and each P-state corresponds to a frequency. Note, that P0 can run above the
rated frequency for short periods of time if turbo mode is enabled. The exact turbo
frequency for P0 and the amount of time the core runs at the turbo frequency is controlled
autonomously in hardware.
Like core C-states, P-states are controlled by the operating system scheduler. The OS
scheduler places a CPU core in a specific P-state depending on the amount of
performance needed to complete the current task. For example, if a 2GHz CPU core
only needs to run at 1GHz to complete a task, the OS scheduler will place the CPU into a
higher numbered P-state.
Each CPU core can be placed in a different P-state. Multiple threads on one core (e.g.
Hyper-Threading) are resolved to a single P-state. P-states are only valid when the CPU
core is in the C0 state. P-states are sometimes referred to as DVFS (dynamic voltage
and frequency scaling) or EIST (Enhanced Intel Speedstep Technology).
In addition to controlling the core frequency, P-states also indirectly control the voltage
level of the VRD (voltage regulator device) that is supplying power to the CPU cores. As
the core frequency is reduced from its maximum value, the VRD voltage is automatically
reduced down to a certain point. Eventually, the VRD will be operating at the minimum
voltage that the CPU cores can tolerate. If the core frequency is lowered beyond this
point, the VRD voltage will remain at the minimum voltage. This is illustrated in the
diagram below.
CPU voltage at
a minimum here
Frequency
scaling
only
Frequency &
voltage scaling
Typically, the most efficient operating point is at the peak of the curve.
9.3.5 D-States
D-states are subsystem power-saving states. They are applicable to devices such as
LAN, SAS, and USB. The operating system can transition to different D-states after a
period of time or when requested by a device driver. All D-states occur when the server
is in S0 state.
9.3.6 M-States
M-states control the memory power savings. The memory controller automatically
transitions memory to the M1 or M2 state when the memory is idle for a period of time.
M-states are only defined when the server is in S0 state.
Figure 9-3 illustrates the relative influence of each power saving feature. The vertical
axis is the system utilization, ranging from 0% (idle) to 100% (maximum utilization). The
width of each polygon at any utilization level represents the relative benefit of each group
of power saving features. For example, at 50% utilization, the power supply and VRD
efficiency have a very large influence on overall system efficiency. This is because the
blue polygon is very wide at the 50% utilization point. In contrast to this, energy-efficient
Ethernet has little benefit at 50% utilization and the idle power-saving features have no
It is important to understand what portion of the utilization curve the server will be
operating in. In this manner, it is possible to understand which power-saving features are
influencing the overall performance/watt efficiency of the server for the target workload.
The composite effect can be measured with industry standard efficiency benchmarks
such as SPEC SPECpower.
Electrical conversion efficiency (ECE) measures how much power is lost to convert from
one power level to another (e.g. AC-to-DC or DC-to-DC conversion). If a power supply
converts 220V AC to 12V DC and it is 95% efficient for a 500W load, 5% of the input
power is converted to heat and is typically dissipated with a fan built into the power
supply. In this example, 526W AC is required, 500W is delivered to the load, and 26W is
dissipated as heat. Power supply and VRD efficiency has improved dramatically in
recent years but no electrical circuit is ideal and some power is always dissipated.
ECE =Power out / Power In
Power usage effectiveness (PUE) measures how much power is lost in the datacenter
relative to actual IT equipment power. The overall PUE depends on how close to the true
compute power load that the power out measurement is taken and also what ancillary
loads are included in the calculation (e.g. lights, humidification, UPS, CRACs, chillers,
etc.)
PUE =Total Facility Power / IT Equipment Power =1 / Datacenter Efficiency
Performance/watt efficiency (P/W E) is defined as how much performance can be
achieved for every watt of power consumed.
P/W E = Performance / Power
P/W E focuses on the server, chassis, and rack efficiency. By comparison, ECE or PUE
can be extended to the datacenter level or power station level.
See the xCAT pages on Energy Management and renergy for more details.
Appendix B: Acknowledgements
Nagarajan Karthikesan (k.nagarajan@in.ibm.com) provided information on GCC 4.7.0
compilation.
Luigi Brochard supported this project by helping to gather the resources and people
needed.
Steve Stevens and Lisa Maurice provided the encouragement and managerial
sponsorship to complete this project.
Appendix D: Notices
IBM may not offer the products, services, or features discussed in this document in other
countries. Consult your local IBM representative for information on the products and
services currently available in your area. Any reference to an IBM product, program, or
service is not intended to state or imply that only that IBM product, program, or service
may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's
responsibility to evaluate and verify the operation of any non-IBM product, program, or
service.
IBM may have patents or pending patent applications covering subject matter described
in this document. The furnishing of this document does not give you any license to these
patents. You can send license inquiries, in writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-
1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other
country where such provisions are inconsistent with local law: INTERNATIONAL
BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS"
WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING,
BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do
not allow disclaimer of express or implied warranties in certain transactions, therefore,
this statement may not apply to you.
Any references in this information to non-IBM Web sites are provided for convenience
only and do not in any manner serve as an endorsement of those Web sites. The
materials at those Web sites are not part of the materials for this IBM product and use of
those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes
appropriate without incurring any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those
products, their published announcements or other publicly available sources. IBM has not
tested those products and cannot confirm the accuracy of performance, compatibility or
any other claims related to non-IBM products. Questions on the capabilities of non-IBM
products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business
operations. To illustrate them as completely as possible, the examples include the names
of individuals, companies, brands, and products. All of these names are fictitious and
any similarity to the names and addresses used by an actual business enterprise is
entirely coincidental.
COPYRIGHT LICENSE:
6
http://www.ibm.com/systems/p
Appendix E: Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International
Business Machines Corporation in the United States, other countries, or both. These and
other IBM trademarked terms are marked on their first occurrence in this information with
the appropriate symbol ( or ), indicating US registered or common law trademarks
owned by IBM at the time this information was published. Such trademarks may also be
registered or common law trademarks in other countries. A current list of IBM trademarks
is available on the Web.
The following terms are trademarks of the International Business Machines Corporation
in the United States, other countries, or both:
1350 IBM Process Rational Unified
AIX 5L Reference Model Process
AIX for Rational
alphaWorks IT Redbooks
Ascendant IBM Systems Redbooks (logo)
BetaWorks Director Active RS/6000
BladeCenter Energy RUP
CICS Manager S/390
Cool Blue IBM Sametime
DB2 iDataPlex Summit Ascendant
developerWorks IntelliStation Summit
Domino Lotus Notes System i
EnergyScale Lotus System p
Enterprise MQSeries System Storage
Storage Server MVS System x
Enterprise Netfinity System z
Workload Notes System z10
Manager OS/390 System/360
eServer Parallel Sysplex System/370
Express PartnerWorld Tivoli
Portfolio POWER TotalStorage
FlashCopy POWER VM/ESA
GDPS POWER4 VSE/ESA
General Parallel POWER5 WebSphere
File System POWER6 Workplace
Geographically POWER7 Workplace
Dispersed Parallel PowerExecutive Messaging
Sysplex Power Systems X-Architecture
Global Innovation PowerPC xSeries
Outlook PowerVM z/OS
GPFS PR/SM z/VM
HACMP pSeries z10
HiperSockets QuickPlace zSeries
HyperSwap RACF
i5/OS Rational Summit
The following terms are trademarks of other companies:
AMD, AMD Opteron, the AMD Arrow logo, and combinations thereof, are
trademarks of Advanced Micro Devices, Inc.
InfiniBand, and the InfiniBand design marks are trademarks and/or service
marks of the InfiniBand Trade Association.
ITIL is a registered trademark, and a registered community trademark of