Você está na página 1de 110

Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation



Page 1 of 110


Performance Guide
For HPC Applications
On IBM Power 755 System




Jim Abeles, Luigi Brochard, Louis Capps, Don DeSota, Jim
Edwards, Brad Elkin , John Lewars, Eric Michel, Raj Panda, Rajan
Ravindran, Joe Robichaux, Swamy Kandadai and Sid Vemuganti


Release 1.0
J une 4, 2010
IBM Systems and Technology Group
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 2 of 110
Contents
1 Introduction ................................................................................................ 7
2 POWER7 Microarchitecture Overview ........................................................ 9
3 Performance Optimization on POWER7 .................................................... 14
3.1 Compiler Versions and Options .................................................................... 14
3.2 ESSL ............................................................................................................. 14
3.3 SMT ............................................................................................................. 15
3.3.1 Introduction ............................................................................................................ 15
3.3.2 Hardware Resources ............................................................................................... 16
3.3.3 SMT Control ............................................................................................................ 18
3.3.4 Performance Results ............................................................................................... 20
3.4 Operating Systems and System Configuration .............................................. 22
3.4.1 Page Size ................................................................................................................. 22
3.4.2 Memory Affinity ...................................................................................................... 23
3.4.3 Process Binding ....................................................................................................... 23
3.4.4 Hardware Prefetch Control ..................................................................................... 24
4 AIX ............................................................................................................ 26
4.1 Page Modes ................................................................................................. 26
4.1.1 Linker Options ......................................................................................................... 26
4.1.2 Using ldedit ............................................................................................................. 27
4.1.3 Using the LDR_CNTRL Environment Variable ......................................................... 27
4.1.4 Configuring a System for Large Page Support ........................................................ 27
4.1.5 Querying Page Use .................................................................................................. 28
4.1.6 4KB & 64KB Page Pools ........................................................................................... 28
4.2 Memory Affinity Settings ............................................................................. 28
4.3 Process and Thread Binding ......................................................................... 29
4.4 Task Binding ................................................................................................. 29
4.4.1 Using bindprocessor ............................................................................................... 30
4.4.2 Using rsets (execrset) ............................................................................................. 30
4.4.3 Using launch ........................................................................................................... 30
4.4.4 Mixed (Thread and Task) Binding ........................................................................... 31
4.5 SMT Control ................................................................................................. 31
4.5.1 Querying SMT Configuration .................................................................................. 31
4.5.2 Specifying the SMT Configuration .......................................................................... 31
4.6 Hardware Prefetch Control........................................................................... 32
5 Linux ......................................................................................................... 33
5.1 SMT Management ........................................................................................ 33
5.1.1 Linux SMT Performance .......................................................................................... 33
5.2 Memory Pages ............................................................................................. 34
5.3 Memory Affinity ........................................................................................... 34
5.3.1 Introduction ............................................................................................................ 34
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 3 of 110
5.3.2 Using numactl ........................................................................................................ 35
5.3.3 Linux Memory Affinity Performance ...................................................................... 35
5.4 Process and Thread Binding ......................................................................... 37
5.4.1 taskset ..................................................................................................................... 37
5.4.2 numactl ................................................................................................................... 37
5.4.3 Compiler Environment Variables for OpenMP Threads ......................................... 37
5.4.4 LoadLeveler ............................................................................................................. 37
5.5 Hardware Prefetch Control........................................................................... 37
5.6 Compilers ..................................................................................................... 38
5.6.1 GNU Compilers ....................................................................................................... 38
5.6.2 IBM Compilers ........................................................................................................ 39
5.6.3 IBM Compiler Performance Comparison ................................................................ 40
5.6.4 Porting x86 Applications to Linux on POWER ......................................................... 40
5.7 MPI .............................................................................................................. 40
5.8 OpenMPI ...................................................................................................... 41
5.8.1 Introduction ............................................................................................................ 41
5.8.2 OpenMPI Binding Parameters ................................................................................ 41
5.8.3 Using OpenMPI with LoadLeveler ........................................................................... 41
5.9 Monitoring Tools for Linux ........................................................................... 42
5.9.1 top .......................................................................................................................... 42
5.9.2 nmon ....................................................................................................................... 42
5.9.3 oprofile ................................................................................................................... 43
6 MPI Performance on the Power 755 System ............................................. 45
6.1 MPI Performance Considerations ................................................................. 45
6.2 LoadLeveler JCF (Job Command File) Affinity Settings ................................... 45
6.3 IBM PE (Parallel Environment) Affinity Support ............................................ 47
6.4 Environment Variables and LoadLeveler Keywords for Optimal MPI
Performance .......................................................................................................... 47
6.4.1 Details on Environment Variables .......................................................................... 48
6.4.2 Other Possible Environment Variables to Try ......................................................... 51
7 Performance Analysis Tools on AIX ........................................................... 52
7.1 Runtime Environment Control ...................................................................... 52
7.1.1 ulimit ....................................................................................................................... 52
7.1.2 Memory Pages ........................................................................................................ 52
7.2 Profiling Tools .............................................................................................. 53
7.2.1 gprof ....................................................................................................................... 53
7.2.2 tprof ........................................................................................................................ 54
7.3 MPI Performance Tools ................................................................................ 56
7.3.1 MPI Summary Statistics .......................................................................................... 57
7.3.2 MPI Profiling ........................................................................................................... 58
7.4 Hardware Performance Tools ....................................................................... 59
7.4.1 hpmcount ............................................................................................................... 59
7.4.2 libhpm ..................................................................................................................... 60
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 4 of 110
7.4.3 Profiling Hardware Events with tprof ..................................................................... 62
7.5 Other Useful Tools ....................................................................................... 62
8 Performance Results ................................................................................. 64
8.1 HPC Benchmarks on AIX ............................................................................... 64
8.1.1 STREAM, Linpack, SPEC CPU2006 ........................................................................... 64
8.1.2 NAS Parallel Benchmarks Class D (MPI) .................................................................. 64
8.1.3 Weather Benchmarks ............................................................................................. 65
8.2 HPC Benchmarks on Linux ............................................................................ 65
8.2.1 Linpack .................................................................................................................... 65
8.2.2 NAS Parallel Benchmarks Class D............................................................................ 65
9 VSX Vector Scalar Extensions ................................................................. 69
9.1 VSX Architecture .......................................................................................... 69
9.1.1 Note on Terminology: ............................................................................................. 69
9.2 A Short Vector Processing History ................................................................ 69
9.3 VSU Overview .............................................................................................. 71
9.4 Compiler Options ......................................................................................... 74
9.5 Vectorization Overview ................................................................................ 74
10 Auto-vectorization ................................................................................. 77
10.1 Inhibitors of Auto-vectorization ................................................................ 77
10.1.1 Loop-carried Data Dependencies ........................................................................... 78
10.1.2 Memory Aliasing ..................................................................................................... 78
10.1.3 Non-stride-1 Accesses ............................................................................................ 78
10.1.4 Complexities Associated with the Structure of the Loop ....................................... 79
10.1.5 Data Alignment ....................................................................................................... 79
11 VSX and Altivec Programming ............................................................... 82
11.1 Handling Data Loads ................................................................................. 82
11.2 Performance Improvement of VSX/Altivec-enabled Code Over Scalar Code
82
11.3 Memory Alignment ................................................................................... 82
11.3.1 AIX ........................................................................................................................... 82
11.3.2 Linux........................................................................................................................ 83
11.3.3 Multiple Array Offsets in a Loop ............................................................................. 83
11.3.4 Multidimensional Arrays ........................................................................................ 83
11.4 Vector Programming Strategies ................................................................ 83
11.4.1 Some Requirements for Efficient Loop Vectorization ............................................ 84
11.4.2 OpenMP Loops ....................................................................................................... 84
11.4.3 Example: Vectorizing a Simple Loop ....................................................................... 85
11.4.4 Local Algorithms ..................................................................................................... 86
11.4.5 Global Restructuring ............................................................................................... 86
11.5 Conclusions .............................................................................................. 88
12 Power Consumption .............................................................................. 89
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 5 of 110
12.1 Static Power Saver: "SPS" ......................................................................... 89
12.2 Dynamic Power Saver: "DPS" .................................................................... 89
12.3 Dynamic Power Saver - Favor Performance: "DPS-FP" ............................... 90
12.4 Performance Versus Power Consumption ................................................. 90
Appendix A: POWER7 and POWER6 Hardware Comparison ....................... 92
Appendix B: IBM System Power 755 Compute Node ................................. 93
Appendix C: Script petaskbind.sh............................................................... 94
Appendix D: Script petaskbind-rset.sh ....................................................... 96
Appendix E: Enabling Huge Pages on SLES11 Power 755 systems .............. 98
Appendix F: Flushing Linux I/O buffers ...................................................... 99
Appendix G: Compiler Flags and Environment Settings for NAS Parallel
Benchmarks 100
Appendix H: Example Program Listing for Using the dscr_ctl System Call 101
Appendix I: Scramble Program for 64K Page Creation ............................... 102
Appendix J: Runtime Environment for the GFS Application ........................ 104
Appendix K: Acknowledgements ............................................................. 105
Appendix L: Abbreviations Used .............................................................. 106
Appendix M: Notices ................................................................................ 107
Appendix N: Trademarks ......................................................................... 109

Figures
Figure 2-1. POWER7 Chip Architecture .......................................................................................... 11
Figure 2-2.POWER7 Cache Hierarchy Summary ............................................................................. 12
Figure 2-3 POWER7 Bus Bandwidths per chip ................................................................................ 12
Figure 2-4. POWER7 Hybrid "Fluid" L3 Cache Structure ................................................................ 13
Figure 3-1 How symmetric multithreading uses functional units ................................................... 15
Figure 3-2 Thread designations for SMT2 and SMT4 ..................................................................... 19
Figure 3-3 SMT2 and SMT4 performance comparison for SPEC CFP2006 ...................................... 21
Figure 3-4 NAS Parallel Benchmark Class B (OpenMP) SMT Gain .................................................. 21
Figure 3-5 NAS Parallel Benchmarks Class C (MPI) SMT Gain ........................................................ 22
Figure 5-1 SMT effect on an HPC kernel ......................................................................................... 34
Figure 5-2 Effects of process binding, memory page size and memory affinity on STREAM triad . 36
Figure 5-3 SIMD performance on an HPC kernel ............................................................................ 40
Figure 5-4 nmon display ................................................................................................................. 43
Figure 9-1 Scalar and vector operations ......................................................................................... 70
Figure 9-2 POWER7 with VSU block diagram ................................................................................. 72
Figure 10-1 Handling unaligned data ............................................................................................. 80
Figure 12-1 Active Energy Manager GUI ....................................................................................... 90
Figure 12-2 Correlation of performance and power consumption ................................................. 91

Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 6 of 110
Tables
Table 1-1 Early POWER7 systems .................................................................................................... 8
Table 3-1 Hardware resources for the different threading modes ................................................. 17
Table 3-2 Mapping logical cores to physical cores in ST, SMT2 and SMT4 mode .......................... 19
Table 6-1 Mapping MP_TASK_AFFINITY to LoadLeveler JCF statements ....................................... 47
Table 8-1 Power 755 performance ................................................................................................. 64
Table 8-2 NAS Parallel Benchmark Performance on Power 755 Cluster ......................................... 64
Table 8-3 GFS performance comparison ......................................................................................... 65
Table 8-4 Linpack performance for SMT off, SMT2 and SMT4 ....................................................... 65
Table 12-1 Performance characteristics of selected SPEC applications .......................................... 91

Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 7 of 110
1 Introduction
In February of 2010, IBM introduced the IBM Power 750/755
1
In particular, the 32-core IBM Power 755 is specifically enabled for HPC with a full HPC
software stack targeted at workloads like weather and climate modeling, computational
chemistry, physics and petroleum reservoir modeling.
entry computers and the
IBM Power 770/780 midrange computers, the first systems using the new POWER7
microprocessor. Based upon proven Simultaneous Multi-threading (SMT) and multi-core
technology from the POWER5 and POWER6 architectures, the POWER7 processor
extends IBMs leadership position by coupling innovations such as eight cores per
socket, on-chip eDRAM technology with a cache hierarchy and memory subsystem
specifically balanced and tuned for ultra high frequency multi-threaded cores in addition
to a new SIMD unit capable of eight Double Precision Floating Point Operations per cycle
and enhanced efficiency and automating power management.
Later in 2010 IBM plans
2
It is the goal of this paper to explore in detail performance aspects of HPC type
workloads on POWER7 microprocessor. Much of the work and knowledge in this paper
is derived from running workloads on the IBM Power 755 and Power 780 systems. While
the Power 780 is a POWER7 64-core SMP system based on 16-core scalable building
blocks targeted mainly for high performance database and other commercial applications,
because of its extremely well balanced design, the Power 780 also performs well with
HPC workloads and is an excellent platform for this performance study. Moreover, our
work on Power 755 and 780 will form a good bridge to performance work to be done on
the ultra-dense cluster system in 2011.
to extend the POWER7 system family with single and dual
socket blade systems as well as a large 256-core SMP system targeted for high-end
business and commercial computing. In 2011 IBM will ship an ultra-dense cluster system
with an integrated interconnect that will enable scaling the cluster system to multi-
petaflops with thousands of POWER7 cores.
To help the intended technical readers of this paper, such as HPC application specialists
or customers considering purchases of these systems, we have included suitable
examples to articulate the features of the POWER7 microprocessor that are critical for
realizing performance benefits on HPC workloads.
After a brief and high-level overview of the POWER7 micro-architecture in Chapter 2, we
cover in Chapter 3 performance optimization on POWER7-based systems when using
IBM compilers, libraries, etc. Operating system level tuning that is needed for
performance optimization on HPC applications is covered in Chapter 4 for AIX and in
Chapter 5 for Linux. High performance clusters can be built with Power 755 as nodes
using InfiniBand for interconnect. IBM supports a software stack to enable MPI
applications on Power 755/IB clusters. Aspects of tuning that are needed for MPI
application performance optimization is explained in Chapter 6. Analysis tools that are
commonly used for performance tuning are covered in Chapter 7. To provide a feel for
realizable performance on a single node of Power 755 as well as a Power 755 cluster, we
have selected several frequently used HPC benchmarks and have provided benchmark
results in Chapter 8. A new single instruction multiple data (SIMD) instruction set called
VSX is introduced in the POWER7 processor. This is in addition to the AltiVec
TM
SIMD



2
All statements regarding IBMs future direction and intent are subject to change or withdrawal without notice,
and represent goals and objectives only. Statements regarding SMP servers do not imply that IBM will introduce
a server with this capability
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 8 of 110
instruction set which was previously introduced in the PPC970-based system, the J S20
blade server. We cover key aspects of SIMD architecture and performance optimization
in Chapters 9 and 10. Chapter 11 covers power management features in the POWER7
processor and its implications for HPC workloads.

Given the scope of the current document, which discusses performance optimization on
HPC workloads, very little detail has been given except where it is essential on the
various hardware features of Power 755 and 780 systems such as operating
environment, physical package, system features, etc. However, we provide here two
references that are fairly comprehensive and provide an excellent coverage of the topics
to complement the work in this document.
Table 1-1 Early POWER7 systems

For a more in-depth description of the POWER7 755 system, please refer to the IBM
POWER 750 and 755 Technical Overview and Introduction.
For a more in-depth description of the POWER7 780 system, please refer to the IBM
POWER 770 and 780 Technical Overview and Introduction.


Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 9 of 110
2 POWER7 Microarchitecture Overview
The POWER7 design represents a significant innovation in multiprocessor architecture,
implementing groundbreaking logic and circuit design. These innovations allow the
POWER7 design to move from two cores per chip to eight cores while increasing the
number of available SMT (hardware-supported) threads to four per core. Even with these
changes, the POWER7 architecture maintains the traditional POWER execution
frequency/memory balance by integrating custom eDRAM L3 memory on-chip for the first
time. This approach combines the dense, low power attributes of eDRAM with the speed
and bandwidth of SRAM. The 32MB of L3 on a chip also permits the use of caching
techniques to reconfigure and optimize access to L3 cache on the fly.
To further maintain system balance, the POWER7 design adds aggressive out-of-order
execution, improving the dynamic allocation of core resources. The private 256KB L2
cache is moved into the core logic, reducing coherence traffic and providing a scalable
interconnect. In addition, dual on-chip memory controllers and an enhanced SMP
interconnect provide bandwidth to help keep the cores operating at peak capacity and
provide the ability to interconnect up to 32 sockets (or 256 cores) directly with no glue
logic. The large victim L3 cache is split so that each core has its own private cache, but
can also access other core L3 sections if needed.
Other key innovations are reflected in the SMP interconnect fabric and associated logical
system topology, to enable improved RAS, virtualization, and dynamic power
management capabilities.
The POWER7 processor
3
o Eight high frequency processor cores with up to four SMT
(Simultaneous Multithreading) threads per core (thirty two threads per
chip)
is based on 64-bit PowerPC architecture and incorporates:
12 execution units per core
2 Fixed point, 2 Load/store, 4 DP floating point, 1 Vector,
1 Branch, 1 Condition Register , 1 Decimal
6-wide in-order dispatch
Two branches
Four non-branch instructions
Second branch terminates the group
8-wide out-of-order issue
Two load or store ops
Two fixed-point ops
Two scalar floating-point, two VSX, two VMX/AltiVec ops
(one must be a permute op) or one DFP op
One branch op
One condition register op
4 Way SMT per core (SMT4)

3
Different versions of POWER7 processor are used in different POWER7 server systems. In general,
POWER7 processors differ from one another in terms of core frequency, number of memory controllers, fabric
bus characteristics, etc.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 10 of 110
Aggressive out of order execution
o 32KB each instruction and data cache per core
o One 128 bit VMX/AltiVec (Vector Multimedia Extension) unit per core
o Four FP units combined into two 128-bit VSX (Vector/Scalar extension)
units per core
o Private 256KB L2 cache for each processor core
o 4MB Fluid L3 on-chip cache per core, shareable by eight processor
cores
o Up to two integrated memory controllers (one for Power 755)
o Integrated I/O controller
o An integrated SMP coherence and data interconnect switch that enables
scalable connectivity for up to 32 POWER7 chips. This allows a large,
SMP server to re-order coherent storage updates from different threads,
exploit the high bandwidth coherence transport structures available in
POWER7 chips and sustain high system throughput.
o Support logic for dynamic power management, dynamic configuration
and recovery, and system monitoring
Dynamically enable and disable up to 4 threads per core
Dynamically turn cores on and off
Dynamically vary individual core frequencies (can provide up to
10% over standard system frequency)
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 11 of 110
Figure 2-1. POWER7 Chip Architecture

A primary goal for the POWER7 design is to maintain the high operating frequencies
introduced with POWER6 while significantly increasing core count and driving high
performance. The high level design to achieve these goals uses out-of-order execution
with aggressive power management at the chip level in addition to doubling the floating-
point capability per cycle. To improve CPI and ensure efficient use of this design point:
Changes have been made in the areas of prefetch, predecode, and L1 cache
design to keep the pipelines busy in addition to changes in all areas of the core
architecture including instruction dispatch, issue and grouping.
The POWER7 cache structure is greatly enhanced by reducing the L2 latency to
the core by 3 times over the POWER6. In addition, the L3 cache is brought on-
chip using embedded DRAM, which provides 4 times the bandwidth per core and
one-sixth the latency of POWER6.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 12 of 110
Figure 2-2.POWER7 Cache Hierarchy Summary

The new design point for the POWER7 chip provides greater than a 4-times speedup
over a POWER6 chip due to improvements in the number of cores and core design, as
well as chip enhancements to the interconnect, storage architecture and integrated
memory controllers. In some cases, customers can further improve performance by using
the new VSX SIMD capabilities of the POWER7 cores.
Figure 2-3 POWER7 Bus Bandwidths per chip

Note: Bandwidths shown are raw peak numbers
The POWER7 Fluid L3 cache structure provides higher performance by:
Automatically cloning shared data to multiple private regions
Automatically migrating private footprints (up to 4MB) to the local region (per
core) at ~5X lower latency than full L3 caches.
Keeping multiple footprints at 3X lower latency than local memory
Allows a subset of the cores to use the entire L3 cache when some cores are
idle,
The L2 Turbo cache keeps a tight 256KB working set with extremely low latency (3X
lower than local L3 region) and high bandwidth, reducing L3 power and boosting
performance
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 13 of 110
Figure 2-4. POWER7 Hybrid "Fluid" L3 Cache Structure
L3 Cache Before

L3 Cache After

Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 14 of 110
3 Performance Optimization on POWER7
We cover several general aspects to performance optimization on POWER7 based
systems. The first of which is using the right level of compilers and options followed by
using tuned libraries. Additionally we also introduce the reader to the SMT
implementation in POWER7. Operating system level tuning for optimal application
performance is the last topic for discussion in this chapter.
3.1 Compiler Versions and Options
The minimum compiler versions that explicitly support the POWER7 architecture are:
1. XL C/C++(C compiler) version 11.1
2. XLF Fortran version 13.1
Documentation for XL C/C++will be found here(AIX) and here(Linux) when released.
Documentation for XLF Fortran on AIX and Linux will be found here when released.
Code generated by previous versions of the compiler will run on POWER7; however the
compilers that explicitly support the POWER7 architecture are required to fully exploit
POWER7 features such as VSX. Note that any source code that includes VSX intrinsics
(see Chapters 9 and 10) will fail to compile with earlier compiler versions.
A reasonable starting point for options to compile POWER7-tuned binaries is:
For C:
xlc qarch=pwr7 qtune=pwr7 O3 qhot
For Fortran:
xlf qarch=pwr7 qtune=pwr7 O3 qhot
If using earlier compiler versions, a reasonable qarch option to use with earlier compiler
versions is -qarch=pwr5. Note that binaries created using the compiler option -
qarch=pwr6e will not
Of course, the compilers still support the O4 and O5 optimization levels as well as
many of the options available in earlier compiler versions. The biggest change is to
compiling binaries that support VSX SIMD instructions. See sections
run on POWER7.
9.4 and 10.1.
3.2 ESSL
ESSL v5.1 includes support for POWER7.
Documentation for ESSL will be found here when released
Many ESSL routines can automatically use single-precision SIMD algorithms on
POWER7 (and older) systems. The complete list is here. In addition, several ESSL
routines can automatically use double-precision VSX algorithms on POWER7 systems.
These routines will be part of the ESSL 5.1 documentation when released this summer.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 15 of 110
3.3 SMT
3.3.1 Introduction
Simultaneous Multi Threading (SMT) was introduced with the POWER5 processor. SMT
is a technology that allows separate instruction streams (threads) to run concurrently on
the same physical processor improving overall throughput in most cases. POWER5 and
POWER6 processors implemented 2-way SMT which allowed two threads to run
concurrently on the same core. The POWER7 processor introduces 4-way SMT which
allows 4 threads to execute simultaneously on the same core. SMT4 makes one physical
core appear as 4 logical cores to the operating system and applications.
When threads execute on cores they can stall waiting for a resource or data. SMT allows
multiple threads to share the core thus improving the utilization of the core resources in
the presence of stalls. POWER7 cores can operate in three threading modes:
ST 1 thread executing on the core at any given time
SMT2 2 threads executing on the core concurrently
SMT4 3 or 4 threads executing on the core concurrently
The diagram below depicts pipeline use in the different threading modes. In ST mode
only one thread uses the pipelines. In SMT2 mode threads 0 and 1 share the pipelines. In
SMT4 mode threads 0 and 1 share FX0, LS0, threads 2 and 3 share FX1 and LS1,
threads 0 and 2 share VS0, threads 1 and 3 share VS1 and all threads share the BRX
and CRL pipes.
Figure 3-1 How symmetric multithreading uses functional units

FX0
FX1
VS0
VS1
LS0
LS1
BRX
CRL
Thread

Thread

Thread

Thread

ST
FX0
FX1
VS0
VS1
LS0
LS1
BRX
CRL
Thread

Thread

Thread

Thread

SMT2
FX0
FX1
VS0
VS1
LS0
LS1
BRX
CRL
Thread

Thread

Thread

Thread

SMT4
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 16 of 110
3.3.2 Hardware Resources
To support multiple contexts executing concurrently the hardware must provide resources
for all the threads. Depending on the resource it may be replicated for each thread, split
between threads, or shared between the threads. In general, the POWER7 register
resources are replicated, the microarchitecture queues are split or shared, and the nest is
shared. The table below provides details on the resources for the different threading
modes in the units in the core.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 17 of 110
Table 3-1 Hardware resources for the different threading modes
Item POWER7-ST POWER7-SMT2 POWER7-SMT4
IFU
ICache Used by single thread Shared by 2 threads Shared by 4 threads
I-ERAT 64 entry, 2-way
Two copies, each 64-
entry, 2-way
(64-entry, 2-way shared
by 2 threads) x 2
T0/T2 share one, T1/T3
share other
I-ERAT misses
Two outstanding misses
to LSU
Two outstanding misses
to LSU
Two outstanding misses
to LSU
3 BHTs Used by single thread Shared by 2 threads Shared by 4 threads
BTAC (only in ST
mode)
Available Not used in SMT2 Not used in SMT4
EAT (tracking
instructions)
14 entries
Partitioned: 14 per
thread
Partitioned: 7 per thread
IBUF 10x4 instr
Partitioned: 10x4 instr /
thread
Partitioned: 5x4 instr /
thread
Count cache 128 entries
128-entry shared by 2
threads
128-entry shared by 4
threads
Thread priority:
Fetch BW,
Dispatch BW
Single thread
Dynamic: keep IBUFs
full and increase
throughput
Dynamic: keep IBUFs
full and increase
throughput

LSU
DCache Single thread Shared Shared
D-ERAT 64 entries
Shared: 64-entry
shared by 2 threads
(64-entry shared by 2
threads) x2
- T0/T1 share a 64-entry
ERAT
- T2/T3 share the other
64-entry ERAT
LRQ / SRQ Single thread Dynamically Shared Dynamically Shared
LMQ Single thread Dynamically Shared Dynamically Shared
TLB (hardware
managed)
Single thread Dynamically Shared Dynamically Shared
TLB (software
managed)
Single thread Software can partition Software can partition
Table walk
serialization

Yes (two concurrent
table walk)
Yes (two concurrent
table walk)
Hardware
prefetch streams
Single thread Dynamically Shared Dynamically Shared
Software prefetch
streams
Single thread Dynamically Shared Dynamically Shared
Reservation in L2
(lock)
One One per thread One per thread
ISU
Mapper / Register
renaming
Single thread Shared by 2 threads Shared by four threads
GPRs 36 36/thread 36/thread
GPR renames 77 41 shared by 2 threads
2x41 shared by 2
threads
VSR 64 64/thread 64/thread
VSR rename 80 80/thread 46 shared per pair of
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 18 of 110
threads
Issue Queue 48 entries
48 entries shared by 2
threads
(24 entries shared by 2
threads) x 2
Issue bandwidth Single thread
Non-blocking:
dynamically shared,
based on oldest ready
Non-blocking: each half
dynamically shared by
two threads, based on
oldest ready
Balance and
dispatch flush
Single thread
Yes, disallows resource
hogging by a thread
improves throughput
Yes, disallows resource
hogging by a thread
improves throughput
Completion
bandwidth

2 groups per cycle, 1
from each thread
2 groups per cycle, 1
from T0/T2 and 1 from
T1/T3

Execution Units
Branch
Shared no time-
multiplexing
Shared no time-
multiplexing
FXU 2 FXU
2 FXU shared by two
threads
(1 FXU shared by 2
threads) x 2
LSU 2 LSU
2 LSU shared by two
threads
(1 LSU shared by 2
threads) x 2
FPU 2 FPU
2 FPU shared by two
threads
2 FPU shared by four
threads
VSX 2 VSX
2 VSX shared by two
threads
2 VSX shared by four
threads
VMX/Altivec 1 VMX
1 VMX shared by two
threads
1 VMX shared by four
threads


3.3.3 SMT Control
The SMT mode can be controlled with Linux and AIX commands, as described in
sections 4.5 and 5.1. J udiciously binding threads to the correct logical cores can also
change the SMT mode, by allowing the OS to switch the SMT for the affected hardware
cores.
Binding is discussed sections 4.4 (AIX) and 5.4 (Linux). Within the OS each SMT thread
is treated as a separate core. This is referred to as a logical core. The actual core is
referred to as a physical core. The threads on a physical core have different designations
depending on the thread number. In SMT2 or SMT4, thread 0 is referred to as the
primary thread and thread 1 is the secondary thread. In SMT4 threads 2 and 3 are
referred to as the tertiary threads. The figure below illustrates the designations.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 19 of 110
Figure 3-2 Thread designations for SMT2 and SMT4


Currently, note that AIX and Linux place new processes and software threads on logical
CPUs as they become available. There is no distinction made between primary,
secondary and tertiary hardware threads. However, both the AIX and Linux schedulers
are being enhanced to preferentially use the primary threads in a core when available.
AIX and Linux order the logical cores with the threads from physical core 0 first followed
by the threads from physical core 1, etc. The table below details the standard ordering
of logical cores for 8 physical cores. Note that dynamically changing the SMT mode of a
system without rebooting can change the order from that shown in Table 3-2 (See section
4.5.2).
Table 3-2 Mapping logical cores to physical cores in ST, SMT2 and SMT4 mode
ST
physical
core
0 1 2 3 4 5 6 7
thread 0 1 2 3 4 5 6 7
designation primary primary primary primary primary primary primary primary
logical core 0 1 2 3 4 5 6 7

SMT2
physical
core
0 0 1 1 2 2 3 3
thread 0 1 0 1 0 1 0 1
designation primary secondary primary secondary primary secondary primary secondary
logical core 0 1 2 3 4 5 6 7

SMT2 Continued
physical
core
4 4 5 5 6 6 7 7
thread 0 1 0 1 0 1 0 1
designation primary secondary primary secondary primary secondary primary secondary
logical core 8 9 10 11 12 13 14 15

SMT4

0
1
2 3
Tertiary
Secondary
Primary
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 20 of 110
physical
core
0 0 0 0 1 1 1 1
thread 0 1 2 3 0 1 2 3
designation primary secondary tertiary tertiary primary secondary tertiary tertiary
logical core 0 1 2 3 4 5 6 7

SMT4 Continued
physical
core
2 2 2 2 3 3 3 3
thread 0 1 2 3 0 1 2 3
designation primary secondary tertiary tertiary primary secondary tertiary tertiary
logical core 8 9 10 11 12 13 14 15

SMT4 Continued
physical
core
4 4 4 4 5 5 5 5
thread 0 1 2 3 0 1 2 3
designation primary secondary tertiary tertiary primary secondary tertiary tertiary
logical core 16 17 18 19 20 21 22 23

SMT4 Continued
physical
core
6 6 6 6 7 7 7 7
thread 0 1 2 3 0 1 2 3
designation primary secondary tertiary tertiary primary secondary tertiary tertiary
logical core 24 25 26 27 28 29 30 31
If the OS is in either SMT2 or SMT4 mode, it will switch the core to use ST mode if the
secondary and tertiary threads are not used. If it is in SMT4 mode, the OS will switch the
core to SMT2 mode if a tertiary thread is not used. The user can control this by binding
jobs to given logical CPUs. Note that, if a job binds to a tertiary thread, the core will
remain in SMT4 mode even though the primary and secondary threads are not used.
3.3.4 Performance Results
The performance gain from SMT will vary depending on the program executing and its
execution model, the threading mode being used on the processor, and the resource
utilization of the program. Sharing the core resources with other threads will impact the
throughput of a thread. For a parallel application to see a gain the gain from
multithreading must be greater than the additional application overhead of increasing the
thread count. This section shows performance on a few benchmarks for the various
parallel execution models.
The graph below illustrates the gains seen on throughput-type workloads that would use
all available logical CPUs for a p755 system. The SPEC CFP2006 suite is used to
compare ST, SMT2, and SMT4 performance. The ST run uses 32 jobs, SMT2 uses 64
jobs and SMT4 uses 128 jobs. The performance (measured as SPECfpRATE ) between
modes is scaled appropriately to show the net benefit from enabling more SMT threads
per core. As can be seen, the result of using SMT2 and SMT4 varies. In this case it
ranges from .78 to 1.38 times for SMT2 and .8 to 1.64 times for SMT4. In some cases the
throughput can degrade with the use of SMT. It is generally the result of some resource
being fully utilized by a single thread or the effect of sharing the caches between multiple
threads. Even though the graph does not show this, another point to make is the SMT2
gain for POWER7 is lower than that of POWER6. This is due to the POWER7 core
switching back to an out-of-order pipe versus the in-order pipe in the POWER6 core.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 21 of 110
Figure 3-3 SMT2 and SMT4 performance comparison for SPEC CFP2006
SPEC FP SMT Comparison
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
R
a
t
e
4
1
0
.
b
w
a
v
e
s
4
1
6
.
g
a
m
e
s
s
4
3
3
.
m
i
l
c
4
3
4
.
z
e
u
s
m
p
4
3
5
.
g
r
o
m
a
c
s
4
3
6
.
c
a
c
t
u
s
A
D
M
4
3
7
.
l
e
s
l
i
e
3
d
4
4
4
.
n
a
m
d
4
4
7
.
d
e
a
l
I
I
4
5
0
.
s
o
p
l
e
x
4
5
3
.
p
o
v
r
a
y
4
5
4
.
c
a
l
c
u
l
i
x
4
5
9
.
G
e
m
s
F
D
T
D
4
6
5
.
t
o
n
t
o
4
7
0
.
l
b
m
4
8
1
.
w
r
f
4
8
2
.
s
p
h
i
n
x
3
P
e
r
f
o
r
a
n
c
e

R
a
l
a
t
i
v
e

t
o

S
T
SMT2 SMT4

The graph below shows the SMT gains for an OpenMP workload. The ST run uses 32
threads, SMT2 uses 64 threads and SMT4 uses 128 threads. The SMT2 and SMT4 gains
for NAS Parallel Benchmarks Class B are shown for 1 to 8 cores. Note that this is a
strong scaling study the work remains constant as the number of cores used
increases. As can be seen the results vary. For CG the gain is large while for some of the
workloads the gain is less than one. As the number of cores used increases the gain is
smaller due to additional overhead in program effects in the hardware. The LU case is an
example where the gain with SMT4 at 8 cores goes below 1, which is a typical
observation when using SMT4 with 8 cores. In most cases the SMT4 gain is lower than
the SMT2 gain. This is expected due to the amount of resource sharing for 4 threads.
Figure 3-4 NAS Parallel Benchmark Class B (OpenMP) SMT Gain

NASPB OMP Class B SMT Gain
0.00
0.50
1.00
1.50
2.00
2.50
3.00
CG MG FT LU BT SP
P
e
r
f
o
r
m
a
n
c
e

R
e
l
a
t
i
v
e

t
o

S
T
SMT4
SMT2
Cores
1 2 4 8
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 22 of 110
The chart below shows the performance gain on an MPI workload for SMT2 and SMT4
modes. The ST run uses 32 tasks, SMT2 uses 64 tasks and SMT4 uses 128 tasks. The
benchmarks shown are part of the NASPB Class C suite. As can be seen the gains for
SMT4 are better on MPI. CG has unexpected behavior in that SMT2 shows degradation
while SMT4 sees a modest gain. This behavior has also been seen on other benchmarks
with a high memory demand. In cases such as FT, where there is a large amount of
communication, the SMT gain for the compute portion of the workload does not
overcome the added communication time of the additional MPI tasks.
Figure 3-5 NAS Parallel Benchmarks Class C (MPI) SMT Gain
NASPB MPI Class C
0
0.2
0.4
0.6
0.8
1
1.2
CG FT LU MG
P
e
r
f
o
r
m
a
n
c
e

R
e
l
a
t
i
v
e

t
o

S
T

M
o
d
e
SMT2 SMT4

For hybrid MPI-OMP workloads, there was no measured performance gain in going from
32 (SMT off) to 64 (SMT2) threads.
In conclusion SMT gains will vary. Throughput type workloads offer the best opportunity
to see gains from SMT. In cases with high memory traffic SMT may not see the gains
expected. In a number of cases SMT offers the ability to get better performance with the
same resources. The best way to determine what SMT will offer for your application is to
measure. In any case, using SMT is not as important as it was on POWER6.
3.4 Operating Systems and System Configuration
The operating system is responsible for memory management and processes scheduling
which can have a critical impact on performance. Both AIX and Linux operating systems
are supported on the Power 755. Some of the features are:
3.4.1 Page Size
The POWER7 processor supports pages sizes of size 4K, 64K, 16MB, and 16GB. Page
sizes of 16MB and 16GB require reserving, or pinning, memory for support. Generally the
16GB page size is not used for common HPC configurations. Specifying the 64K or
16MB page sizes to use for application data can significantly improve application
performance by minimizing TLB misses, and often improves memory streaming
performance.
There are several advantages to a 64KB base page size, compared to 4KB:
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 23 of 110
The amount of memory that can be accessed without causing a TLB miss (the
"TLB reach") is expanded by a factor of 16, from 2MB to 32MB on POWER7.
This reduces the TLB miss rate and improves performance, particularly for
programs with a working set between 4MB and 64MB.It also reduces D-ERAT
misses.
Prefetching engines in POWER microprocessors stop their work at memory page
boundaries, so using larger memory page sizes allows more efficient and
continuous prefetching which enables memory-bandwidth-bound applications to
reach higher performance.
There are also some disadvantages:
Files are cached in kernel memory in units of the base page size. That is, a one-
byte file will take up one page of kernel memory when it is cached in the kernel
page cache, which is whenever the file contents are being accessed. Thus, if the
workload involves access to a large number of small files (tens of KB or smaller),
the 64KB page size will result in more memory being wasted in the kernel page
cache than a 4KB page size would. This wastage is sometimes referred to as
"internal fragmentation".
Programs that assume that the page size is 4KB may behave incorrectly.
Sometimes the assumption is implicit rather than explicit. For example, a
program may request a 12KB stack for a newly-created thread and then fail when
it finds that the thread has been given a 64KB stack rather than 4KB. Fortunately
this has not proved to be a major problem in practice.
3.4.2 Memory Affinity
The Power 755 system is an SMP architecture with an all-to-all interconnect across the
four sockets of the system (see Figure B- 1 in Appendix B:).
However, processes have faster access to memory associated with the memory
controller local to the chip it is running on. Higher latency and lower bandwidth should be
expected if a memory reference is to a remote memory location.
There are four memory domains on the Power 755, each local to a specific chip. The
ratio of local to remote memory bandwidth can be greater than a factor of two, so it can
be advantageous to ensure that an application always accesses local memory as much
as possible.
All modern operating systems have some mechanism to control page placement. Usually
the default memory allocation policy is to allocate pages in a round robin or random
manner across all memory domains on a system. Both AIX and Linux have options that
attempt to allocate memory pages exclusively on the memory controller where a
process/thread is currently running, typically referred to as memory affinity.
The relevant settings are covered in sections 4.2 for AIX and 5.3 for Linux.
3.4.3 Process Binding
The operating system is responsible for process scheduling, and in the course of a long
running job can reschedule a process on different cores of a multi processor system.
The rescheduling of the processes may incur a performance degradation due to the need
to reload the L2 and L3 caches, and a loss of memory affinity if the process is
rescheduled on another chip.
Both Linux and AIX have features that allow the user to bind or affinitize a process or
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 24 of 110
thread to a specific core or even a set of cores.
The relevant settings are covered in sections 4.3 for AIX and 5.4 for Linux.
3.4.4 Hardware Prefetch Control
The POWER architecture supports the dcbt instruction to prefetch a cache line for a load
into the level 1 cache, and the dcbtst instruction that touches a cache line in the level 3
cache in anticipation of a store to the cache line. These prefetch instructions can be
used to mask latencies of memory requests to the memory controller.
The POWER7 chip has the ability to recognize streaming memory access patterns with a
unit stride or stride N, and initiate the dcbt, or dcbtst prefetch instructions automatically.
Control of how aggressive the hardware will prefetch, i.e. how many cache lines will be
prefetched for a given reference, is controlled by the Data Streams Control Register
(DSCR).
The DSCR was introduced in the POWER6 architecture and is a 64 bit register but the
last 5 bits control the features of the hardware prefetch mechanism. A description of the
DSCR register settings with example bit masks is shown below.

+- - - - - - - - - - - - - - - - - - - - - - - - +- - - - +
| / / / | SNSE| SSE| DPFD|
+- - - - - - - - - - - - - - - - - - - - +- - - +- - - - +
( 0 59 60 63)

_Bit(s) Name Description_
59 SNSE Stride-N Stream Enable (POWER7 and up)
Bitmask DSCR_SNSE =16

60 SSE Store Stream Enable
SSE enables hardware detection and initiation of store streams.
Bitmask DSCR_SSE =8

61:63 DPFD Default Prefetch Depth
Depth value applied for hardware-detected streams
and software-defined streams
Values and their meanings are as follows.
Bitmask Value Description
DPFD_DEFAULT 0 Use system default
DPFD_NONE 1 None, disable prefetch
DPFD_SHALLOWEST 2 shallowest
DPFD_SHALLOW 3 shallow
DPFD_MEDIUM 4 medium
DPFD_DEEP 5 deep
DPFD_DEEPER 6 deeper
DPFD_DEEPEST 7 deepest

Some examples of setting the DSCR register to enable specific features, the decimal and
hexadecimal values are shown.
Medium prefetch depth, no store streams, no stride N prefetch
DSCR =DPFD_MEDIUM =5 (0x5)

Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 25 of 110
Medium prefetch depth, enable store streams, no stride N prefetch
DSCR =DSCR_SSE | DPFD_MEDIUM =13 (0xD)

Medium prefetch depth, enable store streams, and stride N prefetch
DSCR =DSCR_SNSE | DSCR_SSE | DPFD_MEDIUM =29 (0x1D)
In general most HPC applications will benefit from the most aggressive prefetch depth
and enabling the store stream and stride N features, However, there are cases , such as
code segments with irregular memory accesses, where aggressive prefetching can
potentially inhibit performance because the prefetched lines may not be referenced
hence prefetching only adds additional bandwidth demand. The user may want to
change the DSCR value. Since the DSCR is a protected register an operating system
interface is required to change the value.
The relevant settings are covered in sections 4.6 for AIX and 5.3 for Linux.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 26 of 110
4 AIX
AIX, short for Advanced Interactive eXecutive, was introduced in J anuary 1986, and
provided IBM with an entre to the world of Ethernet, TCP/IP and open systems and
standards. Version 1 was implemented for IBM mainframes, and was based on AT&T
UNIX System V Release 3, and 4.3 BSD. A version of AIX for the PC/RT, AIX/RT-2, was
released in 1986. In February 1990 IBM introduced AIX version 3.0, along with the
POWER processor. Since then AIX has been a robust operating system for POWER-
based computer systems.
The following versions of AIX support Power 755 systems:
AIX 5.3 with the 5300-11 Technology Level and Service Pack 2, or later,
available in March 2010.
AIX 5.3 with the 5300-10 Technology Level and Service Pack 4, or later,
available May 28, 2010
AIX 5.3 with the 5300-09 Technology Level and Service Pack 7, or later,
available May 28, 2010
AIX 6.1 with the 6100-04 Technology Level and Service Pack 3, or later
AIX 6.1 with the 6100-03 Technology Level and Service Pack 5, or later,
available J une 25, 2010
AIX 6.1 with the 6100-02 Technology Level and Service Pack 8, or later,
available J une 25, 2010
Note that the Power 755 only runs AIX 6.1 6100-04 (61H) SP2 when it is first generally
available.
Since Linux is also a widely used operating system, there is a suite of libraries and
commands that facilitate development of applications for AIX and Linux. The AIX
Toolbox for LinuxApplications contains a collection of open source and GNU software
built for AIX IBM Systems. These tools provide the basis of the development environment
of choice for many Linux application developers. More information can be found in AIX
Toolbox for Linux Applications.
The following section attempts to summarize features of the AIX operating system that
can be utilized to improve performance of HPC applications.
Detailed information on AIX can be found at the IBM System p and AIX Information
Center (http://publib16.boulder.ibm.com/pseries/index.htm).
Specific information on performance tuning for AIX 6.1 can be obtained in the
Performance Tools Guide and Reference
4.1 Page Modes
The default page size for all application memory segments is 4KB. There are multiple
ways to specify the page mode to be used for an application.
4.1.1 Linker Options
The link options bdatapsize=<size>, -bstackpsize=<size>, and btextpsize=<size>can
be used to specify the page size to use for the data, stack, and text segments
respectively. For example the linker/compile line to specify that 64KB pages should be
used for all segments would look like
l d o myexe - bdat apsi ze=64K - bst ackpsi ze=64K - bt ext psi ze=64K
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 27 of 110
More details on page modes can be found in the Guide to Multiple Page Size Support on
AIX.
4.1.2 Using ldedit
The ldedit command can also be used to change the default page size settings for a
binary. This is convenient since it does not require relinking object files; it modifies the
binary executable directly. The equivalent ldedit command for the ld example above is
l dedi t - bdat apsi ze=64K - bst ackpsi ze=64K - bt ext psi ze=64K myexe
To switch back to default (4K) pages, use
l dedi t - bdat apsi ze=4K - bst ackpsi ze=4K - bt ext psi ze=4K myexe
The dump command can be used to show the page size information for the myexe
dump Xany ov <myexe>

executable, e.g.
will show the text, data and stack page sizes.
The linker option blpdata can be used to specify that 16 MB (large) pages should be
used for the data section, but it is recommended that the -bdatapsize=16M option be
used as the blpdata option may be deprecated in the future. Note the size of the stack
and text regions can only be 4K, or 64K.
l dedi t - bdat apsi ze=16M - bst ackpsi ze=64K - bt ext psi ze=64K myexe
4.1.3 Using the LDR_CNTRL Environment Variable
Memory page sizes can also be specified using the LDR_CNTRL environment variable.
This variable has similar functionality to ldedit, but applies to all subsequent commands in
a script, so it must be used carefully (It can waste resources if a user runs cp or a bash
shell in 16 MB pages). An example setting to specify that all memory regions use 64K
In Korn, or Bourne shell
expor t LDR_CNTRL=DATAPSI ZE=64K@STACKPSI ZE=64k@TEXTPSI ZE=64K
In C shell
set env LDR_CNTRL DATAPSI ZE=64K@STACKPSI ZE=64k@TEXTPSI ZE=64K
4.1.4 Configuring a System for Large Page Support
Sometimes, an HPC application will benefit from an even larger page size. 16MB pages
are also available in AIX. In this case, the system also needs large pages allocated,
which is done with the vmo command (this can only be used as root).
vmo [ r ] - o l gpg_r egi ons=300 - o l gpg_si ze=16777216
Note that the memory specified for large pages is pinned and not accessible for data
regions that utilize 4KB or 64KB pages. Issuing the above command will create 300
large pages (4.8 GB) dynamically. This is typically not recommended for HPC systems,
and it is suggested to specify the -r option to specify that the large page region will be
created after the next reboot. In this case the bosboot a option should be run before
rebooting the system.
Choose the number of large pages wisely. Any binary not using large pages will be
unable to access that memory and any binary that needs more large page memory than
what was allocated will use 4KB pages, so allocating too many pages can reduce
available memory to most applications and increase swapping while allocating too few
pages could limit the benefit of using 16MB pages in the first place.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 28 of 110
4.1.5 Querying Page Use
The vmstat command can be used to determine page use activity for each page size
configured on the system. The options p all or -P all will display paging activity for
each page size. Below are some examples
# vmst at - p al l

Syst emconf i gur at i on: l cpu=16 mem=128000MB

kt hr memor y page f aul t s cpu
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r b avm f r e r e pi po f r sr cy i n sy cs us sy i d wa
3 1 1947834 11808993 0 0 0 0 0 0 12 1389 2219 1 0 99 0

psz avm f r e r e pi po f r sr cy si z
4K 1287131 4942320 0 0 0 0 0 0 8463408
64K 41294 429167 0 0 0 0 0 0 470461
16M 0 4096 0 0 0 0 0 0 4096


# vmst at - P al l

Syst emconf i gur at i on: mem=128000MB

pgsz memor y page
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
si z avm f r e r e pi po f r sr cy
4K 8463408 1287097 4942194 0 0 0 0 0 0
64K 470461 40820 429641 0 0 0 0 0 0
16M 4096 0 4096 0 0 0 0 0 0
Page usage information can also be displayed for a specific process using the Z and T
options of the ps command. The following example shows that the process 90250 is
utilizing 64KB pages for data segment (DPGSZ), and 4KB for the stack, text, and shared
memory (SPGSZ, TPGSZ, and SHMPGSZ) respectively.
# ps - Z - T 90250
PI D TTY TI ME DPGSZ SPGSZ TPGSZ SHMPGSZ CMD
90250 - 14: 02 64K 4K 4K 4K myexe
4.1.6 4KB & 64KB Page Pools
Both 4KB and 64KB pages share the same page pool. One can think of 64KB pages as a
collection of 4KB pages and the OS monitors how many of each page size are actually
allocated. If an application needs more 64KB pages than are already defined, AIX will
coalesce unused 4KB pages to create more 64KB pages. This coalescing operation can
impact performance.
If it is known that most applications run on a system will utilize 64KB pages, running a
program that attempts to touch as many 64KB pages as possible can help minimize the
performance impact of the 64KB page creation process. A sample program called
scramble is provided in Appendix I.
4.2 Memory Affinity Settings
As mentioned in section 3.4.2, memory accesses on a POWER7 system cover four
memory domains. For HPC workloads, the typical way to use a system is to allow at most
one software thread or process to use an available hardware thread (e.g. by using the
appropriate settings in LoadLeveler). The most effective way to utilize the available
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 29 of 110
memory bandwidth for any particular application is to use process binding and memory
affinity.
The default memory affinity settings for different memory segments are specified by the
various vmo memplace* options. In particular the memplace_data option specifies the
default for the user data segment. A default value of memplace_data=2 specifies a
round-robin page allocation. A value of memplace_data=1 would specify a first-touch or
affinitized page placement setting. It is usually not recommended to change the default
memplace_data options since many of the system daemons are not bound and memory
affinity may affect system performance.
Setting the MEMORY_AFFINITY environment variable specifies that all subsequent
commands will allocate memory with the first touch or affinitized memory allocation policy
expor t MEMORY_AFFI NI TY=MCM
Further control can be provided through the execrset command, which is similar to
numactl on Linux.
4.3 Process and Thread Binding
If memory affinity is important for optimal performance one has to also ensure process
and thread affinity. Processes and threads can be bound to a specific processor using
the bindprocessor command or subroutine (see the AIX Inforcenter for more information).
AIX also includes a feature called resource sets that can be used to specify that a
process or thread can run on a set of resources, such as a set of processors, or even
specific memory domains. A detailed discussion of resource sets is beyond the scope of
this document but information on commands such as execrset and the ra_attachrset
routine can be found in the AIX Info Center documentation.
Note that an AIX thread will inherit the binding information from its parent process. By
specifying binding for a process, all threads will also be bound to the same processor,
which may not be desirable if parallelism is exploited using threads.
The XL Compilers SMP library supports options to specify thread binding with the
XLSMPOPTS environment variable. This variable supports two related options startproc
which specifies the starting processor number for binding and the stride which is the
processor increment to skip for successive threads.
For example, the following environment variable setting would bind the four OpenMP
threads of myexe to logical processors 0, 1, 2, & 3.
expor t OMP_NUM_THREADS=4
expor t XLSMPOPTS=" st ar t pr oc=0: st r i de=1"
myexe
Setting stride to 2 would bind the threads to logical processors 0, 2, 4, & 6.
As with many common multi-chip multi-core architectures, process and thread binding
can be very complicated. The compiler SMP runtime and features of LoadLeveler
facilitate specifying process and thread binding for parallel applications written in
OpenMP, or MPI. A summary of the LoadLeveler affinity features can be found in
sections 6.2 and 6.4.
4.4 Task Binding
Under AIX, all MPI tasks are processes, with their own process IDs. Binding them to
logical CPUs is critical to reproducibly measuring performance and getting the best
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 30 of 110
performance possible. There are two command-line utilities in AIX to do this binding,
bindprocessor and execrset. Several other utilities, of which launch is an example,
are also available. For our purposes (binding MPI tasks), these can be considered
equivalent.
Note that all of these task binding methods will also handle serial jobs as a special case.
4.4.1 Using bindprocessor
An example script, petaskbind.sh (see Appendix C:) shows the explicit mechanism for
binding AIX tasks (started by poe) with bindprocessor.
The following script will execute an 8-way job distributed evenly across the target set of
logical CPUs.
expor t MP_PROCS=8
expor t PEBND_PE_STRI DE=- 1
expor t PEBND_PE_START=0
/ usr / bi n/ poe . / petaskbind.sh . / myexe < . / myi np > . / myout
4.4.2 Using rsets (execrset)
Another example script, petaskbind-rset.sh (see Appendix D:) shows the explicit
mechanism for binding AIX tasks (started by poe) with attachrset/execrset.
The following script will execute an 8-way job distributed evenly across the target set of
logical CPUs.
expor t MP_PROCS=8
expor t PEBND_PE_STRI DE=- 1
expor t PEBND_PE_START=0
/ usr / bi n/ poe . / petaskbind-rset.sh . / myexe < . / myi np > . / myout
4.4.3 Using launch
The launch utility associates AIX processes with specific logical CPUs. The launch utility
is available by request. Contact IBM Technical support.
In addition to the normal poe environment variables and commands, specify one of two
environment variables to tell launch what logical CPUs to bind to:
Set and export TARGET_CPU=<#>to bind a serial executable to a logical CPU.
Set and export TARGET_CPU_RANGE=<#[#]...>to bind tasks to a specific list of logical
CPUs on each node used.
The following script fragment will execute a 32-way job distributed evenly across the
target machines in the poe hostfile.:
expor t MP_PROCS=32
expor t TARGET_CPU_RANGE=- 1
/ usr / bi n/ poe . / l aunch . / myexe < . / myi np > . / myout
And the script fragment
expor t TARGET_CPU_RANGE=0 4 8 12
/ usr / bi n/ poe . / l aunch . / myexe < . / myi np > . / myout
will execute an MPI job distributed on CPUs 0,4,8,12 of every node being used. In this
case, MP_PROCS must be a multiple of 4 and the poe hostfile must have enough entries
to account for all tasks across all nodes.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 31 of 110
4.4.4 Mixed (Thread and Task) Binding
This is commonly referred to as hybrid binding because it applies to hybrid applications
which mix thread-level (including OpenMP) and MPI-style parallelism. hybrid_launch,
another binding utility, is available to bind these sorts of applications under AIX. See
http://benchmarks.pbm.ihost.com/technical/techhottips/AIX/binding/ or contact IBM
Technical support.
The following script fragment will execute a 32-way (8 tasks of 4 threads each) job
distributed evenly across the target machines in the poe hostfile:
expor t MP_PROCS=8
expor t OMP_NUM_THREADS=4
expor t TARGET_CPU_RANGE=- 1
/ usr / bi n/ poe . / hybr i d_l aunch . / myexe < . / myi np > . / myout

SMT Control
The basics of SMT are described in section 3.3, and the logical processor a mapping in
section 3.3.3. AIX provides commands to query the SMT configuration of the system and
logical processor mapping.
4.4.5 Querying SMT Configuration
The lparstat command can be used to query state of the SMT configuration. A sample
command output is shown below for a system configured in SMT 2 mode. The smt field
can be Off, On, or 4, corresponding to ST mode, SMT 2, and SMT 4 respectively.
# l par st at

Syst emconf i gur at i on: t ype=Dedi cat ed mode=Capped smt =On l cpu=64
mem=128000MB

%user %sys %wai t %i dl e %nsp
- - - - - - - - - - - - - - - - - - - - - - - - - - -
0. 9 0. 1 0. 0 99. 1 84
Often it is necessary to know the logical processor ids of the primary, secondary, and
tertiary threads, for process binding. The s option of the bindprocessor command can
be used to determine the logical processor numbers corresponding to the primary,
secondary, and tertiary threads as described below :
Primary threads: bi ndpr ocessor - s 0
Secondary threads: bi ndpr ocessor - s 1
Primarys tertiary thread: bi ndpr ocessor - s 2
Secondarys tertiary thread: bi ndpr ocessor - s 3
4.4.6 Specifying the SMT Configuration
The system administrator can change the smt configuration of the system with the smtctl
command.
The suggested form of the smtctl command to use on a Power 755 system is
smt ct l [ - t #SMT [ - w boot | now] ]
where #SMT is 1, 2, or 4, for ST mode (SMT1), SMT 2, or SMT 4 modes respectively.
The SMT state can be changed dynamically (the w now option). However, this will likely
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 32 of 110
change the logical processor to physical processor mapping. Hence, it is suggested that
to set the SMT mode for the next boot (-w boot), run bosboot a and then reboot the
partition.
Running the smtctl command with no options displays the logical processor-to-physical
core mapping. This may be useful for diagnosing affinity related issues.
Hardware Prefetch Control
Hardware prefetch and the DSCR register is described in section 3.4.4. AIX provides
interfaces for accessing and setting the DSCR register on a system level, and for a user
application.
The dscrctl command can be used to query and set the system wide DSCR value
# dscr ct l - q
Cur r ent DSCR set t i ngs:
Dat a St r eams Ver si on = V2. 06
number _of _st r eams = 16
pl at f or m_def aul t _pd = 0x5 ( DPFD_DEEP)
os_def aul t _pd = 0x0 ( DPFD_DEFAULT)

The platform_default_pd value defines the default value of the DSCR specified in the
firmware, usually 0x5. The os_default_pd value is the current value, and can be
overridden. Note a value of 0 means that the platform default value should be used.
A system administrator can change the system wide value using the dscrctl command
dscr ct l [ - n | - b] s <val ue>
The n option is used to change the value now or immediately, while the b option
should be specified to change the value upon the next reboot.
The system-wide DSCR setting will be the default value used by all applications running
on the system. A user can change the value of the DSCR value to use for a process
using the dscr_ctl system call. A full description of this call is beyond the scope of this
document. However, a program that provides an example of using the dscr_ctl API is
provided in Appendix H:

Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 33 of 110
5 Linux
The only supported Linux distributions on POWER7 are Red Hat Enterprise Linux version
5 (RHEL5) and Suse Enterprise Linux Server version 11 (SLES11).
5.1 SMT Management
On Linux, the default behavior is to use SMT. When you install a system with Linux and
access the system you will see several times the number of physical cores present in
your system. On POWER6 systems, you can have 2 logical cores per physical core while
on POWER7 it can be 2 or 4.
The management of SMT can be done 3 ways:
at the yaboot "boot:" prompt as a kernel command line option.
o boot : linux smt-enabled=off
offline the threads after boot with:
f or i i n `seq 1 2 127`
do
echo " 0" > / sys/ devi ces/ syst em/ cpu/ cpu$i / onl i ne
done
and replace the "0" with a "1" to turn the threads back on.
change the SMT configuration of the system with the ppc64_cpu command.
The suggested form of the smtctl command to use on a Power 755 system is

ppc64_cpu [ - - smt = {on| 2| 4} ]
The best use of SMT is achieved by binding processes or threads. This is covered in
more detail in section 5.4
5.1.1 Linux SMT Performance
Figure 5-1 shows the impact of using SMT2 and SMT4 on a financial risk analysis kernel
which benefits from SMT (on top of vectorization). This kernel is written in C and
parallelized with OpenMP. The data are for one core running 1, 2 or 4 OpenMP threads.
The best performing case runs at about 15 GFLOPS on 2 threads (SMT2 mode),
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 34 of 110
Figure 5-1 SMT effect on an HPC kernel

SMT effect: Financial kernel
0%
15%
11%
0.0
10.0
20.0
30.0
40.0
50.0
60.0
e
l
a
p
s
e
d

t
i
m
e

(
s
e
c
)
%

i
m
p
r
o
v
e
m
e
n
t
t ot al elapsed t ime 41.3 35.3 36.9
r elat ive per f
impr ovement %
0% 15% 11%
ST SMT2 SMT4

5.2 Memory Pages
The Linux virtual memory subsystem currently supports only two page sizes. On
POWER7 systems, supported Linux distributions use a base page size of 64KB. In
addition, Linux "huge" pages of 16MB (known as "large pages" on AIX) can be made
available. See section 3.4.1 for a general discussion of page sizes.
A related Linux community project on SourceForge called libhugetlbfs provides
transparent access to the huge pages for compiled executables. This approach is still
limited by the resource availability of the 16MB pages and the system administrative
burden of allocating the sufficient number of huge pages for the application.
Assuming libhugetlbfs is installed, the HugePage string can be found in /proc/meminfo
# cat / pr oc/ memi nf o | gr ep HugePage
HugePages_Tot al : 0
HugePages_Fr ee: 0
In this case, huge pages can be allocated as follows
# echo 0 > / pr oc/ sys/ vm/ nr _hugepages
# echo 30 > / pr oc/ sys/ vm/ nr _hugepages
Looking at /proc/meminfo again confirms the pages are allocated
# cat / pr oc/ memi nf o | gr ep HugePage
HugePages_Tot al : 30
HugePages_Fr ee: 30
See the libhuge short and simple article on IBM developerWorks for more details
5.3 Memory Affinity
5.3.1 Introduction
By default, Linux will enable memory affinity. When a process is scheduled to start on a
given core, data can be written into memory during the first use (usually during the
initialization). And this memory location will remain the same until the process terminates.
So if the scheduler decides to move the process / thread to another core, time to access
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 35 of 110
this memory location will change. This is why it is recommended that most HPC
applications bind processes and threads on Linux.
5.3.2 Using numactl
While Linux enables memory affinity by default, it could be useful to manage this
behavior manually. The numactl command is a way to do it.
To show the numa nodes (memory pools) defined on a Power 755 system:
numact l - - har dwar e
Command output example on a Power 755 with 32 cores (ST mode) and 256GB of total
memory
# numact l - - har dwar e
avai l abl e: 4 nodes ( 0- 3)
node 0 cpus: 0 4 8 12 16 20 24 28
node 0 si ze: 60928 MB
node 0 f r ee: 59797 MB
node 1 cpus: 32 36 40 44 48 52 56 60
node 1 si ze: 64768 MB
node 1 f r ee: 64412 MB
node 2 cpus: 64 68 72 76 80 84 88 92
node 2 si ze: 64768 MB
node 2 f r ee: 64433 MB
node 3 cpus: 96 100 104 108 112 116 120 124
node 3 si ze: 64768 MB
node 3 f r ee: 64470 MB
node di st ances:
node 0 1 2 3
0: 10 20 20 20
1: 20 10 20 20
2: 20 20 10 20
3: 20 20 20 10

To use round robin memory allocation on all memory pools:
numact l i nt er l eave=al l
(This is useful for OpenMP applications when data allocation is done by a master thread
and we cant access source code to modify it.)
For more information, man numact l
5.3.3 Linux Memory Affinity Performance
Here we compare the effects of process and memory affinity, and using different
memory pages on the STREAM kernel benchmark. We also compare AIX and Linux
behavior:
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 36 of 110
Figure 5-2 Effects of process binding, memory page size and memory affinity on STREAM triad
STREAM TRIAD on POWER7 HV32 @ 3.3GHz (AIX 6.1)
56428 56746
94922
120242 119177
0
20000
40000
60000
80000
100000
120000
140000
ST ST ST ST ST
SP SP SP SP MP
no ma no ma ma ma ma
no bind bind no bind bind bind
M
B
/
s

STREAM TRIAD on POWER7 HV32 @ 3.3GHz (SLES11)
16157
33700
117292
121615
113245
109870
0
20000
40000
60000
80000
100000
120000
140000
ST ST ST 32 cores ST 16 cores SMT2 SMT4
MP MP MP MP MP MP
no ma ma ma ma ma ma
no bind no bind bind bind bind bind
M
B
/
s

ST, SMT2 and SMT4 refer to the different SMT modes. SP and MP refer to small (4KB)
and medium (64KB) page sizes. ma refers to memory affinity on (no ma is off). bind
and no bind designate whether CPU binding was used.
On Linux SLES11 system, 64KB memory pages are enabled by default.
The same performance is available both with AIX and SLES11, but the behavior
is slightly different:
o AIX: The best performance is obtained with Small Pages
o Linux: The best performance is obtained with Medium Pages (no Small
Pages available) when using 16 out of 32 cores.
o Linux: SMT doesnt provide performance improvement (2 threads nor 4
threads)
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 37 of 110
5.4 Process and Thread Binding
5.4.1 taskset
This command is included in Linux and can be used to manage the process binding. For
instance, using:
t askset c 0- 2, 5 <exe>
will force the process <exe>to run on cores #0,1,2,5
5.4.2 numactl
This command is available in Linux and can be used similarly to taskset, with a slight
syntax change. Following the example above, using:
numact l C 0- 2, 5 <exe>
will force the process <exe>to run on cores #0,1,2,5
5.4.3 Compiler Environment Variables for OpenMP Threads
For a code compiled using IBM compilers, OpenMP threads can be bound using the
XLSMPOPTS environment variable from the IBM compiler runtime environment.
For instance:
expor t OMP_NUM_THREADS=4
expor t XLSMPOPTS=st ar t pr oc=0: st r i de=2
binds 4 OpenMP threads starting at core #0 with a stride of 2:
The threads are bound on cores #0 2 4 6
Full details on the IBM implementation of XLSMPOPTS are found here.
When using GNU compilers, you can bind OpenMP threads using the following variable:
GOMP_CPU_AFFI NI TY
For example,
expor t GOMP_CPU_AFFI NI TY=" 0 3 1- 2 4- 15: 2"
will bind the initial thread to CPU 0, the second to CPU 3, the third to CPU 1, the fourth to
CPU 2, the fifth to CPU 4, the sixth through tenth to CPUs 6, 8, 10, 12, and 14
respectively and then start assigning back from the beginning of the list.
expor t GOMP_CPU_AFFI NI TY=0
binds all threads to CPU 0.
5.4.4 LoadLeveler
The Tivoli Workload Scheduler LoadLeveler is the preferred solution to bind processes in
a cluster environment. It is used in the same way under Linux as it is with AIX. See
section 5.8.3 and section 6.2.
Hardware Prefetch Control
Hardware prefetch and the DSCR register is described in section 3.4.4. Linux provides
interfaces for accessing and setting the DSCR register on a system level, and for a user
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 38 of 110
application.
The ppc64_cpu command can be used to query and set the system wide DSCR value
# ppc64_cpu - - dscr

Cur r ent DSCR set t i ngs:
Dat a St r eams Ver si on = V2. 06
number _of _st r eams = 16
pl at f or m_def aul t _pd = 0x5 ( DPFD_DEEP)
os_def aul t _pd = 0x0 ( DPFD_DEFAULT)

The platform_default_pd value defines the default value of the DSCR specified in the
firmware, usually 0x5. The os_default_pd value is the current value, and can be
overridden. Note a value of 0 means that the platform default value should be used.
A system administrator can change the system wide value using the ppc64_cpu
command
ppc64_cpu - - dscr =<val ue>
The system wide DSCR setting will be the default value used by all applications running
on the system. As of SLES 11 and RHEL 5.4, there is no way for a user to control the
DSCR setting from within a program.
5.5 Compilers
When compiling on POWER7 systems, we have the choice between two solutions:
the standard compiler of Linux world: GNU compilers: gcc, g++, gfortran (version
4.5 or later)
the IBM compilers VAC (version 11.1), VACPP (version 11.1) , XLF (version
13.1)
GCC offers robust portability of code intended for compilation in Linux, while the IBM XL
compilers offer a substantial performance increase over GCC when higher levels of
optimization are used.
While IBM is recommending IBM compilers to take advantage of IBM hardware features
and IBM software, the use of the GNU tool chain is sometimes mandatory or the easier
choice when performance is not a concern (small utilities, etc.)
The following sections cover the main guidelines for obtaining an optimized executable
on Linux on POWER with either solution.
5.5.1 GNU Compilers
5.5.1.1 Porting
In general, porting code with GCC compilers should be straightforward. In most cases it's
a simple recompile and is as easy as typing the make command. Architectures may vary,
and occasionally library version discrepancies may arise. But for the most part it doesn't
matter which architecture it runs on.
On all architectures, libraries must be compiled with - f PI C; x86 has this flag applied by
default with GCC. On POWER, this flag specifies that the generated code will be used in
a shared object.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 39 of 110
5.5.1.2 Recommended compiler options
When you want or need to use GNU compilers to build your application on POWER7
systems, the following tips will improve performance:
Start with the latest versions of GNU compilers
Be aware of default 32-bit or 64-bit modes
-mcpu and -mtune compiler options
Consider using the Advance Toolchain
-Bsymbolic
Consider libhugetlbfs for run-time improvements
Consider alternative malloc routines
All these points are covered in details in the hpccentral wiki .
The latest (4.5 right now) beta version of the GNU compiler introduces POWER7 support
and optimization through specific compiler flags:
-mcpu=power7 -mtune=power7
-maltivec
-mvsx
These flags have not been tested yet or compared to alternatives that would still result in
binaries able to run on a POWER7 system (e.g. mcpu=power5).
5.5.2 IBM Compilers
5.5.2.1 Porting
IBM XL compilers offer a high-performance alternative to GCC as well as a number of
additional features. Porting applications natively developed using the GNU compiler is
usually straightforward.
Fortran
Porting a code from GNU Fortran to IBM XLF is often a simple recompilation.
C/C++
Fortunately, XL C/C++uses the GNU C and C++headers, and the resulting application is
linked with the C and C++run time libraries provided with GCC. This means that the XL
C/C++compilers produce GNU elf objects, which are fully compatible with the objects
GCC compilers produce. XL C/C++ships the SMP run time library to support the
automatic parallelization and OpenMP features of the XL C/C++compilers.
Moving from GCC to XL C/C++for Linux on POWER is straightforward. XL C/C++assists
with the task by providing an option, -qinfo=por, to help you filter the generated diagnostic
messages to show only those that pertain to portability issues. In addition, a subset of the
GNU extensions to gcc and gcc-c++are supported by XL C/C++. See "XL C/C++for
Linux on pSeries Compiler Reference" for a complete list of features that are supported,
as well as those that are accepted but have semantics that are ignored.
To use supported features with your C code, specify either -qlanglvl=extended or
-qlanglvl=extc89. In C++, all supported GNU gcc/gcc-c++features are accepted by
default. Furthermore, gxlc and gxlc++help to minimize the changes to makefiles for
existing applications built with the GNU compilers.
5.5.2.2 Recommended compiler options
General use of IBM compilers on Linux on POWER is exactly the same as on AIX. See
section 3.1
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 40 of 110
A summary of new general compiler flags for POWER7 on Linux:
-qarch=pwr7 qtune=pwr7 : compile and tune for a POWER7 processor
-qsimd=auto: use the auto-vectorization feature
-qaltivec: flag needed to recognize vector types and
formatting extensions to C language
5.5.3 IBM Compiler Performance Comparison
The kernel introduced in section 5.1.1 has been compiled with a beta version of the IBM
compilers for POWER7 using the following compiler flags:
- O3 qsi md=aut o - qal t i vec - qar ch=pwr 7 - qt une=pwr 7
The No Vector version excludes the -qsimd=auto qaltivec options.
The compiler was able to generate a lot of SIMD operations and increase the overall
performance.
Figure 5-3 SIMD performance on an HPC kernel

VMX/VSX effect: Financial kernel
2.3
3.0
1.8
0.0
20.0
40.0
60.0
80.0
100.0
e
l
a
p
s
e
d

t
i
m
e

(
s
e
c
)
p
e
r
f

r
a
t
i
o

(
x
)
NoVect or 92.4 48.3 44.0
Vect or 41.0 16.0 25.0
Rat io 2.3 3.0 1.8
t ot al elapsed t ime single pr ecision loop double pr ecision loop

Using the GNU compiler (version 4.3.2), we were not able to use auto-
vectorization; this compiler version does not support POWER7 . Therefore, the
elapsed time is much larger: 159 sec.
The newer version of the GNU compiler (4.5 alpha) hasnt been tested yet.
The theoretical maximum possible performance for Altivec/VSX is 4x, which
would occur if only one scalar floating point pipe is used effectively. It is possible
that the scalar loop can be further optimized by hand to effectively use both
scalar floating point pipes. See section 11.2 for a detailed discussion of expected
Altivec/VSX performance
5.5.4 Porting x86 Applications to Linux on POWER
A useful article has been produced to help migrate an application developed on x86 Linux
to Linux on POWER.
MPI
In Linux community, open-source versions of MPI are well known and primarily used
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 41 of 110
instead of proprietary libraries. OpenMPI is becoming the first free choice for MPI on
several architectures like Linux on POWER. IBM has also made its Parallel Environment
available for Linux on POWER. The following is a brief comparison of these two MPI
products available on Power 755.
OpenMPI
5.5.5 Introduction
OpenMPI is licensed under the new BSD license. Currently OpenMPI runs on a wide
variety of platforms which includes POWER7 systems and it natively supports the
Infiniband network which is available for the Power 755 system.
5.5.6 OpenMPI Binding Parameters
OpenMPI has its own processor and memory affinity parameters. Having the system
support for processor and memory affinity, one can explicitly tell OpenMPI to use them
when running MPI jobs.
This script fragment sets the processor affinity with a rankmap file.
#mpi r un - np 4 - host f i l e host f i l e -rf rankfile <execut abl e>

#cat r ankf i l e
r ank 0=host 1 sl ot =2
r ank 1=host 2 sl ot =1- 3
This means that
r ank 0 will run on host 1 bound to CPU2
r ank 1 will run on host 2 bound to CPUs from CPU1 to CPU3 and CPU0
5.5.7 Using OpenMPI with LoadLeveler
OpenMPI is easily used within LoadLeveler. Here are some tips from the OpenMPI FAQ,
5.5.7.1 How to build OpenMPI with support for LoadLeveler
Support for LoadLeveler will be automatically included if the LoadLeveler libraries and
headers are in the default path. If not, support must be explicitly requested with the "- -
wi t h- l oadl evel er " command line switch to OpenMPI's configure script. In general,
the procedure is the same as including support for high-speed interconnect networks,
except that you use - - wi t h- l oadl evel er .
For example
shel l $ . / conf i gur e - - wi t h- l oadl evel er =/ pat h/ t o/ LoadLevel er / i nst al l at i on
After OpenMPI is installed, you should see one or more components named
"loadleveler":
shel l $ ompi _i nf o | gr ep l oadl evel er
MCA r as: l oadl evel er ( MCA v1. 0, API v1. 3, Component v1. 3)
Specific frameworks and version numbers may vary, depending on your version of
OpenMPI.
5.5.7.2 How to run with LoadLeveler
If support for LoadLeveler is included in your OpenMPI installation (which you can check
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 42 of 110
with the ompi_info command - look for components named "loadleveler"), OpenMPI will
automatically detect when it is running inside such jobs and will just "do the Right Thing."
Specifically, if you execute an mpirun command in a LoadLeveler job, it will automatically
determine what nodes and how many slots on each node have been allocated to the
current job. There is no need to specify what nodes to run on. OpenMPI will then attempt
to launch the job using whatever resource is available (on Linux rsh/ssh is used).
For example
# J ob t o submi t
shel l $ cat j ob
#@out put = j ob. out
#@er r or = j ob. er r
#@j ob_t ype = par al l el
#@node = 3
#@t asks_per _node = 4
mpi r un a. out

# Submi t bat ch j ob t o LoadLevel er
shel l $ l l submi t j ob
This will run 4 MPI processes per node on the 3 nodes which were allocated by
LoadLeveler for this job.
5.6 Monitoring Tools for Linux
The traditional tool for monitoring processes on Linux is top. The nmon tool has been
open-sourced recently and can be used on Linux on POWER. It can provide much more
detail and also collect data for later analysis.
5.6.1 top
Included in every Linux distribution, top displays information on the processes with the
highest CPU usage on your system.
Typical output:
t op - 07: 56: 32 up 1 day, 3: 33, 2 user s, l oad aver age: 0. 00, 0. 01, 0. 16
Tasks: 311 t ot al , 1 r unni ng, 310 sl eepi ng, 0 st opped, 0 zombi e
Cpu( s) : 0. 0%us, 0. 0%sy, 0. 0%ni , 100. 0%i d, 0. 0%wa, 0. 0%hi , 0. 0%si , 0. 0%st
Mem: 259900032k t ot al , 11843072k used, 248056960k f r ee, 635264k buf f er s
Swap: 4096384k t ot al , 0k used, 4096384k f r ee, 10950528k cached

PI D USER PR NI VI RT RES SHR S %CPU %MEM TI ME+ COMMAND
16312 r oot 20 0 6336 3840 2496 R 0 0. 0 0: 00. 06 t op
1 r oot 20 0 3648 1088 832 S 0 0. 0 0: 06. 49 i ni t
2 r oot 15 - 5 0 0 0 S 0 0. 0 0: 00. 02 kt hr eadd
3 r oot RT - 5 0 0 0 S 0 0. 0 0: 00. 01 mi gr at i on/ 0
4 r oot 15 - 5 0 0 0 S 0 0. 0 0: 00. 47 ksof t i r qd/ 0
5.6.2 nmon
nmon monitors all system CPUs, processes, filesystems, network interfaces, etc
It can be used to collect and show data for long runs.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 43 of 110
Figure 5-4 nmon display



The official website for Linux nmon tool.
5.6.3 oprofile
oprofile is a system-wide profiler for Linux, capable of profiling all running code at low
overhead. It consists of a kernel driver and a daemon for collecting sample data, and
several post-processing tools for turning data into information. oprofile leverages the
hardware performance counters of the CPU to enable profiling of a wide variety of
interesting statistics, which can also be used for basic time-spent profiling.
All code is profiled: hardware and software interrupt handlers, kernel modules, the kernel,
shared libraries, and applications.
The main features are:
Unobtrusive
o No special recompilations, wrapper libraries or the like are necessary.
Even debug symbols (-g option to gcc) are not necessary unless you
want to produce annotated source.
o No kernel patch is needed - just insert the module.
System-wide profiling
o All code running on the system is profiled, enabling analysis of system
performance.
Performance counter support
o Enables collection of various low-level data, and association with
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 44 of 110
particular sections of code.
Call-graph support
o With an x86 or ARM 2.6 kernel, oprofile can provide gprof-style call-
graph profiling data.
Low overhead
o oprofile has a typical overhead of 1-8%, dependent on sampling
frequency and workload.
Post-profile analysis
o Profile data can be produced on the function-level or instruction-level
detail. Source trees annotated with profile information can be created. A
hit list of applications and functions that take the most time across the
whole system can be produced.
System support
o oprofile works across a range of CPUs, include the Intel range, AMD's
Athlon and AMD64 processors range, the Alpha, ARM, and more.
oprofile will work against almost any 2.2, 2.4 and 2.6 kernels, and works
on both UP and SMP systems from desktops to the scariest NUMAQ
boxes.

The official website for oprofile is here.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 45 of 110
6 MPI Performance on the Power 755 System
6.1 MPI Performance Considerations
Binding MPI processes to logical CPUs, physical cores, or MCM affinity domains can
improve both communication and computation performance.
Note that the term MCM does not always refer to a multichip module. In LoadLeveler and
Parallel Environment (PE), the use of MCM refers to an affinity domain that equates to a
single chip (8 cores) of a Power 755 node. MCM can also refer to an affinity domain in
other contexts. For example setting the AIX environment variable
MEMORY_AFFINITY=MCM requests AIX to allocate memory within a local affinity
domain, with the Power 755 having four such domains (eight cores per MCM domain).
Note the term rset is AIX-specific but we will use it here to refer to both AIX rsets and
Linux cpusets. LoadLeveler running under Linux uses the rset keywords for its binding
controls though the actual binding is done via Linux cpusets.
The use of affinity rsets is preferred over bindprocessor calls since rsets are more flexible
and there can be difficulties mapping between bindprocessor IDs, logical CPU IDs,
physical CPUs, and affinity domains. It is recommended to not use the bindprocessor
command, but there are still some cases in which bindprocessor may be faster (IBM is
working towards rset solutions for all such cases).
An important difference between POWER6 and POWER7 is that not all SMT threads on
POWER7 provide symmetric performance. Unlike POWER5 and POWER6, there is a
noticeable difference between the primary and secondary SMT threads, in that the core
cannot be ceded to run in single thread (ST) mode when a single threaded workload is
running on any other thread except the primary thread. Workloads requiring the best
single-thread performance will do best running on the primary SMT threads on POWER7.
IBM is considering changes to IBM PE and LoadLeveler to account for this behavior, but
it is currently possible to force the main compute thread of an MPI task to be bound to the
primary SMT threads via the following environment variables:
expor t XLSMPOPTS=par t hds=1
expor t MP_TASK_AFFI NI TY=CORE: 1
expor t MP_BI NDPROC=yes

6.2 LoadLeveler JCF (Job Command File) Affinity Settings
In an IBM scheduler solution, LoadLeveler should provide the required rset support.
In order to use the LoadLeveler affinity scheduling support, the first step is to set the
rset_support keyword to value rset_mcm_affinity in the LoadLeveler configuration file.
After starting LoadLeveler with affinity scheduling support, the usage and availability of
MCMs in the entire cluster can be queried using llstatus M command.
An important J CF keyword is the rset keyword which should be set to rset_mcm_affinity
to force an rset to be created for each MPI task:
#@r set = r set _mcm_af f i ni t y
Each task will be constrained to run in an MCM affinity domain if no other options are set,
but the mcm_affinity_options keyword will probably also need to be set to get good
performance. With the default value tasks will accumulate on the same MCM leaving
other MCMs idle (the default is being reviewed and may be changed). This may be good
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 46 of 110
for a small job that doesnt need more resources than a single MCM provides, but
generally round-robin scheduling of the tasks across MCMs is preferable. A suggested
value for mcm_affinity_options for MPI jobs running on Power 755 is:
#@mcm_af f i ni t y_opt i ons=mcm_di st r i but e mcm_mem_r eq mcm_sni _none
The mcm_affinity_options keyword should define one
task_mcm_allocation options:
of each of these three types of
affinity options: task_mcm_allocation, memory_affinity, and adapter_affinity, which
can have the following possible values:
(Generally should be set to mcm_distribute unless all tasks running on an
Power 755 node can fit constrained within a single MCM)
mcm_accumulate (tasks are placed on the same MCM when possible)
mcm_distribute (tasks are round-robin distributed across available MCMs)
memory_affinity options
(the default value should suffice here, but, on AIX its suggested that
MEMORY_AFFINITY=MCM be exported into the jobs environment to be sure
that local memory requests are made)
mcm_mem_none (the job has no memory affinity requirement.)
mcm_mem_pref (memory affinity preferred)
mcm_mem_req (memory affinity required)
adapter_affinity options:
mcm_sni_none (no network adapter affinity requirement; recommended for
Power 755)
mcm_sni_pref (network adapter affinity preferred)
mcm_sni_req (network adapter affinity required)
There is no reason to use the network adapter affinity options mcm_sni_pref or
mcm_sni_req on Power 755 as the recommended setting is the default
mcm_sni_none option.
The LoadLeveler J CF task_affinity keyword allows constraining a task to run on a
number of cores or logical CPUs, e.g.
#@t ask_af f i ni t y=cor e( x)
Or
#@t ask_af f i ni t y=cpu( x)
The core keyword specifies that each task in the job is bound to run on as many
processor cores as specified by x. The cpu keyword indicates that each task of the job is
constrained to run on as many logical CPUs as defined by x. The CPUs/cores will not be
shared between tasks scheduled by LoadLeveler if scheduling affinity is enabled.
An optional new J CF keyword, cpus_per_core, allows the user to limit the number of
software threads assigned to a core by an MPI job:
#@cpus_per_core=x
Note that, for a POWER7 with SMT4 enabled, possible values for x are 1 through 4. This
x defines how many SMT threads the task can run on for every core it is assigned. It
should be noted that the order of SMT thread allocation is not currently guaranteed by
LoadLeveler, though this is being investigated (on POWER7, the primary threads perform
better than secondary threads, which perform better than tertiary threads).
When the new LoadLeveler task_affinity and cpus_per_core options are used, the rset
and mcm_affinity_options can be used in combination to request MCM-specific affinity
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 47 of 110
options in addition to the processor core affinity options.
6.3 IBM PE (Parallel Environment) Affinity Support
IBM PE has extended its affinity support with additional rsets for cases in which
LoadLeveler affinity support is unavailable. PEs support of affinity domains is controlled
by the MP_TASK_AFFINITY environment variable. It provides roughly equivalent
functionality to LoadLevelers J CF keywords, except PE can provide only one core or one
logical CPU to each MPI task.
The MP_TASK_AFFINITY settings are ignored for batch jobs. To give a batch job
affinity, the appropriate LoadLeveler J CF keywords should be used.
For interactive jobs, POE will query the LoadLeveler API to determine if the resource
manager provides affinity support at the requested level. When running a version of
LoadLeveler with full affinity support, PE will simply convert the requested
MP_TASK_AFFINITY environment variable to the appropriate J CF settings as follows:
Table 6-1 Mapping MP_TASK_AFFINITY to LoadLeveler JCF statements
POE Environment Variable Setting: LoadLeveler J CF Lines that will Be Generated:
#one chip 8 cores
MP_TASK_AFFINITY = MCM
#@rset = rset_mcm_affinity
# one core
MP_TASK_AFFINITY = CORE
#@rset = rset_mcm_affinity
#@task_affinity = core(1)
# one logical CPU
# usually one hardware (SMT)
# thread of a single core
MP_TASK_AFFINITY = CPU
#@rset = rset_mcm_affinity
#@task_affinity = cpu(1)

6.4 Environment Variables and LoadLeveler Keywords for Optimal
MPI Performance
A key difference, compared to POWER6 Infiniband systems, is that the Power 755
system often will have two available Infiniband links connected to the same network.
So on Power 755 there may be no difference between requesting network adapters on a
single network (or switch):
# usi ng t he LoadLevel er keywor d
net wor k. MPI = sni _si ngl e, no_shar ed, us, , i nst ances=1
# usi ng t he PE keywor d/ var i abl e
MP_EUI DEVI CE=sn_si ngl e
and requesting adapers on all available networks (switches)
# usi ng t he LoadLevel er keywor d
net wor k. MPI = sni _al l , no_shar ed, us, , i nst ances=1
#usi ng t he PE keywor d
MP_EUI DEVI CE=sn_al l
If there are multiple links defined on the same network, then the instances keyword can
be used to allocate each MPI tasks multiple instances (windows) on multiple links. For
example, if both Infiniband links on a Power 755 are plugged into the same network
(switch), the following approaches can be used to allocate each MPI task the ability to
use both links:
# usi ng t he LoadLevel er keywor d
#@net wor k. mpi = sn_si ngl e, not _shar ed, us, , i nst ances=2
#usi ng t he PE keywor d/ var i abl e
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 48 of 110
expor t MP_I NSTANCES=2
Note that the IBM communication protocol stack (specifically LAPI) will use only a single
link per MPI task for smaller FIFO messages. Each task will typically send FIFO
messages over one link, but may receive over any available link. The breakpoint at
which messages will change from FIFO to RDMA mode is defined by the maximum of
both the MP_EAGER_LIMIT and the MP_MIN_BULK_MESSAGE
For maximum bandwidth, RDMA traffic will be striped equally across all available links. A
maximum of two links are supported on the Power 755.
General environment variable recommendations
expor t MP_FI FO_MTU=4K
expor t MP_RFI FO_SI ZE=16777216
expor t MP_EUI DEVELOP=mi n
expor t MP_SI NGLE_THREAD=YES
#do NOT use i f mor e t han one appl i cat i on t hr ead
#( i ncl udi ng OpenMP) i s maki ng comm. cal l s
expor t MP_PULSE=0
expor t MP_USE_BULK_XFER=yes
expor t MP_BUFFER_MEM=512M
expor t MP_EUI LI B=us


RDMA specific tunables
expor t MP_RDMA_MTU=4K
expor t MP_BULK_MI N_MSG_SI ZE=64k
expor t MP_RC_MAX_QP=8192
expor t LAPI _DEBUG_RC_I NI T_SETUP=yes
#ONLY f or t r y bot h = yes and = no
expor t LAPI _DEBUG_RC_SQ_SI ZE=16
Other possible variables to experiment with
(These tunables have no hard and fast recommendations though initial starting points are
listed)
expor t MP_I NSTANCES=2 #al so t r y 4 but t hi s may
#r equi r e a LoadL admi n
#change- >
#max_pr ot ocol s_i nst ances=4

expor t MP_EAGER_LI MI T=32768 #t r y 64K, 128k al so
expor t MP_BULK_MI N_MSG_SI ZE=64k #RDMA cr oss- over wi l l never
#occur bel ow t he
#MP_EAGER_LI MI T
expor t MP_REXMI T_BUF_SI ZE=36000 #set bi g enough f or l ar gest
# eager message + pr ot ocol
# header s.

expor t MP_BULK_XFER_CHUNK_SI ZE=1024k #can t r y 256K
expor t MP_RETRANSMI T_I NTERVAL=40000000#can t r y smal l er val ues
#t oo
expor t MP_SHMCC_EXCLUDE_LI ST=al l #j ust a f ew
#appl i cat i ons can see a
#benef i t wi t h set t i ng
#di f f er ent val ues
#( see det ai l s bel ow)

expor t MP_SHM_ATTACH_THRESH=500000 #of t en t he def aul t i s bet t er

6.4.1 Details on Environment Variables
All of these settings are intended for use with Power 755 clusters connected with
Infiniband network adapters.

Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 49 of 110
MP_FIFO_MTU=4K
Unless this is set, FIFO/UD uses a default 2K packet. FIFO bandwidths can be improved
by setting this variable to 4K to enable 4KB MTU for FIFO packets. Note that the switch
chassis MTU must be enabled for 4KB MTU for this to work.

MP_RFIFO_SIZE=16777216
The default size of the receive FIFO used by each MPI task is 4MB. Larger jobs are
recommended to use the maximum size receive FIFO (16MB) by setting
MP_RFIFO_SIZE=16777216.
MP_SHM_ATTACH_THRESH=500000
LAPI has two modes of sending shared memory messages. For smaller messages, a
slot mode is used to copy messages from one MPI task to another. For larger messages,
and only on AIX, its possible to map the shared memory segment of one task to another,
thereby saving a copy at the cost of some overhead of attaching. The
MP_SHM_ATTACH_THRESH variable defines the minimum size message for which
attach mode is used. Depending on the type of job, different cross-over points may
provide optimal performance, but 500000 is often a reasonable starting point when tuning
this value. The default depends on how many tasks are running on the node.
MP_EUIDEVELOP=min
The MPI layer will perform checking on the correctness of parameters according to the
value of MP_EUIDEVLOP. As these checks can have a significant impact on latency,
when not developing applications it is recommended that MP_EUIDEVELOP=min be set
to minimize the checking done at the message passing interface layer.
MP_SINGLE_THREAD=yes
Setting MP_SINGLE_THREAD=yes can improve latency by eliminating locking between
LAPI and MPI. This should not be set if a single MPI task (process) has multiple user
threads making MPI calls. Also, this should not be set equal to YES for hybrid
MPI/OpenMP applications if multiple OpenMP threads are making communications calls;
though it should be noted that by default such hybrid codes only make MPI calls from the
main (master) OpenMP thread.
MP_PULSE=0
POE and the Partition Manager use a pulse detection mechanism to periodically check
each remote node to ensure that it is actively communicating with the home node. You
specify the time interval (or pulse interval), with the MP_PULSE environment variable.
During an execution of a POE job, POE and the Partition Manager daemons check at the
interval you specify that each node is running. When a node failure is detected, POE
terminates the job on all remaining nodes and issues an error message. The default
value for MP_PULSE is 600 (600 seconds or 10 minutes). Settings MP_PULSE=0
disable the pulse mechanism.
MP_SHMCC_EXCLUDE_LIST=all
MP_SHMCC_EXCLUDE_LIST can be used to specify a list of routines for which MPIs
collective communication layer will be avoided (instead the communication will be done
via LAPIs point to point interface). In certain cases the MPI CC routines can result in
worse performance, particularly when there is a lot of load imbalance (such that tasks do
not arrive at collectives in a synchronous manner). Its often worth experimenting with
MP_SHMCC_EXCLUDE_LIST=all to see if any improvements can be made by excluding
MPI CCs use of all routines. If improvements are found with the all setting, then further
experiments could optionally be done to isolate which CC routines are the source of the
problem. The variable can be set to any colon delineated list (e.g.
MP_SHMCC_EXCLUDE_LIST=barrier:bcast). The full list of valid CC routine values is:
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 50 of 110
None", "All", "Barrier","Bcast","Reduce", "Allreduce","Reduce_scatter", "Gather",
"Gatherv", "Scatter", "Scatterv", "Allgather", Allgatherv", "Alltoall", "Alltoallv", "Alltoallw",
"Scan", "Exscan"};

MP_BUFFER_MEM=512M
Setting MP_BUF_MEM can increase the memory available to the protocols beyond the
defaults. Setting MP_BUF_MEM can address (MPCI_MSG: ATTENTION: Due to
memory limitation eager limit is reduced to X) cases in which the user-requested
MP_EAGER limit cannot be satisfied by the protocols.
MP_USE_BULK_XFER=yes
Setting MP_USE_BULK_XFER=yes will enable RDMA. On IB systems using RDMA will
generally give better performance at lower task counts when forcing RDMA QP
information to be exchanged in MPI_Init() (via setting
LAPI_DEBUG_RC_INIT_SETUP=yes). When the RDMA QP information is not
exchanged in MPI_Init(), there can be delays due to QP information exchange until all
tasks have synced-up.
The benefit of RDMA depends on the application and its use of buffers. For example,
applications that tend to re-use the same address space for sending and receiving data
will do best, as they avoid the overhead of repeatedly registering new areas of memory
for RDMA.
RDMA mode will use more memory than pure FIFO mode. Note that this can be
curtailed by setting MP_RC_MAX_QP to limit the number of RDMA QPs that are created.
MP_RDMA_MTU=4K
Unless this is set, RDMA uses the default of 2K packet. RDMA bandwidths can be
improved by setting this variable to 4K to enable 4KB MTU for RDMA packets. Note
that the switch chassis MTU must be enabled for 4KB MTU for this to work. In cases
where network latency dominates, and for certain message sizes, keeping the MTU size
at 2K will provide better performance.
MP_BULK_MIN_MSG_SIZE=64k
The minimum size message used for RDMA is defined to be the maximum of
MP_BULK_MIN_MSG_SIZE and MP_EAGER_LIMIT. So if MP_EAGER_LIMIT is
defined to be higher than MP_BULK_MIN_MSG_SIZE, the smallest RDMA message will
be limited by the eager limit.
MP_RC_MAX_QP=8192
This variable defines the maximum number of RDMA QPs (queue pair) that will be
opened by a given MPI task. An Infiniband QP refers to the low-level communication
construct established over Infiniband. It can be thought of as the equivalent of, at the
Infiniband layer, what a socket equates to at the higher TCP/IP layer.
Depending on the size of the job and the number of tasks per node, it may be desirable
to limit the number of QPs used for RDMA. By default, when the limit of RDMA QPs is
reached, future connections will all use FIFO/UD mode for message passing.
LAPI_DEBUG_RC_SQ_SIZE=yes
This variable defines the number of outstanding RDMA messages that can be in flight on
the send queue. Increasing this value beyond the default (4) can often improve
performance.
LAPI_DEBUG_RC_INIT_SETUP=yes
Setting this variable will force all RDMA QP exchange to occur in MPI_Init(). This can be
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 51 of 110
useful for smaller jobs (typically 256 MPI tasks or less)
6.4.2 Other Possible Environment Variables to Try
MP_POLLING_INTERVAL=800000
This defines the interval at which the LAPI timer thread runs. Setting the polling interval
equal to 800000 defines an 800 millisecond timer. The default is 400 milliseconds.
MP_RETRANSMIT_INTERVAL=8000000
This defines the number of loops through LAPI dispatch code before retransmission logic
kicks in. The polling interval also comes into play in the calculation of when
retransmission will occur. If the network is very stable, setting this value to 800000 or
higher (the default is 400000) can improve performance by eliminating unneeded
retransmission.
LAPI_DEBUG_SEND_FLIP=8
For FIFO traffic, this will allow LAPI to use more than one link, switching every X number
of packets between FIFOs. In some cases enabling this variable can increase
performance.
LAPI_DEBUG_ACK_THRESH=16
This defines how many packets are received before sending an ACK (the default is 31).
The sender can block on ACKs after sending 64 packets to a given destination without
receiving an ACK. In certain cases, setting this variable can improve performance.
MP_BULK_XFER_CHUNK_SIZE=1024k
This defines how large a chunk size the protocols use for RDMA transmissions. The
default value is 32768k but in some cases a smaller value, like 1024k may improve
performance.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 52 of 110
7 Performance Analysis Tools on AIX
The AIX performance management and tuning guides available here cover many tools,
most of which are focused on overall system performance. This section will focus on tools
for measuring the performance and the optimization of a specific application or workload.
The IBM System p and AIX Information Center is a good place to start looking for
information on any AIX performance tool or command. The AIX 6.1 information center is
found here
7.1 Runtime Environment Control
One requirement to get the best reproducible performance possible is that the runtime
environment needs to be configured correctly. The following environment variables and
utility programs control those conditions:
7.1.1 ulimit
The ulimit command controls user limits on jobs run interactively. Large HPC jobs often
use more hardware resources than are given to users by default. These limits are put in
place to make sure that program bugs and user errors dont make the system unusable
or crash. However, for benchmarking and workload characterization, it is best to set all
resource limits to unlimited (or the highest allowed value) before running. This is
especially true for resources like stack, memory and CPU time.
ul i mi t s unl i mi t ed
ul i mi t munl i mi t ed
ul i mi t t unl i mi t ed
7.1.2 Memory Pages
The page size of an executable determines how much memory is mapped into a TLB
entry, and how much I/O is done when swapping out memory. The default is 4KB pages.
Using larger pages can reduce the number of TLB misses but can also increase memory
traffic when doing copy-on-write operations.
A binary can be modified with ldedit to use 64KB pages, (see section 4.1.2) to see if that
improves performance.
Sometimes, an HPC application will benefit from an even larger page size. 16MB pages
are also available in AIX by default. In this case, the system also needs large pages
allocated, which is done with the vmo command.
vmo - o l gpg_r egi ons=300 - o l gpg_si ze=16M ( AI X 6. 1 synt ax)
The example above allocates 300 16MB pages (4.8 GB).
Large pages (also known as huge pages under Linux) must be used with caution. The
biggest drawback to using large pages is that they are pinned in real memory. Once a
job using large pages is started, it cant be swapped out to disk. The memory is only
freed if the job finishes or is killed. If the job is suspended, the real memory is not
available for other jobs to use.
If a job requests large pages but none are available at runtime, memory is allocated from
the regular (4K) page pool.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 53 of 110
7.2 Profiling Tools
Profiling refers to charging CPU time to subroutines and micro-profiling refers to charging
CPU time to source program lines. Profiling is frequently used in benchmarking and
tuning activities to find out where the "hotspots" in a code are.
Several tools are available for profiling in UNIX in general, and AIX offers additional tools.
For many years, UNIX has included gprof, and this is also available in AIX. tprof is an
AIX-only alternative which can provide profiling information from the original binaries.
7.2.1 gprof
To get gprof-compatible output, first binaries need to be compiled and created with the
added -pg option (additional options like optimization level, -On can also be added):
xl c pg o mypr og. exe mypr og. c
or
xl f pg o mypr og. exe mypr og. c
When the program is executed, a gmon.out file is generated (or, for a parallel job. several
gmon.<#>.out files are generated, one per task). To get the human-readable profile, run:
gpr of mypr og. exe gmon. out > mypr og. gpr of
or
gpr of mypr og. exe gmon. *. out > mypr og. gpr of
To get microprofiling information, from gprof output, you need to use the xprofiler tool.
Full documentation for gprof can be found here
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 54 of 110
7.2.2 tprof
For additional information, see POK Hot Tips discussion, AIX 6.1 Documentation and the
HPC Wiki Page

7.2.2.1 Profiling with tprof
Description from Man Page
The tprof command reports CPU usage for individual
programs and the system as a whole. This command is a
useful tool for anyone with a J AVA, C, C++, or FORTRAN
program that might be CPU-bound and who wants to know
which sections of the program are most heavily using the CPU.
The tprof command can charge CPU time to object files,
processes, threads, subroutines (user mode, kernel mode and
shared library) and even to source lines of programs or
individual instructions.
tprof estimates the CPU time spent in a program, the kernel, a shared library etc. by
sampling the instructions every 10 milliseconds. When the sampling occurs a "tic" is
applied to the components running at that time. By sampling over several intervals an
estimate of the time spent in the various processes and daemons running on the system
can be estimated.
For example:
t pr of - x mypr og
If present, the - x flag is always the last on the command line, and is followed by the
name of a command and its arguments.
This produces an additional file, myprog.prof, which contains the same sort of information
as found in the gprof output.
The output looks something like this:

Conf i gur at i on i nf or mat i on
=========================
Syst em: AI X 5. 3 Node: v20n18 Machi ne: 00C518EC4C00
Tpr of command was:
t pr of - x mypr og
Tr ace command was:
/ usr / bi n/ t r ace - ad - M - L 134785843 - T 500000 - j
000, 00A, 001, 002, 003, 38F, 005, 006, 134, 139, 5A2, 5A5, 465, 234, - o -
Tot al Sampl es = 2538
Tr aced Ti me = 5. 01s ( out of a t ot al execut i on t i me of 5. 01s)

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Pr ocess Fr eq Tot al Ker nel User Shar ed Ot her
======= ==== ===== ====== ==== ====== =====
wai t 32 80. 97 80. 97 0. 00 0. 00 0. 00
i pa64 1 12. 45 3. 62 2. 32 6. 50 0. 00
/ usr / l pp/ xl f / bi n/ xl f code 1 6. 54 0. 47 5. 32 0. 75 0. 00
/ usr / bi n/ sh 1 0. 04 0. 00 0. 00 0. 04 0. 00
======= ==== ===== ====== ==== ====== =====
Tot al 35 100. 00 85. 07 7. 64 7. 29 0. 00


Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 55 of 110

Pr ocess PI D TI D Tot al Ker nel User Shar ed Ot her
======= === === ===== ====== ==== ====== =====
wai t 1362 1399 14. 66 14. 66 0. 00 0. 00 0. 00
i pa64 120348 386811 12. 45 3. 62 2. 32 6. 50 0. 00
wai t 18022 18059 7. 29 7. 29 0. 00 0. 00 0. 00
/ l pp/ xl f / bi n/ xl f code 70436 324381 6. 54 0. 47 5. 32 0. 75 0. 00
wai t 16682 16719 2. 01 2. 01 0. 00 0. 00 0. 00
wai t 22120 22157 2. 01 2. 01 0. 00 0. 00 0. 00

( Si mi l ar l i nes r emoved)

wai t 17218 17255 1. 97 1. 97 0. 00 0. 00 0. 00
wai t 21316 21353 1. 97 1. 97 0. 00 0. 00 0. 00
wai t 826 863 1. 77 1. 77 0. 00 0. 00 0. 00
/ usr / bi n/ sh 71210 419561 0. 04 0. 00 0. 00 0. 04 0. 00
======= === === ===== ====== ==== ====== =====
Tot al 100. 00 85. 07 7. 64 7. 29 0. 00
This mode of operation, where tprof is run for the duration of an executing program, and
a report produced immediately, is known as "real time" or "online" profiling.
7.2.2.2 Source code level microprofiling with tprof
Microprofiling is a very useful feature in tprof. tprof supports microprofiling using the
object file or annotated assembly file listing. In order to produce the assembly code level
micro profiling, the object files and executable should be compiled (or linked) with the
flags - g - qf ul l pat h. The annotated listing can be obtained by compiling the source
files with - ql i st and - qi pa which is implicit with the - O4, - O5 flags (note -
qi pa=l evel =0 can be used). When linked with the -qlist and -qipa options a file named
a.lst will be generated which contains the assembly code of the entire executable after
ALL optimizations are applied.
Below is an example of how to produce object file micro-profiling using one of the NAS
parallel benchmarks
npr oc=8
exe=l u. B. $npr oc

t pr of [ - L a. l st ] - mbi n/ $exe - Zusk - r $exe - x / bi n/ sh << EOF
expor t MP_PROCS=$npr oc
expor t MP_HOSTFI LE=`pwd`/ host . l i st

poe bi n/ $exe
EOF
The options of interest are
-m Objectlist
Enables microprofiling of the Object - in this case an executable
-L Objectlist
Enables annotated listing micro profiling, the Object can be a executable,
annotated listing file (a.lst), or shared library
-u
Enables user mode profiling
-k
Enables kernel profiling (may not be accurate if not running as root
-s
Enables profiling of shared libraries (useful for estimating time in LAPI and
OpenMP libraries)
-Z
Shows CPU usage in tics
-r RootString
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 56 of 110
The string "RootString" is used as the prefix for all output files
-x Program
Specifies the program to be run by tprof, this must be the last flag on the
argument list. Note if -r is not used the program name supplied to -x will be used
as the RootString.
In the above example several files will be produced:
RootString.prof
Holds the process, thread, object file and subroutine level profiling report. Usually
the hot subroutine list can be determined from this report.
RootString.source.mprof
Here source is the base name of a source file. If more than one source file has
the same base name, then a number to uniquely identify each file is appended to
the report file name, for example, RootString.Filename.c.mprof-1. The micro-
profiling report contains a hot line profile section, and a source line profile
section.
RootString.a.lst.prof
This is the name of the annotated listing profile.

7.2.2.3 Using tprof on an MPI parallel application
In order to profile an MPI parallel application, execute a command similar to the following:
t pr of - u - l - x poe . . / . . / execut e/ DLPOLY. Y - pr ocs 16
Note that this ONLY runs on the first node in the host file or the first node selected by
LoadLeveler for a parallel job. It is not possible to say something like " poe tprof -x
a.out" , as only one trace daemon can be active on a node at a time.
As with serial jobs, it is also possible to profile an application which is already running:
t pr of - p DLPOLY. Y - S DI RECTORY - r dl pol y - u - l - x sl eep 60
Here - p DLPOLY. Y selects the process name to be profiled, and - S selects the directory
containing the executable.
If poe is being run interactively, then locating the node running the processes of the job is
not a problem. If the job is being run using LoadLeveler, then it is likely that the tasks are
not running on the node from which the job was submitted. The simplest way to find the
relevant node is to run "llq", and then log on to the node specified in the "Runni ng on"
column. This is the first node selected by LoadLeveler, and will be running MPI task zero.
If it is necessary to profile a specific MPI task, then this information would need to be
extracted from the "Task I nst ance" information from a long LoadLeveler listing "llq
-l jobname".
Running tprof on a single node and extracting profiling information for all the tasks
running on that node is normally useful and sufficient. There is no easy way to use tprof
to profile all the tasks of a multi-node parallel job. For that situation, it is best to look into
the PE Benchmarker toolkit for AIX (which is not discussed in this document).
7.3 MPI Performance Tools
For uncovering inter-task communication bottlenecks in MPI programs (especially when
the number of MPI tasks is large), the program can be linked with either an MPI-profiling
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 57 of 110
or an MPI-trace library. An alternative set of tools is the PE Benchmarker which provides
its utilities as part of the standard AIX release in the ppe.perf and ppe.pvt filesets. These
will not be covered here, but are discussed in PE Benchmarker for AIX
7.3.1 MPI Summary Statistics
The main objective for libmpitrace.a is to provide a very low overhead elapsed-time
measurement of MPI routines for applications written in any mixture of Fortran, C, and
C++. The overhead for the current version is less than 1 microsecond per call. The
read_real_time() routine is used to measure elapsed-time, with a direct method to convert
timebase structures into seconds. This is much faster than using rtc() or the
time_base_to_time() conversion routine.
To use the basic wrappers, link with libmpitrace.a, and then run the application. By
default you will get text files mpi_profile.0, mpi_profile.1, mpi_profile.2, etc. The output is
written when the application calls MPI_Finalize(). Instead of generating one small text file
from each MPI rank, there is an option to dramatically reduce the number of output files.
At the present time, this option can be enabled by setting an environment variable
SAVE_ALL_TASKS=no. With this option, detailed data is saved for MPI rank 0, and for
the ranks with the minimum, maximum, and median communication times. The text file
for rank 0 contains a summary for all other MPI ranks. It is hoped that this option
provides sufficient information, while keeping the number of files manageable for very
large task counts.
The default mode is to just collect timing summary data, which is printed when the
application calls MPI_Finalize().
An example of this output is as follows (taken from a SPECMPI2007 104.milc run)
Dat a f or MPI r ank 0 of 4:
Ti mes and st at i st i cs f r omMPI _I ni t ( ) t o MPI _Fi nal i ze( ) .
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
MPI Rout i ne #cal l s avg. byt es t i me( sec)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
MPI _Comm_si ze 583 0. 0 0. 000
MPI _Comm_r ank 55036888 0. 0 2. 296
MPI _I send 7996 1336747. 2 1. 632
MPI _I r ecv 7996 1336747. 2 0. 041
MPI _Wai t 15992 0. 0 8. 062
MPI _Bcast 4 604. 0 0. 000
MPI _Bar r i er 2 0. 0 0. 000
MPI _Al l r educe 590 7. 8 0. 653
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t ot al communi cat i on t i me = 12. 684 seconds.
t ot al el apsed t i me = 3287. 987 seconds.
user cpu t i me = 3284. 443 seconds.
syst emt i me = 0. 319 seconds.
maxi mummemor y si ze = 2499520 KByt es.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Message si ze di st r i but i ons:

MPI _I send #cal l s avg. byt es t i me( sec)
8 4. 0 0. 000
1976 378000. 0 0. 357
3002 805862. 8 0. 429
1962 1586392. 7 0. 378
974 2726981. 5 0. 231
8 7056000. 0 0. 011
16 13104000. 0 0. 033
32 25200000. 0 0. 097
18 37856000. 0 0. 096

MPI _I r ecv #cal l s avg. byt es t i me( sec)
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 58 of 110
8 4. 0 0. 000
1976 378000. 0 0. 010
3002 805862. 8 0. 015
1962 1586392. 7 0. 009
974 2726981. 5 0. 007
8 7056000. 0 0. 000
16 13104000. 0 0. 000
32 25200000. 0 0. 000
18 37856000. 0 0. 000

MPI _Bcast #cal l s avg. byt es t i me( sec)
4 604. 0 0. 000

MPI _Al l r educe #cal l s avg. byt es t i me( sec)
24 4. 0 0. 291
566 8. 0 0. 363



- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Communi cat i on summar y f or al l t asks:

mi ni mumcommuni cat i on t i me = 11. 337 sec f or t ask 2
medi an communi cat i on t i me = 12. 123 sec f or t ask 3
maxi mumcommuni cat i on t i me = 14. 400 sec f or t ask 1


MPI t asks sor t ed by communi cat i on t i me:
t aski d comm( s) el apsed( s) user ( s) si ze( KB) swi t ches
2 11. 34 3287. 99 3284. 71 2499392 20424
3 12. 12 3287. 99 3285. 75 2499392 20472
0 12. 68 3287. 99 3284. 44 2499520 20385
1 14. 40 3287. 99 3285. 55 2499392 20552
7.3.2 MPI Profiling
The main objective for libmpiprof.a was to provide an elapsed-time profile of MPI routines
including some call-graph information so that one can identify communication time on a
per-subroutine basis. For example, if an application has MPI calls in routines main",
"exchange", and "transpose", the profile would show how much communication time was
spent in each of these routines, including a detailed breakdown by MPI function. This
provides a more detailed picture of message-passing time at the expense of more
overhead, ~5 microseconds per call. In some applications there are message-passing
wrappers, and one would like the profile to indicate the name of the routine that called the
wrapper, not the name of the routine that called the MPI function. In this case, one can
set an environment variable CALLGRAPH_LEVEL=2, and then run the application (which
must be compiled with either -g or -qtbtable=full, and linked with libmpiprof.a). It may
also be useful to try higher levels such as CALLGRAPH_LEVEL=3, which associates the
message-passing time with the great-grandparent in the call chain.
For MPI analysis it can be useful to distinguish between different calls to the same MPI
routine based, for example, on the instruction address, or location in the code. There are
several ways to get such data. The easiest is to use libmpiprof.a which associates time
spent in MPI calls with regions in the user code.
To use libmpiprof.a, the code must be compiled with "-qtbtable=full" or "-g" as an
additional compiler option. The wrappers in libmpiprof.a use a trace-back method to find
the name of the routine that called the MPI function, and this only works if there is a full
trace-back table. J ust link the application with libmpiprof.a and run the code. Output is
written in files mpi_profile.0, mpi_profile.rank, as before. This method associates time in
MPI routines by subroutine name in the user's application, instead of by function address.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 59 of 110
7.4 Hardware Performance Tools
7.4.1 hpmcount
Along with MPI profiling data (see 7.3.2), performance counter data from the hardware
performance monitor (HPM) can be used to characterize the performance of HPC codes
on POWER7 hardware, possibly with the intent of finding performance bottlenecks and
looking for opportunities to optimize the target code. Using hpmcount, one can monitor
the activity of all POWER7 subsystems FPU efficiency, stalls, cache usage, flop counts,
etc. during the application execution on the system.
The command line syntax for the most useful options is illustrated by the following
example:
hpmcount d o hpm. out g pm_ut i l i zat i on: uk mypr og
The options are as follows:
The d option asks for more detail in the output.
The o option names an output file string. The actual output file for this hpmcount run will
start with hpm.out and add the process id and, in the case of a poe-based run, a task id
for each hpmcount file. A serial run just uses 0000 for the task id.
The -g <group list>option lists a predefined group of events or a comma-separated list of
event group names or numbers. When a comma-separated list of groups is used, the
counter multiplexing mode is selected. Each event group can be qualified by a counting
mode as follows:
event _gr oup: count i ng_modes
where the counting mode can be user space (u), kernel space (k), and some other
options that wont be as important.
7.4.1.1 Multiplexing with hpmcount
When more than one group is supplied to the g option, counter multiplexing is activated.
Counting is distributed among the listed groups. The sum of the sampling times over all
groups should be a close match to the total CPU time.
In the following example:
hpmcount g pm_ut i l i zat i on: uk, pm_vsu0: u mypr og
the program execution time is split between the pm_utilization and the pm_vsu0
groups.
7.4.1.2 Running hpmcount under the poe environment
Unlike tprof (see 7.2.2.3), hpmcount can be run for each task of a parallel job. Task
binding (see 4.4) by using the launch utility or something similar is recommended to
increase accuracy and reproducibility. An example might be:
expor t MP_PROCS=8
expor t TARGET_CPU_LI ST=- 1
expor t HPM_EVENT_GROUP=pm_ut i l i zat i on: u
poe l aunch hpmcount d o hpm. out mypr og
Note that additional environment variables needed by poe are implied. An alternative way
of specifying the group(s) to be collected has been used here by setting the
HPM_EVENT_GROUP variable.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 60 of 110
7.4.2 libhpm
The libhpm library is part of the bos.pmapi.lib fileset distributed with the standard AIX
release.
For event counts resolved by code regions, one can use calls to libhpm.a routines as a
quicker way to collect a large group of events than profiling with tprof (See 7.4.3) to
control the collection of HPM group statistics.
Using libhpm.a can be divided into 3 steps: modifying the source code base, building the
executable, and running.
libhpm.a supports calls from Fortran and C/C++code; the examples will be in C.
1. There are 4 necessary calls that need to be inserted in the base source code:
hpmInit(1,"SomeString"); [Fortran f_hpminit(<#>,string) ]
at a point before any statistics are to be collected. Inserting this call at the start of
execution in the main program is the simplest course. The specified SomeString is
arbitrary, but it seems most sensible to derive it from the program name.
hpmTerminate(1); [Fortran f_hpmterminate(<#>,string) ]
This is the matching call to hpmInit(). The argument (an integer) must agree with the first
argument of the hpmInit() call. This call signals that HPM data collection has ended.
hpmStart(2,"String"); [Fortran f_hpmstart(<#>,string) ]
Insert this call at the start of every code section where one wants to collect HPM data.
The integer identifier has to be unique (there are usually many hpmStart() calls inside a
program; all have to be uniquely identified). The string is user-defined and it is
recommended that it briefly describe the code block being measured. The string (along
with the range of lines in the block) is reported as part of the libhpm output. There is a
significant execution overhead from this call, so it is best to put it outside loop blocks.
Having any particular hpmStart
hpmStop(2); [Fortran f_hpmstop(<#>) ]
This is the matching call to hpmStart(). The argument must agree with the first argument
of the hpmInit() call. This call signals the end of HPM data collection for a section of code.
2. Link with needed libraries
xl c o mypr og_hpmmypr og. o l l i bhpml pmapi l m

3. Use an appropriate execution environment
expor t HPM_OUTPUT_NAME=l i bhpm. out
expor t HPM_EVENT_GROUP=pm_ut i l i zat i on
. / mypr og_hpm
Not that myprog_hpm can be run as normal, without any call to hpmcount, but that
hpmcount environment variables are still valid. The additional environment variable,
HPM_OUTPUT_NAME is strongly recommended to specify a file for the libhpm output.
Otherwise output goes to stdout.
Here is an excerpt from an example output from a libhpm run:

Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 61 of 110

Tot al execut i on t i me of i nst r ument ed code ( wal l t i me) : 4. 166042 seconds

######## Resour ce Usage St at i st i cs ########

Tot al amount of t i me i n user mode : 4. 157466 seconds
Tot al amount of t i me i n syst emmode : 0. 002418 seconds
Maxi mumr esi dent set si ze : 6372 Kbyt es
Aver age shar ed memor y use i n t ext segment : 66 Kbyt es*sec
Aver age unshar ed memor y use i n dat a segment : 22228 Kbyt es*sec
Number of page f aul t s wi t hout I / O act i vi t y : 1572
Number of page f aul t s wi t h I / O act i vi t y : 0
Number of t i mes pr ocess was swapped out : 0
Number of t i mes f i l e syst emper f or med I NPUT : 0
Number of t i mes f i l e syst emper f or med OUTPUT : 0
Number of I PC messages sent : 0
Number of I PC messages r ecei ved : 0
Number of si gnal s del i ver ed : 0
Number of vol unt ar y cont ext swi t ches : 6
Number of i nvol unt ar y cont ext swi t ches : 6

####### End of Resour ce St at i st i cs ########

I nst r ument ed sect i on: 1 - Label : Scal ar - pr ocess: 1
f i l e: t est _mul _nn_hpm. c, l i nes: 118 <- - > 123
Count : 1
Wal l Cl ock Ti me: 1. 587333 seconds
Tot al t i me i n user mode: 1. 5830869784375 seconds

Gr oup: 0
Count i ng dur at i on: 1. 587304013 seconds
PM_RUN_CYC ( Run cycl es) : 5065878331
PM_I NST_CMPL ( I nst r uct i ons compl et ed) : 2680000246
PM_I NST_DI SP ( I nst r uct i ons di spat ched) : 2816747278
PM_CYC ( Pr ocessor cycl es) : 5065878331
PM_RUN_I NST_CMPL ( Run i nst r uct i ons compl et ed) : 2682544499
PM_RUN_CYC ( Run cycl es) : 5079851163


Ut i l i zat i on r at e : 99. 733 %
MI PS : 1688. 367 MI PS
I nst r uct i ons per cycl e : 0. 529


I nst r ument ed sect i on: 2 - Label : VMX - pr ocess: 1
f i l e: t est _mul _nn_hpm. c, l i nes: 159 <- - > 164
Count : 1
Wal l Cl ock Ti me: 1. 277721 seconds
Tot al t i me i n user mode: 1. 2739698684375 seconds

Gr oup: 0
Count i ng dur at i on: 1. 277698308 seconds
PM_RUN_CYC ( Run cycl es) : 4076703579
PM_I NST_CMPL ( I nst r uct i ons compl et ed) : 1610000247
PM_I NST_DI SP ( I nst r uct i ons di spat ched) : 1624541340
PM_CYC ( Pr ocessor cycl es) : 4076703579
PM_RUN_I NST_CMPL ( Run i nst r uct i ons compl et ed) : 1612330036
PM_RUN_CYC ( Run cycl es) : 4089025245


Ut i l i zat i on r at e : 99. 706 %
MI PS : 1260. 056 MI PS
I nst r uct i ons per cycl e : 0. 395


I nst r ument ed sect i on: 3 - Label : VMXopt - pr ocess: 1
f i l e: t est _mul _nn_hpm. c, l i nes: 197 <- - > 202
Count : 1
Wal l Cl ock Ti me: 1. 28022 seconds
Tot al t i me i n user mode: 1. 2764352996875 seconds
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 62 of 110
7.4.3 Profiling Hardware Events with tprof
The -E flag enables event-based profiling. The -E flag is one of the four software-based
events (EMULATION, ALIGNMENT, ISLBMISS, DSLBMISS) or a Performance Monitor
event (PM_*). By default, the profiling event is processor cycles (the familiar time-based
profile report discussed in 7.2.2). All Performance Monitor events are prefixed with PM_,
such as PM_CYC for processor cycles or PM_INST_CMPL for instructions completed.
The command
pml i st g - 1
lists all Performance Monitor events that are supported on the POWER7 processor. The
chosen Performance Monitor event must be taken in a group where we can also find the
PM_INST_CMPL Performance Monitor event. Profiling marked events results in better
accuracy. Marked events have the PM_MRK_ prefix.

From the AIX 6 tprof command documentation:
-E [ mode ] Enables event-based profiling. The possible modes are:
PM_event
specifies the hardware event to profile. If no mode is specified for the -E flag, the
default event is processor cycles (PM_CYC).
EMULATION
enables the emulation profiling mode.
ALIGNMENT
enables the alignment profiling mode.
ISLBMISS
enables the Instruction Segment Lookaside Buffer miss profiling mode.
DSLBMISS
enables the Data Segment Lookaside Buffer miss profiling mode.
For example,
tprof -m bin/$exe -Zusk -r $exe E PM_LD_MISS_L1 x myprog
profiles the number of L1 D-cache misses across the source code
7.5 Other Useful Tools
Several utilities are supplied by AIX to help monitor job execution. Use these in trial runs
to make sure that your jobs are running as expected. This ensures that the performance
data that you collect will be accurate and reproducible. These utilities generally report
system-wide statistics, so they are best used on a dedicated system. These are only
mentioned briefly here; more details are available through their respective AIX man
pages and the AIX 6.1 documentation
Common problems to watch out for during a program trial run:
Verify that the tasks are really bound properly to a fixed set of logical CPUs.
Verify that large pages are being used (if needed)
Verify that a job is not thrashing in memory (high swap activity, high wait time,
low CPU utilization)
Check the impact of I/O activity on application performance

1. topas
An analog to top for AIX. Pressing c twice shows utilization by logical CPU number.
Other displays can show disk I/O activity and swap activity. A high swap rate and a low
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 63 of 110
user CPU utilization can indicate a memory thrashing issue.
2. mpstat <time interval> <count>
A standard UNIX alternative to topas.
mpstat 2 will monitor CPU activity across all logical CPUs at 2 second intervals.
Use this utility to check that tasks are properly bound to the right logical CPUs. Check
that CPUs record a high utilization rate.
3. vmstat l
This utility will monitor user CPU utilization, page swap activity and how many large
pages are in use.
4. iostat
A similar command line format to mpstat, iostat is a standard UNIX tool that reports disk
I/O activity.
5. svmon
svmon P <pid> shows a detailed analysis of how each process uses memory, and
where the memory for each process is located. In other words, it can tell you how close a
tasks memory is to the CPU that the task is executing on, and therefore how well
memory affinity is working.
6. dbx/pdbx
These are the preferred AIX debuggers for serial and parallel applications.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 64 of 110
8 Performance Results
8.1 HPC Benchmarks on AIX
High Performance Computing is known for its diversity of algorithms and applications.
There are no industry standard benchmarks that are representative of production use of
HPC systems. One users requirements can be quite a bit different from another users
requirements and it is always the best to evaluate an HPC system with ones own
workloads. However, it is not always practical to do benchmarking with a users workload
given the logistical difficulties such as access to a benchmarking system. In order to
provide an insight into Power 755 cluster performance, we provide in the following
sections, performance results on several well-known benchmark suites such as SPEC
CPU2006 and NAS Parallel Benchmarks.
8.1.1 STREAM, Linpack, SPEC CPU2006
Table 8-1 Power 755 performance


Power 755
AIX 6.1 TL3
Frequency (GHz) 3.3
No. of logical cores/system 32
Peak performance (GFLOPS )
4
844
STREAM (triad) (GB/s) 121.9
Linpack (HPL) (GFLOPS ) 819.9
SPECfp_rate2006 825
SPECint_rate2006 1010
8.1.2 NAS Parallel Benchmarks Class D (MPI)
NAS Parallel Benchmarks is a suite that is commonly used in HPC. We used a Power
755 cluster with InfiniBand network for benchmarking. The cluster has 4 Power 755
nodes in it. cg is a conjugate gradient method based benchmark which results in sparse
matrix-vector computations stressing memory bandwidth in a 575 node. ft on the other
hand uses a 3D FFT method exercising all-to-all communication pattern in the cluster.
mg is the multi-grid benchmark while lu and sp are pseudo-application benchmarks
representative of structured grid computations used in CFD. Performance (Mop/s) as
reported by individual, parallel benchmark is shown in the table below.
Table 8-2 NAS Parallel Benchmark Performance on Power 755 Cluster
Cluster Size
#of 755 nodes
MPI
Tasks
bt.D
Mop/s
cg.D
Mop/s
ft.D
Mop/s
lu.D
Mop/s
mg.D
Mop/s
sp.D
Mop/s
1 32 (a) 63282 9689 (*) 53176 38065 20117
2 64 155703 17417 44487 134300 104711 48660
4 128 (b) 285389 31135 73887 270683 192712 97529
NOTES:
(a) 25 tasks for bt.D, sp.D
(b) 121 tasks for bt.D, sp.D
(*) did not run due to insufficient memory in the 755 node


4
This assumes that the core frequency stays at 3.3 GHz. The system can be forced to run at 3.61 GHz through
the Dynamic Power Savings mode. In this mode, the peak rate would be 924 GFLOPS .
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 65 of 110
8.1.3 Weather Benchmarks
A weather model from the National Centers for Environmental Prediction (NCEP) was
tested on a single Power 755 system. The Global Forecast System (GFS), a spectral
model (using spherical harmonic basis functions) was run at T190 resolution (about
70KM). Both systems, Power 755 and POWER6-575 were booted in SMT2 mode with 64
logical CPUs in each system. Results are presented in the table below for the Power 755
as well as a POWER6-575.
Table 8-3 GFS performance comparison
System Cores MPI tasks Binding to Logical
CPUs
Elapsed Time
Power 755 16 16 0,4,8,12,..60 322.32
POWER6-575 16 16 0,4,8,12,..60 459.81
Power 755 16 32 0,1,4,5,8,962,63 300.68
POWER6-575 16 32 0,1,4,5,8,962,63 342.45
Power 755 32 32 0,2,4,662 174.10
POWER6-575 32 32 0,2,4,662 232.88
Power 755 32 64 0,1,2,363 172.60
POWER6-575 32 64 0,1,2,363 173.84

The compilation flags (for the makefile) and environment variable settings (for the runs)
are listed in Appendix J :
8.2 HPC Benchmarks on Linux
All benchmarks have been run on a 3.3 GHz Power 755 node running SLES11.1rc1
8.2.1 Linpack
The Linpack benchmark is run for SMT off, SMT2 and SMT4 modes. The benchmark
settings are N=156016 (about 180 GB of memory), NB=128 and P and Q adjusted
according to the table below
Table 8-4 Linpack performance for SMT off, SMT2 and SMT4
Power 755 Power 755 Power 755
No. of logical cores/system 32 64 128
Peak performance (GFLOPS )
5
844 844 844
Linpack (HPL) (GFLOPS ) 811 (P=4,Q=8) 674 (P=4,Q=16) 661 (P=4, Q=32)
8.2.2 NAS Parallel Benchmarks Class D

8.2.2.1 MPI Benchmarks
All benchmarks were compiled with -q64 -O3 -qarch=auto -qtune=auto and run with
OpenMPI version 1.3.2. The runs use 32 (blue), 64 (red) and 128 (yellow) threads.
Mop/s can be thought of as a speed scale, so higher means faster performance.
Note that the ppc64_cpu utility normally changes the SMT mode by just activating or
deactivating the logical processors through the sysfs interface. By default, SMT2 mode
will activate logical CPUs N and N+1 and disable N+2 and N+3 for a physical processor

5
This assumes that the core frequency stays at 3.3 GHz. The system can be forced to run at 3.61 GHz through
the Dynamic Power Savings mode. In this mode, the peak rate would be 924 GFLOPS. See chapter 12 for
more information on Dynamic Power Savings mode.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 66 of 110
labeled number N (in Power7, N is multiple of 4). This set of logical CPUs does not work
correctly with XLSMPOPTS=STARTPROC=0:STRIDE=2 which assumes that the
available logical processors numbers will always be a multiple of 2. So ppc64_cpu had
to be used to activate logical CPUs N and N+2 and disable N+1 and N+3.




8.2.2.2 OMP Benchmarks
All benchmarks were compiled with -q64 -O3 -qarch=auto -qtune=auto qsmp=omp
The OMP runs use the following (exported) environment variables:
1. OMP_DYNAMIC=FALSE - disable dynamic adjustment of the number of threads
2. XLSMPOPTS=SPINS=0:YIELDS=0 - I adjusted their values from 100 and 10 to 0
respectively to instruct the OMP thread do not busy-wait or yield before sleep
3. XLSMPOPTS=STACK=8000000 - raised from 4194304 to allocated a stack of 128
for each OMP thread
4. XLSMPOPTS=STARTPROC=0:STRIDE=N - to correctly bind OMP threads to logical
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 67 of 110
processors, where N depends of the SMT configuration used.
N=1 for 128 threads, 2 for 64 threads and 4 for 32 threads.
NASPB3.3-OMP
Power7 SMT off/2/4
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
bt.D ft.D lu.D mg.D sp.D
M
o
p
s
/
t
o
t
a
l
SMT off (32 threads)
SMT2 (64 threads)
SMT4 (128 threads)

NASPB3.3-OMP
Power7 SMT off/2/4
0
100
200
300
400
500
600
700
ep.D is.D
M
o
p
s
/
t
o
t
a
l
SMT off (32 threads)
SMT2 (64 threads)
SMT4 (128 threads)

Note that the 128- thread lu.D benchmark failed with a segfault so it is not reported. Also,
the 32-thread ua.D failed to finish so that entire benchmark is not included.
8.2.2.3 SPECMPI2007 Benchmarks
The SPECMPI2007 scores are for 32(blue), 64(red) and 128(yellow) thread runs. SMT2
mode gets a better overall score, but the best SMT mode (off, SMT2, SMT4) depends
strongly on the application being run.


Note that some benchmarks, like 104.milc, 113.GemsFDTD, 128.GAPgeofem and 137.lu
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 68 of 110
show a performance decrease from SMT2 to SMT4. This may be rooted in a problem
with OpenMPI that it doesnt distribute the MPI tasks evenly among processors.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 69 of 110
9 VSX Vector Scalar Extensions
9.1 VSX Architecture
IBMs Vector/SIMD Multimedia eXtension (VMX, a.k.a. AltiVec
TM
) is an extension to the
PowerPC Architecture. It defines additional registers and instructions to support single-
instruction multiple-data (SIMD) operations that accelerate data-intensive tasks.
The VMX/AltiVec extensions to the PowerPC Architecture were developed jointly by
Apple Computer, IBM, and Motorola. Apple Computer and Motorola use different
terminology to refer to the VMX/AltiVec extensions of the PowerPC Architecture.
Specifically, Motorola uses the term AltiVec, and Apple uses the term Velocity Engine.
From now on, this paper just uses the less cumbersome term Altivec.
The Vector-Scalar floating point eXtension architecture (VSX) has been developed by
IBM to extend SIMD support to include two independent 2-way-SIMD double precision
floating point (FP) operations per cycle. The Altivec SIMD features are a subset of VSX.
Vector technology provides a software model that accelerates the performance of various
software applications and extends the instruction set architecture (ISA) of the PowerPC
architecture. The instruction set is based on separate vector/SIMD-style (single
instruction stream, multiple data streams) execution units that have a high degree of data
parallelism. This high data parallelism can perform operations on multiple data elements
in a single instruction.
9.1.1 Note on Terminology:
The term vector, as used in chapters 9 and 10, refers to the spatial parallel processing
of short, fixed-length one-dimensional matrixes performed by an execution unit. This is
the classical SIMD execution of multiple data streams with one instruction. It should not
be confused with the temporal parallel (pipelined) processing of long, variable length
vectors performed by classical vector machines. The definition is discussed further in the
next section.
For POWER systems, the VSX term has been used to highlight the double precision
arithmetic instructions supported by the POWER7 hardware, with Altivec reserved for
the older 32-bit precision arithmetic SIMD support.
9.2 A Short Vector Processing History
The basic concept behind vector processing is to enhance the performance of data-
intensive applications by providing hardware support for operations that can manipulate
an entire vector (or array) of data in a single operation. The number of data elements
operated upon at a time is called the vector length.
Scalar processors perform operations that manipulate single data elements such as
fixed-point or floating-point numbers. For example, scalar processors usually have an
instruction that adds two integers to produce a single-integer result.
Vector processors perform operations on multiple data elements arranged in groups
called vectors (or arrays). For example, a vector add operation to add two vectors
performs a pair-wise addition of each element of one source vector with the
corresponding element of the other source vector. It places the result in the
corresponding element of the destination vector. Typically a single vector operation on
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 70 of 110
vectors of length n is equivalent to performing n scalar operations.
Figure 9-1 illustrates the difference between scalar and vector operations.
Figure 9-1 Scalar and vector operations

Processor designers are continually looking for ways to improve application performance.
The addition of vector operations to a processors architecture is one method that a
processor designer can use to make it easier to improve the peak performance of a
processor. However, the actual performance improvements that can be obtained for a
specific application depend on how well the application can exploit vector operations and
avoid other system bottlenecks like memory bandwidth.
The concept of vector processing has existed since the 1950s. Early implementations of
vector processing (known as array processing) were installed in the 1960s. They used
special purpose peripherals attached to general purpose computers. An example is the
IBM 2938 Array Processor, which could be attached to some models of the IBM
System/360. This was followed by the IBM 3838 Array Processor in later years.
By the mid-1970s, vector processing became an integral part of the main processor in
large supercomputers manufactured by companies such as Cray Research. By the mid-
1980s, vector processing became available as an optional feature on large general-
purpose computers such as the IBM 3090TM.
In the 1990s, developers of microprocessors used in desktop computers adapted the
concept of vector processing to enhance the capability of their microprocessors when
running desktop multimedia applications. These capabilities were usually referred to as
Single Instruction Multiple Data (SIMD) extensions and operated on short vectors.
Examples of SIMD extensions in widespread use today include:
Intel Multimedia Extensions (MMX
TM
)
Intel Streaming SIMD Extensions (SSE)
AMD 3DNow!
Motorola AltiVec and IBM VMX/AltiVec
IBM VSX
The SIMD extensions found in microprocessors used in desktop computers operate on
short vectors of length 2, 4, 8, or 16. This is in contrast to the classic vector
supercomputers that can often exploit long vectors of length 64 or more.











Vector Add Operation
+
7

7
1
6
3
3
10
+
5
6
11
4
5
2
11


Scalar Add Operation
4
12
7
17
7
8
12
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 71 of 110
9.3 VSU Overview
The POWER7 architecture continues to support the Altivec instruction set, and extends
support to double precision floating point operations with the VSX instruction set. The
VSU (Vector Scalar Unit) is the hardware that implements the Altivec, VSX and scalar
floating point instructions. There are no longer separate scalar FPUs in the POWER7
core.
There are 12 execution units in the POWER7 core
2 symmetric load/store units (LSU), also capable of executing simple fixed-point
ops
2 symmetric fixed-point units (FXU)
4 floating-point units (FPU), implemented as two 2-way SIMD operations for
double- and single-precision. Scalar binary floating point instructions can only
use two of these FPUs.
1 Altivec execution unit capable of executing simple FX, complex FX, permute
and 4-way SIMD single-precision floating point (FP) ops
1 decimal floating-point unit (DFU)
1 Branch execution unit (BR)
1 CR Logical execution unit (CRL)
The VSU is the combination of the Altivec execution unit and the 4 FPUs. It is divided into
two independent pipes, each of which can execute one instruction per cycle. Each pipe
can independently execute a scalar double-precision FP op or a SIMD double-precision
FP op. All SIMD operations are on 16-byte vectors of data.
Mimicking the behavior of the original Altivec unit, pipe0 handles the simple FX, complex
FX and 4-way SIMD single-precision FP ops and pipe1 handles the Altivec permute ops.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 72 of 110
Figure 9-2 POWER7 with VSU block diagram
Global
Completion
Table
Load
VSR
172x2

4* 2b cells
Data
LS
Instr. Queue
Float
Complex
MUX
Vector
Store
Data
Store Queue
LSU
ISU
SAT/CR6 bit
fpscr bits
Regs
Power Gating
Control
Control:
Tags,...
SP
BFU0..1
Float
DP
BFU2..3
Float
DP
Bypass
fpscr
vscr
Simple-
VECTOR
ALU 128b
Permute
128-bit
128-bit
128-bit
128-bit
Fixed
64-bit/ 64-bit/
128b 128b
(32b+1+4ecc)
DFU
MUX
IFU
CR
ISU
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 73 of 110
Notes for the programmer:
1. Up to two instructions can be issued to the VSU in a given cycle one for each
pipeline
2. Instructions executed by pipe0 can be a 128-bit simple fixed point operation
(Altivec), 128-bit complex fixed point operation (Altivec), 4-way SIMD single-
precision FPU operation (Altivec), a 2-way SIMD double-precision FPU operation
(VSX) or a scalar floating point (single or double precision) operation.
3. Instructions executed by pipe1 can be a 128-bit permute (Altivec or VSX
permute), a store, a scalar floating point (single or double precision) operation, or
a 2-way SIMD double-precision FPU operation (VSX).
4. This means there can be two simultaneous VSX instructions executing at once,
each handling 2 double-precision FP operations. Since each operation can be a
FP multiply-add (FMA) that gives a peak of 2x2x2=8 double-precision FP
operations per cycle. For a Power 755 this is a peak rate of 26.4 GFLOPS from
the VSU.
5. Note that, different from previous implementations, the scalar and vector FP
operations are all executed within the VSU.
6. For Altivec operations , a vector permute can be issued at the same cycle as a
vector floating point operation, with the FP op using VSU pipe0 and the permute
using pipe1
7. Because there are two scalar FXU pipelines independent of the VSU, two
additional FXU operations, for logical operations and/or array indexing, can be
executed at the same time as VSU operations.
.
In ST mode (SMT off), there are now 64 16-byte registers and 80 16-byte rename
registers available for single and double precision, vector operations referred to as the
VSR (vector/scalar register file). These registers can also be used to hold scalar data. In
addition, for scalar operations, there are 32 GPRs and 77 rename GPRs. See Table 3-1
for how these registers are divided up in SMT2 and SMT4.
The Altivec and VSX extensions to PowerPC Architecture define instructions that use the
VSU to manipulate scalars and vectors stored in the VSR. These instructions fall into
these categories:
Scalar floating point arithmetic instructions (on 32-bit and 64-bit real data)
Vector integer arithmetic instructions (on 8-bit, 16-bit, or 32-bit integers)
Vector floating-point arithmetic instructions (32-bit and 64 bit)
Vector load and store instructions
Vector permutation and formatting instructions
Processor control instructions used to read and write from the VSU status and control
register
Memory control instructions used to manage prefetch and caches
This technology can be thought of as a set of registers and execution units that have
been added to the PowerPC architecture in a manner analogous to the addition of
floating-point units. Floating-point units were added to provide support for high-precision
scientific calculations and the vector technology is added to the PowerPC architecture to
accelerate the next level of performance-driven, high-bandwidth communications and
computing applications.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 74 of 110
For additional information about the topics presented in this chapter, the interested
reader can refer to
PowerPC Microprocessor Family: Vector/SIMD Multimedia Extension Technology Programming
Environments Manual
This manual is on the Web.
The VSX instruction set is described in detail in Book I, chapter 7 of
PowerISA AS Version 2.06
This document is on the Web, too.
Compiler Options
There are several options needed to build a hand-coded Altivec/VSX application. The
recommended command lines are
For C:
xlc qarch=pwr7 qtune=pwr7 O3 qhot -qaltivec
For Fortran:
xlf qarch=pwr7 qtune=pwr7 O3 qhot
The options are broken down as follows:
1. Make sure the architecture supports VSX
- qar ch=pwr 7 qt une=pwr 7
2. Make sure VSX instructions are recognized at all compiler stages (this is only
needed for C/C++source code)
qal t i vec
3. Provide a suitable level of optimization
- O3 - qhot
or
- O4
- O5 ( hi ghest opt i mi zat i on l evel )
Adding the options
- qsour ce ql i st qr epor t
will generate a listing file including optimization information.
As noted in section 3.1, the first compiler versions that recognize the POWER7
architecture are XL C/C++11.1 and XLF Fortran 13.1.
9.4 Vectorization Overview
So, repeating the point of section 9.2, the reason to care about vector technology is
performance
So how does SIMD code differ from scalar code? Compare the code fragments from
examples 9-1 and 9-2.
. Vector technology can provide dramatic performance gains, up to 2 times
the best scalar performance for some cases.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 75 of 110
Example 9-1 Scalar vector addition
f l oat *a, *b, *c;
. . .
f or ( i = 0, i < n, i ++)
{
a[ i ] = b[ i ] + c[ i ] ; / / scal ar ver si on
}
Example 9-2 Vectorization of scalar addition
vect or f l oat *a, *b, *c;
. . .
f or ( i = 0, i < n / vect or _si ze, i ++)
{
a[ i ] = vec_add( b[ i ] , c[ i ] ) ; / / vect or i zed ver si on
}
In Example 9-2, the 32-bit scalar data type has been replaced by a vector data type. Note
that the loop range is no longer n iterations, but is reduced by the vector data type length
(vector_size =4 for floats when using Altivec). Remember, an Altivec vector register can
hold 128 bits (16 bytes) of data. Therefore, the vector addition operation, a[i] =
vec_add(b[i],c[i]), can execute 4 add operations with a single instruction for each vector,
as opposed to multiple scalar instructions. With two FPU pipes for scalar addition and
one Altivec pipe for single precision SIMD addition the vectorized version can be up to 2
times faster than the scalar version..
Vector technology in the POWER7 supports a diverse set of applications in the following
areas:
Digital Signal Processing
Bioinformatics
A broad range of scientific applications in physics and chemistry.
A broad range of engineering applications dedicated to solving the partial
differential equations of their respective fields.
In summary, vector technology found in the POWER7 defines the following:
A fixed 128-bit wide vector length that can be subdivided into sixteen 8-bit bytes, eight
16-bit half words, four 32-bit words or 2 64-bit double words.
The vector registers are separate from general-purpose registers (GPRs).
Scalar single and double-precision floating point arithmetic
Vector integer and floating-point arithmetic
Four operands for most instructions (three source operands and one result)
Saturation clamping.
Where unsigned results are clamped to zero on underflow and to the maximum
positive integer value (2
n
-1, for example, 255 for byte fields) on overflow. For signed
results, saturation clamps results to the smallest representable negative number (-2n-
1, for example, -128 for byte fields) on underflow, and to the largest representable
positive number (2n-1-1, for example, +127 for byte fields) on overflow).
No mode switching that would increase the overhead of using the instructions.
Operations are selected based on utility to digital signal processing algorithms (including
3D).
Vector instructions provide a vector compare and select mechanism to implement
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 76 of 110
conditional execution as the preferred way to control data flow in vector programs
Enhanced cache and memory interface.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 77 of 110
10 Auto-vectorization
The basic command lines for compiling C and Fortran source code for POWER7 are
discussed in section 9.4. To automatically vectorize code, the recommended command
lines are:
For C:
xlc qarch=pwr7 qtune=pwr7 O3 qhot qaltivec qsimd=auto
For Fortran:
xlf qarch=pwr7 qtune=pwr7 O3 qhot qsimd=auto
The additional option, -qsimd=auto, is the option that explicitly asks the compiler to auto-
vectorize loops. The option -qsimd is an alias that means the same thing.
Note: -qhot=[no]simd as well as the qenablevmx options have been deprecated for the
XLC v11 and XLF v13 releases.
SIMD vectorization is no longer supported by default. However, if qsmid=auto has been
used the following options will disable vectorization at the function level:
For a C | Fortran loop:
#pragma nosimd | !IBM* NOSIMD
One can look at the vectorization report by adding the -qsource qreport options. The
listing file (suffix .lst) prints the source code, points out loop candidates for auto-
vectorization and lists messages explaining all of the successful and unsuccessful
vectorization attempts. Alternatively, one can generate a listing file using the qlist
command and look for success/failure of vectorization along with detailed reasons. For
example, one can expect reports of the form shown below for a successful vectorization
of the inner loop of a function in the figure below (some code isnt shown for readability):
1586- 542 ( I ) Loop ( l oop i ndex 2 wi t h nest - l evel 1 and i t er at i on count ) at
f oo. c <l i ne xx> was SI MD vect or i zed.
1586- 543 ( I ) <SI MD i nf o> Tot al number of t he i nner most l oops consi der ed <" 2" >.
Tot al number of t he i nner most l oops SI MD vect or i zed <" 1" >.

#def i ne I TER 10
voi d f oo( i nt si ze)
{
i nt i , j ;
f l oat *x, *y, *a;
i nt i t er _count =1024;
. . .
. . .
f or ( j = 0; j < I TER; j ++) {
f or ( i = 0; i < i t er _count ; i +=1) {
x[ i ] = y[ i ] + a[ i +1] ;
}
}
}
After building a program with auto-vectorization test it out. If the performance is not as
expected, a programmer can refer to comments in the listing provided by qsource
-qreport to identify reasons for loops that failed to auto-vectorize and give the
programmer direction on how to correct code to auto-vectorize properly.
Inhibitors of Auto-vectorization
Here are some common conditions which can prevent the compiler from performing auto-
vectorization.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 78 of 110
10.1.1 Loop-carried Data Dependencies
In the presence of loop-carried data dependences where data is read after it is written
(see section 11.4.2), auto-vectorization cannot be carried out. As shown in the simple
code snippet below, in every iteration i of the loop, c[i-1] is being read which is written in
the i-1th iteration. When such a loop is transformed using VSX instructions it results in
incorrect values being computed in the c array.
f or ( i = 0; i < N; i ++)
c[ i ] = c[ i - 1] + 1
Certain compiler transformations applied to loops may help resolve certain kinds of loop-
carried data dependences and enable auto-vectorization.
10.1.2 Memory Aliasing
When loops contain data which can potentially overlap, the compiler may desist from
auto-vectorization as it can lead to incorrect results. For the code snippet shown below, if
the memory locations of the pointers a,b,c denoted by a[0..N-1], b[0..N-1] and c[0..N-1]
do not overlap, and the compiler can statically deduce this fact, the loop will be vectorized
and Altivec/VSX instructions generated for it. However, in general, it may be non-trivial
for the compiler to deduce this at compile-time and subsequently the loop may remain
serial. This can happen even when in reality the memory locations do not overlap.
doubl e f oo( doubl e *a, doubl e *b, doubl e *c)
{
f or ( i = 0; i < N; i ++)
a[ i ] = b[ i ] + c[ i ]
}
The user can help the compiler resolve the memory aliasing issues by one of the
following possible methods:
_ use O3 ipa or -O5 (which does interprocedural analysis)
_ tell the compiler when memory is disjoint using: #pragma disjoint (*a, *b)
For the example above two statements of the form #pragma disjoint (*a,*b) and #pragma
disjoint (*a,*c) will enable the compiler to make safe assumptions about the aliasing and
subsequently vectorize the loop.
The compiler may in certain situations ( where memory overlap pragmas are not provided
or the compiler cannot safely analyze overlaps) generated versioned code to compute
the overlap at runtime i.e. it inserts code to test for memory overlap and generates auto-
vectorized or serial code based on whether the test passes or fails at runtime. For the
loop in question, the compiler may generate a test of the form shown below:
i f ( ( ( ( ( l ong) a + n * 8) - 8 < ( l ong) c | ( n * 8 + ( l ong) c) - 8 < ( l ong) a) &
( ( n *8 + ( l ong) b) - 8 < ( l ong) c | ( n * 8 + ( l ong) c) - 8 < ( l ong) b) ) ) {
/ / aut o- vect or i zed code
}
el se
/ / original serial code
10.1.3 Non-stride-1 Accesses
For auto-vectorization to happen, usually, the compiler expects the accesses of the data
to happen from contiguous locations in memory (what we refer to as stride-1 accesses)
so that once the vector stores and loads happen, the resultant data can be fed directly to
the computation operands. However, with non stride-1 accesses, data may not be loaded
or stored from contiguous locations, leading to several complexities in code generation.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 79 of 110
Even if the non-contiguous loading/storing pattern is known, the compiler would now
need to generate certain extra instructions to pack/unpack the data into vector registers
before they can be operated upon. These extra instructions ( not present in the serial
code ) may increase the cost of implementing auto-vectorization and the compiler may
pre-empt the decision to auto-vectorize those loops based on heuristic/profile-driven cost
analyses. In the code snippet shown below, the accesses of array b are non-unit-stride
as another array c is used to index into b. The compiler does not auto-vectorize such a
loop.
f or ( i = 0; i < N; i ++)
a[ i ] = b[ c[ i ] ]

10.1.4 Complexities Associated with the Structure of the Loop
A loop which is a candidate for auto-vectorization may exhibit one or more of the
following characteristics which may inhibit vectorization.
"contains control flow : (restricted form of if/then/else allowed)
f or ( i = 0; i < n ; i ++) {
i f ( i < 8 )
c[ i ] = a[ i ] + b[ i ] ;
el se
c[ i ] = a[ i ] - b[ i ] ;
}
This kind of control-flow is handled by the compiler by splitting the loop into two separate
loops and vectorizing them:
f or ( i = 0; i < 8 ; i ++) {
c[ i ] = a[ i ] + b[ i ]
}
and
f or ( i = 8; i < n ; i ++) {
c[ i ] = a[ i ] - b[ i ]
}
contains function call: (no function calls)
trip count too small : (short loops not profitable)
The user may help the compiler auto-vectorize such a loop by following some of the
policies noted below. Also, the current compiler does support loops with if-then-else
constructs for some cases by generating suitable vector instructions (e.g. vselect)
inline function calls automatically/manually (-O3 ipa/-O5 or by using inline
#pragma/directives)
.
10.1.5 Data Alignment
General alignment issues are also discussed in section 11.3
For auto-vectorization, the compiler tries to ensure that vector loads/stores are from
aligned memory locations to minimize the misalignment penalty as much as possible.
However, this may not (always) be possible and under such circumstances the compiler
may preempt auto-vectorization as the cost may be higher than serial code. To
understand the kind of alignment issues that may arise with auto-vectorization consider
this code:
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 80 of 110
f or ( i =0; i <100; i ++)
a[ i ] = b[ i +1] + c[ i +1] ;
In this code snippet, even when a[0], b[0] and c[0] are aligned at 16-byte boundaries, the
b[i+1] and c[i+1] accesses cause misalignment and the compiler may re-align the data if it
can reduce the misalignment penalty. This can be done by various mechanisms by the
compiler most of them would require extra instructions to be generated. One strategy
would be to read multiple sets of 128-bits of aligned data from b and c and then use a
separate instruction (like vector permute) to combine the requisite parts. The figure below
shows such a policy
Figure 10-1 Handling unaligned data

The strategies discussed above fall in the category of static compile-time alignment
handling. In the case shown above, a[0], b[0] and c[0] are proven to be 128-byte aligned
by compiler analyses techniques or via some attributes or keywords of the kinds shown
below:
align data for the compiler:
doubl e a[ 256] __at t r i but e__( ( al i gned( 16) ) ;
tell the compiler its aligned:
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 81 of 110
[ C] __al i gnx( 16, p) ; |
[ For t r an] cal l al i gnx( 16, a[ 5] ) ;
all dynamically allocated memory (malloc,alloca) are 16-byte aligned
all global objects are 16-byte aligned
In situations where the compiler is unable to analyze the alignments of the data and no
alignment related keywords are provided by the user, the compiler is forced to assume
that the data are unaligned. In such cases, the compiler can generate versioned code
based on runtime alignment of the data. For the following example the alignment of the
data arrays x and y are unknown to the compiler. This results in the versioned code
where the vectorized code is generated only when both x[0] and y[0] are aligned on 128-
byte boundaries. This test is done via the (x&15 | y&15) check.
f l oat a1 = 5. 4;
f or ( j = 0; j < I TER; j ++) {
f or ( i = 0; i < i t er _count ; i +=1) {
x[ i ] = a1 + y[ i ] ;
count ++;
}
i f ( ! ( ! ( ( long) x & 15 | (long) y & 15

) ) ) got o l ab_41;
@CI V1 = 0;
do {
28 | *( ( ( vect or f l oat <4> *) x) + )
= spl at ( 5. 40000009E+00) + *( ( ( vect or f l oat <4> *) y) +
) ;
27 | @CI V1 = @CI V1 + 1;
} whi l e ( ( unsi gned) @CI V1 < ( unsi gned) ( ( i t er _count - 5) / 4 + 1) ) ;

l ab_41:
@CI V0 = 0;
do {
28 | ( ( f l oat *) x) [ @CI V0] = 5. 40000009E+00 + ( ( f l oat *) y) [ @CI V0] ;
27 | @CI V0 = @CI V0 + 1;
} whi l e ( ( unsi gned) @CI V0 < ( unsi gned) i t er _count ) ; / * ~43 */
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 82 of 110
11 VSX and Altivec Programming
Currently, there isnt a widely accepted recipe to decide on whether to convert a code to
exploit vectorization. Many of the concepts that allow compilers to auto-vectorize code
also apply to hand-coded approaches to vectorizing code. In addition, the effort involved
in manually converting a code to use vector data types and operations encounters
additional issues to be aware of.
11.1 Handling Data Loads
The VSU unit has two load/store execution pipelines, with a 2-cycle load-to-use
latency 32 bytes can be loaded from the D-cache (into vector registers) in each
cycle. The D-cache can also be loaded from the L2 at the same rate (32 bytes
per cycle).
The best way to exploit the LSU for double precision (VSX) vectors is to load
(and work on) vectors in pairs. This is different than with earlier SIMD
architectures, where Altivec only supported at most one load per clock.
11.2 Performance Improvement of VSX/Altivec-enabled Code Over
Scalar Code
Given the various data types that can be used as vectors, the following performance
gains are possible:
On POWER7, there are 2 scalar floating point units (FPUs) and 2 scalar fixed point units
(FXUs) but only one VSU (with one embedded Altivec unit). So at peak performance,
VSX/Altivec code can be 2 times faster than scalar code for 64-bit floating point, 32-bit
floating point and integer arithmetic, 4 times faster for 16-bit (short) integer arithmetic and
8 times faster for 8-bit (byte) integer arithmetic.
These performance gains assume both corresponding scalar functional units are fully
utilized. Performance is limited by how well the application can keep the VSU busy as
well as cache reuse, memory bandwidth and memory alignment considerations. These
issues are addressed in later sections. Less efficient scalar code will result in measuring
a greater speedup than expected. The most valid scalar-SIMD comparison is between
optimum scalar and SIMD code.
11.3 Memory Alignment
Many issues are already covered in section 10.1.5. Here are some additional suggestions
for handling dynamically allocated data that is unaligned.
11.3.1 AIX
The malloc() system call should already be allocating one-dimensional arrays on 16-byte
boundaries by default. But just in case, AIX 6L has a useful system call that explicitly
forces malloc()d one-dimensional arrays to align on 16-byte boundaries vec_malloc().
It can be used in IBM XL C/C++routines in place of any standard malloc() call; in fact an
easy way to ensure arrays are mallocd properly is to use the C preprocessor:
#def i ne mal l oc vec_mal l oc
Automatic and local arrays are already properly aligned i.e. a[0] will be on a 16-byte
address boundary 0xXXXXXXXX0.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 83 of 110
For XL Fortran programs running on AIX, an easy way to allocate aligned one-
dimensional arrays is to provide a Fortran wrapper that calls vec_malloc().
An alternative for any executable program is to set the MALLOCALIGN environment
variable
expor t MALLOCALI GN=16
This will force all dynamically allocated arrays to align on 16 byte boundaries
11.3.2 Linux
For Linux on POWER, programmers can use the Linux system call memalign() to align
dynamically allocated one-dimensional arrays on 16-byte boundaries.
11.3.3 Multiple Array Offsets in a Loop
A common problem is where array references with different offsets (e.g. A(i) and A(i+1)
for single precision data) appear in the same loop. Usually, extra loads and permutes are
needed to manage the data. There is no guarantee that vector performance will be better
than scalar performance; the extra instruction overhead can offset any advantage of
using SIMD instructions.
11.3.4 Multidimensional Arrays
Working with multidimensional arrays is more of a challenge. Those arrays whose
leading dimensions will not allow a row (for C) or a column (for Fortran) to load evenly
into vectors require the overhead of additional instructions to handle partial vectors at the
matrix boundaries. This overhead can offset any potential performance gains from using
SIMD instructions.
To make it easier to align a multidimensional array, one suggestion is to allocate a large
enough one-dimensional array. Then, use an array of pointers or another language-
appropriate mechanism to reference the allocated space as if it were a multi-dimensional
array.
Vector Programming Strategies
A lot of confusion arises when talking about whether SIMD programming is worth the
effort or works. While Altivec and VSX have differences in programming details, they
have similar criteria when it comes to deciding how to convert a candidate (scalar)
program to exploit SIMD instructions to increase performance. To understand some of
the tradeoffs involved, this section classifies SIMD programming approaches into 3
categories: local loop changes, local algorithmic changes, and global data restructuring.
The order goes from easiest to implement to hardest.
It costs nothing to first try and see if auto-vectorization will improve code performance,
but oftentimes it wont. Providing the compiler with the flags -qreport qlist will describe
the reasons that the compiler will fail to auto-vectorize any particular loop. This is further
discussed in the introduction to chapter 10
Note that the examples that follow focus on high calculation rates. They assume that the
dominant floating point arithmetic operations are multiplies and adds and similarly
pipelined floating point instructions. But the pursuit of high gigaflop rates is not the only
situation where Altivec/VSX operations have a potential performance advantage over
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 84 of 110
their scalar equivalents. If a significant fraction of the calculations are floating point
divides and/or square roots, the higher execution latency and the lack of pipelining gives
a performance advantage to Altivec/VSX versions of these operations. Two examples of
codes that take advantage of VSX operations to improve performance are the SPEC
CPU2006 versions of WRF and CactusADM.

11.3.5 Some Requirements for Efficient Loop Vectorization
One way to look at SIMD programming is to list all of the things that can go wrong, and
the methods used to solve the problems.
Besides the aforementioned data flow dependencies, issues to be handled are:
1. Make sure the arrays that provide data to form Altivec/VSX vectors) are aligned
as discussed in section 10.1.5.
2. Iteration-dependent branching inside loops can remove the loop as a candidate
for SIMD conversion. The vec_sel() intrinsic can handle conditions that depend
on the loop iteration, but the performance can fall off rapidly with every if-test.
3. Iterate through arrays with a unit stride (i++). There are common situations
where iteration is not unit stride (e.g. red-black lattice order). In these cases, the
array elements can be reordered so that the most common iteration patterns can
be done over contiguous chunks of memory. In the case of red-black ordering,
the even elements could all be grouped in the first of the array, the odd
elements could be in the second of the array.
4. Minimize the number of loads relative to all other operations (loads are often the
main performance bottleneck in SIMD programming). This doesnt inhibit
vectorization, but it does limit performance. This can happen in nested loops,
like those that are found in matrix multiplies.
11.3.6 OpenMP Loops
The most straightforward approach to enabling a code for VSX is to find the (hopefully
few) loops that use the most execution time and vectorize them. In legacy applications,
these loops are often the ones that are parallelized with OpenMP directives or similar
approaches. Many loops that can be successfully parallelized (in the thread-safe sense of
OpenMP) are also candidates for vectorization.
There are some differences between vectorizable loops and OpenMP loops.
1. A loop that includes branch conditions that depend on the iterator (loop index)
value are perfectly ok to implement in parallel threads, but can have a substantial
performance penalty when executed in a SIMD context.
2. A loop that references array elements across more than one iteration will create
race conditions in parallel execution, but may be perfectly fine for SIMD
execution. For example (assuming an incrementing iterator), consider the code
fragment:
A( I ) = A( I +1) + B( I )
In a scalar loop, any give element of A( ) is modified after it is read. This
fragment vectorizes on I (though it does have alignment issues). In contrast, the
vectorized version of
A( I ) = A( I - 1) + B( I )
does not give the same results as the scalar form. The data to set A( I ) comes
from an earlier write of A( I ) . This is an example of a recursive data
dependence. Note that if the index offset was 4 instead of 1, the data
dependence wouldnt be an issue for Altivec instructions, and if the offset was 2,
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 85 of 110
it wouldnt be an issue for VSX instructions.
Even though there are differences to watch out for between OpenMP-parallel and
vectorizable loops, these types of loops are the top-of-the-list candidates for
vectorization.
11.3.7 Example: Vectorizing a Simple Loop
As an example, lets see how the SAXPY/DAXPY loop is transformed. This could be
auto-vectorized, but hand coding is used for illustration.
Here is the scalar code snippet; (this snippet just includes the highlights. The supporting
code is assumed)
a = ( f l oat *) mal l oc( ar r ayl en*si zeof ( f l oat ) ) ;
b = ( f l oat *) mal l oc( ar r ayl en*si zeof ( f l oat ) ) ;
f or ( i =0; i <ar r ayl en; i ++) {
b[ i ] = al pha*a[ i ] + b[ i ] ;
}
And here is its [Altivec] counterpart
a = ( f l oat *) vec_mal l oc( ar r ayl en*si zeof ( f l oat ) ) ;
c = ( f l oat *) vec_mal l oc( ar r ayl en*si zeof ( f l oat ) ) ;
vAl pha = vec_spl at s( al pha) ;
f or ( i =0; i <ar r ayl en; i +=4) {
vA = ( vect or f l oat *) &( a[ i ] ) ;
vC = ( vect or f l oat *) &( c[ i ] ) ;
*vC = vec_madd( vAl pha, *vA, *vC) ;
}
Notes:
1. vec_malloc() is used in place of malloc() to force arrays to align on 16-byte
boundaries. Depending on the AIX release level, malloc() may also force arrays
to align properly, but using vec_malloc() always will.
2. vec_splats() is a new VSX intrinsic. It can be used to splat scalar data to all
vector data types. Another (now obsolete) way to do the same thing is by using
the vec_loadAndSplatScalar() function found at the Apple (Altivec) website:
3. Assigning the vector float pointers to appropriate elements of the arrays is one
way to have the compiler load the data into vector registers. For this case, no
explicit load intrinsics have to be included. However, explicit vector loads may
need to be included in other situations.
4. This is not the fastest way to code a SAXPY calculation. For example, unrolling
the loop (by 4 seems to be a good choice) yields better performance. Other
techniques can further improve performance. The loop shown here will not
achieve the highest performance possible on a Power 755.
5. The speed measured depends on the size of the arrays. As the arrays get longer,
both the SIMD and scalar calculation rates decrease.
And here is its double-precision (VSX) analog.
a = ( doubl e *) vec_mal l oc( ar r ayl en*si zeof ( doubl e) ) ;
b = ( doubl e *) vec_mal l oc( ar r ayl en*si zeof ( doubl e) ) ;
vAl pha = vec_spl at s( al pha) ;
f or ( i nt i =0; i <ar r ayl en; i +=4) {
vA1 = ( vect or doubl e*) &( a[ i ] ) ;
vB1 = ( vect or doubl e*) &( b[ i ] ) ;
*vB1 = vec_madd( vAl pha, *vA1, *vB1) ;
vA2 = ( vect or doubl e*) &( a[ i +2] ) ;
vB2 = ( vect or doubl e*) &( b[ i +2] ) ;
*vB2 = vec_madd( vAl pha, *vA2, *vB2) ;
}
Notes:
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 86 of 110
1. A double can be loaded and splatted similarly to a float.
2. The vector multiply-add intrinsic for double precision arithmetic (VSX) is the
same as for single precision (Altivec). Not all corresponding VSX and Altivec
instructions share the same intrinsic.
3. As noted in section 9.3 and elsewhere, a vector double holds only 2 doubles, but
VSX can still do 4 double precision FMAs at a time, by issuing two vec_madd()s
per cycle. The base loop is unrolled by 2 to make sure this happens. This is
analogous to the base SAXPY loop above.
4. Like the Altivec loop, this loop could benefit from more unrolling.
11.3.8 Local Algorithms
Experience has shown that many hot routines are VSX opportunities, but auto-
vectorization doesnt work well enough. To get optimal performance, the routines have to
be rewritten from scratch, probably using a different, more SIMD-friendly algorithm to
further improve the performance. For example, sometimes there are loop data flow
dependencies that inhibit vectorization and have to be reworked to allow vectorization. In
any event, the necessary changes needed to get a large fraction of the runtime executing
SIMD instructions are local to the routines involved, so they are relatively quick to
implement, though not as quick as auto-vectorization, when it works.
The local algorithms approach has been applied to DGEMM (of course), most (if not all)
bioinformatics codes that have SIMD versions of the dynamic programming algorithm,
FFTs, specialized matrix multiplies (like that found in LQCD codes), DSP, Video
processing, encryption and other signal processing tasks. Notable examples are the
Smith-Waterman algorithm in FASTA and the HMMER hidden markov algorithm.
Freescale maintains a collection of web pages with many useful SIMD algorithms,
techniques and code examples, originally targeted at 32-bit data, but readily adaptable to
handle double-precision VSX.
All local algorithms have one weakness they cant speed up code that uses at least one
load per arithmetic operation, like those found in matrix-vector and vector-vector multiply
kernels. These are simply bounded by the POWER7 system memory bandwidth once the
problem size grows beyond L2 cache. This means that unit tests that stay in cache often
predict too optimistic a performance gain compared to real life workloads. Caution is in
order.
11.3.9 Global Restructuring
Frequently, a program cant reach peak theoretical VSX performance by using either of
the first two approaches. For example, many programs create arrays of structures (e.g.
3D cells in a finite element program, sites in a QCD lattice) and loop over selected
attributes in each array element. This requires accessing data with a non-unit (but usually
constant) stride through memory requiring aggressive prefetching and wasting memory
bandwidth (because adjacent structure attributes are loaded with the requested data if
they share the same cache line). If an application programmer is willing to put in enough
effort (or if an appropriate tool is available), the program can be transformed from using
an array of structures (AoS) to an array of structures of vectors (AoSoV) approach.
The structures in AoSoV are purposely built so that each element is the size of a 16-byte
vector type, whether the underlying data is single or double precision (32- or 64-bit). This
allows the AoSoV structures to exploit the same scalar algorithm already present, just by
changing the scalar variables to their vector equivalents and adjusting the loop iterators.
For example, the SU3 matrix from MILC (found in the SPEC benchmark suite, as well as
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 87 of 110
academic MILC) is a scalar structure:
t ypedef st r uct {
f l oat r eal ;
f l oat i mag;
} compl ex;
t ypedef st r uct { compl ex e[ 3] [ 3] ; } su3_mat r i x;
The program loops over arrays of a lattice (whose structure in turn contains this structure)
for most of its run time. This structures size does not allow all array elements to align on
16-byte boundaries
To illustrate the Local Algorithm Change method, we will pad the structure out so that
each row is a multiple of 16 bytes. This is required for Altivec-enabled code, since the
structure has to be 16-byte aligned to get results that agree with the scalar version, An
altered structure that would work is:
t ypedef st r uct { compl ex e[ 3] [ 4] ; } su3_mat r i x_vec1;
Alternatively, the original structure can be modified by adding 8 bytes of padding. (This
minimizes unused data in a cache line). This forces more data juggling when loading the
rows in the matrix, but saves 2*8=16 bytes over a su3_matrix_vec1 structure.
Similar arguments apply to a double-precision version of the su3_matrix.
But a better choice for a structure to enable Altivec is:
#def i ne VECTOR_LENGTH 4
t ypedef st r uct {
f l oat r eal [ 3] [ 3] [ VECTOR_LENGTH] ;
f l oat i mag[ 3] [ 3] [ VECTOR_LENGTH] ;
} su3_mat r i x_bl k_vmx;
and, for VSX,
#def i ne VECTOR_LENGTH 2
t ypedef st r uct {
doubl e r eal [ 3] [ 3] [ VECTOR_LENGTH] ;
doubl e i mag[ 3] [ 3] [ VECTOR_LENGTH] ;
} su3_mat r i x_bl k_vsx;
Changing a basic structure within a code forces global changes that can require changing
most of the code. It can increase the performance gains but it is a costly approach in both
time and effort.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 88 of 110

11.4 Conclusions
First, as discussed in chapter 10 and elsewhere, since auto-vectorization can be invoked
with little effort, it is always worthwhile to try out. The -qreport qlist qsource options
will indicate why loops are or are not vectorized. The amount of effort that should be
invested in helping auto-vectorization succeed depends on the expected benefit, but it
should be easier than the effort needed to hand-code Altivec/VSX intrinsics.
Keep in mind the criteria presented in section 11.4 while inspecting the hot
computational loops for vectorization opportunities.
For any given application, the performance gain that can be achieved from using
Altivec/VSX rather than scalar instructions is not guaranteed to reach the maximum. The
ultimate performance will depend on the specific application, the algorithms used for the
scalar and Altivec/VSX versions of the application, and several factors such as bandwidth
from caches and memory, data alignment, etc. For most HPC applications it is unlikely
that VSX speedup will approach the maximum of 2x.
In many cases where SIMD opportunities exist to improve code performance, the IBM
ESSL library can save time and effort for coding up vectorized versions of standard
algorithms, like FFTs.
It is important to be aware of codes where there are opportunities to benefit from
exploiting the Altivec/VSX capabilities of POWER7 systems like the Power 755, whether
it is through auto-vectorization or hand-coding. It is equally important to have realistic
expectations of the potential speedup available from vectorizing an application.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 89 of 110
12 Power Consumption
In industries across the board energy consumption has become a top priority. The HPC
segment is no exception. In a business where customers often look to cluster densely
populated nodes, power and cooling can rapidly escalate operating expenses. In
response to these new challenges IBM offers a technology called IBM EnergyScale
which is available for IBM POWER7 processor-based systems. The goal of this chapter is
to highlight the features of this new technology and communicate to the user what kind of
behavior they can expect with respect to power consumption and performance.
EnergyScale offers the user 3 modes for desired power management which are
configurable via IBM Systems Director Active Energy Manager (AEM). The available
modes are SPS: Static Power Saver, DPS: Dynamic Power Saver, and DPS-FP:
Dynamic Power Saver - Favor Performance.
xCAT 2 (Extreme Cluster Administration Toolkit) is a tool for deploying and managing
clusters. xCAT 2 is a scalable distributed computing management and provisioning tool
that can be used to deploy and manage Power 755 clusters. For xCAT 2.3 and later there
is an Energy Management Plugin available that gives the administrator of Power clusters
the ability to query for power and change the EnergyScale modes for each 755 server in
the cluster. The actions are invoked using an xCAT command called renergy that can be
used either in scripts or from the command line.
Figure 12-1 shows the GUI for AEM.
12.1 Static Power Saver: " SPS"
The firmware sets the processor frequency to a predetermined fixed value of 30% below
the shipping nominal frequency hence the term "static" power saver. This mode also
enables a feature called "folding." Folding triggers the OS to scan the processor core
utilization. If any of the cores are running idle the OS will issue a command to the
hypervisor to either "Nap" or "Sleep" the cores depending on the OS level the user is
running --AIX 6.1H supports Nap only, AIX 6.1J supports sleep. For Power 755 systems
when the core is being napped the frequency is set to 1.65 GHz. The sleep frequency is
set to 0 MHz. There is a minimal lag time when the cores come out of sleep or nap mode.
If the user is concerned that this might impact performance on their application, the
following command can be issued from AIX to disable folding thus disabling nap or sleep:
schedo - o vpm_f ol d_pol i cy=0

SPS mode offers the maximum power savings for the system at the cost of sacrificing
performance. This mode is ideal for periods when there is little or no activity on the
system such as weekends or evenings when the user is looking to maximize power
savings.
12.2 Dynamic Power Saver: " DPS"
The firmware will alter the processor frequency based upon the utilization of the
POWER7 cores. The processor frequency is ramped up as utilization goes up. When
processor cores are not utilized their frequency drops to 1.65 GHz. The maximum core
frequency that can be achieved in this mode is 90% of the nominal ship frequency of the
system given 100% core utilization. This feature prefers power savings over
performance.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 90 of 110
12.3 Dynamic Power Saver - Favor Performance: " DPS-FP"
The firmware will alter the processor frequency based upon the utilization of the
POWER7 cores. The processor frequency is ramped up as utilization goes up. When
processor cores are not utilized their frequency drops to 1.65 GHz. The maximum core
frequency that can be achieved in this mode is 107% of the nominal ship frequency of the
system given 100% core utilization. This feature prefers maximum performance over
power savings.


Figure 12-1 Active Energy Manager GUI

12.4 Performance Versus Power Consumption
The degree of power consumption on a system depends on the performance
characteristics of the application. HPC applications display a very diverse set of
performance characteristics ranging from very core intensive applications with very little
data movement from memory over the course of computation to very memory bandwidth
intensive behavior where data is constantly moved to and from system memory.
In order to understand the correlation between performance and power consumption on
POWER7 based systems, we considered a set of 6 application benchmarks from the
SPEC CFP2006 suite of benchmarks. Three of the applications are very memory
bandwidth intensive and the other three are core intensive. The applications are shown in
Table 12-1.
We measured the system level power consumption of a Power 755, which includes all
the components such as processor chips, DIMMs, fans, etc. with the system in two
different modes, (1) nominal and (2) SPS. Each application is run in a throughput mode
on the system, whereby, 32 serial copies of a given application are started
simultaneously with the Power 755 system booted in single thread (ST) mode.
In Figure 12-2, the x-axis represents a reduction in performance relative to performance
in nominal mode when the system is switched to SPS mode. Similarly, the y-axis
represents reduction power relative to power in nominal mode when the system is
switched to SPS mode.
As it can be seen, the core intensive applications suffer a reduction in performance of
about 30%, though power consumption goes down by about 30%, too. For many HPC
users, this may not always be attractive since their applications suffer in performance by
using SPS mode. On the other hand, memory bandwidth intensive applications suffer
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 91 of 110
less than 5% reduction in performance while resulting in power savings of about 20%-
25%. Our study shows that users and data center managers with knowledge of
application performance behavior can fully utilize the EnergyScale power management
features provided in POWER7 based systems to reduce power consumption with minimal
impact to performance.
Table 12-1 Performance characteristics of selected SPEC applications
SPEC Benchmark
Performance
Characteristic
416.gamess Core intensive
433.milc Mem. bandwidth intensive
435.gromacs Core intensive
437.leslie3d Mem. bandwidth intensive
444.namd Core intensive
459.GemsFDTD Mem. bandwidth intensive


Figure 12-2 Correlation of performance and power consumption
Nominal to SPS: Performance vs Power
0%
5%
10%
15%
20%
25%
30%
35%
0% 5% 10% 15% 20% 25% 30% 35%
Reduction in performance (%)
R
e
d
u
c
t
i
o
n

i
n

p
o
w
e
r

(
%
)
Memory bandwidth intensive Core intensive


Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 92 of 110
Appendix A: POWER7 and POWER6 Hardware
Comparison
Table A 1 Core Features
6
Feature

POWER7 POWER6
Registers 64 FPR/GPR (with renaming) 64 FPR/GPR (no renaming)
Cache 32KB 4W I Cache
32KB 8W D Cache (2RD/1WR)
Dedicated L2 reload buses for I and D
32B D-Cache reload bus at core frequency
4 MB L2 not shared
32 MB L3
64KB 4W I Cache
64KB 8W D Cache (2RD/1WR)
Shared L2 reload bus for I and D
32B reload bus at core frequency
4 MB L2 not shared
32 MB L3
Functional Units 2FX, 2LS, 4FP, 1BR, 1CR, 1DP, 1VSX/AltiVec 2FX, 2LS, 2FP, 1BR/CR, 1DP, 1 AltiVec
FPU Pipe line 2-eight stage (6 execution) 2-eight stage (6 execution)
Threading 1, 2 or 4 thread SMT
Priority-based dispatch
Alternating dispatch from 2 or 4 threads (6
instructions)
2 thread SMT
Priority-based dispatch
Simultaneous dispatch from 2 threads (7 instructions)
Instruction
Dispatch
6 Instruction dispatch per thread
Two branches
Four non-branch instructions
5 Instruction dispatch per thread
7 Instruction dispatch for 2 threads (SMT)
1 branch at any location in group
2 threads per cycle
In flight:
1 thread - 120
2 threads 184
Instruction
Issue
8- Instruction issue per thread
Two load or store ops
Two fixed-point ops
Two scalar floating-point, two VSX, two AltiVec
ops (one must be a permute op) or one DFP op
One branch op
One condition register op
2FX, 2LS, 2FP/1DP, 1BR/CR
AltiVec uses FPQ and VIQ

Rename Yes

No
Load Target Buffer (up to 20 loads)
Translation I-ERAT =64 entries, 2W (4KB,64KB page)
D-ERAT =64 entries, Fully set associative
(4KB,64KB, 16M page)
SLB =32 entries per thread
68 bit VA, 46 bit RA
Page size =4KB, 64KB, 16MB, 16GB
I-ERAT =128 entries, 2W (4KB,64KB page)
D-ERAT =128 entries, Fully set associative (4K,64K,
16M page)
SLB =64 entries per thread
68 bit VA, 48 bit RA
Page size =4KB, 64KB, 16MB, 16GB



Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 93 of 110
Appendix B: IBM System Power 755 Compute Node
Figure B- 1 Schematic of Power 755 Node

Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 94 of 110
Appendix C: Script petaskbind.sh
#! / bi n/ sh
# Scr i pt i s desi gned t o bi nd t asks usi ng t he bi ndpr ocessor command
# f or MPI t asks st ar t ed vi a poe usi ng t he bi ndpr ocessor .
#
# Fi r st dr af t no checki ng i s done so be car ef ul
# TODO : make det er mi nat i on of ncpu mor e r obust
# i . e. check out put of l par st at .
#
# Usage : poe pet askbi nd. sh a. out <ar gs>
#
# Assumed env var i abl es f r omPE :
# MP_CHI LD - f r omPOE, ef f ect i vel y t he t ask r ank
# MP_COMMON_TASKS - f r omPOE, col on del i mi t ed st r i ng cont ai ni ng number
# and l i st of mpi t ask i ds r unni ng on t he same node
#
# Envs t o cont r ol bi ndi ng
# PEBND_PE_STRI DE - St r i de bet ween successi ve MPI t asks
# val ue of - 1 wi l l set st i de = ncpus/ nt asks
# Def aul t val ue of - 1
# PEBND_PE_START - Desi r ed l ogi cal pr ocessor t o st ar t PE t asks
# Def aul t val ue 0
#

get mi n ( )
{
xmi n=$1
# xl i st i s col on del i mi t ed l i st of MPI t asks shar i ng same node
xl i st =$2
f or x i n `echo $xl i st | sed ' s/ : / / g' `; do
i f [ $x - l t $xmi n ] ; t hen
xmi n=$x
f i
done
echo " $xmi n"
}

# Set def aul t s
PEBND_PE_STRI DE=${PEBND_PE_STRI DE: - - 1}
PEBND_PE_START=${PEBND_PE_START: - 0}

# Get number of common t asks
ncom=${MP_COMMON_TASKS%%: *}
nt asks=`expr $ncom+ 1`

# Get number of l ogi cal pr ocessor s on node, assums l par st at i s avai l abl e
ncpu=`l par st at | gr ep Syst em| awk ' { pr i nt $6 }' | awk - F= ' { pr i nt $2 }' `

# Get l i st of common t asks , 1st el emi nt i n t hi s l i st i s number of common t asks
# unl ess i t i s t he onl y t ask

coml i st =${MP_COMMON_TASKS#*: }
i f [ $ncom- eq 0 ] ; t hen
coml i st =" "
f i
myt ask=$MP_CHI LD

# Det er mi ne smal l est t ask i d on node
mi nt ask=`get mi n $myt ask $coml i st `

# l ocal i ndex
st ar t _i ndex=`expr $myt ask - $mi nt ask`

i f [ " x$PEBND_PE_STRI DE" = " x- 1" ] ; t hen
st r i de=`expr $ncpu / $nt asks`
el se
st r i de=" $PEBND_PE_STRI DE"
f i

Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 95 of 110
st ar t _pr oc=`expr $PEBND_PE_START + $st ar t _i ndex \ * $st r i de`

# Debuggi ng
debug=0
i f [ $debug = 1 ] ; t hen
echo " st ar t _pr oc $st ar t _pr oc"
echo " st r i de = $st r i de"
echo " PEBND_PE_STRI DE $PEBND_PE_STRI DE"
echo " PEBND_PE_START $PEBND_PE_START"
# echo " MP_COMMON_TASKS $MP_COMMON_TASKS"
# echo " coml i st $coml i st "
# echo " ncom$ncom"
f i

# Do t he bi ndi ng.
bi ndpr ocessor $$ $st ar t _pr oc

# Execut e command
exec " $@"
ok
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 96 of 110
Appendix D: Script petaskbind-rset.sh
#! / bi n/ sh
# Scr i pt i s desi gned t o bi nd t asks usi ng t he execr set command
# f or MPI t asks st ar t ed vi a poe.
#
# Fi r st dr af t no checki ng i s done so be car ef ul
# TODO : make det er mi nat i on of ncpu mor e r obust
# i . e. check out put of l par st at .
#
# Usage : poe pet askbi nd- r set . sh a. out <ar gs>
#
# Assumed env var i abl es f r omPE :
# MP_CHI LD - f r omPOE, ef f ect i vel y t he t ask r ank
# MP_COMMON_TASKS - f r omPOE, col on del i mi t ed st r i ng cont ai ni ng number
# and l i st of mpi t ask i ds r unni ng on t he same node
#
# Envs t o cont r ol bi ndi ng
# PEBND_PE_STRI DE - St r i de bet ween successi ve MPI t asks
# val ue of - 1 wi l l set st i de = ncpus/ nt asks
# Def aul t val ue of - 1
# PEBND_PE_START - Desi r ed l ogi cal pr ocessor t o st ar t PE t asks
# Def aul t val ue 0
#

get mi n ( )
{
xmi n=$1
# xl i st i s col on del i mi t ed l i st of MPI t asks shar i ng same node
xl i st =$2
f or x i n `echo $xl i st | sed ' s/ : / / g' `; do
i f [ $x - l t $xmi n ] ; t hen
xmi n=$x
f i
done
echo " $xmi n"
}

# Set def aul t s
PEBND_PE_STRI DE=${PEBND_PE_STRI DE: - - 1}
PEBND_PE_START=${PEBND_PE_START: - 0}

# Get number of common t asks
ncom=${MP_COMMON_TASKS%%: *}
nt asks=`expr $ncom+ 1`

# Get number of l ogi cal pr ocessor s on node, assums l par st at i s avai l abl e
ncpu=`l par st at | gr ep Syst em| awk ' { pr i nt $6 }' | awk - F= ' { pr i nt $2 }' `

# Get l i st of common t asks , 1st el emi nt i n t hi s l i st i s number of common t asks
# unl ess i t i s t he onl y t ask

coml i st =${MP_COMMON_TASKS#*: }
i f [ $ncom- eq 0 ] ; t hen
coml i st =" "
f i
myt ask=$MP_CHI LD

# Det er mi ne smal l est t ask i d on node
mi nt ask=`get mi n $myt ask $coml i st `

# l ocal i ndex
st ar t _i ndex=`expr $myt ask - $mi nt ask`

i f [ " x$PEBND_PE_STRI DE" = " x- 1" ] ; t hen
st r i de=`expr $ncpu / $nt asks`
el se
st r i de=" $PEBND_PE_STRI DE"
f i

Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 97 of 110
st ar t _pr oc=`expr $PEBND_PE_START + $st ar t _i ndex \ * $st r i de`

# Debuggi ng
debug=0
i f [ $debug = 1 ] ; t hen
echo " st ar t _pr oc $st ar t _pr oc"
echo " st r i de = $st r i de"
echo " PEBND_PE_STRI DE $PEBND_PE_STRI DE"
echo " PEBND_PE_START $PEBND_PE_START"
# echo " MP_COMMON_TASKS $MP_COMMON_TASKS"
# echo " coml i st $coml i st "
# echo " ncom$ncom"
f i

# Do t he bi ndi ng.
# bi ndpr ocessor $$ $st ar t _pr oc
at t achr set - F - c $st ar t _pr oc $$ > / dev/ nul l 2>&1

# Execut e command
# or r epl eace bel ow wi t h
# execr set - c $st ar t _pr oc - e " $@"
exec " $@"

Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 98 of 110
Appendix E: Enabling Huge Pages on SLES11 Power 755
systems
This is a small overview on how to set up Huge Pages on a system. See also this wiki
page.

We strongly recommend performing these actions immediately after a reboot of
the system!
How to allocate Huge Pages:
#! / bi n/ bash
# ver i f y t he l ocal i t y of t he memor y on t he memor y pool s:
numact l - - har dwar e | t ee numact l _out 0

# ver i f y t he number of Huge Pages al l ocat ed on t he syst em
cat / pr oc/ memi nf o | gr ep Huge

# al l ocat e Huge Pages ( j ust af t er a r eboot t o have a cl ear memor y)
# f i r st r eset ever yt hi ng
echo 0 > / pr oc/ sys/ vm/ nr _hugepages
# al l ocat e X GB of Huge Pages
expor t X=64
nbhp=$( echo " $X * 1024 / 16" | bc)
#echo <i nt eger val ue f r omabove command l i ne> > / pr oc/ sys/ vm/ nr _hugepages
echo $nbhp > / pr oc/ sys/ vm/ nr _hugepages

# ver i f y t he amount of Huge Pages al l ocat ed
cat / pr oc/ memi nf o | gr ep Huge

# ver i f y t he l ocal i t y of t he memor y on t he memor y pool s:
numact l - - har dwar e | t ee numat cl _out 1

# now cr eat e t he f i l esyst emused t o access t hese Huge Pges
mkdi r / l i bhuget l bf s

# t hen mount t he f i l esyst em
mount - t huget l bf s huget l bf s / l i bhuget l bf s

# cr eat e a user gr oup t o r est r i ct access t o Huge Pages
gr oupadd l i bhuge
chmod 770 / l i bhuget l bf s
chgr p l i bhuge / l i bhuget l bf s/
chmod g+w / l i bhuget l bf s/

# add sar a user i d t o t he Huge Pages gr oup
user mod sar a - G l i bhuge



For codes using malloc (C) or ALLOCATE (Fortran) functions:
you don't need to recompile. At execution time, use the following:
LD_PRELOAD=l i bhuget l bf s. so HUGETLB_MORECORE=yes . / ${EXE}

For codes using static arrays:
you must recompile and add use the following flags at link time:
- B / usr / shar e/ l i bhuget l bf s/ - t l - Wl , - - huget l bf s- l i nk=BDT

There is no way to use Huge Pages for codes using both static arrays and malloc.
How to use Huge Pages with your application:


Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 99 of 110
Appendix F: Flushing Linux I/O buffers
The total memory used by I/O buffers during an application run is not released by default.
OS tuning can be done to improve this behavior, but there is way to manually flush these
buffers to free the memory used.
This is quite important when an application Y is launched just after an application X
has been using a lot of local memory. The memory pools available to allocate data may
not be local, and then one MPI tasks/process of Y will allocate data on a memory card
in a remote location. This can dramatically impact the performance of memory intensive
codes.

Command
echo 1 > / pr oc/ sys/ vm/ dr op_caches : To free pagecache
echo 2 > / pr oc/ sys/ vm/ dr op_caches : To free dentries and inodes
echo 3 > / pr oc/ sys/ vm/ dr op_caches : To free pagecache, dentries
and inodes
:
As this is a non-destructive operation, and dirty objects are not freeable, the user should
run "sync" first in order to make sure all cached objects are freed.

Example

:


See link
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 100 of 110
Appendix G: Compiler Flags and Environment Settings for
NAS Parallel Benchmarks
Following are the compiler flags and the environment settings used in the NAS
benchmark runs on the Power 755 cluster system.

Different compiler options are used for different benchmarks:

For ft and lu:
- O5 - q64 - qnohot - qar ch=pwr 7 - qt une=pwr 7

For bt and sp:
- O3 - q64 - qnohot - qar ch=pwr 7 - qt une=pwr 7

For cg and mg:
- O5 - q64 - qar ch=pwr 7 - qt une=pwr 7

SYSTEM CONFIGURATION:
Node: Power 755
---------------------
MEMORY: 249.25GB (32 x 8GB DIMM) - No large pages
CPUs: 64 Clock speed 3300 MHz - SMT-2 enabled
AIX: 6.1 (6100-04-01) - 64 bit kernel

LoadLeveler: The BACKFILL scheduler is in use

Switch: Qlogic Infiniband
2 144 port SilverStorm 9120, DDR
2 links per network adapter to each Qlogic switch


Installed software:
LOADL: 4.1.0.1
LAPI: 3.1.4.1
PPE-POE: 5.2.0.1
ESSL: 5.1.0.0
PESSL: 3.3.0.2
GPFS: 3.3.0.2
VAC: 11.01.0000.0000
XLF: 13.01.0000.0000
The following MPI and other environment variables were used in the runs
expor t OMP_NUM_THREADS=1
expor t MP_PROCS=32
expor t MP_HOSTFI LE=hf
expor t MP_USE_BULK_XFER=yes
expor t MEMORY_AFFI NI TY=MCM
expor t MP_PULSE=0
expor t MP_EAGER_LI MI T=65536
expor t MP_I NFOLEVEL=4
expor t MP_EUI LI B=us
expor t MP_EUI DEVI CE=sn_al l
expor t MP_SHARED_MEMORY=yes
expor t MP_SI NGLE_THREAD=yes
expor t MP_I NSTANCES=2
expor t MP_RETRANSMI T_I NTERVAL=5000
expor t TARGET_CPU_LI ST=- 1

Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 101 of 110
Appendix H: Example Program Listing for Using the
dscr_ctl System Call
Applications may exhibit better performance using a DSCR setting different from the
system default. AIX does support a dscr_ctl subroutine that can be used to set the DSCR
register for the application. The prototype for the routine is in the file
/usr/include/sys/machine.h. Below is an example program demonstrating how to query
and set the DSCR register for a user application.

#i ncl ude <st di o. h>
#i ncl ude <st dl i b. h>
#i ncl ude <sys/ machi ne. h>

l ong l ong dscr ;


r c = dscr _ct l ( DSCR_WRI TE, &dscr , si zeof ( l ong l ong) ) ;


#i ncl ude <st di o. h>
#i ncl ude <sys/ machi ne. h>

voi d mai n ( ) {

i nt r c;
char *pt r ;
l ong l ong dscr ;
l ong l ong dscr _r ead;
st r uct dscr _pr oper t i es dscr _pr op;


r c=dscr _ct l ( DSCR_GET_PROPERTI ES, &dscr _pr op, DSCR_PROP_SI ZE) ;
pr i nt f ( " Ret ur n code %d\ n" , r c) ;
pr i nt f ( " Number of St r eams %u\ n" , dscr _pr op. number _of _st r eams) ;
pr i nt f ( " pl at f or m_def aul t _pd %#l l x\ n" , dscr _pr op. pl at f or m_def aul t _pd) ;
pr i nt f ( " os_def aul t _pd %#l l x\ n" , dscr _pr op. os_def aul t _pd) ;

r c = dscr _ct l ( DSCR_READ, &dscr _r ead, si zeof ( l ong l ong) ) ;
pr i nt f ( " DSCR_READ %#l l x\ n" , dscr _r ead) ;

/ * User def i ned val ue of dscr */
dscr = 0x1eLL;
pr i nt f ( " User DSCR %#l l x\ n" , dscr ) ;

r c = dscr _ct l ( DSCR_WRI TE, &dscr , si zeof ( l ong l ong) ) ;

r c = dscr _ct l ( DSCR_READ, &dscr _r ead, si zeof ( l ong l ong) ) ;
pr i nt f ( " DSCR_READ i s now %#l l x\ n" , dscr _r ead) ;

r c=dscr _ct l ( DSCR_GET_PROPERTI ES, &dscr _pr op, DSCR_PROP_SI ZE) ;
pr i nt f ( " Number of St r eams %u\ n" , dscr _pr op. number _of _st r eams) ;
pr i nt f ( " pl at f or m_def aul t _pd %#l l x\ n" , dscr _pr op. pl at f or m_def aul t _pd) ;
pr i nt f ( " os_def aul t _pd %#l l x\ n" , dscr _pr op. os_def aul t _pd) ;

}
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 102 of 110
Appendix I: Scramble Program for 64K Page Creation
The sample script and program below is an example of how to run the scramble pages
program to address the 64K page coalescing issue described in section 4.1.6.

Scramble Run Script

#! / bi n/ sh
# r unscr ambl e. sh
expor t LDR_CNTRL=DATAPSI ZE=64K@TEXTPSI ZE=64K@STACKPSI ZE=64K@SHMPSI ZE=64K

f r ee_4Kpg=`vmst at | t ai l - 1 | awk ' { pr i nt $4 }' `

( ( f r ee_MB = f r ee_4Kpg / 256 ) )
( ( scr ambl e_MB = 3 * ( f r ee_MB / 4 ) ) )

f or r ep i n 1 2
do
. / scr ambl e_64Kpg $scr ambl e_MB
done


Scrample Program

/ *
scr ambl e_64kpg. c

Compi l e and Li nk wi t h :
cc - q64 - o scr ambl e_64Kpg scr ambl e_64Kpg. c - l m- bdat apsi ze: 64k - bst ackpsi ze: 64k

Run :
scr ambl e_64Kpg <scr ambl e_MB>
*/

#i ncl ude <st di o. h>
#i ncl ude <st dl i b. h>
#i ncl ude <mat h. h>
doubl e * ( al l ocar r ay( doubl e * a) ) ;
voi d f r eear r ay( doubl e *a) ;
i nt mai n( i nt ar gc, char **ar gv)
{
doubl e *i 1;
doubl e * ( *i 2) ;
i nt i , nn, j , npages, k, i t er ;
i f ( ar gc ! = 2)
{
f pr i nt f ( st der r , " Usage: %s si ze_i n_MB \ n" , ar gv[ 0] ) ;
r et ur n( 1) ;
}
nn = at oi ( ar gv[ 1] ) ;
/ * npages = ( nn << 8) ; */ / * 4K pages */
npages = ( nn << 4) ;
i 2 = mal l oc( si zeof ( i 1) *npages) ;
pr i nt f ( " number of pages %d \ n" , npages) ;
pr i nt f ( " %s wi l l scr ambl e %u MB of memor y\ n" , ar gv[ 0] , nn) ;
f or ( i t er =0; i t er <1; i t er ++)
{
f or ( i =0; i <npages; i ++)
{
i 2[ i ] = al l ocar r ay( i 1) ;
/ *
f or ( j =0; j <8; j ++, i 2[ i ] ++)
{
pr i nt f ( " i 2[ i ] i s %d\ n" , i 2[ i ] ) ;
pr i nt f ( " *i 2[ i ] i s %f \ n" , *i 2[ i ] ) ;
}
*/
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 103 of 110
}
f or ( i =0; i <npages; i ++)
{
/ * pr i nt f ( " i 2[ i ] bef or e f r ee: %d\ n" , i 2[ i ] ) ; */
j = ( r andom( ) & ( npages - 1) ) ;
/ * pr i nt f ( " j i s %d\ n" , j ) ; */
f r eear r ay( i 2[ j ] ) ;
i 2[ j ] =NULL;
/ * pr i nt f ( " i 2[ j ] af t er f r ee: %d\ n" , i 2[ j ] ) ; */
}
f or ( i =0; i <npages; i ++)
{
/ * pr i nt f ( " i 2[ i ] bef or e f r ee: %d\ n" , i 2[ i ] ) ; */
f r eear r ay( i 2[ i ] ) ;
i 2[ i ] =NULL;
/ * pr i nt f ( " i 2[ i ] af t er f r ee: %d\ n" , i 2[ i ] ) ; */
}
}
r et ur n( 0) ;
}
doubl e * ( al l ocar r ay( doubl e *a) )
{
doubl e *b;
i nt j ;
/ * a = ( doubl e *) mal l oc( si zeof ( doubl e) * 512) ; */ / * 4K pages */
a = ( doubl e *) mal l oc( si zeof ( doubl e) * 8192) ;

b = a;
f or ( j =0; j <8; j ++, a++)
{
*a = ( doubl e) j ;
/ *
pr i nt f ( " a i s %f \ n" , *a) ;
pr i nt f ( " a addr i s %d\ n" , a) ;
*/

}
r et ur n( b) ;
}
voi d f r eear r ay( doubl e *a)
{
i f ( a ! = NULL ) f r ee( a) ;

}
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 104 of 110
Appendix J: Runtime Environment for the GFS Application
These are the conditions that produced the GFS results reported in section 8.1.3

The compiler flags for the GFS Makefile) are:
F77 = mpxl f _r - g
FI NCS = - I / hi t achi / gf s/ l i b/ i ncmod/ esmf . xl f 11
#FI NCS = - I / vol / j abel es2/ ncep/ gf s. bench/ l i b/ i ncmod/ esmf . xl f 12
#
smp_opt =noaut o
OPTS = - qsuf f i x=cpp=f - O3 - qr eal si ze=8 - qnost r i ct - qxl f 77=l eadzer o -
qmaxmem=- 1 - qsmp=$( smp_opt ) - qnosave
OPTS90 = $( OPTS)
OPTS90A = $( OPTS)

FFLAG90 = $( OPTS90) $( FI NCS) - qf r ee - NS2048
FFLAG90A = $( OPTS90A) $( FI NCS) - qf r ee - NS2048
FFLAGS = $( OPTS) $( TRAPS)
FFLAGX = $( OPTS) $( TRAPS)
FFLAGI O = $( OPTS) $( TRAPS)
FFLAGY = $( OPTS)
FFLAGM = $( OPTS) $( FI NCS) $( TRAPS) $( DEBUG) - NS2048
FFLAGSF = - O3 - qnosave - q f r ee=f 90 - qcheck
FFLAGSI = - qnosave - O3 - q f r ee=f 90
FFLAGB = - qnosave - O3 - q f i xed


ESMFLI B = - L/ hi t achi / gf s/ l i b/ xl f 11 - l esmf - l net cdf _st ubs
LDR = mpxl f _r - qsmp=$( smp_opt )
LDFLAGS = - bdat apsi ze: 64K - bst ackpsi ze: 64K - bt ext psi ze: 64K - L/ usr / l i b - l essl _r -
L / hi t achi / massl i b - l mass - qsmp=$( smp_opt )

LI BS = - l C ${ESMFLI B} - L / hi t achi / gf s/ sor c/ gf s/ l i b - l w3_d - l baci o_4

Therse are the environment variables used during the runs:

expor t MEMORY_AFFI NI TY=MCM
expor t BI ND_TASKS=no
expor t SAVE_ALL_TASKS=no
expor t MP_COREFI LE_FORMAT=" cor e. t xt "
expor t XLSMPOPTS=" st ack=512000000"
expor t MP_SHARED_MEMORY=yes
expor t MP_LABELI O=yes
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 105 of 110
Appendix K: Acknowledgements
We would like to thank, Dibyendu Das for providing additional information about compiler
auto-vectorization for Altivec/VSX. We thank Frank OConnell for his valuable
suggestions on performance aspects of vectorization.
We would like to thank Francois Thomas for providing input for the Linux content.
We would like to thank the following people who provided many of the HPC benchmark
results shown in the paper.
Adhemerval Zanella
Farid Parpia
J ohn Divirgilio
We would like to thank the following people who provided additional information about the
Linux OS on Power 755 systems
J oel Schopp
We would like to thank Steve White for his comprehensive review of the guide contents.
We would like to thank Madhavi Valluri who provided additional information about SPEC
CPU2006 SIMD performance on Power 755 systems
We would like to thank Bill Buros whose Wiki provided Eric Michel with information on
SMT and memory page sizes for Linux.



Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 106 of 110
Appendix L: Abbreviations Used
FMA floating point multiply add
FP/LS ratio of floating point operations to loads and stores

Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 107 of 110
Appendix M: Notices

IBM Corporation 2010
IBM Corporation
Marketing Communications, Systems Group
Route 100, Somers, New York 10589
Produced in the United States of America
March 2010, All Rights Reserved
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other
countries. Consult your local IBM representative for information on the products and
services currently available in your area. Any reference to an IBM product, program, or
service is not intended to state or imply that only that IBM product, program, or service
may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's
responsibility to evaluate and verify the operation of any non-IBM product, program, or
service.
IBM may have patents or pending patent applications covering subject matter described
in this document. The furnishing of this document does not give you any license to these
patents. You can send license inquiries, in writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY
10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other
country where such provisions are inconsistent with local law: INTERNATIONAL
BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS"
WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING,
BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do
not allow disclaimer of express or implied warranties in certain transactions, therefore,
this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes
are periodically made to the information herein; these changes will be incorporated in
new editions of the publication. IBM may make improvements and/or changes in the
product(s) and/or the program(s) described in this publication at any time without notice.
Any references in this information to non-IBM Web sites are provided for convenience
only and do not in any manner serve as an endorsement of those Web sites. The
materials at those Web sites are not part of the materials for this IBM product and use of
those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes
appropriate without incurring any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those
products, their published announcements or other publicly available sources. IBM has not
tested those products and cannot confirm the accuracy of performance, compatibility or
any other claims related to non-IBM products. Questions on the capabilities of non-IBM
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 108 of 110
products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business
operations. To illustrate them as completely as possible, the examples include the names
of individuals, companies, brands, and products. All of these names are fictitious and
any similarity to the names and addresses used by an actual business enterprise is
entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which
illustrate programming techniques on various operating platforms. You may copy, modify,
and distribute these sample programs in any form without payment to IBM, for the
purposes of developing, using, marketing or distributing application programs conforming
to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all
conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function
of these programs.
More details can be found at the IBM Power Systems home page
7

7
http://www.ibm.com/systems/p
.
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 109 of 110
Appendix N: Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International
Business Machines Corporation in the United States, other countries, or both. These and
other IBM trademarked terms are marked on their first occurrence in this information with
the appropriate symbol (or ), indicating US registered or common law trademarks
owned by IBM at the time this information was published. Such trademarks may also be
registered or common law trademarks in other countries. A current list of IBM trademarks
is available on the Web at http://www.ibm.com/legal/copytrade.shtml.
The following terms are trademarks of the International Business Machines Corporation
in the United States, other countries, or both:
1350
AIX 5L
AIX
alphaWorks
Ascendant
BetaWorks
BladeCenter
CICS
Cool Blue
DB2
developerWork
s
Domino
EnergyScale
Enterprise
Storage
Server
Enterprise
Workload
Manager
eServer
Express
Portfolio
FlashCopy
GDPS
General Parallel
File System
Geographically
Dispersed
Parallel
Sysplex
Global
Innovation
Outlook
GPFS
HACMP
HiperSockets
HyperSwap
i5/OS
IBM Process
Reference
Model for
IT
IBM Systems
Director Active
Energy
Manager
IBM
IntelliStation
Lotus Notes
Lotus
MQSeries
MVS
Netfinity
Notes
OS/390
Parallel
Sysplex
PartnerWorld
POWER
POWER
POWER4
POWER5
POWER6
POWER7
PowerExecutive

Power
Systems
PowerPC
PowerVM
PR/SM
pSeries
QuickPlace
RACF
Rational
Summit
Rational Unified
Process
Rational
Redbooks
Redbooks
(logo)
RS/6000
RUP
S/390
Sametime
Summit
Ascendant
Summit
System i
System p
System
Storage
System x
System z
System z10
System/360
System/370
Tivoli
TotalStorage
VM/ESA
VSE/ESA
WebSphere
Workplace
Workplace
Messaging
X-Architecture
xSeries
z/OS
z/VM
z10
zSeries
The following terms are trademarks of other companies:
AltiVec is a trademark of Freescale Semiconductor, Inc.
AMD, AMD Opteron, the AMD Arrow logo, and combinations
Performance Guide for HPC Applications on IBM Power 755

Copyright 2010 IBM Corporation

Page 110 of 110
thereof, are trademarks of Advanced Micro Devices, Inc.
InfiniBand, and the InfiniBand design marks are trademarks and/or
service marks of the InfiniBand Trade Association.
ITILis a registered trademark, and a registered community trademark
of the Office of Government Commerce, and is registered in the U.S.
Patent and Trademark Office.
IT Infrastructure Libraryis a registered trademark of the Central
Computer and Telecommunications Agency which is now part of the
Office of Government Commerce.
Novell, SUSE, the Novell logo, and the N logo are registered
trademarks of Novell, Inc. in the United States and other countries.
Oracle, J D Edwards, PeopleSoft, Siebel, and TopLinkare
registered trademarks of Oracle Corporation and/or its affiliates.
SAP NetWeaver, SAP R/3, SAP, and SAP logos are trademarks or
registered trademarks of SAP AG in Germany and in several other
countries.
IQ, J 2EE, J ava, J DBC, Netra, Solaris, Sun, and all J ava-
based trademarks are trademarks of Sun Microsystems, Inc. in the
United States, other countries, or both.
Microsoft, Windows, Windows NT, Outlook, SQL Server,
Windows Server, Windows, and the Windows logo are trademarks
of Microsoft Corporation in the United States, other countries, or both.
Intel Xeon, Intel, Itanium, Intel logo, Intel Inside logo, and Intel
Centrino logo are trademarks or registered trademarks of Intel
Corporation or its subsidiaries in the United States, other countries, or
both.
QLogic, the QLogic logo are the trademarks or registered trademarks
of QLogic Corporation.
SilverStorm is a trademark of QLogic Corporation.
SPECis a registered trademark of Standard Performance Evaluation
Corporation.
SPEC MPIis a registered trademark of Standard Performance
Evaluation Corporation.
UNIXis a registered trademark of The Open Group in the United States
and other countries.
Linux is a trademark of Linus Torvalds in the United States, other
countries, or both.
Other company, product, or service names may be trademarks or service marks of
others.
SAT/CR6 bit

Você também pode gostar