Escolar Documentos
Profissional Documentos
Cultura Documentos
Craig Lucas
November 2014
craig.lucas@nag.co.uk
What is multicore?
A very brief introduction to OpenMP
The NAG Library for SMP and Multicore
2
MULTICORE
3
Terminology
So what is multicore.
4
Terminology
The terminology of multicore can be very confusing.
There is some ambiguity and difference of opinion.
Here we make a best attempt to explain it!
Core or processing core. The functional units that
do the actual computation, the registers that store
values for these and the highest level of cache
memory. They process instructions.
Processor Consists of multiple cores, lower level
cache memory and a bus over which data can be
moved. Cores might be arranged into groups on an
integrated circuit die. Could be CPU or socket.
5
Terminology
6
Terminology
Linked to memory is latency and bandwidth.
7
Terminology
Shared Memory We use this term to describe any
environment where a collection of cores can all
access the same main memory.
SMP This stands for Symmetric Multi-Processing.
This could describe a multicore processor, as
architecturally each core has equal access to the
memory. (Some would argue that it is Shared
Memory Programming.)
Shared memory node - is a collection of processors
and some memory that is shared between them.
Could be one multicore processor, sharing memory.
8
Terminology
NUMA Non-Uniform Memory Access. In a NUMA
Node the access time to memory is not equal.
Perhaps:
a node consisting of two multicore processors each with
their own memory (seen as one by program)
but the memory is shared, the cores on one processor can
access the memory on the other
a core can access a data element in its own memory
quicker than it can access one in the other processors
memory
Due to cache effects, latencies can vary on a single
multicore processor.
9
Terminology
What is important when it comes to hardware in
shared memory is the simple conceptual view.
Here core excludes all cache which is grouped for
simplicity. Core1 Core 2 Core 3 Core 4
Cache
Bus
Shared Memory
10
Terminology
We ignore that cache memory comes in a variety of
levels (L1, L2, ...) and some of this may actually be
shared.
The overall principles are:
There are (in this case) 4 cores that can do some
computation independently to and concurrently with the
other cores
Data exists once in the shared memory and all cores can
access this data
Data can be replicated in the separate caches. (Data is
copied through cache, to the registers, where it is used by
the functional units)
11
Terminology
Bus
12
NUMA Architecture
All cores can access all memory.
But speed of data movement is not equal.
Core1 Core 2 Core 3 Core 4
Cache Faster
data movement
Bus
Some
Shared Memory Shared Memory cache line
13
Example: AMD Interlagos Node (32 cores)
Node
Processor
Die
2 core
module
L3 Cache
Memory
Number of links
between dies
indicate bandwidth
14
Current and Emerging Architectures
AMD Interlagos (Opteron)
Halfway between conventional processor and Sandy Bridge
Consists of modules of two of tightly coupled cores sharing
floating point units and instruction handling
Each core has its own registers, hardware integer handling
and own L1 cache
Intel Ivy Bridge & Haswell (Core i5, Core i7, etc)
Up to 18 physical cores running up to 36 virtual or logical
cores through hyper-threading
Pairs of logical cores have separate registers but shared
computational units
15
Shared Memory Programming
These shared memory environments are suitable for
programming with OpenMP.
Using OpenMP is an example of multi-threaded
programming.
A Thread is an entity (or context) that can execute an
independent instruction stream while sharing a global
address space.
In other words a thread processes instructions, whilst
sharing memory with other threads.
OpenMP creates a set of threads for a single program,
one for each physical core (perhaps.)
16
OpenMP Parallelism: Strong Points
Easy to write
Allows incremental parallelism of existing serial code
Dynamic Load balancing
Amounts of work to be done on each processor can be
adjusted during execution
Closely linked to a dynamic view of data
Dynamic Data view
Redistribution through different patterns of data access
Portability (including GPU)
Active development of language
17
OpenMP Parallelism: Dynamic View of Data
Computation Stage 321
Data
Core 1 Core 3
Core 2 Core 4
18
OpenMP Parallelism: Some Weaker Points
19
Considering Parallelism
Exists in serial code, see vectorization SSE, AVX
What is most important: Throughput on multiple
tasks or turnaround on a single task?
You have 20,000 jobs to run with different parameters?
Some users can parallelise their own code that makes
many calls to NAG routines, e.g. different starting points
for E04 local optimiser
Parallelism at higher level often more efficient
Tasks may be independent, so no synchronisation
Overheads of starting tasks amortised over long run times
If the number of tasks >> number of cores, efficient load balancing
should be automatic
20
Reasons to parallelise individual task
Limited memory per core may restrict the number of
independent tasks that can run side by side
Then look to avoid wasted (idle) cores
Hardware trends may mean less RAM/core on average
e.g. compute nodes on HECToR: 2-core, 3GB/core -> 4-core,
2GB/core -> 24-core, ~1.3 GB/core -> 32-core, 1 GB/core
Reduce runtime!
Parallelise over data within a task, e.g. matrix
multiplication, operations within iterative processes
Share out tasks between threads e.g. searching different
spaces, generation of eigenvectors.
21
A BRIEF INTRO TO OPENMP
22
OpenMP
Hello world
Parallel data generation (for a NAG Multicore
routine, perhaps)
23
What is OpenMP ?
OpenMP (Open Multi-Processing) is a specification
for shared memory parallelism.
That is, it is a mechanism for writing multi-process/or
multi-thread code for shared memory machines.
OpenMP uses:
Compiler Directives
Environment Variables, and
Run-time Library Routines
24
OpenMP on Linux
25
Status
26
Status
27
OpenMP Execution Model
OpenMP uses the fork-join model:
start as a single (master) thread, on one core
continue as single thread until parallel construct
create required number of threads, on remaining cores
program statements within this parallel region are
executed in parallel by each thread in this team of threads
(includes master thread and slaves)
at end of the parallel construct, threads synchronize and
only master thread continues execution, on the original
single core
Can have many parallel regions, although there is an
overhead with thread creation/destruction.
28
Execution Model
Multi-threading in
parallel region, on
all multicores
Serial execution,
Serial execution
on a single core
29
29
Three elements of OpenMP
30
Three elements of OpenMP
We add (1) and (3) to our existing serial code.
This makes OpenMP parallelisation much easier
than, say, MPI. But more importantly:
We do not (necessarily) destroy our serial code so we
can compile it for, and run it on, serial platforms.
We can also incrementally add parallelism to a serial
code. Starting with the most computationally
expensive.
31
Three elements of OpenMP
So first up is Compiler Directives.
The parallel directive defines the parallel region, the
code that will be executed in parallel on different
threads.
Fortran example: In Fortran
directives start
!$OMP PARALLEL [clause1, clause2, ] with !$OMP
< code in here >
!$OMP END PARALLEL In C it is
#pragma omp,
and it ends with
C/C++ example: a line break
33
Hello World in Fortran
PROGRAM HELLO I have added the
IMPLICIT NONE red directive to my
INCLUDE "omp_lib.h" existing serial Hello
INTEGER :: iam, nthreads World code
34
Hello World in C
#include <stdio.h> I have added the
#include <omp.h> red directive to my
int main () { existing serial Hello
int nthreads; World code
int thread_no;
35
Hello World in Fortran
PROGRAM HELLO OpenMP header file.
IMPLICIT NONE Or USE omp_lib module
INCLUDE "omp_lib.h"
INTEGER :: iam, nthreads
36
Hello World in C
#include <stdio.h> OpenMP
#include <omp.h> header file
int main () {
int nthreads;
int iam;
37
Hello World
Output:
Hello World from thread 2 of 4 threads
Hello World from thread 0 of 4 threads
Hello World from thread 1 of 4 threads
Hello World from thread 3 of 4 threads
38
OMP_NUM_THREADS in Visual Studio
While developing
you set the value:
39
Compiling OpenMP
40
Compiling OpenMP
In Visual Studio use the project's Property Pages
dialog box:
41
Worksharing
42
Worksharing
For this we use a work sharing constructs. The most
common one is for loops, do and for.
In Fortran the syntax is:
!$OMP DO [clause1, clause2, ]
DO ! Existing serial for loop
body_of_do_loop
END DO
!$OMP END DO
43
Worksharing
44
A Parallel Loop in Fortran
Spawn threads
we now have
4 threads
working
!$OMP PARALLEL default(none), shared(a,b), private(i,j)
!$OMP DO
do j = 1, n
do i = 1, n
a(i,j) = a(i,j) + b(i,j)
end do
end do
!$OMP END DO The outer j loop is
!$OMP END PARALLEL executed in parallel,
with each thread taking
CALL F07ADF(M, N, A, LDA, IPIV, INFO) a subset of the
iterations
45
A Parallel Loop in C
Spawn threads
#include <nagmk24.h> we now have
4 threads
#pragma omp parallel default(none) \ working
shared(a,b,n) private(i,j)
{
#pragma omp for
for (i=0; i<n; i++){
for (j=0; j<n; j++){
The outer i loop is
a[i][j] += b[i][j] executed in parallel,
with each thread taking
} a subset of the
} iterations
}
46
A Parallel Loop
So what is happening?
Well something like this, for n=200
Thread 0 Thread 1
do j = 1, 50 do j = 51, 100
do i = 1, 200 do i = 1, 200
a(i,j) = a(i,j) + ??? a(i,j) = a(i,j) + b(i,j)
end do end do
Thread 2
end do Thread
end do3
do j = 101, 150
do j = 151, 200
do i = 1, 200
do i = 1, 200
a(i,j) = a(i,j) + ???
a(i,j) = a(i,j) + b(i,j)
end do
end do
end do
end do
47
A Parallel Loop
So what is happening?
Well something like this, for n=200
Thread 0
Thread 1
for (i=0; i<50; i++)
for (j=0; j<200; j++ for
) (i=50; i<100; i++)
for (j=0; j<200; j++ )
a[i][j] += b[i][k]*c[k][j];
a[i][j] += b[i][k]*c[k][j];
Thread 2
Thread 3
for (i=100; i<150; i++)
for (j=0; j<200; j++for
) (i=150; i<200; i++)
a[i][j] += b[i][j] for (j=0; j<200; j++ )
a[i][j] += b[i][j]
48
27/02/2012 48
A Parallel Loop
The four threads have each taken a share of the work
to do.
Recall that all threads can access all the shared data,
like the arrays in this example.
We need to consider something called scoping (or
data sharing) of the variables too.
This is what we have done with the shared and
private clauses in the previous example which
actually shows the usual scoping of arrays and loop
counters.
49
Variable Scoping
All threads need to access the arrays, these are in a
shared memory. Every thread accesses the data, but
note that they dont write to the same array
elements. This is very important.
The threads will need private copies of the loop
variables as they independently execute the loops,
and in particular set the value of the variables.
Consider if they shared the variables, the threads
would get confused, each trying to update them.
We use default(none) to tell the compiler we are
going to explicitly scope all variables.
50
That was easy!
51
Lots more OpenMP too
52
Learn More
53
THE NAG LIBRARY FOR SMP &
MULTICORE
54
NAG Library for SMP & Multicore
Based on standard NAG Libraries
Designed to better exploit multicore architecture
Mark 24 now, Mark 25 next year
Identical interfaces to standard Fortran Library
Just re-link the application
Easy access to parallelism for non-specialists
user is shielded from details of parallelism
assists rapid migration from serial code
Can be used along with users own parallelism
New chapter of utilities to interrogate and control whole
programs OpenMP environment
55
Target systems
56
NAG Library Functionality
Root Finding Dense Linear Algebra
Summation of Series (e.g. FFT) Sparse Linear Algebra
Quadrature Correlation and Regression
Ordinary Differential Equations Analysis
Partial Differential Equations Multivariate Analysis of Variance
Numerical Differentiation Random Number Generators
Integral Equations Univariate Estimation
Mesh Generation Nonparametric Statistics
Interpolation Smoothing in Statistics
Curve and Surface Fitting Contingency Table Analysis
(Global) Optimisation Survival Analysis
Approximations of Special Time Series Analysis
Functions Operations Research
Wavelet Transforms
57
Coming next year in Mark 25, 80+ routines
Simplified FFTs, additional Wavelet routines
High dimensional Numerical Integration, thread safe ODEs
Additional routines for Interpolation
Improved Curve and Surface Fitting routine
More Linear Algebra including:
new and improved algorithms for matrix functions
real and complex quadratic eigenvalue problem
driver routine for selected eigenvalues/vectors of symmetric sparse matrices
Many new routines in statistics including:
additional regression analysis
nearest correlation matrices
more random number generators
New chapter controlling OpenMP environment
58
Environments
We want to bring benefits of multi-threading to
multiple environments
Currently:
Fortran 77 and 90 (with interface blocks available)
From C programs via NAG C Header files
Toolbox for Matlab (Windows and Linux)
Other environments: Python, Java, R
Future, multi-threading as standard:
C Library at Mark 25 multithreaded.
No longer specific product for SMP, so Fortran Library too.
Excel? Other environments?
59
Compatibility issues
60
Using the Library
Read Users Note for details of your installation. In
/opt/nag/fsxxx24xxl/doc or on NAG websites
http://www.nag.co.uk/numeric/FL/FSdocumentation.asp
Documentation installed or easy to navigate version
on NAG websites. Read:
Essential Introduction updated at Mark 25
Chapter Introductions
Routines Documents
Interface blocks included for Fortran 90+ for compile
checks of arguments and their type.
61
Running the Examples
Examples are included for all the routines.
Scripts are available in /opt/nag/fsxxx24xxl/doc to
run the examples: nagsmp_example (static) and
nagsmp_example_shar
Data copied to current directory and compilation
echod to screen.
Easy way to see how the library should be linked to.
Output can be checked by comparing with results given
in /opt/nag/fsxxx24xxl/examples/results/yyyyyye.r
where yyyyyy is routine name.
For example:
62
F07ABF Example Program
Running the example from home on 2 threads:
home> nagsmp_example f07abf 2
63
F07ABF Example Program
F07ABF Example Program Results
Solution(s)
1 2
1 1.0000 3.0000
2 -1.0000 2.0000
3 3.0000 4.0000
4 -5.0000 1.0000
Compare with
/opt/nag/fsxxx24xxx/examples/results/f07abfe.r 64
NAG, vendor libraries and your code
Pictorially:
65
NAG & vendor libraries (e.g. ACML, MKL)
User
application
Compilers etc
NAG Libraries
Vendor Libraries
Hardware
66
Improving and utilizing core-math routines
67
Dense Linear Algebra: BLAS / LAPACK
BLAS: Basic Linear Algebra Subprograms
BLAS1: vector-vector operations, e.g. dscal, ddot, daxpy
BLAS2: matrix-vector operations, e.g. dgemv, dtrsv
BLAS3: matrix-matrix operations, e.g. dgemm, dtrsm
Optimised for cache-based architectures
NAG Library uses vendor library for fast BLAS
e.g. ACML, MKL, LAPACK: Linear Algebra PACKage
matrix factorisations and solvers, e.g. LU, Cholesky, QR
eigensolvers
SVD and least-squares
68
C05NCF: Non-linear equation solver
Intel Xeon E5310 1.6 GHz, N=4000
180 Fortran Library Mark 21 + MKL
Fortran Library Mark 22 + MKL
140
Time (secs)
120
100
80
60
40
20
0
1 2 4 8
Number of cores
(Historical) Improvements with MKL parallelized 69
S.V.D. (DBDSQR) Improved serial performance and parallelized
12000
Performance (Mflops)
10000
NCPU
8000 1
2
6000
4
4000
2000
0
1000 2000 4000 1000 2000 4000
Problem size (N)
netlib DBDSQR + ACML BLAS SMP DBDSQR+ ACML BLAS
70
G03AAF: Principal Component Analysis
140
100
Time (secs)
80
60
40
20
0
1 2 4 8
71
Parallelism with OpenMP
NAG routines are parallelised with OpenMP
Potential for parallelism is now a key criteria for
selecting future algorithms:
Accuracy and stability still most important
Not all algorithms can be parallelised, alas.
Some algorithms not computationally demanding enough
to need to be parallelised
Look at older existing routines:
Replace algorithm with parallelisable one?
Can they benefit from new OpenMP features? Certainly
with OpenMP 4.0.
72
E01TGF/E01THF: Interpolation
50
40
E01THF: evaluate interpolant at given points
35 E01TGF: generate 3D interpolant
Time (secs)
30
25
20
15
10
0
1 2 4 8
Number of cores
73
G13EAF: Kalman filter (1 iteration)
800 AMD Magny-Cours, dual socket, 2.2 GHz
N=2000, M=L=1000
700
600
Time (secs)
500
400
300
200
100
0
1 2 4 6 8 12 16 24
Number of cores
74
C09EAF: 2-D discrete wavelet transform
25 AMD Magny-Cours, dual socket, 2.2 GHz
M=N=15000, DB8, Half, Forward
20
Time (secs)
15
10
0
1 2 4 6 8 12 16 24
Number of cores
75
G02AAF: Nearest-correlation matrix
Intel Xeon E5405 2.0
Serial iterative algorithm but with LAPACK/OpenMP at each step GHz, N=10,000
18000
14000
12000
Time (secs)
10000
8000
6000
4000
2000
0
1 2 4 8
Number of cores
76
Performance considerations
77
Performance considerations
Problem size(s)
Spawning threads incurs an overhead that may be
significant for small problem sizes
In some cases we try to determine an appropriate
threshold and only parallelise problems bigger than this
Difficult to get optimal value on all systems in a single family
Performance may vary depending upon whether the
problem size fits in cache or not
We block algorithms for cache where possible, to try to minimise
this variation
78
Performance advice
79
Performance advice
Profiling tools may help but:
Not standard across all systems
Remember: the goal is not to maximise scalability, GFLOPS,
cache re-use etc, but rather to minimise runtime!
The maximum number of cores may not give optimal
performance
Experiment with different values
OpenMP allows you to change the number of threads to
use (within a max limit) for different parts of your program,
e.g. before calls to different subroutines.
See Chapter X06.
80
Performance advice
81
Summary
Multicore is everywhere!
NAG Library can exploit multicore for performace
with little effort from you:
Identical interfaces to standard NAG Fortran Library.
Toolbox for Matlab at standard. C coming soon.
Documentation and examples to get you started
Works with vendor core-math library to get best
performance on dense linear algebra and FFT routines
May need to learn OpenMP.
Not too difficult to learn this evolving language
Can incrementally add parallelism to existing serial code
Can use them both together, with care.
82