Você está na página 1de 9

COMPUTACI/N ALTES/AS

PRESTACIONS/ES PAC 4
Semestre Setembre 2015

The objectives and goals

The exercises

Evaluation criteria

Formatting

Deadline

COMPUTACI DALTES PRESTACIONS


PAC 4

Presentation and goals


Objectives
The main goal of this exercise is to expose to the student to a more open problem where
he/she can decide what to explore and how to explore it. Three different options are
provided.
The first one is a complete open option. The student has to work on a project that he/she
wants to do. The second one is a specific proposal oriented to those students that do not
have strong computer science background (i.e: programming, linux etc.). This option will let
the student to propose a parallelization of a specific scientific problem that the student
selects (different from the previous PEC3 exercise and providing two different kind of
parallelization). The third one, also specific, is more practical and it is oriented to those
students that have programming and system skills. In this second option the student will
carry out a performance evaluation of the NASA Parallel benchmarks (using the OpenMP or
MPI implementation).
What it is expected
Students doing the first option have to deliver: 1) a description of the mini-projection that
the student has done (what it addresses, what are the goals and what was the inteded goal);
2) a free and open format disseration of the project per se.
Students doing the second option have
proposed algorithm to parallelize; 2)
algorithm and a comparison with the
potential speedup that the parallel
implementation.

to deliver: 1) a description and characterization of the


a two pseudo-code parallelization for the proposed
sequential implementation; 3) An estimation for the
application may show with respect to the serial

Students doing the third option have to deliver: 1) the scripts developed to carry out the
performance characterization; 2) one document with the answers with the proposed
questions
For both options the maximum length for the document is 4 pages.
The environament and resources
For the practical option, its important to emphasize that it is expected that the student will
carry out its own pathfinding and research to find out solutions in the context of:
performance analysis, metrics collection and program execution.

Computaci dAltes Prestacions, 2015

COMPUTACI DALTES PRESTACIONS


PAC 4

First option: Mini-project

The first option corresponds to aa open mini-project. As has been already mentioned this is a
completely open mini-project where is expected that the student proposes what he/she wants
to do, evaluate and what are the goals and outputs of the project.
It is suggested that the student divides the presented document in the following sections:
1)
2)

Motivation of the mini-project: why the student selects the given are and topic
for the project.
Related work: what is the related work associated with the project.

3)

The organization: discuss on what the project has consisted, the different tasks
that the student has done and what are the environments and methodologies that the
student has followed.

4)

Discuss the project: results, analysis or whatever is the output or result of the
project.

5)

Conclusions

The total length of the dissertation must be no more than 6 pages. Thereby the student must
syntetize as much as possible.

Computaci dAltes Prestacions, 2015

COMPUTACI DALTES PRESTACIONS


PAC 4

Second option: Theoretical parallelization


1. The serial algorithm
The goal of this first part is to identify a problem that is of the interest to the student that can
be parallelized: what is solves, what inputs needs, what generates, it computational cost and
what parts of it can be potentially parallelized.
1.1

Describe what the selected algorithm does (include the references that you have used):

1.1.1 Why the algorithm has been selected?


1.1.2 Inputs, outputs and pseudo-code describing the algorithm. The pseudo-code must
contain comments on what each important part of it is doing.
1.2

Describe what parts of the algorithm can be potentially parallelized.

2. Parallel implementation
The goal of this second part is to propose a pseudo-code parallel implementation for the parts
identified in 1.2.
2.1

There are any existing solutions? Briefly describe their approaches. List the references.

2.2

Describe in pseudo-code a two potential parallel implementations for this algorithm:

2.2.1 What strategy have you selected? (i.e: pipeline, shared memory, message passing etc.)
Why?
2.2.2 What other options you could use?
2.2.3 If possible, can you compare your proposal with the existing solutions proposed in 2.1?
2.2.4 Describe the pseudo-code including comments on why the different parallel selected
parts.
3. Performance projection
3.1

Given the pseudo-code proposed in 2.2 and 1.1 propose a theoretical model to project
the speedup that the parallel implementation may show with respect to the serial
implementation. (Its a model so its not expected to provide 100% accuracy).

3.1.1 Extra: Can you compare your proposed solutions with one of the existing solutions
identified in 2.1?
3.2

Provide a speedup analysis using the previous model for: 1, 2, 4, 16 and 32 threads.

Computaci dAltes Prestacions, 2015

COMPUTACI DALTES PRESTACIONS


PAC 4
3.3

Provide a description of what type of computational system would be better for the
provided implementation and what components would be important to invest more.

Computaci dAltes Prestacions, 2015

COMPUTACI DALTES PRESTACIONS


PAC 4

Third option: NASA Parallel Benchmarks


1. The NASA Parallel Benchmarks
Benchmarks are targeted to understand and characterize specific (A) type of systems or
architectures (for example HPC systems, cloud computing or enterprise systems) or specific
computer technologies (for example processor architectures, storage systems or memory
architectures) (B) using a set of different proxy applications (for example data bases, scientific
algorithms or networking applications) (C) with a pre-defined set of inputs (for example small,
big and large working sets) and (D) configurations (for example increasing threads or increasing
processes) . Using different combinations of (C) and (D) engineers can characterize specific
instances of (A) or/and (B).
For example the LINPACK benchmark is used to stablish the list of the top 500 most powerful
high performance computing systems in the world. It is recommended to have a look to the
top500 web site and to the LINPACK benchmark:
http://www.top500.org/project/linpack/
Another well-known benchmark suite is the NASA Parallel benchmarks (NPB). The NPB is a suite
of HPC applications implemented in a variety of HPC programming models (MPI, OpenMP etc.)
and that can be executed with different levels of parallelism (threads or processes depending on
the programming model) and with different working set sizes (S,W, A, B, C and D).
The first part of the exercise is devoted to understand what the NASA Parallel benchmarks are
and how to execute them. Benchmarks are used in the area of computer architecture to
evaluate and characterize systems:
http://www.nas.nasa.gov/publications/npb.html
In order to reduce the scope of potential studies, the exercise will be focused on the OpenMP
and MPI implementations for the NPB.
The MPI NPB Version
In order to facilitate the PEC, the benchmark is already compiled in the UOC cluster. They can
be found in the following path:

/share/apps/aca/benchmarks/NPB3.2/

Inside this folder the student can find the different implementations of the NPB applications. The
NPB3.2-MPI folder contains the compilation for the MPI version (see the bin folder).

The bin

folder contains a set of different executables that correspond to the different applications, with
different levels of parallelism and with different working set size.

Computaci dAltes Prestacions, 2015

COMPUTACI DALTES PRESTACIONS


PAC 4
The following example shows how to submit the CG application using two processes and using
the working set class A.

#!/bin/bash
#$ -cwd
#$ -S /bin/bash
#$ -N hello
#$ -o hello.out.$JOB_ID
#$ -e hello.err.$JOB_ID
#$ -pe orte 8
mpirun -np 2 /share/apps/aca/benchmarks/NPB3.2/NPB3.2-MPI/bin/cg.A.2

The OpenMP NPB Version


Differently from the MPI implementation, the OpenMP compilation has one binary per application
and working set (and not per level of parallelism). As the students did in the previous PECs the
level of parallelism is indicated through the OMP_NUM_THREADS environment variable.
The following example shows how to submit the BT application with working set S and 4
OpenMP threads:

#!/bin/bash
#$ -cwd
#$ -S /bin/bash
#$ -N omp1
#$ -o omp1.out.$JOB_ID
#$ -e omp1.out.$JOB_ID
#$ -pe openmp 4
export OMP_NUM_THREADS=$NSLOTS
/share/apps/aca/benchmarks/NPB3.2/NPB3.2-OMP/bin/bt.S

Questions
1.1 Select one of the NPB Applications. Briefly describe the application: what it solves, what are
input sizes (google to see to what W, S .. is translated for the specific application) and what
level of parallelism it accepts.
1.2 Propose a methodology to evaluate and characterize the performance of an OpenMP
application for different level of parallelism. For example, the simplest approach would be to
propose a methodology that describes the performance of the application based on the runtime.
However, Linux system provide other ways to get metrics from the system. (For example, look
at: http://linux.die.net/man/1/time)
1.3 Using the previous methodology study the impact of the level of parallelism in to the
OpenMP implementation. The student needs to define the experiment parameters: working
set/s, level of parallelism etc. The maximum amount of runs must be 64 runs. Carefully select
the working set. Keep in mind that some of the working sets may have really large runtimes.
The student must provide at the end of the document the scripts used to launch the
experiments. The study must provide plots, tables are not accepted to show the results.

Computaci dAltes Prestacions, 2015

COMPUTACI DALTES PRESTACIONS


PAC 4

2. OpenMP vs MPI
The goal of this second part is to explore and understand potential ways to compare two
different implementations of the same application. In this case, the student will compare the
OpenMP and MPI implementations for the application selected in 1.1.
Questions
2.1 Extend the methodology proposed in 1.2 to compare and characterize two implementations
of the same algorithm (MPI and OpenMP based).
2.2 Similar to what has been done in section 1.3. Using the methodology described in 2.1
performance a study for the application selected in 1.1. Similar to question 1, the maximum
amount of runs must be 64 runs (results obtained in 1.3 can be re-used). Carefully select the
working set. Keep in mind that some of the working sets may have really large runtimes. The
student must provide at the end of the document the scripts used to launch the experiments.
The study must provide plots. Tables are not accepted to show the results.

Computaci dAltes Prestacions, 2015

COMPUTACI DALTES PRESTACIONS


PAC 4

Evaluation criteria
Criteria that will be used in the evaluation: proper utilization of MPI or OpenMP models, brevity
and clear results, experiment setup and discussion and analysis.

Format
One PDF document containing all the different answers for the selected option containing:
-

The answers to the formulation questions (must not exceed 6 pages).

All the different codes developed or scripts must be added as annex section at the end
of the document (no limit)

Provide one tar document with the developed codes (if any)
$ tar cvf tot.tar fitxer1 fitxer2 ...

Deadline

Computaci dAltes Prestacions, 2015

Você também pode gostar