Você está na página 1de 5

PROGRAMMING MPI Cluster Programming

Cluster Programming with MPI

Number Crunching
Linux clusters are all the rage. As this article shows, programming a cluster with MPI (Message Passing Interface) need

not be too difficult. Of course, we realize that most readers will not have a cluster of Linux machines under their desks,

so the sample programs will run on any normal PC. BY HEIKO BAUKE

M
PI is the indisputed leader are well documented, and in case of the MPI namespace. Each MPI program
when it comes to program- installation issues, competent help is is surrounded by the MPI::Init and
ming Linux clusters and other always at hand via the LAM mailing MPI::Finalize tags. MPI calls are illegal if
massively parallel computers. list [4]. they occur before MPI::Init has initial-
The development of parallelized pro- ized the MPI environment, or after
grams leads to a number of issues. The Hello World! MPI::Finalize has closed the MPI envi-
load needs to be evenly spread and the Listing 1 shows a simple MPI program ronment.
work performed by these multiple simul- that outputs the names of the computers The communicator is one of MPI’s
taneous processes needs coordinating. running the processes. fundamental concepts. It groups pro-
The header file included here, mpi.h, cesses that can exchange messages.
Message Exchanges provides MPI specific prototypes. All the Communicators are implemented by the
Message exchanges are the central con- MPI classes and functions reside within MPI::COMM class. The MPI::COMM_
cept at the heart of MPI. Many processes WORLD communicator instance always
combine to solve a large problem and exists; it contains all the MPI processes
talk to each other to coordinate their and is quite sufficient for simple
work. The concept is highly generic and programs.
can be implemented on a variety of com- Libraries who need to encapsulate
puter architectures. This makes it more their communication before the applica-
or less irrelevant to the programmer tion functions use their own
whether the program will run on a large- communicators. The Get_size method
scale SMP machine or throughout the tells you how many processes have been
night on a collection of networked, and assigned to a communicator, Get_rank
otherwise idle, office machines. tells you a process’s rank. The rank and
MPI programs are usually designed on size size variables are local, just like
the basis of the SPMD paradigm (Single all variables in MPI programs, and
Program, Multiple Data). Multiple can thus assume different values for
processes run identical program code, each process.
but each process handles different data. Calling the Get_processor_name
Each process is assigned a unique rank function places the name of the
that influences its execution. computer running the process in a buffer
that the proc_name pointer references.
Implementations proc_name_length contains the length.
The MPI standard defines only one API The string is null terminated before
(or three to be more precise, one each for being sent to the process with the
Fortran, C, and C++). Every super-com- rank zero.
puter manufacturer offers its own This process uses a for loop to receive
implementation, optimized for its own the processor names of the other
hardware. Besides these there are also processes and store them to a file. If the
free implementations such as LAM-MPI file cannot be opened, the Abort method
[3] and MPICH [2]. can be used to kill all the processes.
How an MPI program is compiled, de- If you are wondering why the
bugged and launched will depend on the process ranked zero (and not 42, for
implementation. The information in this example) collects the data, this is
article is based on LAM-MPI and the because there will always be a zero
C++ API. The installation steps for LAM rank process. This ensures that the

70 June 2003 www.linux-magazine.com


MPI Cluster Programming PROGRAMMING

data will be sent to an existing process, methods must match. The prototype of to mpirun, other implementations may
and that the program will run indepen- the receiving method is called Recv: standard output to /dev/null.
dently of the number of processes. The The program can be compiled with
number of processes is stipulated when void Comm::Recv(void* buf, intU any C++ compiler, although using the
launching the program. count, const Datatype&U
All non-zero processes use the Send datatype, int source, int tag)U Table 1: An MPI datatype is
method to send their strings. The proto- const available for each standard
type of the send method is as follows: C++ datatype
The arguments mean the same as in MPI Datatype C++ Datatype
void Comm::Send(const void*U the sending method, but in this case MPI::CHAR char
buf, int count, const Datatype&U they refer to the source of the message, MPI::WCHAR wchar_t
datatype, int dest, int tag)U instead of the target, and specify MPI::SHORT signed short
const a receive buffer rather than a send MPI::INT signed int
buffer. Programmers must ensure that MPI::LONG signed long
It expects a pointer to a contiguous the receive buffer is large enough. MPI::SIGNED_CHAR signed char
memory area first. count stipulates the Our Hello World program outputs the MPI::UNSIGNED_CHAR unsigned char
number of elements of the datatype type processor name to a file and not to MPI::UNSIGNED_SHORT unsigned short
should be sent to the process ranked standard output. This makes sense MPI::UNSIGNED unsigned int
dest. because MPI programs are not typically MPI::UNSIGNED_LONG unsigned long int

Table 1 shows C++ datatypes. tag is bound to a terminal. Additionally, the MPI::FLOAT float

used to separate the messages, which are MPI standard does not actually specify MPI::DOUBLE double
MPI::LONG_DOUBLE long double
uniquely identified by their com- what happens when data is output
MPI::BOOL bool
municator, a datatype and a tag. The to STDOUT or STDERR. Programmers
MPI::COMPLEX Complex<float>
receiving process must possess a re- wanting to create portable code are well
MPI::DOUBLE_COMPLEX Complex<double>
ceiving method that matches the sending advised to avoid output of this kind. The
MPI::LONG_DOUBLE_COMPLEX Complex<long
method. So the communicator, datatype, LAM implementation passes any stan- double>
and tag of the sending and receiving dard output to STDOUT by the processes

Listing 1: Hello World Program with MPI


// hello_world.cc try { other ranks and write to file
// proc_name=new for (int i=0; i<size; ++i) {
// Hello World Program char[max_proc_name_length]; // Receive string and write
} to file
#include <cstdlib> catch (bad_alloc &e) { // not if (i>0)
#include <iostream> enough memory
#include <fstream> MPI::COMM_WORLD.Recv(proc_name,
#include "mpi.h" MPI::COMM_WORLD.Abort(EXIT_FAILUR max_proc_name_length, MPI::CHAR,
E); // terminate all processes i, 0);
using namespace std; } out << "Hello World! My
rank is " << i << " of " << size
int main(void) { MPI::Get_processor_name(proc_name << ". "
, proc_name_length); // which << "I am running on "
MPI::Init(); // initialisiere hostname? << proc_name << "." << endl;
MPI }
proc_name[proc_name_length]='\0'; } else
int // Send string
rank=MPI::COMM_WORLD.Get_rank(); if (rank==0) { // rank 0 is
// ascertain own rank indicated MPI::COMM_WORLD.Send(proc_name,
int ofstream proc_name_length+1, MPI::CHAR, 0,
size=MPI::COMM_WORLD.Get_size(); out("hello_world.out"); // 0);
// and number of processes create output file
if (!out) // error => MPI::Finalize(); // terminate
const int terminate processs MPI
max_proc_name_length=MPI::MAX_PRO
CESSOR_NAME+1; MPI::COMM_WORLD.Abort(EXIT_FAILUR return EXIT_SUCCESS;
int proc_name_length; E); }
char *proc_name; // Receive status reports of

www.linux-magazine.com June 2003 71


PROGRAMMING MPI Cluster Programming

Wrapper compiler, which is normally have been launched on all the machines fairly useful programs. Network band-
supplied with the MPI implementation that will be running MPI programs later. width plays an important role in the case
will ensure that the required libraries To do so, you will need to create a nodes of Beowulf clusters. A small ping pong
and header files are found. The Wrapper file that contains a hostname in each line program can be used to measure the
compiler for LAM-MPI is called mpiCC. (the simplest case would be just local- bandwidth (see Listing 2), where a fixed
host) and launch the daemons by typing: length message is first passed from
mpiCC -o hello_worldU process 0 to process 1 and then back
hello_world.cc lamboot -v nodes from process 1 to process 0. The elapsed
time is measured and a mean value for
After compling, simpy enter: You only need to launch the daemons multiple attempts ascertained.
once. If you enter lamhalt to terminate You should be familiar with the MPI
mpirun -np 4 hello_world the daemons, you will need to re-launch initialization phase from the Hello World
them explicitly. program. The actual measurement is
to launch four instances of the performed in two loops (starting in line
hello_world program. The -np parameter Ping Pong 37 and 41 respectively). The Barrier is
defines the number of processes. Before The few MPI functions introduced in the new; it ensures that a communicator’s
doing so, ensure that the LAM daemons previous section allow you to author processes are synchronized. Each pro-

Listing 2: Measuring bandwidth with Ping Pong


// ping_pong.cc output file packet_size, MPI::CHAR, 1, 0);
// out.open("ping_pong.dat"); t=(MPI::Wtime()-t)/2.0;
// if (!out) // Time vector
// t_av+=t;
// Ascertain bandwidth in MPI::COMM_WORLD.Abort(EXIT_FAILUR if (t>t_max)
relation to packet size E); t_max=t;
out << "# Data throughput } else {
relative to packet size" << endl
#include <cstdlib> << "# Time resolution" MPI::COMM_WORLD.Recv(buff,
#include <iostream> << MPI::Wtick() << " s" << endl packet_size, MPI::CHAR, 0, 0);
#include <fstream> << "# Packet size\tmean
#include "mpi.h" time\tmaximum time" << endl; MPI::COMM_WORLD.Send(buff,
} packet_size, MPI::CHAR, 0, 0);
using namespace std; }
// Loop through various }
const int packet sizes if (rank==0) { // Output
max_packet_size=0x1000000; // int packet_size=1; results
maximum size of a message while t_av/=count;
const int count=250; // number (packet_size<=max_packet_size) { out << packet_size <<
of messages per measurement double t_av=0.0; "\t\t" << t_av << "\t" << t_max
char buff[max_packet_size]; // double t_max=0.0; << endl;
Send and receive buffer // Loop through multiple }
messages packet_size*=2; // Double
int main(void) { for (int i=0; i<count; ++i) packet size
{ }
MPI::Init(); // initialize MPI
MPI::COMM_WORLD.Barrier(); // if (rank==0) // Close output
int Sync processes file
rank=MPI::COMM_WORLD.Get_rank(); // Send and/or receive out.close();
// ascertain own rank messages }
int if (rank==0) {
size=MPI::COMM_WORLD.Get_size(); double t=MPI::Wtime(); MPI::Finalize(); // Terminate
// and number of // Starting time MPI

if (size==2) { // exactly two MPI::COMM_WORLD.Send(buff, return EXIT_SUCCESS;


processes needed for this process packet_size, MPI::CHAR, 1, 0); }
ofstream out;
if (rank==0) { // Open MPI::COMM_WORLD.Recv(buff,

72 June 2003 www.linux-magazine.com


MPI Cluster Programming PROGRAMMING

cess interrupts the running program MPI provides communication methods


while the communicator’s processes call that involve all the processes handled by
INFO
Barrier. Synchronization is important to a communicator. [1] MPI Forum: http://www.mpi-forum.org
avoid misleading results. The MPI:: A process, needing to communicate [2] MPICH:
Wtime function is used to measure the to the other processes, could use a loop http://www-unix.mcs.anl.gov/mpi/mpich
time in seconds since starting an internal to call Send and thus transfer these [3] LAM-MPI:
timer. The MPI::Wtick specifies the gran- results to every other process. The http://www.lam-mpi.org
ularity of the timer. other processes could then call Recv to [4] LAM Mailing list:
Figure 1 shows the results. The receive the results. The overhead for this http://www.lam-mpi.org/MailArchives
throughput is seen to improve proportion- method increases relative to the total [5] P. Pacheco; Parallel Programming with
ally to the message size for Ethernet and number of processes. MPI; Morgan Kaufmann Publishers; 1996
Myrinet until the maximum bandwidth of The Bcast provides a simpler and more [6] Brian Hayes; Digital Diffraction; American
64 kilobytes is reached. The transfer time efficient method. Bcast distributes a mes- Scientist; Nr. 5 1996:
for the message roughly comprises two sage to all processes in a time http://www.americanscientist.org/issues/
parts. proportional to log2 no. of processes. The comsci96/compsci96-05.html
One of them is the constant delay prototype for Bcast is as follows: [7] Implementations:
(approx. 9 microseconds for Myrinet, and http://www.lam-mpi.org/mpi/
approx. 75 microseconds for Ethernet) void Comm::Bcast(void *buffer,U implementations/
and the actual transfer time, which is pro- int count, const Datatype&U [8] http://www.lam-mpi.org/tutorials/
portional to the message length. The datatype, int root) const [9] Ian Foster; Designing and Building Parallel
delay is significant for smaller messages. Programs; Addison-Wesley; 1995: http://
When messages are exchanged via The first argument is a pointer to a buffer www-unix.mcs.anl.gov/dbpp/
shared memory, a peak in the transfer that stores the data to be sent or [10]Tina: http://tina.nat.uni-magdeburg.de
rate distribution can be observed. This received. The buffer contains count ele-
[11] Author’s homepage:
depends on the hard and software used. ments of the datatype specified by the http://tina.nat.uni-magdeburg.de/heiko/
Packets of more than 64 Kbytes will not third argument. The last argument speci-
[12] Additional information:
fit in fast access cache memory, and the fies the process whose data are to be
ftp://www.linux-magazin.de/pub/list-
transfer rate drops drastically. In the case distributed to the other processes. ings/magazin/2003/05/MPI
of very large messages, the bandwidth is Bcast communications use an
restricted by the slower main memory. extremely effective tree. Process 0 sends
the data to process 4 in step 1. Now two processes with the data, 0, 2, 4, and 6,
Collective Communication processes have the data (0 and 4). Both send the data to processes 1, 3, 5, and 7.
Send and Recv are used to exchange mes- of these send the data simultaneously to Reduce is another useful function that
sages between two processes, however, processes 2 and 6. Finally, the four does the exactly the opposite of Bcast.
The prototype for Reduce is as follows:

450 Shared Memory DualAthlon 1200 MHz


void Comm::Reduce(const voidU
Shared Memory DualPentium III 800 MHz
Shared Memory DualUltraSparc II 400 MHz
*sendbuf, void *recvbuf, intU
400 count, const Datatype &datatypeU
Ethernet
Myrinet ,const Op &op, int root) const
350
Table 2: MPI Operators for
300
Reduce
Average bandwidth in MByte/s

Operand Meaning
250
MPI::MAX Maximum
MPI::MIN Minimum
200 MPI::SUM Sum
MPI::PROD Product
150 MPI::LAND logical AND
MPI::BAND binary AND
100 MPI::LOR logical OR
MPI::BAND binary OR
50 MPI::LXOR logical exclusive OR
MPI::BXOR binary exclusive OR
0 MPI::MAXLOC Maximum and occurrence of
0 2 4 6 8
10 10 10 10 10 maximum
MPI::MINLOC Minimum and occurrence of
Figure 1: Results of bandwidth test using shared memory, Ethernet, and Myrinet. Myrinet data courtesy
minimum
of Tobias Czauderna and Andreas Herzog (Univ. Magdeburg)

www.linux-magazine.com June 2003 73


PROGRAMMING MPI Cluster Programming

Reduce uses the operation specified by void Comm::Gather(const voidU screen. The article at [6] provides addi-
the op operator to collect the data that all *sendbuf, int sendcount, constU tional details.
processes have placed in the buffer Datatype &sendtype, voidU After initializing MPI, the process
pointed to by sendbuf. The results are *recvbuf, int recvcount, constU ranked zero reads a configuration file
placed in the recvbuf of the process refer- Datatype &recvtype, int root) that contains geometrical data. A call to
enced by root. The send and receive §§ const Bcast distributes the geometrical data to
buffers contain count elements. The all other processes.
count, datatype, op, and root arguments root specifies the process by or to which The program uses geometrical distrib-
must be identical for all processes. sendtype type data are sent. sendcount ution to parallelize the calculations. The
The case where local data belonging to specifies the amount of data each proces- screen area for which the intensity distri-
one process needs to be distributed sor will send or receive. The sendcount bution will be calculated is divided into
across all processes is handled by the and recvcount, and sendtype and recvtype narrow horizontal bands. After this cal-
Scatter function . In contrast to this, arguments are typically identical. culation, a call to Reduce discovers the
Gather collates distributed data within a In listing 3. each process generates a intensity of the brightest point. Gather
single process. The prototype for Scatter pseudo-random number, and then the collects the image data to allow process
is as follows: maximum, and minimum values, and zero to output a portable graymap file.
the sum of the pseudo-random numbers
void Comm::Scatter(const void U is calculated. Information is collected by Conclusion
*sendbuf, int sendcount, constU process zero and written to a file. MPI is extremely powerful. Amongst
Datatype &sendtype, void U other things MPI provides non-blocking
*recvbuf, int recvcount, constU Digital Diffraction communication, additional collective
Datatype &recvtype, int root)U The program on the web site [12] creates communication functions, derived data-
const parallel diffraction images. It calculates types and special topologies. [9]
the diffraction between spherical waves provides a discussion of several generic
and Gather as follows: originating at various points on the aspects of parallel programming. ■

Listing 3: Demonstration of collective message exchange methods


// collective.cc random number generator MPI::SUM, 0);
// for (int i=0; i<count; ++i) //
// Sample program for collective throw dice if (rank==0) { // Save results
communication rand_nums[i]=rand()%1000; ofstream
out("collective.out");
#include <cstdlib> if (rank==0) // allocate space for (int i=0; i<size; ++i) {
#include <fstream> at receiving process out << "Process " << i << "
#include <vector> :";
#include "mpi.h" all_rand_nums.resize(count*size); for (int j=0; j<count; ++j)
// Collect data out << "\t" <<
using namespace std; all_rand_nums[i*count+j];
MPI::COMM_WORLD.Gather(&rand_nums out << endl;
int main(void) { [0], count, MPI::INT, }
out << endl << "Maxima :";
const int count=8; &all_rand_nums[0], count, for (int i=0; i<count; ++i)
vector<int> rand_nums(count), MPI::INT, 0); out << "\t" << max[i];
max(count), min(count), out << endl << "Minima :";
sum(count), // Calculate maximum, minimum, for (int i=0; i<count; ++i)
all_rand_nums; and sum and send to rank 0 out << "\t" << min[i];
out << endl << "Summen :";
MPI::Init(); // initialize MPI MPI::COMM_WORLD.Reduce(&rand_nums for (int i=0; i<count; ++i)
[0], &max[0], count, MPI::INT, out << "\t" << sum[i];
int MPI::MAX, 0); out << endl;
rank=MPI::COMM_WORLD.Get_rank(); }
// ascertain own rank MPI::COMM_WORLD.Reduce(&rand_nums
int [0], &min[0], count, MPI::INT, MPI::Finalize(); // terminate
size=MPI::COMM_WORLD.Get_size(); MPI::MIN, 0); MPI
// and number of processes
MPI::COMM_WORLD.Reduce(&rand_nums return EXIT_SUCCESS;
srand(7*rank); // initialize [0], &sum[0], count, MPI::INT, }

74 June 2003 www.linux-magazine.com

Você também pode gostar