Escolar Documentos
Profissional Documentos
Cultura Documentos
Heterogeneous Clusters
2
The processors are characterized by their speeds and files open simultaneously and used in the implemen-
their relative differences are stored in the perf vec- tations. On Table 7, constant 15 is the number of tem-
tor. The rates of transfers with the disks as well as porary files used by the polyphase merge sort we have
the bandwidth of the network are not captured in the 7
employed, + is the sum of values stored in the perfor-
model which follows. 8
mance vector, is the number of processors and is
Moreover we are interested in the “perfect case” i.e. the input size.
for the case where the size of the problem can be ex-
pressed as + sums. The concept of lowest common
9#( :
("<;
>= ?;! 0@BA-A
multiple is useful in order to specify the things here in
a mathematical way. In other words, we ask so that the
problem size be expressed in the form:
4.1.1 Introduction
+
+ + The key of the success in sorting is dependent of the
+ +
+ +
pivots that must partition the initial bucket into roughly
equal sizes.
(1)
The Parallel Sample Sort (PSS) algorithm [HC83] and
of integers, + !
where is a constant in which represents a number
is a vector of size + which contains
its improvement [LS94] does not sort the portions first
the relative performances of the + the processors of the
but it uses oversampling to select pivots. It picks +
E CD
+ + is the smallest common multiple pivots by randomly choosing +
E
candidates from the
cluster,
of the + values stored in the + ! entire input data, where is the oversampling ratio,
vector.
and then selecting + FG
+ "$#%&(')+*& -, , we de-
pivots from the sorted candi-
For example, with dates. Intuitively, a larger oversampling ratio results in
scribe a processor who runs 8 times more quickly than better load balancing but increases the cost of selecting
the slowest, the second processor is running 5 times pivots.
more quickly than the slowest processor, the third pro-
We propose the following framework for external sort-
#.%&(')/*0 21
cessor is running 3 times more quickly than the slowest
4 . 3 * 3 ' 3 -, % .3 53 3 16
ing which is based on PSS:
and we obtain that and thus
is
1. Pick pivots in a proportional way to the perf
acceptable.
vector; See below for details;
In an equivalent way, note that should be a multiple
2. Send pivots to a master node that sorts them;
H4
of the sum of the values stored in the perf vector.
keep + pivots and broadcasts them to each
Thus, with problem sizes of the form of equation 1, it
node;
is very easy for us to assign to each processor a quan-
tity of data proportional to its speed. It is the intuitive 3. Each node partitions its input according to the
initial idea and it characterizes the precondition of the pivots and sends the partitions to appropriate pro-
problem. If cannot be expressed itself as with the cessors;
equation 1, different techniques as those presented in
[ARM95] can be used in order to ensure balancing. 4. Sort the received partitions (in our case we use
a polyphase merge sort).
4 New results
size , +
I
pivots (with
KJ 3
It can be shown [LS94] that for an unsorted list of
) partition the list
into +
In this section we introduced two new results for sort- ML
sublists such that the size of the maximum
sublist is less or equal to + with probability at least
ing on heterogeneous clusters. Let us introduce first,
on Table 7 page 8, a comparison of resources used NK3 + O L 3 +/2QPR
by the three algorithms. We focus on memory usage, In the case of an heterogeneous cluster (processors
the number of files and the load balance factor. Then at different speeds) we simulate a + machine with + 7 7
we will focus on the technical details of the two algo- equals the sum of coefficients in the performance vec-
rithms. tor.
On Table 7 page 8 H-PSS means “Heterogeneous Par- Example: for perf $# 210/1 ,
, if we
4* S-T6U
ML VW3 V
allel Sample Sort”, H-PSOP means “Heterogeneous obtain that the size of the maximum sublist is less or
Parallel Sorting by Over-Partitioning” and H-PSRS me-
L 3 2//7 XZY\[/]\[2^`_Za+b(XZY
equal to + with probability at least
-
c c c if we c set ml nS-T U +7 we get a probability
Reader should notice in particular that the memory
(RAM) usage is very low as well as the number of of
Now,
*
3jo %. For an out-of-core point of view, the
3
increase (sustained by becomes
*
) on the
l two consecutive pivots (except for the initial an
number of pivots is acceptable because the memory final case). A sublist is the union of with
usage stays low! ranging over all processors. There is + sub-
We set with
l `SjT6U + 7 to mimic the framework of Li lists.
and Sevcik [LS94].
7
Finally, let + be the sum of values in the performance
Step 4 (building a task queue and sorting sublists)
Let denotes the task of sorting . The
7 l S-T 7
vector. Thus the total number of pivots selected in our size of each sublist can be computed:
implementation of external PSS is: + + . Note
that this number is quiet low for an out-of-core point of P
view, so the corresponding integers fit in main mem-
ory. Moreover, it is necessary a divisor of + . Note also 7 ! X
that more the cluster is unbalanced (for instance a pro-
cessor is 1000 times more powerful than the others)
Also the starting position of sublist in the fi-
nal sorted array can be calculated:
more the probability is high. . . that is to say, we will
have more chance to get balanced sublists. The choice X
7
of + is thus justified to capture the heterogeneity of "
X
the machine!
9 :
!";
jA( @ '
)!:
6A 6A )
A( @BA-A
' A task queue is built with the tasks ordered from
the largest sublists size to smallest. Each pro-
cessor repeatedly takes one task # at a time !
from the queue. It processes the task by (a) copy-
4.2.1 Introduction
ing the + parts of the sublist
the final array
into
Li and Sevcik in [LS94] proposed an algorithm for in- at position " to " .
, and (b) ap- C
core sorting on homogeneous platforms with no se- plying a sequential sort to the elements in that
quential sort in the beginning. The choice and the range. The process continues until the task queue
number of pivots is done according to the discussion is empty.
of the previous section: for an unsorted list of size ,
J 3
+ pivots (with ) partition the list into 4.2.2 The heterogeneous case
+
ML
sublists such that the size of the maximum sub-
list is less or equal to + with probability at least
O 3 O L 3 2/ P!R
The main difference in the heterogeneous case is in the
+ + . way we manage partitions and in the way we select
The algorithm presented in [LS94] for sorting on ho-
mogeneous platforms with the over-partitioning tech- according to 1 7 7 S-T U 7
pivots. First, the number of candidates is calculated
+ + 7
+ where + is the sum
nique is as follows: 1 7 ` SjT-Uj 7.
of the values stored in the performance vector. After a
sorting stage, we keep + +
1 <S-T U 7
pivots among
+
Algorithm 1 (PSOP [LS94]) E
the candidates since we have set
7
+ according to the probability formula given
Step 1 initially, processor has , a portion of size and
ML
+ of the unsorted list ; above.
E Note that this number is independent of the problem
Step 2 (selecting pivots) a sample of + candidates
are randomly picked from the list, where is the E 7
size and also that if + grows (the cluster is more “un-
balanced”), the number of pivots grows and we amor-
oversampling ratio and the over partitioning
ratio. Each processor picks candidates and E tize the risk of unbalanced partitions.
4
task in order to decide which processor will exe- Mean SD Mean/opt Mean/SD
cute it, the sum of execution time of all previous tasks
R
&% that has been allocated to processor . PID0
115632 10847 93.88% 9.38%
""##!
In this section we focus mainly on the load balance PID2
factor. In doing this we validate our approaches. We
615038 20479 100.10% 3.33%
do not yet investigate the execution times on an hetero-
geneous cluster. We use a small homogeneous clus-
ter composed of one Pentium III (Katmai), 451 Mhz,
PID3
""
986599 20970 100.36% 2.12%
cache: 512KB, RAM: 261668 kB and 3 Celerons (Men-
docino), 400 Mhz, cache: 128MB, RAM: 64MB. Disks
$&%('*),+.-0/213+54&+56&7,89+;:3+97;<>=@?9%BA&CD)E+
were FUJITSU MPD3064AT disks with 512KB of ca-
?E7E6,4GFIH;J9KL7EMON&%;43%QPSR3+E4&+56&7E8T+;:3+97(<U=
che.
We guess it is sufficient in order to study the load bal-
V 7;:&M*W;8;<&6&%54DW;7;:L7EMXC3+E69MZYT+ V 4&7E6\[
ancing factor and to isolate the main properties of our
codes. We show that we obtain good load balancing
A first result is presented on Table 4. We sort
l o - l-%j 1
factors both in the case of heterogeneous clusters and
o-o c 3
homogeneous clusters (simulated). integers (the optimal amount of data per pro-
Tabular 1 and also 2, 3, 4 are divided into five columns. cessor is integers). We start 35 experiments
%3 '
From left to right, we have the mean size of data in the
last step of the algorithm (column Mean), the standard
and we observe a mean execution
onds (the standard deviation is
time of
seconds).
sec-
l o'
H 3 j%-% c l-'
deviation of the mean (SD), the ratio of the mean over A second result is presented on Table 3. Here, we sort
3-3j3 1j
the optimal size (the values in the columns should be integers (the optimal amount of data per
l 3'
close to 100% and represent the quality of the load bal- processor is integers). We start 35 experiments
'l
ance), the ratio of the mean over the standard deviation and we observe a mean execution time of seconds
*j'
and, at least the maximal and minimal observed sizes
over the experiments (these values can be compared
(the standard deviation is seconds).
Again, the results (see again Tables 1 to 4) about the
to the mean size to appreciate how algorithms capture
load expansion metric are good. All the results vali-
the extreme cases).
date in terms of load balancing the approach both for
'
!"'$# ;
!?= I;! the heterogeneous case and for the homogeneous case.
So, the external parallel sample sort algorithm devel-
3 -%j% c l-
column). A first result is presented on Table 1. We
sort
filled with the same value (1) represents the “homoge-
integers and we use our benchmark neous case”. If we set entirely the vector with value 10
R X R
numbered 0 (random generated data using the linear
congruential generator mod . 3
f 2 we also model the “homogeneous case” but if we run
the program according to this setting we will generate
# After
/*0+')+% , and we observe the load balancing factor
that, we set again the performance vector to more pivots!
5
Mean SD Mean/opt Mean/SD U
cal memory of a processor to a distant disk. The ex-
pected gain of using READ will be compared to the
PID0
"!" U
measured gain for our H-PSS implementation over that
930196 87822 94.62% 9.44% we are re-coding to use READ .
PID1
! 2 "? L
For H-PSS, it is not hard to observe that we have
about + + data (in the most favorable case
2935879 157911 99.51% 5.38%
for the partitioning), where is the initial amount
of data on the local disk of processor , that move
"!!!
PID2
from one node to another disk during the redistribu-
U
4974058 140542 101.2% 2.82%
tion of data. It is an important amount of informa-
PID3
! "!# tion. In this case, READ should bring us a significant
7871546 211648 100.36% 2.69% speedup both in terms of io-bus usage but also in terms
of memory-bus usage. Since the memory bus usage
$3% 'D)E+ZH/213+54&+E6T7,89+;:*+,7;<>=@?9% ATCD)E+ is expected to be reduced, a supplementary question
?,75694 FB- J9K 7,MZN&%;43%QPSR3+E4&+56&7E8T+;:3+97(<U= arises: is it possible to use more memory bandwidth
V 7(:3M3W;85<T63%;4DW57(:L7,M C*+56TMXYT+ V 4T7E6\[ to start the final sequential external sort concurrently
with the redistribution of data in order to overlap com-
munication and computation more efficiently?
6
[DNS91] David J. DeWitt, Jeffrey F. Naughton, and
Mean SD Mean/opt Mean/SD
Donovan A. Schneider. Parallel sorting on
a shared-nothing architecture using prob-
abilistic splitting. In Proceedings of the
PID0
!!!
472349 59876 90.45% 12.67%
First International Conference on Paral-
lel and Distributed Information Systems, PID1
"!"
pages 280–291, December 1991. 526403 52376 100.79% 9.95%
$&%('*),+ /213+54&+56&7,89+;:3+97;<>=@?9%BA&CD)E+
631, November 1983.
[Kim86] Michelle Y. Kim. Synchronized disk inter-
?E7E6,4GFIH;J9KL7EMON&%;43%QPSR37 A\7E8T+(:*+97(<U=
leaving. IEEE Transactions on Computers,
Vol.C-35, (11), November 1986. V 7;:&M*W;8;<&6&%54DW;7;:L7EMXC3+E69MZYT+ V 4&7E6\[
[Knu98] Donald E. Knuth. Sorting and Search-
ing, volume 3 of The Art of Computer
Programming. Addison-Wesley, Reading, Mean SD Mean/opt Mean/SD
MA, USA, second edition, 1998.
"!"!"!
PID0
[LS94] Hui Li and Kenneth C. Sevcik. Parallel 3918862 584064 93,78% 14,90%
sorting by overpartitioning. In Proceed-
ings of the 6th Annual Symposium on Par-
PID1
!"
allel Algorithms and Architectures, pages 4150519 625341 99,34% 15,06%
46–56, New York, NY, USA, June 1994.
ACM Press.
PID2
"!!""!!
3935862 579492 94,2% 14,72%
[NV95] Mark H. Nodine and Jeffrey Scott Vitter.
"##
PID3
Greed sort: Optimal deterministic sorting
on parallel disks. Journal of the ACM, 4706443 505979 112,65% 10,75%
$&%('*),+ /213+54&+56&7,89+;:3+97;<>=@?9%BA&CD)E+
42(4):919–933, July 1995.
[Pea99] Matthew D. Pearson. Fast out-of-core sort-
?E7E6,4GFB- JTK 7,MONT%543% PSR*7 AD7,89+;:*+,7;<>=
ing on parallel disk systems. Technical
Report PCS-TR99-351, Dept. of Computer V 7;:&M*W;8;<&6&%54DW;7;:L7EMXC3+E69MZYT+ V 4&7E6\[
Science, Dartmouth College, Hanover, NH,
June 1999.
[Raj98] Rajasekaran. A framework for simple sort- Mean SD Mean/opt Max, Min
ing algorithms on parallel disk systems (ex- PID0
tended abstract). In SPAA: Annual ACM
929386 229 100.059% 929858, 929018
Symposium on Parallel Algorithms and Ar-
chitectures, 1998. PID1
[SGM86] Kenneth Salem and Hector Garcia-Molina.
Disk Striping. In Proceedings of the 2 In-
580687 225 100.027%
PID2
581129, 580170
$&%('*),+/213+54&+56&7,89+;:3+97;<>= 3?
database systems. In In Proc. Euromicro
7
Mean SD Mean/opt Max, Min
PID0
7898307 1690 100.04% 7901360, 7895160
PID1
4936858 1621 100.05% 4939629, 4933891
PID2
2956149 1414 99.84% 2959340, 2953082
PID3
985900 1115 99.89% 987923, 983598
Sensitivity to duplicates ?
ML 7
No, until a bound
of +
?
duplicates