Lect

Lecture 5 Architecture of Parallel Computers 1
Loop-independent vs. loop-carried dependences

[3.2] Loop-carried dependence: dependence exists across iterations;
i.e., if the loop is removed, the dependence no longer exists.
Loop-independent dependence: dependence exists within an iteration;
i.e., if the loop is removed, the dependence still exists.
Example:
S1[i] T S1[i+1]: loop-carried
S1[i] T S2[i]: loop-independent
S3[i,j] T S3[i,j+1]:
loop-carried on for j loop
no loop-carried dependence
in for i loop
S4[i,j] T S4[i+1,j]:
no loop-carried dependence
in for j loop
loop-carried on for i loop
Iteration-space Traversal Graph (ITG)
[3.2.1] The ITG shows graphically the order of traversal in the
iteration space. This is sometimes called the happens-before
relationship. In an ITG,
A node represents a point in the iteration space
A directed edge indicates the next point that will be encountered
after the current point is traversed
Example:

for (i=1; i<n; i++) {
S1: a[i] = a[i-1] + 1;
S2: b[i] = a[i];
}

for (i=1; i<n; i++)
for (j=1; j< n; j++)
S3: a[i][j] = a[i][j-1] + 1;

for (i=1; i<n; i++)
for (j=1; j< n; j++)
S4: a[i][j] = a[i-1][j] + 1;

for (i=1; i<4; i++)
for (j=1; j<4; j++)
S3: a[i][j] = a[i][j-1] + 1;
2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 2

Loop-carried Dependence Graph (LDG)
LDG shows the true/anti/output dependence relationship
graphically.
A node is a point in the iteration space.
A directed edge represents the dependence.
Example:

i
j
1
2
3
3 2 1
for (i=1; i<4; i++)
for (j=1; j<4; j++)
S3: a[i][j] = a[i][j-1] + 1;


Another example:

Draw the ITG
List all the dependence relationships
Note that there are two loop nests in the code.
The first involves S1.
The other involves S2 and S3.
What do we know about the ITG for these nested loops?

1
2
3
3 2 1
i
j
T T
T T
T T
for (i=1; i<=n; i++)
for (j=1; j<=n; j++)
S1: a[i][j] = a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j];

for (i=1; i<=n; i++)
for (j=1; j<=n; j++) {
S2: a[i][j] = b[i][j] + c[i][j];
S3: b[i][j] = a[i][j-1] * d[i][j];
}

Dependence relationships for Loop Nest 1
True dependences:
o S1[i,j] T S1[i,j+1]
o S1[i,j] T S1[i+1,j]
Output dependences:
o None
Anti-dependences:
o S1[i,j] A S1[i+1,j]
o S1[i,j] A S1[i,j+1]
Exercise: Suppose we dropped off the first half of S1, so we had
S1: a[i][j] = a[i-1][j] + a[i+1][j];
or the last half, so we had
S1: a[i][j] = a[i][j-1] + a[i][j+1];
Which of the dependences would still exist?

i
1
2
n
n 2 1
. . .
. . .

Draw the LDG for Loop Nest 1.

Dependence relationships for Loop Nest 2
True dependences:
o S2[i,j] T S3[i,j+1]
Output dependences:
o None
Anti-dependences:
o S2[i,j] A S3[i,j] (loop-independent dependence)
i
j
1
2
n
n 2 1
. . .
. . .
Note: each
edge represents
both true, and
anti-dependences
Draw the LDG for Loop Nest 2.

Why are there no vertical edges in this graph? Answer here.

Why is the anti-dependence not shown on the graph?

Finding parallel tasks across iterations
[3.2.2] Analyze loop-carried dependences:
Dependences must be enforced (especially true dependences;
other dependences can be removed by privatization)
There are opportunities for parallelism when some dependences
are not present.
Example 1

LDG:

i
j
1
2
n
n 2 1
. . .
. . .
Note: each
edge represents
only true dependences
for (i=2; i<=n; i++)
S: a[i] = a[i-2];

We can divide the loop into two parallel
tasks (one with odd iterations and
another with even iterations):
Example 2

LDG

How many parallel tasks are there here?
Example 3

LDG

i
j
1
2
n
n 2 1
. . .
. . .
for (i=2; i<=n; i+=2)
S: a[i] = a[i-2];
for (i=3; i<=n; i+=2)
S: a[i] = a[i-2];

for (i=0; i<n; i++)
for (j=0; j< n; j++)
S3: a[i][j] = a[i][j-1] + 1;
for (i=1; i<=n; i++)
for (j=1; j<=n; j++)
S1: a[i][j] = a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j];
j
1
2
n
n 2 1
. .
.
Note: each
edge represents
both true, and
anti-dependences
Identify which nodes are not dependent on each other
In each anti-diagonal, the nodes are independent of each other
We need to rewrite the code to iterate over anti-diagonals:
Calculate number of anti-diagonals
for each anti-diagonal do
Calculate the number of points in the current anti-diagonal
for each point in the current anti-diagonal do
Compute the value of the current point in the matrix
Parallelize loops highlighted above.
i
1
2
n
n 2 1
. .
.
. .
.
Note: each
edge represents
both true, and
anti-dependences
for (i=1; i <= 2*n-1; i++) {// 2n-1 anti-diagonals
if (i <= n) {
points = i; // number of points in anti-diag
row = i; // first pt (row,col) in anti-diag
col = 1; // note that row+col = i+1 always
}
else {
points = 2*n i;
row = n;
col = i-n+1; // note that row+col = i+1 always
}
for_all (k=1; k <= points; k++) {
a[row][col] = // update a[row][col]
row--; col++;
}
}

DOACROSS Parallelism
[3.2.3] Suppose we have
this code:
Can we execute anything in
parallel?
Well, we cant run the iterations of the for loop in parallel, because
S[i] T S[i+1] (There is a loop-carried dependence.)
But, notice that the b[i]
*
c[i] part has no loop-carried dependence.
This suggests breaking up the loop into two:
The first loop is ||izable.
The second is not.
Execution time: N(T
S1
+T
S2
)
What is a disadvantage of
this approach?

Heres how to solve this problem:
What is the execution time now?

Parallelism across statements in a loop
[3.2.4] Identify dependences in a loop body.
If there are independent statements, can split/distribute the loops.
for (i=1; i<=N; i++) {
S: a[i] = a[i-1] + b[i]
*
c[i];
}
for (i=1; i<=N; i++) {
S1: temp[i] = b[i] * c[i];
}
for (i=1; i<=N; i++) {
S2: a[i] = a[i-1] + temp[i];
}
post(0);
for (i=1; i<=N; i++) {
S1: temp = b[i] * c[i];
wait(i-1);
S2: a[i] = a[i-1] + temp;
post(i);
}
Example:
Loop-carried dependences:

Loop-indep. dependences:

Note that S4 has no dependences with other statements
S1[i] A S2[i+1] implies that S2 at iteration i+1 must be executed
after S1 at iteration i. Hence, the dependence is not violated if all S2s
executed after all S1s.
After loop distribution:
Each loop is a parallel task.
This is called function
parallelism.
Further transformations
can be performed (see p.
44 of text).
This is called function parallelism, and can be distinguished from data
parallelism, which we saw in DOALL and DOACROSS.
Characteristics of function parallelism:

Can use function parallelism along with data parallelism when data
parallelism is limited.
DOPIPE Parallelism
[3.2.5] Another strategy for loop-carried dependences is pipelining the
statements in the loop.
for (i=0; i<n; i++) {
S1: a[i] = b[i+1] * a[i-1];
S2: b[i] = b[i] * coef;
S3: c[i] = 0.5 * (c[i] + a[i]);
S4: d[i] = d[i-1] * d[i];
}
for (i=0; i<n; i++) {
S1: a[i] = b[i+1] * a[i-1];
S2: b[i] = b[i] * coef;
S3: c[i] = 0.5 * (c[i] + a[i]);
}

for (i=0; i<n; i++) {
S4: d[i] = d[i-1] * d[i];
}

Consider this situation:
Loop-carried dependences:

Loop-indep. dependences:

To parallelize, we just need to make sure the two statements are
executed in sync:

Question: Whats the difference
between DOACROSS and
DOPIPE?

for (i=2; i<=N; i++) {
S1: a[i] = a[i-1] + b[i];
S2: c[i] = c[i] + a[i];
}
for (i=2; i<=N; i++) {
a[i] = a[i-1] + b[i];
post(i);
}

for (i=2; i<=N; i++) {
wait(i);
c[i] = c[i] + a[i];
}

Lect

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Lect

Enviado por

Direitos autorais:

Formatos disponíveis

Lecture 5 Architecture of Parallel Computers 1

Loop-independent vs. loop-carried dependences

Você também pode gostar