[3.2] Loop-carried dependence: dependence exists across iterations; i.e., if the loop is removed, the dependence no longer exists. Loop-independent dependence: dependence exists within an iteration; i.e., if the loop is removed, the dependence still exists. Example: S1[i] T S1[i+1]: loop-carried S1[i] T S2[i]: loop-independent S3[i,j] T S3[i,j+1]: loop-carried on for j loop no loop-carried dependence in for i loop S4[i,j] T S4[i+1,j]: no loop-carried dependence in for j loop loop-carried on for i loop Iteration-space Traversal Graph (ITG) [3.2.1] The ITG shows graphically the order of traversal in the iteration space. This is sometimes called the happens-before relationship. In an ITG, A node represents a point in the iteration space A directed edge indicates the next point that will be encountered after the current point is traversed Example:
for (i=1; i<n; i++) for (j=1; j< n; j++) S3: a[i][j] = a[i][j-1] + 1;
for (i=1; i<n; i++) for (j=1; j< n; j++) S4: a[i][j] = a[i-1][j] + 1;
for (i=1; i<4; i++) for (j=1; j<4; j++) S3: a[i][j] = a[i][j-1] + 1; 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 2
Loop-carried Dependence Graph (LDG) LDG shows the true/anti/output dependence relationship graphically. A node is a point in the iteration space. A directed edge represents the dependence. Example:
i j 1 2 3 3 2 1 for (i=1; i<4; i++) for (j=1; j<4; j++) S3: a[i][j] = a[i][j-1] + 1;
Lecture 5 Architecture of Parallel Computers 3
Another example:
Draw the ITG List all the dependence relationships Note that there are two loop nests in the code. The first involves S1. The other involves S2 and S3. What do we know about the ITG for these nested loops?
1 2 3 3 2 1 i j T T T T T T for (i=1; i<=n; i++) for (j=1; j<=n; j++) S1: a[i][j] = a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j];
for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S2: a[i][j] = b[i][j] + c[i][j]; S3: b[i][j] = a[i][j-1] * d[i][j]; } 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 4
Dependence relationships for Loop Nest 1 True dependences: o S1[i,j] T S1[i,j+1] o S1[i,j] T S1[i+1,j] Output dependences: o None Anti-dependences: o S1[i,j] A S1[i+1,j] o S1[i,j] A S1[i,j+1] Exercise: Suppose we dropped off the first half of S1, so we had S1: a[i][j] = a[i-1][j] + a[i+1][j]; or the last half, so we had S1: a[i][j] = a[i][j-1] + a[i][j+1]; Which of the dependences would still exist?
i 1 2 n n 2 1 . . . . . .
Lecture 5 Architecture of Parallel Computers 5 Draw the LDG for Loop Nest 1.
Dependence relationships for Loop Nest 2 True dependences: o S2[i,j] T S3[i,j+1] Output dependences: o None Anti-dependences: o S2[i,j] A S3[i,j] (loop-independent dependence) i j 1 2 n n 2 1 . . . . . . Note: each edge represents both true, and anti-dependences 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 6 Draw the LDG for Loop Nest 2.
Why are there no vertical edges in this graph? Answer here.
Why is the anti-dependence not shown on the graph?
Finding parallel tasks across iterations [3.2.2] Analyze loop-carried dependences: Dependences must be enforced (especially true dependences; other dependences can be removed by privatization) There are opportunities for parallelism when some dependences are not present. Example 1
LDG:
i j 1 2 n n 2 1 . . . . . . Note: each edge represents only true dependences for (i=2; i<=n; i++) S: a[i] = a[i-2];
Lecture 5 Architecture of Parallel Computers 7 We can divide the loop into two parallel tasks (one with odd iterations and another with even iterations): Example 2
LDG
How many parallel tasks are there here? Example 3
LDG
i j 1 2 n n 2 1 . . . . . . for (i=2; i<=n; i+=2) S: a[i] = a[i-2]; for (i=3; i<=n; i+=2) S: a[i] = a[i-2];
for (i=0; i<n; i++) for (j=0; j< n; j++) S3: a[i][j] = a[i][j-1] + 1; for (i=1; i<=n; i++) for (j=1; j<=n; j++) S1: a[i][j] = a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j]; j 1 2 n n 2 1 . . . Note: each edge represents both true, and anti-dependences 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 8 Identify which nodes are not dependent on each other In each anti-diagonal, the nodes are independent of each other We need to rewrite the code to iterate over anti-diagonals: Calculate number of anti-diagonals for each anti-diagonal do Calculate the number of points in the current anti-diagonal for each point in the current anti-diagonal do Compute the value of the current point in the matrix Parallelize loops highlighted above. i 1 2 n n 2 1 . . . . . . Note: each edge represents both true, and anti-dependences for (i=1; i <= 2*n-1; i++) {// 2n-1 anti-diagonals if (i <= n) { points = i; // number of points in anti-diag row = i; // first pt (row,col) in anti-diag col = 1; // note that row+col = i+1 always } else { points = 2*n i; row = n; col = i-n+1; // note that row+col = i+1 always } for_all (k=1; k <= points; k++) { a[row][col] = // update a[row][col] row--; col++; } }
Lecture 5 Architecture of Parallel Computers 9 DOACROSS Parallelism [3.2.3] Suppose we have this code: Can we execute anything in parallel? Well, we cant run the iterations of the for loop in parallel, because S[i] T S[i+1] (There is a loop-carried dependence.) But, notice that the b[i] * c[i] part has no loop-carried dependence. This suggests breaking up the loop into two: The first loop is ||izable. The second is not. Execution time: N(T S1 +T S2 ) What is a disadvantage of this approach?
Heres how to solve this problem: What is the execution time now?
Parallelism across statements in a loop [3.2.4] Identify dependences in a loop body. If there are independent statements, can split/distribute the loops. for (i=1; i<=N; i++) { S: a[i] = a[i-1] + b[i] * c[i]; } for (i=1; i<=N; i++) { S1: temp[i] = b[i] * c[i]; } for (i=1; i<=N; i++) { S2: a[i] = a[i-1] + temp[i]; } post(0); for (i=1; i<=N; i++) { S1: temp = b[i] * c[i]; wait(i-1); S2: a[i] = a[i-1] + temp; post(i); } 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 10 Example: Loop-carried dependences:
Loop-indep. dependences:
Note that S4 has no dependences with other statements S1[i] A S2[i+1] implies that S2 at iteration i+1 must be executed after S1 at iteration i. Hence, the dependence is not violated if all S2s executed after all S1s. After loop distribution: Each loop is a parallel task. This is called function parallelism. Further transformations can be performed (see p. 44 of text). This is called function parallelism, and can be distinguished from data parallelism, which we saw in DOALL and DOACROSS. Characteristics of function parallelism:
Can use function parallelism along with data parallelism when data parallelism is limited. DOPIPE Parallelism [3.2.5] Another strategy for loop-carried dependences is pipelining the statements in the loop. for (i=0; i<n; i++) { S1: a[i] = b[i+1] * a[i-1]; S2: b[i] = b[i] * coef; S3: c[i] = 0.5 * (c[i] + a[i]); S4: d[i] = d[i-1] * d[i]; } for (i=0; i<n; i++) { S1: a[i] = b[i+1] * a[i-1]; S2: b[i] = b[i] * coef; S3: c[i] = 0.5 * (c[i] + a[i]); }
for (i=0; i<n; i++) { S4: d[i] = d[i-1] * d[i]; }
Lecture 5 Architecture of Parallel Computers 11 Consider this situation: Loop-carried dependences:
Loop-indep. dependences:
To parallelize, we just need to make sure the two statements are executed in sync:
Question: Whats the difference between DOACROSS and DOPIPE?