Você está na página 1de 35

Introduction to Parallel

Programming

Debugging and
Profiling
Introduction
● Debugging
– Many techniques same as for serial code
– Blocking can be hard to trace
– Sometimes serial code (or p=1) works, p>1 does not
● Profiling
– Knowing where your code is spending its time is
important
– Amdahl's Law
– Weak and Strong Scaling
General Debugging Approach
● Always get serial code running first!
– Much easier to debug problems
● Debugging tools cannot fix or detect flawed
algorithms
● Build your code in pieces, check each part
– ie. If you need a numerical integrator, write a function
for it and check it by itself with known data
● Break code into separate files, avoid copying
code segments
– allows code reuse, only need to fix bugs once
Debugging Tools
● printf in C, write in Fortran!
● gdb: GNU debugger for use with gcc
– Shows line where program crashes
– Allows inspection of variables and call stack
– other compilers have similar debugger
● valgrind: Heavyweight memory checker
– With dynamically allocated arrays, it's possible to try
to read from or write to invalid memory (ie. already
freed, index too large, etc) but no crash results
– Checks all memory accesses in your code are safe
printf (C) and write (Fortran)
● Printing out variable values is primary debugging
tool
– especially useful when program does not crash but
results are wrong
● Easy to redirect output to file to analysis
– myprog > myprog.log

● Can generate a lot of output


● Remember to remove them afterwards – they can
significantly slow your code down
gdb
● Use gdb when program segfaults
● Compilation
– Compile program with debugging symbols

>gcc ­g crash.c ­o crash

>f77 ­g crash.f ­o crash
Using gdb
● Running
– Start program in gdb
– No program options on command line!
– Program options are given after “run” command
>gdb ./crash
GNU gdb 6.1­debian
Copyright 2004 Free Software Foundation, Inc.
.... [some lines removed] ....
This GDB was configured as "i386­linux"...
Using host libthread_db library "/lib/libthread_db.so.1".

(gdb)run
Starting program: /path/to/file/crash

Program received signal SIGSEGV, Segmentation fault.
0x08048511 in main (argc=1, argv=0xbffff964) at crash.c:24
24            d[i][j].a = N*j+i;
(gdb)
 
Using gdb: Variables
● To examine a simple variable or expression:
(gdb) print i

● Example

Program received signal SIGSEGV, Segmentation fault.
0x08048511 in main (argc=1, argv=0xbffff964) at crash.c:24
24            d[i][j].a = N*j+i;
(gdb) print i
$8 = 4
(gdb) print j
$9 = 0
(gdb) print N*j + i
$10 = 4
(gdb) print $8
$11 = 4
(gdb)
Using gdb: Structures
● To examine a structure:
(gdb) print d

● Example

Program received signal SIGSEGV, Segmentation fault.
0x08048511 in main (argc=1, argv=0xbffff964) at crash.c:24
24            d[i][j].a = N*j+i;
(gdb) print d[1][2]
$16 = {a = 9, b = {3, 6561}}
(gdb) print d[1][2].b[1]
$17 = 6561
(gdb) print $16.a
$18 = 9 The Data Structure
(gdb)
typedef struct {
  int a;
  float b[2];
} DataStruct;
Using gdb: Pointers
● To examine a pointer and the value pointed to:
(gdb) print p
(gdb) print *p

● Example

Program received signal SIGSEGV, Segmentation fault.
0x08048511 in main (argc=1, argv=0xbffff964) at crash.c:24
24            d[i][j].a = N*j+i;
(gdb) print d
$19 = (DataStruct **) 0x80498f0
(gdb) print *d
$20 = (DataStruct *) 0x8049908
(gdb) print **d
$21 = {a = 0, b = {0, 0}}
Using gdb: Backtrace
● To find out the function calls to get where you
are, use backtrace
(gdb) backtrace

● Example

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 16384 (LWP 7137)]
0x08048464 in functionC (c=228) at backtrace.c:11

11        found[d] = 1;
(gdb) backtrace
#0  0x08048464 in functionC (c=228) at backtrace.c:11
#1  0x08048483 in functionB (b=114) at backtrace.c:18
#2  0x0804849e in functionA (a=57) at backtrace.c:26
#3  0x08048500 in main (argc=1, argv=0xbffff914) at backtrace.c:38
(gdb)
Using gdb: Moving Around
● You can move up and down the call stack to
examine variables there
(gdb) up
(gdb) down
● Example
(gdb) backtrace
#0  0x08048464 in functionC (c=228) at backtrace.c:11
#1  0x08048483 in functionB (b=114) at backtrace.c:18
#2  0x0804849e in functionA (a=57) at backtrace.c:26
#3  0x08048500 in main (argc=1, argv=0xbffff914) at backtrace.c:38
(gdb) up
#1  0x08048483 in functionB (b=114) at backtrace.c:18
18        return(c);
(gdb) print b
$1 = 114
(gdb) down
#0  0x08048464 in functionC (c=228) at backtrace.c:11
11        found[d] = 1;
(gdb) print d
$2 = 25
(gdb)
Breakpoints
● Breakpoints allow you to stop your program at a
certain point and examine its data
● Break points can be set in many ways including:
Line Number Function Name
(gdb) break 21 (gdb) break integrate

(gdb) break filename:21 (gdb) break filename:integrate

● Breakpoints will stop every time


– use “tbreak” instead of “break” to stop only once
– use “clear” to delete a break point
● Use “continue” to continue after breakpoint
Breakpoint Example

>gdb ./backtrace

(gdb) break functionB
Breakpoint 1 at 0x8048476: file backtrace.c, line 17.
(gdb) run
Starting program: /path/to/file/backtrace
Starting loop

Breakpoint 1, functionB (b=2) at backtrace.c:17
17        c = functionC(b*2);
(gdb) clear functionB
Deleted breakpoint 1
(gdb) break 10
Breakpoint 2 at 0x804842a: file backtrace.c, line 10.
(gdb) continue
Continuing.

Breakpoint 2, functionC (c=4) at backtrace.c:10
10        d = sqrt(c)+10;
(gdb)
MPI Debugging
● Program is running simultaneously on multiple
computers
– Cannot just use gdb since it is interactive
● Easiest solution is printf/write
– Would like all processes to output at the same time
– Need to synchronize output
● Need MPI_Barrier function
– Will wait until all processes reach this point
MPI_Barrier(MPI_Comm comm)
MPI_Barrier Example

  if (myRank == 0){
    N = (int)(2*atof(argv[1]));
  }
  MPI_Bcast(&N, 1, MPI_INT, 0, MPI_COMM_WORLD);

  partialSum = 0.0;
  for (i=N­myRank;i>=0;i­=nProc){
    partialSum += pow(­1.0, i)/(2*i + 1);
  }

  printf("Partial Sum from %d is %e\n", myRank, partialSum);
  fflush(stdout);
  MPI_Barrier(MPI_COMM_WORLD);

  MPI_Reduce(&partialSum, &totalSum, 1, MPI_DOUBLE,
             MPI_SUM, 0, MPI_COMM_WORLD);

  if (myRank == 0){
    printf("pi = %e\n", totalSum * 4);
  }
Using gdb in Parallel with MPICH
● Debug using multiple processes on one machine
● Start p-1 copies of your program in gdb
● Turn off stopping on SIGUSR1 signal

>gdb myprog 
GNU gdb 6.1­debian
Copyright 2004 Free Software Foundation, Inc.
... [some lines deleted] ...
Using host libthread_db library "/lib/libthread_db.so.1".

(gdb) handle SIGUSR1 noprint
Signal        Stop      Print   Pass to program Description
SIGUSR1       No        No      Yes             User defined signal 1
(gdb)
Using gdb in Parallel with MPICH
● Start process 0 using mpirun with -dbg option
● Give -p4norem option to your program
● Turn off stopping on SIGUSR1 signal
>mpirun ­np 4 ­dbg=gdb myprog ­p4norem
GNU gdb 6.1­debian
Copyright 2004 Free Software Foundation, Inc.
... [some lines deleted] ...
Breakpoint 1, 0x0804c32c in PMPI_Init ()
(gdb) handle SIGUSR1 noprint
Signal        Stop      Print   Pass to program Description
SIGUSR1       No        No      Yes             User defined signal 1
(gdb)continue
Continuing.
waiting for process on host keitelxx:
/path/to/prog/calc_pi3 keitelxx 47895 ­p4amslave

● In other debuggers, start the program with given


options
(gdb) run keitelxx 47895 ­p4amslave
gdb and LAM-MPI
● Much easier than mpich
– reason related to design
● Technique works on local computer only
● Notes:
– No arguments to program. Give them in debugger.
– No other special arguments need to be given

mpirun ­np 4 xterm ­e gdb ./calc_pi3
Aside: Output in Separate
Terminals
● Can use similar technique to view output in
separate terminals
– Don't forget the “read”, otherwise terminals will close
immediately
– Only works for LAM/MPI

mpirun ­np 4 xterm ­e “./calc_pi3 1e5; read”
Sample Screen Shot
Memory Checking
● Tools to check memory access
– More important in C with dynamic memory
allocation/freeing
● Tools include
– Valgrind: no recompilation necessary
– ElectricFence: recompile with -lefence or use
LD_PRELOAD
● Note: Both options slow code by a lot! Do not
use in production runs!
backtrace.c
int functionC(int c){ Output
  ....
  found[d] = 1; >./backtrace
  return(d); Starting loop
} Finished
>
int functionB(int b){
....
}
Everything looks good....
int functionA(int a){ But is it really?
....
}

int main(int argc, char *argv[]){

  found = (int *)calloc(25,sizeof(int));
  printf("Starting loop\n");
  for (i=1;i<100;i++){
    functionA(i);
  }
  printf("Finished\n");

}
valgrind
>valgrind ./backtrace
==7833== Memcheck, a memory error detector for x86­linux.
==7833== Copyright (C) 2002­2004, and GNU GPL'd, by Julian Seward et al.
==7833== Using valgrind­2.2.0, a program supervision framework for x86­linux.
==7833== Copyright (C) 2000­2004, and GNU GPL'd, by Julian Seward et al.
==7833== For more details, rerun with: ­v
==7833==
Starting loop
==7833== Invalid write of size 4
==7833==    at 0x8048464: functionC (backtrace.c:11)
==7833==    by 0x8048482: functionB (backtrace.c:17)
==7833==    by 0x804849D: functionA (backtrace.c:23)
==7833==    by 0x80484F4: main (backtrace.c:34)
==7833==  Address 0x1BA7408C is 0 bytes after a block of size 100 alloc'd
==7833==    at 0x1B905901: calloc (vg_replace_malloc.c:176)
==7833==    by 0x80484C9: main (backtrace.c:31)
Finished
==7833==
==7833== ERROR SUMMARY: 43 errors from 1 contexts (suppressed: 13 from 1)
==7833== malloc/free: in use at exit: 100 bytes in 1 blocks.
==7833== malloc/free: 1 allocs, 0 frees, 100 bytes allocated.
==7833== For a detailed leak analysis,  rerun with: ­­leak­check=yes
==7833== For counts of detected errors, rerun with: ­v
Electric Fence
● Will cause program to SEGFAULT on invalid
read or writes
● Use gdb to track down locations
>gdb ./backtrace
(gdb) set environment LD_PRELOAD libefence.so.0.0
(gdb) run
Starting program: /path/to/code/backtrace

  Electric Fence 2.1 Copyright (C) 1987­1998 Bruce Perens.
Starting loop

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 16384 (LWP 7848)]
0x08048464 in functionC (c=228) at backtrace.c:11
11        found[d] = 1;
(gdb) bt
#0  0x08048464 in functionC (c=228) at backtrace.c:11
#1  0x08048483 in functionB (b=114) at backtrace.c:17
#2  0x0804849e in functionA (a=57) at backtrace.c:23
#3  0x080484f5 in main (argc=1, argv=0xbffff904) at backtrace.c:34
(gdb)
Profiling
● 90% of the runtime is spent in 10% of the code*
– Focus optimization on this part
– No point optimizing code which does not affect overall
performance
● The challenge is to identify this part of the code
● Frequently, it is not where you expect

Premature optimization is the root of all evil.


Professor Sir Charles Anthony Richard Hoare

● (But, keep it in mind)


*75% of statistics are made up on the spot
Profiling Tool: gprof
● Get serial code working and giving correct results
first
● To use gprof,
– compile with “-pg” flag
– do not use optimization flags
– run code as normal
– a file named “gmon.out” will have been created

>gcc ­pg myprog.c ­o myprog
>./myprog
Examining gprof Output
● To examine profiling information, use gprof
program
– generates a lot of output so pipe it through less or
redirect to a file
>gprof ./myprog | less
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 48.53     51.36    51.36 69881880     0.00     0.00  local_calculate_dpsi
 12.04     64.11    12.75 139763760    0.00     0.00  neighbour_Psis
  6.63     71.13     7.02 52411410     0.00     0.00  local_acc_dpsi_set_source
  6.25     77.75     6.62  1747047     0.00     0.00  adjustEdgeDistances
  4.41     82.42     4.67       10     0.47     7.98  calculate_dpsi
  3.46     86.08     3.67 205201523    0.00     0.00  node_comparison
  3.29     89.57     3.49 17470470     0.00     0.00  grow_nodes
  3.29     93.05     3.49  1747047     0.00     0.00  calcEdgeDistance
  2.17     95.35     2.30 17470470     0.00     0.00  local_advance_psi
  2.04     97.51     2.16 17470470     0.00     0.00  local_accumulate_dpsi
.... [continued] ....
Flat Profile
Fraction of total
time spent in function
Total Time spent
in this function

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 48.53     51.36    51.36 69881880     0.00     0.00  local_calculate_dpsi
 12.04     64.11    12.75 139763760    0.00     0.00  neighbour_Psis
  6.63     71.13     7.02 52411410     0.00     0.00  local_acc_dpsi_set_source
  6.25     77.75     6.62  1747047     0.00     0.00  adjustEdgeDistances
  4.41     82.42     4.67       10     0.47     7.98  calculate_dpsi
  3.46     86.08     3.67 205201523    0.00     0.00  node_comparison
  3.29     89.57     3.49 17470470     0.00     0.00  grow_nodes
  3.29     93.05     3.49  1747047     0.00     0.00  calcEdgeDistance
  2.17     95.35     2.30 17470470     0.00     0.00  local_advance_psi
  2.04     97.51     2.16 17470470     0.00     0.00  local_accumulate_dpsi
.... [continued] ....
Flat Profile
Average Time
per function call
to do its own work Average Time
per function call
# of times function to do its own work plus
is called its children

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 48.53     51.36    51.36 69881880     0.00     0.00  local_calculate_dpsi
 12.04     64.11    12.75 139763760    0.00     0.00  neighbour_Psis
  6.63     71.13     7.02 52411410     0.00     0.00  local_acc_dpsi_set_source
  6.25     77.75     6.62  1747047     0.00     0.00  adjustEdgeDistances
  4.41     82.42     4.67       10     0.47     7.98  calculate_dpsi
  3.46     86.08     3.67 205201523    0.00     0.00  node_comparison
  3.29     89.57     3.49 17470470     0.00     0.00  grow_nodes
  3.29     93.05     3.49  1747047     0.00     0.00  calcEdgeDistance
  2.17     95.35     2.30 17470470     0.00     0.00  local_advance_psi
  2.04     97.51     2.16 17470470     0.00     0.00  local_accumulate_dpsi
.... [continued] ....
Call Graph
● Flat Profile shows overall summary and total
number of function calls
● But, can we find where the functions are being
called from?
– Of course!
● gprof output also contains “call graph” details
Call Graph
Summary applies to Parent function
this function

                     Call graph (explanation follows)
index % time    self  children    called     name
                                                 <spontaneous>
[1]     99.9    0.08  105.63                 main [1]
                4.67   75.16      10/10          calculate_dpsi [2]
                0.15   16.87       1/1           restore_state [4]
                0.72    5.03      10/10          adjust_nodes [9]
                0.72    2.30      10/10          advance_psi [14]
                0.00    0.00       1/1           initialize_wavespace [34]
                0.00    0.00       1/1           destroy_wavespace [33]
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
                4.67   75.16      10/10          main [1]
[2]     75.4    4.67   75.16      10         calculate_dpsi [2]
               51.36   12.75 69881880/69881880   local_calculate_dpsi [3]
                7.02    0.00 52411410/52411410   local_acc_dpsi_set_source [7]
                2.16    0.00 17470470/17470470   local_accumulate_dpsi [16]
                1.81    0.00 17470470/17470470   local_set_source [17]
                0.06    0.00 17470470/17470470   local_clear_dpsi [22]
                0.01    0.00     100/141         node_index_minimum [26]
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
.... [continued] ....

Child functions
Call Graph
Time spent in this function

Time spent in each child function

                     Call graph (explanation follows)
index % time    self  children    called     name
                                                 <spontaneous>
[1]     99.9    0.08  105.63                 main [1]
                4.67   75.16      10/10          calculate_dpsi [2]
                0.15   16.87       1/1           restore_state [4]
                0.72    5.03      10/10          adjust_nodes [9]
                0.72    2.30      10/10          advance_psi [14]
                0.00    0.00       1/1           initialize_wavespace [34]
                0.00    0.00       1/1           destroy_wavespace [33]
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
                4.67   75.16      10/10          main [1]
[2]     75.4    4.67   75.16      10         calculate_dpsi [2]
               51.36   12.75 69881880/69881880   local_calculate_dpsi [3]
                7.02    0.00 52411410/52411410   local_acc_dpsi_set_source [7]
                2.16    0.00 17470470/17470470   local_accumulate_dpsi [16]
                1.81    0.00 17470470/17470470   local_set_source [17]
                0.06    0.00 17470470/17470470   local_clear_dpsi [22]
                0.01    0.00     100/141         node_index_minimum [26]
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
.... [continued] ....
Call Graph
● For example, main takes 0.08 seconds itself and 105.63 seconds calling
other functions
● Of that, 4.67 + 75.16 = 79.83 are spend in calculate_dpsi
● In calculate_dpsi, 51.36 s are spent in local_calculate_dpsi and 12.75 s
are spend in functions called from local_calculate_dpsi
                     Call graph (explanation follows)
index % time    self  children    called     name
                                                 <spontaneous>
[1]     99.9    0.08  105.63                 main [1]
                4.67   75.16      10/10          calculate_dpsi [2]
                0.15   16.87       1/1           restore_state [4]
                0.72    5.03      10/10          adjust_nodes [9]
                0.72    2.30      10/10          advance_psi [14]
                0.00    0.00       1/1           initialize_wavespace [34]
                0.00    0.00       1/1           destroy_wavespace [33]
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
                4.67   75.16      10/10          main [1]
[2]     75.4    4.67   75.16      10         calculate_dpsi [2]
               51.36   12.75 69881880/69881880   local_calculate_dpsi [3]
                7.02    0.00 52411410/52411410   local_acc_dpsi_set_source [7]
                2.16    0.00 17470470/17470470   local_accumulate_dpsi [16]
                1.81    0.00 17470470/17470470   local_set_source [17]
                0.06    0.00 17470470/17470470   local_clear_dpsi [22]
                0.01    0.00     100/141         node_index_minimum [26]
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
.... [continued] ....
Next Time
● Some of the topic for next time:
– MPI Topologies
– MPI Data Structures
– More Collective Communications
– All about Communicators
– Non-blocking communications
– Any topics you would like to hear about?
● Remember: No talks next week. We will resume
the week after (Feb 23)

Você também pode gostar