Performance Measurement and Management
Diomidis Spinellis
Department of Management Science and Technology
Athens University of Economics and Business
Athens, Greece
dds@aueb.gr
Program States
-  Directly executing code (user time)
-  The kernel executes code on behalf of the program (system time)
-  Waiting for an external operation to complete (idle time)
Also
-  Real or clock time 
-  CPU time 
Measuring Workload
-  For terminating processes use the time command
$ time make
[...]
real    0m9.380s
user    0m2.580s
sys     0m6.640s
 
-  For servers use top
 22:14:03 up  5:22,  2 users,  load average: 0.00, 0.01, 0.00
39 processes: 38 sleeping, 1 running, 0 zombie, 0 stopped
CPU states:   0.1% user,   0.4% system,   0.0% nice,  99.5% idle
 
Workload and Tools
| Profile | r >>u+s | s >u | u ≈ r | 
|---|
| Characterization | I/O-bound | kernel-bound | CPU-bound | 
|---|
| Tools | Disk, net, VM stats, packet dumps | System call tracing | Function profiling, basic block counting | 
|---|
System Monitoring Tools
- I/O
- iostat (under *BSD), vmstat (Linux)
- Memory
- vmstat, free (Linux)
- Network
- netstat
Example: virtual memory statistics while executing the find command
$ vmstat 1
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 1  0  0      0  52296   3476  50092   0   0   340     0  189   190   0   6  94
 0  1  0      0  51736   3588  50428   0   0   448     0  214   243   0   5  95
 1  0  0      0  51072   3716  50724   0   0   424     0  210   218   0   5  95
 0  1  0      0  50596   3848  50968   0   0   376     0  196   204   0   4  96
 1  0  0      0  49952   3976  51304   0   0   464     0  220   247   0   6  94
 0  1  0      0  49556   4068  51512   0   0   300     0  177   172   2   2  96
 1  0  0      0  49024   4132  51816   0   0   368     0  196   195   0   7  93
 1  0  0      0  48228   4200  52416   0   0   668     0  270   307   0   9  91
 1  0  0      0  47668   4352  52712   0   0   448     0  215   227   0   5  95
 1  0  0      0  47048   4412  53092   0   0   440     0  213   244   1   5  94
 0  1  0      0  44912   4668  54080   0   0   432     0  211   225   0   4  96
Profiling System Calls
-  The strace -c flag counts calls and reports time
-  The results typically follow a Pareto distribution
Example:
$ strace -c find . -name test
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 66.06    0.367503         584       629           getdents64
 26.31    0.146389          80      1826           lstat64
  2.27    0.012621          40       313           fcntl64
  1.83    0.010168          31       326        10 open
  1.78    0.009894          16       625           chdir
  1.23    0.006846          22       316           fstat64
[...]
Function Profiling
-  Statistical (prof)
-  Graph-based (gprof, Java)
-  Pinpoint hotspots
Profiling C/C++ Code with Gprof
-  Compile with gcc -pg
-  Execute program; a file gmon.out will be produced
-  The gprof program produces the report 
A Flat Profile
Flat profile:
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
  6.35      0.33     0.33                             _Unwind_SjLj_Register
  5.77      0.63     0.30 10192020     0.00     0.00  __gnu_norm::__deque_buf_size(unsigned int)
  1.54      1.53     0.08       10     8.00   118.33  parse_parse()
  1.35      1.60     0.07  5096343     0.00     0.00  operator==(Tokid, Tokid)
A Call Graph Profile
index % time    self  children    called     name
                                  10             Pdtoken::process_pragma() <cycle 1> [360]
[4]     22.8    0.08    1.10      10         parse_parse() <cycle 1> [4]
                0.00    0.42    8348/17593       Token::unify(Token const&, Token const&) [5]
                0.00    0.26    4328/4328        Type::declare() [28]
                0.00    0.22    3511/3511        completed_typedef(Type) [33]
                0.00    0.03    3529/3540        Block::enter() [220]
                0.00    0.02    7402/27543       obj_lookup(std::string const&) [94]
                0.00    0.02  812361/834736      YYSTYPE::operator=(YYSTYPE const&) [324]
                0.01    0.01   85359/295237      Type::~Type() <cycle 3> [429]
                0.00    0.02    1690/1690        Call::register_call(Token const&, Id const*) [331]
                0.00    0.02   11362/11362       Type::set_abstract(Type) <cycle 4> [581]
Profiling Java Code with EJP
-  Add the tracer library to your library path
-  Run program under the Java VM with -Xruntracer
-  View the results with the EJP Presenter
A Java Program Profile
 
Basic Block Counting in C/C++ Code
To investigate algorithms count the number of times each line is executed.
-  Compile with gcc -pg -g -fprofile-arcs -ftest-coverage
-  Execute program; a file name.bb will be created 
-  Run gcov -l name.bb 
-  A .gcov file for each source file will be created 
Example:
           main(int argc, char *argv[])
           {
      1            int i, j, k;
      1            int a = 0;
     11            for (i = 0; i < 10; i++)
    110                    for (j = 0; j < 10; j++)
   1100                            for (k = 0; k < 10; k++)
   1000                                    a += i + j + k;
      1    }
Architectural Inefficiencies
-  Cache misses
-  Branch mispredicts
-  Front end stalls (failure to deliver enough instructions)
-  Cache pollution through address aliasing (set associativity issues)
-  Long latency instructions
-  TLB misses
Architectural Profiling Tools
- Processor event monitoring
- Oprofile (http://oprofile.sourceforge.net/) (Linux)
- Locality optimizations
-  SLO: Suggestions for Locality Optimizations (http://slo.sourceforge.net/)
Events that Oprofile can analyze on a Core 2 Intel CPU.
- Clock cycles when not halted
- number of instructions retired
- number of L2 cache requests
- L2 cache demand requests from this core
- Page table walk events
- cycles divider is busy
- L1 cacheable data read operations
- Bus cycles when data is sent on the bus
- number of instruction fetch misses
- Branch predicted taken with bubble 1
- [100 more]
Memory Performance
Memory as important as CPU cycles
-  Difference in latency between memory types
-  Large data sets 
-  Embedded applications 
-  Bandwidth requirements 
Again, we need to profileMemory Profiling C/C++ Code with Valgrind
-  Create a debug build (-g)
-  Run the program under valgrind
-  Examine valgrind's output
-  Can also detect memory leaks
Example: sed under valgrind
echo ':abcdefgh: : :' |
valgrind  --tool=massif ./sed -f TEST/hanoi.sed
