1. Profiling, Tracing and Linux Perf

a) What are the conceptual differences between profiling and tracing? What are the respective advantages and disadvantages of these techniques?

Useful Resource on Profiling vs. Tracing

While researching, I came across this helpful article:
Profiling vs. Tracing by J. Whitham

  • Profiling

    • Similar to debugging with backtracing.
    • Running code is interrupted at given time intervals (e.g., every millisecond). In each interruption, a ‘screenshot’ or sample is taken of the code by visiting each running thread and examining its stack to discover which functions are running.
    • Each sample is aggregated into a report or graph (e.g., flame graphs).
    • Profiling cannot be used to construct a trace.
  • Tracing

    • Does not operate by sampling.
    • A trace is a log of events within the program during runtime.
    • The log is customizable and may report function calls, returns, and execution of other statements.
    • May require the program to be instrumented, i.e., modified to include log events. These might need to be added to the source code as a pre-compilation step (‘instrumented code’) or might be added dynamically to the machine code.
    • A detailed trace can be used to reconstruct a profile.
    • Useful for discovering the chain of events that led to a problem.
    • More difficult to set up than profiling.
    • Detailed traces are large.
    • Instrumented code is never as easy to compile as the original code.

b) Perf is a powerful Linux tool for profiling, tracing, and performance analysis.

  • Use perf --help to get an overview of perf’s capabilities.
  • Use perf list to list performance events and metric groups that can be measured by perf on SuperMUC-NG.
  • Use perf stat -e <event0,event1,…> <command> to measure cache misses during execution of the parallel triad or matrix multiplication.

It might be more convenient to use an interactive session on SuperMUC-NG.
The command salloc -t 30 -A pr58ci -N 1 reserves a single node for 30 minutes for an interactive session.

What I ran

xxxxxxx@i01r01c02s08:~/scp/ex6/triad_numa> perf stat -e cache-misses ./triad
 
NITERS,n,result,mflops
10,1000000000,8.79e+17,445
 
 Performance counter stats for './triad':
 
        4999617006      cache-misses
 
      56.991889844 seconds time elapsed
 
      48.104035000 seconds user
       8.884006000 seconds sys

2. Hardware Performance Counter

Performance counters are available in the performance monitoring units (PMUs) on many CPUs. They can be configured to count performance events, e.g., retired instructions or cache misses.

A popular interface to configure and read performance counters. This is the Performance Application Programming Interface (PAPI).
PAPI enables measuring the occurrence of performance events in specific code regions.

SuperMUC-NG has a recent version of PAPI installed. You have to load it with module load papi before use.
You can list available events with papi_avail -a.
Interesting events are those related to the memory subsystem, as well as instructions and floating-point operations, e.g.:

  • PAPI_L1_DCM
  • PAPI_L2_DCM
  • PAPI_L3_TCM
  • PAPI_TOT_INS
  • PAPI_DP_OPS

A program can be compiled with PAPI like so:

gcc test.c -o test \$PAPI_INC -L\$PAPI_LIBDIR -lpapi

See the documentation of the easier to use “high-level” API or consult the more powerful “low-level” API, where you can find instructions and examples.

  • Insert measurement probes into the matrix multiplication code to measure performance events during the matrix multiplication.
  • Show the effect of matrix multiplication optimizations (loop reordering, cache blocking, …) with suitable performance event measurements.

3. MPI: Hello World – Approximating π

Find an interactive Monte Carlo Method to estimate π.

Write an MPI program that approximates π. You can use collective or point-to-point operations for your implementation.

Run the program using 4x48 processes on 4 nodes of SuperMUC-NG.

Hint: You will need to modify your existing batch files for that. Examples for MPI-specific batch files can be found in the [SuperMUC’s documentation](SuperMUC’s documentation).


4. MPI: Communication Benchmarks

Determine the latency and bandwidth of the SuperMUC-NG nodes with the help of benchmark applications.

  • Use the OSU Micro-Benchmarks of the Ohio State University (OSU). You can download the code at http://mvapich.cse.ohio-state.edu/benchmarks/ (click on “Tarball”).

  • Copy the files to SuperMUC and extract the tarball (tar -xf <file>). Compile and install the benchmarks in the extracted directory:

    module load automake
    ./configure --prefix=$HOME/osu CC=$MPICC CXX=$MPICXX
    make -j
    make install

Using these commands the benchmarks will be installed into your home directory under `osu/`. Point-to-point benchmarks can be found in `osu/libexec/osu-micro-benchmarks/mpi/pt2pt/`.

- Determine the latency and bandwidth inside one node (MPI processes on (a) same socket (b) different socket) and between different nodes of the system. You can use the program `osu_latency` to determine the latency and `osu_bw` for the bandwidth.

- Use the attached batch file to execute the benchmarks. You have to adapt `nodes`, `ntasks-per-node` and the application name accordingly!

















































<!-- DISQUS SCRIPT COMMENT START -->


<!-- DISQUS RECOMMENDATION START -->


<div id="disqus_recommendations"></div>


<script>
(function() { // REQUIRED CONFIGURATION VARIABLE: EDIT THE SHORTNAME BELOW
var d = document, s = d.createElement('script'); // IMPORTANT: Replace EXAMPLE with your forum shortname!
s.src = 'https://myuninotes.disqus.com/recommendations.js'; s.setAttribute('data-timestamp', +new Date());
(d.head || d.body).appendChild(s);
})();
</script>
<noscript>
Please enable JavaScript to view the
<a href="https://disqus.com/?ref_noscript" rel="nofollow">
comments powered by Disqus.
</a>
</noscript>


<!-- DISQUS RECOMMENDATION END -->




<hr style="border: none; height: 2px; background: linear-gradient(to right, #f0f0f0, #ccc, #f0f0f0); margin-top: 4rem; margin-bottom: 5rem;">
<div id="disqus_thread"></div>
<script>
    /**
    *  RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
    *  LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables    */
    /*
    var disqus_config = function () {
    this.page.url = PAGE_URL;  // Replace PAGE_URL with your page's canonical URL variable
    this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
    };
    */
    (function() { // DON'T EDIT BELOW THIS LINE
    var d = document, s = d.createElement('script');
    s.src = 'https://myuninotes.disqus.com/embed.js';
    s.setAttribute('data-timestamp', +new Date());
    (d.head || d.body).appendChild(s);
    })();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>

<!-- DISQUS SCRIPT COMMENT END -->