Running code is interrupted at given time intervals (e.g., every millisecond). In each interruption, a ‘screenshot’ or sample is taken of the code by visiting each running thread and examining its stack to discover which functions are running.
Each sample is aggregated into a report or graph (e.g., flame graphs).
Profiling cannot be used to construct a trace.
Tracing
Does not operate by sampling.
A trace is a log of events within the program during runtime.
The log is customizable and may report function calls, returns, and execution of other statements.
May require the program to be instrumented, i.e., modified to include log events. These might need to be added to the source code as a pre-compilation step (‘instrumented code’) or might be added dynamically to the machine code.
A detailed trace can be used to reconstruct a profile.
Useful for discovering the chain of events that led to a problem.
More difficult to set up than profiling.
Detailed traces are large.
Instrumented code is never as easy to compile as the original code.
b) Perf is a powerful Linux tool for profiling, tracing, and performance analysis.
Use perf --help to get an overview of perf’s capabilities.
Use perf list to list performance events and metric groups that can be measured by perf on SuperMUC-NG.
Use perf stat -e <event0,event1,…> <command> to measure cache misses during execution of the parallel triad or matrix multiplication.
It might be more convenient to use an interactive session on SuperMUC-NG.
The command salloc -t 30 -A pr58ci -N 1 reserves a single node for 30
minutes for an interactive session.
What I ran
xxxxxxx@i01r01c02s08:~/scp/ex6/triad_numa> perf stat -e cache-misses ./triadNITERS,n,result,mflops10,1000000000,8.79e+17,445 Performance counter stats for './triad': 4999617006 cache-misses 56.991889844 seconds time elapsed 48.104035000 seconds user 8.884006000 seconds sys
2. Hardware Performance Counter
Performance counters are available in the performance monitoring units (PMUs) on many CPUs. They can be configured to count performance events, e.g., retired instructions or cache misses.
A popular interface to configure and read performance counters. This is the Performance Application Programming Interface (PAPI).
PAPI enables measuring the occurrence of performance events in specific code regions.
SuperMUC-NG has a recent version of PAPI installed. You have to load it with module load papi before use.
You can list available events with papi_avail -a.
Interesting events are those related to the memory subsystem, as well as instructions and floating-point operations, e.g.:
PAPI_L1_DCM
PAPI_L2_DCM
PAPI_L3_TCM
PAPI_TOT_INS
PAPI_DP_OPS
A program can be compiled with PAPI like so:
gcc test.c -o test \$PAPI_INC -L\$PAPI_LIBDIR -lpapi
See the documentation of the easier to use “high-level” API or consult the more powerful “low-level” API, where you can find instructions and examples.
Insert measurement probes into the matrix multiplication code to measure performance events during the matrix multiplication.
Show the effect of matrix multiplication optimizations (loop reordering, cache blocking, …) with suitable performance event measurements.
3. MPI: Hello World – Approximating π
Find an interactive Monte Carlo Method to estimate π.
Write an MPI program that approximates π. You can use collective or point-to-point operations for your implementation.
Run the program using 4x48 processes on 4 nodes of SuperMUC-NG.
Hint: You will need to modify your existing batch files for that. Examples for MPI-specific batch files can be found in the [SuperMUC’s documentation](SuperMUC’s documentation).
4. MPI: Communication Benchmarks
Determine the latency and bandwidth of the SuperMUC-NG nodes with the help of benchmark applications.