1. Project: Optimizing Matrix-Matrix Multiplication
Optimize a double precision matrix-matrix multiplication as much as you can for the Intel i9-9900 CPU.
Attached with this worksheet is the naive implementation of the algorithm:
Given two square matrices and , calculate with .
The overall goal is to implement the matrix-matrix multiplication, restricted to square matrices. We encourage you to work in small groups, and to discuss on https://matrix.lmu.de/#/room/#ws2425:matrix.lmu.de.
Notes
- As a starting point, a simple benchmark of an unoptimized implementation, and a Makefile using default compiler flags is attached to this assignment.
- Run
make
to compile the benchmark application. - You presumably only need to change the implementation of the function
mmult
. - Try to achieve the best performance for matrices with .
- Create a plot of the performance of your implementation with the matrix size on the x-axis, and the GFLOP/s on the y-axis. Vary from small () to large () in steps of 200.
- The program must be single-threaded.
- Experiment with any thinkable performance tweaks, such as:
- Loop interchange
- Blocking (tiling) to reduce cache misses
- Changing the memory layout of matrix values
- Compiler flags such as:
-O2/-O3/-Ofast…
-march=native
- … (see the compiler man-pages)
- Loop unrolling
- Manual vectorization with AVX2 assembly / AVX2 vector intrinsics (advanced)
- Be creative, but make sure your implementation remains correct.
Submit your best implementation variant (including Makefiles and any other prerequisites required to compile and run it), achieved GFLOP/s for , and the plot as an archive (zip/tar/tgz/…).
Down below is a resource to read. I would advise to read it in the PDF Paper Format
Enhancing Matrix-Matrix Multiplication Performance Techniques and Strategies
2. Scaling
Amdahl’s law, Gustafson’s law, as well as weak and strong scaling were introduced early in the lecture.