1. Project: Optimizing Matrix-Matrix Multiplication

Optimize a double precision matrix-matrix multiplication as much as you can for the Intel i9-9900 CPU.

Attached with this worksheet is the naive implementation of the algorithm:

Given two square matrices $A = (a_{ij})$ and $B = (b_{jk})$ , calculate $C = (c_{ik})$ with $c_{ik} = \sum_{j = 1}^{n} a_{ij} \cdot b_{jk}$ .

The overall goal is to implement the matrix-matrix multiplication, restricted to square matrices. We encourage you to work in small groups, and to discuss on https://matrix.lmu.de/#/room/#ws2425:matrix.lmu.de.

Notes

As a starting point, a simple benchmark of an unoptimized implementation, and a Makefile using default compiler flags is attached to this assignment.
Run make to compile the benchmark application.
You presumably only need to change the implementation of the function mmult.
Try to achieve the best performance for $N \times N$ matrices with $N = 2000$ .
Create a plot of the performance of your implementation with the matrix size $N$ on the x-axis, and the GFLOP/s on the y-axis. Vary $N$ from small ( $N = 200$ ) to large ( $N = 2000$ ) in steps of 200.
The program must be single-threaded.
Experiment with any thinkable performance tweaks, such as:
- Loop interchange
- Blocking (tiling) to reduce cache misses
- Changing the memory layout of matrix values
- Compiler flags such as:
  - -O2/-O3/-Ofast…
  - -march=native
  - … (see the compiler man-pages)
- Loop unrolling
- Manual vectorization with AVX2 assembly / AVX2 vector intrinsics (advanced)
- Be creative, but make sure your implementation remains correct.

Submit your best implementation variant (including Makefiles and any other prerequisites required to compile and run it), achieved GFLOP/s for $N = 2000$ , and the plot as an archive (zip/tar/tgz/…).

Down below is a resource to read. I would advise to read it in the PDF Paper Format

Enhancing Matrix-Matrix Multiplication Performance Techniques and Strategies

2. Scaling

Amdahl’s law, Gustafson’s law, as well as weak and strong scaling were introduced early in the lecture.

(a) Describe Amdahl’s law and Gustafson’s law in your own words.

Amdahl’s Law: Speedup is limited by the part of a task that can’t be parallelized. Even with many processors, the serial portion slows things down.

Formula: $Speedup = \frac{1}{s + \frac{p}{N}}$ .

Gustafson’s Law: If you increase the problem size with more processors, speedup grows almost linearly. It’s more realistic for real-world tasks.

Formula: $Speedup = s + p \times N$ .

(b) Why does Gustafson criticize Amdahl’s approach?

Gustafson says Amdahl’s Law is too pessimistic because it assumes a fixed problem size. In real life, when we add more processors, we usually make the problem bigger to use all the computing power. Gustafson shows that when problem size scales, we can get much better speedup than Amdahl’s Law predicts.

(c) What is the difference between weak scaling and strong scaling?

Weak Scaling: Keeps workload per processor the same. The problem size grows with the number of processors.
Strong Scaling: The problem size stays the same, and we measure how adding processors reduces runtime.

Relation to the laws:

Amdahl’s Law focuses on strong scaling and shows limits when the problem size doesn’t grow.
Gustafson’s Law fits weak scaling because it assumes the problem size increases as processors are added.

🎓 MyUniNotes

Explorer

ÜB - 4 PHPC

1. Project: Optimizing Matrix-Matrix Multiplication

Notes

Enhancing Matrix-Matrix Multiplication Performance Techniques and Strategies

2. Scaling

(a) Describe Amdahl’s law and Gustafson’s law in your own words.

(b) Why does Gustafson criticize Amdahl’s approach?

(c) What is the difference between weak scaling and strong scaling?

Graph View

Table of Contents

Backlinks