1. A Matrix in C
(a) Implement the functions matrix_new
and matrix_delete
to allocate and deallocate an matrix. The signatures of the functions must be:
double **matrix_new(int rows, int cols)
void matrix_delete(double **matrix)
It must be possible to use the resulting return value like a 2D-array, e.g.:
(b) Implement the functions matrix_print
and matrix_vector_mul
to print an matrix and to multiply an matrix with an -dimensional vector .
The signatures of the functions must be:
void matrix_print(int rows, int cols, double **matrix)
void matrix_vector_mul(int rows, int cols, double **matrix, double *x, double *y)
Notes:
matrix[i]
is of typedouble *
, and should point to the first element of row .- It is possible to use just a single call to
malloc
inmatrix_new
.
Solution:
In this solution, we provide implementations for the required matrix functions in C. The matrix_new
function allocates a 2D matrix using a single call to malloc
, and matrix_delete
deallocates it. We also include matrix_print
to display the matrix and matrix_vector_mul
to perform matrix-vector multiplication.
Explanation:
-
matrix_new
Function:- Allocates memory for both the array of row pointers and the data in a single
malloc
call. - The data block is placed immediately after the array of pointers.
- Sets up each
matrix[i]
to point to the start of the corresponding row in the data block. - Note on Alignment: On most modern systems,
sizeof(double *)
andsizeof(double)
are the same (usually 8 bytes on 64-bit systems), ensuring proper alignment. If there is a concern about misalignment, especially on systems wheresizeof(double *)
differs fromsizeof(double)
, adjustments or padding may be needed.
- Allocates memory for both the array of row pointers and the data in a single
-
matrix_delete
Function:- Frees the memory allocated by
matrix_new
.
- Frees the memory allocated by
-
matrix_init
Function:- Initializes the matrix elements with the sum of their row and column indices. This is for testing purposes and can be modified as needed.
-
matrix_print
Function:- Prints the matrix in a formatted manner for easy readability.
-
matrix_vector_mul
Function:- Performs standard matrix-vector multiplication. Initializes each element of the result vector
y
to zero before accumulating the sum.
- Performs standard matrix-vector multiplication. Initializes each element of the result vector
-
Test Program (
main
):- Demonstrates the usage of the implemented functions.
- Initializes a matrix and a vector, performs multiplication, and prints the results.
Sample Output:
Matrix A:
0.00 1.00 2.00
1.00 2.00 3.00
2.00 3.00 4.00
3.00 4.00 5.00
Vector y = A * x:
y[0] = 3.00
y[1] = 6.00
y[2] = 9.00
y[3] = 12.00
Notes:
- We initialized the vector
x
with all elements set to1.0
. The result vectory
thus contains the sums of the elements in each row of the matrix. - The code includes error checking for memory allocation failures.
- Proper memory management is ensured by freeing the allocated memory with
matrix_delete
. - The code can be compiled with
gcc -o matrix matrix.c
and run with./matrix
.
2. Cache Hierarchy
Before we work further on cache optimization with a matrix multiplication, let us first observe the effect of caches with an even simpler program. Write a program that measures the throughput of each cache level in the memory hierarchy. For that, you can use the vector triad:
Allocate the vectors (a
, b
, c
, and d
) of length dynamically. For each iteration, the triad issues two floating-point operations. You can calculate the throughput in FLOP/s with:
where is the runtime in seconds. In order to measure the throughput of each cache level, vary from small (i.e., the vectors fit into the L1 cache) to large (i.e., the vectors need to be loaded from main memory).
(a) Generate measurements on your own computer, or a computer from the CIP-pool. Plot your results in a graph that shows on the x-axis (logarithmic scale) and achieved FLOP/s on the y-axis.
Code:
Instructions:
- Compile the program with optimization flags to enable vectorization (e.g.,
gcc -O3 -march=native -o triad triad.c
). - Run the program for varying sizes of and collect the results.
- Redirect the output to a CSV file for plotting.
(b) Lookup the cache sizes of your considered CPU (e.g., with lscpu -C
) and compute the maximum vector length such that the four vectors fit in cache for each of the cache levels.
Notes:
- Choose
NITERS
large enough to obtain measurable execution times. - To see more distinct effects, enable compiler optimizations and vectorization.
- Ensure that dead code elimination does not remove the computations.
- Use plotting tools to visualize the results.
Solution:
Cache Sizes on Apple M1 Pro:
Using system commands, we obtained the following cache sizes:
- L1 Data Cache: 64 KB
- L2 Cache: 4 MB
- L3 Cache: Not present (Apple M1 Pro does not have a traditional L3 cache)
Maximum Vector Length for Each Cache Level:
Each vector consists of elements of type double
(8 bytes). There are four vectors (a
, b
, c
, d
).
Total size of all vectors:
To fit all vectors into a cache level:
L1 Data Cache (64 KB):
L2 Cache (4 MB):
Results:
Collected performance data for varying :
n | Time (s) | MFLOP/s |
---|---|---|
1000 | 0.159952 | 1250.375113 |
2000 | 0.304836 | 1312.180976 |
4000 | 0.623894 | 1282.269103 |
8000 | 1.418834 | 1127.686537 |
16000 | 2.874988 | 1113.048124 |
32000 | 5.694749 | 1123.842333 |
64000 | 11.418003 | 1121.036665 |
128000 | 22.955299 | 1115.210915 |
256000 | 45.912998 | 1115.152620 |
512000 | 95.052698 | 1077.297143 |
1024000 | 187.386242 | 1092.929757 |
Plot:
Observations:
- Up to (L1 Cache Limit):
- High performance (~1,300 MFLOP/s) as all data fits in L1 cache.
- Between and (L2 Cache Limit):
- Slight decrease in performance but remains relatively stable (~1,100–1,200 MFLOP/s).
- Beyond :
- Performance remains stable with a slight decline.
- Expected a significant drop due to main memory access, but modern CPUs with advanced prefetching and memory bandwidth mitigate this effect.
Discussion:
- The performance does not drop sharply beyond the L2 cache size, possibly due to:
- Efficient cache management and prefetching mechanisms.
- High memory bandwidth of the Apple M1 Pro.
- Compiler optimizations and vectorization contribute to stable performance.
- The results highlight the importance of understanding cache hierarchies for performance optimization.
3. CIP-Pool Shell Warm-up
We will work on the SuperMUC-NG in later worksheets. The system is accessible via an SSH connection that must be proxied over the CIP-pool. As a first step, we make sure that you can connect to the CIP-pool via SSH.
(a) If you do not already have an account for the IT services of the Institut für Informatik, register as described at the RBG website. Your account will be created within a few days.
✅ Account registered and confirmed.
(b) Once your account was set up, connect to the CIP-Pool with an SSH client of your choice. You can find instructions for that on the RBG website.
Instructions:
-
Open your terminal application.
-
Connect to the CIP-Pool using SSH:
-
After logging in, create a shell script file
hostmd.sh
: -
Insert the following content into the file:
-
Save and exit the editor (
CTRL + X
, thenY
to confirm). -
Make the script executable:
-
Execute the script:
-
The script outputs the hostname, user information, and an MD5 hash of that information.
Sample Output:
remote.cip.ifi.lmu.de John Doe johndoe /home/johndoe /home/johndoe
e99a18c428cb38d5f260853678922e03 -
Note: Replace John Doe
and johndoe
with your actual full name and username.
Attachment:
- Include the output as a text file (
hostmd_output.txt
) in your submission.