1. A Matrix in C

(a) Implement the functions `matrix_new` and `matrix_delete` to allocate and deallocate an $N \times M$ matrix. The signatures of the functions must be:

double **matrix_new(int rows, int cols)
void matrix_delete(double **matrix)

It must be possible to use the resulting return value like a 2D-array, e.g.:

double **matrix = matrix_new(10, 120);
matrix[4][7] = 5.0;

(b) Implement the functions `matrix_print` and `matrix_vector_mul` to print an $N \times M$ matrix and to multiply an $N \times M$ matrix with an $M$ -dimensional vector $y = A x$ .

The signatures of the functions must be:

void matrix_print(int rows, int cols, double **matrix)
void matrix_vector_mul(int rows, int cols, double **matrix, double *x, double *y)

Notes:

matrix[i] is of type double *, and should point to the first element of row $i$ .
It is possible to use just a single call to malloc in matrix_new.

Solution:

In this solution, we provide implementations for the required matrix functions in C. The matrix_new function allocates a 2D matrix using a single call to malloc, and matrix_delete deallocates it. We also include matrix_print to display the matrix and matrix_vector_mul to perform matrix-vector multiplication.

#include <stdio.h>
#include <stdlib.h>
 
// Allocates an N x M matrix
double **matrix_new(int rows, int cols) {
    // Calculate total size needed:
    // - Pointers to each row
    // - Actual data
    size_t total_size = rows * sizeof(double *) + rows * cols * sizeof(double);
 
    // Allocate memory for the matrix
    double **matrix = (double **)malloc(total_size);
    if (matrix == NULL) {
        perror("malloc");
        return NULL;
    }
 
    // Pointer to the data block after row pointers
    double *data = (double *)(matrix + rows);
 
    // Set up row pointers to point into data block
    for (int i = 0; i < rows; i++) {
        matrix[i] = data + i * cols;
    }
 
    return matrix;
}
 
// Deallocates the matrix
void matrix_delete(double **matrix) {
    free(matrix);
}
 
// Initializes the matrix with values (for testing purposes)
void matrix_init(int rows, int cols, double **matrix) {
    // Initialize matrix with values equal to the sum of their indices
    for (int i = 0; i < rows; i++) {
        for (int j = 0; j < cols; j++) {
            matrix[i][j] = (double)(i + j);
        }
    }
}
 
// Prints the matrix
void matrix_print(int rows, int cols, double **matrix) {
    // Print the matrix elements
    for (int i = 0; i < rows; i++) {
        for (int j = 0; j < cols; j++) {
            printf("%6.2f ", matrix[i][j]);
        }
        printf("\n");
    }
}
 
// Performs matrix-vector multiplication: y = A * x
void matrix_vector_mul(int rows, int cols, double **matrix, double *x, double *y) {
    // Perform matrix-vector multiplication
    for (int i = 0; i < rows; i++) {
        y[i] = 0.0;
        for (int j = 0; j < cols; j++) {
            y[i] += matrix[i][j] * x[j];
        }
    }
}
 
/* Test program */
int main(int argc, char **argv) {
    int n = 4;  // Number of rows
    int m = 3;  // Number of columns
 
    // Allocate the matrix
    double **matrix = matrix_new(n, m);
    if (matrix == NULL) {
        return EXIT_FAILURE;
    }
 
    // Initialize the matrix
    matrix_init(n, m, matrix);
 
    // Print the matrix
    printf("Matrix A:\n");
    matrix_print(n, m, matrix);
 
    // Initialize vector x
    double x[m];
    for (int i = 0; i < m; i++) {
        x[i] = 1.0;  // For simplicity, set all elements to 1.0
    }
 
    // Allocate vector y
    double y[n];
 
    // Perform matrix-vector multiplication
    matrix_vector_mul(n, m, matrix, x, y);
 
    // Print the result vector y
    printf("\nVector y = A * x:\n");
    for (int i = 0; i < n; i++) {
        printf("y[%d] = %6.2f\n", i, y[i]);
    }
 
    // Free the matrix memory
    matrix_delete(matrix);
 
    return EXIT_SUCCESS;
}

Explanation:

matrix_new Function:
- Allocates memory for both the array of row pointers and the data in a single malloc call.
- The data block is placed immediately after the array of pointers.
- Sets up each matrix[i] to point to the start of the corresponding row in the data block.
- Note on Alignment: On most modern systems, sizeof(double *) and sizeof(double) are the same (usually 8 bytes on 64-bit systems), ensuring proper alignment. If there is a concern about misalignment, especially on systems where sizeof(double *) differs from sizeof(double), adjustments or padding may be needed.
matrix_delete Function:
- Frees the memory allocated by matrix_new.
matrix_init Function:
- Initializes the matrix elements with the sum of their row and column indices. This is for testing purposes and can be modified as needed.
matrix_print Function:
- Prints the matrix in a formatted manner for easy readability.
matrix_vector_mul Function:
- Performs standard matrix-vector multiplication. Initializes each element of the result vector y to zero before accumulating the sum.
Test Program (main):
- Demonstrates the usage of the implemented functions.
- Initializes a matrix and a vector, performs multiplication, and prints the results.

Sample Output:

Matrix A:
  0.00   1.00   2.00
  1.00   2.00   3.00
  2.00   3.00   4.00
  3.00   4.00   5.00

Vector y = A * x:
y[0] =   3.00
y[1] =   6.00
y[2] =   9.00
y[3] =  12.00

Notes:

We initialized the vector x with all elements set to 1.0. The result vector y thus contains the sums of the elements in each row of the matrix.
The code includes error checking for memory allocation failures.
Proper memory management is ensured by freeing the allocated memory with matrix_delete.
The code can be compiled with gcc -o matrix matrix.c and run with ./matrix.

2. Cache Hierarchy

Before we work further on cache optimization with a matrix multiplication, let us first observe the effect of caches with an even simpler program. Write a program that measures the throughput of each cache level in the memory hierarchy. For that, you can use the vector triad:

for(size_t j = 0; j < NITERS; j++)
    for(size_t i = 0; i < n; i++)
        a[i] = b[i] + c[i] * d[i];

Allocate the vectors (a, b, c, and d) of length $n$ dynamically. For each iteration, the triad issues two floating-point operations. You can calculate the throughput in FLOP/s with:

\frac{2 \cdot n \cdot NITERS}{t}

where $t$ is the runtime in seconds. In order to measure the throughput of each cache level, vary $n$ from small (i.e., the vectors fit into the L1 cache) to large (i.e., the vectors need to be loaded from main memory).

(a) Generate measurements on your own computer, or a computer from the CIP-pool. Plot your results in a graph that shows $n$ on the x-axis (logarithmic scale) and achieved FLOP/s on the y-axis.

Code:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
 
#define NITERS 100000
 
int main(int argc, char **argv) {
    size_t n;
    if (argc != 2) {
        printf("Usage: %s <n>\n", argv[0]);
        return EXIT_FAILURE;
    } else {
        n = atoi(argv[1]);
    }
 
    double *a = (double *)malloc(n * sizeof(double));
    double *b = (double *)malloc(n * sizeof(double));
    double *c = (double *)malloc(n * sizeof(double));
    double *d = (double *)malloc(n * sizeof(double));
 
    if (a == NULL || b == NULL || c == NULL || d == NULL) {
        perror("malloc");
        return EXIT_FAILURE;
    }
 
    struct timespec start, end;
 
    // Initialize the vectors
    for (size_t i = 0; i < n; i++) {
        b[i] = (double)i;
        c[i] = (double)i;
        d[i] = (double)i;
    }
 
    // Start timing
    clock_gettime(CLOCK_MONOTONIC, &start);
 
    // Perform the triad
    for (size_t j = 0; j < NITERS; j++) {
        for (size_t i = 0; i < n; i++) {
            a[i] = b[i] + c[i] * d[i];
        }
    }
 
    // Prevent dead code elimination by using the result
    double sum = 0.0;
    for (size_t i = 0; i < n; i++) {
        sum += a[i];
    }
    // Print sum to ensure the compiler doesn't optimize it away
    fprintf(stderr, "Sum: %f\n", sum);
 
    // End timing
    clock_gettime(CLOCK_MONOTONIC, &end);
 
    // Calculate elapsed time in seconds
    double elapsed = (end.tv_sec - start.tv_sec) +
                     (end.tv_nsec - start.tv_nsec) / 1e9;
 
    // Calculate MFLOP/s
    double mflops = (2.0 * NITERS * n) / (elapsed * 1e6);
 
    printf("n=%zu, time=%f s, MFLOP/s=%f\n", n, elapsed, mflops);
 
    // Free memory
    free(a);
    free(b);
    free(c);
    free(d);
 
    return EXIT_SUCCESS;
}

Instructions:

Compile the program with optimization flags to enable vectorization (e.g., gcc -O3 -march=native -o triad triad.c).
Run the program for varying sizes of $n$ and collect the results.
Redirect the output to a CSV file for plotting.

(b) Lookup the cache sizes of your considered CPU (e.g., with `lscpu -C`) and compute the maximum vector length $n$ such that the four vectors fit in cache for each of the cache levels.

Notes:

Choose NITERS large enough to obtain measurable execution times.
To see more distinct effects, enable compiler optimizations and vectorization.
Ensure that dead code elimination does not remove the computations.
Use plotting tools to visualize the results.

Solution:

Cache Sizes on Apple M1 Pro:

Using system commands, we obtained the following cache sizes:

L1 Data Cache: 64 KB
L2 Cache: 4 MB
L3 Cache: Not present (Apple M1 Pro does not have a traditional L3 cache)

Maximum Vector Length $n$ for Each Cache Level:

Each vector consists of $n$ elements of type double (8 bytes). There are four vectors (a, b, c, d).

Total size of all vectors:

Total Size = 4 \times n \times 8 bytes = 32 n bytes

To fit all vectors into a cache level:

32 n \leq Cache Size

L1 Data Cache (64 KB):

32 n \leq 65, 536 bytes ⟹ n \leq 2, 048

L2 Cache (4 MB):

32 n \leq 4, 194, 304 bytes ⟹ n \leq 131, 072

Results:

Collected performance data for varying $n$ :

n	Time (s)	MFLOP/s
1000	0.159952	1250.375113
2000	0.304836	1312.180976
4000	0.623894	1282.269103
8000	1.418834	1127.686537
16000	2.874988	1113.048124
32000	5.694749	1123.842333
64000	11.418003	1121.036665
128000	22.955299	1115.210915
256000	45.912998	1115.152620
512000	95.052698	1077.297143
1024000	187.386242	1092.929757

Plot:

Observations:

Up to $n = 2, 048$ (L1 Cache Limit):
- High performance (~1,300 MFLOP/s) as all data fits in L1 cache.
Between $n = 2, 048$ and $n = 131, 072$ (L2 Cache Limit):
- Slight decrease in performance but remains relatively stable (~1,100–1,200 MFLOP/s).
Beyond $n = 131, 072$ :
- Performance remains stable with a slight decline.
- Expected a significant drop due to main memory access, but modern CPUs with advanced prefetching and memory bandwidth mitigate this effect.

Discussion:

The performance does not drop sharply beyond the L2 cache size, possibly due to:
- Efficient cache management and prefetching mechanisms.
- High memory bandwidth of the Apple M1 Pro.
Compiler optimizations and vectorization contribute to stable performance.
The results highlight the importance of understanding cache hierarchies for performance optimization.

3. CIP-Pool Shell Warm-up

We will work on the SuperMUC-NG in later worksheets. The system is accessible via an SSH connection that must be proxied over the CIP-pool. As a first step, we make sure that you can connect to the CIP-pool via SSH.

(a) If you do not already have an account for the IT services of the Institut für Informatik, register as described at the RBG website. Your account will be created within a few days.

✅ Account registered and confirmed.

(b) Once your account was set up, connect to the CIP-Pool with an SSH client of your choice. You can find instructions for that on the RBG website.

Instructions:

Open your terminal application.
Connect to the CIP-Pool using SSH:
```
ssh username@remote.cip.ifi.lmu.de
```
After logging in, create a shell script file hostmd.sh:
```
nano hostmd.sh
```

Insert the following content into the file:

#!/bin/bash
echo `hostname` $FULLNAME $LOGNAME $HOME $PWD
echo -n `hostname` $FULLNAME $LOGNAME $HOME $PWD | md5sum -

Save and exit the editor (CTRL + X, then Y to confirm).
Make the script executable:
```
chmod +x hostmd.sh
```
Execute the script:
```
./hostmd.sh
```
The script outputs the hostname, user information, and an MD5 hash of that information.

Sample Output:

remote.cip.ifi.lmu.de John Doe johndoe /home/johndoe /home/johndoe
e99a18c428cb38d5f260853678922e03  -

Note: Replace John Doe and johndoe with your actual full name and username.

Attachment:

Include the output as a text file (hostmd_output.txt) in your submission.

🎓 MyUniNotes

Explorer

ÜB - 3 PHPC

1. A Matrix in C

(a) Implement the functions `matrix_new` and `matrix_delete` to allocate and deallocate an $N \times M$ matrix. The signatures of the functions must be:

(b) Implement the functions `matrix_print` and `matrix_vector_mul` to print an $N \times M$ matrix and to multiply an $N \times M$ matrix with an $M$ -dimensional vector $y = A x$ .

Notes:

Solution:

Explanation:

Sample Output:

Notes:

2. Cache Hierarchy

(a) Generate measurements on your own computer, or a computer from the CIP-pool. Plot your results in a graph that shows $n$ on the x-axis (logarithmic scale) and achieved FLOP/s on the y-axis.

Code:

Instructions:

(b) Lookup the cache sizes of your considered CPU (e.g., with `lscpu -C`) and compute the maximum vector length $n$ such that the four vectors fit in cache for each of the cache levels.

Notes:

Solution:

Cache Sizes on Apple M1 Pro:

Maximum Vector Length $n$ for Each Cache Level:

L1 Data Cache (64 KB):

L2 Cache (4 MB):

Results:

Plot:

Observations:

Discussion:

3. CIP-Pool Shell Warm-up

(a) If you do not already have an account for the IT services of the Institut für Informatik, register as described at the RBG website. Your account will be created within a few days.

(b) Once your account was set up, connect to the CIP-Pool with an SSH client of your choice. You can find instructions for that on the RBG website.

Graph View

Table of Contents

Backlinks

🎓 MyUniNotes

Explorer

ÜB - 3 PHPC

1. A Matrix in C

(a) Implement the functions matrix_new and matrix_delete to allocate and deallocate an N×M matrix. The signatures of the functions must be:

(b) Implement the functions matrix_print and matrix_vector_mul to print an N×M matrix and to multiply an N×M matrix with an M-dimensional vector y​=Ax.

Notes:

Solution:

Explanation:

Sample Output:

Notes:

2. Cache Hierarchy

(a) Generate measurements on your own computer, or a computer from the CIP-pool. Plot your results in a graph that shows n on the x-axis (logarithmic scale) and achieved FLOP/s on the y-axis.

Code:

Instructions:

(b) Lookup the cache sizes of your considered CPU (e.g., with lscpu -C) and compute the maximum vector length n such that the four vectors fit in cache for each of the cache levels.

Notes:

Solution:

Cache Sizes on Apple M1 Pro:

Maximum Vector Length n for Each Cache Level:

L1 Data Cache (64 KB):

L2 Cache (4 MB):

Results:

Plot:

Observations:

Discussion:

3. CIP-Pool Shell Warm-up

(a) If you do not already have an account for the IT services of the Institut für Informatik, register as described at the RBG website. Your account will be created within a few days.

(b) Once your account was set up, connect to the CIP-Pool with an SSH client of your choice. You can find instructions for that on the RBG website.

Graph View

Table of Contents

Backlinks

(a) Implement the functions `matrix_new` and `matrix_delete` to allocate and deallocate an $N \times M$ matrix. The signatures of the functions must be:

(b) Implement the functions `matrix_print` and `matrix_vector_mul` to print an $N \times M$ matrix and to multiply an $N \times M$ matrix with an $M$ -dimensional vector $y = A x$ .

(a) Generate measurements on your own computer, or a computer from the CIP-pool. Plot your results in a graph that shows $n$ on the x-axis (logarithmic scale) and achieved FLOP/s on the y-axis.

(b) Lookup the cache sizes of your considered CPU (e.g., with `lscpu -C`) and compute the maximum vector length $n$ such that the four vectors fit in cache for each of the cache levels.

Maximum Vector Length $n$ for Each Cache Level: