2. The Top500 List

(a) Explain the terms Rmax and Rpeak in the ranking

Rmax
- → Maximal LINPACK performance achieved
- LINPACK benchmark measures how fast it can solve a large System of linear equations
Rpeak
- → Theoretical peak performance
- calculated based on its hardware specifications, such as the number of processors and their clock speeds

(b) Name 3 trends you can observe since the first release of the Top500 list in June 1996

Exponential Performance Growth: Supercomputers have experienced rapid increases in computational power, with the leading system’s performance doubling approximately every two years.
Rise of Parallel Processing: There’s been a significant increase in the number of processing cores, reflecting a shift towards parallel architectures to handle complex computations more efficiently.
Enhanced Energy Efficiency: Modern supercomputers are achieving higher performance while consuming less energy, emphasizing a focus on sustainable high-performance computing.

(c) Choose any high ranked system from the Top500 list and discuss its specification compared to SuperMUC-NG

System	Manufacturer	Cores	Processor	LINPACK Performance (Rmax)	Rpeak	Power Consumption
Frontier	HPE Cray	8,699,904	AMD EPYC 64-core CPUs with AMD MI250X GPUs	1.1 exaflops	2 exaflops	~21 MW
SuperMUC-NG	Lenovo	311,040	Intel Xeon Platinum 8174 24-core CPUs	19.48 petaflops	26.87 petaflops	~4 MW

(d) Today, mobile devices, such as smartphones and tablets, are equipped with relatively powerful CPUs. Could a mobile device achieving a Linpack performance of 2 GFLOP/sec have been ranked in past Top500 lists? What is the best ranking it could have reached?

A mobile device achieving 2 GFLOP/s could have ranked within the TOP500 lists up until June 1995. Beyond this date, the performance of the 500th-ranked system exceeded 2 GFLOP/s, making it ineligible for the list thereafter.

(e) Assume a CPU with the following specification: 20 cores @ 2 GHz base frequency, 2x floating-point units per core with support for fused multiply-add (FMA) instructions, and a 256-bit vector length. Calculate the theoretical (double precision floating-point) peak performance of this CPU

Determine Elements per Vector:

\frac{256 Bits}{64 Bits / element} = 4 elements / vector

Flops per FMA Instruction:

2 FLOPs \times 4 elements = 8 FLOPs / instruction

FLOPs per FPU

8 FLOPS/cycle \times 2 GHz = 16 GFLOPs/FPU

FLOPs per Core

8 GFLOPs/FPU \times 2 FPUs = 32 GFLOPs/core

Total Theoretical Peak Performance:

32 GFLOPs/core \times 20 cores = 640 GFLOPs

Instruction Guide to Calculating the Theoretical Peak Performance of a CPU

Given Specifications:

Cores: 20

Base Clock Frequency: 2 GHz

Floating-Point Units (FPUs) per Core: 2

Vector Length: 256 Bits

Supported Instruction: Fused Multiply-Add (FMA)

Precision: Double Precision (64 Bits)

Step-by-Step Calculation:

1. Determine the Number of Elements per Vector

Terms:

Vector Length: The number of bits that can be processed in a vector register. Here, it is 256 Bits.

Double Precision: A floating-point format that uses 64 Bits per number.

Calculation:
$\frac{256 Bits}{64 Bits / Element} = 4 Elements / Vector$
Explanation: A 256-bit vector can contain four double-precision elements since each element is 64 bits in size.

2. Calculate FLOPs per FMA Instruction

Terms:

FLOP (Floating-Point Operation): A floating-point operation, such as addition or multiplication.

FMA (Fused Multiply-Add): An instruction that performs one multiplication and one addition in a single step.

Calculation:
$2 FLOPs \times 4 Elements = 8 FLOPs / Instruction$
Explanation: An FMA instruction performs two FLOPs per element (one multiplication and one addition). With four elements per vector, this results in a total of eight FLOPs per instruction.

3. Calculate FLOPs per FPU

Terms:

GHz (Gigahertz): A unit of frequency. Here, 2 GHz equals 2 billion cycles per second.

FLOPs per Cycle: The number of FLOPs that an FPU can perform per clock cycle.

Calculation:
$8 FLOPs/cycle \times 2 GHz = 16 GFLOPs/FPU$
Explanation: Each FPU can perform 8 FLOPs per cycle. At a frequency of 2 GHz, this equates to 16 billion FLOPs per second (16 GFLOPs) per FPU.

4. Calculate FLOPs per Core

Terms:

FPUs per Core: The number of floating-point units per CPU core, here 2 FPUs.

Calculation:
$16 GFLOPs/FPU \times 2 FPUs = 32 GFLOPs/Core$
Explanation: Each core has two FPUs, each capable of 16 GFLOPs. Therefore, each core achieves a performance of 32 GFLOPs.

5. Calculate the Total Theoretical Peak Performance

Terms:

Cores: The number of CPU cores, here 20.

Calculation:
$32 GFLOPs/Core \times 20 Cores = 640 GFLOPs$
Explanation: The total theoretical peak performance of the CPU is the performance of a single core multiplied by the number of cores. Here: 32 GFLOPs per Core multiplied by 20 Cores equals 640 GFLOPs.

Summary:

Theoretical Peak Performance: 640 GFLOPs

Key Point: By accounting for 2 FPUs per core, the FLOP performance per core is doubled, resulting in a total performance of 640 GFLOPs.

3. Moore’s Law

> Moore’s Law states that the number of transistors per area unit in an integrated circuit roughly doubles every 18 months. These additional resources can be used to improve performance. How did the performance of the systems ranked in the Top500 list change? Discuss the total performance of all systems, the #1 rank, and the #500 rank with respect to Moore’s Law.

The performance of the systems closely follows the trend predicted by Moore’s Law. We can observe that the #1 ranked supercomputers experience a significant performance increase approximately every two years, creating a “staircase” effect in the graph as each new generation surpasses the previous one. Similarly, the #500 ranked systems also show steady performance growth in alignment with Moore’s prediction, although at a more gradual pace. This overall pattern highlights the consistent improvements across both the top-tier and lower-ranked systems on the Top500 list, validating the impact of Moore’s Law on supercomputing advancements.

4. C Programming

(a) If you are not familiar with C, you can follow, for example, Beej’s Guide to C. First run a hello world example, then read up on the topics arrays, pointers and dynamic memory allocation

https://beej.us/guide/bgc/

(b) Implement the following program in C: Allocate an array array_a of 100 int values dynamically using malloc. Allocate an array array_s of 100 int values statically (i.e., without using malloc). Fill array_a with values from 0 to 99. Copy values from array_a to array_s in reverse order. Print out the values from both arrays. Deallocate (free) the memory of array_a

#include <stdio.h>  /* printf */
#include <stdlib.h> /* malloc, free */
 
int main(int argc, char *argv[]) {
  int *array_d = (int *)malloc(sizeof(int) * 100);
  if (array_d == NULL) {
    return EXIT_FAILURE;
  }
  int array_s[100];
 
  for (int i = 0; i < 100; i++) {
    array_d[i] = i;
    printf("array_d[%i] = %i\n", i, i);
  }
  int zero_index = 0;
  for (int i = 99; i >= 0; i--) {
    array_s[zero_index] = array_d[i];
    printf("array_s[%i] = %i\n", zero_index, array_s[zero_index]);
    zero_index++;
  }
  free(array_d);
  return EXIT_SUCCESS;
}

(c) Why is it a mistake to try to deallocate the memory of array_s with free?

The memory of array_s should not be deallocated with free because it is a static array allocated on the stack, and stack memory is managed automatically. Only dynamically allocated memory on the heap, like array_d (allocated with malloc), requires explicit deallocation with free. Using free on array_s would lead to undefined behavior.

5. Passing a 2D Array in C

Note
Passing an N x M 2D int array as argument to a function can be implemented with several possible parameter types for the 2D array in the function signature:
int matrix[][]
int matrix[0][0]

int (*matrix)[] //array pointer to first row
int *matrix //init pointer to first matrix element

a) Implement the function void print_matrix_arr(int rows, int cols, int matrix[][cols]).

Pass matrix to print_matrix_arr in the provided test program to test your implementation.

b) Implement the function void print_matrix_ptr(int rows, int cols, int *matrix).

Pass matrix to print_matrix_ptr in the provided test program to test your implementation.

Notes:

Depending on the implementation, the number of array elements per sub-array must appear before the array in the parameter list.
The 2D array must be passed either as 2D array type, array pointer type, or as a pointer to its first element. In the latter, the number of elements in each sub-array is not part of the passed type.

Solution

#include <stdio.h>
 
void print_matrix_arr(int rows, int cols, int matrix[][cols]) {
  for (int i = 0; i < rows; i++) {
    for (int j = 0; j < cols; j++) {
      printf("%i ", matrix[i][j]);
    }
    printf("\n");
  }
}
 
void print_matrix_ptr(int rows, int cols, int *matrix) {
  for (int i = 0; i < rows; i++) {
    for (int j = 0; j < cols; j++) {
      printf("%i ", *matrix);
      matrix++;
    }
    printf("\n");
  }
}
 
int main(void) {
  int matrix[2][3] = {{1, 2, 3}, {4, 5, 6}};
 
  // expected output:
  // 1 2 3
  // 4 5 6
 
  printf("print_matrix_arr:\n");
  print_matrix_arr(2, 3, matrix);
 
  printf("print_matrix_ptr:\n");
  print_matrix_ptr(2, 3, (int *)matrix);
}

6. Linked List in C

Note

Implement the function struct element *push_back(struct element *head, int val) which appends a new list element with value val to the tail of a linked list, and returns a pointer to the new list tail on success, or NULL on error.

The provided test program shall:

Print: head --> [ 42 ] --> [ 99 ] --> [ 37 ] --> [ 13 ] --> NULL
after push_back(head, 13)

Not crash!

Note:

The linked list head is passed to push_back as the parameter head, and you need to iterate to the tail of the list before you can append the new list element.

Solution to Pushback

struct element *push_back(struct element *head, int val) {
  // check for valid input of 'head'
  if (head == NULL) {
    struct element *new = malloc(sizeof(struct element));
    if (new == NULL) {
      return NULL;
    }
    new->val = val;
    new->next = NULL;
    return new;
  }
  // set 'head' to last element
  while (head->next != NULL) {
    head = head->next;
  }
 
  struct element *new;
 
  // allocate storage for new list element
  new = malloc(sizeof(struct element));
  if (new == NULL) {
    return NULL;
  }
  // set 'val' and 'next'
  new->val = val;
  new->next = NULL;
  head->next = new;
  return new;
}

Complete Solution Code

#include <stdio.h>
#include <stdlib.h>
 
struct element {
  int val;
  struct element *next; // pointer to next list element
};
 
/* TODO: implement push_back function body */
struct element *push_back(struct element *head, int val) {
  // check for valid input of 'head'
  if (head == NULL) {
    struct element *new = malloc(sizeof(struct element));
    if (new == NULL) {
      return NULL;
    }
    new->val = val;
    new->next = NULL;
    return new;
  }
  // set 'head' to last element
  while (head->next != NULL) {
    head = head->next;
  }
 
  struct element *new;
 
  // allocate storage for new list element
  new = malloc(sizeof(struct element));
  if (new == NULL) {
    return NULL;
  }
  // set 'val' and 'next'
  new->val = val;
  new->next = NULL;
  head->next = new;
  return new;
}
 
struct element *push_front(struct element *head, int val) {
  struct element *new = malloc(sizeof(struct element));
  if (new == NULL) {
    return NULL;
  }
 
  // new->val  = val; new->next = head;
  *new = (struct element){.val = val, .next = head};
 
  return new;
}
 
void print_list(struct element *head) {
  printf("head --> ");
  while (head != NULL) {
    printf("[ %d ] --> ", head->val);
    head = head->next;
  }
  printf("NULL\n");
}
 
int main(void) {
  struct element *head = NULL; // empty list
 
  head = push_front(head, 37);
  head = push_front(head, 99);
  head = push_front(head, 42);
 
  printf("before push_back(head, 13):\n");
  print_list(head);
 
  push_back(head, 13);
 
  printf("after push_back(head, 13):\n");
  // expected output: head --> [ 42 ] --> [ 99 ] --> [ 37 ] --> [ 13 ] --> NULL
  print_list(head);
 
  // this should not segfault:
  push_back(NULL, 0);
}

🎓 MyUniNotes

Explorer

ÜB - 1 PHPC

2. The Top500 List

(a) Explain the terms Rmax and Rpeak in the ranking

(b) Name 3 trends you can observe since the first release of the Top500 list in June 1996

(c) Choose any high ranked system from the Top500 list and discuss its specification compared to SuperMUC-NG

(d) Today, mobile devices, such as smartphones and tablets, are equipped with relatively powerful CPUs. Could a mobile device achieving a Linpack performance of 2 GFLOP/sec have been ranked in past Top500 lists? What is the best ranking it could have reached?

(e) Assume a CPU with the following specification: 20 cores @ 2 GHz base frequency, 2x floating-point units per core with support for fused multiply-add (FMA) instructions, and a 256-bit vector length. Calculate the theoretical (double precision floating-point) peak performance of this CPU

Given Specifications:

Step-by-Step Calculation:

1. Determine the Number of Elements per Vector

2. Calculate FLOPs per FMA Instruction

3. Calculate FLOPs per FPU

4. Calculate FLOPs per Core

5. Calculate the Total Theoretical Peak Performance

Summary:

3. Moore’s Law

4. C Programming

(a) If you are not familiar with C, you can follow, for example, Beej’s Guide to C. First run a hello world example, then read up on the topics arrays, pointers and dynamic memory allocation

(c) Why is it a mistake to try to deallocate the memory of array_s with free?

5. Passing a 2D Array in C

Notes:

Solution

6. Linked List in C

Solution to Pushback

Complete Solution Code

Graph View

Table of Contents

Backlinks