(a) Explain the terms Rmax and Rpeak in the ranking
Rmax
→ Maximal LINPACK performance achieved
LINPACK benchmark measures how fast it can solve a large System of linear equations
Rpeak
→ Theoretical peak performance
calculated based on its hardware specifications, such as the number of processors and their clock speeds
(b) Name 3 trends you can observe since the first release of the Top500 list in June 1996
Exponential Performance Growth: Supercomputers have experienced rapid increases in computational power, with the leading system’s performance doubling approximately every two years.
Rise of Parallel Processing: There’s been a significant increase in the number of processing cores, reflecting a shift towards parallel architectures to handle complex computations more efficiently.
Enhanced Energy Efficiency: Modern supercomputers are achieving higher performance while consuming less energy, emphasizing a focus on sustainable high-performance computing.
(c) Choose any high ranked system from the Top500 list and discuss its specification compared to SuperMUC-NG
System
Manufacturer
Cores
Processor
LINPACK Performance (Rmax)
Rpeak
Power Consumption
Frontier
HPE Cray
8,699,904
AMD EPYC 64-core CPUs with AMD MI250X GPUs
1.1 exaflops
2 exaflops
~21 MW
SuperMUC-NG
Lenovo
311,040
Intel Xeon Platinum 8174 24-core CPUs
19.48 petaflops
26.87 petaflops
~4 MW
(d) Today, mobile devices, such as smartphones and tablets, are equipped with relatively powerful CPUs. Could a mobile device achieving a Linpack performance of 2 GFLOP/sec have been ranked in past Top500 lists? What is the best ranking it could have reached?
A mobile device achieving 2 GFLOP/s could have ranked within the TOP500 lists up until June 1995. Beyond this date, the performance of the 500th-ranked system exceeded 2 GFLOP/s, making it ineligible for the list thereafter.
(e) Assume a CPU with the following specification: 20 cores @ 2 GHz base frequency, 2x floating-point units per core with support for fused multiply-add (FMA) instructions, and a 256-bit vector length. Calculate the theoretical (double precision floating-point) peak performance of this CPU
Determine Elements per Vector:
64 Bits / element256 Bits=4 elements / vector
Flops per FMA Instruction:
2 FLOPs×4 elements=8 FLOPs / instruction
FLOPs per FPU
8 FLOPS/cycle×2 GHz=16 GFLOPs/FPU
FLOPs per Core
8 GFLOPs/FPU×2 FPUs=32 GFLOPs/core
Total Theoretical Peak Performance:
32 GFLOPs/core×20 cores=640 GFLOPs
Instruction Guide to Calculating the Theoretical Peak Performance of a CPU
Given Specifications:
Cores: 20
Base Clock Frequency: 2 GHz
Floating-Point Units (FPUs) per Core: 2
Vector Length: 256 Bits
Supported Instruction: Fused Multiply-Add (FMA)
Precision: Double Precision (64 Bits)
Step-by-Step Calculation:
1. Determine the Number of Elements per Vector
Terms:
Vector Length: The number of bits that can be processed in a vector register. Here, it is 256 Bits.
Double Precision: A floating-point format that uses 64 Bits per number.
Calculation:
64 Bits / Element256 Bits=4 Elements / Vector
Explanation:
A 256-bit vector can contain four double-precision elements since each element is 64 bits in size.
2. Calculate FLOPs per FMA Instruction
Terms:
FLOP (Floating-Point Operation): A floating-point operation, such as addition or multiplication.
FMA (Fused Multiply-Add): An instruction that performs one multiplication and one addition in a single step.
Calculation:
2 FLOPs×4 Elements=8 FLOPs / Instruction
Explanation:
An FMA instruction performs two FLOPs per element (one multiplication and one addition). With four elements per vector, this results in a total of eight FLOPs per instruction.
3. Calculate FLOPs per FPU
Terms:
GHz (Gigahertz): A unit of frequency. Here, 2 GHz equals 2 billion cycles per second.
FLOPs per Cycle: The number of FLOPs that an FPU can perform per clock cycle.
Calculation:
8 FLOPs/cycle×2 GHz=16 GFLOPs/FPU
Explanation:
Each FPU can perform 8 FLOPs per cycle. At a frequency of 2 GHz, this equates to 16 billion FLOPs per second (16 GFLOPs) per FPU.
4. Calculate FLOPs per Core
Terms:
FPUs per Core: The number of floating-point units per CPU core, here 2 FPUs.
Calculation:
16 GFLOPs/FPU×2 FPUs=32 GFLOPs/Core
Explanation:
Each core has two FPUs, each capable of 16 GFLOPs. Therefore, each core achieves a performance of 32 GFLOPs.
5. Calculate the Total Theoretical Peak Performance
Terms:
Cores: The number of CPU cores, here 20.
Calculation:
32 GFLOPs/Core×20 Cores=640 GFLOPs
Explanation:
The total theoretical peak performance of the CPU is the performance of a single core multiplied by the number of cores. Here: 32 GFLOPs per Core multiplied by 20 Cores equals 640 GFLOPs.
Summary:
Theoretical Peak Performance:640 GFLOPs
Key Point: By accounting for 2 FPUs per core, the FLOP performance per core is doubled, resulting in a total performance of 640 GFLOPs.
3. Moore’s Law
> Moore’s Law states that the number of transistors per area unit in an integrated circuit roughly doubles every 18 months. These additional resources can be used to improve performance. How did the performance of the systems ranked in the Top500 list change? Discuss the total performance of all systems, the #1 rank, and the #500 rank with respect to Moore’s Law.
The performance of the systems closely follows the trend predicted by Moore’s Law. We can observe that the #1 ranked supercomputers experience a significant performance increase approximately every two years, creating a “staircase” effect in the graph as each new generation surpasses the previous one. Similarly, the #500 ranked systems also show steady performance growth in alignment with Moore’s prediction, although at a more gradual pace. This overall pattern highlights the consistent improvements across both the top-tier and lower-ranked systems on the Top500 list, validating the impact of Moore’s Law on supercomputing advancements.
4. C Programming
(a) If you are not familiar with C, you can follow, for example, Beej’s Guide to C. First run a hello world example, then read up on the topics arrays, pointers and dynamic memory allocation
(b) Implement the following program in C: Allocate an array array_a of 100 int values dynamically using malloc. Allocate an array array_s of 100 int values statically (i.e., without using malloc). Fill array_a with values from 0 to 99. Copy values from array_a to array_s in reverse order. Print out the values from both arrays. Deallocate (free) the memory of array_a
#include <stdio.h> /* printf */#include <stdlib.h> /* malloc, free */int main(int argc, char *argv[]) { int *array_d = (int *)malloc(sizeof(int) * 100); if (array_d == NULL) { return EXIT_FAILURE; } int array_s[100]; for (int i = 0; i < 100; i++) { array_d[i] = i; printf("array_d[%i] = %i\n", i, i); } int zero_index = 0; for (int i = 99; i >= 0; i--) { array_s[zero_index] = array_d[i]; printf("array_s[%i] = %i\n", zero_index, array_s[zero_index]); zero_index++; } free(array_d); return EXIT_SUCCESS;}
(c) Why is it a mistake to try to deallocate the memory of array_s with free?
The memory of array_s should not be deallocated with free because it is a static array allocated on the stack, and stack memory is managed automatically. Only dynamically allocated memory on the heap, like array_d (allocated with malloc), requires explicit deallocation with free. Using free on array_s would lead to undefined behavior.
5. Passing a 2D Array in C
Note
Passing an N x M 2D int array as argument to a function can be implemented with several possible parameter types for the 2D array in the function signature:
int matrix[][]int matrix[0][0]int (*matrix)[] //array pointer to first rowint *matrix //init pointer to first matrix element
a) Implement the function void print_matrix_arr(int rows, int cols, int matrix[][cols]).
Pass matrix to print_matrix_arr in the provided test program to test your implementation.
b) Implement the function void print_matrix_ptr(int rows, int cols, int *matrix).
Pass matrix to print_matrix_ptr in the provided test program to test your implementation.
Notes:
Depending on the implementation, the number of array elements per sub-array must appear before the array in the parameter list.
The 2D array must be passed either as 2D array type, array pointer type, or as a pointer to its first element. In the latter, the number of elements in each sub-array is not part of the passed type.
Solution
#include <stdio.h>void print_matrix_arr(int rows, int cols, int matrix[][cols]) { for (int i = 0; i < rows; i++) { for (int j = 0; j < cols; j++) { printf("%i ", matrix[i][j]); } printf("\n"); }}void print_matrix_ptr(int rows, int cols, int *matrix) { for (int i = 0; i < rows; i++) { for (int j = 0; j < cols; j++) { printf("%i ", *matrix); matrix++; } printf("\n"); }}int main(void) { int matrix[2][3] = {{1, 2, 3}, {4, 5, 6}}; // expected output: // 1 2 3 // 4 5 6 printf("print_matrix_arr:\n"); print_matrix_arr(2, 3, matrix); printf("print_matrix_ptr:\n"); print_matrix_ptr(2, 3, (int *)matrix);}
6. Linked List in C
Note
Implement the function struct element *push_back(struct element *head, int val) which appends a new list element with value val to the tail of a linked list, and returns a pointer to the new list tail on success, or NULL on error.
The linked list head is passed to push_back as the parameter head, and you need to iterate to the tail of the list before you can append the new list element.
Solution to Pushback
struct element *push_back(struct element *head, int val) { // check for valid input of 'head' if (head == NULL) { struct element *new = malloc(sizeof(struct element)); if (new == NULL) { return NULL; } new->val = val; new->next = NULL; return new; } // set 'head' to last element while (head->next != NULL) { head = head->next; } struct element *new; // allocate storage for new list element new = malloc(sizeof(struct element)); if (new == NULL) { return NULL; } // set 'val' and 'next' new->val = val; new->next = NULL; head->next = new; return new;}
Complete Solution Code
#include <stdio.h>#include <stdlib.h>struct element { int val; struct element *next; // pointer to next list element};/* TODO: implement push_back function body */struct element *push_back(struct element *head, int val) { // check for valid input of 'head' if (head == NULL) { struct element *new = malloc(sizeof(struct element)); if (new == NULL) { return NULL; } new->val = val; new->next = NULL; return new; } // set 'head' to last element while (head->next != NULL) { head = head->next; } struct element *new; // allocate storage for new list element new = malloc(sizeof(struct element)); if (new == NULL) { return NULL; } // set 'val' and 'next' new->val = val; new->next = NULL; head->next = new; return new;}struct element *push_front(struct element *head, int val) { struct element *new = malloc(sizeof(struct element)); if (new == NULL) { return NULL; } // new->val = val; new->next = head; *new = (struct element){.val = val, .next = head}; return new;}void print_list(struct element *head) { printf("head --> "); while (head != NULL) { printf("[ %d ] --> ", head->val); head = head->next; } printf("NULL\n");}int main(void) { struct element *head = NULL; // empty list head = push_front(head, 37); head = push_front(head, 99); head = push_front(head, 42); printf("before push_back(head, 13):\n"); print_list(head); push_back(head, 13); printf("after push_back(head, 13):\n"); // expected output: head --> [ 42 ] --> [ 99 ] --> [ 37 ] --> [ 13 ] --> NULL print_list(head); // this should not segfault: push_back(NULL, 0);}
×
MyUniNotes is a free, non-profit project to make education accessible for everyone.
If it has helped you, consider giving back! Even a small donation makes a difference.
These are my personal notes. While I strive for accuracy, I’m still a student myself. Thanks for being part of this journey!