2. The Top500 List
(a) Explain the terms Rmax and Rpeak in the ranking
Rmax
- → Maximal LINPACK performance achieved
- LINPACK benchmark measures how fast it can solve a large System of linear equations
Rpeak
- → Theoretical peak performance
- calculated based on its hardware specifications, such as the number of processors and their clock speeds
(b) Name 3 trends you can observe since the first release of the Top500 list in June 1996
- Exponential Performance Growth: Supercomputers have experienced rapid increases in computational power, with the leading system’s performance doubling approximately every two years.
- Rise of Parallel Processing: There’s been a significant increase in the number of processing cores, reflecting a shift towards parallel architectures to handle complex computations more efficiently.
- Enhanced Energy Efficiency: Modern supercomputers are achieving higher performance while consuming less energy, emphasizing a focus on sustainable high-performance computing.
(c) Choose any high ranked system from the Top500 list and discuss its specification compared to SuperMUC-NG
System | Manufacturer | Cores | Processor | LINPACK Performance (Rmax) | Rpeak | Power Consumption |
---|---|---|---|---|---|---|
Frontier | HPE Cray | 8,699,904 | AMD EPYC 64-core CPUs with AMD MI250X GPUs | 1.1 exaflops | 2 exaflops | ~21 MW |
SuperMUC-NG | Lenovo | 311,040 | Intel Xeon Platinum 8174 24-core CPUs | 19.48 petaflops | 26.87 petaflops | ~4 MW |
(d) Today, mobile devices, such as smartphones and tablets, are equipped with relatively powerful CPUs. Could a mobile device achieving a Linpack performance of 2 GFLOP/sec have been ranked in past Top500 lists? What is the best ranking it could have reached?
A mobile device achieving 2 GFLOP/s could have ranked within the TOP500 lists up until June 1995. Beyond this date, the performance of the 500th-ranked system exceeded 2 GFLOP/s, making it ineligible for the list thereafter.
(e) Assume a CPU with the following specification: 20 cores @ 2 GHz base frequency, 2x floating-point units per core with support for fused multiply-add (FMA) instructions, and a 256-bit vector length. Calculate the theoretical (double precision floating-point) peak performance of this CPU
-
Determine Elements per Vector:
-
Flops per FMA Instruction:
-
FLOPs per FPU
-
FLOPs per Core
-
Total Theoretical Peak Performance:
Instruction Guide to Calculating the Theoretical Peak Performance of a CPU
Given Specifications:
- Cores: 20
- Base Clock Frequency: 2 GHz
- Floating-Point Units (FPUs) per Core: 2
- Vector Length: 256 Bits
- Supported Instruction: Fused Multiply-Add (FMA)
- Precision: Double Precision (64 Bits)
Step-by-Step Calculation:
1. Determine the Number of Elements per Vector
Terms:
- Vector Length: The number of bits that can be processed in a vector register. Here, it is 256 Bits.
- Double Precision: A floating-point format that uses 64 Bits per number.
Calculation:
Explanation: A 256-bit vector can contain four double-precision elements since each element is 64 bits in size.
2. Calculate FLOPs per FMA Instruction
Terms:
- FLOP (Floating-Point Operation): A floating-point operation, such as addition or multiplication.
- FMA (Fused Multiply-Add): An instruction that performs one multiplication and one addition in a single step.
Calculation:
Explanation: An FMA instruction performs two FLOPs per element (one multiplication and one addition). With four elements per vector, this results in a total of eight FLOPs per instruction.
3. Calculate FLOPs per FPU
Terms:
- GHz (Gigahertz): A unit of frequency. Here, 2 GHz equals 2 billion cycles per second.
- FLOPs per Cycle: The number of FLOPs that an FPU can perform per clock cycle.
Calculation:
Explanation: Each FPU can perform 8 FLOPs per cycle. At a frequency of 2 GHz, this equates to 16 billion FLOPs per second (16 GFLOPs) per FPU.
4. Calculate FLOPs per Core
Terms:
- FPUs per Core: The number of floating-point units per CPU core, here 2 FPUs.
Calculation:
Explanation: Each core has two FPUs, each capable of 16 GFLOPs. Therefore, each core achieves a performance of 32 GFLOPs.
5. Calculate the Total Theoretical Peak Performance
Terms:
- Cores: The number of CPU cores, here 20.
Calculation:
Explanation: The total theoretical peak performance of the CPU is the performance of a single core multiplied by the number of cores. Here: 32 GFLOPs per Core multiplied by 20 Cores equals 640 GFLOPs.
Summary:
- Theoretical Peak Performance: 640 GFLOPs
Key Point: By accounting for 2 FPUs per core, the FLOP performance per core is doubled, resulting in a total performance of 640 GFLOPs.
3. Moore’s Law
> Moore’s Law states that the number of transistors per area unit in an integrated circuit roughly doubles every 18 months. These additional resources can be used to improve performance. How did the performance of the systems ranked in the Top500 list change? Discuss the total performance of all systems, the #1 rank, and the #500 rank with respect to Moore’s Law.
The performance of the systems closely follows the trend predicted by Moore’s Law. We can observe that the #1
ranked supercomputers experience a significant performance increase approximately every two years, creating a “staircase” effect in the graph as each new generation surpasses the previous one. Similarly, the #500
ranked systems also show steady performance growth in alignment with Moore’s prediction, although at a more gradual pace. This overall pattern highlights the consistent improvements across both the top-tier and lower-ranked systems on the Top500 list, validating the impact of Moore’s Law on supercomputing advancements.
4. C Programming
(a) If you are not familiar with C, you can follow, for example, Beej’s Guide to C. First run a hello world example, then read up on the topics arrays, pointers and dynamic memory allocation
(b) Implement the following program in C: Allocate an array array_a of 100 int values dynamically using malloc. Allocate an array array_s of 100 int values statically (i.e., without using malloc). Fill array_a with values from 0 to 99. Copy values from array_a to array_s in reverse order. Print out the values from both arrays. Deallocate (free) the memory of array_a
(c) Why is it a mistake to try to deallocate the memory of array_s with free?
The memory of array_s
should not be deallocated with free
because it is a static array allocated on the stack, and stack memory is managed automatically. Only dynamically allocated memory on the heap, like array_d
(allocated with malloc
), requires explicit deallocation with free
. Using free
on array_s
would lead to undefined behavior.
5. Passing a 2D Array in C
Note
Passing an N x M 2D int array as argument to a function can be implemented with several possible parameter types for the 2D array in the function signature:
a) Implement the function void print_matrix_arr(int rows, int cols, int matrix[][cols])
.
- Pass
matrix
toprint_matrix_arr
in the provided test program to test your implementation.
b) Implement the function void print_matrix_ptr(int rows, int cols, int *matrix)
.
- Pass
matrix
toprint_matrix_ptr
in the provided test program to test your implementation.
Notes:
- Depending on the implementation, the number of array elements per sub-array must appear before the array in the parameter list.
- The 2D array must be passed either as 2D array type, array pointer type, or as a pointer to its first element. In the latter, the number of elements in each sub-array is not part of the passed type.
Solution
6. Linked List in C
Note
Implement the function
struct element *push_back(struct element *head, int val)
which appends a new list element with valueval
to the tail of a linked list, and returns a pointer to the new list tail on success, orNULL
on error.The provided test program shall:
- Print:
head --> [ 42 ] --> [ 99 ] --> [ 37 ] --> [ 13 ] --> NULL
afterpush_back(head, 13)
- Not crash!
Note:
The linked list head is passed to
push_back
as the parameterhead
, and you need to iterate to the tail of the list before you can append the new list element.