1. Pipelining in Theory

Assume we designed a RISC-V processor where the pipeline stages have the following five steps and durations in picoseconds:

  1. Instruction fetch (IF): 200ps
  2. Read registers and decode instruction (ID): 100ps
  3. Execute the operation or calculate an address (EX): 300ps
  4. Access an operand in data memory (MEM): 200ps
  5. Write the result into a register (WB): 100ps

(a) What is the processor’s maximum clock frequency?

(b) How could we achieve a performance gain in the next processor generation?

(c) A competitor developed a processor with 50 pipeline stages and advertises it to yield a 10× performance increase compared to our 5-stage RISC-V processor.

How can this statement be argued? How can we counter this argument?


2. Pipelining in Practice

Write a simple benchmark application that allows us to measure and compare the throughput of scalar operations (addition, division, multiplication) of floating-point operations (double, 8 bytes) in the following two scenarios:

Scenario I

No (or infrequent) dependencies between operations so that the CPU can pipeline instructions effectively.

Scenario II

Every operation depends on the result of a previous operation so that no operation can be pipelined.

How does the throughput change depending on the operation and scenario? Submit your 3 × 2 results including your processor model, your source code, and Makefile.

Notes:

  • Think about how to implement the two scenarios and fill in the remaining code snippets in the provided C-source template.
  • A Makefile to build the benchmark (see the Makefile for instructions) is provided as well.
  • GCC with the flag -fno-tree-vectorize, as configured in the Makefile, avoids generating SIMD instructions to restrict measurements to scalar operations.
  • Avoid loads and stores between registers and memory; otherwise, the benchmark will measure these instead of FPU performance.
    To check this, you can view the assembly code (e.g., with objdump -S <executable>, or use Compiler Explorer1). See Listing 1 and Listing 2 for the difference in x86 assembly.
  • You may need to adjust the optimization level of the GCC compiler flags (-O0, -O1, -O2, -O3, -Ofast) in the Makefile to accomplish this with C code.

Listing 1: x86 assembly code for multiplication of two scalar double values (a = a * b) with loads and stores between registers and memory.

movsd -0xa0(%rbp),%xmm0 ; load a from memory into register xmm0
mulsd -0xa8(%rbp),%xmm0 ; load b from memory, mul. with xmm0, store to xmm0
movsd %xmm0,-0xa0(%rbp) ; store from register xmm0 to memory location of a

Listing 2: x86 assembly code for multiplication of two scalar double values (a = a * b) in registers (desired).

mulsd %xmm1,%xmm0 ; multiply xmm1 with xmm0, store to xmm0
  • If you like, you can measure the average core frequency during the benchmark run, for example with perf stat <executable>, to compute the throughput in units of operations per cycle.

  • Pinning the benchmark application thread to a specific core, for example with taskset -c <core> <executable>, prevents the OS from moving the thread during the benchmark run, and increases measurement accuracy.

  • You are highly encouraged to work in teams!


3. Concepts of Parallel Processing

We discussed several concepts of parallel processing in the lecture. What are the conceptual differences between pipelining, superscalarity, vector processing and multi-threading?


Footnotes

  1. https://godbolt.org/