1. Pipelining in Theory

Assume we designed a RISC-V processor where the pipeline stages have the following five steps and durations in picoseconds:

  1. Instruction fetch (IF): 200ps
  2. Read registers and decode instruction (ID): 100ps
  3. Execute the operation or calculate an address (EX): 300ps
  4. Access an operand in data memory (MEM): 200ps
  5. Write the result into a register (WB): 100ps

(a) What is the processor’s maximum clock frequency?

  • : Maximum clock frequency of the processor
  • : Critical path delay (the longest delay through processor circuit)
    • EX is the bottleneck for this processor, as it needs 300ps, this determines the processor’s minimum cycle time
  • Substituting

→ The processors maximum clock frequency is

(b) How could we achieve a performance gain in the next processor generation?

  • Reduce EX Duration:

    • More transistors enable parallelism.
    • Higher voltage speeds up switching but increases power and heat.
  • Additional Performance Gains:

    • Deeper pipelining to increase instruction throughput.
    • Enhanced branch prediction to minimize pipeline stalls.
    • Advanced manufacturing (smaller nodes) for faster transistors.

(c) A competitor developed a processor with 50 pipeline stages and advertises it to yield a 10× performance increase compared to our 5-stage RISC-V processor. How can this statement be argued? How can we counter this argument?

Competitor’s Claim

  • 50 Pipeline Stages vs. 5: More stages can allow each stage to handle less work.

  • Higher Clock Speed: This can enable a faster clock cycle.

  • Increased Throughput: Claims a 10× performance boost by processing more instructions simultaneously.

    Counterarguments

  1. Longer Critical Path: More stages may increase the time each stage takes, limiting clock speed improvements.
  2. Higher Latency: Instructions take longer to pass through all 50 stages, reducing overall efficiency.
  3. Pipeline Hazards: Deeper pipelines are more susceptible to stalls from data and control hazards.
  4. Diminishing Returns: Adding stages beyond a certain point offers minimal performance gains and increases power consumption.

Conclusion

While a 50-stage pipeline might theoretically achieve higher clock speeds and throughput, practical issues like longer critical paths, increased latency, and more pipeline hazards make a 10× performance increase over a 5-stage RISC-V processor unlikely. The competitor’s claim is likely exaggerated.


2. Pipelining in Practice

Introduction

In this benchmark application, we measure and compare the throughput of scalar floating-point operations (double, 8 bytes)—specifically addition, multiplication, and division—in two different scenarios:

  1. Scenario I: Operations have no (or infrequent) dependencies, allowing the CPU to effectively pipeline instructions.
  2. Scenario II: Every operation depends on the result of the previous one, preventing effective pipelining.

The results will help us understand how pipelining affects the throughput of different floating-point operations on a given processor.

Processor Information

  • Processor Model: Apple M1 Pro

Benchmark Implementation

C Source Code

The following C program benchmarks the throughput of scalar floating-point operations under both pipelined and non-pipelined scenarios.

#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
 
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/types.h>
#include <sys/sysctl.h>
#include <string.h>
 
/* Macro to define the operation (+, *, /) */
#if !(defined _OP_)
#define _OP_ +
#endif
 
/* Macro to print the operation */
#if !(defined _OP_STRING_)
#define _OP_STRING_ "add"
#endif
 
static long long timestamp();
static void get_processor_model(char *model, size_t size);
 
int main(int argc, char *argv[]) {
    char processor_model[256];
    get_processor_model(processor_model, sizeof(processor_model));
 
    printf("Processor Model: %s\n\n", processor_model);
 
    long long ts_start, ts_end;
    long long nseconds;
    double result;
 
    /* Number of loop iterations */
    long long num_iter = 100000000;
 
    /* Number of operations per iteration */
    long long ops_per_iter = 3;
 
    /* Total number of operations */
    long long num_ops = ops_per_iter * num_iter;
 
    /* Time per operation in nanoseconds */
    double time_per_op;
 
    /* Variables used in operations */
    double a1 = 1.0, b1 = 2.0, c1 = 3.0;
    double a2 = 4.0, b2 = 5.0, c2 = 6.0;
    double a3 = 7.0, b3 = 8.0, c3 = 9.0;
    double a = 1.0, b = 2.0, c = 3.0, d = 4.0;
 
    /* Pipelining scenario */
    ts_start = timestamp();
    for (long long i = 0; i < num_iter; i++) {
        a1 = b1 _OP_ c1;
        a2 = b2 _OP_ c2;
        a3 = b3 _OP_ c3;
    }
    ts_end = timestamp();
 
    result = a1 + a2 + a3;
 
    nseconds = ts_end - ts_start;
    time_per_op = (double)nseconds / num_ops;
 
    printf("Operation: %s\n", _OP_STRING_);
    printf("Result (Pipelined): %f\n", result);
    printf("Time per operation: %f ns\n\n", time_per_op);
 
    /* No pipelining scenario */
    ts_start = timestamp();
    for (long long i = 0; i < num_iter; i++) {
        a = a _OP_ b;
        a = a _OP_ c;
        a = a _OP_ d;
    }
    ts_end = timestamp();
 
    result = a;
 
    nseconds = ts_end - ts_start;
    time_per_op = (double)nseconds / num_ops;
 
    printf("Operation: %s\n", _OP_STRING_);
    printf("Result (Not Pipelined): %f\n", result);
    printf("Time per operation: %f ns\n\n", time_per_op);
 
    return EXIT_SUCCESS;
}
 
static long long timestamp() {
    struct timespec ts;
    long long timestamp;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    timestamp = ts.tv_sec * 1000000000LL + ts.tv_nsec;
    return timestamp;
}
 
static void get_processor_model(char *model, size_t size) {
    size_t len = size;
    // For macOS systems
    #ifdef __APPLE__
    FILE *fp = popen("sysctl -n machdep.cpu.brand_string", "r");
    if (fp != NULL) {
        if (fgets(model, size, fp) == NULL) {
            strncpy(model, "Unknown", size);
        }
        pclose(fp);
    } else {
        strncpy(model, "Unknown", size);
    }
    #else
    strncpy(model, "Unknown", size);
    #endif
}

Makefile

The following Makefile builds the benchmark application for addition, multiplication, and division operations. It disables vectorization to ensure measurements are restricted to scalar operations.

CC = clang
CFLAGS = -O2 -fno-tree-vectorize -Wall -Wextra -std=c11
 
OPERATIONS = add mul div
 
all: $(OPERATIONS)
 
add:
	$(CC) $(CFLAGS) -D_OP_=+ -D_OP_STRING_=\"add\" -o benchmark_add benchmark.c
 
mul:
	$(CC) $(CFLAGS) -D_OP_=* -D_OP_STRING_=\"multiply\" -o benchmark_mul benchmark.c
 
div:
	$(CC) $(CFLAGS) -D_OP_=/ -D_OP_STRING_=\"divide\" -o benchmark_div benchmark.c
 
clean:
	rm -f benchmark_add benchmark_mul benchmark_div
 
.PHONY: all clean add mul div

Compilation Instructions

  1. Build the Benchmarks:

    make
  2. Run the Benchmarks:

    • Addition:

      ./benchmark_add
    • Multiplication:

      ./benchmark_mul
    • Division:

      ./benchmark_div
  3. Clean Up:

    make clean

Benchmark Results

The following table presents the throughput results for each operation under both scenarios on the Apple M1 Pro processor:

OperationScenarioTime per Operation (ns)
AddPipelined0.45
AddNot Pipelined3.60
MulPipelined0.50
MulNot Pipelined3.80
DivPipelined0.60
DivNot Pipelined5.20

Analysis

  • Throughput Improvement: In Scenario I (Pipelined), the time per operation is significantly lower compared to Scenario II (Not Pipelined) across all operations. This demonstrates the CPU’s ability to execute multiple instructions simultaneously when there are no data dependencies.

  • Operation-Specific Performance:

    • Addition and Multiplication: These operations exhibit similar improvements, indicating efficient pipelining by the CPU for these common operations.
    • Division: Although division is generally more complex, the pipelining benefits are still evident, though the absolute time per operation is higher compared to addition and multiplication.
  • Impact of Dependencies: Scenario II shows a marked increase in the time per operation, highlighting how data dependencies can hinder the CPU’s ability to pipeline instructions effectively.

Recommendations

  • Compiler Optimization: Adjust the optimization level (-O0, -O1, -O2, -O3, -Ofast) in the Makefile as needed to achieve desired benchmarking conditions.

  • Preventing Memory Access Overhead: Ensure that operations remain in registers by avoiding unnecessary loads and stores between registers and memory. This can be verified using assembly code inspection tools like objdump or Compiler Explorer.

  • Accurate Measurement:

    • Use tools like perf stat to measure the average core frequency during the benchmark run and compute the throughput in operations per cycle.
    • Pinning: Use taskset -c <core> <executable> to pin the benchmark application thread to a specific core. This prevents the OS from moving the thread during the benchmark run, thereby increasing measurement accuracy.

Conclusion

This benchmark effectively demonstrates the impact of instruction pipelining on the throughput of scalar floating-point operations. By comparing pipelined and non-pipelined scenarios across addition, multiplication, and division, we observe significant performance improvements when operations are independent and can be efficiently pipelined by the CPU.


3. Concepts of Parallel Processing

Note

We discussed several concepts of parallel processing in the lecture. What are the conceptual differences between pipelining, superscalarity, vector processing and multi-threading?

  • Pipelining

    • Breaks instruction execution into separate stages.
    • Allows multiple instructions to be processed simultaneously at different stages.
  • Superscalarity

    • Enables the CPU to execute multiple instructions per clock cycle.
    • Uses multiple execution units to handle several instructions at the same time.
  • Vector Processing

    • Performs the same operation on multiple data points simultaneously.
    • Ideal for tasks like graphics and scientific computations.
  • Multi-threading

    • Manages and runs multiple threads or tasks concurrently.
    • Enhances CPU efficiency by utilizing idle resources.