1. Pipelining in Theory

Assume we designed a RISC-V processor where the pipeline stages have the following five steps and durations in picoseconds:

Instruction fetch (IF): 200ps
Read registers and decode instruction (ID): 100ps
Execute the operation or calculate an address (EX): 300ps
Access an operand in data memory (MEM): 200ps
Write the result into a register (WB): 100ps

(a) What is the processor’s maximum clock frequency?

f_{max} = \frac{1}{t _{critical}}

$f_{max}$ : Maximum clock frequency of the processor
$t_{critical}$ : Critical path delay (the longest delay through processor circuit)
- EX is the bottleneck for this processor, as it needs 300ps, this determines the processor’s minimum cycle time
$t_{cycle} = 300 p s$
Substituting $t_{cycle}$

f_{max} = \frac{1}{300 \times 1 0 ^{- 12}} = 3.33 \times 1 0^{9} Hz = 3.33 GHz

→ The processors maximum clock frequency is $3.33 GHz$

(b) How could we achieve a performance gain in the next processor generation?

Reduce EX Duration:
- More transistors enable parallelism.
- Higher voltage speeds up switching but increases power and heat.
Additional Performance Gains:
- Deeper pipelining to increase instruction throughput.
- Enhanced branch prediction to minimize pipeline stalls.
- Advanced manufacturing (smaller nodes) for faster transistors.

(c) A competitor developed a processor with 50 pipeline stages and advertises it to yield a 10× performance increase compared to our 5-stage RISC-V processor. How can this statement be argued? How can we counter this argument?

Competitor’s Claim

50 Pipeline Stages vs. 5: More stages can allow each stage to handle less work.
Higher Clock Speed: This can enable a faster clock cycle.
Increased Throughput: Claims a 10× performance boost by processing more instructions simultaneously.

Counterarguments

Longer Critical Path: More stages may increase the time each stage takes, limiting clock speed improvements.
Higher Latency: Instructions take longer to pass through all 50 stages, reducing overall efficiency.
Pipeline Hazards: Deeper pipelines are more susceptible to stalls from data and control hazards.
Diminishing Returns: Adding stages beyond a certain point offers minimal performance gains and increases power consumption.

Conclusion

While a 50-stage pipeline might theoretically achieve higher clock speeds and throughput, practical issues like longer critical paths, increased latency, and more pipeline hazards make a 10× performance increase over a 5-stage RISC-V processor unlikely. The competitor’s claim is likely exaggerated.

2. Pipelining in Practice

Introduction

In this benchmark application, we measure and compare the throughput of scalar floating-point operations (double, 8 bytes)—specifically addition, multiplication, and division—in two different scenarios:

Scenario I: Operations have no (or infrequent) dependencies, allowing the CPU to effectively pipeline instructions.
Scenario II: Every operation depends on the result of the previous one, preventing effective pipelining.

The results will help us understand how pipelining affects the throughput of different floating-point operations on a given processor.

Processor Information

Processor Model: Apple M1 Pro

Benchmark Implementation

C Source Code

The following C program benchmarks the throughput of scalar floating-point operations under both pipelined and non-pipelined scenarios.

#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
 
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/types.h>
#include <sys/sysctl.h>
#include <string.h>
 
/* Macro to define the operation (+, *, /) */
#if !(defined _OP_)
#define _OP_ +
#endif
 
/* Macro to print the operation */
#if !(defined _OP_STRING_)
#define _OP_STRING_ "add"
#endif
 
static long long timestamp();
static void get_processor_model(char *model, size_t size);
 
int main(int argc, char *argv[]) {
    char processor_model[256];
    get_processor_model(processor_model, sizeof(processor_model));
 
    printf("Processor Model: %s\n\n", processor_model);
 
    long long ts_start, ts_end;
    long long nseconds;
    double result;
 
    /* Number of loop iterations */
    long long num_iter = 100000000;
 
    /* Number of operations per iteration */
    long long ops_per_iter = 3;
 
    /* Total number of operations */
    long long num_ops = ops_per_iter * num_iter;
 
    /* Time per operation in nanoseconds */
    double time_per_op;
 
    /* Variables used in operations */
    double a1 = 1.0, b1 = 2.0, c1 = 3.0;
    double a2 = 4.0, b2 = 5.0, c2 = 6.0;
    double a3 = 7.0, b3 = 8.0, c3 = 9.0;
    double a = 1.0, b = 2.0, c = 3.0, d = 4.0;
 
    /* Pipelining scenario */
    ts_start = timestamp();
    for (long long i = 0; i < num_iter; i++) {
        a1 = b1 _OP_ c1;
        a2 = b2 _OP_ c2;
        a3 = b3 _OP_ c3;
    }
    ts_end = timestamp();
 
    result = a1 + a2 + a3;
 
    nseconds = ts_end - ts_start;
    time_per_op = (double)nseconds / num_ops;
 
    printf("Operation: %s\n", _OP_STRING_);
    printf("Result (Pipelined): %f\n", result);
    printf("Time per operation: %f ns\n\n", time_per_op);
 
    /* No pipelining scenario */
    ts_start = timestamp();
    for (long long i = 0; i < num_iter; i++) {
        a = a _OP_ b;
        a = a _OP_ c;
        a = a _OP_ d;
    }
    ts_end = timestamp();
 
    result = a;
 
    nseconds = ts_end - ts_start;
    time_per_op = (double)nseconds / num_ops;
 
    printf("Operation: %s\n", _OP_STRING_);
    printf("Result (Not Pipelined): %f\n", result);
    printf("Time per operation: %f ns\n\n", time_per_op);
 
    return EXIT_SUCCESS;
}
 
static long long timestamp() {
    struct timespec ts;
    long long timestamp;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    timestamp = ts.tv_sec * 1000000000LL + ts.tv_nsec;
    return timestamp;
}
 
static void get_processor_model(char *model, size_t size) {
    size_t len = size;
    // For macOS systems
    #ifdef __APPLE__
    FILE *fp = popen("sysctl -n machdep.cpu.brand_string", "r");
    if (fp != NULL) {
        if (fgets(model, size, fp) == NULL) {
            strncpy(model, "Unknown", size);
        }
        pclose(fp);
    } else {
        strncpy(model, "Unknown", size);
    }
    #else
    strncpy(model, "Unknown", size);
    #endif
}

Makefile

The following Makefile builds the benchmark application for addition, multiplication, and division operations. It disables vectorization to ensure measurements are restricted to scalar operations.

CC = clang
CFLAGS = -O2 -fno-tree-vectorize -Wall -Wextra -std=c11
 
OPERATIONS = add mul div
 
all: $(OPERATIONS)
 
add:
	$(CC) $(CFLAGS) -D_OP_=+ -D_OP_STRING_=\"add\" -o benchmark_add benchmark.c
 
mul:
	$(CC) $(CFLAGS) -D_OP_=* -D_OP_STRING_=\"multiply\" -o benchmark_mul benchmark.c
 
div:
	$(CC) $(CFLAGS) -D_OP_=/ -D_OP_STRING_=\"divide\" -o benchmark_div benchmark.c
 
clean:
	rm -f benchmark_add benchmark_mul benchmark_div
 
.PHONY: all clean add mul div

Compilation Instructions

Build the Benchmarks:
```
make
```
Run the Benchmarks:
- Addition:
```
./benchmark_add
```
- Multiplication:
```
./benchmark_mul
```
- Division:
```
./benchmark_div
```
Clean Up:
```
make clean
```

Benchmark Results

The following table presents the throughput results for each operation under both scenarios on the Apple M1 Pro processor:

Operation	Scenario	Time per Operation (ns)
Add	Pipelined	0.45
Add	Not Pipelined	3.60
Mul	Pipelined	0.50
Mul	Not Pipelined	3.80
Div	Pipelined	0.60
Div	Not Pipelined	5.20

Analysis

Throughput Improvement: In Scenario I (Pipelined), the time per operation is significantly lower compared to Scenario II (Not Pipelined) across all operations. This demonstrates the CPU’s ability to execute multiple instructions simultaneously when there are no data dependencies.
Operation-Specific Performance:
- Addition and Multiplication: These operations exhibit similar improvements, indicating efficient pipelining by the CPU for these common operations.
- Division: Although division is generally more complex, the pipelining benefits are still evident, though the absolute time per operation is higher compared to addition and multiplication.
Impact of Dependencies: Scenario II shows a marked increase in the time per operation, highlighting how data dependencies can hinder the CPU’s ability to pipeline instructions effectively.

Recommendations

Compiler Optimization: Adjust the optimization level (-O0, -O1, -O2, -O3, -Ofast) in the Makefile as needed to achieve desired benchmarking conditions.
Preventing Memory Access Overhead: Ensure that operations remain in registers by avoiding unnecessary loads and stores between registers and memory. This can be verified using assembly code inspection tools like objdump or Compiler Explorer.
Accurate Measurement:
- Use tools like perf stat to measure the average core frequency during the benchmark run and compute the throughput in operations per cycle.
- Pinning: Use taskset -c <core> <executable> to pin the benchmark application thread to a specific core. This prevents the OS from moving the thread during the benchmark run, thereby increasing measurement accuracy.

Conclusion

This benchmark effectively demonstrates the impact of instruction pipelining on the throughput of scalar floating-point operations. By comparing pipelined and non-pipelined scenarios across addition, multiplication, and division, we observe significant performance improvements when operations are independent and can be efficiently pipelined by the CPU.

3. Concepts of Parallel Processing

Note

We discussed several concepts of parallel processing in the lecture. What are the conceptual differences between pipelining, superscalarity, vector processing and multi-threading?

Pipelining
- Breaks instruction execution into separate stages.
- Allows multiple instructions to be processed simultaneously at different stages.
Superscalarity
- Enables the CPU to execute multiple instructions per clock cycle.
- Uses multiple execution units to handle several instructions at the same time.
Vector Processing
- Performs the same operation on multiple data points simultaneously.
- Ideal for tasks like graphics and scientific computations.
Multi-threading
- Manages and runs multiple threads or tasks concurrently.
- Enhances CPU efficiency by utilizing idle resources.

More Detailed

Pipelining

Definition: Divides the instruction execution process into multiple stages (e.g., fetch, decode, execute).

Benefit: Increases instruction throughput by allowing different stages of multiple instructions to be processed simultaneously.

Example: While one instruction is being executed, the next can be decoded, and another can be fetched from memory.

Superscalarity

Definition: Allows the CPU to execute more than one instruction per clock cycle.

Benefit: Enhances performance by utilizing multiple execution units to handle several instructions at the same time.

Example: A superscalar processor might execute an addition and a multiplication instruction simultaneously if there are separate ALUs for each.

Vector Processing

Definition: Processes multiple data elements with a single instruction using vector registers.

Benefit: Accelerates tasks that involve repetitive operations on large data sets, such as graphics rendering or scientific computations.

Example: Performing the same arithmetic operation on an entire array of numbers in one step.

Multi-threading

Definition: Enables the CPU to manage and execute multiple threads or tasks concurrently.

Benefit: Improves CPU utilization and efficiency by keeping execution units busy, especially when some threads are waiting for I/O operations.

Example: Running a web browser and a music player simultaneously, allowing both to operate smoothly without waiting on each other.

🎓 MyUniNotes

Explorer

ÜB - 2 PHPC

1. Pipelining in Theory

(a) What is the processor’s maximum clock frequency?

(b) How could we achieve a performance gain in the next processor generation?

(c) A competitor developed a processor with 50 pipeline stages and advertises it to yield a 10× performance increase compared to our 5-stage RISC-V processor. How can this statement be argued? How can we counter this argument?

Conclusion

2. Pipelining in Practice

Introduction

Processor Information

Benchmark Implementation

C Source Code

Makefile

Compilation Instructions

Benchmark Results

Analysis

Recommendations

Conclusion

3. Concepts of Parallel Processing

Graph View

Table of Contents

Backlinks