Parallel Computing Study Guide
Table of Contents
- Introduction and Motivation
- Matrix-Matrix Multiplication Experiment
- Peak Performance
- Performance Analysis
- Instruction Set Architecture (ISA)
- Parallelism in Single-Core Processors
- Memory System and Caches
- Case Study: Matrix Multiplication Performance
- Summary
- Course Plan
- Additional Resources
- Exercise
- Formulas Summary
Introduction and Motivation
-
Matrix-Matrix Multiplication:
- Operation: , where , , and are matrices.
- Naive implementation uses three nested loops.
-
Experimental Testbed (SuperMUC Phase I):
- Hardware Specifications:
- Intel Xeon E5-2680, 2.3 GHz, 20 MB L3 cache
- 2 sockets with 8 cores each (16 total)
- 32 GB DDR3 RAM
- Sequential execution using one core
- Hardware Specifications:
-
Question:
What performance can we expect for the naive matrix multiplication application?
Matrix-Matrix Multiplication Experiment
-
Performance Evaluation:
- Compare naive implementation to optimized library (e.g., MKL’s DGEMM).
-
Findings:
- Naive implementation achieves a small fraction of peak performance.
- Optimized libraries significantly bridge the performance gap.
-
Key Insight:
Theoretical peak performance serves as a benchmark to assess actual implementation efficiency.
Peak Performance
-
Definition:
Theoretical maximum performance under ideal conditions based on hardware specifications. -
Calculation Example for SuperMUC Phase I:
- AVX Registers: 256-bit (4 double-precision numbers)
- FMA Operations: 8 FLOPs per cycle
- Peak Performance per Core:
- Node Peak Performance:
Performance Analysis
-
Optimized vs. Naive Implementation:
- Optimized (DGEMM): Approaches peak performance.
- Naive: Significantly lower due to inefficiencies in memory access and lack of optimization.
-
Conclusion:
Optimized algorithms and libraries are essential to harness the full potential of hardware.
Instruction Set Architecture (ISA)
-
Role of ISA:
Defines the hardware-software interface, ensuring portability and longevity. -
Key Properties of a Good ISA:
- Longevity across generations
- Generality and expressivity
- Convenience for higher-level functionalities
- Efficient low-level implementation
-
RISC vs. CISC:
- RISC: Simplified instructions, optimized for performance.
- CISC: Complex instructions, potentially higher expressivity.
Parallelism in Single-Core Processors
-
Types of Parallelism:
- Instruction Level Parallelism (ILP): Execute multiple instructions simultaneously.
- Techniques: Pipelining, superscalar execution, out-of-order (OOO) execution.
- Data Level Parallelism (DLP): Perform the same operation on multiple data points.
- Techniques: SIMD instructions, vector processing.
- Thread Level Parallelism (TLP): Execute multiple threads concurrently.
- Techniques: Simultaneous Multi-Threading (SMT), Hyper-Threading.
- Instruction Level Parallelism (ILP): Execute multiple instructions simultaneously.
-
ILP Techniques:
- Pipelining: Overlapping instruction execution stages.
- Superscalar: Multiple instructions per cycle.
- OOO Execution: Reordering instructions to avoid stalls.
- VLIW/EPIC: Compiler-managed parallelism.
Memory System and Caches
-
Memory Hierarchy:
- Registers: Fastest, smallest.
- L1 Cache: Fast, limited size.
- L2 Cache: Slower than L1, larger.
- Main Memory (DRAM): Slower, much larger.
- Secondary Storage: Very slow, very large.
-
Memory Wall:
- Gap between CPU speed and memory access speed.
- Implications: Memory latency becomes a significant performance bottleneck.
-
Locality:
- Temporal Locality: Reuse of recently accessed data.
- Spatial Locality: Access to data near recently accessed data.
-
Cache Miss Types:
- Compulsory Miss: First access to data.
- Capacity Miss: Data not in cache due to limited size.
- Conflict Miss: Multiple data mapped to the same cache line.
Case Study: Matrix Multiplication Performance
-
Naive Matrix-Multiply:
- Flops:
- Memory Accesses:
- Computational Intensity: for large
-
Blocked (Tiled) Matrix-Multiply:
- Approach: Divide matrices into sub-blocks to fit into cache.
- Flops:
- Memory Accesses:
- Computational Intensity:
- Benefit: Higher leads to better performance by reducing memory latency impact.
-
Optimization Goal:
Maximize computational intensity while fitting data into fast memory (cache).
Summary
-
Parallelism Forms:
- Exists within single-core CPUs and across multiple cores.
- Includes ILP, DLP, and TLP.
-
Performance Factors:
- Instruction count, CPI (cycles per instruction), and clock rate.
- Memory hierarchy and cache performance critically impact real-world performance.
-
Optimization Techniques:
- Utilize optimized libraries (e.g., MKL) for critical operations.
- Implement algorithms with high computational intensity and good cache locality.
-
Future Directions:
- Enhanced parallel programming models.
- Continued hardware advancements to bridge memory and compute performance gaps.
Course Plan
Topics and Schedule
- 17 Oct 2024: Introduction and Motivation
- 24 Oct 2024: Overview of Computer Architecture and Basics of Parallel Computing
- Feb 2025: Q&A / Review Session
- Future Topics:
- Parallel Algorithms
- Parallel Architectures
- Parallel Programming Techniques
- Shared Memory Systems, Threading, Parallel Linear Algebra
- Distributed Memory Systems, Message Passing, FFT (Fast Fourier Transform)
- Data Parallelism, GPUs (Graphics Processing Units), Map-Reduce, Parallel Sorting
Exams
- Written Exam:
Details to be announced. Comprehensive coverage of all course topics, including theoretical concepts and practical applications in parallel computing.
Important Milestones
- Introduction and Motivation: Understanding the significance and impact of parallel computing.
- Computer Architecture: Grasping hardware components and their roles in parallel processing.
- Parallel Programming: Gaining practical experience with parallel programming models and techniques.
- Application Projects: Applying concepts to real-world problems to reinforce learning.
Additional Resources
Articles and Websites
- IEEE Spectrum: Next Generation Supercomputers
- Top500 Project: www.top500.org
- ExascaleDay: exascaleday.ornl.gov
- Parallel Programming Article: Computerworld
Books
- Computer Architecture: A Quantitative Approach by John L. Hennessy and David A. Patterson
- Parallel Programming in C with MPI and OpenMP by Quinn
Online Courses
- Coursera: Parallel, Concurrent, and Distributed Programming in Java
- edX: Introduction to Parallel Programming
- YouTube: Parallel Computing Tutorials
Tools and Software
- MPI (Message Passing Interface): MPI Forum
- OpenMP: OpenMP Official Website
- CUDA: NVIDIA CUDA Toolkit
- Profiling Tools: Valgrind, Intel VTune, GNU gprof
Exercise
Question
Why does the performance of supercomputers grow faster than predicted by Moore’s Law?
Answer
Supercomputer performance surpasses Moore’s Law predictions primarily due to extensive use of parallelism. While Moore’s Law focuses on the doubling of transistors, parallel processing leverages multiple cores and processors to perform simultaneous computations, resulting in a more substantial overall performance increase.
Detailed Explanation:
- Parallelism: Supercomputers utilize millions of cores working concurrently, multiplying the effective computing power beyond single-core improvements.
- Efficient Algorithms: Advanced parallel algorithms and optimized software ensure that multiple cores are effectively utilized, maximizing performance gains.
- Specialized Hardware: Incorporation of GPUs and other accelerators enhances computational capabilities, handling specific parallel tasks more efficiently.
- Scalable Architectures: Modern supercomputers are designed to scale seamlessly, allowing performance to increase exponentially as more cores and nodes are added
.
Summary:
Parallelism complements transistor scaling, enabling supercomputers to achieve performance growth that outpaces Moore’s Law by effectively utilizing multiple processing units simultaneously.
Formulas Summary
1. Dynamic Power Consumption
- P: Power consumption
- V: Supply voltage
- f: Clock frequency
- C: Capacitance (proportional to the number of transistors switching)
Explanation:
Power consumption increases with the square of the voltage, directly with frequency, and linearly with capacitance. This highlights the inefficiency of simply increasing clock speeds.
2. Amdahl’s Law
- P: Fraction of the program that can be parallelized
- N: Number of processors
- Speedup: Theoretical maximum speedup achieved by parallelizing the program
Explanation:
Amdahl’s Law calculates the maximum possible speedup for a program based on the proportion that can be parallelized. It shows that the speedup is limited by the serial portion of the program.
3. Computational Intensity (q)
- f: Number of floating-point operations
- m: Number of memory accesses
Explanation:
Computational intensity measures the ratio of computations to memory operations. Higher indicates better performance by reducing the impact of memory latency.
4. Performance Equation
- Instructions: Total number of instructions executed
- CPI: Cycles per instruction
- Cycle Time: Time per clock cycle
Explanation:
This equation relates the CPU time to the number of instructions, their execution efficiency, and the clock speed of the processor.