High Performance Computing for Engineers David Thomas dt10@ic.ac.uk Room 903 HPCE / dt10/ 2013 / 0.1
High Performance Computing for Engineers Research Testing communication protocols Evaluating signal-processing filters Simulating analogue and digital designs Tools CAD tools: synthesis, place-and-route, verification Libraries/toolboxes: filter design, compressive sensing Products Oil exploration and discovery Mobile-phone apps Financial computing HPCE / dt10/ 2013 / 0.2
High Performance Computing for Engineers Types of performance metrics Throughput Latency Power Design-time Capital and running costs Required versus desired performance Subject to a throughput of X, minimise average power Subject to a budget of Y, maximise energy efficiency Subject to Z development days, maximise throughput HPCE / dt10/ 2013 / 0.3
What is available to you Types of compute device Multi-core CPUs GPUs (Graphics Processing Units) MPPAs (Massively Parallel Processor Arrays) FPGAs (Field Programmable Gate Arrays) Types of compute system Embedded Systems Mobile Phones Tablets Laptops Grid computing Cloud computing HPCE / dt10/ 2013 / 0.4
2013 : HTC Droid DNA Snapdragon S4 Pro - CPU : Quad-core Krait (ARM derivative) - GPU : Adreno 320 GPU (OpenCL compatible) Images Copyright HTC and Qaulcomm HPCE / dt10/ 2013 / 0.5
2013 : Lenovo Thinkpad Edge E525 AMD Fusion A8-3500M - CPU : Quad-Core 2.4GHz Phenom-II - GPU : HD 6620G 400MHz (320 cores) Img:http://laptops-specs.blogspot.com/2011/09/lenovo-thinkpad-edge-e525-specs.html, http://www.techradar.com/images/zoom/amd-llano-965315/index1 HPCE / dt10/ 2013 / 0.6
2013 : Imperial HPC Cluster cx2 - SGI Altix ICE 8200 EX Racks and racks of high-performance PCs 3000+ x64 cores running at 3GHz Available to researchers and undergrads (if they ask nicely) Grid-management system Run program on 1000 PCs with one command HPCE / dt10/ 2013 / 0.7
Performance and Efficiency Relative to CPU 60.0 50.0 40.0 30.0 20.0 10.0 0.0 Uniform Gaussian Exponential Mean (Geo) MPPA FPGA GPU 200.0 150.0 100.0 50.0 0.0 Uniform Gaussian 345 Exponential Mean (Geo) FPGA GPU MPPA Performance Power Efficiency HPCE / dt10/ 2013 / 0.8
Design tradeoffs 1 Sequential SW 10 Performance 100 1000 1 hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2013 / 0.9
Design tradeoffs 1 10 Performance 100 Sequential SW Thread-based SW 1000 1 hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2013 / 0.10
Design tradeoffs 1 10 Performance 100 Sequential SW Thread-based SW 1000 1 hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2013 / 0.11
Design tradeoffs Task-based parallelism vs threads Easy to program (less time coding) 1 Easy to get right (less time testing) 10Many implementations and APIs Performance 100 Intel Threaded Building Blocks (TBB) Microsoft.NET Task Parallel Library 1000 OpenCL 1 hour 1 day 1 week 1 month Sequential SW Task-based SW Thread-based SW Design-time HPCE / dt10/ 2013 / 0.12
Design tradeoffs 1 10 Performance 100 Sequential SW Task-based SW Thread-based SW 1000 1 hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2013 / 0.13
Design tradeoffs 1 10 Performance 100 Sequential SW Task-based SW Thread-based SW GPU 1000 1 hour 1 day 1 week 1 month Design-time Src: NVIDIA CUDA Compute Unified Device Architecture, Programmers Guide HPCE / dt10/ 2013 / 0.14
Design tradeoffs 1 10 Performance 100 Sequential SW Task-based SW Thread-based SW GPU 1000 1 hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2013 / 0.15
Design tradeoffs 1 10 Performance 100 1000 Sequential SW Task-based SW Thread-based SW GPU FPGA 1 hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2013 / 0.16
Design tradeoffs 1 10 Performance 100 1000 Sequential SW Task-based SW Thread-based SW GPU FPGA 1 hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2013 / 0.17
What will you learn Systems: what high-performance systems do you have Methods: how can these systems be programmed Practise: concrete experience with multi-core and GPUs Analysis: knowing what to use and when HPCE / dt10/ 2013 / 0.18
What you won t learn Multi-threaded programming PThreads, windows threads, mutexes, spin-locks,... We ll look at the concepts and hardware, but ignore the practise Not needed when using modern task-based methods OpenMP API for parallelising for-loops in C/C++ Old technology, not very user-friendly Doesn t map nicely to architectures such as GPUs We ll use modern techniques such as TBB and CUDA/OpenCL MPI (Messaging Passing Interface) Point-to-point communication between networks Important; but very specialised: entire course by itself This course only considers common non-specialist systems HPCE / dt10/ 2013 / 0.19
Structure of the course Exam (50%) + two practical courseworks (50%) Task-based project using Intel Threaded Building Blocks Simple and robust framework for task-level parallelism Highly portable: linux, windows, posix source GPU based project using CUDA or OpenCL If you have a GPU in your laptop, use that Lab-machines have GPUs compatible with CUDA HPCE / dt10/ 2013 / 0.20
Skills needed Basic programming If you can t program in _any_ language then worry Intel TBB uses C++ rather than C Some weird C++ stuff, but not scary: explained in lectures Working examples given and explained Templates given as starting point for project work GPU programming uses CUDA or OpenCL (both C-like) Let s you use whatever graphics card you happen to have Working examples, explained in lectures Not expected to become a guru, just make it faster HPCE / dt10/ 2013 / 0.21
Key Focus: Engineering How does this apply to you? Examples from Elec. Eng. problems Mathematical analysis Simulation of digital circuits VLSI circuit layout Communication channel evaluation Tools and languages used in EE C++ MATLAB HPCE / dt10/ 2013 / 0.22
Simple example : Totient function Eulers totient function: totient(n) Number of integers in range 1..n which are relatively prime to n Integers i and j are relatively prime if gcd(i,j)=1 Totient not included in MATLAB HPCE / dt10/ 2013 / 0.23
Version 0 : Simple loop Eulers totient function: totient(n) Number of integers in range 1..n which are relatively prime to n Not included in MATLAB Integers i and j are relatively prime if gcd(i,j)=1 function [res]=totient_v0(n) res=0; for i=1:n % Loop over all numbers in 1..n if gcd(i,n)==1 % Check if relatively prime res=res+1; % Count any that are end end HPCE / dt10/ 2013 / 0.24
Version 1 : Vectorising Convert loops into vector operations Standard MATLAB optimisation Actually a way of making parallelism explicit function [res]=totient_v1(n) numbers=1:n; % Generate all numbers in 1..n gcd_res= (gcd(numbers,n)==1); % Perform GCD on all numbers res=sum(gcd_res==1); % Count all relatively prime numbers HPCE / dt10/ 2013 / 0.25
Version 2 : Parallel for loop MATLAB supports a parfor command Each loop iteration is/may be executed in parallel Can operate on multiple cores, and even multiple machines HPCE / dt10/ 2013 / 0.26
Version 2 : Parallel for loop MATLAB supports a parfor command Each loop iteration is/may be executed in parallel Can operate on multiple cores, and even multiple machines function [res]=totient_v2(n) res=0; parfor i=1:n % Loop over all numbers in 1..n if gcd(i,n)==1 % Check if relatively prime res=res+1; % Count any that are end end HPCE / dt10/ 2013 / 0.27
Version 3 : Agglomeration Too much overhead with current parallel loop Each parallel iteration has a cost due to scheduling Process space in chunks, using smaller vectors function [res]=totient_v3(n, step) if nargin<2 % How large each chunk should be step=1000; end res=0; % Loop over each chunk parfor i=1:floor(n/step) % Then process each chunk as a vector numbers=(i-1)*step+1:min(i*step,n); rel_prime= (gcd(numbers,n)==1); res=res+sum(rel_prime); end HPCE / dt10/ 2013 / 0.28
Results from my dual-core laptop 8 6 v0: For Loop v1: Vectorised v2: ParFor Loop v3: ParFor Chunked 4 2 0 0 0.5 1 1.5 2 2.5 x 10 5 HPCE / dt10/ 2013 / 0.29