High Performance Computing for Engineers

High Performance Computing for Engineers David Thomas dt10@ic.ac.uk Room 903 HPCE / dt10 / 2012 / 0.1

High Performance Computing for Engineers Research Testing communication protocols Evaluating signal-processing filters Simulating analogue and digital designs Tools CAD tools: synthesis, place-and-route, verification Libraries/toolboxes: filter design, compressive sensing Products Oil exploration and discovery Mobile-phone apps Financial computing HPCE / dt10 / 2012 / 0.2

High Performance Computing for Engineers Types of performance metrics Throughput Latency Power Design-time Capital and running costs Required versus desired performance Subject to a throughput of X, minimise average power Subject to a budget of Y, maximise energy efficiency Subject to Z development days, maximise throughput HPCE / dt10 / 2012 / 0.3

What is available to you Types of compute device Multi-core CPUs GPUs (Graphics Processing Units) MPPAs (Massively Parallel Processor Arrays) FPGAs (Field Programmable Gate Arrays) Types of compute system Embedded Systems Mobile Phones Tablets Laptops Grid computing Cloud computing HPCE / dt10 / 2012 / 0.4

2012 : LG Optimus 2X NVidia Tegra 2 - CPU : Dual-core ARM Cortex A9 - GPU : ULP GeForce (8 cores) Imgs : http://www.techradar.com/reviews/phones/mobile-phones/lg-optimus-2x-929388/review, http://www.anandtech.com/show/2911 HPCE / dt10 / 2012 / 0.5

2012 : Lenovo Thinkpad Edge E525 AMD Fusion A8-3500M - CPU : Quad-Core 2.4GHz Phenom-II - GPU : HD 6620G 400MHz (320 cores) Img:http://laptops-specs.blogspot.com/2011/09/lenovo-thinkpad-edge-e525-specs.html, http://www.techradar.com/images/zoom/amd-llano-965315/index1 HPCE / dt10 / 2012 / 0.6

2012 : Imperial HPC Cluster cx2 - SGI Altix ICE 8200 EX Racks and racks of high-performance PCs 3000+ x64 cores running at 3GHz Available to researchers and undergrads (if they ask nicely) Grid-management system Run program on 1000 PCs with one command HPCE / dt10 / 2012 / 0.7

Performance and Efficiency Relative to CPU 60.0 50.0 40.0 30.0 G P U MP P A F P G A 0.0 10.0 20.0 200.0 150.0 100.0 50.0 F P G A G P U MP P A 0.0 345 Un i f o rm G a u ssi a n E xp o n e n t i a l M e a n (G e o ) U n i fo r m G a u s s i a n E x p o n e n ti a l Me a n ( G e o ) Performance Power Efficiency HPCE / dt10 / 2012 / 0.8

Design tradeoffs HPCE / dt10 / 2012 / 0.9

Design tradeoffs HPCE / dt10 / 2012 / 0.10

Design tradeoffs HPCE / dt10 / 2012 / 0.11

Design tradeoffs Task-based parallelism vs threads Easy to program (less time coding) Easy to get right (less time testing) Many implementations and APIs Intel Threaded Building Blocks (TBB) Microsoft.NET Task Parallel Library OpenCL HPCE / dt10 / 2012 / 0.12

Design tradeoffs HPCE / dt10 / 2012 / 0.13

Design tradeoffs Src: NVIDIA CUDA Compute Unified Device Architecture, Programmers Guide HPCE / dt10 / 2012 / 0.14

Design tradeoffs HPCE / dt10 / 2012 / 0.15

Design tradeoffs HPCE / dt10 / 2012 / 0.16

Design tradeoffs HPCE / dt10 / 2012 / 0.17

What will you learn Systems: what high-performance systems do you have Methods: how can these systems be programmed Practise: concrete experience with multi-core and GPUs Analysis: knowing what to use and when HPCE / dt10 / 2012 / 0.18

What you won t learn Multi-threaded programming PThreads, windows threads, mutexes, spin-locks,... We ll look at the concepts and hardware, but ignore the practise Not needed when using modern task-based methods OpenMP API for parallelising for-loops in C/C++ Old technology, not very user-friendly Doesn t map nicely to architectures such as GPUs We ll use modern techniques such as TBB and CUDA/OpenCL MPI (Messaging Passing Interface) Point-to-point communication between networks Important; but very specialised: entire course by itself This course only considers common non-specialist systems HPCE / dt10 / 2012 / 0.19

Structure of the course Exam (50%) + two practical courseworks (50%) Task-based project using Intel Threaded Building Blocks Simple and robust framework for task-level parallelism Highly portable: linux, windows, posix source GPU based project using CUDA or OpenCL If you have a GPU in your laptop, use that Certain lab-machines have GPUs compatible with CUDA Will also explore using OpenCL to target both CPUs and GPUs HPCE / dt10 / 2012 / 0.20

Skills needed Basic programming If you can t program in _any_ language then worry Intel TBB uses C++ rather than C Some weird C++ stuff, but not scary: explained in lectures Working examples given and explained Templates given as starting point for project work GPU programming uses CUDA or OpenCL (both C-like) Let s you use whatever graphics card you happen to have Working examples, explained in lectures Template as starting point for project work Not expected to become a guru, just make it faster HPCE / dt10 / 2012 / 0.21

Key Focus: Engineering How does this apply to you? Examples from Elec. Eng. problems Mathematical analysis Simulation of digital circuits VLSI circuit layout Communication channel evaluation (Fractal zoomers) Tools and languages used in EE C MATLAB qsub (Imperial HPC cluster) HPCE / dt10 / 2012 / 0.22

Simple example : Totient function Eulers totient function: totient(n) Number of integers in range 1..n which are relatively prime to n Integers i and j are relatively prime if gcd(i,j)=1 Totient not included in MATLAB HPCE / dt10 / 2012 / 0.23

Version 0 : Simple loop Eulers totient function: totient(n) Number of integers in range 1..n which are relatively prime to n Not included in MATLAB Integers i and j are relatively prime if gcd(i,j)=1 function [res]=totient_v0(n) res=0; for i=1:n % Loop over all numbers in 1..n if gcd(i,n)==1 % Check if relatively prime res=res+1; % Count any that are end end HPCE / dt10 / 2012 / 0.24

Version 1 : Vectorising Convert loops into vector operations Standard MATLAB optimisation Actually a way of making parallelism explicit function [res]=totient_v1(n) numbers=1:n; % Generate all numbers in 1..n gcd_res= (gcd(numbers,n)==1); % Perform GCD on all numbers res=sum(gcd_res==1); % Count all relatively prime numbers HPCE / dt10 / 2012 / 0.25

Version 2 : Parallel for loop MATLAB supports a parfor command Each loop iteration is/may be executed in parallel Can operate on multiple cores, and even multiple machines HPCE / dt10 / 2012 / 0.26

Version 2 : Parallel for loop MATLAB supports a parfor command Each loop iteration is/may be executed in parallel Can operate on multiple cores, and even multiple machines function [res]=totient_v2(n) res=0; parfor i=1:n % Loop over all numbers in 1..n if gcd(i,n)==1 % Check if relatively prime res=res+1; % Count any that are end end HPCE / dt10 / 2012 / 0.27

Version 3 : Agglomeration Too much overhead with current parallel loop Each parallel iteration has a cost due to scheduling Process space in chunks, using smaller vectors function [res]=totient_v3(n, step) if nargin<2 % How large each chunk should be step=1000; end res=0; % Loop over each chunk parfor i=1:floor(n/step) % Then process each chunk as a vector numbers=(i-1)*step+1:min(i*step,n); rel_prime= (gcd(numbers,n)==1); res=res+sum(rel_prime); end HPCE / dt10 / 2012 / 0.28

Results from my dual-core laptop 8 6 v0: For Loop v1: Vectorised v2: ParFor Loop v3: ParFor Chunked 4 2 0 0 0.5 1 1.5 2 2.5 x 10 5 HPCE / dt10 / 2012 / 0.29

Questions? HPCE / dt10 / 2012 / 0.30