High Performance Computing for Engineers

Similar documents
High Performance Computing for Engineers

EE 579: Digital System Testing. EECS 579 Course Goals

Multicore Processors Big deal? or No big deal? Steven Parker SCI Institute School of Computing University of Utah

How do I get an Allocation?

EPCC A UK HPC CENTRE. Adrian Jackson. Research Architect

Announcement of Request for DoD HPC Modernization Program HPC Equipment Reutilization Proposals

FPGA Accelerator Virtualization in an OpenPOWERcloud. Fei Chen, Yonghua Lin IBM China Research Lab

DARPA-BAA TRADES Frequently Asked Questions (FAQs) as of 7/19/16

SJTU CCOE Annual Report and Renew Request

Collaborative R&D Funding Infineon UK

How to obtain HPC resources. A. Emerson, HPC, Cineca.

Report from the Executive Committee

Software as Infrastructure at NSF. Daniel S. Katz Program Director, Division of Advanced Cyberinfrastructure

Success Story Enabling Global Growth with NVIDIA GRID

NICS and the NSF's High-Performance Computing Program. Jim Ferguson NICS Director of Education, Outreach & Training 8 September 2011

Radar Open Systems Architectures

DEPARTMENT OF INSTRUMENTATION AND CONTROL ENGINEERING (Established in 1993)

Best Practices for Writing a Successful NSF MRI Grant Proposal

You ll love the Vue. Philips IntelliVue Information Center ix

2009 Student Technology Fee Proposal Form

Domain Reuse. Mr. Neil Patterson & Mr. Milton Smith

Personal cloud computer available for everyone

Federal Demonstration Partnership. January 12, 2009 Michael Pellegrino

Introduction. Groupware. Groupware development and research contexts. Time-space classification of groupware

DEFENSE INFORMATION SYSTEMS AGENCY P. O. BOX 4502 ARLINGTON, VIRGINIA Joint Interoperability Test Command (JTE) 23 Dec 09

Pioneering ONLINE RECRUITMENT software

2 nd Call for Collaborative Data Science Projects

YOU ARE CORDIALLY INVITED to attend the IDN-KANSAS CITY Trade Show THURSDAY APRIL 19, :00-2:00 2-DAYS OF EDUCATION. Holiday Inn & Suites

Agathoklis Papadopoulos, PhD

Simulation: Overview and Taxonomy

Realization of FPGA based numerically Controlled Oscillator

MLR Institute of Technology

Deployment Guide. GlobalMeet 5 June 27, 2018

School of Engineering and Technology

It's time for a change to better utilize resources in healthcare

2017 Annual Missile Defense Small Business Programs Conference

Fingers In The Air. A Gentle Introduction To Software Estimation. Giovanni Asproni

UNCLASSIFIED. UNCLASSIFIED Army Page 1 of 7 R-1 Line #9

Mr. Vincent Grizio Program Manager MISSION SUPPORT SYSTEMS (MSS)

NAWIPS Migration to AWIPS II Status Update Unidata Users Committee Meeting. NCEP Central Operations 11 April 2011

UNCLASSIFIED. R-1 ITEM NOMENCLATURE PE D8Z: Central Test and Evaluation Investment Program (CTEIP) FY 2011 Total Estimate. FY 2011 OCO Estimate

The Nomad Digital Pen

PFAU 9.0: Fluid-Structure Simulations with OpenFOAM for Aircraft Designs"

UNCLASSIFIED R-1 ITEM NOMENCLATURE FY 2013 OCO

Powerful yet simple digital clinical noting and sketching from PatientSource. Patient Care Safely in One Place

INCITE Proposal Writing Webinar April 25, 2013

Welcome to the 2016 XSEDE Summer Boot Camp

Hilton Reservations and Customer Care

Army Ground-Based Sense and Avoid for Unmanned Aircraft

Methicillin resistant Staphylococcus aureus transmission reduction using Agent-Based Discrete Event Simulation

MedicsDocAssistant EHR The Ultimate in Electronic Health Records

INTRODUCTION TO Mobile Diagnostic Imaging. A quick-start guide designed to help you learn the basics of mobile diagnostic imaging

Server, Desktop, Mobile Platforms Working Group (SDMPWG) Dated

Tuning Your MPI Application Without Writing Code. SNUG TechTalk, 8 Feb 2012

SSF Call for Proposals: Framework Grants for Research on. Big Data and Computational Science

Advanced Computing Initiative

Joseph Wei

2-DAYS OF EDUCATION TUESDAY, AUGUST 7, :00 AM - 5:00 PM TRADE SHOW WEDNESDAY, AUGUST 8, :00 AM - 2:00 PM YOU ARE CORDIALLY INVITED

The AFIT of Today is the Air Force of Tomorrow.

Computer System. Computer hardware. Application software: Time-Sharing Environment. Introduction to Computer and C++ Programming.

Acute Care Solutions. A range of modern, intuitive and marketleading solutions for the next generation of hospital IT

Net-Enabled Mission Command (NeMC) & Network Integration LandWarNet / LandISRNet

Think Huddle and Neustar Performance Gains with 12c

UNCLASSIFIED. R-1 Program Element (Number/Name) PE A / Landmine Warfare and Barrier Advanced Technology. Prior Years FY 2013 FY 2014 FY 2015

Streamline Practice, Laboratory and Clinical Workflows. Healthcare Identification Solutions

The future of innovation in view of the new EU policies: Europe 2020, Innovation Union, Horizon Nikos Zaharis, SEERC December 29, 2011

Transforming Care in the NHS through Digital Technology

A Framework for Evaluating Electronic Health Records Overview - Applying to the Davies Ambulatory Awards Program Revised May 2012

Oracle Taleo Cloud for Midsize (TBE)

YOUR BUSINESS CAN T WAIT for VoIP

Computer Science Undergraduate Scholarship

CONTINUOUS IMPROVEMENT INITIATIVE GUIDELINES OCTOBER 2017

Developing a LIMS to Support Trials in the United Kingdom

GE Medical Systems Information Technologies. ApexPro FH Enterprise-Wide Telemetry

Table of Contents DARPA-BAA-16-62

One Size Doesn t Fit All

Code 85 Weapons Analysis Facility (WAF) Technical Engineering Services Pre-Solicitation Conference

UNCLASSIFIED. UNCLASSIFIED Army Page 1 of 10 R-1 Line #10

europeana business plan 2012

DARPA BAA HR001117S0054 Intelligent Design of Electronic Assets (IDEA) Frequently Asked Questions Updated October 3rd, 2017

UNCLASSIFIED FY 2016 OCO. FY 2016 Base

ARMY RDT&E BUDGET ITEM JUSTIFICATION (R2 Exhibit)

eprint MOBILE DRIVER User Guide

GLOBALMEET USER GUIDE

Helping healthcare: How Clinical Desktop can enrich patient care

EHR REVITALIZED WITH CLINICAL MOBILITY SOLUTIONS

P.O.Box 4500 Phone: +358 (0) OULU FINLAND Homepages:

Manage Your Project Portfolio, Second Edition

EGNOS Exploitation Grant Plan 2017

AFCEA Aberdeen Maryland Chapter Luncheon

AFCEA Mission Command Industry Engagement Symposium

ECE Computer Engineering I. ECE Introduction. Z. Aliyazicioglu. Electrical and Computer Engineering Department Cal Poly Pomona

Enhancing tactical communications with more cohesive solutions

Getting Set Up. Authors: Linda Besen Dennis Joseph Cicely Ridley. \ii a. - NATIONAL CENTER FOR ATMOSPHERIC RESEARCH BOULDER, COLORADO

7. Study and Evaluation Scheme for Diploma Programme In Computer Engineering (For the State of Haryana)

Goals of System Modeling:

UNCLASSIFIED. R-1 ITEM NOMENCLATURE PE A: Landmine Warfare and Barrier Advanced Technology FY 2012 OCO

DEEP LEARNING FOR PATIENT FLOW MALCOLM PRADHAN, CMO

INCITE Proposal Writing Webinar

Reduction of procedure time by 17% with Philips Azurion

Transcription:

High Performance Computing for Engineers David Thomas dt10@ic.ac.uk Room 903 HPCE / dt10/ 2013 / 0.1

High Performance Computing for Engineers Research Testing communication protocols Evaluating signal-processing filters Simulating analogue and digital designs Tools CAD tools: synthesis, place-and-route, verification Libraries/toolboxes: filter design, compressive sensing Products Oil exploration and discovery Mobile-phone apps Financial computing HPCE / dt10/ 2013 / 0.2

High Performance Computing for Engineers Types of performance metrics Throughput Latency Power Design-time Capital and running costs Required versus desired performance Subject to a throughput of X, minimise average power Subject to a budget of Y, maximise energy efficiency Subject to Z development days, maximise throughput HPCE / dt10/ 2013 / 0.3

What is available to you Types of compute device Multi-core CPUs GPUs (Graphics Processing Units) MPPAs (Massively Parallel Processor Arrays) FPGAs (Field Programmable Gate Arrays) Types of compute system Embedded Systems Mobile Phones Tablets Laptops Grid computing Cloud computing HPCE / dt10/ 2013 / 0.4

2013 : HTC Droid DNA Snapdragon S4 Pro - CPU : Quad-core Krait (ARM derivative) - GPU : Adreno 320 GPU (OpenCL compatible) Images Copyright HTC and Qaulcomm HPCE / dt10/ 2013 / 0.5

2013 : Lenovo Thinkpad Edge E525 AMD Fusion A8-3500M - CPU : Quad-Core 2.4GHz Phenom-II - GPU : HD 6620G 400MHz (320 cores) Img:http://laptops-specs.blogspot.com/2011/09/lenovo-thinkpad-edge-e525-specs.html, http://www.techradar.com/images/zoom/amd-llano-965315/index1 HPCE / dt10/ 2013 / 0.6

2013 : Imperial HPC Cluster cx2 - SGI Altix ICE 8200 EX Racks and racks of high-performance PCs 3000+ x64 cores running at 3GHz Available to researchers and undergrads (if they ask nicely) Grid-management system Run program on 1000 PCs with one command HPCE / dt10/ 2013 / 0.7

Performance and Efficiency Relative to CPU 60.0 50.0 40.0 30.0 20.0 10.0 0.0 Uniform Gaussian Exponential Mean (Geo) MPPA FPGA GPU 200.0 150.0 100.0 50.0 0.0 Uniform Gaussian 345 Exponential Mean (Geo) FPGA GPU MPPA Performance Power Efficiency HPCE / dt10/ 2013 / 0.8

Design tradeoffs 1 Sequential SW 10 Performance 100 1000 1 hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2013 / 0.9

Design tradeoffs 1 10 Performance 100 Sequential SW Thread-based SW 1000 1 hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2013 / 0.10

Design tradeoffs 1 10 Performance 100 Sequential SW Thread-based SW 1000 1 hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2013 / 0.11

Design tradeoffs Task-based parallelism vs threads Easy to program (less time coding) 1 Easy to get right (less time testing) 10Many implementations and APIs Performance 100 Intel Threaded Building Blocks (TBB) Microsoft.NET Task Parallel Library 1000 OpenCL 1 hour 1 day 1 week 1 month Sequential SW Task-based SW Thread-based SW Design-time HPCE / dt10/ 2013 / 0.12

Design tradeoffs 1 10 Performance 100 Sequential SW Task-based SW Thread-based SW 1000 1 hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2013 / 0.13

Design tradeoffs 1 10 Performance 100 Sequential SW Task-based SW Thread-based SW GPU 1000 1 hour 1 day 1 week 1 month Design-time Src: NVIDIA CUDA Compute Unified Device Architecture, Programmers Guide HPCE / dt10/ 2013 / 0.14

Design tradeoffs 1 10 Performance 100 Sequential SW Task-based SW Thread-based SW GPU 1000 1 hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2013 / 0.15

Design tradeoffs 1 10 Performance 100 1000 Sequential SW Task-based SW Thread-based SW GPU FPGA 1 hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2013 / 0.16

Design tradeoffs 1 10 Performance 100 1000 Sequential SW Task-based SW Thread-based SW GPU FPGA 1 hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2013 / 0.17

What will you learn Systems: what high-performance systems do you have Methods: how can these systems be programmed Practise: concrete experience with multi-core and GPUs Analysis: knowing what to use and when HPCE / dt10/ 2013 / 0.18

What you won t learn Multi-threaded programming PThreads, windows threads, mutexes, spin-locks,... We ll look at the concepts and hardware, but ignore the practise Not needed when using modern task-based methods OpenMP API for parallelising for-loops in C/C++ Old technology, not very user-friendly Doesn t map nicely to architectures such as GPUs We ll use modern techniques such as TBB and CUDA/OpenCL MPI (Messaging Passing Interface) Point-to-point communication between networks Important; but very specialised: entire course by itself This course only considers common non-specialist systems HPCE / dt10/ 2013 / 0.19

Structure of the course Exam (50%) + two practical courseworks (50%) Task-based project using Intel Threaded Building Blocks Simple and robust framework for task-level parallelism Highly portable: linux, windows, posix source GPU based project using CUDA or OpenCL If you have a GPU in your laptop, use that Lab-machines have GPUs compatible with CUDA HPCE / dt10/ 2013 / 0.20

Skills needed Basic programming If you can t program in _any_ language then worry Intel TBB uses C++ rather than C Some weird C++ stuff, but not scary: explained in lectures Working examples given and explained Templates given as starting point for project work GPU programming uses CUDA or OpenCL (both C-like) Let s you use whatever graphics card you happen to have Working examples, explained in lectures Not expected to become a guru, just make it faster HPCE / dt10/ 2013 / 0.21

Key Focus: Engineering How does this apply to you? Examples from Elec. Eng. problems Mathematical analysis Simulation of digital circuits VLSI circuit layout Communication channel evaluation Tools and languages used in EE C++ MATLAB HPCE / dt10/ 2013 / 0.22

Simple example : Totient function Eulers totient function: totient(n) Number of integers in range 1..n which are relatively prime to n Integers i and j are relatively prime if gcd(i,j)=1 Totient not included in MATLAB HPCE / dt10/ 2013 / 0.23

Version 0 : Simple loop Eulers totient function: totient(n) Number of integers in range 1..n which are relatively prime to n Not included in MATLAB Integers i and j are relatively prime if gcd(i,j)=1 function [res]=totient_v0(n) res=0; for i=1:n % Loop over all numbers in 1..n if gcd(i,n)==1 % Check if relatively prime res=res+1; % Count any that are end end HPCE / dt10/ 2013 / 0.24

Version 1 : Vectorising Convert loops into vector operations Standard MATLAB optimisation Actually a way of making parallelism explicit function [res]=totient_v1(n) numbers=1:n; % Generate all numbers in 1..n gcd_res= (gcd(numbers,n)==1); % Perform GCD on all numbers res=sum(gcd_res==1); % Count all relatively prime numbers HPCE / dt10/ 2013 / 0.25

Version 2 : Parallel for loop MATLAB supports a parfor command Each loop iteration is/may be executed in parallel Can operate on multiple cores, and even multiple machines HPCE / dt10/ 2013 / 0.26

Version 2 : Parallel for loop MATLAB supports a parfor command Each loop iteration is/may be executed in parallel Can operate on multiple cores, and even multiple machines function [res]=totient_v2(n) res=0; parfor i=1:n % Loop over all numbers in 1..n if gcd(i,n)==1 % Check if relatively prime res=res+1; % Count any that are end end HPCE / dt10/ 2013 / 0.27

Version 3 : Agglomeration Too much overhead with current parallel loop Each parallel iteration has a cost due to scheduling Process space in chunks, using smaller vectors function [res]=totient_v3(n, step) if nargin<2 % How large each chunk should be step=1000; end res=0; % Loop over each chunk parfor i=1:floor(n/step) % Then process each chunk as a vector numbers=(i-1)*step+1:min(i*step,n); rel_prime= (gcd(numbers,n)==1); res=res+sum(rel_prime); end HPCE / dt10/ 2013 / 0.28

Results from my dual-core laptop 8 6 v0: For Loop v1: Vectorised v2: ParFor Loop v3: ParFor Chunked 4 2 0 0 0.5 1 1.5 2 2.5 x 10 5 HPCE / dt10/ 2013 / 0.29