Você está na página 1de 32

Asymmetric Multiprocessing for High Performance Computing

Stephen Jenks Electrical Engineering and Computer Science UC Irvine

sjenks@uci.edu
Source: NVIDIA

Overview and Motivation

Conventional Parallel Computing Was:


For scientists with large problem sets Complex and difficult Very expensive computers with limited access

Parallel Computing is Becoming:


Ubiquitous Cheap Essential Different Still complex and difficult

Where Is Parallel Computing Going? And What are the Research Challenges?
2/25/2009 Stephen Jenks 2

Computing is Changing

Parallel programming is essential


Clock speed not increasing much Performance gains require parallelism

Parallelism is changing
Special purpose parallel engines CPU and parallel engine work together Different code on CPU & parallel engine Asymmetric Computing

2/25/2009

Stephen Jenks

Conventional Processor Architecture

Hasnt Changed Much For 40 Years Pipelining and Superscalar Since the 1960s But has become integrated Microprocessors High Clock Speed Great performance High Power Cooling Issues Various Solutions

From Hennessy & Patterson, 2nd Ed.

2/25/2009

Stephen Jenks

Parallel Computing Problem Overview


newimage[i][j] = (image[i][j] + image[i][j-1] + image[i][j+1] + image[i+1][j] + image[i-1][j]) / 5

Image Relaxation (Blur)

Stencil

2/25/2009

Stephen Jenks

Shared Memory Multiprocessors


CPUs see common address space
CPU CPU CPU CPU

Cache

Cache

Cache

Cache

Shared Memory Each CPU computes results for its partition Memory is shared so dependences satisfied
2/25/2009 Stephen Jenks 6

Shared Memory Programming

Threads
POSIX Threads Windows Threads

Not Integrated with Compiler or Language No idea if code is in thread or not Poor optimizations

OpenMP
User-inserted directives to compiler Loop parallelism #pragma omp parallel for for (i=0; i<n; i++) Parallel regions a[i] = b[i] * c[i]; 2/25/2009 Stephen Jenks Visual Studio

Multicore Processors
Several CPU Cores Per Chip Shared Memory Shared Caches (sometimes) Lower Clock Speed

Lower Power & Heat But Good Performance

Intel Core Duo

Program with Threads Single Threaded Code


Not Faster Majority of Code Today

AMD Athlon 64 X2
2/25/2009 Stephen Jenks 8

Conventional Processors are Dinosaurs

So much circuitry dedicated to keeping ALUs fed:


Cache Out-of-order execution/reorder buffer Branch prediction Large Register Sets Simultaneous Multithreading

ALU tiny by comparison Huge power for little performance gain

With thanks to Stanfords Pat Hanrahan for the analogy


2/25/2009 Stephen Jenks 9

AMD Phenom X4 Floorplan

Source: AMD.com
2/25/2009 Stephen Jenks 10

Single Nehalem (Intel I7) Core

Nehalem - Everything You Need to Know about Intel's New Architecture http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3382

2/25/2009

Stephen Jenks

11

NVIDIA GPU Floorplan

Source: Dr. Sumit Gupta - NVIDIA

2/25/2009

Stephen Jenks

12

Asymmetric Parallel Accelerators

Current Cores are Powerful and General Pair General CPUs with Specialized Accelerators
Graphics Processing Unit Field Programmable Gate Array (FPGA) Single Instruction, Multiple Data (SIMD) Processor
Athlon 64 CPU XBAR HyperTransport Memory Controller ATI GPU

Some Applications Only Need Certain Operations Perhaps a Simpler Processor Could Be Faster

Possible Hybrid AMD Multi-Core Design


2/25/2009 Stephen Jenks 13

Graphics Processing Unit (GPU)

GPUs Do Pixel Pushing and Matrix Math

From NVIDIA CUDA Compute Unified Device Architecture Programming Guide 11/29/2007

2/25/2009

Stephen Jenks

14

CUDA Programming Model

Data Parallel
But not Loop Parallel Very Lightweight Threads Write Code from Threads Point of View

Block of Threads

No Shared Memory
Host Copies Data To and From Device Different Memories

Hundreds of Parallel Threads (Sort-of SIMD)


2/25/2009 Stephen Jenks 15

Finite Difference, Time Domain (FDTD) Electromagnetic Simulation


/* update magnetic field */ for (i = 1; i < nx - 1; i++) for (j = 1; j < ny - 1; j++) { /* combining Y loop */ for (k = 1; k < nz - 1; k++) { ELEM_TYPE invmu = 1.0f/AR_INDEX(mu,i,j,k); ELEM_TYPE tmpx = rx*invmu; ELEM_TYPE tmpy = ry*invmu; ELEM_TYPE tmpz = rz*invmu; AR_INDEX(hx,i,j,k) += tmpz * (AR_INDEX(ey,i,j,k+1) - AR_INDEX(ey,i,j,k)) tmpy * (AR_INDEX(ez,i,j+1,k) - AR_INDEX(ez,i,j,k)); AR_INDEX(hy,i,j,k) += tmpx * (AR_INDEX(ez,i+1,j,k) - AR_INDEX(ez,i,j,k)) tmpz * (AR_INDEX(ex,i,j,k+1) - AR_INDEX(ex,i,j,k)); AR_INDEX(hz,i,j,k) += tmpy * (AR_INDEX(ex,i,j+1,k) - AR_INDEX(ex,i,j,k)) tmpx * (AR_INDEX(ey,i+1,j,k) - AR_INDEX(ey,i,j,k)); } /* now update the electric field */ for (k = 1; k < nz - 1; k++) { ELEM_TYPE invep = 1.0f/AR_INDEX(ep,i,j,k); ELEM_TYPE tmpx = rx*invep; ELEM_TYPE tmpy = ry*invep; ELEM_TYPE tmpz = rz*invep; AR_INDEX(ex,i,j,k) += tmpy * (AR_INDEX(hz,i,j,k) - AR_INDEX(hz,i,j-1,k)) tmpz * (AR_INDEX(hy,i,j,k) - AR_INDEX(hy,i,j,k-1)); AR_INDEX(ey,i,j,k) += tmpz * (AR_INDEX(hx,i,j,k) - AR_INDEX(hx,i,j,k-1)) tmpx * (AR_INDEX(hz,i,j,k) - AR_INDEX(hz,i-1,j,k)); AR_INDEX(ez,i,j,k) += tmpx * (AR_INDEX(hy,i,j,k) - AR_INDEX(hy,i-1,j,k)) tmpy * (AR_INDEX(hx,i,j,k) - AR_INDEX(hx,i,j-1,k)); } } 2/25/2009 Stephen Jenks 16

Fields stored in 6 3D Arrays: EX, EY, EZ, HX, HY, HZ 2 other coefficient arrays O(N3) algorithm

OpenMP FDTD
#pragma omp parallel for private(i,j,k) /* update magnetic field */ #pragma omp parallel for private(i,j,k) \ shared(hx,hy,hz,ex,ey,ez) shared(hx,hy,hz,ex,ey,ez) for (i = 1; i < nx - 1; i++) for (j = 1; j < ny - 1; j++) { for (k = 1; k < nz - 1; k++) { Because of Dependences, ELEM_TYPE invmu = 1.0f/AR_INDEX(mu,i,j,k); Need to Split E & M Loops ELEM_TYPE tmpx = rx*invmu; ELEM_TYPE tmpy = ry*invmu; ELEM_TYPE tmpz = rz*invmu; AR_INDEX(hx,i,j,k) += tmpz * (AR_INDEX(ey,i,j,k+1) - AR_INDEX(ey,i,j,k)) tmpy * (AR_INDEX(ez,i,j+1,k) - AR_INDEX(ez,i,j,k)); AR_INDEX(hy,i,j,k) += tmpx * (AR_INDEX(ez,i+1,j,k) - AR_INDEX(ez,i,j,k)) tmpz * (AR_INDEX(ex,i,j,k+1) - AR_INDEX(ex,i,j,k)); AR_INDEX(hz,i,j,k) += tmpy * (AR_INDEX(ex,i,j+1,k) - AR_INDEX(ex,i,j,k)) tmpx * (AR_INDEX(ey,i+1,j,k) - AR_INDEX(ey,i,j,k));
} }

Sequential Core2Quad 26.7 Speedup


2/25/2009

OpenMP Core2Quad 8.8 3.0


Stephen Jenks

PS3 28.6

PS3 26.5 1.1


17

CUDA FDTD
// Block index int bx = blockIdx.x; // because CUDA does not support 3D grids, we'll use the y block value to // fake it. Unfortunately, this needs mod and divide (or a lookup table). int by, bz; by = blockIdx.y / gridZ; bz = blockIdx.y % gridZ; Predefined variables provide // Thread index int tx = threadIdx.x; int ty = threadIdx.y; int tz = threadIdx.z;

thread position in block and block position in grid

// this thread's indices in the global matrix (skip outer boundaries) int i = bx * THREAD_COUNT_X + tx + 1; int j = by * THREAD_COUNT_Y + ty + 1; int k = bz * THREAD_COUNT_Z + tz + 1;

At this point, the thread knows where it maps in problem space


2/25/2009 Stephen Jenks 18

CUDA FDTD (cont.)


float invep = 1.0f/AR_INDEX(d_ep,i,j,k); float tmpx = d_rx*invep; Code looks normal, but: float tmpy = d_ry*invep; Only one iteration float tmpz = d_rz*invep; float* ex = d_ex; float* ey = d_ey; float* ez = d_ez;
AR_INDEX(ex,i,j,k) += tmpy*(AR_INDEX(d_hz,i,j,k)-AR_INDEX(d_hz,i,j-1,k))tmpz*(AR_INDEX(d_hy,i,j,k)-AR_INDEX(d_hy,i,j,k-1)); AR_INDEX(ey,i,j,k) += tmpz*(AR_INDEX(d_hx,i,j,k)-AR_INDEX(d_hx,i,j,k-1))tmpx*(AR_INDEX(d_hz,i,j,k)-AR_INDEX(d_hz,i-1,j,k)); AR_INDEX(ez,i,j,k) += tmpx*(AR_INDEX(d_hy,i,j,k)-AR_INDEX(d_hy,i-1,j,k))tmpy*(AR_INDEX(d_hx,i,j,k)-AR_INDEX(d_hx,i,j-1,k));

Arrays reside in device global memory

Machine Core2Quad Core2Quad GeForce 9800GX2

Code Style Sequential OpenMP CUDA

Time (s) 26.5 8.8 7.4

Speedup 1 3.0 3.6

2/25/2009

Stephen Jenks

19

Better CUDA FDTD


Problem: CUDA Memory Accesses Slow Problem: Not Much Work Done Per Thread Idea 1: Copy Data to Fast Shared Memory
Allows Reuse By Neighbor Threads Takes Advantage of Coalesced Memory Accesses

Idea 2: Compute Multiple Elements Per Thread


But Each Block Only Gets 16KB Shared Space Barely Enough for 2 Points at 1x16x16 Threads
Machine Core2Quad Core2Quad GeForce 9800GX2 GeForce 9800GX2 Code Style Sequential OpenMP CUDA CUDA faster
2/25/2009

Time (s) 26.5 8.8 7.4 6.5


Stephen Jenks

Speedup 1 3.0 3.6 4.1


20

Cell Broadband Engine


PowerPC Processing Element with Simultaneous Multithreading at 3.2 GHz 8 Synergistic Processing Elements at 3.2 GHz
Optimized for SIMD/Vector processing (100 GFLOPS Total) 256KB Local Storage - no cache

4x16-byte-wide rings @ 96 bytes per clock cycle

From IBM Cell Broadband Engine Programmer Handbook, 10 May 2006

2/25/2009

Stephen Jenks

21

New IBM Single Source Compiler Supports OpenMP

Example SPE Code

PPE &SPEs Run Separate Programs


Programmed in C Explicitly Move Memory Hard to Debug SPE Code SPE Good at Arithmetic, not Branching PPE &SPEs Communicate via Mailboxes & Memory SPEs Communicate via Signals & Memory

struct dma_list_eleminList[DMA_LIST_SIZE] __attribute__ ((aligned (8))); struct dma_list_elemoutList[DMA_LIST_SIZE] __attribute__ ((aligned (8))); void SendSignal1(program_data* destPD, int value) { sig1_local.SPU_Sig_Notify_1 = value; mfc_sndsig( &(sig1_local.SPU_Sig_Notify_1), destPD->sig1_addr, 3/* tag */, 0, 0); mfc_sync(1<<3); }

Cell Programming Model


2/25/2009 Stephen Jenks 22

Research Agenda & Future Directions

Programming and Performance Being Pushed Back to Software Developers


Big Problem! Software Less Well Tested/Developed Than Hardware Architecture Affects Performance In Strange Ways Compiler Solutions (IBM SSC) Dont Always Work Well

Extract Top Performance From Multicore CPUs Make Multicore Processors Better to Program Explore & Exploit Asymmetric Computing

2/25/2009

Stephen Jenks

23

Parallel Microprocessor Problems


L2 Cache CPU1 CPU CPU2 System/ Mem I/F

Mem Mem

Then Now

Memory interface too slow for 1 core/thread Now multiple threads access memory simultaneously, overwhelming memory interface Parallel programs can run as slowly as sequential ones!

Stephen Jenks

2/25/2009 24

SPPM: Producer/Consumer Parallelism Using The Cache


Data in Memory Data in Memory

Memory Bottleneck

Communications Through Cache

Thread 1 Half the Work

Thread 2 Half the Work

Producer Thread

Consumer Thread

2/25/2009

Stephen Jenks

25

Multicore Architecture Improvements

Cores in Common Multicore Chips Not Well Connected


Communicate Through Cache & Memory
Synchronization Slow (OS-based) or Uses Spin Waits

Register-Based Synchronization
Shared Registers Between Cores Halt Processor While Waiting To Save Power

Preemptive Communications (Prepushing)


Reduces Latency over Demand-Based Fetches

Cuts Cache Coherence Traffic/Activity

Software Controlled Eviction


Manages Shared Caches Using Explicit Operations Move Data to Shared Cache Before Needed From Private Cache

Synchronization Engine
Hardware-based multicore synchronization operations
2/25/2009 Stephen Jenks 26

Asymmetric Multicore Processors


Intel & AMD Likely to Add GPUs to CPUs in Multicore Processors Connectivity Better than PCIe Athlon 64 ATI Cache Access Possible CPU GPU

Shared Memory Likely Programming Slightly Better

XBAR
HyperTransport Memory Controller

Multiple CPU Classes Possible


Fast & Powerful for Some Apps Simple & Power Efficient for Most

Thoughts:

Not many multithreaded apps, so dont need many CPUs Application accelerators speed up multimedia, etc. Probably Cost Effective to Integrate CPU & GPU

Possible Hybrid AMD Multicore Design

2/25/2009

Stephen Jenks

27

FPGA Application Accelerators

Cray and SGI Add FPGA Application Accelerators to High End Parallel Machines
FPGA Code Created Through VHDL or Other Compiler Can Speed Up Some Applications

My Thoughts
FPGA Great for Prototyping General & Flexible But Have Slow Clock Speed So Will Be Quickly Surpassed By GPUs & Special Purpose CPUs

2/25/2009

Stephen Jenks

28

Intel Larrabee
Simple Simple Simple Simple Simple x86 Core x86 Core x86 Core x86 Core x86 Core Interprocessor Ring Network
Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache Coherent L2 Cache

Memory & I/O Interfaces

Simple Simple Simple x86 Core x86 Core x86 Core

Simple Simple x86 Core x86 Core

Many simple, fast, low power, in-order x86 cores


Larrabee: A Many-Core x86 Architecture for Visual Computing ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, Publication date: August 2008. 2/25/2009 Stephen Jenks

29

OpenCL
New standard pushed by Apple & others (all major players) Data Parallel

N-Dimensional computational domain Work-groups similar to CUDA block Work-items similar to CUDA threads

Task Parallel
Similar to threads, but single work-item

Supports GPU, multicore, Cell, etc.


2/25/2009 Stephen Jenks 30

Summary

Parallelism Big Resurgence


Ubiquitous Still Difficult (Programming Challenge Worries Intel, Microsoft, & IBM)

New Types Of Parallelism Becoming Widespread


Multicore Processors
But They Suffer From Bottlenecks And Are Not Well Integrated

GPU Processing
Hard to Program Especially Hard to Get Good Performance

Asymmetric Processor Cores


Hard to Program and Manage Also Especially Hard to Get Good Performance

Many Architecture, Hardware, and Software Research Challenges


2/25/2009 Stephen Jenks 31

Resources

http://www.ibm.com/developerworks/power/cell - Cell Tools & Docs http://www.nvidia.com/object/cuda_home.html - CUDA Tools & Docs http://spds.ece.uci.edu/~sjenks/Pages/ParallelMicros.html SPPM & Parallel Execution Models http://web.mac.com/sfj/iWeb/Site/Welcome.html - Old Parallel Programming Blog (Source code for FDTD Examples) http://newport.eecs.uci.edu/~sjenks My homepage, where this presentation is posted

2/25/2009

Stephen Jenks

32

Você também pode gostar