Você está na página 1de 46

HETEROGENEOUS SYSTEM COHERENCE

FOR INTEGRATED CPU-GPU SYSTEMS


JASON POWER*, ARKAPRAVA BASU*, JUNLI GU, SOORAJ PUTHOOR,
BRADFORD M BECKMANN, MARK D HILL*, STEVEN K REINHARDT, DAVID A WOOD*
*University of Wisconsin-Madison
Advanced Micro Devices, Inc.

Powerpoint version available on:


http://pages.cs.wisc.edu/~powerjg/

2 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

ABSTRACT
Hardware coherence can increase the utility of
heterogeneous systems
Major bottlenecks in current coherence implementations
High bandwidth difficult to support at directory
Extreme resource requirements

We propose Heterogeneous System Coherence


Leverages spatial locality and region coherence
Reduces bandwidth by 94%
Reduces resource requirements by 95%
4 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

PHYSICAL INTEGRATION

5 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

PHYSICAL INTEGRATION

6 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

PHYSICAL INTEGRATION

7 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

PHYSICAL INTEGRATION

Stacked High-bandwidth DRAM

GPU
CPU
Cores
8 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

Credit: IBM

LOGICAL INTEGRATION
General-purpose GPU computing
OpenCL
CUDA

Heterogeneous Uniform Memory Access (hUMA)


Shared virtual address space
Cache coherence

Allows new heterogeneous apps

9 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

OUTLINE
Motivation
Background
System overview
Cache architecture reminder

Heterogeneous System Bottlenecks


Heterogeneous System Coherence Details
Results
Conclusions

10 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

SYSTEM OVERVIEW
SYSTEM LEVEL

Highbandwidth
interconnect

Accelerated
Processing
Unit (APU)

DRAM Channels

11 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

SYSTEM OVERVIEW
APU

APU
GPU compute
accesses must
stay coherent

GPU
Cluster

Direct-access
bus

CPU
Cluster

Directory

(used for graphics)

To DRAM
12 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

Arrow thickness
bandwidth

Invalidation
traffic

SYSTEM OVERVIEW
GPU

CU

Very high bandwidth:


CU CU CU CU CULocal
CU Scratchpad
CU CU CU CU
L2
has
high
miss
rate
L1
L1
L1
L1
L1
L1Memory
L1
L1
L1
L1

GPU Cluster
I-Fetch / Decode
CU

CU

CU

CU

L1

L1

L1

L1

CU
L1

Register File

Ex

Ex

Ex

Ex

Ex

L1

L1

L1

L1Ex L1

CU

CU

CU

CU

L1

GPU L2 Cache

Ex
Ex

Coalescer

Ex

CU

L1
ExL1 ExL1 Ex

L1

L1

CU
Ex CU ExCU ExCU Ex

CU

CU

13 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

To L1
L1

L1

L1

L1

L1

L1

CU

CU

CU

CU

CU

CU

SYSTEM OVERVIEW
CPU Cluster
CPU
bandwidth:
Core

CPU
Core

Low
Low L2 miss rate
L1

L1

To Dir

L2
L1

L1

CPU
Core

CPU
Core

14 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

CACHE ARCHITECTURE REMINDER


CPU/GPU L2 CACHE

MSHR Entries

Demand
Requests

Cache Tag Arrays


Demand requests
Searches cache tags
from L1Allocates
cache anfor
a tag match
MSHR
Tag hit on probe: send
entry On a directory
MSHRs
data to other core
Miss
On a miss, send probe,Requests
check
Data
Onrequest
a hit, return
Hit
to directory
MSHRsResponses
and tags
Probe
data to the L1
Requests

Core Data
Responses

15 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

Coherent
Network
Interface

DIRECTORY ARCHITECTURE REMINDER


DIRECTORY

Miss

To DRAM

16 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

PR Entries

MSHR Entries

Demand Block
requests
Block tags
Probe
Searches
Directory
Tag Array cache
Requests/
Responses
from L2Allocates
cache anfor
a tag match
MSHR
On a miss, the entry
data
Allocate
and send
Probe
comes
from DRAM
MSHRs
Request RAM
Coherent
probes to L2 caches Hit
Block Requests

BACKGROUND SUMMARY
System under investigation
Heterogeneous CPU-GPU on chip
High-bandwidth DRAM

Directory pipeline complex


MSHR array is associative
Difficult to pipeline with more than 1 request per cycle
Important resources: MSHR entries

17 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

OUTLINE
Motivation
Background
Heterogeneous System Bottlenecks
Simulation overview
Directory bandwidth
MSHRs
Performance is significantly affected

Heterogeneous System Coherence Details


Results
Conclusions
18 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

SIMULATION DETAILS
gem5 simulator

Workloads

Simple CPU
GPU simulator based on AMD GCN
All memory requests through gem5
CPU Clock
CPU Cores
CPU Shared L2
GPU Clock
Compute Units
GPU Shared L2
L3 (Memory-side)
DRAM
Peak Bandwidth
Baseline Directory

Modified to use hUMA


Rodinia & AMD APP SDK

2 GHz
2
2 MB (16-way banked)
1 GHz
32
4 MB (64-way banked)
16 MB (16-way banked)
DDR3, 16 channels
700 GB/s
256k entries (8-way banked)

19 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

GPGPU BENCHMARKS
Rodinia benchmarks
bp trains the connection weights on a neural network
bfs breadth-first search
hs performs a transient 2D thermal simulation (5-point stencil)
lud matrix decomposition
nw performs a global optimization for DNA sequence alignment
km does k-means clustering
sd speckle-reducing anisotropic diffusion

AMD SDK
bn bitonic sort
dct discrete cosine transform
hg histogram
mm matrix multiplication

20 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

SYSTEM BOTTLENECKS

APU
Difficult to scale directory
bandwidth
Difficult to multi-port
GPU
Complicated pipeline
Cluster

CPU
Cluster

Designed to
support CPU
High resource usage
bandwidth
Must allocate MSHR for entire duration of request
High
bandwidth
MSHR
array difficult to scale Directory
To DRAM
21 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

DIRECTORY TRAFFIC

Directory accesses per GPU cycle

4.5
4

Difficult to support >1


request per cycle

3.5
3
2.5
2
1.5
1
0.5

0
bp

bfs

hs

lud

nw

22 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

km

sd

bn

dct

hg

mm

RESOURCE USAGE
100000

Maximum MSHRs

10000

1000

100

Very difficult to
scale MSHR array

Steady state at
700 GB/s
Causes significant
back-pressure on L2s

10

1
bp

bfs

hs

lud

nw

23 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

km

sd

bn

dct

hg

mm

PERFORMANCE OF BASELINE
COMPARED TO UNCONSTRAINED RESOURCES
5

Back-pressure from limited


MSHRs and bandwidth

4.5
4
Slow down

3.5
3
2.5

2
1.5
1
0.5
0
bp

bfs

hs

lud

nw

24 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

km

sd

bn

dct

hg

mm

BOTTLENECKS SUMMARY
Directory bandwidth
Must support up to 4 requests per cycle
Difficult to construct pipeline

Resource usage
MSHRs are a constraining resource
Need more than 10,000
Without resource constraints, up to 4x better performance

25 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

OUTLINE
Motivation
Background
Heterogeneous System Bottlenecks
Heterogeneous System Coherence Details
Overall system design
Region buffer design
Region directory design
Example
Hardware complexity

Results
Conclusions
26 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

BASELINE DIRECTORY COHERENCE

APU
GPU
Cluster

CPU
Cluster

Initialization
Kernel Launch

Directory

To DRAM
27 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

Read result

HETEROGENEOUS SYSTEM COHERENCE (HSC)

APU
GPU
Cluster

CPU
Cluster

Initialization
Kernel Launch

Directory

To DRAM
28 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

HETEROGENEOUS SYSTEM COHERENCE (HSC)

APU
GPU
Region
Cluster
Buffer

CPU
Region
Cluster
Buffer

Direct-access bus
Region
Directory
Directory

To DRAM
29 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

Region buffers
coordinate with
region directory

HSC: EXAMPLE MEMORY REQUEST


GPU L2 Cache

GPU Region Buffer

Region Directory

APU
GPU
Region
Cluster
Buffer

CPU
Region
Cluster
Buffer

Region
Directory

To DRAM
32 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

HSC: L2 CACHE & REGION BUFFER

MSHR Entries

Demand
Demand
Requests
Requests

MSHRs

MSHR Entries

MSHRs

Region
tags and
Cache Tag Arrays
Region Buffer
Cache Tag Arrays
permissions
Only region-level
permission traffic
Interface for
direct-access bus
Miss

Hit

Miss
Core Data
Responses

Hit

Core Data
Responses

33 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

Hit

Miss
Miss
Requests
Probe
Hit Data
Requests

Responses

Probe
Requests

Direct Access
Bus Interface

Coherent
Coherent
Network
Network
Interface
Interface

HSC: REGION DIRECTORY


Region tags,
sharers, and
Block Directory
Array
permissions
Region
DirectoryTag
Tag
Array
Block Probe
Requests/
BlockResponses
Probe

Requests/Responses

Hit

Miss
Miss
To DRAM

34 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

PR Entries

Hit

PR Entries

MSHR Entries

Block Requests

Probe
Probe
Request
RAM
Request
RAM

MSHRs

MSHR Entries

Region
Permission
Requests
MSHRs
Coherent

HSC: HARDWARE COMPLEXITY


Region protocols reduce
directory size
Region directory: 8x fewer entries

Region buffers
At each L2 cache
1-KB region (16 64-B blocks)
16-K region entries
Overprovisioned for low-locality
workloads

35 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

(a) Region Directory Entry


Region Tag

State CPU GPU

18 bits

2 bits 1 valid bit


per cluster

(b) Region Buffer Entry


Region Tag

18 bits

State B0 B1 B2 ... B15


2 bits

1 valid bit per


block in the region

HSC SUMMARY
Key insight
GPU-CPU applications exhibit high spatial locality
Use direct-access bus present in systems
Offload bandwidth onto direct-access bus

Use coherence network only for permission


Add region buffer to track region information
At each L2 cache
Bypass coherence network and directory

Replace directory with region directory


Significantly reduces total size needed

36 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

OUTLINE
Motivation
Background
Heterogeneous System Bottlenecks
Heterogeneous System Coherence Details
Results
Speed-up
Latency of loads
Bandwidth
MSHR usage

Conclusions
37 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

THREE CACHE-COHERENCE PROTOCOLS


Broadcast: Null-directory that broadcasts on all requests
Baseline: Block-based, mostly inclusive, directory
HSC: Region-based directory with 1-KB region size

38 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

HSC PERFORMANCE
5
4.5
Normalized speed-up

Largest
Largest slow-downs
slowdowns
Broadcast
from constrained
resources

Baseline

HSC

3.5
3
2.5
2
1.5
1
0.5
0
bp

bfs

hs

lud

nw

39 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

km

sd

bn

dct

hg

mm

DIRECTORY TRAFFIC REDUCTION


1.2

broadcast

baseline

HSC

Normalized directory bandwidth

1
0.8
0.6
0.4

Average bandwidth
significantly reduced Theoretical
reduction from 16
block regions

0.2
0
bp

bfs

hs

lud

nw

40 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

km

sd

bn

dct

hg

mm

HSC RESOURCE USAGE

Normalized directory MSHRs required

0.25
0.2
0.15
0.1

Maximum
MSHRs
significantly
reduced

0.05
0
bp

bfs

hs

lud

nw

41 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

km

sd

bn

dct

hg

mm

RESULTS SUMMARY
Used a detailed timing simulator for CPU and GPU
HSC significantly improves performance
Reduces the average load latency
Decreases bandwidth requirement of directory

HSC reduces the required MSHRs at the directory

42 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

RELATED WORK
Coarse-grained coherence
Region coherence
Applied to snooping systems [Cantin, ISCA 2005] [Moshovos, ISCA 2005]
[Zebchuk, MICRO 2007]

Extended to directories [Fang, PACT 2013] [Zebchuk, MICRO 2013]

Spatiotemporal coherence [Alisafaee, MICRO 2012]


Dual-grain directory coherence [Basu, UW-TR 2013]
Primarily focused on directory size

GPU coherence [Singh et al. HPCA 2013]


Intra-GPU coherence

43 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

CONCLUSIONS
Hardware coherence can increase the utility of
heterogeneous systems
Major bottlenecks in current coherence implementations
High bandwidth difficult to support at directory
Extreme resource requirements

We propose Heterogeneous System Coherence


Leverages spatial locality and region coherence
Reduces bandwidth by 94%
Reduces resource requirements by 95%
44 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

Questions?
Contact:
powerjg@cs.wisc.edu

45 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and
typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to
product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences
between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or
otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to
time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR
ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO
EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM
THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of
Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance
Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.
46 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

Backup
Slides

LOAD LATENCY
4.5

broadcast

Normalized load latency

4
3.5

baseline

HSC

Average load time


significantly reduced

2.5
2
1.5
1
0.5
0
bp

bfs

hs

lud

nw

48 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

km

sd

bn

dct

hg

mm

EXECUTION TIME BREAKDOWN


120
GPU

CPU

hg

mm

Execution time (%)

100

80
60
40
20
0
bp

bfs

hs

lud

nw

49 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

km

sd

bn

dct

Você também pode gostar