Heterogeneous System Coherence

HETEROGENEOUS SYSTEM COHERENCE
FOR INTEGRATED CPU-GPU SYSTEMS

JASON POWER*, ARKAPRAVA BASU*, JUNLI GU, SOORAJ PUTHOOR,
BRADFORD M BECKMANN, MARK D HILL*, STEVEN K REINHARDT, DAVID A WOOD*
*University of Wisconsin-Madison
Advanced Micro Devices, Inc.
Powerpoint version available on:

http://pages.cs.wisc.edu/~powerjg/
2 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
ABSTRACT
Hardware coherence can increase the utility of
heterogeneous systems
Major bottlenecks in current coherence implementations
High bandwidth difficult to support at directory
Extreme resource requirements
We propose Heterogeneous System Coherence

Leverages spatial locality and region coherence
Reduces bandwidth by 94%
Reduces resource requirements by 95%
PHYSICAL INTEGRATION
Stacked High-bandwidth DRAM
GPU
CPU
Cores
Credit: IBM
LOGICAL INTEGRATION
General-purpose GPU computing
OpenCL
CUDA
Heterogeneous Uniform Memory Access (hUMA)

Shared virtual address space
Cache coherence
Allows new heterogeneous apps
OUTLINE
Motivation
Background
System overview
Cache architecture reminder
Heterogeneous System Bottlenecks

Heterogeneous System Coherence Details
Results
Conclusions
SYSTEM OVERVIEW
SYSTEM LEVEL
Highbandwidth
interconnect
Accelerated
Processing
Unit (APU)
DRAM Channels
SYSTEM OVERVIEW
APU
APU
GPU compute
accesses must
stay coherent
GPU
Cluster
Direct-access
bus
CPU
Cluster
Directory
(used for graphics)
To DRAM
Arrow thickness
bandwidth
Invalidation
traffic
SYSTEM OVERVIEW
GPU
CU
Very high bandwidth:

CU CU CU CU CULocal
CU Scratchpad
CU CU CU CU
L2
has
high
miss
rate
L1
L1
L1
L1
L1
L1Memory
L1
L1
L1
L1
GPU Cluster
I-Fetch / Decode
CU
CU
CU
CU
L1
L1
L1
L1
CU
L1
Register File
Ex
Ex
Ex
Ex
Ex
L1
L1
L1
L1Ex L1
CU
CU
CU
CU
L1
GPU L2 Cache
Ex
Ex
Coalescer
Ex
CU
L1
ExL1 ExL1 Ex
L1
L1
CU
Ex CU ExCU ExCU Ex
CU
CU
To L1
L1
L1
L1
L1
L1
L1
CU
CU
CU
CU
CU
CU
SYSTEM OVERVIEW
CPU Cluster
CPU
bandwidth:
Core
CPU
Core
Low
Low L2 miss rate
L1
L1
To Dir
L2
L1
L1
CPU
Core
CPU
Core
CACHE ARCHITECTURE REMINDER

CPU/GPU L2 CACHE
MSHR Entries
Demand
Requests
Cache Tag Arrays

Demand requests
Searches cache tags
from L1Allocates
cache anfor
a tag match
MSHR
Tag hit on probe: send
entry On a directory
MSHRs
data to other core
Miss
On a miss, send probe,Requests
check
Data
Onrequest
a hit, return
Hit
to directory
MSHRsResponses
and tags
Probe
data to the L1
Requests
Core Data
Responses
Coherent
Network
Interface
DIRECTORY ARCHITECTURE REMINDER

DIRECTORY
Miss
To DRAM
PR Entries
MSHR Entries
Demand Block
requests
Block tags
Probe
Searches
Directory
Tag Array cache
Requests/
Responses
from L2Allocates
cache anfor
a tag match
MSHR
On a miss, the entry
data
Allocate
and send
Probe
comes
from DRAM
MSHRs
Request RAM
Coherent
probes to L2 caches Hit
Block Requests
BACKGROUND SUMMARY
System under investigation
Heterogeneous CPU-GPU on chip
High-bandwidth DRAM
Directory pipeline complex

MSHR array is associative
Difficult to pipeline with more than 1 request per cycle
Important resources: MSHR entries
OUTLINE
Motivation
Background
Simulation overview
Directory bandwidth
MSHRs
Performance is significantly affected

Results
Conclusions
SIMULATION DETAILS
gem5 simulator
Workloads
Simple CPU
GPU simulator based on AMD GCN
All memory requests through gem5
CPU Clock
CPU Cores
CPU Shared L2
GPU Clock
Compute Units
GPU Shared L2
L3 (Memory-side)
DRAM
Peak Bandwidth
Baseline Directory
Modified to use hUMA

Rodinia & AMD APP SDK
2 GHz
2
2 MB (16-way banked)
1 GHz
32
DDR3, 16 channels
700 GB/s
256k entries (8-way banked)
GPGPU BENCHMARKS
Rodinia benchmarks
bp trains the connection weights on a neural network
bfs breadth-first search
hs performs a transient 2D thermal simulation (5-point stencil)
lud matrix decomposition
nw performs a global optimization for DNA sequence alignment
km does k-means clustering
sd speckle-reducing anisotropic diffusion
AMD SDK
bn bitonic sort
dct discrete cosine transform
hg histogram
mm matrix multiplication
SYSTEM BOTTLENECKS
APU
Difficult to scale directory
bandwidth
Difficult to multi-port
GPU
Complicated pipeline
Cluster
CPU
Cluster
Designed to
support CPU
High resource usage
bandwidth
Must allocate MSHR for entire duration of request
High
bandwidth
MSHR
array difficult to scale Directory
To DRAM
DIRECTORY TRAFFIC
Directory accesses per GPU cycle
4.5
4
Difficult to support >1

request per cycle
3.5
3
2.5
2
1.5
1
0.5
0
bp
bfs
hs
lud
nw
km
sd
bn
dct
hg
mm
RESOURCE USAGE
100000
Maximum MSHRs
10000
1000
100
Very difficult to
scale MSHR array
Steady state at
700 GB/s
Causes significant
back-pressure on L2s
10
1
bp
bfs
hs
lud
nw
km
sd
bn
dct
hg
mm
PERFORMANCE OF BASELINE
COMPARED TO UNCONSTRAINED RESOURCES
5
Back-pressure from limited

MSHRs and bandwidth
4.5
4
Slow down
3.5
3
2.5
2
1.5
1
0.5
0
bp
bfs
hs
lud
nw
km
sd
bn
dct
hg
mm
BOTTLENECKS SUMMARY
Directory bandwidth
Must support up to 4 requests per cycle
Difficult to construct pipeline
Resource usage
MSHRs are a constraining resource
Need more than 10,000
Without resource constraints, up to 4x better performance
OUTLINE
Motivation
Background
Overall system design
Region buffer design
Region directory design
Example
Hardware complexity
Results
Conclusions
BASELINE DIRECTORY COHERENCE
APU
GPU
Cluster
CPU
Cluster
Initialization
Kernel Launch
Directory
To DRAM
Read result
HETEROGENEOUS SYSTEM COHERENCE (HSC)
APU
GPU
Cluster
CPU
Cluster
Initialization
Kernel Launch
Directory
To DRAM
HETEROGENEOUS SYSTEM COHERENCE (HSC)
APU
GPU
Region
Cluster
Buffer
CPU
Region
Cluster
Buffer
Direct-access bus
Region
Directory
Directory
To DRAM
Region buffers
coordinate with
region directory
HSC: EXAMPLE MEMORY REQUEST

GPU L2 Cache
GPU Region Buffer
Region Directory
APU
GPU
Region
Cluster
Buffer
CPU
Region
Cluster
Buffer
Region
Directory
To DRAM
HSC: L2 CACHE & REGION BUFFER
MSHR Entries
Demand
Demand
Requests
Requests
MSHRs
MSHR Entries
MSHRs
Region
tags and
Cache Tag Arrays
Region Buffer
Cache Tag Arrays
permissions
Only region-level
permission traffic
Interface for
direct-access bus
Miss
Hit
Miss
Core Data
Responses
Hit
Core Data
Responses
Hit
Miss
Miss
Requests
Probe
Hit Data
Requests
Responses
Probe
Requests
Direct Access
Bus Interface
Coherent
Coherent
Network
Network
Interface
Interface
HSC: REGION DIRECTORY

Region tags,
sharers, and
Block Directory
Array
permissions
Region
DirectoryTag
Tag
Array
Block Probe
Requests/
BlockResponses
Probe
Requests/Responses
Hit
Miss
Miss
To DRAM
PR Entries
Hit
PR Entries
MSHR Entries
Block Requests
Probe
Probe
Request
RAM
Request
RAM
MSHRs
MSHR Entries
Region
Permission
Requests
MSHRs
Coherent
HSC: HARDWARE COMPLEXITY

Region protocols reduce
directory size
Region directory: 8x fewer entries
Region buffers
At each L2 cache
1-KB region (16 64-B blocks)
16-K region entries
Overprovisioned for low-locality
workloads
(a) Region Directory Entry

Region Tag
State CPU GPU
18 bits
2 bits 1 valid bit

per cluster
(b) Region Buffer Entry

Region Tag
18 bits
State B0 B1 B2 ... B15

2 bits
1 valid bit per

block in the region
HSC SUMMARY
Key insight
GPU-CPU applications exhibit high spatial locality
Use direct-access bus present in systems
Offload bandwidth onto direct-access bus
Use coherence network only for permission

Add region buffer to track region information
At each L2 cache
Bypass coherence network and directory
Replace directory with region directory

Significantly reduces total size needed
OUTLINE
Motivation
Background
Results
Speed-up
Latency of loads
Bandwidth
MSHR usage
Conclusions
THREE CACHE-COHERENCE PROTOCOLS

Broadcast: Null-directory that broadcasts on all requests
Baseline: Block-based, mostly inclusive, directory
HSC: Region-based directory with 1-KB region size
HSC PERFORMANCE
5
4.5
Normalized speed-up
Largest
Largest slow-downs
slowdowns
Broadcast
from constrained
resources
Baseline
HSC
3.5
3
2.5
2
1.5
1
0.5
0
bp
bfs
hs
lud
nw
km
sd
bn
dct
hg
mm
DIRECTORY TRAFFIC REDUCTION

1.2
broadcast
baseline
HSC
Normalized directory bandwidth
1
0.8
0.6
0.4
Average bandwidth
significantly reduced Theoretical
reduction from 16
block regions
0.2
0
bp
bfs
hs
lud
nw
km
sd
bn
dct
hg
mm
HSC RESOURCE USAGE
Normalized directory MSHRs required
0.25
0.2
0.15
0.1
Maximum
MSHRs
significantly
reduced
0.05
0
bp
bfs
hs
lud
nw
km
sd
bn
dct
hg
mm
RESULTS SUMMARY
Used a detailed timing simulator for CPU and GPU
HSC significantly improves performance
Reduces the average load latency
Decreases bandwidth requirement of directory
HSC reduces the required MSHRs at the directory
RELATED WORK
Coarse-grained coherence
Region coherence
Applied to snooping systems [Cantin, ISCA 2005] [Moshovos, ISCA 2005]
[Zebchuk, MICRO 2007]
Extended to directories [Fang, PACT 2013] [Zebchuk, MICRO 2013]
Spatiotemporal coherence [Alisafaee, MICRO 2012]

Dual-grain directory coherence [Basu, UW-TR 2013]
Primarily focused on directory size
GPU coherence [Singh et al. HPCA 2013]

Intra-GPU coherence
CONCLUSIONS
Hardware coherence can increase the utility of
heterogeneous systems
Major bottlenecks in current coherence implementations
High bandwidth difficult to support at directory
Extreme resource requirements
We propose Heterogeneous System Coherence

Leverages spatial locality and region coherence
Reduces bandwidth by 94%
Reduces resource requirements by 95%
Questions?
Contact:
powerjg@cs.wisc.edu
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and
typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to
product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences
between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or
otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to
time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR
ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO
EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM
THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of
Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance
Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.
Backup
Slides
LOAD LATENCY
4.5
broadcast
Normalized load latency
4
3.5
baseline
HSC
Average load time

significantly reduced
2.5
2
1.5
1
0.5
0
bp
bfs
hs
lud
nw
km
sd
bn
dct
hg
mm
EXECUTION TIME BREAKDOWN

120
GPU
CPU
hg
mm
Execution time (%)
100
80
60
40
20
0
bp
bfs
hs
lud
nw
km
sd
bn
dct

Heterogeneous System Coherence

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Heterogeneous System Coherence

Enviado por

Direitos autorais:

Formatos disponíveis

HETEROGENEOUS SYSTEM COHERENCE

FOR INTEGRATED CPU-GPU SYSTEMS

Powerpoint version available on:

2 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

We propose Heterogeneous System Coherence

5 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

6 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

7 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

Stacked High-bandwidth DRAM

Heterogeneous Uniform Memory Access (hUMA)

Allows new heterogeneous apps

9 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

Heterogeneous System Bottlenecks

10 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

11 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

(used for graphics)

Very high bandwidth:

13 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

14 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

CACHE ARCHITECTURE REMINDER

Cache Tag Arrays

15 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

DIRECTORY ARCHITECTURE REMINDER

16 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

Directory pipeline complex

17 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

Heterogeneous System Coherence Details

Modified to use hUMA

19 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

20 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

Directory accesses per GPU cycle

Difficult to support >1

22 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

23 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

Back-pressure from limited

24 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

25 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

BASELINE DIRECTORY COHERENCE

HETEROGENEOUS SYSTEM COHERENCE (HSC)

HETEROGENEOUS SYSTEM COHERENCE (HSC)

HSC: EXAMPLE MEMORY REQUEST

GPU Region Buffer

HSC: L2 CACHE & REGION BUFFER

33 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

HSC: REGION DIRECTORY

34 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

HSC: HARDWARE COMPLEXITY

35 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

(a) Region Directory Entry

State CPU GPU

2 bits 1 valid bit

(b) Region Buffer Entry

State B0 B1 B2 ... B15

1 valid bit per

Use coherence network only for permission

Replace directory with region directory

36 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

THREE CACHE-COHERENCE PROTOCOLS

38 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

39 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46

DIRECTORY TRAFFIC REDUCTION

Normalized directory bandwidth

40 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46