Você está na página 1de 18

University of Michigan

Electrical Engineering and Computer Science


1
Application-Specific Processing on a
General Purpose Core via Transparent
Instruction Set Customization

Nathan Clark, Manjunath Kudlur, Hyunchul Park,
Scott Mahlke, Krisztin Flautner*

Advanced Computer Architecture Lab, University of Michigan
*ARM Ltd.
University of Michigan
Electrical Engineering and Computer Science
2
A Case for Customization
General purpose processors handles many
applications fairly well, but
Each application has different requirements
Need for efficient execution

Impressive design wins through customization
Performance, power, area
Up to 3.5x speedup [Hot Chips 16]
University of Michigan
Electrical Engineering and Computer Science
3
Computationally demanding parts of applications
run on special hardware
New instructions use the special hardware
Instruction Set Customization
CUSTOM
XOR
MPY
LD
XOR
SHR
XOR
MOV
MPY
LD
SHR
AND
University of Michigan
Electrical Engineering and Computer Science
4
Traditional vs. Transparent Customization
High Non-Recurring
Engineering costs (NRE)

Universal accelerator
No ISA change
CPU
CPU
Compute Accelerator
(CCA)
CPU
CPU
CPU
CPU
Traditional Transparent
University of Michigan
Electrical Engineering and Computer Science
5
Design of a Compute Accelerator
Goal: support important
computation subgraphs
Array of function units
Exploits subgraph
parallelism
Allows natural data
propagation

FU FU FU

FU FU FU

IN 1


IN 2


F
e
t
c
h
I
s
s
u
e

ALU
ALU
CCA

W
B
University of Michigan
Electrical Engineering and Computer Science
6
Or
And Mov
Or
And
Or
And Mov
Or
And Mov
Or
And Mov
Or
And Mov
Mov
Mov
1
1 1
1
1 1
1
1
CCA Shape
164.gzip
University of Michigan
Electrical Engineering and Computer Science
7
And Xor
Xor
Xor Add
Mov
Mov
1
2 2
2
2 2
2
2
CCA Shape
Blowfish
University of Michigan
Electrical Engineering and Computer Science
8
Dynamic % of subgraphs using FU
CCA Utilization
1 2 3 4 5 6 7
1 100 59.0 22.9 13.1 6.5 4.2 0.3
2 91.1 50.6 9.9 4.1 0.6 0.2 0.0
3 57.4 17.8 6.3 2.9 0.1 0.0 0.0
4 18.5 8.3 1.6 0.1 0.0 0.0 0.0
5 8.7 2.1 0.1 0.0 0.0 0.0 0.0
6 2.1 1.2 0.1 0.0 0.0 0.0 0.0
7 1.2 0.1 0.1 0.0 0.0 0.0 0.0
8 0.1 0.1 0.0 0.0 0.0 0.0 0.0
University of Michigan
Electrical Engineering and Computer Science
9
CCA Operations
Dynamic opcodes in
important subgraphs
Excluded mpy/div,
load/store, branch
Two main categories
logicals, adds
Subgraphs rarely have
more than 3 dependent
adds
Opcode %
Add 28.7
And 12.5
Move 11.7
Sext 10.4
Lshift 9.8
Or 8.7
Xor 5.1
Sub 4.8
Rshift 2.4
Compare 0.4
University of Michigan
Electrical Engineering and Computer Science
10
Proposed CCA Design
4 inputs/2 outputs
Two FU types
Arith/logic
Logic
Crossbar between rows
Captures > 99% of
important subgraphs
I1 I2 I1 I3 I4
O1 O2
University of Michigan
Electrical Engineering and Computer Science
11
Synthesis of CCA
Synopsys design tools, 130nm library
Depth Configuration Control (bits) Delay (ns) Cell area
(mm
2
)
Subgraphs
Supported
7
6A-4L-4A-
3L-2A-2L-1L
245 5.62 0.48 99.3%
6
6A-4L-4A-
3L-2A-1L
229 4.56 0.45 95.1%
5
6A-4L-4A-
2L-1L
197 3.50 0.40 87.6%
4
6A-4L-3A-2L
172 3.19 0.38 81.8%
University of Michigan
Electrical Engineering and Computer Science
12
+ No ISA change
+ No recompile
Simple selection
Hardware complexity
+ Powerful selection
+ Simple hardware
Some ISA change
Recompile necessary
ASIPs
ISA change
High NRE
+ No ISA change
+ No recompile
Simple selection
Hardware complexity
+ Powerful selection
+ Simple hardware
Some ISA change
Recompile necessary
ASIPs
ISA change
High NRE
+ No ISA change
+ No recompile
Simple selection
Hardware complexity
+ Powerful selection
+ Simple hardware
Some ISA change
Recompile necessary
ASIPs
ISA change
High NRE
+ No ISA change
+ No recompile
Simple selection
Hardware complexity
+ Powerful selection
+ Simple hardware
Some ISA change
Recompile necessary
ASIPs
ISA change
High NRE
Static Dynamic
CCA Utilization
Realization
Selection
Static
Dynamic
University of Michigan
Electrical Engineering and Computer Science
13

ADD r4, r1, #1
LSR r2, r2, #4
XOR r5, r4, r2
LD r3
ADD r6, r5, r3
XOR r7, r6, r8
SHR

Dynamic Selection Dynamic Realization
Detect and replace subgraphs in fill unit of trace cache
I-Cache
Trace
Cache


R
e
t
i
r
e


.
.
.

E
x
e
c
u
t
e


.
.
.


D
e
c
o
d
e
Trace
Construction
Subgraph
Selection and
Insertion

LSR r2, r2, #4
LD r3
CUSTOM
SHR


ADD r4, r1, #1
LSR r2, r2, #4
XOR r5, r4, r2
LD r3
ADD r6, r5, r3
XOR r7, r6, r8
SHR


ADD r4, r1, #1
LSR r2, r2, #4
XOR r5, r4, r2
LD r3
ADD r6, r5, r3
XOR r7, r6, r8
SHR


ADD r4, r1, #1
LSR r2, r2, #4
XOR r5, r4, r2
LD r3
ADD r6, r5, r3
XOR r7, r6, r8
SHR


ADD r4, r1, #1
LSR r2, r2, #4
XOR r5, r4, r2
LD r3
ADD r6, r5, r3
XOR r7, r6, r8
SHR

University of Michigan
Electrical Engineering and Computer Science
14
Simulation
SimpleScalar ARM instruction set
4-wide Execution, 1 compute accelerator
128 RUU entries
32k inst. trace cache, 256 inst. Traces
5000 cycle selection/insert latency
L1 I-cache : 32k, 2 way, 2 cycle hit
L1 D-cache : 32k, 4 way, 2 cycle hit

University of Michigan
Electrical Engineering and Computer Science
15
Varying CCA Latency
1.00
1.05
1.10
1.15
1.20
1.25
1.30
1.35
1.40
1.45
S
p
e
e
d
u
p

6
4
2
1
SPECint MediaBench Encryption
Lat
University of Michigan
Electrical Engineering and Computer Science
16
Static Selection Dynamic Realization
Compiler selects subgraphs offline
Communicated to the hardware at load time
Control bits stored in a table and inserted at decode

ADD r4, r1, #1
LSR r2, r2, #4
XOR r5, r4, r2
LD r3
ADD r6, r5, r3
XOR r7, r6, r8
SHR


LSR r2, r2, #4
LD r3
CCA_Start #2
ADD r4, r1, #1
XOR r5, r4, r2
ADD r6, r5, r3
XOR r7, r6, r8
CCA_End
SHR

I-Cache
Control
Table


R
e
t
i
r
e


.
.
.

E
x
e
c
u
t
e


.
.
.


D
e
c
o
d
e
University of Michigan
Electrical Engineering and Computer Science
17
1.00
1.05
1.10
1.15
1.20
1.25
1.30
1.35
1.40
1.45
S
p
e
e
d
u
p

Dynamic Selection Static Selection
Dynamic vs. Static Selection
SPECint MediaBench Encryption
University of Michigan
Electrical Engineering and Computer Science
18
Summary
Transparent instruction set customization
Benefits of customization without changing ISA
Presented design of a compute accelerator
Handle majority of important computation
subgraphs in many benchmarks
Developed ways to utilize the accelerator
Table-based static selection dynamic realization
Trace cache based dynamic selection dynamic
realization

Você também pode gostar