Application-Specific Processing On A General Purpose Core Via Transparent Instruction Set Customization

University of Michigan
Electrical Engineering and Computer Science

1
Application-Specific Processing on a
General Purpose Core via Transparent
Instruction Set Customization

Nathan Clark, Manjunath Kudlur, Hyunchul Park,
Scott Mahlke, Krisztin Flautner*

Advanced Computer Architecture Lab, University of Michigan
*ARM Ltd.
2
A Case for Customization
General purpose processors handles many
applications fairly well, but
Each application has different requirements
Need for efficient execution

Impressive design wins through customization
Performance, power, area
Up to 3.5x speedup [Hot Chips 16]
3
Computationally demanding parts of applications
run on special hardware
New instructions use the special hardware
Instruction Set Customization
CUSTOM
XOR
MPY
LD
XOR
SHR
XOR
MOV
MPY
LD
SHR
AND
4
Traditional vs. Transparent Customization
High Non-Recurring
Engineering costs (NRE)

Universal accelerator
No ISA change
CPU
CPU
Compute Accelerator
(CCA)
CPU
CPU
CPU
CPU
Traditional Transparent
5
Design of a Compute Accelerator
Goal: support important
computation subgraphs
Array of function units
Exploits subgraph
parallelism
Allows natural data
propagation

FU FU FU

FU FU FU

IN 1

IN 2

F
e
t
c
h
I
s
s
u
e

ALU
ALU
CCA

W
B
6
Or
And Mov
Or
And
Or
And Mov
Or
And Mov
Or
And Mov
Or
And Mov
Mov
Mov
1
1 1
1
1 1
1
1
CCA Shape
164.gzip
7
And Xor
Xor
Xor Add
Mov
Mov
1
2 2
2
2 2
2
2
CCA Shape
Blowfish
8
Dynamic % of subgraphs using FU
CCA Utilization
1 2 3 4 5 6 7
1 100 59.0 22.9 13.1 6.5 4.2 0.3
2 91.1 50.6 9.9 4.1 0.6 0.2 0.0
3 57.4 17.8 6.3 2.9 0.1 0.0 0.0
4 18.5 8.3 1.6 0.1 0.0 0.0 0.0
5 8.7 2.1 0.1 0.0 0.0 0.0 0.0
6 2.1 1.2 0.1 0.0 0.0 0.0 0.0
7 1.2 0.1 0.1 0.0 0.0 0.0 0.0
8 0.1 0.1 0.0 0.0 0.0 0.0 0.0
9
CCA Operations
Dynamic opcodes in
important subgraphs
Excluded mpy/div,
load/store, branch
Two main categories
logicals, adds
Subgraphs rarely have
more than 3 dependent
adds
Opcode %
Add 28.7
And 12.5
Move 11.7
Sext 10.4
Lshift 9.8
Or 8.7
Xor 5.1
Sub 4.8
Rshift 2.4
Compare 0.4
10
Proposed CCA Design
4 inputs/2 outputs
Two FU types
Arith/logic
Logic
Crossbar between rows
Captures > 99% of
important subgraphs
I1 I2 I1 I3 I4
O1 O2
11
Synthesis of CCA
Synopsys design tools, 130nm library
Depth Configuration Control (bits) Delay (ns) Cell area
(mm
2
)
Subgraphs
Supported
7
6A-4L-4A-
3L-2A-2L-1L
245 5.62 0.48 99.3%
6
6A-4L-4A-
3L-2A-1L
229 4.56 0.45 95.1%
5
6A-4L-4A-
2L-1L
197 3.50 0.40 87.6%
4
6A-4L-3A-2L
172 3.19 0.38 81.8%
12
+ No ISA change
+ No recompile
Simple selection
Hardware complexity
+ Powerful selection
+ Simple hardware
Some ISA change
Recompile necessary
ASIPs
ISA change
High NRE
+ No ISA change
+ No recompile
Simple selection
Hardware complexity
+ Simple hardware
Some ISA change
Recompile necessary
ASIPs
ISA change
High NRE
+ No ISA change
+ No recompile
Simple selection
Hardware complexity
+ Simple hardware
Some ISA change
Recompile necessary
ASIPs
ISA change
High NRE
+ No ISA change
+ No recompile
Simple selection
Hardware complexity
+ Simple hardware
Some ISA change
Recompile necessary
ASIPs
ISA change
High NRE
Static Dynamic
CCA Utilization
Realization
Selection
Static
Dynamic
13

ADD r4, r1, #1
LSR r2, r2, #4
XOR r5, r4, r2
LD r3
ADD r6, r5, r3
XOR r7, r6, r8
SHR

Dynamic Selection Dynamic Realization
Detect and replace subgraphs in fill unit of trace cache
I-Cache
Trace
Cache

R
e
t
i
r
e

.
.
.

E
x
e
c
u
t
e

.
.
.

D
e
c
o
d
e
Trace
Construction
Subgraph
Selection and
Insertion

LSR r2, r2, #4
LD r3
CUSTOM
SHR

ADD r4, r1, #1
LSR r2, r2, #4
XOR r5, r4, r2
LD r3
ADD r6, r5, r3
XOR r7, r6, r8
SHR

ADD r4, r1, #1
LSR r2, r2, #4
XOR r5, r4, r2
LD r3
ADD r6, r5, r3
XOR r7, r6, r8
SHR

ADD r4, r1, #1
LSR r2, r2, #4
XOR r5, r4, r2
LD r3
ADD r6, r5, r3
XOR r7, r6, r8
SHR

ADD r4, r1, #1
LSR r2, r2, #4
XOR r5, r4, r2
LD r3
ADD r6, r5, r3
XOR r7, r6, r8
SHR

14
Simulation
SimpleScalar ARM instruction set
4-wide Execution, 1 compute accelerator
128 RUU entries
32k inst. trace cache, 256 inst. Traces
5000 cycle selection/insert latency
L1 I-cache : 32k, 2 way, 2 cycle hit
L1 D-cache : 32k, 4 way, 2 cycle hit

15
Varying CCA Latency
1.00
1.05
1.10
1.15
1.20
1.25
1.30
1.35
1.40
1.45
S
p
e
e
d
u
p

6
4
2
1
SPECint MediaBench Encryption
Lat
16
Static Selection Dynamic Realization
Compiler selects subgraphs offline
Communicated to the hardware at load time
Control bits stored in a table and inserted at decode

ADD r4, r1, #1
LSR r2, r2, #4
XOR r5, r4, r2
LD r3
ADD r6, r5, r3
XOR r7, r6, r8
SHR

LSR r2, r2, #4
LD r3
CCA_Start #2
ADD r4, r1, #1
XOR r5, r4, r2
ADD r6, r5, r3
XOR r7, r6, r8
CCA_End
SHR

I-Cache
Control
Table

R
e
t
i
r
e

.
.
.

E
x
e
c
u
t
e

.
.
.

D
e
c
o
d
e
17
1.00
1.05
1.10
1.15
1.20
1.25
1.30
1.35
1.40
1.45
S
p
e
e
d
u
p

Dynamic Selection Static Selection
Dynamic vs. Static Selection
SPECint MediaBench Encryption
18
Summary
Transparent instruction set customization
Benefits of customization without changing ISA
Presented design of a compute accelerator
Handle majority of important computation
subgraphs in many benchmarks
Developed ways to utilize the accelerator
Table-based static selection dynamic realization
Trace cache based dynamic selection dynamic
realization

Application-Specific Processing On A General Purpose Core Via Transparent Instruction Set Customization

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Application-Specific Processing On A General Purpose Core Via Transparent Instruction Set Customization

Enviado por

Direitos autorais:

Formatos disponíveis

University of Michigan

Electrical Engineering and Computer Science

Você também pode gostar