Escolar Documentos
Profissional Documentos
Cultura Documentos
Explore
Discover
Understand
Discover,
Understand,
Improve
Improve
Simulation runtimes
Automotive Crash
Stent Expansion
Gasket Sealing
20
90
80
70
60
50
40
15
30
10
20
10
0
2008
2006
5
100
2009
2007
2010
2008
50
New code
allows x86
cores to
offload key
compute
work onto
one or more
Nvidia Tesla
cards to
accelerate
computation
Abaqus
3DS.COM Dassault Systmes | Confidential Information | 5/16/2012 | ref.: 3DS_Document_2012
PCI
RAM
RAM
RAM
RAM
switch
Machine
Socket
Core
GPU
PCI
RAM
PCI
RAM
PCI
RAM
RAM
RAM
Hybrid Mode
X86 cores and GPU card are used
processes
X86
GPU
PCI bus
X86
GPU
RAM
GPU As Platform
Very limited control compute on
X86; bulk of work done on GPU card
X86
GPU/MIC
Abaqus Solvers
3DS.COM Dassault Systmes | Confidential Information | 5/16/2012 | ref.: 3DS_Document_2012
Nature of Code
Finite Elements
Elements
Complex parallelism.
Abaqus/Standard
Abaqus/Explicit
Cost of Code
Abaqus/Standard
13 Million DOF Powertrain Analysis, 6 Iterations,
1.1E+14 FLOPS per factorization
480
ABAQUS/Standard
420
Direct Solver
360
300
240
180
120
60
1 host, 8 cores
10
Abaqus/Standard
3DS.COM Dassault Systmes | Confidential Information | 5/16/2012 | ref.: 3DS_Document_2012
11
RAM
Speedup
5.7X
4.1X
2.8X
1.8X
RAM
Abaqus/Standard
PCI
1400
RAM
1000
RAM
RAM
GPU
800
600
GPU
1.4
1.4
1.7
1297
1.4
1038
1.9
904
400
1 GPU
2 GPU
949
1.5
697
692
560
502
200
No GPU
375
2.4
537
348
221
Problem 1 Total
12
RAM
1200
Time (seconds)
PCI
Problem 2 Total
Problem 1 Solver
Problem 2 Solver
1.5 MDOF
5.37 teraflops per iteration
Problem 2
3.4 MDOF
10.8 teraflops per iteration
Abaqus/Explicit
3DS.COM Dassault Systmes | Confidential Information | 5/16/2012 | ref.: 3DS_Document_2012
13
Abaqus/Explicit Prototype
3DS.COM Dassault Systmes | Confidential Information | 5/16/2012 | ref.: 3DS_Document_2012
X86 Socket
14
PCI bus
GPU
PCI
RAM
RAM
GPU
Nvidia Tesla 21 ms
15
Given peak performance of GPU near 1 TFlop vs peak performance for Westmere of ~70 Gflops
why does Explicit code sped up by only 3X?
do k = 1, groupsize
temp1(k) = in1(k) * in2(k)
Computational
Unit
Cache
L1
L2
L3
do k = 1, groupsize
temp2(k) = temp1(k) * in3(k)
do k = 1, groupsize
out(k) = temp2(k) * x
High degree of data parallelism, but each
piece of data is used a small number of
times in operations
768 KB
~150 GB/sec
~25 GB/sec
RAM
GPU System
Memory
Conclusions
16
Highly data parallel code does have bandwidth issues on GPU which limits gains
Data management between X86 memory and GPU is a key topic
17