Escolar Documentos
Profissional Documentos
Cultura Documentos
Ar en
t * Complete
*
AECDo
*
st
PLDI *
We
* Consi
ll
*
cu m
se
eu
Verification of Producer-Consumer
Ev
e
nt R
*
ed
* Easy to
alu
at e d
sisted for many years. Completeness guarantees that all violations bar.sync bar.arrive
reported by W EFT are actual bugs; there are no false positives. Fi- (wait until begin) (signal begin)
Named Barrier 0
nally, W EFT is efficient enough to be practical for verifying real
codes; W EFT runs on complex, warp-specialized kernels consist-
bar.arrive bar.sync
ing of thousands of lines of code and with billions of potential races (signal ready) (wait until ready)
within a few minutes. Named Barrier 1
The rest of this paper is organized as follows. In Section 2, we
provide the necessary background for understanding named barri-
ers and the construction of warp-specialized kernels. We also pro- Figure 1. Producer-consumer named barriers example.
vide a concrete example of an interesting warp-specialized kernel
from a real application that demonstrates the need for formal verifi- While logically CTAs represent a flat collection of threads, in
cation. Each of the remaining sections describe one of our primary practice the threads of a CTA are organized into groups of 32, re-
technical contributions. ferred to as warps. The hardware within an SM makes scheduling
decisions at the granularity of warps. When a warp is scheduled,
We introduce a core language for warp-specialized kernels that
all threads within the warp execute the same instruction. In the
contain named barriers. We give a formal operational semantics case of a branch instruction, the SM first executes the warp for the
for this language and show how it models the necessary proper- not taken branch with the threads that took the branch masked off.
ties for checking correctness of warp-specialized kernels (Sec- The SM then executes the taken branch with the complementary
tion 3). These semantics are a proposed formalization of the En- set of threads in the warp masked off. The resulting serialized exe-
glish description of named barriers in the PTX manual [1] and cution of different branches within a warp is referred to as branch
can be re-used by future tools that attempt to verify programs divergence and is a common source of performance degradation.
that use named barriers. The crucial insight for enabling warp-specialized programs is that
We formally define the correctness properties for warp-specialized there is no penalty if threads in different warps diverge. As long as
kernels and provide algorithms for verifying these properties. all threads within a warp continue to execute the same instruction
Our algorithms (Section 4) are sound (do not miss any errors), stream, there is no performance degradation.
complete (all reported errors are true errors), and efficient.
We describe our implementation of W EFT, a verification tool 2.2 Warp Specialization and Named Barriers
for checking warp-specialized code containing named barriers. A warp-specialized kernel is one in which individual warps are
We describe the important optimizations that permit W EFT to assigned different computations via control divergence contingent
scale to kernels that execute hundreds of millions of program upon warp IDs. There are a number of advantages to specializing
statements. Using W EFT we check the correctness of all warp- warps. For example, CudaDMA[6] specializes warps into compute
specialized kernels of which we are aware and discover several and DMA warps to optimize the loading of data into shared mem-
non-trivial bugs (Section 5). ory. By separating memory operations from arithmetic, both in-
struction streams can be better optimized by the CUDA compiler.
Section 6 covers related work and Section 7 concludes. Alternatively, the Singe compiler[7] specializes warps to handle the
mapping of fine-grained static dataflow graphs onto CTAs. Con-
2. Background and Motivation sequently, Singe can leverage task parallelism on GPUs in the ab-
sence of significant data-parallelism, and fit very large working sets
In this section we briefly cover the necessary details for understand- into on-chip memories by blocking data for the GPU register file.
ing GPU programming (Section 2.1) and warp specialization (Sec- Similar to data-parallel CUDA kernels, warp-specialized ker-
tion 2.2). We then introduce a motivating warp-specialized kernel nels communicate through shared memory. However, unlike data-
that illustrates the need for verification of code using named barri- parallel kernels, communication in warp-specialized kernels is
ers (Section 2.3). asymmetric, with one or more warps acting as producers, and one
or more other warps acting as consumers. In CudaDMA, DMA
2.1 GPU Architecture and Programming warps act as producers while compute warps act as consumers. For
While there are several different frameworks currently available Singe kernels, dataflow edges between operations assigned to dif-
for programming GPUs, they all provide variations on the same ferent warps define producer-consumer relationships between pairs
data-parallel programming model. For the rest of this work, we use of warps. To handle producer-consumer synchronization, NVIDIA
CUDA as a proxy for this model as it is the only interface that GPUs support hardware named barriers in PTX[1]. PTX provides
supports the named barrier primitives necessary for constructing two instructions for arriving at a named barrier: sync and arrive.
warp-specialized kernels. A sync instruction causes the threads in a warp arriving at the bar-
When a CUDA kernel is launched on a GPU, a collection of rier to block until the barrier completes. Alternatively, an arrive
threadblocks or cooperative thread arrays (CTAs for the remain- instruction allows a warp to register arrival at a barrier and immedi-
der of this paper) is created to execute the program. Each CTA is ately continue executing without blocking. Both sync and arrive
composed of up to 1024 threads. The hardware inside of the GPU instructions require the total number of threads participating in the
dynamically assigns CTAs to one of the streaming multiprocessors barrier to be specified. Importantly, the number of participants can
(SMs) inside of the GPU. To communicate between threads within be less than the total number of threads in the CTA.
the CTA, CUDA provides an on-chip software-managed shared Using both the sync and arrive instructions (via inline PTX
memory visible to all the threads within the same CTA. To coordi- assembly), CUDA applications can encode producer-consumer
nate access to shared memory, CUDA supports the syncthreads synchronization. Figure 1 demonstrates the usage of two named
primitive, which performs a CTA-wide blocking barrier. Synchro- barriers to encode a producer-consumer relationship between a
nizing between threads within the same CTA is the only synchro- pair of warps. A producer warp initially blocks on named barrier
nization supported by CUDA because threads within the same CTA 0 using a sync instruction to wait until it is safe to write data into
are the only ones guaranteed to be executing concurrently. shared memory. At some point the consumer warp signals that it is
Warp 0 Warp 1 Warp 2 Warp 3
1 global void launch bounds (64,1) example deadlock(void) {
Named Barrier: 2
2
3
assert(warpSize == 32);
assert(blockDim.x == 64);
A Warp 0: arrive
Warp 3: sync
4 int warp id = threadIdx.x / 32;
5
6
if (warp id == 0) {
// bar.sync (barrier name), (participants);
B C D
Named Barrier: 1
7 asm volatile(bar.sync 0, 64;); Warp 2: sync
Named Barrier: 0
8
9
asm volatile(bar.arrive 1, 64;);
} else { Warp 0: arrive E Warp 3: arrive
Warp 1: sync Named Barrier: 3
10 asm volatile(bar.sync 1, 64;);
Warp 1: sync
11
12 }
asm volatile(bar.arrive 0, 64;);
F G Warp 2: arrive
13 } Named Barrier: 2
Warp 1: sync
Listing 1. Example of deadlock using named barriers. H I J Warp 2: arrive
safe to write by performing an arrive instruction on named barrier Figure 2. Motivating warp specialization example.
0. Since it issued a non-blocking arrive instruction, the consumer
warp can perform additional work while waiting for the producer is run. The happens-before relationships established by the paths
warp to generate the data. Eventually, the consumer warp blocks D E F I and D E G, using named barriers
on named barrier 1 waiting for the data to be ready. The producer 1 and 3, ensure the participating warps in the next generation of
warp signals that the data is ready by arriving on named barrier 1 barrier 2 (warps 1 and 2) both have a happens-before relationship
with a non-blocking arrive so that it can continue executing. with the previous generation of barrier 2; therefore barrier 2 can be
Each SM on current NVIDIA GPUs contains 16 physical named safely recycled. Note that the path A B F I is insuf-
barriers numbered 0 through 15. Once a named barrier completes, ficient for establishing a happens-before relationship because warp
it is immediately re-initialized so that it can be used again. We 0 performs a non-blocking arrive on the first generation of named
refer to each use of a named barrier as a generation of the named barrier 2 and therefore cannot guarantee that the barrier actually
barrier. Different generations of a named barrier can have different completed. For the Heptane kernel to run correctly, all barriers must
numbers of participants. successfully execute and be properly re-used. Informally, we refer
The use of named barriers can create many interesting failure to a kernel that meets these criteria as well-synchronized (we give
modes for kernels that will not normally arise within the standard a formal definition of well-synchronized in Section 4). In the case
CUDA programming model. For example, named barriers intro- of the Heptane chemistry kernel, which consists of more than 10K
duce the possibility of deadlock. Listing 1 shows a simple warp- lines of CUDA code, checking for the well-synchronized property
specialized kernel which deadlocks for all CTAs. Each of the two by hand is impractical.
warps in this example block waiting on the other warp to arrive The second important property of the Singe kernel is that the
on a different named barrier, resulting in a cyclic synchronization synchronization pattern is completely static. While there is sig-
dependence. It is important to note that deadlock is a problem spe- nificant parallelism in the kernel, the use of named barriers con-
cific to warp-specialized code. It is impossible to create such circu- strains the synchronization so that it is both deterministic and
lar waits using the traditional syncthreads primitive because all known at compile-time. Furthermore, each thread executes straight-
invocations map to named barrier 0.2 line code with no dynamic branches or loops. These properties
are important as the undecidability result of [25] shows that it is
2.3 Motivating Kernel impossible to obtain complete solutions in the presence of arbi-
To motivate the need for verification of warp-specialized kernels, trary control flow and synchronization. In practice, we found that
we give an example from the Heptane chemistry kernel emitted by all warp-specialized kernels we considered consisted entirely of
the Singe DSL compiler[7]. This kernel is currently deployed in code with only statically-analyzable branches, loops, and synchro-
a production version of the combustion simulation S3D[9], which nization patterns. While this may seem unusual for general pur-
is being used for doing advanced energy research on Titan, the pose code, it is common for many GPU kernels due to the high
number two supercomputer in the world[26]. Figure 2 illustrates cost of dynamic branching associated with the in-order instruc-
how part of the computation represented as a static dataflow graph tion pipelines of current GPUs.3 We also found that all accesses
is mapped onto four warps. All computations (nodes in the dataflow to global memory and shared memory could be completely under-
graph) represent warp-wide data-parallel operations consisting of stood statically as functions of CTA and thread ID. Consequently,
straight-line code. Where dataflow edges cross warp boundaries, the operational semantics we present in Section 3 are for kernels
data must be communicated through shared memory. All cross- with statically analyzable control flow and data flow, which is suf-
warp dataflow edges (labeled and shown in red) are assigned a ficient for handling all of the warp-specialized code known to us.
named barrier to use when synchronizing access to shared memory
for communicating data. 3. Operational Semantics
There are two important properties to be observed regarding this
example kernel. First, notice the re-use of named barrier 2 in Fig- Before describing our formal language, we first scope the do-
ure 2. To avoid conflicting arrivals at named barrier 2 from partic- main of our problem. We are interested in formally checking
ipants of different generations, a happens-before relationship[16] for deadlock-freedom, proper named barrier recycling, and shared
must be established between the completion of the previous gener- memory data race freedom within individual CTAs. We make no
ation of a named barrier and all participants of the next generation. claim of handling inter-CTA synchronization4 used by persistent
From Figure 2, we know barrier 2 completes before operation D uber-kernels such as OptiX[24]. We also do not consider software-
2 Using 3 It
is also simple to extend our results to programs with loops and dynamic
syncthreads inside of warp-divergent conditional statements is
illegal in the CUDA programming model and has undefined semantics, but branches if these constructs do not contain synchronization operations.
in practice can also result in deadlock on some architectures. 4 Which is not officially supported by the CUDA programming model.
level intra-CTA synchronization which can be constructed with 1. An enabled map E that maps thread identifiers to booleans
atomic primitives (but are at least an order of magnitude slower signifying whether the thread is enabled or not. Threads are
than named barriers). We view the verification of atomics as an disabled when they block on a barrier.
orthogonal problem to the verification of named barrier usage. We 2. A barrier map B that maps barrier names to a triple consisting
rely on a symmetry assumption that all CTAs execute the same pro- of a list I of threads that have synced at the barrier, a list A of
gram (albeit with different input data) and therefore we only need threads that have arrived at the barrier, and the thread count,
to model and verify a single CTA to establish correctness. The val- describing the number of threads the barrier is expecting to
idation is limited to the accesses to a GPUs shared memory, which register if it is configured.
has a well-defined semantics due to the software-managed nature of
the cache. We do not attempt to reason about data passed between The thread count is configured by the first thread that reaches
threads through global memory, both because of the weak seman- a barrier, hence the need for all synchronization commands to
tics [3] and because global memory is not used for communication specify the expected number of participants. The thread count of
in any of the warp-specialized kernels of which we are aware. unconfigured barriers is denoted by . An empty list of thread
identifiers is denoted by [] and :: adds a thread to a list. The
3.1 Syntax number of elements in a list L is denoted by |L|.
A GPU program has an arbitrary number of non-interfering CTAs. For a map A, we use A[x/y] to denote the map that agrees
Each CTA has N (typically between 32 and 1024) threads that can with A on all inputs except y and maps y to x. Similarly, A[x/Y ]
synchronize for access to shared memory. We denote a thread by P denotes a map that agrees with A on all inputs that are not in Y and
and a CTA by T . We use P1 ||P2 || . . . || PN to denote a CTA maps all y Y to x. The function ite(e1 , e2 , e3 ) stands for if e1
with threads P1 , P2 , . . . , PN . Variables in shared memory loca- then e2 else e3 .
tions are denoted by g. For simplicity, we assume that all vari- The initial state I has i.E(i) = true and b.B(b) =
ables occupy 64 bits. We consider abstract thread programs for ([], [], ). Plainly stated: all threads are enabled and ready to ex-
each thread, where everything has been abstracted away except the ecute, no thread has registered at any barrier, and all barriers are
instructions required to reason about synchronization and shared unconfigured. We use done to denote a CTA with no more com-
memory accesses. Each thread program has a separate thread iden- mands to execute.
tifier denoted by id. We use i to range over thread identifiers. The
hardware provides B (typically 16) named barriers and b refers to 3.3 Semantics
a specific barrier name. We describe the small step operational semantics of a CTA T with
A thread program has the following grammar: threads P1 , . . . , PN executing in parallel. Recall that a state s has
P ::= return | c; P two components E and B. The rules have the following form:
c ::= read g | write g
E, B, T E 0 , B0 , T 0
| arrive b n | sync b n
A thread program is a sequence of commands (straight line code). E, B, i, P E 0 , B0 , i, P 0
In each command c a thread can either read/write from a shared
For the two kinds of rules, a CTA/thread program executing in
memory location or perform a synchronization operation. Block-
some state takes a single step, which results in a new CTA/thread
ing synchronization operations are performed by sync and non-
program and a new state. Here i denotes the thread ID of the thread
blocking operations by arrive. The read and write commands are
program (equivalent to threadIdx.x for one dimensional CTAs,
treated as no-ops in this section. They are only useful for detecting
which are common for warp-specialized kernels). We now discuss
data races and play no role in the semantics of named barriers. For
the operational semantics rules for each of the commands starting
synchronization commands (sync and arrive), the first argument
with the execution of a CTA.
b represents the name (ID) of a named barrier, and the second argu-
ment n represents the expected number of threads to register at this
b. B(b) = ([], [], ) B(b) = (I, A, n) |I| + |A| < n
generation of the barrier. Successive commands are separated by a
semicolon. A thread terminates by executing a return. E, B, i, Pi E 0 , B0 , i, Pi0
In this syntax, the standard barrier syncthreads is expressed E, B, P1 || . . . ||Pi || . . . ||PN E 0 , B0 , P1 || . . . ||Pi0 || . . . ||PN
as sync 0 N : a sync across all threads in a CTA on barrier 0.
Program points are defined in the standard manner: before and The first antecedent says that for all barriers b, either b is uncon-
after each command c. Each program point is uniquely identified figured or needs to register more threads. In this situation, we non-
by the command just before it. We say that a program point and deterministically choose a thread in the CTA and execute it for one
command c correspond if is the program point just after c . The step. If we have N thread programs executing concurrently, any
first program point of thread i is denoted by Ii and the last by Fi . one of these can make an evaluation step from Pi to Pi0 with some
We omit the superscripts when the thread is clear from the context. side-effects on the state.
Our semantics does not assume nor require warp-synchronous Next, if some barrier has registered the required number of
execution. Warp-synchronous execution is the assumption that all threads, then wake up all the threads blocked at the barrier and
threads within a warp will execute in lock-step. This assumption recycle the barrier.
has traditionally held for past GPU architectures, but it is not stan- B(b) = (I, A, n) |I| + |A| = n
dardized and may be invalidated by future designs. We can easily T = P1 || . . . || PN i.Pi = ite(i I, sync b n; Pi0 , Pi0 )
model warp-synchronous execution in our language by inserting T 0 = P10 || . . . || PN 0
E 0 = E[true/I]
a sync command across all threads in a warp after every original
command in a program, thereby ensuring that the threads within a E, B, T E 0 , B[([], [], )/b], T 0
warp execute in lockstep. If the correct number of threads (n) have registered at b, then enable
all threads that were blocked on b (contained in I), change b to an
3.2 State unconfigured state, and update the control of all threads in I. This
We first define a state s of a CTA. It consists of the following: rule recycles a barrier by returning it to an unconfigured state.
The execution terminates when all threads execute a return. hold for all possible executions, and using the relationships to prove
the important correctness properties. We begin by providing formal
E, B, return || . . . || return E, B, done definitions for the correctness properties that we intend to verify.
The hardware does not care about the state of the barriers when the
CTA exits as the barriers are reset before starting a new CTA. 4.1 Preliminaries
The first synchronization operation configures the barrier and Recall that a state s has two components E and B. We use (s, T )
determines the number of threads required. to abbreviate E, B, T . A configuration C is a pair of a state s and a
B(b) = ([], [], ) B0 = B[([], [] :: id , n)/b] CTA T . First we need to define execution traces.
E, B, id , arrive b n; c E, B0 , id , c Definition 1. A partial trace or a subtrace is a sequence of con-
figurations (s0 , T0 ), . . . , (sn , Tn ) such that for any two successive
The arriving thread configures the barrier with the thread count (n) configurations we have (sj , Tj ) (sj+1 , Tj+1 ).
and gets added to the list of arrives. Programs with an invalid thread
count (more than the maximum threads permitted in a CTA5 ) can A complete trace ends in either done, err, or a deadlock.
be rejected by a preprocessing phase before execution begins. Definition 2. A (complete) trace starting from a configuration
If the first thread to register with a barrier is a sync, the seman- (s, T ) is a subtrace (s, T ), . . . , (s0 , done), or (s, T ), . . . , (err, T 0 ),
tics are similar to arrive. The thread updates the enabled map and or (s, T ), . . . , (s0 , T 0 ) where T 0 6= done and no rule is applicable
is added to the list of blocked threads. (deadlock).
E(id ) = true E 0 = E[false/id ]
As is standard, C C 0 denotes that C evaluates to C 0 in one step;
B(b) = ([], [], ) B0 = B[([] :: id , [], n)/b]
C C denotes that C evaluates to C 0 in zero or more steps, and
0
the thread programs and simulates one execution of the CTA with belongs to this barrier interval or not, i.e., whether k (1, 3).
only the synchronization commands. Next, W EFT uses the algo- The crucial insight for scalability is that W EFT only needs
rithm in Figure 4 to determine if the CTA is well-synchronized. to perform latest/earliest happens-before/after reachability analysis
Since memory accesses do not affect synchronization, they can for all the nodes in the barrier dependence graph and not for every
be safely ignored without compromising the soundness or com- program point in the thread programs. Individual program points
pleteness of W ELL S YNC. This simplification considerably reduces can later easily determine their latest/earliest happens-before/after
the number of commands to consider and makes the computation points by using the results from the dynamic barriers that imme-
tractable for all of the kernels of which we are aware. diately precede and follow within the same thread program. This
After verifying that the CTA is well-synchronized, W EFT con- greatly reduces both computation time and overall memory usage.
structs the barrier dependence graph. Each completed barrier is W EFT first computes the earliest/latest happens-after/before
converted to a single node in the graph and is assigned a gener- points reachable in every thread program from each node in the bar-
ation. We call the nodes of the graph dynamic barriers. We use rier dependence graph. Each dynamic barrier initializes its reacha-
cPi to denote the i
th
command of program P . The trace of Sec- bility points with its participants and sets the remaining values to
Q
tion 4.2 creates the following nodes for Figure 3: n1 has cP 1 , c1 , . The transitive closure of reachability points is then computed
P Q P Q over the barrier dependence graph. W EFT avoids recomputation by
generation 1; n2 has c3 , c2 , generation 1; n3 has c4 , c4 , genera-
tion 2; n4 has cP Q
5 , c6 , generation 2. Edges are added to the graph
using a topological sort of the barrier nodes to guide the computa-
by traversing each thread program both forward and backwards to tion, requiring at worst O(D2 ) time for a kernel with D dynamic
establish happens-before and happens-after relationships between barrier nodes. Since D is usually much smaller than the total num-
pairs of dynamic barriers. Figures 6 and 7 show the computed bar- ber of commands, this approach is considerably faster. Further-
rier dependence graphs for the earlier examples of Figures 2 and more, W EFT can leverage the knowledge that a fixed number of
5 respectively. Note that the barrier dependence graph of a well- dynamic barriers (usually 16) are live at a time to further improve
synchronized CTA is always a directed acyclic graph. performance. If the kernel contains N thread programs, the results
After verifying that a kernel is well-synchronized and generat- only require O(DN ) space to store, which is independent of the
ing the barrier dependence graph, W EFT proceeds to check if a ker- total number of kernel commands. This computation produces the
nel is race-free. Due to the large number of commands W EFT can- barrier intervals for the synchronization commands. For example,
Q Q
not compute the standard transitive closure of Figure 4 and it cannot the generated barrier interval of cP 1 n1 in thread Q is (c0 , c2 ),
th
traverse the graph of all commands. In order to scale to kernels re- where the 0 command is just a placeholder for the beginning of
Q Q
quiring hundreds of billions of race tests, W EFT performs constant a program. Similarly the barrier interval for cP 3 n2 is (c1 , c3 ).
time race tests by computing the latest happens-before and earliest For individual program commands, W EFT computes the ear-
happens-after points reachable in every thread program from every liest/latest happens-after/before relationships based on the results
Kernel Name Dynamic Shared Commands Race Tests Programs Memory Verification Races
Barriers Addresses (Thousands) (Millions) (threads) Usage (MB) Time (s) (B,H)
saxpy single 8192 512 3670 1878 320 3645 8.42 (no,no)
saxpy double 8192 1024 3670 939 384 4298 8.99 (no,no)
sgemv vecsingle 257 128 2142 17246 160 247 46.42 (no,no)
sgemv vecdouble 514 256 4285 34492 192 498 95.19 (no,no)
sgemv vecmanual 258 1024 8446 34490 160 814 147.81 (no,no)
sgemv bothsingle 514 4128 1287 4370 288 207 10.15 (no,no)
sgemv bothdouble 1028 8256 2573 8740 448 507 25.30 (no,no)
sgemv bothmanual 516 8256 1320 2185 416 305 7.74 (no,no)
rtm 1phase single 215 633 376 130 160 67 0.43 (Yes,no)
rtm 1phase manual 214 1265 372 74 160 67 0.26 (Yes,no)
rtm 2phase single 215 633 387 136 160 70 0.39 (Yes,Yes)
rtm 2phase manual 214 1265 382 77 160 69 0.32 (Yes,Yes)
rtm 2phase quad 208 1265 371 71 160 68 0.28 (Yes,Yes)
Figure 9. Verification results for the CudaDMA kernels. Races are classified as (benign,harmful).
Kernel Name Dynamic Shared Commands Race Tests Programs Memory Verification Races
Barriers Addresses (Thousands) (Millions) (threads) Usage (MB) Time (s) (B,H)
dme diff fermi 352 1930 2912 169264 320 600 398.00 (no,no)
dme diff kepler 352 1920 1075 255 320 411 1.40 (no,no)
dme visc kepler 128 1920 2058 1038 480 385 3.24 (no,no)
dme chem fermi 864 5536 5156 40463 512 1917 53.74 (no,no)
dme chem kepler 1760 3072 3866 2486 256 1143 11.91 (no,Yes)
hept diff kepler 416 1664 1490 633 416 723 2.60 (no,no)
hept visc fermi 128 1690 5897 168972 416 693 416.92 (no,no)
hept visc kepler 128 1664 2955 2530 416 419 5.22 (no,no)
hept chem fermi 992 5192 7572 83762 640 3330 135.03 (no,no)
hept chem kepler 928 6144 5241 2375 512 2153 11.82 (no,Yes)
prf diff kepler 672 3712 6414 5729 928 5099 22.18 (no,no)
prf visc fermi 4 3776 901 1737 1024 114 2.64 (no,no)
prf visc kepler 128 3712 14213 26706 1024 2123 68.43 (no,no)
Figure 10. Verification results for the Singe kernels. Races are classified as (benign,harmful).
stored in the barrier dependence graph. For each command, W EFT nels ranged from 1684-13245 lines of CUDA. All our experiments
consults the barriers in the same thread immediately before/after were run on an Intel Xeon X5680 Sandy Bridge processor clocked
the command with different physical barrier names to determine at 3.33 GHz with 48 GB of memory divided across two NUMA
the latest/earliest happens before/after for each thread program. For domains. The performance of W EFT is primarily limited either by
example, the barrier interval of write cP 2 is computed using the memory latency or memory bandwidth depending on both the pro-
barrier intervals of nodes n1 and n2 . If there are S commands in cessor on which it is being run and the kernel being verified. W EFT
the program, this requires O(S) time to compute because there are is multi-threaded to take advantage of the considerable amount of
a constant number of dynamic barriers to consider (the maximum memory-level parallelism available in our verification algorithm.
number of live barriers). For a kernel with N thread programs, the All our experiments using W EFT were run with four threads (two
result of this computation requires O(SN ) memory to store. While per NUMA domain) as we found this to be the ideal settings for
this is large, for most kernels it requires at most tens of gigabytes maximizing memory system performance on our target machine.
of data which is within the range of many machines today. Figures 9 and 10 show the verification results for the CudaDMA
After establishing the earliest happens-after and latest happens- and Singe kernels respectively. W EFT reports statistics on each
before relationships for each program command, W EFT examines of the kernels including the number of dynamic barriers, total
all pairs of shared memory accesses to a common address in which shared memory addresses that are used, total commands in the
at least one access is a write and determines if there is a race. Each formal language from Section 3 across all threads, the number
race test is a constant time look-up since each command knows the of thread programs, and the total race tests that were performed.
earliest happens-after and latest happens-before command that can Performance is measured by the time required to verify each kernel
be reached (if any) in every other thread program. As we will show and the total memory required to perform the verification. W EFT
in Section 5.3, this approach allows W EFT to validate very large reports whether the kernel is well-synchronized and race-free. Not
and complex warp-specialized kernels in a reasonable amount of surprisingly, all the kernels we examined were well-synchronized
time and memory. and therefore we omit these results from the tables. Kernels that
are not well-synchronized tend to hang or crash frequently. While
5.3 Experiments it is easy to cause such bugs to manifest, it is much more difficult
We evaluated W EFT by using it to validate 13 warp-specialized to diagnose their root cause. In the future, we anticipate W EFT
kernels that use the CudaDMA library and 13 warp-specialized will be especially valuable in giving feedback for debugging warp-
kernels that were emitted by the Singe compiler. The CudaDMA specialized kernels that either deadlock or crash due to a violation
kernels ranged from 65-1561 lines of CUDA while the Singe ker- of the well-synchronized property during development.
1 global void two phase(...) {
a two-year old non-determinism bug filed against the two-phase
2 shared float2 s PQ[...]; kernels, which persisted because the single phase kernels achieved
3 float2 PQy buf = s PQ + PQY BUF OFFSET; higher performance.
4 ... Two Singe kernels also contained write-after-read data races.
5 if (tid < COMPUTE THREADS PER CTA) { Both the DME and Heptane chemistry kernels for the Kepler archi-
6 // Compute warps, main loop tecture contained instances where writes could occur before previ-
7 for (int iz = 0; iz < NZ; iz++) { ous reads had completed. The cause of these races was a bug in the
8 ... heuristic that the Singe compiler uses for solving the NP-complete
9 // Start next transfer to shared memory bin-packing problem for managing dataflow transfers through the
10 dma ld pq.start async dma(); //barrier arrive
11 // Still reading old version in PQy buf
limited space in shared memory. Interestingly, the bug only mani-
12 for (int j = 1; j < R; j++) { fested on the Kepler architecture where the additional register file
13 const float2 v1 = PQy buf[PQ indexj], v2 = PQy buf[PQ index+j]; space permitted a more aggressive mapping strategy that stressed
14 q Pxyl += c x[j] (v1.x + v2.x); q Qxyl += c x[j] (v1.y + v2.y); the bin-packing solver. Despite extensive testing, as far as we know,
15 } these write-after-read races have never manifested in practice on
16 } current GPU architectures due to the large number of intermedi-
17 } else { // DMA warps } ate instructions. However, there is no guarantee that the bug in the
18 } Singe compiler would not have manifested on future GPU archi-
Listing 3. Write-after-read bug in two-phase RTM. tectures, further underscoring the need for verification of complex
warp-specialized kernels.
The performance results show that W EFT is capable of scal-
ing to very large kernels containing up to 8K dynamic barriers and
14 million commands. W EFT is able to verify most kernels in a 6. Discussion and Related Work
few minutes or less (within the time of a standard coffee break), There is considerable literature on the verification of concurrent
making it practical for use by real developers. W EFT also saves programs and some recent work on GPU verification. Existing
the time the developers would normally invest in writing tests that work on GPU verification has focused on general-purpose verifica-
uncover deadlocks and data races; since W EFT is sound there is tion of GPU kernels and the need to handle various programming
no need to test for the properties that it checks and the testing constructs provided by GPU models. Our work is largely orthog-
effort can be focused on other properties such as functional cor- onal: we focus on obtaining complete and efficient solutions for
rectness. Moreover, since W EFT is complete, all errors reported handling the verification of named barriers by sacrificing expres-
by W EFT are real errors and worth investigating. Finally, since the siveness. We do not operate at the source level and we do not aim to
time complexity is a small polynomial function, the tool execution provide a general-purpose GPU verification tool. Instead, we focus
time is predictable. In the worst case, W EFT required seven min- specifically on the problems associated with producer-consumer
utes to verify the hept visc fermi kernel. This kernel makes ex- named barriers, which no previous work addresses. Therefore, we
tensive use of warp-synchronous programming to exchange thou- believe this work is largely complementary to the existing work
sands of constants through the same shared memory locations, re- on verification of GPU programs. Furthermore, we feel that this
quiring more than 168 billion race tests. Note that the use of the work provides a complete solution for the problems associated with
new shuffle instructions available on the Kepler architecture by the named barriers and could easily be integrated into a more general
dme diff kepler kernel reduces the number of tests by three or- purpose GPU verification framework.
ders of magnitude. The prf diff kepler kernel uses the most There are many other verification tools which have attempted to
memory (5.1 GB), but this usage is well within the memory limits reason about the synchronization schemes of GPU kernels. GPU-
of most modern laptop and desktop machines. Verify [5, 8, 11, 12] is a general-purpose verifier for GPU kernels.
W EFT found several different data races in seven of the ker- In the GPUVerify model, a CTA-wide barrier (e.g. syncthreads)
nels from both the CudaDMA and Singe suites. These races can be is the only synchronization mechanism. This restriction permits
classified into two categories: benign and harmful. All of the re- GPUVerify to perform a two-thread reduction: a round robin
verse time migration (RTM) kernels from the CudaDMA test suite scheduling of threads is sufficient to detect all data races. A sim-
had an instance of a benign data race where the developer intention- ilar reduction is also used by PUG [19]. While this reduction can
ally had multiple threads load the same value from global memory considerably reduce the cost of the analysis and simplifies the im-
and write it to the same location in shared memory during the ini- plementation of the verifier, it does not have sufficient power to
tialization loops of the kernel. The resulting code was simpler to handle the full generality of named barriers. GPUVerify can also
understand and maintain and caused negligible performance degra- reason about warp-synchronous programming (as can W EFT) and
dation. The W EFT tool does not distinguish between benign and atomics (which W EFT does not) [4].
harmful races and correctly reports all the competing writes to the There are many different tools that use symbolic execution [10,
same shared memory locations as races. Since benign data races 13, 20, 21] to search for bugs in GPU kernels. To the best of our
can cause performance degradation, it is still important to report knowledge none of these tools has any support for reasoning about
them to the developer even if they do not break correctness. named barriers. However, we see no fundamental reason that they
The two-phase RTM kernels from the CudaDMA suite also could not be modified to understand named barrier semantics.
contained a bug which resulted in a number of harmful write-after- GPU Amplify [18] relies on a combination of static and dy-
read data races. Listing 3 shows the pertinent parts of code that namic analysis to search for data races in GPU kernels. We believe
cause the data race in all of the two-phase RTM kernels. Initially, that our work subsumes and generalizes their approach for search-
a shared memory buffer called s PQ is allocated and an alias called ing for shared memory races. In particular, their approach can be
PQy buf is created as a pointer to the middle of the buffer (lines seen as a special case instance of W EFT that requires that all ker-
2-3). Inside of the main loop of the compute warps, the placement nels use only a single named barrier for CTA-wide synchronization.
of the start async dma call (line 10) comes before the last use of In some ways, named barrier semantics are related to other
the shared memory buffer through the PQy buf pointer (line 13), kinds of barriers. Java Phasers have similar semantics to named bar-
resulting in a write-after-read race as the DMA warps begin writing riers, but are implemented in software[23]. Other (non-exhaustive)
into shared memory before the last read. Interestingly, this resolved work on simple barriers such as those used for distributed program-
ming include [2, 17]. While there are some similarities, the funda- [8] A. Betts, N. Chong, A. F. Donaldson, S. Qadeer, and P. Thomson.
mental implications of named barriers being a hardware resource GPUVerify: a verifier for GPU kernels. In OOPSLA, pages 113132,
instead of a software object give them a much stricter semantics 2012.
which is therefore important to verify. [9] J. H. Chen, A. Choudhary, B. de Supinski, M. DeVries, E. R. Hawkes,
S. Klasky, W. K. Liao, K. L. Ma, J. Mellor-Crummey, N. Podhorszki,
R. Sankaran, S. Shende, and C. S. Yoo. Terascale direct numerical
7. Conclusion simulations of turbulent combustion using S3D. Computational Sci-
As GPU usage continues to expand into new application domains ence and Discovery, page 015001, 2009.
with more task parallelism and larger working sets, warp special- [10] W. Chiang, G. Gopalakrishnan, G. Li, and Z. Rakamaric. Formal
ization will become increasingly important for handling challeng- analysis of GPU programs with atomics via conflict-directed delay-
ing kernels that do not fit the standard data-parallel paradigm. The bounding. In NFM, pages 213228, 2013.
named barriers available on NVIDIA GPUs make warp special- [11] N. Chong, A. F. Donaldson, P. H. J. Kelly, J. Ketema, and S. Qadeer.
ization possible, but require more complex code with potentially Barrier invariants: a shared state abstraction for the analysis of data-
difficult to discern bugs. dependent GPU kernels. In OOPSLA, pages 605622, 2013.
We have presented W EFT, a verifier for kernels using named [12] N. Chong, A. F. Donaldson, and J. Ketema. A sound and complete
abstraction for reasoning about parallel prefix sums. In POPL, pages
barriers that is sound, complete, and efficient. Soundness ensures 397410, 2014.
that developers do not have to write any tests to check for synchro-
nization errors. Completeness ensures that all violations reported [13] P. Collingbourne, C. Cadar, and P. H. J. Kelly. Symbolic crosschecking
of floating-point and SIMD code. In EuroSys, pages 315328, 2011.
by W EFT are actual bugs; there are no false alarms. W EFT runs on
large, complex, warp-specialized kernels requiring up to billions of [14] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco.
checks for race conditions and completes in a few minutes. Using Gpus and the future of parallel computing. IEEE Micro, 31(5):717,
2011.
W EFT we validated 26 kernels and found several non-trivial bugs.
[15] Khronos. The OpenCL Specification, Version 1.0. The Khronos
OpenCL Working Group, December 2008.
Acknowledgements [16] L. Lamport. Time, clocks, and the ordering of events in a distributed
We wish to thank Michael Garland, Vinod Grover, Divya Gupta, system. Commun. ACM, 21(7):558565, 1978.
Manolis Papadakis, Sean Treichler, and the anonymous reviewers [17] D.-K. Le, W.-N. Chin, and Y. M. Teo. Verification of static and dy-
for their valuable comments and feedback. This material is based namic barrier synchronization using bounded permissions. In ICFEM,
on research sponsored by a fellowship from Microsoft, Los Alamos pages 231248, 2013.
National Laboratories Subcontract No. 173315-1 through the U.S. [18] A. Leung, M. Gupta, Y. Agarwal, R. Gupta, R. Jhala, and S. Lerner.
Department of Energy under Contract No. DE-AC52-06NA25396, Verifying GPU kernels by test amplification. In PLDI, pages 383394,
and NSF grant CCF-1160904. 2012.
[19] G. Li and G. Gopalakrishnan. Scalable smt-based verification of GPU
kernel functions. In FSE, pages 187196, 2010.
References [20] G. Li, P. Li, G. Sawaya, G. Gopalakrishnan, I. Ghosh, and S. P.
[1] Parallel thread execution ISA version 4.1. http://docs.nvidia. Rajan. GKLEE: concolic verification and test generation for GPUs.
com/cuda/parallel-thread-execution/index.html, 2014. In PPOPP, pages 215224, 2012.
[2] A. Aiken and D. Gay. Barrier inference. In POPL, pages 342354, [21] P. Li, G. Li, and G. Gopalakrishnan. Parametric flows: automated
1998. behavior equivalencing for symbolic analysis of races in CUDA pro-
[3] J. Alglave, M. Batty, A. F. Donaldson, G. Gopalakrishnan, J. Ketema, grams. In SC, page 29, 2012.
D. Poetzl, T. Sorensen, and J. Wickerson. GPU concurrency: Weak [22] NVIDIA. CUDA programming guide 6.0. http://developer.
behaviours and programming assumptions. In ASPLOS, pages 577 download.nvidia.com/compute/DevZone/docs/html/C/doc/
591, 2015. CUDA_C_Programming_Guide.pdf, August 2014.
[4] E. Bardsley and A. F. Donaldson. Warps and atomics: Beyond barrier [23] Oracle. Class phaser. https://docs.oracle.com/javase/7/
synchronization in the verification of GPU kernels. In NFM, pages docs/api/java/util/concurrent/Phaser.html, 2014.
230245, 2014. [24] S. G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Lue-
[5] E. Bardsley, A. Betts, N. Chong, P. Collingbourne, P. Deligiannis, A. F. bke, D. McAllister, M. McGuire, K. Morley, A. Robison, and M. Stich.
Donaldson, J. Ketema, D. Liew, and S. Qadeer. Engineering a static OptiX: A general purpose ray tracing engine. ACM Transactions on
verification tool for GPU kernels. In CAV, pages 226242, 2014. Graphics, August 2010.
[6] M. Bauer, H. Cook, and B. Khailany. CudaDMA: optimizing GPU [25] G. Ramalingam. Context-sensitive synchronization-sensitive analysis
memory bandwidth via warp specialization. SC 11, 2011. is undecidable. ACM Trans. Program. Lang. Syst., 22(2):416430,
2000.
[7] M. Bauer, S. Treichler, and A. Aiken. Singe: Leveraging warp special-
ization for high performance on GPUs. PPoPP 14, 2014. [26] Top500. Top 500 supercomputers. http://www.top500.org, 2014.