Você está na página 1de 48

CUDA GPU Occupancy Calculator

Just follow steps 1, 2, and 3 below! (or click here for help) 1.) Select Compute Capability (click): 1.b) Select Shared Memory Size Config (bytes) 2.) Enter your resource usage: Threads Per Block Registers Per Thread Shared Memory Per Block (bytes) (Don't edit anything below this line) 3.) GPU Occupancy Data is displayed here and in the graphs: Active Threads per Multiprocessor Active Warps per Multiprocessor Active Thread Blocks per Multiprocessor Occupancy of each Multiprocessor 3.5 49152
(Help)

256 32 4096

(Help)

2048 64 8 100%

(Help)

Physical Limits for GPU Compute Capability: Threads per Warp Warps per Multiprocessor Threads per Multiprocessor Thread Blocks per Multiprocessor Total # of 32-bit registers per Multiprocessor Register allocation unit size Register allocation granularity Registers per Thread Shared Memory per Multiprocessor (bytes) Shared Memory Allocation unit size Warp allocation granularity Maximum Thread Block Size

3.5 32 64 2048 16 65536 256 warp 255 49152 256 4 1024 = Allocatable Limit Per SM Blocks Per SM 64 8 64 8 49152 12

Allocated Resources Warps (Threads Per Block / Threads Per Warp) Registers (Warp limit per SM due to per-warp reg count) Shared Memory (Bytes)
Note: SM is an abbreviation for (Streaming) Multiprocessor

Per Block 8 8 4096

Maximum Thread Blocks Per Multiprocessor Limited by Max Warps or Max Blocks per Multiprocessor Limited by Registers per Multiprocessor Limited by Shared Memory per Multiprocessor
Note: Occupancy limiter is shown in orange

Blocks/SM * Warps/Block = Warps/SM 8 8 64 8 8 64 12 8 0 Physical Max Warps/SM = 64 Occupancy = 64 / 64 = 100%

CUDA Occupancy Calculator Version: Copyright and License

5.1

Threads Warps/Multiprocessor 256 64 32 12 64 24 96 36 128 48 160 60 192 60 224 63 256 64 288 63 320 60 352 55 384 60 416 52 448 56 480 60 512 64 544 51

576 608 640 672 704 736 768 800 832 864 896 928 960 992 1024 1056 1088 1120 1152 1184 1216 1248 1280 1312 1344 1376 1408 1440 1472 1504 1536 1568 1600 1632 1664 1696 1728 1760 1792 1824 1856 1888 1920 1952 1984 2016 2048 2080 2112 2144 2176 2208

54 57 60 63 44 46 48 50 52 54 56 58 60 62 64 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

2240 2272 2304 2336 2368 2400 2432 2464 2496 2528 2560 2592 2624 2656 2688 2720 2752 2784 2816 2848 2880 2912 2944 2976 3008 3040 3072

Click Here for detailed instructions on how to use this occupancy calculator.
For more information on NVIDIA CUDA, visit http://developer.nvidia.com/cuda Your chosen resource usage is indicated by the red triangle on the graphs. The other data points represent the range of possible block sizes, register counts, and shared memory allocation.

Impact of Varying Block Size


My Block Size 256

64 Multiprocessor Warp Occupancy (# warps) 56 48 40 32 24 16 8 0 0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 Threads Per Block

Impact of Varying Register Count Per Thread


64
My Register Count 32

Multiprocessor Warp Occupancy (# warps)

56 48 40 32 24 16 8 0 256 248 240 232 224 216 208 200 192 184 176 168 160 152 144 136 128 120 112 104 96 88 80 72 64 56 48 40 32 24 16 8 0 Registers Per Thread

Impact of Varying Shared Memory Usage Per Block

Impact of Varying Shared Memory Usage Per Block


My Shared Memory 4096 64 56 48 40 32 24 16 8 0 0

Multiprocessor Warp Occupancy (#warps)

Registers Warps/Multiprocessor 32 64 1 64 2 64 3 64 4 64 5 64 6 64 7 64 8 64 9 64 10 64 11 64 12 64 13 64 14 64 15 64 16 64 17 64

2048

4096

6144

8192

10240

12288

14336

16384

Shared Memory Per Block

18432

20480

22528

24576

26624

28672

Shared Memory Warps/Multiprocessor 4096 64 0 64 512 64 1024 64 1536 64 2048 64 2560 64 3072 64 3584 64 4096 64 4608 64 5120 64 5632 64 6144 64 6656 56 7168 48 7680 48 8192 48

30720

32768

34816

36864

38912

40960

43008

45056

47104

49152

18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69

64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 48 48 48 48 48 48 48 48 40 40 40 40 40 40 40 40 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 24 24 24 24 24

8704 9216 9728 10240 10752 11264 11776 12288 12800 13312 13824 14336 14848 15360 15872 16384 16896 17408 17920 18432 18944 19456 19968 20480 20992 21504 22016 22528 23040 23552 24064 24576 25088 25600 26112 26624 27136 27648 28160 28672 29184 29696 30208 30720 31232 31744 32256 32768 33280 33792 34304 34816

40 40 40 32 32 32 32 32 24 24 24 24 24 24 24 24 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121

24 24 24 24 24 24 24 24 24 24 24 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16

35328 35840 36352 36864 37376 37888 38400 38912 39424 39936 40448 40960 41472 41984 42496 43008 43520 44032 44544 45056 45568 46080 46592 47104 47616 48128 48640 49152

8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173

16 16 16 16 16 16 16 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225

8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255

8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

IMPORTANT
This spreadsheet requires Excel macros for full functionality. When you load this file, make sure you enable macros because they are often disabled by default by Excel.

Overview
The CUDA Occupancy Calculator allows you to compute the multiprocessor occupancy of a GPU by a given CUDA kernel. The multiprocessor occupancy is the ratio of active warps to the maximum number of warps supported on a multiprocessor of the GPU. Each multiprocessor on the device has a set of N registers available for use by CUDA program threads. These registers are a shared resource that are allocated among the thread blocks executing on a multiprocessor. The CUDA compiler attempts to minimize register usage to maximize the number of thread blocks that can be active in the machine simultaneously. If a program tries to launch a kernel for which the registers used per thread times the thread block size is greater than N, the launch will fail. The size of N on GPUs with compute capability 1.0-1.1 is 8192 32-bit registers per multiprocessor. On GPUs with compute capability 1.2-1.3, N = 16384. On GPUs with compute capability 2.0-2.1, N = 32768. On GPUs with compute capability 3.0, N=65536. Maximizing the occupancy can help to cover latency during global memory loads that are followed by a __syncthreads(). The occupancy is determined by the amount of shared memory and registers used by each thread block. Because of this, programmers need to choose the size of thread blocks with care in order to maximize occupancy. This GPU Occupancy Calculator can assist in choosing thread block size based on shared

Instructions
Using the CUDA Occupancy Calculator is as easy as 1-2-3. Change to the calculator sheet and follow these three steps. 1.) First select your device's compute capability in the green box. Click to go there 1.b) If your compute capability supports it, you will be shown a second green box in which you can select the size in bytes of the shared memory (configurable at run time in CUDA). Click to go there 2.) For the kernel you are profiling, enter the number of threads per thread block, the registers used per thread, and the total shared memory used per thread block in bytes in the orange block. See below for how to find the registers used per thread. Click to go there 3.) Examine the blue box and the graph to the right. This will tell you the occupancy, as well as the number of active threads, warps, and thread blocks per multiprocessor, and the maximum number of active blocks on the GPU. The graph will show you the occupancy for your chosen block size as a red triangle, and for all other possible block sizes as a line graph. Click to go there You can now experiment with how different thread block sizes, register counts, and shared memory usages can affect your GPU occupancy. Determining Registers Per Thread and Shared Memory Per Thread Block To determine the number of registers used per thread in your kernel, simply compile the kernel code using the option --ptxas-options=-v to nvcc. This will output information about register, local memory, shared memory, and constant memory usage for each kernel in the .cu file. However, if your kernel declares any external shared memory that is allocated dynamically, you will need to add the (statically allocated) shared memory reported by ptxas to the amount you dynamically allocate at run time to get the correct shared memory usage. An example of the verbose ptxas output is as follows: ptxas info : Compiling entry function '_Z8my_kernelPf' for 'sm_10'

ptxas info

: Used 5 registers, 8+16 bytes smem

Let's say "my_kernel" contains an external shared memory array which is allocated to be 2048 bytes at run time. Then our total shared memory usage per block is 2048+8+16 = 2072 bytes. We enter this into the box labeled "shared memory per block (bytes)" in this occupancy calculator, and we also enter the number of registers used by my_kernel, 5, in the box labeled registers per thread. We then enter our thread block size and the calculator Notes about Occupancy Higher occupancy does not necessarily mean higher performance. If a kernel is not bandwidth-limited or latencylimited, then increasing occupancy will not necessarily increase performance. If a kernel grid is already running at least one thread block per multiprocessor in the GPU, and it is bottlenecked by computation and not by global memory accesses, then increasing occupancy may have no effect. In fact, making changes just to increase occupancy can have other effects, such as additional instructions, more register spills to local memory (which is off-chip), more divergent branches, etc. As with any optimization, you should experiment to see how changes affect the *wall clock time* of the kernel execution. For bandwidth-bound applications, on the other hand, increasing occupancy can help better hide the latency of memory accesses, and therefore improve performance. For more information on NVIDIA CUDA, visit http://developer.nvidia.com/cuda

enable macros

n GPUs with compute ompute capability 3.0,

w these three steps.

Compute Capability SM Version Threads / Warp Warps / Multiprocessor Threads / Multiprocessor Thread Blocks / Multiprocessor Max Shared Memory / Multiprocessor (bytes) Register File Size Register Allocation Unit Size Allocation Granularity Max Registers / Thread Shared Memory Allocation Unit Size Warp allocation granularity Max Thread Block Size Shared Memory Size Configurations (bytes) [note: default at top of list]

1.0 sm_10 32 24 768 8 16384 8192 256 block 124 512 2 512 16384

1.1 sm_11 32 24 768 8 16384 8192 256 block 124 512 2 512 16384

1.2 sm_12 32 32 1024 8 16384 16384 512 block 124 512 2 512 16384

1.3 sm_13 32 32 1024 8 16384 16384 512 block 124 512 2 512 16384

2.0 sm_20 32 48 1536 8 49152 32768 64 warp 63 128 2 1024 49152 16384

2.1 sm_21 32 48 1536 8 49152 32768 64 warp 63 128 2 1024 49152 16384

3.0 sm_30 32 64 2048 16 49152 65536 256 warp 63 256 4 1024 49152 16384 32768 256

3.5 sm_35 32 64 2048 16 49152 65536 256 warp 255 256 4 1024 49152 16384 32768 256

Warp register allocation granularities [note: default at top of list]

64 128

64 128

Copyright 1993-2012 NVIDIA Corporation. All rights reserved.

For more information on NVIDIA CUDA, visit http://www.nvidia.com/cuda

Copyright 1993-2012 NVIDIA Corporation. All rights reserved. NOTICE TO USER: This spreadsheet and data is subject to NVIDIA ownership rights under U.S. and international Copyright laws. Users and possessors of this spreadsheet and data are hereby granted a nonexclusive, royalty-free license to use it in individual and commercial software. NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THIS SPREADSHEET AND DATA FOR ANY PURPOSE. IT IS PROVIDED "AS IS" WITHOUT EXPRESS OR IMPLIED WARRANTY OF ANY KIND. NVIDIA DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SPREADSHEET AND DATA, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SPREADSHEET AND DATA. U.S. Government End Users. This spreadsheet and data are a "commercial item" as that term is defined at 48 C.F.R. 2.101 (OCT 1995), consisting of "commercial computer software" and "commercial computer software documentation" as such terms are used in 48 C.F.R. 12.212 (SEPT 1995) and is provided to the U.S. Government only as a commercial end item. Consistent with 48 C.F.R.12.212 and 48 C.F.R. 227.7202-1 through 227.7202-4 (JUNE 1995), all U.S. Government End Users acquire the spreadsheet and data with only those rights set forth herein. Any use of this spreadsheet and data in individual and commercial software must include, in the user documentation and internal comments to the code, the above Disclaimer and U.S. Government End Users Notice. For more information on NVIDIA CUDA, visit http://www.nvidia.com/cuda

Você também pode gostar