Compute Unified Device Architecture

COMPUTE UNIFIED DEVICE
ARCHITECTURE
ABSTRACT
manipulating and displaying computer graphics,
and their highly parallel structure makes them
CUDA (Compute Unified Device Architecture) more effective than general-purpose CPUs for a
is a compiler and set of development tools that range of complex algorithms. A GPU can sit on
enable programmers to use a variation of C to top of a video card, or it can be integrated
code algorithms for execution on the graphics directly into the motherboard. More than 90% of
processing unit (GPU). CUDA has been new desktop and notebook computers have
developed by NVIDIA and to use this integrated GPUs, which are usually far less
architecture requires an Nvidia GPU and powerful than those on a video card.
drivers.
Graphic processors or GPUs have evolved much
1. Introduction in the past few years. Today, they are capable of
calculating things other than pixels in video
CUDA is a scalable parallel programming model games, however, it's important to know how to
and a software environment for parallel use them efficiently for other tasks
computingThe latest drivers all contain the
necessary CUDA components. CUDA works 3. Difference Between CPU and GPU
with all NVIDIA GPUs from the G8X series
onwards, including GeForce, Quadro and the
During the last couple of years, GPU calculation
Tesla line. NVIDIA states that programs
power has improved exponentially and much
developed for the GeForce 8 series will also
faster than that of the CPU. However, this
work without modification on all future Nvidia
doesn't mean that GPUs have evolved faster.
video cards, due to binary compatibility. CUDA
gives developers access to the native instruction
set and memory of the massively parallel GPU Computing with CUDA brings parallel
computational elements in CUDA GPUs. Using computing to the masses Over 85,000,000
CUDA, the latest NVIDIA GPUs effectively CUDA-capable GPUs sold. Data-parallel
become open architectures like CPUs (Central supercomputers are everywhere We’re already
Processing Units). Unlike CPUs however, GPUs seeing innovations in data-parallel computing
have a parallel "many-core" architecture, each Massively parallel computing has become a
core capable of running thousands of threads commodity technology
simultaneously - if an application is suited to this
kind of an architecture, the GPU can offer large Due to the nature of graphics computations, GPU
performance benefits. CUDA provides both chips are customized to handle streaming data.
deterministic low level API and a higher level This means that the data is already sequential, or
API.The initial CUDA SDK was made cache-coherent, and thus the GPU chips do not
need the significant amount of cache space that
public 15th February 2007. The compiler
dominates CPU chips. The GPU die real estate
in CUDA is based on the PathScale C can then be re-targeted to produce more
compiler which is based on Open64. processing power.
NVIDIA released a CUDA API beta for
Mac OS X August 19, 2008. A CPU is expected to process a task as fast as
possible whereas a GPU must be capable of
2. GPU(Graphical Processing Unit) processing a maximum of tasks, or to be more
accurate, one task for a maximum of data in a
minimum period of time. Of course, a GPU also
A graphics processing unit or GPU (also
has to be fast and a CPU must be able to process
occasionally called visual processing unit or
several tasks, but up to this date the development
VPU) is a dedicated graphics rendering device
of their respective architectures has shown the
for a personal computer, workstation, or game
above priority. This has meant multiplying
console. Modern GPUs are very efficient at
processing units for GPUs, and for CPUs,
1
making control units more complex and Eventually, hard drives will be 100% flash
increasing embedded cache memory. drives, eliminating the need for RAM disk
caching, as flash memory is faster than RAM.
4. Cache Memory
5. Cuda Architecture
Cache (pronounced cash) memory is extremely
fast memory that is built into a computer’s The GPU has some number of MultiProcessors
central processing unit (CPU), or located next to (MPs), depending on the model
it on a separate chip. The CPU uses cache • The NVIDIA 8800 comes in 2 models: either
memory to store instructions that are repeatedly 12 or 16 MPs
required to run programs, improving overall • The NVIDIA 8600 has 4 MPs
system speed. The advantage of cache memory is • Each MP has 8 independent processors
that the CPU does not have to use the • There are 16 KB of Shared Memory per MP,
motherboard’s system bus for data transfer. arranged in 16 banks
Whenever data must be passed through the • There are 64 KB of Constant Memory
system bus, the data transfer speed slows to the
motherboard’s capability. The CPU can process 6. CUDA API
data much faster by avoiding the bottleneck
created by the system bus. CUDA, or Compute Unified Device
Architecture, is the architecture that allows the
As it happens, once most programs are open and exploitation of GeForce 8 GPU calculation
running, they use very few resources. When power by allowing it to process kernels
these resources are kept in cache, programs can (programs) on a certain amount of threads. If
operate more quickly and efficiently. All else CUDA also partly includes the GPU, since it has
being equal, cache is so effective in system more and more optimizations to facilitate non
performance that a computer running a fast CPU graphic calculations, in practice it mainly
with little cache can have lower benchmarks than concerns software. CUDA is in consequence a
a system running a somewhat slower CPU with driver, a runtime, libraries (an implementation of
more cache. Cache built into the CPU itself is BLAS amongst other things), an API based on
referred to as Level 1 (L1) cache. Cache that an extension of the C programming language and
resides on a separate chip next to the CPU is an accompanying compiler (redirecting the part
called Level 2 (L2) cache. Some CPUs have both not executed by the GPU to the system’s default
L1 and L2 cache built-in and designate the compiler).
separate cache chip as Level 3 (L3) cache.
Cache that is built into the CPU is faster than

separate cache, running at the speed of the
microprocessor itself. However, separate cache
is still roughly twice as fast as Random Access
Memory (RAM). Cache is more expensive than
RAM, but it is well worth getting a CPU and
motherboard with built-in cache in order to
maximize system performance.
Disk caching applies the same principle to the

hard disk that memory caching applies to the
CPU. Frequently accessed hard disk data is
stored in a separate segment of RAM in order to
avoid having to retrieve it from the hard disk
over and over. In this case, RAM is faster than
the platter technology used in conventional hard
disks. This situation will change, however, as
hybrid hard disks become ubiquitous. These
disks have built-in flash memory caches.
2
CUDA is a high level API, meaning that it When a thread gets blocked somehow (a
globally disregards hardware even if taking into memory access, waiting for information
account the specifications is required to provide from another thread, etc.), the processor quickly
high performances. AMD, however, with CTM returns the thread to the pool and grabs another
has a low level API. This roughly means that it is one to work on.
easier to program with CUDA whereas it is This thread-swap happens within a single cycle
easier to fully optimize the code with CTM. A full memory access requires 200 instruction
cycles to complete
The CUDA driver acts as an intermediate
element between the compiled code and GPU. 8. Hardware Support
The CUDA runtime is an intermediate between
the developer and driver, facilitating The 8-Series (G8X) GPU from NVIDIA, found
programming by masking some of the details. in the GeForce, Quadro and Tesla lines, is the
With CUDA it’s either possible to use the API first series of GPUs to support the CUDA SDK.
runtime or directly access the API driver. It is The 8-Series (G8X) GPUs feature hardware
possible to see the API runtime as a high level support for 32-bit (single precision) floating
language and the API driver as an intermediate point vector processors, using the CUDA SDK
between high and low level, allowing a manual as API. (CUDA supports the 64-bit C "double"
and deeper optimization of the code. In the data type, however on G8X series GPUs these
opposite direction, AMD gives the possibility of types will get demoted to 32-bit floats.).
writing kernels in HLSL instead of machine NVIDIA recently announced its new GT200
language to facilitate programming. While the architecture which now supports 64-bit (double
both stick to their initial choices, Nvidia and precision) floating point. Due to the highly
AMD try to go a little bith in the opposite way. parallel nature of vector processors, GPU
assisted hardware stream processing can have a
For this first look at CUDA, we focused on API huge impact in specific data processing
runtime. The driver mode, however, isn't that applications. It is anticipated in the computer
different and it only has more options and less gaming industry that graphics cards may be used
automation. in future game physics calculations (physical
effects like debris, smoke, fire, fluids). CUDA
This particular API consists of a couple of has also been used to accelerate non-graphical
extensions of C language, a component intended applications in computational biology and other
for the system that makes it possible to control fields by an order of magnitude or more
the GPU(s), another that runs with the GPU, and
a common component that includes the types of 9. Advantages
vectors and a group of functions of the standard
C library, which can be executed on the system
CUDA has several advantages over traditional
as well as the GPU.
general purpose computation on GPUs (GPGPU)
using graphics APIs.
Without going into all the details on the added
extensions, we are going to give you the main
ones that allow the understanding of the  It uses the standard C language, with
functioning of CUDA some simple extensions.
 Scattered writes – code can write to
7. CUDA Efficiency arbitrary addresses in memory.
 Shared memory – CUDA exposes a fast
CUDA expects not just have a few threads, but shared memory region (16KB in size)
to have thousands of them! All threads execute that can be shared amongst threads.
the same code (called the kernel), but operates This can be used as a user-managed
on different data. Each thread can determine cache, enabling higher bandwidth than
which one it is Think of all the threads as living is possible using texture lookups.[6]
in a “pool”, waiting to be executed. All  Faster downloads and readbacks to and
processors start by grabbing a thread from the from the GPU
pool.  Full support for integer and bitwise
operations
3
of GPUs as coprocessors. More flexibility and
10. Limitations robustness were necessary, although the version
0.8 was already very promising. These regular
evolutions also allowed increasing the feeling of
 Texture rendering is not supported.
confidence, from which CUDA is starting to
 Recursive functions are not supported
benefit.
and must be converted to loops.
 Various deviations from the IEEE 754 Two main evolutions stand out. The first is the
standard. Denormals and signalling asynchronous functioning of CUDA. As we
NaNs are not supported; only two IEEE explained in our previous article, the version 0.8
rounding modes are supported (chop suffered from a large limitation, because once
and round-to-nearest even), and those the CPU sent the work to the GPU, it was
are specified on a per-instruction basis blocked until it sent the results back. The CPU
rather than in a control word (whether and GPU therefore couldn’t work at the same
this is a limitation is arguable); and the time and was a big brake on performances.
precision of division/square root is Another problem was found in the case where a
slightly lower than single precision. calculation system was equipped with several
 The bus bandwidth and latency between GPUs. A CPU core per GPU was needed, which
the CPU and the GPU may be a isn’t too efficient in practice.
bottleneck.
 Threads should be run in groups of at Nvidia of course knew this and the synchronous
least 32 for best performance. Branches functioning of the first CUDA versions were
in the program code do not impact probably used to facilitate a rapid release of a
performance significantly, provided that functional version without focusing on the more
each of 32 threads takes the same delicate details. With CUDA 0.9 and also 1.0,
execution path; the SIMD execution this problem disappeared and the CPU is free
model becomes a significant limitation once it has sent the program to be executed to the
GPU (except when access to textures is used). In
for any inherently divergent task (e.g., the case where a number GPUs are used, it is
traversing a ray tracing acceleration data however necessary to create a CPU thread per
structure). GPU because CUDA does not authorize the
piloting of two 2 GPUs starting from the same
 CUDA-enabled GPUs are only thread. This is not a big problem. Note that there
available from Nvidia (GeForce 8 series is a function that can force synchronous
and above, Quadro and Tesla[1]) functioning if this is necessary.
 Currently CUDA's compiler support
under Windows is only for Visual The second main innovation on the functional
Studio 7 and 8, not Visual Studio 9 level is the appearance of atomic functions,
(2008). Support for VS2008 is expected which means reading data in memory, using it in
by end of 2008. an operation and writing the result without any
other access to this memory space until the
operation is fully completed. This allows
11. CUDA Evolves avoiding (or at least reducing) certain current
problems such as a thread which tries to read a
It’s now clear that CUDA wasn’t simply an
value which we don’t know if it was modified or
Nvidia marketing ploy and/or to see if the market
not.
is ready for such a thing. Rather it’s a long term
strategy based on the feeling that an accelerator
market is starting to form and is expected to
grow quickly in the coming years.
CUDA’s team is therefore working hard to

evolve the language, improve the compiler, make
use more flexible, etc. Since the 0.8 beta version
out in February, the versions 0.9beta and finally
1.0 have allowed CUDA to viably make the use
4
12. Conclusion 13. References
The first one is that a high end GPU comes 1. CUDA for GPU Computing
with a memory bandwidth of 100 GB/s 2. NVIDIA on DailyTech
whereas the four cores of a CPU have to
share 10 GB/s. This is a significant
3. NVIDIA CUDA 2.0
difference which limits CPUs in certain 4. Schatz, M.C., Trapnell, C., Delcher, A.L.,
cases. The second point is that a GPU is Varshney, A. (2007). "High-throughput
designed to maximize its throughput. It will sequence alignment using Graphics
therefore automatically use a very high Processing Units.". BMC Bioinformatics
number of threads to maximize its 8:474: 474. doi:10.1186/1471-2105-8-474.
throughput whereas a CPU will more often 5. Manavski, Svetlin A.; Giorgio Valle (2008).
end up waiting many cycles. These two "CUDA compatible GPU cards as efficient
reasons allow GPUs to have an enormous hardware accelerators for Smith-Waterman
advantage over CPUs in certain applications sequence alignment". BMC Bioinformatics
such as the one we tested. 9(Suppl 2):S10: S10. doi:10.1186/1471-
2105-9-S2-S10.
Using a GPU with CUDA may seem very 6. Silberstein, Mark (2007). "Efficient
difficult even adventurous at first, but it is computation of Sum-products on GPUs".
much simpler than what most people think. 7. "PyCUDA".
The reason is that GPU isn’t destined to
replace a CPU, but rather to help it in certain
specific tasks. This doesn’t involve making a
task parallel in order to use several cores
(like it’s currently the case with CPUs) but to
implement a task which is naturally
massively parallel and they are numerous. A
race car isn’t used to transport cattle and we
don’t drive a tractor in an F1 race. This is the
same thing for GPUs. Therefore, it's all
about making a massively parallel algorithm
efficient depending on a given architecture.
It would be a mistake to limit our view of a

GPU such as the GeForce 8800 as a cluster
of 128 cores, which we should try to take
advantage of by segmenting an algorithm. A
GeForce 8800 isn’t just 128 processors, but
rather first of all up to 25,000 threads in
flight! We therefore have to provide the GPU
with an enormous number of threads,
keeping them within the hardware limits for
maximum productivity and let the GPU
execute them efficiently.
5
6

Compute Unified Device Architecture

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Compute Unified Device Architecture

Enviado por

Direitos autorais:

Formatos disponíveis

COMPUTE UNIFIED DEVICE

Cache that is built into the CPU is faster than

Disk caching applies the same principle to the

CUDA’s team is therefore working hard to

It would be a mistake to limit our view of a

Você também pode gostar