Escolar Documentos
Profissional Documentos
Cultura Documentos
Architectures:
And the Common Convergence of Both
Jonathan Palacios and Josh Triska
March 15, 2011
1 Introduction
Graphics Processing Units (GPUs) have been evolving at a rapid rate in recent years, partly due to increasing needs of the very active computer graphics
development community. The visual computing demands of modern computer
games and scientic visualization tools have steadily escalated over the past two
decades (Figure 1). But the speed and breadth of evolution in recent years has
also been aected by the increased demand for these chips to be suitable for
general purpose parallel computing as well as graphics processing.
In a campaign that has been perhaps most aggressively pushed by the company NVIDIA (one of the leading chip designers), GPUs have moved closer and
closer to being general purpose parallel computing devices. This movement began in the computer graphics software research community around 2003 [15], and
at the time was called General Purpose GPU (GPGPU) computing [16, 20, 8].
Using graphics APIs not originally intended or designed for non-graphical applications, many data parallel algorithms, such as protein folding, stock options
pricing Magnetic Resonance Image (MRI) reconstruction and database queries,
were ported to the GPU.
This prompted eorts by chip designers, such as NVIDIA, AMD and Intel to
produce architectures that were more exible with more general purpose components (perhaps the most notable change has been the
This blurring of roles between the CPU (which, in the past, has been considered the primary general purpose processor) and the GPU has caused some
interesting dynamics, the full ramications of which are not yet clear. GPUs
are becoming much more capable processors, and unlike CPUs, which are struggling to nd ways of improving speed, their raw computational power increases
dramatically every generation, as they add more and more functional units and
processing cores.
CPUs are also adding cores (most CPUs are now at least
certain tasks where there is less potential for parallelism. In any case, GPUs
and CPUs seem to be engaged in some sort of co-evolution.
Figure 1: On the left is a screenshot of the game System Shock released in 1999.
The improvement in graphical quality between this and the 2008 game, Crysis
(right), illustrates the increasing demands of the computer graphics development
community.
In this work, we hope to explore how GPUs dier from CPUs, how they are
evolving to be more general purpose, and what this means for both classes of
processors.
we briey cover the history of the GPU to give some context for it's role in
computing. We then examine what the chip structure looks like for both GPUs
and CPUs and how the former has evolved in Section 3. In Section 4 we cover
the data pipeline in the GPU, and how it works for graphics and general purpose
computing. Section 5 examines the GPU memory hierarchy, and Section 6, the
instruction set.
the world's rst GPU. There were other graphics acceleration products on the
market at the time (by NVIDIA itself, as well as other companies such as ATI,
and 3dfx), but this represented the rst time the term GPU was used.
The rst GPUs were xed-function throughput devices, essentially designed
to take a set of triangle coordinates, and color, texture, and lighting specications, in order produce an image eciently. Over time, GPUs have become
more and more programmable, so that small programs (called shaders) can be
run for each triangle, vertex and pixel processed, vastly expanding the kinds
of visual eects that can be generated quickly (reections, refractions, lighting
detail). This is in addition to the huge increase in raw parallel computing power
(Figure 2), dwarng the CPUs gains in this area, although the two technologies
are clearly optimized for dierent applications.
Increasingly, logic units dedicated for special purposes, such as vertex processing or pixel processing, have given way to more general purpose units that
can be used for either task (the unied shader model standardized the instruction set used across vertex and pixel shaders, so that only one type of logical
unit was needed). Computation has also become more precise over time, moving
from indexed arithmetic, to integer and xed point, to single precision oating
point, and most recently double precision oating point storage and operations
have been added. GPUs have essentially become massively parallel computing
devices with hundreds of cores (ALUs) and many more threads.
In the past ve years, the instruction set and memory hardware have also
been expanded to support general purpose programming languages such as C
and C++. So it is clear, that, although they may still have some limitations
relative to CPUs (more restrictive memory access, etc.), GPUs are becoming
more and more applicable to a great many more purposes than those for which
they were originally intended; and this trend is likely to continue [5].
It is worth noting (since it has aected the development of and the literature
on the GPU so dramatically) that although when the market started there
were upwards to twenty companies competing with each other, the eld has
since narrowed signicantly.
AMD's GPU division was the company ATI) are serious competetors in the
GPU market, and only NVIDIA and AMD develop high-end graphics cards [10].
3.1
In this section, we cover an abridged history of the GPU chip structure (focusing
on NVIDIA chips), to give some context for more recent developments.
The
earliest GPUs were very special purpose devices that performed xed function
tasks like transformation and lighting, texture mapping, and rasterization of
triangles into pixels. From the programmers perspective, one could set a few
options, and pass them lists of triangles, vertices, textures and virtual lights
and cameras to be processed. The story became much more interesting when
GPUs started to become programmable (2000-2004) [19].
NVIDIA's 7800 (Figure 3) was one of the earlier GPUs to have both pixel
and vertex shaders. However, the processor cores (which contained the functional units) were still divided up into vertex processors and fragment (pixel)
processors. This limited the exibility of the chip somewhat, because there were
a xed number of components that could be dedicated to particular tasks.
The unied shader model was introduced in 2008, after which vertex and
fragment shaders relied on the same instruction set, and thus the same hardware; there were no longer separate vertex and fragment processors. This can
be seen in NIVDIA 8800 chip (Figure 4). The processing power is essentially
divided up among eight Thread Processing Clusters (TPCs) each of which contain two Streaming Multiprocessors (SMs). Each of these contains eight Scalar
Processors (SPs) which each contain their own integer ALUs and FPUs and can
run multiple threads. These are where most of the actual computation, for both
vertices and fragments (and the more recent geometry shaders, which process
triangles), takes place.
distribution of work to the processor hierarchy, and also to making sure the
processors have access to the right data.
3.2
NVIDIA's more recent chip, the Fermi architecture (Figure 5), has gone towards
a more general purpose design.
each of the 16 SMs is larger (Figure 6), and now contains two warp instruction
Figure 3: NVIDIA's 7800 chip, released in 2005. Something to note here is how
the computing functionality is divided dierent kinds of processing components,
as this chip design predates the unied shader model. One of the vertex processors is highlighted in red near the top, near the middle a fragment processor
is highlighted in blue, and a raster operator is indicated by a yellow box near
the bottom.
processors has been removed, and now there is only one kind of streaming
multiprocessor; This processor contains instruction cache and dispatch logic,
and 8 scalar processors which each contain an ALU and FPU.
Figure 5: NVIDIA's Fermi Architecture. A close up of one of the SMs (highlighted in red) is shown in Figure 6.
schedulers, which issue instructions to two clusters of SP cores (or CUDA Cores)
respectively. Each SM also has it's own register le, 16 load/store units, and 4
Special Function Units (SPUs), which perform functions such as sine and cosine.
This larger, more capable SM is aimed at being able to handle more general data
intensive parallel programs, such as those found in scientic computing, while
still maintaining the ability to eciently process graphical data.
This architecture has also added better and faster oating point calculations,
a unied address space, error-correcting code logic for memory and many other
features that make it all the more suitable for general purpose computing.
Something to note here is how large the dierence in the total SPs in just
one generation of NVIDIA's product: The number of SMs has remained the
same, but the number of SPs has quadrupled. This trend is seen throughout
the graphics architecture development community, and it is one of the primary
dierences between how GPUs and CPUs have been evolving: GPUs are always
using smaller transistor size to dramatically increase the number of processors,
aiming at ever-larger data throughput. CPUs, rather, focus on instruction Level
parallelism and reducing latency.
Figure 6:
3.3
Intel Core i7
Modern CPUs have a very dierent distribution for for their chip layout; One of
Intel's latest chips is the Core i7 (Figure 7), and it is apparent from looking at
the die micrograph how much less emphasis is placed on raw computing power.
As CPUs add more cores, each core comes with all of the logic that has been
won through years of research, and at the expense of space for functional units
(Figure 8). This has both benets as well as drawbacks; CPUs are much better
at performing a much larger array of tasks than GPUs, and are able to much
more eectively reduce latency for individual tasks.
It is clear that CPUs and GPUs have very dierent priorities when it comes
to distributing silicon real-estate to dierent functionalities.
GPUs focus is
closer together and cooperate more and more. Indeed, the NVIDA Tegra chip
(designed for mobile phones) already has a GPU and a CPU on the same piece
of silicon (Figure 9).
4 Data-ow
GPU hardware is designed to compute operations on millions of pixels using
thousands of vertices and large textures as inputs. The pixels must be presented
at a frame rate of approximately 60 Hz to be smoothly presented to the human
eye. These requirements lead to two properties of GPUs in general: very high
throughput and very high latency compared to CPUs. The requirement of large
throughput is self-explanatory.
the order of milliseconds are not detectable by the human eye, especially if the
delays are within the computation time of a single frame, in which case they
are completely hidden.
Between
the year 2000 and 2001, shaders became able to run custom pixel-generation
code [13]. This led to interest from the scientic community in using the inherently parallel and powerful GPUs for general-purpose computing. The vertex,
geometry, and fragment shaders (processors, cores, etc.)
grammable elements in the traditional graphics pipeline and are the processors
used in general-purpose GPU programming.
Figure 7: A die micrograph Intel's Core i7, which uses the nehalem architecture
for each core (Figure 8).
The Cores (on this chip there are four) are shown
Starting in 2006, the dierent programmable shaders unied their core instruction set [1] which allowed code to run in any portion of the graphics pipeline
independent of the type of shader type. With the introduction of the unied
shader architecture (now known from the software perspective as Shader Model
4.0), the GPU becomes essentially a many-core, streaming multiprocessor with
raw mathematical performance many times that of a CPU. As of 2010, the
single-precision FP calculation performance of Nvidia's Fermi (1.5 TFLOP) and
AMD's 5870 (2.7 TFLOP) vastly surpasses that of CPUs. The transition to a
unied shader model occurred primarily to ease load balancing [14], as presentday games vary widely in the proportion of vertex or pixel shader usage. With
a unied architecture, any shader can process data, leading to better utilization
of GPU resources.
5 Memory Hierarchy
There are signicant dierences between the memory structure of GPUs and
CPUs. One reality that GPU manufacturers must deal with is the small amount
of cache available to each processor, due the large number of processors and limited space on the die. Cache hit rates are frequently much lower for GPUs than
CPUs (90% for the GeForce 6800 versus close to 100% for moden CPUs [12]).
10
Figure 8: Intel's Nehalem architecture. Along with functional units for doing
actual computation, each core on also contains reservations stations, reorder
buers, and complex branch prediction.
11
Figure 9:
phones) has a CPU, a GPU and image and HD video processing hardware all
on the same piece of silicon.
Figure 10: Simple sequential model of a graphics pipeline showing the three
main types of shader processors: vertex and fragment.
12
Figure 11: Typical CPU (top) versus GPU (bottom) cache hierarchy. Note the
larger size of the register array than both the L1 and L2 cache for GPUs.
This may seem like a crippling situation, but the high-latency nature of the
GPU pipeline can allow memory to be retrieved from a miss within acceptable
delay times.
quickly take over the idle processor while it waits for the missing data to be
transferred from memory [4].
The memory
13
Sequential ele-
ments in the array are pre-fetched, rather than sequential elements in memory.
array in a GPU program, it would be wrapped in a 2D block for quick access by the cache (Figure 13). There is a growing body of research literature
dealing with the modication of various scientic computing programs toward
optimizing memory access based upon the existing pre-fetching hardware.
The nature of typical graphics demands on GPUs, such as gaming, is such
that a well-dened set of inputs (verticies, textures, lighting) is heavily processed
through thousands of cycles and then a result (pixel colors) are nally written
at the end. Because of the lack of interleaving threads which are typical of CPU
programs randomly accessing memory, caches in GPUs are generally read-only.
They function primarily as a way to limit the number of requests to the memory
bus, the use of which is a precious and scarce resource, rather than to reduce
the latency of read/write misses [6].
6 Instruction Set
In comparing GPU instruction sets to modern CPU instruction sets, it is important to consider the history of the GPU. Prior to the recent evolution of GPUs
toward general-purpose computing capabilities, the GPU was a xed-function
resource, highly optimized to perform optimally those tasks that are relevant
to producing 3D graphics, such as vertex operations (assembling and shading
vertices into triangles) and fragment operations (texturing and coloring pixels).
Dedicated hardware performed each operation, and the instruction set was specic to each hardware section depending on what tasks it did. The recent shift to
general-purpose capabilities has resulted in a common set of hardware among
14
15
Figure 14:
CPUs (right).
Some special-purpose
hardware has remained, but all computing elements are capable of executing
code within the unied instruction set which brings some important capabilities
to the previously limited set of common instructions [17]:
With the common instruction set, as well as program space for custom vertex
and fragment code, a wide variety of programs can be implemented. In recent
years, double-precision FP units have become more and more common in computing elements, increasing the range of options available for GPU programs.
The instruction sets have also been more and more optimized for running Ccode. From the early academic compilers like BrookGPU [2] to modern C-like
languages such as Nvidia's CUDA and AMD's Stream, there has been a constant
improvement in software tools year after year.
GPU instructions are inherently single-instruction, multiple-data (SIMD). In
producing graphics, shaders will run the same program on thousands of pieces
of data to do such tasks as apply lters and calculate lighting. As an example of
how this would aect a CPU-to-GPU code transition, consider incrementing all
the elements in an array (Figure 14). On a CPU, the standard doubly-nested
loop would traverse the array and perform an operation on each element. In a
GPU, however, a single SIMD operation is able to operate on all elements in
parallel. A section of memory to which the operation will be applied is mapped,
and the operation simple executes if it is within the prescribed bounds.
In
the left example of Figure 14, the code would be run on 16 processors, each
operating on a dierent element.
Comparison of an array-increment operation on GPUs (left) and CPUs
(right).
There are other signicant dierences between programs compiled for CPU
and programs compiled for GPUs. Branches incur a large penalty due to branch
16
7 Applications
There are specic computing applications which are particularly suited to GPU
implementation.
nite-element models or
Early work in GPU computing produced very fast programs for solving large
sets of dierential equations, such as for the Navier-Stokes uid ow equations.
Applications which have found particular utility in the computing power of
GPUs include:
Solving dense linear systems: large linear systems comprise a huge portion of scientic computing, and often result in the largest computational
penalties.
17
Sorting algorithms such as the bitonic merge sort: sorting algorithms are
not typically thought of as parallel operations, except for bitonic sorts, in
which GPUs perform excellently.
Search algorithms such as binary search: Per unit of energy spent, GPUs
are quite ecient at sorting.
Complex physical simulation such as protein folding: there are many unsolved problems in the medical community with regard to simulation of
protein behavior. Access to very low cost per TFLOP computing power
such as GPU arrays could possibly increase the pace of solutions.
Real-time physics in games and rendering: throughput dominates the computational demand in real-time applications involving many elements such
as in games.
N-body simulations:
8 Conclusion
The recent additions which have made general-purpose computation possible
have been largely toward CPU-like elements, such as universal oating-point
capability, loops, conditionals, and instruction set compatibility with C-like
compilers.
GPUs and CPUs still ll dierent niches in the market for high performance
architecture. GPUs are built for large throughput, and running fairly simple,
but costly programs, but despite their general purpose aspirations, are still
designed to be optimal for a very specic purpose. CPUs, on the other hand
are designed to lower latency in a single task as much as possible, can handle
the most complex programs with ease, but still lag behind GPUs when large
amounts of simple calculations can be done in parallel.
Both will likely always be needed; it is unlikely that the need for either
approach will ever disappear.
blur. The variety of demand for computational power will likely produce more
innovations that bring computation paradigms across the GPU-CPU border.
Having a highly parallel, high throughput computation resource like the GPU
easily available and accessible in addition to the ubiquitous CPU on modern
PCs will allow the best of both worlds to be realized.
18
References
[1] D. Blythe. The direct3d 10 system.
Commun. ACM,
IEEE/ACM International
[5] P. Glaskowsky. Nvidia's fermi: The rst complete gpu computing architecture. Technical report, NVIDA, 2009.
[6] N. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A memory model
for scientic algorithms on graphics processors.
[7] Z. Hakura and A. Gupta. The design and analysis of a cache architecture
for texture mapping.
Presenta-
Proceedings of
CNET News,
GP2, 2004.
2000.
2008.
[12] J. Montrym and H. Moreton. The geforce 6800.
2010.
[14] Nvidia. Geforce 8800 architecture overview tb-02787-001. Technical report,
Nvidia, 2006.
[15] NVIDIA.
Technical
[16] J. Owens.
19
[18] D. Patterson. The top 10 innovations in the new nvidia fermi architecture
and the top 3 next challenges. Technical report, U.C. Berkley, 2009.
[19] D. Patterson and J. Hennessy.
Else-
vier, 2008.
[20] Y. Yang, P. Xiang, J. Kong, and H. Zhou.
20