On Implementation of GPU Virtualization Using PCI Pass-Through

2012 IEEE 4th International Conference on Cloud Computing Technology and Science
On Implementation of GPU Virtualization Using PCI Pass-Through
Chao-Tung Yang Hsien-Yi Wang Ching-Hsien Hsu

Wei-Shen Ou Yu-Tso Liu Department of Computer Science and Information
Department of Computer Science, Tunghai University Engineering, Chunghua University, Hsinchu, Taiwan
Taichung City 40704, Taiwan e-mail: chh@chu.edu.tw
e-mail: ctyang@thu.edu.tw
Abstract—In this paper, we use PCI pass-through technology The rest parts of this work are organized as follows.
and make the virtual machines in a virtual environment are Section 2 describes a background review of Cloud
able to use the NVIDIA graphics card, which uses the CUDA Computing, Virtualization technology, and CUDA (Compute
parallel progamming. It makes the virtual machine have not Unified Device Architecture) [5]. Section 3 describes the
only the virtual CPU but also the real GPU for computing. The system implementation, Tesla C1060’s architecture and its
performance of virtual machine is predicted to increase specification and end-user’s interface. Section 4 present’s
dramatically. This paper will measure the performance experimental environment and the methods we use and
differences between virtual machines and physical machines by results of GPU virtualization and show that the proposed
using CUDA; and how virtual machines would verify CPU
approach improved performance. Finally, conclusions are
numbers under influence of CUDA performance. At last, we
discussed in Section 5.
compare two open source virtualization environment
hypervisor, whether it is after PCI pass-through CUDA II. BACKGROUND REVIEW
performance differences or not. Through the experiment, we
will be able to know which environment will reach the best A. Cloud Computing
efficiency in a virtual environment by using CUDA.
Cloud Computing [3] is a computing approach base on
Keywords- CUDA; GPU virtualization; Cloud Computing;
the Internet, which is world in recent years lively discussion.
PCI pass-through; That means use software service and data storage in remote
server. It is a new service architecture that brings new a
I. INTRODUCTION selection of software or data storage service to users. Users
no longer need to find out the “Cloud” in details of the
GPUs are real “manycore” processors with hundreds of
infrastructure, no necessary to know the professional
processing elements. A graphics processing unit (GPU) is a
knowledge, without direct control the real machine.
specialized microprocessor that offloads and accelerates
graphics rendering from the central (micro-) processor. B. Virtualization
Modern GPUs are very efficient at manipulating computer Virtualization technology [6] is a technology that
graphics, and their highly parallel structures make them more creation of a virtual version of something, such as a
effective than general-purpose CPUs for a range of complex hardware platform, operation system, a storage device or
algorithms. We know that a CPU has only 8 cores at single network resources. The goal of virtualization is to centralize
chip currently but a GPU has grown to 448 cores. From the administrative tasks while improving scalability and overall
number of cores, GPU is appropriate to compute the hardware-resource utilization. By using virtualization,
programs suited massive parallel processing. Although several operating systems can be run on a single powerful
frequency of the core on the GPU is lower than CPU’s, we server without glitch in parallel.
believe that massively parallel can conquer the problem of The original operation system becomes a Virtual
lower frequency. As for GPU, it has been used on Machine Manager (VMM). Guest OS is not to be executed
supercomputers. On top 500 sites in November 2010 [1], directly by the CPU, but to use VMM making a translation to
there are three supercomputers of the first 5 built with CPU and other hardware for the implementation.
NVIDIA GPU [2]. 1) Full-Virtualization:
In this paper, we introduce some hypervisor
Unlike the traditional way that put the operation system
environments for virtualization and different virtualization
kernel to Ring 0 level, full-virtualization use hypervisor
types on cloud system. Several types of hardware
instead of that. Hypervisor manage all instructions to Ring 0
virtualization are introduced as well. This paper focuses on
from Guest OS.
GPU virtualization and implements a system with
virtualization environment and uses PCI pass-through [4] 2) Para-Virtualization
technology which lets the virtual machines on the system use In para-virtualization [7], does not virtualize all hardware
GPU accelerator to increase the computing power. We do until. There is a unique Host OS called Domain0. Domain0
several experiments to compare the differences between is parallel with other Guest OS and use Native Operating
virtual machine with GPU virtualization (PCI pass-through) System to manage hardware driver. Guest OS accessing the
and native machine with GPU. real hardware by calling the driver in Domain0 through
711
978-1-4673-4510-1/12/$31.00 ©2012 IEEE
hypervisor. The requirement sent by Guest OS to hypervisor Unlike the two kinds of emulation of devices before,
is called Hypercall. device pass-through is about providing an isolation of
3) Xen devices to a given guest operating system shown in Figure 2.
Host OS type and Hypervisor type. The VM Layer of Assigning devices to specific guests is useful when those
Host OS type deploys on Host OS, such like Windows or devices cannot be shared. For performance, near-native
Linux, and then installs other operation system on top of VM performance can be achieved using device pass-through.
Layer. The operation systems on top the VM Layer are
E. Related Work
called Guest OS.
In recent years, virtualization environment on Cloud
C. CUDA become more and more popular. The balance between
CUDA (Compute Unified Device Architecture) performance and cost is the most important point that
[5][8][9][10][11][12][13][14] is architecture of parallel everybody focused. For more effective to use the resource on
computing developed by NVIDIA. It is an official name that the server, virtualization technology is the solution. Running
NVIDIA face to GPGPU. It is the first time that use C- many virtual machines on a server, the resource can be more
complier as a develop environment for GPU. CUDA’s effective to use. But the performance of virtual machines has
programming model maintains a low learning curve for their own limit. Users will limit by using a lot of computing
programmer familiar with standard programming languages on virtual machine.
such as C and FORTRAN shown in Figure 1. The There are some approaches that pursue the virtualization
architecture of CUDA is compatible with OpenCL of the CUDA Runtime API for VMs such as rCUDA
[15][16][17] and C-complier from its own. The instructions [18][19][20], vCUDA [21], GViM [22] and gVirtuS [23].
are transformed into PTX code by driver no matter they The solutions feature a distributed middleware comprised of
come from CUDA C-language or OpenCL, and then two parts, the front-end and the back-end[24].
calculate by graphics cores.
Guest OS2 (VM) Guest OS1 (VM)

Front-end
Guest driver
middleware
Emulated device
Back-end Hypervisor (VMM)
middleware Physical driver
Physical device Hardware Platform Physical device

Figure 1. CUDA Programming Model from nVidia[2].
D. Virtualization on GPU Figure 3. Front-End and Back-End.

Virtualization is more and more popular in recent days.
The requirement of virtualization is also increase. Common Figure 3 shows that the front-end middleware is installed
virtual machines are inadequate for our use, because the in the virtual machine, and the back-end middleware with
environment of the virtual machines is through virtualization direct access to the acceleration hardware, is running by host
after all. OS with executing the VMM.
rCUDA using Sockets API to let the client and server
have communication with each other. And client can use the
GPU on server through that. It is a production-ready
Guest OS2 (VM) Guest OS1 (VM)
framework to run CUDA applications from VMs, based in a
Guest driver
Physical driver recent CUDA API version. We can use this middleware to
make a customized communications protocol and is
independent [18]. The architecture is shown in Figure 4.
Unlike rCUDA, GViM and vCUDA are not at expense of
passthrough
Emulated device
Hypervisor (VMM) losing VMM independence.
Physical driver
Physical device Hardware Platform Physical device
Figure 2. Pass-through within the Hypervisor.
712
978-1-4673-4510-1/12/$31.00 ©2012 IEEE
Client Server features for technical and enterprise computing including

C++ support, ECC memory for uncompromised accuracy
Application Deamon and scalability, and a 7X increase in double precision
performance compared Tesla 10-series GPUs.
Compared to the latest quad-core CPUs, Tesla C2050
CUDA Runtime wrapper library Sockets API CUDA Driver API
computing Processors deliver equivalent supercomputing
performance at 1/10th the cost and 1/20th the power
Sockets API consumption.
D. End User’s Operating Interface
When users create a virtual machine and pass the GPU
GPU through to virtual machine successful, users can through an
Network
application called “virtual machine manager” in Linux to see
Figure 4. Architecture of rCUDA. the result. In Figure 6, the GPU pass-through is successful
and virtual is running.
III. SYSTEM IMPLEMENTATION
A. System Architecture
To use GPU accelerator on virtual machines, this paper
plans using PCI-pass-through to implement the system for
better performance. For performance, near-native
performance can be achieved using device pass-through.
This technology is perfect for networking applications or
those that have high disk I/O or like using hardware
accelerator that have not adopted virtualization because of
contention and performance degradation through the
hypervisor. But assigning devices to specific guests is also
useful when those devices cannot be shared. For example, if
a system included multiple video adapters, those adapters
could be passed through to unique guest domains.
VT-d Pass-Through is a technique to give a DomU
exclusive access to a PCI function using the IOMMU [25]
provided by VT-d. It is primarily targeted at HVM (fully Figure 6. PCI Pass-through Successful.
virtualized) guests because Para-Virtualized pass-through
does not require VT-d .There is an important thing that your On the other way, users can also use pietty or VNC.
hardware must support that. In addition to the motherboard Users must be prepared to internet connection and VNC
chipset and BIOS also your CPU must have support for connection, and then set the IP and port as long as you can
IOMMU IO virtualization (VT-d). VT-d is disabled by connect to the virtual machine. In the console, users can use
default, to enable it, need 'iommu' parameter to enable it. the command “lspci” to see the PCI pass-through is working
or not. The setup steps are shown in Figure 7, Figure 8 and
Figure 9, respectively.
Figure 5. IOMMU On.
Figure 7. Shows in Pietty.

B. Tesla C1060 Computing Processor Board
The NVIDIA Tesla™ C1060 [26] transforms a
workstation into a high-performance computer that
outperforms a small cluster. This gives technical
professionals a dedicated computing resource at their desk-
side that is much faster and more energy-efficient than a
shared cluster in the data center.
C. Tesla C2050 Computing Processor Board Figure 8. Shows in VNC.
The NVIDIA Tesla™ C2050[28] is based on the next-

generation CUDA™ architecture codenamed “Fermi”, the
20-series family of Tesla GPUs support many “must have”
713
978-1-4673-4510-1/12/$31.00 ©2012 IEEE
TABLE III. GPU SOFTWARE ENVIRONMENTS

GPU Software Environments
Driver 285.05.33
Cuda toolkit 4.1.28
CUDA SDK 4.1.28
IV. EXPERIMENTAL METHODS AND RESULTS

A. Experimental Methods
Figure 9. Using lspci.
We set up ten comparison benchmarks: alignedTypes,
E. System Environment asyncAPI, BlackScholes, clock, convolutionSeparable,
fastWalshTransform, matrixMul, Bandwidthtest, matrixmul-
Previously, we have conducted the design principle and sizeable and VecAdd. The first seven benchmarks are ports
implementation methods. We present here several of CUDA SDK [8]. From benchmarks in the suite, we select
experiment conducts on two machines. The node’s hardware 7 representative SDK benchmarks of varying computation
specification is listing in Table I. loads and data size which use different CUDA features.
TABLE I. HARDWARE/SOFTWARE SPECIFICATION These benchmarks are executed with the default. Another
two benchmarks are selected as matrixmul-sizeable and
Hardware/Software Specification VecAdd which the problem size can be set with high
CPU Memo Di OS Hyper GPU computation loading. All SDK benchmarks’ execution time
ry sk visor is measured by the command ‘time’ in the CentOS [29].
Node1 Xeon 12GB 1T CentOS Xen Quadro Table IV shows the data transferring size of each benchmark.
E5506 B 6.2 NVS 295/
Tesla TABLE IV. DATA TRANSFERS OF BENCHMARKS
C1060/
Tesla GPU Software Environments
C2050 SDK name Data transfers
Aligned Types 413.26MB
In Table I, we use two machines with the same hardware Async API 128.00MB
specification. And the hypervisor which one is Xen, another Black Scholes 76.29MB
Clock 2.50KB
is KVM. The purpose is compare the performance between Convolution Separable 36.00MB
these two hypervisor using PCI pass-through with the same Fast Walsh Transform 64.00MB
GPU. Quadro NVS 295 [27] is using for primary graphics Matrix Mul 79.00KB
card. Tesla C1060 is the one we using for computing and
passing through to virtual machine. The first experiment is the comparison between native
and virtual machine performance. We present how much
TABLE II. HARDWARE/SOFTWARE SPECIFICATION OF VIRTUAL GPU performance reduced with PCI pass-through. The
MACHINE
second experiment is the comparison between virtual
Hardware/Software Specification of Virtual Machine machines with 1 CPU, 2 CPUs and 4 CPUs performance
CPU RAM Disk OS Hyper
visor
GPU Virtual
ization
whether presenting CPU numbers in virtual machines would
VM1 1,2,4 1GB 12GB CentOS6.2 Xen Quadro Full affect the GPU performance or not. The final experiment is
VM2 1,2,4 1GB 12GB CentOS6.2 Xen
NVS 295
Tesla Full
the comparison between the two common virtualization
C1060 hypervisors performance. We will display which one has
VM3 1,2,4 1GB 12GB CentOS6.2 Xen Tesla
C2050
Full
better GPU performance with PCI pass-through than the
others.
Table II is the hardware/software specification of virtual B. Experimental Results
machines. We create three virtual machines with the same
specification but in different CPU number and different GPU. We first analyze the performance of the CUDA SDK
We want discuss that the CPU number will or not affect the benchmarks running in a VM using PCI pass-through, and
performance of virtual machine in PCI pass-through. So we compare their execution times with those in a native
will use 1, 2 or 4 CPUs in our virtual machine to see the environment —i.e., using the regular CUDA Runtime library
difference between each other. These of two virtual in a non-virtualized environment. The results of these
machines’ virtualization type are full, because we found out experiments are reported as below.
that PCI pass-through is not working in para-virtualization in
our research. Table III shows the GPU software environment.
714
978-1-4673-4510-1/12/$31.00 ©2012 IEEE
Figure 10. Execution Time between Native and VM with C1060.
Figure 13. Execution time of MatrixMul with C1060.
Figure 11. Execution Time between Native and VM with C2050.
Figure 14. Execution time of MatrixMul with C2050.
Figure 12. Execution Time between Native and VM with NVS295.
Figure 10, Figure 11 and Figure 12 show that the

execution time on processing the SDK benchmark in native Figure 15. Execution time of MatrixMul with NVS295.
and virtual machine using one CPU on Xen. We can see the
real time of these benchmarks on virtual machine is less than
native machine. In Figure 13, Figure 14 and Figure 15, the
execution time of MatrixMul. In this figure, we can see the
similar result as the previous. The execution times of these
four environments are very close. The execution time of
virtual machine is also shorter than real machine. The
execution time of problem size 256 is shorter than 0.1 second
so it is difficult to see clear in this figure.
Figure 16. Compare with rCUDA.
715
978-1-4673-4510-1/12/$31.00 ©2012 IEEE
[9] Download CUDA, http://developer.nvidia.com/object/cuda.htm

[10] NVIDIA CUDA Programming Guide,
http://developer.download.nvidia.com/compute/DevZone/docs/html/
C/doc/CUDA_C_Programming_Guide.pdf
[11] CUDA-wiki, http://en.wikipedia.org/wiki/CUDA
[12] Fred V. Lionetti, Andrew D. McCulloch and Scott B. Baden,
“Source-to-Source Optimization of CUDA C for GPU Accelerated
Cardiac Cell Modeling,” Lecture Notes in Computer Science, 2010,
Volume 6271, Euro-Par 2010 - Parallel Processing, Pages 38-49.
[13] Sungbo Jung, “Parallelized pairwise sequence alignment using
CUDA on multiple GPUs,” BMC Bioinformatics, 2009, Volume 10,
Supplement 7, A3.
[14] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W.
Sheaffer, Kevin Skadron, “A Performance Study of General-Purpose
Figure 17. Compare with vCUDA. Applications on Graphics Processors Using CUDA,” Journal of
Parallel and Distributed Computing, Volume 68, Issue 10, October
2008, Pages 1370-1380.
Also we compare with rCUDA and vCUDA. Figure 16
[15] OpenCL, http://www.khronos.org/opencl/
and Figure 17 show the comparison. The time in the figure is
[16] OpenCL-wiki, http://en.wikipedia.org/wiki/OpenCL
the time that we minus from the execution time before GPU
virtualization and after. We use the time after GPU [17] M.J. Harvey, G. De Fabritiis, “Swan: A tool for porting CUDA
programs to OpenCL,” Computer Physics Communications, Volume
virtualization minus before GPU virtualization. The 182, Issue 4, April 2011, Pages 1093-1099.
execution time is taken from [18] [21]. From these two [18] J. Duato, A. J. Pe˜na, F. Silla, R. Mayo, and E. S. Quintana-2UWÕ
figures, we can see that using PCI pass-through dose not add “rCUDA: Reducing the number of GPUbased accelerators in high
too much time. Compare with these two technologies, PCI performance clusters,” Proceedings of the 2010 International
pass-through is more efficient. Conference on High Performance Computing & Simulation (HPCS
2010), Jun. 2010, pp. 224–231.
V. CONCLUSIONS [19] J. Duato, A.J. Pena, F. Silla, J.C. Fernandez, R. Mayo, E.S. Quintana-
Orti, “Enabling CUDA acceleration within virtual machines using
In our work, we can see the GPU performance is the rCUDA,” Proceedings of High Performance Computing (HiPC), 2011
same in native and virtual machine. No matter how many 18th International Conference, 2010, Pages 1-10.
CPUs in virtual machine, the GPU provide the same [20] J. Duato, A. J. Pe˜na, F. Silla, R. Mayo, and E. S. Quintana-Orti,
performance by PCI pass-through. Data transfer time is “Performance of CUDA virtualized remote GPUs in high
shorter than rCUDA because rCUDA is network related and performance clusters,” Proceedings of International Conference on
Parallel Processing (ICPP), Sep. 2011, Pages 365-374.
the seed of network is the key of rCUDA. Code needs to be
[21] L. Shi, H. Chen, and J. Sun, “vCUDA: GPU accelerated high
rewritten if using rCUDA, but not PCI pass-through. Though performance computing in virtual machines,” Proceedings of IEEE
rCUDA can let the virtual machine run not only in local International Symposium on Parallel & Distributed Processing
GPU, but also in remote GPU by network. Taking PCI pass- (IPDPS’09), 2009, Page 1-11.
through is more direct if we want to make comparison of [22] V. Gupta, A. Gavrilovska, K. Schwan, H. Kharche, N. Tolia, V.
vCUDA. vCUDA uses the middleware as the connect point Talwar, and P. Ranganathan, “GViM: GPU-accelerated virtual
but it takes more time than PCI pass-through. Using the PCI machines,” in 3rd Workshop on System-level Virtualization for High
Performance Computing. NY, USA: ACM, 2009, pp. 17–24.
pass-through to implement that computing with GPU
accelerator in virtual machines can save more resource but [23] G. Giunta, R. Montella, G. Agrillo, and G. Coviello, “A GPGPU
transparent virtualization component for high performance computing
has the same high performance in real machine overall. clouds,” in Euro-Par 2010 - Parallel Processing, ser. LNCS, P. D
Ambra, M. Guarracino. D. Talia, Eds. Springer Berlin / Heidelberg,
ACKNOWLEDGEMENTS 2010, vol. 6271, pp. 379–391.
This work is supported in part by the National Science Council, [24] Front and back ends,
Taiwan, under grants no. NSC 101-2221-E-029-014 and NSC 101- http://en.wikipedia.org/wiki/Front_and_back_ends
2622-E-029-008-CC3 [25] Nadav Amit, Muli Ben-Yehuda and Ben-Ami Yassour, “IOMMU:
Strategies for Mitigating the IOTLB Bottleneck,” Lecture Notes in
REFERENCES Computer Science, 2012, Volume 6161, Computer Architecture,
Pages 256-274.
[1] TOP 500, http://www.top500.org/
[26] NVIDIA Telsa C1060 Computing Processor,
[2] nVidia, http://www.nvidia.com http://www.nvidia.com/object/product_tesla_c1060_us.html
[3] Cloud computing, http://en.wikipedia.org/wiki/Cloud_computing [27] NVIDIA Quadro NVS 295,
[4] PCI-pass-through, http://www.nvidia.com.tw/object/product_quadro_nvs_295_tw.html
http://www.ibm.com/developerworks/linux/library/l-pci-passthrough [28] NVIDIA Telsa C2050 Computing Processor,
[5] CUDA, http://www.nvidia.com.tw/object/cuda_home_new_tw.html http://www.nvidia.com.tw/object/product_tesla_C2050_C2070_tw.ht
[6] Virtualization, http://en.wikipedia.org/wiki/Virtualization ml
[7] Para Virtualization, http://en.wikipedia.org/wiki/Paravirtualization [29] CentOS, http://www.centos.org/
[8] NVIDIA CUDA SDK, http://developer.nvidia.com/cuda-cc-sdk-code-
samples
716
978-1-4673-4510-1/12/$31.00 ©2012 IEEE

On Implementation of GPU Virtualization Using PCI Pass-Through

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

On Implementation of GPU Virtualization Using PCI Pass-Through

Enviado por

Direitos autorais:

Formatos disponíveis

2012 IEEE 4th International Conference on Cloud Computing Technology and Science

On Implementation of GPU Virtualization Using PCI Pass-Through

Chao-Tung Yang Hsien-Yi Wang Ching-Hsien Hsu

Guest OS2 (VM) Guest OS1 (VM)

Physical device Hardware Platform Physical device

D. Virtualization on GPU Figure 3. Front-End and Back-End.

Physical device Hardware Platform Physical device

Figure 2. Pass-through within the Hypervisor.

Client Server features for technical and enterprise computing including

Figure 5. IOMMU On.

Figure 7. Shows in Pietty.

The NVIDIA Tesla™ C2050[28] is based on the next-

TABLE III. GPU SOFTWARE ENVIRONMENTS

IV. EXPERIMENTAL METHODS AND RESULTS

Figure 10. Execution Time between Native and VM with C1060.

Figure 13. Execution time of MatrixMul with C1060.

Figure 11. Execution Time between Native and VM with C2050.

Figure 14. Execution time of MatrixMul with C2050.

Figure 12. Execution Time between Native and VM with NVS295.

Figure 10, Figure 11 and Figure 12 show that the

Figure 16. Compare with rCUDA.

[9] Download CUDA, http://developer.nvidia.com/object/cuda.htm

Você também pode gostar