Você está na página 1de 155

microsystem

SOLARIS™ OPERATING SYSTEM HARDWARE

VIRTUALIZATION PRODUCT ARCHITECTURE

Chien-Hua Yen, ISV Engineering


chien.yen@sun.com

Sun BluePrints™ On-Line — November 2007

Part No 820-
3703-10 Revision
1.0, 11/27/07
Edition:
November 2007
Sun Microsystems, Inc.

Table of Contents

Introduction..............................................................................1
Hardware Level Virtualization.................................................2
Scope......................................................................................4

Section 1: Background Information.........................................7


Virtual Machine Monitor Basics.............................................9
VMM Requirements................................................................9
VMM Architecture
...............................................................................................
11

The x86 Processor Architecture...........................................21

SPARC Processor Architecture............................................29

Section 2: Hardware Virtualization Implementations ..........37


Sun xVM Server......................................................................39
Sun xVM Server Architecture Overview
...............................................................................................
40
Sun xVM Server CPU Virtualization
...............................................................................................
45
Sun xVM Server Memory Virtualization
...............................................................................................
52
Sun xVM Server I/O Virtualization
...............................................................................................
56
Sun xVM Server with Hardware VM (HVM)..........................63
HVM Operations and Data Structure
...............................................................................................
64
Sun xVM Server with HVM Architecture Overview
...............................................................................................
68
Logical Domains.....................................................................79
Logical Domains (LDoms) Architecture Overview
...............................................................................................
80
CPU Virtualization in LDoms
...............................................................................................
84
Memory Virtualization in LDoms
...............................................................................................
88
Sun Microsystems, Inc.

I/O Virtualization in LDoms


...............................................................................................
91
VMware....................................................................................97
VMware Infrastructure Overview
...............................................................................................
98
VMware CPU Virtualization
...............................................................................................
98
VMware Memory Virtualization
..............................................................................................
103
VMware I/O Virtualization
..............................................................................................
103

Section 3: Additional Information ........................................107


VMM Comparison .................................................................109
References.............................................................................111
Terms and Definitions..........................................................113
Author Biography................................................................117

Chapter 1
Introduction

In the IT industry, virtualization is a mechanism of presenting a


set of logical computing resources over a fixed hardware
configuration so that these logical resources can be accessed in
the same manner as the original hardware configuration. The
concept of virtualization is not new. First introduced in the late
1960s on mainframe computers, virtualization has recently
become popular as a means to consolidate servers and reduce
the costs of hardware acquisition, energy consumption, and
space utilization. The hardware resources that can be virtualized
include computer systems, storage, and the network.
Server virtualization can be implemented at different levels on
the computing stack, including the application level, operating
system level, and hardware level:

• An example of application level virtualization is the Virtual


Machine for the Java™
platform (Java Virtual Machine or JVM™ machine)1. The JVM
implementation provides an application execution environment
as a layer between the application and the OS, removing
application dependency on OS-specific APIs and hardware-
specific characteristics.

1The terms "Java Virtual Machine" and "JVM" mean a Virtual Machine for the Java(TM) platform.
Sun Microsystems, Inc.

• OS level virtualization abstracts OS services such as file


systems, devices, networking, and security, and provides a
virtualized operating environment to applications. Typically, OS
level virtualization is implemented by the OS kernel. Only one
instance of the kernel runs on the system, and it provides
multiple virtualized operating environments to applications.
Examples of OS level virtualization include Solaris™
Containers technology, Linux VServers, and FreeBSD Jails.
OS level virtualization has less performance overhead and
better system resource utilization than hardware level
virtualization. Since one OS kernel is shared among all virtual
operating environments, isolation among all virtualized
operating environments is as good as the OS provides.
• Hardware level virtualization, discussed in detail in this paper,
has become popular recently because of increasing CPU
power and low utilization of CPU resources in the IT data
center. Hardware level virtualization allows a system to run
multiple OS instances. With less sharing of system resources
than OS level virtualization, hardware virtualization provides
stronger isolation of operating environments.
The Solaris OS includes bundled support for application and OS
level virtualization with its JVM software and Solaris Containers
offerings. Sun first added support for hardware virtualization in
the Solaris 10 11/06 release with Sun Logical Domains (LDoms)
technology, supported on Sun servers which utilize UltraSPARC
T1 or UltraSPARC T2
5

Sun Microsystems, Inc.

processors. VMware also supports the Solaris OS as a guest OS


in its VMware Server and Virtual Infrastructure products starting
with the Solaris 10 1/06 release. In October 2007, Sun
announced the Sun xVM family of products that includes the Sun
xVM Server and the Sun xVM Ops Center management system:
• Sun xVM Server — includes support for the Xen open source
community work [6] on the x86 platform and support for LDoms
on the UltraSPARC T1/T2 platform
• Sun xVM Ops Center — a management suite for the
Sun xVM Server

Note - In this paper, in order to distinguish the discussion of x86


and UltraSPARC T1/T2 processors, Sun xVM Server is
specifically used to refer to the Sun hardware virtualization
product for the x86 platform, and LDoms is used to refer to the
Sun hardware virtualization product for the UltraSPARC T1 and
T2 platforms.

The hardware virtualization technology and new products built


around this technology have expanded options and opportunities
for deploying servers with better utilization, more flexibility, and
enhanced functionality. In reaping the benefits of the hardware
virtualization, IT professionals also face the challenges of
operating within the limitation of a virtualized environment while
delivering the same level of service agreement as the physical
operating environment. Meeting this requirement requires a good
understanding of virtualization technologies, CPU architecture,
and software implementations, and awareness of their strengths
and limitations.

Hardware Level Virtualization


Hardware level virtualization is a mechanism of virtualizing the
system hardware resources such as CPU, memory, and I/O, and
creating multiple execution environments on a single system.
Each of these execution environments runs an instance of the
operating system.
A hardware level virtualization implementation typically consists
of several virtual machines (VMs), as shown in Figure 1. A layer
of software, the virtual machine monitor (VMM), manages system
hardware resources and presents an abstraction of these
resources to each VM. The VMM runs in privileged mode and
has full control of system hardware. A guest operating system
(GOS) runs in each VM. The GOS to VM is analogous to
program to process in which OS plays the function of the VMM.
6
I

Sun Microsystems, Inc.

Figure 1. In hardware level visualization, the VMM


software manages hardware resources and presents an
abstraction of these resources to one or more virtual
machines.
7

Sun Microsystems, Inc.

VM
8
I

Sun Microsystems, Inc.

VM
9

Sun Microsystems, Inc.

VM
1
I

Sun Microsystems, Inc.


1

Sun Microsystems, Inc.


1
I

Sun Microsystems, Inc.


1

Sun Microsystems, Inc.

GOS
1
I

Sun Microsystems, Inc.

GOS
1

Sun Microsystems, Inc.

GOS
1
I

Sun Microsystems, Inc.


1

Sun Microsystems, Inc.


1
I

Sun Microsystems, Inc.

Virt
1

Sun Microsystems, Inc.

ual Mac
2
I

Sun Microsystems, Inc.

hine Mo
2

Sun Microsystems, Inc.

1
2
I

Sun Microsystems, Inc.

nitor (Vf
2

Sun Microsystems, Inc.

MM)
2
I

Sun Microsystems, Inc.

Platform Hardware
Hardware resource virtualization can take the form of sharing,
partitioning, or delegating:
• Sharing — Resources are shared among VMs. The VMM
coordinates the use of resources by VMs. For example, the
VMM may include a CPU scheduler to run threads of VMs
based on a pre-determined scheduling policy and VM priority.
• Partitioning — Resources are partitioned so that each VM gets
the portion of resources allocated to it. Partitioning can be
dynamically adjusted by the VMM based on the utilization of
each VM. Examples of resource partitioning include the
ballooning memory technique employed in Sun xVM Server
and VMware, and the allocation of CPU resources in Logical
Domains technology.
• Delegating — With delegating, resources are not directly
accessible by a VM. Instead, all resource accesses are made
through a control VM that has direct access to the resource.
I/O device virtualization is normally accessed via delegation.
The distinction and boundaries between the virtualization
methods are often not clear. For example, sharing may be used
for one component and partitioning used in others, and together
they make up an integral functional module.

Benefits of Hardware Level Virtualization


Hardware level virtualization allows multiple operating systems
to run on a single server system. This ability offers many
benefits that are not available in a single OS server. These
benefits can be summarized in three functional categories:

• Workload Consolidation
According to Gartner [17] "Intel servers running at 10 percent
to 15 percent utilization are common." Many IT organizations
run out and buy a new server every time they deploy a new
application. With virtualization, computers no longer have to be
dedicated to a particular task. Applications and users can
share computing resources, remaining blissfully unaware that
they are doing so. Companies can shift computing resources
around to meet demand at a given time, and get by with less
infrastructure overall. When used for consolidation,
virtualization can also save
hardware and maintenance expenses, floor space, cooling
costs, and power consumption.
• Workload Migration
Hardware level virtualization decouples the OS from the
underlying physical platform resources. A guest OS state,
along with the user applications running on top of it, can be
encapsulated into an entity and moved to another system. This
capability is useful for migrating a legacy OS system from an
old under-powered server to a more powerful server while
preserving the investment in software. When a server needs to
2

Sun Microsystems, Inc.

be maintained, a VM can be dynamically migrated to a new


sever with no down time, further enhancing availability.
Changes in workload intensity levels can be addressed by
dynamically shifting underlying resources to the starving VMs.
Legacy applications that ran natively on a server continue to
run on the same OS running inside a VM, leveraging the
existing investment in applications and tools.
• Workload Isolation
Workload isolation includes fault and security isolations.
Multiple guest OSes run independently, and thus a software
failure in one VM does not affect other VMs. However, the
VMM layer introduces a single point of failure that can bring
down all VMs on the system. A VMM failure, although
potentially catastrophic, is less probable than a failure in the
OS because the complexity of VMM is much less than that of
an
OS.
Multiple VMs also provide strong security isolation among
themselves with each VM running an independent OS.
Security intrusions are confined to the VM in which they occur.
The boundary around each VM is enforced by the VMM and
the inter-domain communication, if provided by the VMM, is
restricted to specific kernel modules only.
One distinct feature of hardware level virtualization is the ability
to run multiple instances of heterogeneous operating systems on
a single hardware platform. This feature is important for the
following reasons:
• Better security and fault containment among application
services can be achieved through OS isolation.
• Applications written for one OS can run on a system
that supports a different OS.
• Better management of system resource utilization is possible
among the virtualized environments.

Scope
This paper explores the underlying hardware architecture and
software implementation for enabling hardware virtualization.
Great emphasis has been placed on the CPU hardware
architecture limitations for virtualizing CPU services and their
software workarounds. In addition, this paper discusses in detail
the software architecture for implementing the following types of
virtualization:
• CPU virtualization — uses processor privileged mode to
control resource usage by the VM, and relays hardware traps
and interrupts to VMs
• Memory virtualization — partitions physical memory among
multiple VMs and handles page translations for each VM
• I/O virtualization — uses a dedicated VM with direct access to
I/O devices to provide device services
2
I

Sun Microsystems, Inc.

The paper is organized into three sections. Section I,


Background Information, contains information on VMMs and
provides details on the x86 and SPARC processors:
• "Virtual Machine Monitor Basics" on page 9 discusses the core
of hardware virtualization, the VMM, as well as requirements
for the VMM and several types of VMM implementations.
• "The x86 Processor Architecture" on page 21 describes
features of the x86 processor architecture that are pertinent to
virtualization.
• "SPARC Processor Architecture" on page 29 describes
features of the SPARC processor that affect virtualization
implementations.
Section II, Hardware Virtualization Implementations, provides
details on the Sun xVM Server, Logical Domains, and VMware
implementations:
• "Sun xVM Server" on page 39 discusses a paravirtualized
Solaris OS that is based on an open source VMM
implementation for x86[6] processors and is planned for
inclusion in a future Solaris release.
• "Sun xVM Server with Hardware VM (HVM)" on page 63
continues the discussion of Sun xVM Server for the x86
processors that support hardware virtual machines: Intel-VT
and AMD-V.
• "Logical Domains" on page 79 discusses Logical Domains
(LDoms), supported on Sun servers that utilize UltraSPARC Tl
or T2 processors, and describes Solaris OS support for this
feature.
• "VMware" on page 97 discusses the VMware
implementation for the VMM.
Section III, Additional Information, contains a concluding
comparison, references, and appendices:
• "VMM Comparison" on page 109 presents a summary of the
VMM implementations discussed in this paper.
• "References" on page 111 provides a comprehensive
listing of related references.
• "Terms and Definitions" on page 113 contains a
glossary of terms.
• "Author Biography" on page 117 provides information
on the author.

I
n Sun Microsystems, Inc.


S
28 Introduction

Section I
Background Information

• Chapter 2: Virtual Machine Monitor Basics (page 9)


• Chapter 3: The x86 Processor Architecture (page 21)
• Chapter 4: SPARC Processor Architecture (page 29)

I
n Sun Microsystems, Inc.


S
3

Virtual Machine Monitor Basics

Chapter 2
Virtual Machine Monitor Basics

At the heart of hardware level virtualization is the VMM. The


VMM is a software layer that abstracts computer hardware
resources so that multiple OS instances can run on a physical
system. Hardware resources are normally controlled and
managed by the OS. In a virtualized environment the VMM takes
this role, managing and coordinating hardware resources. There
is no clear boundary between an OS and the VMM from the
definition point of view. The division of functions between OS
and the VMM can be influenced by factors such as processor
architecture, performance, OS, and nontechnical requirements
such as ease of installation and migration.
Certain VMM requirements exist for running multiple OS
instances on a system. These requirements, discussed in detail
in the next section, stem primarily from processor architecture
design that is inherently an impediment to hardware
virtualization. Based on these requirements, two types of VMMs
have emerged, each with distinct characteristics in defining the
relationship between the VMM and an OS. This relationship
determines the privilege level of the VMM and an OS, and the
control and sharing of hardware resources.

VMM Requirements
A software program communicates with the computer hardware
through instructions. Instructions, in turn, operate on registers
and memory. If any of the instructions, registers, or memory
involved in an action is privileged, that instruction results in a
privileged action. Sometimes an action, which is not necessarily
privileged, attempts to change the configuration of resources in
the system. Subsequently, this action would impact other actions
whose behavior or result depends on the configuration of
resources. The instructions that result in such operations are
called sensitive instructions.
In the context of the virtualization discussion, a processor's
instructions can be classified into three groups:
• Privileged instructions are those that trap if the processor is in
non-privileged mode and do not trap if it is in privileged mode.
• Sensitive instructions are those that change or reference the
configuration of resources (memory), affect the processor
mode without going through the memory trap sequence (page
fault), or reference the sensitive registers whose contents
change when the processor switches to run another VM.
• Non-privileged and non-sensitive instructions are those that do
not fall into either the privileged or sensitive categories
described above.
Sensitive instructions have "a major bearing on the
virtualizability of a machine" [1] because of their system-wide
S

Virtual Machine Monitor Basics

impact. In a virtualized environment, a GOS should only contain


non-privileged and non-sensitive instructions.
If sensitive instructions are a subset of privileged instructions, it
is relatively easy to build a VM because all sensitive instructions
will result in a trap. In this case a VMM can be constructed to
catch all traps that result from execution of sensitive instructions
by a GOS. All privileged and sensitive actions from VMs would
be caught by the VMM, and resources could be allocated and
managed accordingly (a technique called trap-and-emulate). A
GOS's trap handler could then be called by the VMM trap
handler to perform the GOS-specific actions for the trap.
If a sensitive instruction is a non-privileged instruction, the
instruction executed by one VM will be unnoticed. Robin and
Irvine [3] identified several x86 instructions in this category.
These instructions cannot be safely executed by a GOS as they
can impact the operations of other VMs or adversely affect the
operation of its own GOS. Instead, these instructions must be
substituted by the VMM service. The substitution can be in the
form of an API for the GOS to call, or a dynamic conversion of
these instructions to explicit processor traps.

Types of VMM
In a virtualized environment, the VMM controls the hardware
resources. VMMs can be categorized into two types, based on
this control of resources:

• Type I — maintains exclusive control of hardware


resources
• Type II — leverages the host OS by running inside the
OS kernel
The Type I VMM [3] has several distinct characteristics: it is the
first software to run (besides BIOS and the boot loader), it has
full and exclusive control of system hardware, and it runs in
privileged mode directly on the physical processor. The GOS on
a Type I VMM implementation runs in a less privileged mode
than the VMM to avoid conflicts managing the hardware
resources.
An example of a Type I VMM is Sun xVM Server. Sun xVM
Server includes a bundled VMM, the Sun vVM Hypervisor for
x86. The Sun xVM Hypervisor for x86 is the first software,
beside BIOS and boot loader, to run during boot as shown in the
GRUB menu.lst file:
title Sun xVM Server
kernel$ /boot/
$ISADIR/xen.gz
module$ /p la tfo rm /i8 6 x p v/ke rn e l/$ IS A D IR /u n ix /p la tfo rm /i8 6 x p v /k e rn e l/
$ IS A D IR /u nmodule$
ix /platform/i8 6pc/$ISADIR/boot_archive
S
3

Virtual Machine Monitor Basics

The GRUB bootloader first loads the Sun xVM Hypervisor for
x86, xen.gz. After the VMM gains control of the hardware, it
loads the Solaris kernel,
/platform/i86xpv/kernel/$ISADIR/unix, to run as a GOS.
Sun's Logical Domains and VMware's Virtual Infrastructure 3 [4]
(formerly knows as VMware ESX Server), described in detail in
Chapter 7 "Logical Domains" on page 79 and Chapter 8
"VMware" on page 97, are also Type I VMMs.
A Type II VMM typically runs inside a host OS kernel as an add-
on module, and the host OS maintains control of the hardware
resources. The GOS in a Type II VMM is a process of the host
OS. A Type II VMM leverages the kernel services of the host OS
to access hardware, and intercepts a GOS's privileged
operations and performs these operations in the context of the
host OS. Type II VMMs have the advantage of preserving the
existing installation by allowing a new GOS to be added to an
running OS.
An example of type II VMM is VMware's VMware Server
(formerly known as VMware GSX Server).
Figure 2 illustrates the relationships among hardware, VMM,
GOS, host OS, and user application in virtualized environments.

Type I VMM Server


Type II VMM Server Unprivileged Mode
S

Virtual Machine Monitor Basics

A
A
A

G Apps
G
|
GOS h

VVMM

P Host OS
Platform Hardware
Privileged Mode
Figure 2. Virtual machine monitors vary in how they support
guest OS, host OS, and user applications in virtualized
environments.

VMM Architecture
As discussed in "VMM Requirements" on page 9, the VMM
performs some of the functions that an OS normally does:
namely, it controls and arbitrates CPU and memory resources,
and provides services to upper layer software for sensitive and
privileged operations. These functions require the VMM to run in
privileged mode and the OS to relinquish the privileged and
sensitive operations to the VMM. In addition to processor and
memory operation, I/O device support also has a large impact on
VMM architecture.
VMM in Privileged Mode
A processor typically has two or more privileged modes. The
operating system kernel runs in the privileged mode. The user
applications run in a non-privileged mode and trap to the kernel
when they need to access system resources or services from
the kernel.
The GOS normally assumes it runs in the most privileged mode
of the processor. Running a VMM in a privileged mode can be
accomplished with one of the following three methods:
• Deprivileging the GOS — This method usually requires a
modification to the OS to run at a lower privilege level. For x86
systems, the OS normally runs at protected ring 0, the most
privileged level. In Sun xVM Server, ring 0 is reserved to run
the VMM. This requires the GOS to be modified, or
paravirtualized, to run outside of ring 0 at a lower privilege
level.
• Hyperprivileging the VMM — Instead of changing the GOS to
run at lower privilege, another approach taken by the chip
vendors is to create a hyperprivileged processor mode for the
VMM. The Sun UltraSPARC Tl and T2 processor's
hyperprivileged mode [2], Intel-VT's VMX-root operation (see
[7] Volume 3B, Chapter 19), and AMD-V's VMRUN-Exit state
(see [9] Chapter 15) are examples of a hyperprivileged
processor for VMM operations.
• Both VMM and GOS run in same privileged mode — It is
possible to have both the VMM and GOS run in the same
S
3

Virtual Machine Monitor Basics

privileged mode. In this case, the VMM intercepts all privileged


and sensitive operations of a GOS before passing them to the
processor. For example, VMware allows both the GOS and the
VMM to run in privileged mode. VMware dynamically examines
each instruction to decide whether the processor state and the
segment reversibility (see "Segmented Architecture" on page
23) allow the instruction to be executed directly without the
involvement of the VMM. If the GOS is in privileged mode or
the code segment is non-reversible, the VMM performs
necessary conversions of the core execution path.

Removing Sensitive Instructions in the GOS


Privileged and sensitive operations are normally executed by the
OS kernel. In a virtualized environment, the GOS has to
relinquish the privileged and sensitive operations to the VMM.
This is accomplished by one of the following approaches:
• Modifying the GOS source code to use the VMM services for
handling sensitive operations (paravirtualization)
This method is used by Sun xVM Server and Sun's Logical
Domains (LDoms). Sun xVM Server and LDoms provide a set
of hypercalls for an OS to request VMM services. The VMM-
aware Solaris OS uses these hypercalls to replace its sensitive
instructions.
3

Virtual Machine Monitor Basics

• Dynamically translating the GOS sensitive instructions by


software
As described in a previous section, VMware uses binary
translation to replace the GOS sensitive instructions with VMM
instructions.
• Dynamically translating the GOS sensitive instructions by
hardware
This method requires the processor to provides a special
mode of operation that is entered when an sensitive
instruction is executed in reduced privileged mode.
The first approach, which involves modifying the GOS source
code, is called paravirtualization, because the VMM provides
only partial virtualization of the processor. The GOS must
replace its sensitive and privileged operations with the VMM
service. The remaining two approaches provide full virtualization
to the VM, enabling the GOS to run without modification
In addition to OS modification, performance requirements,
processor architecture design, tolerance of a single point of
failure, and support for legacy OS installations have an impact
on the design of VMM architecture.

Physical Memory Virtualization


Memory management by the VMM involves two tasks:
partitioning physical memory for VMs, and supporting page
translations in a VM.
Each OS assumes physical memory starts from page frame
number (PFN) 0 and is contiguous to the size configured for that
VM. An OS uses physical addresses in operations like page
table updates and Direct Memory Access (DMA). In reality, the
starting PFN of the memory exported to a VM may not start from
PFN 0 and may not be contiguous.
The virtualization of physical address is provided in the VMM by
creating another layer of addressing scheme, namely machine
address (MA). Within a GOS, a virtual address (VA) is used by
applications, and a physical address (PA) is used by the OS in
DMA and page tables. The VMM maps a PA from a VM to a MA,
which is used on hardware. The VMM maintains translation
tables, one for each VM, for mapping PAs to MAs.

Figure 3 depicts the scheme to partition machine memory to


physical memory for each
VM.
3
S

Virtual Machine Monitor Basics

VM0

PFN 0

VM1

PFN MPFN
0 0
Physical
Memory Machine
VM/GOS Memory
VMM
Figure 3. Example physical-to-machine memory mapping.
A ballooning technique [5] has been used in some virtualization
products to achieve better utilization of physical memory among
VMs. The idea behind the ballooning technique is simple. The
VMM controls a balloon module in a GOS. When the VMM
wants to reclaim memory, it inflates the balloon to increase
pressure on memory, forcing the GOS to page out memory to
disk. If the demand for physical memory decreases, the VMM
deflates the balloon in a VM, enabling the GOS to claim more
memory.

Page Translations Virtualization


Access to processor's page translation hardware is a privileged
operation, and this operation is performed by the privileged
VMM. Exactly what the VMM needs to perform depends on the
processor architecture. For example, x86 hardware
automatically loads translations from the page table to the
Translation Lookaside Buffer (TLB). The software has no control
of loading page translations to the TLB. Therefore, the VMM is
responsible for updating the page table that is seen by the
hardware. The SPARC processor uses software through traps to
load page translations to the TLB. A GOS maintains its page
tables in its own memory, and the VMM gets page translations
from the VM and loads them to the TLB.

VMMs typically support the following two methods to support


page translations:

• Hypervisor calls — The GOS makes a call to the VMM for


page translation operations. This method is commonly used by
paravirtualized OSes, as it provides better performance.
• Shadow page table — The VMM maintains an independent
copy of page tables, called shadow page tables, from the
guest page tables. When a page fault occurs, the VMM
propagates changes made by the GOS's page table to the
shadow page table. This method is commonly used by VMMs
that support full virtualization, as the GOS continues to update
its own page table and the synchronization of the guest
3

Virtual Machine Monitor Basics


3
S

Virtual Machine Monitor Basics

page table and the shadow page table is handled by the VMM
when page faults occur.
Figure 4 shows three different page translation implementations
in the Solaris OS on x86 and SPARC platforms.

1. The paravirtualized Sun xVM Server uses the following


approach on x86 platforms:
[1] The GOS uses the hypervisor call method to update the
page tables maintained by the VMM.
2. The Sun xVM Server with HVM and VMware use the
following approach:
[2a] The GOS maintains its own guest page table. The
synchronization between the guest page table and the
hardware page table (shadow page table) is handled by
the VMM when page faults occur.
[2b] The x86 CPU loads the page translation from the
hardware page table to the TLB.
3. On SPARC systems, the Solaris OS uses the following
approach for Logical Domains:
[3a] The GOS maintains its own page table. The GOS takes
an entry from the page table as an argument to the
hypervisor call that loads the translations to the TLB.
[3b] The VMM gets the page translation from the GOS and
loads the translation to the TLB.

Figure 4. Page translation schemes used on


VM
<TLBOperaa
X86
M and SPARC architectures.
tion|>
Translations
x86
—(3b"—

Guest Page Table|

HVCalls|
3

Virtual Machine Monitor Basics

VMM
HW Page Table

Hardware 1—
TLB
TLB

SPARC Page Translations

The memory management implementation for Sun xVM Server,


Sun xVM Server with HVM, VMware, and Logical Domains using
these mechanisms is discussed in detail in later sections of this
|HV
Calls|

paper.

Hardwar
e
S
4

Virtual Machine Monitor Basics

I/O Virtualization
I/O devices are typically managed by a special software module
called the device driver running in the kernel context. Due to
vastly different types and varieties of device types and device
drivers, the VMM either includes few device drivers or leaves
device management entirely to the GOS. In the latter case,
because of existing device architecture limitations (discussed
later in the section), devices can only be exclusively managed by
one VM.
This constraint creates some challenges for I/O access by a VM,
and limits the following:
• What device are exported to a VM
• How devices are exported to a VM
• How each I/O transaction is handled by a VM and the
VMM
Consequently, I/O has the most challenges in the areas of
compatibility and performance for virtual machines. In order to
explain what devices are exported and how they are exported, it
is first necessary to understand the options available to handle
I/O transactions in a VM.

There are, in general, three approaches for I/O virtualization, as


illustrated in Figure 5:

• Direct I/O (VM1 and VM3)


• Virtual I/O using I/O transaction emulation (VM2)
• Virtual I/O using device emulation (VM4)

VM1 VM2 VM3 VM4

Direct I/O I/O VM Virtual I/O Virtual I/O


thru I/O VM thru VMM
Direct I/O

I/O A A
Native
Transaction -¥ Driver
Emulation or
Virtual
and
Native
V

|J I Virtual Driver"] 111 Native DriveT] |J

Device Emulation
3 and Device Driver
*------------'
Network Chip SCSI Controller-*-Sun X64 Server

Figure 5. Different I/O virtualization techniques used by virtual


machine monitors.
For direct I/O, the VMM exports all or a portion of the physical
devices attached to the system to a VM, and relies on VMs to
S

Virtual Machine Monitor Basics

manage devices. The VM that has direct I/O access uses the
existing driver in the GOS to communicate directly with the
device. VM 1 and VM3 in Figure 5 have direct I/O access to
devices. VM1 is also a special I/O VM that provides virtual I/O
for other VMs, such as VM2, to access devices.
Virtual I/O is made possible by controlling the device types
exported to a VM. There are two different methods of
implementing virtual I/O: I/O transaction emulation (shown in
VM2 in Figure 5) and device emulation (shown in VM4).
• I/O transaction emulation requires virtual drivers on both ends
for each type of I/O transaction (data and control functions). As
shown in Figure 5, the virtual driver on the client side (VM2)
receives I/O requests from applications and forwards requests
through the VMM to the virtual driver on the server side (VMl);
the virtual driver on the server side then sends out the request
to the device.
I/O transaction emulation is typically used in paravirtualization
because the OS on the client side needs to include the special
drivers to communicate with its corresponding driver in the OS
on the server side, and needs to add kernel interfaces for
inter-domain communication using the VMM services.
However, it is possible to have PV drivers in an un-
paravirtualized OS (full virtualization) for better I/O
performance. For example, Solaris 10, which is not
paravirtualized, can include PV drivers on a HVM-capable
system to get better performance than that achieved using
device emulation drivers such as QEMU. (See "Sun xVM
Server with HVM I/O Virtualization (QEMU)" on page 71.)
I /O transaction emulation may cause application compatibility
issues if the virtual driver does not provide all data and control
functions (for example, ioctl (2)) that the existing driver does.
• Device emulation provides an emulation of a device type,
enabling the existing driver for the emulated device in a GOS
to be used. The VMM exports emulated device nodes to a VM
so that the existing drivers for the emulated devices in a GOS
are used. By doing this, the VMM controls the driver used by a
GOS for a particular device type; for example, using the
e1000g driver for all network devices. Thus, the VMM can
focus on the emulation of underlying hardware using one
driver interface. Driver accesses to the I/O register and port in
a GOS, which will result in a trap due to invalid address, are
caught and converted to access the real device hardware.
VM4 in Figure 5 uses native OS drivers to access emulated
devices exported by the VMM.
Device emulation is in general less efficient and more limited
on platforms supported than I/O transaction emulation. Device
emulation does not require changes in the GOS and,
therefore, is typically used to provide full virtualization to a VM.
S
4

Virtual Machine Monitor Basics

Virtual I/O, unlike direct I/O, requires additional drivers in either


the I/O VM or the VMM to provide I/O virtualization. This
constraint:
• Limits the type of devices that are made available to a
VM
• Limits device functionality
• Causes significant I/O performance overhead
While virtualization provides full application binary compatibility,
I/O becomes a trouble area in terms of application compatibility
and performance in a VM. One solution to the I/O virtualization
issues is to allow VMs to directly access I/O, as shown by VM3
in Figure 5.
Direct I/O access by VMs requires additional hardware support
to ensure device accesses by a VM are isolated and restricted to
resources owned by the assigned VM. In order to understand
the industry effort to allow an I/O device to be shared among
VMs, it is necessary to examine device operations from an OS
point of view.

The interactions between an OS and a device consist, in


general, of three operations:

1. Programmed I/O (PIO) — host-initiated data transfer. In PIO, a


host OS maps a virtual address to a piece of device memory
and accesses the device memory using CPU load/store
instructions.
2. Direct Memory Access (DMA) —device-initiated data transfer
without the CPU involvement. In DMA, a host OS writes an
address of its memory and the transfer size to a device's
DMA descriptor. After receiving an enable DMA instruction
from the host driver, the device performs data transfer at a
time it chooses and uses interrupts to notify the host OS of
DMA completion.
3. Interrupt —a device-generated asynchronous event
notification.
Interrupts are already virtualized by all VMM implementations as
is shown in the later discussions for Sun xVM Server, Logical
Domains, and VMware. The challenge of I/O sharing among
VMs therefore lies in the device handling for PIO and DMA. To
meet the challenges, PCI SIG has released a suite of IOV
specifications for PCI Express (PCIe) devices, in particular the
"Single Root I/O Virtualization and Sharing Specification"
(SRIOV) specification [35] for device sharing and PIO operation,
and the "Address Translation Services (ATS)" specification [30]
for DMA operation.

Device Configuration and PIO


A PCI device exports its memory to the host through Base
Address Registers (BARs) in its configuration space. A device's
configuration space is identified in the PCI configuration address
space as shown in Figure 6.
S

Virtual Machine Monitor Basics

3 24 23 1 15 1 10 7 21
1 6 18 0
Reserved Bus Number Device Funct Register 00
Number ion Number
Num
ber

Figure 6. PCI configuration address space.


A PCI device can have up to 8 physical functions (PF). Each PF
has its own 256 byte configuration header. The BARs of a PCI
function, which are 32-bit wide, are located at offset 0x10-0x24
in the configuration header. The host gets the size of the
memory region mapped by a BAR by writing a value of all 1's to
the BAR and then reading the value back. The address written to
a BAR is the assigned starting address of the memory region
mapped to the BAR.
4
S

Virtual Machine Monitor Basics

To allow multiple VMs to share a PF, the SRIOV specification


introduces the notion of a Virtual Function (VF). Each VF shares
some common configuration header fields with the PF and other
VFs. The VF BARs are defined in the PCIe's SRIOV extended
capabilities structure. A VF contains a set of non-shared
physical resources, such as work queue and data buffer, which
are required to deliver function specific services. These
resources are exported through the VF BARs and are directly
accessible by a VM.
The starting address of a VF's memory space is derived from the
first VF's memory space address and the size of VF's BAR. For
any given VFx, the starting address of its memory space
mapped to BARa is calculated according to the following
formula:

addr(VFx,BARa) = addr(VF1;BARa) + ( x - 1 ) x ( VF BARa


aperature size)

where addr (VF1, BARa) is the starting address of BARa for the
first VF and (VF BARa aperture size) is the size of the VF BARa
as determined by writing a value of 1's to BARa and reading the
value back. Using this mechanism, a GOS in a VM is able to
share the device with other VMs while performing device
operations that pertain only to the VM.

DMA
In many current implementations (especially in most x86
platforms), physical addresses are used in DMA. Since a VM
shares the same physical address space on the system with
other VMs, a VM might read/write to another VM's memory
through DMA. For example, a device driver in a VM might write
the memory contents that belong to other VMs to a disk and
read the data back into the VM's memory. This causes a
potential breach in security and fault isolation among VMs.
To provide isolation during DMA operation, the ATS specification
defines a scheme for a VM to use the address mapped to its
own physical memory for DMA operation. (This approach is
used in similar designs such as IOMMUSpecification [31] and
DMA Remapping [28].) This DMA ATS enables DMA memory to
be partitioned into multiple domains, and keeps DMA
transactions on one domain isolated from other domains.
Figure 7 shows device DMA with and without ATS. With DMA
ATS, the DMA address is like a virtual address that is associated
with a context (VM). DMA transactions initiated by a VM can
only be associated with the memory owned by the VM. DMA
ATS is a chipset function that resides outside of the processor.

PCI System Memory


Devic
I VM1
I DMA e
I Buffer DMA
I Buffer
HH
HP
North P
A
Bridge
PA A
P
DVA/GPA A
4

Virtual Machine Monitor Basics

PCI Device
South Bridge
1 1

w/ IOMMU
DVA/GPA
PCI Device

South
Bridge

DMA without ATS PCI DMA with ATS

PA - Physical Address
HPA - Host Physical
Address DVA - Device
Virtual Address GPA -
Guest Physical
Address

Figure 7. DMA with and without address translation service


(ATS).
As shown in Figure 7, the physical address (PA) is used on the
hardware platform without hardware support for ATS. For
platforms with hardware support for ATS, a GOS in a VM writes
either a device virtual address (DVA) or a guest physical
address (GPA) to the device's DMA engine. The device driver in
the GOS loads the mappings of either the DVA or GPA to the
host physical address (HPA) in the hardware IOMMU. The HPA
is the address understood by the memory controller.

Note - The distinction between the HPA and GPA is described in


detail in later sections for Sun xVM Server (see "Physical
Memory Management" on page 52), for UltraSPARC LDoms
(see "Physical Memory Allocation" on page 88), and for VMware
(see "Physical Memory Management" on page 103).

When the device performs a DMA operation, a DVA/GPA


address appears on the PCI bus and is intercepted by the
hardware IOMMU. The hardware IOMMU looks up the mapping
for the DVA/GPA, finds the corresponding HPA, and moves the
PCI data to system memory pointed to by the HPA. Since either
DVA or GPA of a VM has its own address space, ATS allows
system memory for DMA to be partitioned and, thus, prevents a
VM from accessing another VM's DMA buffer.
Chapter 3
The x86 Processor Architecture

This chapter provides background information on the x86


processor architecture that is relevant to later discussions on
Sun xVM Server (Chapter 5 on page 39), Sun xVM Server with
HVM (Chapter 6 on page 63), and VMware (Chapter 8 on page
97).
4
S

Virtual Machine Monitor Basics

The x86 processor was not designed to run in a virtualized


environment, and the x86 architecture presents some
challenges for CPU and memory virtualization. This chapter
discusses the following x86 architecture features that are
pertinent to virtualization:
•Protected Mode
The protected mode in the x86 processor utilizes two
mechanisms, segmentation and paging, to prevent a program
from accessing a segment or a page with a higher privilege
level. Privilege level controls how the VMM and a GOS work
together to provide CPU virtualization.
• Segmented Architecture
The x86 segmented architecture converts a program's virtual
addresses into linear addresses that are used by the paging
mechanism to map into physical memory. During the
conversion, the processor's privilege level is checked against
the privilege level of the segment for the address. Because of
the segment cache technique employed by the x86 processor,
the VMM must ensure segment cache consistency with the
VM descriptor table updates. This x86 feature results in a
significant amount of work for the VMM of full virtualization
products such as VMware.
•Paging Architecture The x86 paging architecture provides
page translations to the TLB and page tables. Because the
loading of page translations from page table to TLB is done
automatically by hardware on the x86 platform, page table
updates have to be performed by the privileged VMM. Several
mechanisms are available for updating this "hardware" page
table by a VM.
• I/O and Interrupts
A device interacts with a host processor through PIO, DMA,
and interrupts. PIO in the x86 processor can be performed
through either I/O ports using special I/O instructions or
through memory-mapped addresses with general purpose
MOVE and String instructions. DMA in most x86 platforms is
performed with physical addresses. This can cause a security
and isolation breach in a virtualized environment because a
VM may read/write other VMs memory contents. Interrupts
and exceptions are handled through the Interrupt Descriptor
Table (IDT). There is only one IDT on the system and access
to the IDT is privileged. Therefore, interrupts have to be
handled by the VM and virtualized to be delivered to a VM.
•Timer Devices
The x86 platform includes several timer devices for time
keeping purposes. Knowledge of the characteristics of these
devices is important to fully understand time keeping in a VM:
Some timer devices are interrupt driven (which is virtualized
and delayed) and some require privileged access to update
the device counter.

Protected Mode
The x86 architecture protected mode provides a protection
mechanism to limit access to certain segments or pages and
prevent unprivileged access. The processor's segment-
protection mechanism recognizes 4 privilege levels, numbered
from 0 to 3 (Figure 8). The greater the level number, the lesser
the privileges provided.
4

Virtual Machine Monitor Basics

The page-level protection mechanism restricts access to pages


based on two privilege levels: supervisor mode and user mode.
If the processor is operating at a current privilege level (CPL) 0,
1, or 2, it is in a supervisor mode and the processor can access
all pages. If the processor is operating at a CPL 3, it is in a user
mode and the processor can access only user level pages.

Level 0 - OS Kernel Level 1 Level 2


Level 3 - Applications

Figure 8. Privilege levels in the x86 architecture.


When the processor detects a privilege level violation, it
generates a general-protection exception (#GP). The x86 has
more than 20 privileged instructions. These instructions can be
executed only when the current privilege level (CPL) is 0 (most
privileged).
In addition to the CPL, the x86 has an I/O privilege level (IOPL)
field in the EFLAGS register that indicates the I/O privilege level of
the currently running program. Some instructions, while allowed
to execute when the CPL is not 0, might generate a #GP
exception if the CPL value is higher than IOPL. These
instructions include CLI (clear interrupt), STI (set interrupt flag), IN/INS
(input from port), and OUT/OUTS (output to port).
In addition to the above instructions, there are many instructions
[3] that, while not privileged, reference registers or memory
locations that would allow a VM to access a memory region not
assigned to that VM. These sensitive instructions will not cause
a #GP exception. The trap-and-emulate method for virtualization
of a GOS, as stated in "VMM Requirements" on page 9, does
not apply to these instructions. However, these instructions may
impact other VMs.
Segmented Architecture
In protected mode, all memory accesses must go through a
logical address } Linear address (LA) } Physical Address (PA)
translation scheme. The logical address to LA translation is
managed by the x86 segmentation architecture which divides a
process's address space into multiple protected segments.
A logical address, which is used as the address of an operand or
of an instruction, consists of a 16-bit segment selector and a 32-
bit offset. A segment selector points to a segment descriptor that
defines the segment (see Figure 11 on page 24). The segment
base address is contained in the segment descriptor. The sum
of the offset in a logical address and the segment base address
gives the LA. The Solaris OS directly maps an LA to a process's
Virtual Address (VA) by setting the segment base address to
NULL.

Segmentation: VA + Segment Base Address (always 0 in


Solaris) } Linear address Paging: Linear address } Physical
Address

For each memory reference, a VA and a segment selector are


provided to the processor (Figure 9). The segment selector,
4
S

Virtual Machine Monitor Basics

which is loaded to the segment register, is used to identify a


segment descriptor for the address.

15 321 0
4

Virtual Machine Monitor Basics

Index TI RPL

Index: up to 8K descriptors
(bits 3-15) TI: Table
Indicator; 0=GDT, 1=LDT
RPL: Request Privilege
Level

Figure 9. Segment Selector

Every segment descriptor has a visible part and a hidden part,


as illustrated in Figure 10 (see also [7], Volume 3A Section
3.4.3). The visible part is the segment selector, an index that
points into either the global descriptor table (GDT) or the local
descriptor table (LDT) to identify from which descriptor the
hidden part of the segment register is to be loaded. The hidden
part includes portions containing segment descriptor information
loaded from the descriptor table.

VSelector Type Base Limit C


Address P
L

Hidden
Figure 10. Each segment descriptor has a visible and a hidden
part.
The hidden fields of a segment register are loaded to the
processor from a descriptor table and are stored in the
descriptor cache registers. The descriptor cache registers, like
the TLB, allow the hardware processor to refer to the contents of
the segment register's hidden part without further reference to
the descriptor table. Each time a segment register is loaded, the
descriptor cache register gets fully loaded from the descriptor
table. Since each VM has its own descriptor table (for example,
the GDT), the VMM has to maintain a shadow copy of each
VM's descriptor table. A context switch to a VM will cause the
VM's shadow descriptor table to be loaded to the hardware
descriptor table. If the content of the descriptor table is changed
by the VMM because of a context switch to another VM, the
segment is non-reversible, which means the segment cannot be
restored if an event such as a trap causes the segment to be
saved and replaced.
The Current Privilege Level (CPL) is stored in the hidden portion
of the segment register. The CPL is initially equal to the privilege
level of the code segment from which it is being loaded. The
processor changes the CPL when program control is transferred
to a code segment with a different privilege level.
The segment descriptor contains the size, location, access
control, and status information of the segment that is stored in
either the LDT or GDT. The OS sets segment descriptors in the
descriptor table and controls which descriptor entry to use for a
segment (Figure 11). See "CPU Privilege Mode" on page 45 for
a discussion of setting the segment descriptor in the Solaris OS.
5
S

Virtual Machine Monitor Basics

31 24 23 22 21 20 19 16 15 14 13
12 11 87 0
Base D D L A SL P D S Type Base
31:24 / V PL 23:16
B L
31 1 0
6
Base 15:00 Segment Limit
15:00
L: 64-bit code segment
AVL: Available for use by system software
Base: Segment base address
D/B Default operation size (0=64-bit segment, 1=32
bit segment)
DBL: Descriptor Privilege Level
G: Granularity
SL: Segment Limit 19:16
P: Segment present
S: Descriptor type (0=system,
1=code or data) Type: segment
type
Figure 11. Segment descriptor.
The privilege check performed by the processor recognizes
three types of privilege levels: requested privilege level (RPL),
current privilege level (CPL), and descriptor privilege level
(DPL). A segment can be loaded if the DPL of the segment is
numerically greater than or equal to both the CPL and the RPL.
In other words, a segment can be
accessed only by code that has equal or higher privilege level.
Otherwise, a general-protection fault exception, #GP, is generated
and the segment register is not loaded.
On 64-bit systems, linear address space (flat memory model) is
used to create a continuous, unsegmented address space for
both kernel and application programs. Segmentation is disabled
in the sense that privilege checking can not apply to VA to LA
translations as it doesn't exist. The only protection left to prevent
a user application from accessing kernel memory is through the
page protection mechanism. This is why the kernel of a GOS
has to run in ring 3 (user mode in page level protection) on a 64-
bit system.

Paging Architecture
When operating in the protected mode, the LA } PA translation is
performed by the paging hardware of the x86 processor. To
access data in memory, the processor requires the presence of
a VA } PA translation in the TLB (in Solaris, LA is equal to VA),
the page table backing up the TLB entry, and a page of physical
memory. For the x86 processor, loading the VA } PA page
translation from the page table to TLB is performed
automatically by the processor. The OS is responsible for
allocating physical memory and loading the VA } PA translation
to the page table.
When the processor cannot load a translation from the page
table, it generates a page fault exception, #PF. A #PF exception on
x86 processors usually means a physical page has not been
allocated, because the loading of the translation from the page
table to the TLB is handled by the processor (Figure 12).
5

Virtual Machine Monitor Basics

TLB Entry Page Table Physical Memory

Performed by the processor Performed by the OS

Figure 12. Translations through the TLB are accomplished in the


processor itself, while translations through page tables are
performed by the OS.
The x86 processor uses a control register, %cr3, to manage the
loading of address translations from the page table to the TLB.
The base address of a process's page table is kept by the OS
and loaded to %cr3 when the process is contexted in to run. On
the Solaris OS, %cr3 is kept in the kernel hat structure. Each
address space, as, has one hat structure. The mdb(1) command
can be used to find the value of the %cr3 register of a process:
% mdb -k
> ::ps
S PID PPID PGID SID UID FLAGS
ADDR NAME
R 9352 9351 9352 9352 28155 0x4a014000
fffffffec2ae78c0 bash
> fffffffec2ae78c0::print -t 'struct proc1 ! grep p_as
struct as *p_as = 0xfffffffed15ba7e0
> 0xfffffffed15ba7e0::print -t 'struct as' ! grep a_hat
struct hat *a_hat = 0xfffffffed1718e98
> 0xfffffffed1718e98::print -t 'struct hat' ! grep hat_htable
htable_t *hat_htable = 0xfffffffed0f67678
> 0xfffffffed0f67678::print -t 'struct htable' ! grep ht_pfn
pfn_t ht_pfn = 0x16d37 // %cr3

When multiple VMs are running, the automatic loading of page


translations from the page table to the TLB actually makes the
virtualization more difficult because all page tables have to be
accessible by the processor. As a result, pages table updates
can only be performed by the VMM to enforce a consistent
memory usage on the system. "Page Translations Virtualization"
on page 14 discusses two mechanism for managing page tables
by the VMM.
Another issue of the x86 paging architecture is related to the
flushing of TLB entries. Unlike many RISC processors which
support a tagged TLB, the x86 TLB is not tagged. A TLB miss
results in a walk of the page table by the processor to find and
load the translation to the TLB. Since the TLB is not tagged, a
change in the %cr3 register due to a virtual memory context
switch will result in invalidating all TLB entries. This adversely
affects performance if the VMM and VM are not in the same
address space.
A typical solution to address the performance impact of TLB
flushing is to reserve a region of the VM address space for the
VMM. With this solution, the VMM and VM can run from the
same address space and thus avoid a TLB flush when a VM
memory operation traps to the VMM. The latest CPUs from Intel
and AMD with hardware virtualization support include tagged
TLBs, and consequently the translation of different address
spaces can co-exist in the TLB.

I/O and Interrupts


In general, x86 support for exceptions and I/O interrupts does
not impose any particular challenge to the implementation of a
5
S

Virtual Machine Monitor Basics

VMM. The x86 processor uses the interrupt descriptor table


(IDT) to provide a handler for a particular interrupt or exception.
Access to the IDT functions is privileged and, therefore, can only
be performed by the VMM. The Sun xVM Hypervisor for x86
provides a mechanism to relay hardware interrupts to a VM
through its event channel hypervisor calls (see "Event Channels"
on page 43).
The x86 processor allows device memory and registers to be
accessed through either an I/O address space or memory-
mapped I/O. An I/O address space access is performed using
special I/O instructions such as IN and OUT. These instructions,
while allowed to execute when the CPL is not 0, will result in a
#GP exception if the processor's CPL value is higher than the I/O
privilege level (IOPL). The Sun xVM Hypervisor for x86 provides
a hypervisor call to set the IOPL, enabling a GOS to directly
access I/O ports by setting the IOPL to its privilege level.
When using memory-mapped I/O, any of the processor's
instructions that reference memory can be used to access an I/O
location with protection provided through segmentation and
paging. PIO, whether it is using I/O address space or memory-
mapped I/O, is normally uncacheable as device registers are
usually accessed with precise programming order. PIO uses
addresses in a VM's address space and doesn't cause any
security and isolation issues.
The x86 processor uses physical addresses for DMA. DMA in a
virtualized x86 system has certain issues:
• A 32-bit, non-dual-address-cycle (DAC) PCI device can not
address beyond 4 GB of memory.
• It is possible for one domain's DMA to intrude into another
domain's physical memory, thus causing the risk of security
violation.
The solution to the above issues is to have an I/O memory
management unit (IOMMU) as a part of an I/O bridge or north
bridge that performs a translation of I/O addresses (for example,
an address that appears on the PCI bus) to machine memory
addresses. The I/O address can be any address that is
recognized by the IOMMU. An IOMMU can also improve the
performance of large chunk data transfers by mapping a
contiguous I/O address to multiple physical pages in one DMA
transaction. However, the IOMMU may hurt the I/O performance
for small data transfers because the DMA setup cost is higher
than that of DMA without an IOMMU.
For more details on the IOMMU, also known as hardware
address translation service (hardware ATS), see "I/O
Virtualization" on page 16.

Timer Devices
An OS typically uses several timer devices for different
purposes. Timer devices are characterized by their frequency
granularity, frequency reliability, and ability to generate interrupts
and receive counter input. Understanding the characteristics of
timer devices is important for the discussion of timekeeping in a
virtualized environment, as the VMM provides virtualized
timekeeping of some timers to its overlaying VMs. Virtualized
timekeeping has significant impact on the accuracy of time
5

Virtual Machine Monitor Basics

related functions in the GOS and, thus, on the performance and


results of time sensitive applications.
S
5

The x86 Processor Architecture

An x86 system typically includes the following timer devices:

• Programmable Interrupt Timer (PIT)


PITs use a 1.193182 Mhz crystal oscillator and have a 16-bit
counter and counter input register. The PIT contains three
timers. Timer 0 can generate interrupts and is used by the
Solaris OS as the system timer. Timer 1 was historically used
for RAM refreshes and timer 2 for the PC speaker.
• Time Stamp Counter (TSC)
The TSC is a feature of the x86 architecture that is accessed
via the RDTSC instruction. The TSC, a 64-bit counter, changes
with the processor speed. The TSC cannot generate interrupts
and has no counter input register. The TSC is the finest
grained of all timers and is used in the Solaris OS as the high
resolution timer. For example, the gethrtime(3C) function uses
the TSC to return the current highresolution real time.
•Real Time Clock (RTC) The RTC is used as the time-of-day
(TOD) clock in the Solaris OS. The RTC uses a battery as an
alternate power source, enabling it to continue to keep time
while the primary source of power is not available. The RTC
can generate interrupts and has a counter input register. It is
the lowest grained timer on the system.
• Local Advanced Programmable Interrupt Controller (APIC)
Timer
The local APIC timer, which is a part of the local APIC, has a
32-bit counter and counter input register. It can generate
interrupts and has the same frequency as the front side bus.
The Solaris OS supports the use of the local APIC timer as
one of the cyclic timers.
• High Precision Event Timer (HPET)
The HPET is a relatively new timer available in some new x86
systems. The HPET is intended to replace the PIT and the
RTC for generating periodic interrupts. The HPET can
generate interrupts, is 64-bits wide, and has a counter input
register. The Solaris OS currently does not use the HPET.
• Advanced Configuration and Power Interface (ACPI) Timer
The ACPI timer has a 24-bit counter, can generate interrupts,
and has no input counter register. The Solaris OS does not
use the ACPI timer.
S

SPARC Processor Architecture

Chapter 4
SPARC Processor Architecture

This chapter provides background information on the SPARC


processor architecture that is relevant to later discussions on
Logical Domains (Chapter 7 on page 79).
The SPARC (Scalable Processor Architecture) processor, first
introduced in 1987, is a big-endian RISC processor ISA. SPARC
International (SI), an industry organization, was established in
1989 to promote the open SPARC architecture. In 1994, SI
introduced a 64-bit version of the SPARC processor as SPARC
v9. The UltraSPARC processor, which is a Sun-specific
implementation of SPARC v9, was introduced in 1996 and has
been incorporated into all Sun SPARC platforms shipping today.
In 2005, Sun's UltraSPARC architecture was open sourced as
the UltraSPARC Architecture 2005 Specification [2]. Included in
this enhanced UltraSPARC 2005 specification is support for
Chip-level Multithreading (CMT) for a highly threaded processor
architecture and a hyperprivileged mode that allows the
hypervisor to virtualize the processor to run multiple domains.
The design of the UltraSPARC T1 processor, which is the first
implementation of the UltraSPARC Architecture 2005
Specification, is also open sourced. The UltraSPARC T1
processor includes 8 cores with 4 strands in each core, providing
a total of 32 strands per processor.
In August 2007 Sun announced the UltraSPARC T2 processor,
the follow-up CMT processor to the UltraSPARC T1 processor,
and the OpenSPARC T2 architecture [33] which is the open
source version of the UltraSPARC T2 processor. Sun also
released the UltraSPARC Architecture 2007 specification [34]
which adds a section for error handling and expands the
discussion for memory management. The UltraSPARC T2
processor has several enhancements over the UltraSPARC T1
processor. These enhancements include 64 strands, per-core
floating-point and graphic units, and integrated PCIe and 10 GB
Ethernet (for more details see "Processor Components" on page
31).
The remainder of this chapter discusses the following features of
the UltraSPARC T1/T2 processor architecture, and describes
their effect on virtualization implementations:
• Processor privilege mode — The UltraSPARC 2005
specification defines a hyperprivileged mode for the hypervisor
operations.
• Sun4v Chip Multithreaded architecture — This feature enables
the creation of up to 32 domains, each with its own dedicated
strands, on an UltraSPARC T1 processor, and up to 64
domains on an UltraSPARC T2 processor.
S
5

SPARC Processor Architecture

• Address Space Identifier (ASI) — The ASI provides


functionality to control access to a range of address spaces,
similar to the segmentation used by x86 processors.
• Memory Management Unit (MMU) — The software-controlled
MMU allows an efficient redirection of page faults to the
intended domain for loading translations.
• Trap and interrupt handling — Each strand (virtual processor)
has its own trap and interrupt priority registers. This
functionality allows the hypervisor to re-direct traps to the
target CPU and enables the trap to be taken by the GOS's trap
handler.

Note - The terms strand, hardware thread, logical processor,


virtual CPU and virtual processor are used by various
documents to refer to the same concept. For consistency, the
term strand is used in this chapter.

Processor Mode of Operation


The UltraSPARC 2005 specification defines three privilege
modes: non-privileged, privileged, and hyperprivileged. In
hyperprivileged mode, the processor can access all registers
and address spaces, and can execute all instructions.
Instructions, registers, and address spaces for privileged and
non-privileged modes are restricted.
The processor operates in privileged mode when PSTATE .priv is
set to 1 and HPSTATE.hpriv is set to 0. The processor operates
in hyperprivileged mode when HPSTATE.hpriv is set to 1 (PSTATE
.priv is ignored).
Table 1 lists the availability of instructions, registers, and
address spaces for each of the privilege modes, and includes
information on where further details can be found in the
UltraSPARC Architecture 2005 Specification [2].

Table 1. Documentation describing the availability of


components in the UltraSPARC processor.
Compon Locatio Comments
ent n3
Instructio All instructions except SIR, RDHPR, and RHPR
Table 7-
(which require hyperprivilege to execute) can
n 2 be executed from the privileged mode.
Register There are seven hyperprivileged registers:
Chapter
HPSTATE, HTSTATE, HINTP, HTBA, HVER, HSTICK_CMPR,
s 5 and STRAND_STS. These registers are used by
the hypervisor in the hyperprivileged mode.
Address Tables ASIs 0x30-0x7F are for hyperprivileged
Space 9-1 and access only. These ASIs are mainly for CMT
10-1 control, MMU, TLB, and hyperprivileged
scratch registers.
a. Location in the UltraSPARC Architecture 2005 Specification
[2].
Based on the availability of instructions, registers, and the ASI in
hyperprivileged mode, the following functions of the hypervisor
can be deduced:
S

SPARC Processor Architecture

• Reset the processor: SIR instruction


• Control hyperprivileged traps and interrupts: HTSTATE, HTBA,
HINTP registers
• Control strand operation: ASI 0x41, and HSTICK_CMPR and
STRAND_STS registers
• Manage MMU: ASI 0x50-0x5F
Processor Components
The UltraSPARC Tl processor[lo] contains eight cores, and each
core has hardware support for four strands. One FPU and one
L2 cache are shared among all cores in the processor. Each
core has its own Level l instruction and data cache (Ll Icache
and Dcache) and TLB that are shared among all strands in the
core. In addition, each strand contains the following:
• A full register file with eight register windows and four sets of
global registers (a total of 160 registers: 8 2 16 registers per
window, + 4 3 8 global registers)
• Most of the ASIs
• Ancillary privileged registers
• Trap queue with up to 16 entries
This hardware support in each strand allows the hypervisor to
partition the processor into 32 domains, with one strand for each
domain. Each strand can execute instructions separately without
requiring a software scheduler in the hypervisor to coordinate
the processor resources.
Table 2 summarizes the association of processor components to
their location in the processor, core and strand.
Table 2. Location of key processor components in
the UltraSPARC T1 processor.
Processor Core Strand
• Floating Point Unit • 6 stage instruction • Register
file with 160 registers
• L2 cache crossbar pipeline •
Most of ASI
• L2 cache • L1 Icache and Dcache •
Ancillary state register (ASR)
•TLB • Trap registers

• Privileged registers
The UltraSPARC T2 processor[33] is built upon the UltraSPARC
T1 architecture. It has the following enhancements over the
UltraSPARC T1 processor:
EIght strands per core (for a total of 64 strands)
Two integer pipelines per core, with each integer pipeline
supporting 4 strands
2 Eight banks of 4 MB L2 cache
A The
floating-point and graphics unit (FGU) per core
UltraSPARC T2 has a total of 64 strands in 8 cores, and each core has its own
floating-pointing and graphics unit (FGU). This allows up to 64 domains to be created
on the UltraSPARC T2 processor. This design also adds integrated support for
industry standard I/O interfaces such PCI-Express and 10 Gb Ethernet.
Table 3 summarizes the association of processor components to physical processor,
core and strand.
3 Eight banks of 4 MB L2 cache
The UltraSPARC T2 has a total of 64 strands in 8 cores, and each core has its own
floating-pointing and graphics unit (FGU). This allows up to 64 domains to be created
on the UltraSPARC T2 processor. This design also adds integrated support for
industry standard I/O interfaces such PCI-Express and 10 Gb Ethernet.
Table 3 summarizes the association of processor components to physical processor,
core and strand.
S
5

SPARC Processor Architecture

Table 3. Location of key processor components in the


UltraSPARC T2 processor.

Processor Core Strand

8 banks 4 MB L2 2 instruction Full register file with 8


cache L2 cache pipelines (8 windows Most of ASI
crossbar stages) Li Ancillary state register
Memory Icache and (ASR) Privileged
controller PCI-E Dcache TLB registers
10 Gb/Gb FGU (12
Ethernet stages)

Address Space Identifier


Unlike x86 processors in 32-bit mode, which use segmentation
to divide a process's address space into several segments of
protected address spaces, the SPARC v9 processor has a flat
64-bit address space. An address in the SPARC V9 processor is
a tuple consisting of an 8-bit address space identifier (ASI) and a
64-bit byte-address offset within the specified address space.
The ASI provides attributes of an address space, including the
following:
• Privileged or non-privileged
• Register or memory
• Endianness (for example, little-endian or big-endian)
• Physical or virtual address
• Cacheable or non-cacheable
The SPARC processor's ASI allows different types of address
spaces (user virtual address space, kernel virtual address
space, processor control and status registers, etc.) to coexist as
separate and independent address spaces for a given context.
Unlike x86 processors in which user processes and the kernel
share the same address space, user processes and the kernel
have their own address space on SPARC processors.
Access to these address spaces are protected by the ASI
associated with each address space. ASIs in the range 0x00-
0x2F may be accessed only by software running in privileged or
hyperprivileged mode; ASIs in the range 0x30-0x7F may be
accessed only by software running in hyperprivileged mode. An
access to a restricted (privileged or hyperprivileged) ASI (0x00-
0x7F) by non-privileged software will result in a privileged_action
trap.

Table 9-1 and Table 10-1 of [2] provide a summary and

description for each ASI. Memory Management Unit


The traditional UltraSPARC architecture supports two types of
memory addressing:
S

SPARC Processor Architecture

• Virtual Address (VA) — managed by the GOS and used


by user programs
• Physical address (PA) — passed by the processor to the
system bus when accessing physical memory
The Memory Management Unit (MMU) of the UltraSPARC
processor provides the translation of VAs to PAs. This
translation enables user programs to use a VA to locate data in
physical memory.
The SpitFire Memory Management Unit (sfmmu) is Sun's
implementation of the UltraSPARC MMU. The sfmmu hardware
consists of Translation Lookaside Buffers (TLBs) and a number
of MMU registers:
• Translation Lookaside Buffer (TLB)
The TLB provides virtual to physical address translations.
Each entry of the TLB is a Translation Table Entry (TTE) that
holds information for a single page mapping of virtual to
physical addresses. The format of the TTE is shown in Figure
13. The TTE consists of two 64-bit words, representing the tag
and data of the translation. The privileged field, P, controls
whether or not the page can be accessed by non-privileged
software.
• MMU registers

A number of MMU registers are used for accessing TLB


entries, removing TLB entries (demap), context management,
handling TLB misses, and support for Translation Storage
Buffer (TSB) access. The TSB, an array of TTE entries, is a
cache of translation tables used to quickly reload the TLB. The
TSB resides in the system memory and is managed entirely by
the OS. The UltraSPARC processors includes some MMU
hardware registers for speeding up TSB access. The TLB miss
handler will first search the TSB for the translation. If the
T
T translation is not found in the TSB, the TLB handler calls to
a more sophisticated (and slower) TSB miss handler to load
TT the translation table to the TSB.
E
context_i 000 va
d 000
6 48 47 42 41 0
3
cc s
vnfosoft taddr e epvpewosz
p f
2 i t
636261 5655 131211109 8 7 S4
3 0

Figure 13. The translation lookaside buffer (TLB) is an array of


translation table entries containing tag and data portions.
A TLB hit occurs if both the context and virtual address match an
entry in the TLB. Address aliasing (multiple TLB entries with the
same physical address) is permitted.
S
6

SPARC Processor Architecture

Unlike the x86 processor, the loading of page translations to the


TLB is manually managed by software through traps. In the
event of a TLB miss, a trap is generated trying first to get the
translation from the Translation Storage Buffer (TSB) (Figure
14). The TSB, an in-memory array of translations, acts like a
direct-mapped cache for the TLB. If the translation is not present
in the TSB, a TSB miss trap is generated. The TSB miss trap
handler uses a software lookup mechanism based on the hash
memory entry
S

SPARC Processor Architecture

block structure, hme_blk, to obtain the TTE. If a translation is still


not found in hme_blk, the kernel generic trap handler is invoked
to call the kernel function pagefault() to allocate physical memory
for the virtual address and load the translation into the hme_blk
hash structure.

Figure 14 depicts the mechanism for handling TLB misses in an


unvirtualized domain.
Proces
TLB
sor TLB miss TSB TSB miss home Allocate OS
pagefa
dataOS
^ w _blk memory ult ()
functio
structu
n
"TTE load "TTE load "hat
to TLB to TSB memload()
TTE cache in memory
Figure 14. Handling a TLB miss in an unvirtualized domain,
UltraSPARC T1/T2 processor architecture.
Similarly, Figure 15 depicts how TLB misses are handled in a
virtualized domain. In a virtualized environment, the UltraSPARC
T1/T2 processor adds a Real Address type, in addition to the VA
and PA, into the types of memory addressing (Figure 15). Real
addresses (RA), which are equivalent to the physical memory in
Sun xVM Server (see "Physical Memory Management" on page
52) are provided to the GOS as the underlying physical memory
allocated to it. The GOS-maintained TSBs are used to translate
VAs into RAs. The hypervisor manages the translation from RA
to PA.

Manage
Proce
ssor
d by
Hypervi
MMU TSB
A<-
sor VA TLB

TLB miss w

PA<-RA

TLB miss w

"TTE load to TLB


62 SPARC Processor Architecture Sun Microsystems,
Inc.

TSB miss hme_ Allocate pagefa


w blk memory^ ult()
TTE
w
"TTE load cache "hat_memlo
in
to TSB OS
memory
ad() OS
functio
data n
structu
re
Figure 15. Handling a TLB miss in a virtualized domain, UltraSPARC
T1/T2 processor architecture.

Applications, which are non-privileged software, use only VAs.


The OS kernel, which is privileged software, uses both VAs and
RAs. The hypervisor, which is hyperprivileged software, normally
uses PAs. "Physical Memory Allocation" on page 88 discusses in
detail the types of memory addressing used in LDoms.
The UltraSPARC T2 processor adds a hardware table walk for
loading TLB entries. The hardware table walk accesses the
TSBs to find TTEs that match the virtual address and context ID
of the request. Since a GOS cannot access or control physical
memory, the TTEs in the TSBs controlled by a GOS contain real
page numbers, not physical page numbers (see "Physical
Memory Allocation" on page 88). TTEs in the TSBs controlled by
the hypervisor can contain real page numbers or physical page
numbers. The hypervisor performs the RA-to-PA translation
within the hardware table walk to permit the hardware table walk
to load a GOS TTEs into the TLB for VA-to-PA translation.
Traps
In the SPARC processor, a trap transfers software execution from one
privileged mode to another privileged mode at the same or higher level.
The only exception is that unprivileged mode can not trap to another
unprivileged mode. A trap can be generated by the following methods:
• Internally by the processor (memory faults, privileged
exceptions, etc.)
• Externally generated by I/O devices (interrupts)
• Externally generated by another processor (cross calls)
• Software generated (for example, the Tcc instruction)
A trap is associated with a Trap Type (TT), a 9-bit value. (TT values
0x180-0x1FF are reserved for future use.) The transfer of software
execution occurs through a trap table that contains an array of TT
handlers indexed by the TT value. Each trap table entry is 32-bytes in
length and contains the first eight instructions of the TT handler. When
a trap occurs, the processor gets the TT from the TT register and the
trap table base address (TBA) from the TBA register. After saving the
current executing states and updating some registers, the processor
starts to execute the instructions in the trap table handler.
The SPARC processors support nesting traps using a trap level (TL).
The maximum TL (MAXTL) value is typically in the range of 2-6, and
depends on the processor; in UltraSPARC T1/T2 processors, MAXTL is 6.
Each trap level has one set of trap stack control registers: trap type (TT),
trap program counter (TPC), trap next program counter (TNPC), and trap
state (TSTATE). These registers provide trap software execution state and
S

SPARC Processor Architecture

control for the current TL. The ability to support nested traps in SPARC
processors makes the implementation of an OS trap handler easier
and more efficient, as the OS doesn't need to explicitly save the current
trap stack information.
On UltraSPARC T1/T2 processors, each strand has a full set of trap
control and stack registers which include TT, TL, TPC, TNPC, TSTATE, HTSTATE
(hyperprivileged trap state), TBA, HTBA (hyperprivileged trap base
address), and PIL (priority interrupt level). This design feature allows
each strand to receive traps independently of other strands. This
capability significantly helps trap handling and management by the
hypervisor, as traps are delivered to a strand without being queued up
in the hypervisor.

Interrupts
On SPARC platforms, interrupt requests are delivered to the CPU as
traps. Traps 0x041 through 0x04F are used for Priority Interrupt Level
(PIL) interrupts, and trap 0x60 is used for the vector interrupt. There
are 15 interrupt levels for PIL interrupts. Interrupts are serviced in
accordance to their PIL, with higher PILs having higher priority. The
vector interrupt is used to support the data bearing vector interrupt
which allows a device to include its private data in the interrupt packet
(also known as the mondo
S
6

SPARC Processor Architecture

vector). With vector interrupt, device CSR access can be


eliminated and the complexity of device hardware can be
reduced.
PIL interrupts are delivered to the processor through the ASR's
SOFTINT_REG register. The SOFTINT_REG register contains a 15 bit
int_level field. When a bit in this field is set, a trap is generated
and the PIL of the trap corresponds to the location of the bit in
that field. There is one SOFTINT_REG for each strand.

In LDoms, the interrupt delivery from an I/O device to a GOS is a


two-step process:

• An I/O device sends an interrupt request using the vector


interrupt (trap 0x60) to the hypervisor. The hypervisor inserts
the interrupt request into the interrupt queue of the target
virtual processor.
• The target processor receives the interrupt request on its
interrupt queue through trap 0x7D (for device) or 0x7C (for
cross calls), and schedules an interrupt to itself to be
processed at a later time by setting bits in the privileged SOFTINT
register which causes a PIL interrupt (trap 0x41-0x4F). For
more details on interrupt delivery, see "Trap and Interrupt
Handling" on page 85.

SPARC Processor Sun
Architecture Microsyste
ms, Inc.

Section II
Hardware Virtualization Implementations

• Chapter 5: Sun xVM Server (page 39)


• Chapter 6: Sun xVM Server with Hardware VM (HVM)
(page 63)
• Chapter 7: Logical Domains (page 79)
• Chapter 8: VMware (page 97)

S
66 SPARC Processor Architecture


S

Sun Microsystems, Inc.

Chapter 5
Sun xVM Server

Sun xVM Server is a a paravirtualized Solaris OS that


incorporates the Xen open source community work. The open
source VMM, Xen, was originally developed by the Systems
Research Group of the University of Cambridge Computer
Laboratory, as part of the UK-EPSRC funded XenoServers
project. The first versions of Xen, targeted at the Linux
community for the x86 processor, required the Linux kernel to be
specifically modified to run on the Xen VMM. This OS
paravirtualization made it impossible to run Windows on early
versions of Xen, because Microsoft did not permit the Windows
software to be modified.
In December 2005 the Xen development team released Xen 3.0,
the first version of its VMM that supported hardware-assisted
virtual machines (HVM). With this new version, an unmodified
OS could be hosted on the Intel-VTx and AMD-V (Pacifica)
processors. Xen 3.0 eliminated the need for paravirtualization
and enabled Microsoft Windows to run in a Xen environment
side-by-side with Linux and the Solaris OS.
Xen 3.0 supports the x86 CPU both with HVM and without HVM.
Xen 3.0 also extends support for symmetric multiprocessing, 64-
bit operating systems, and up to 64 GB RAM allowed by the x86
physical address extension (PAE) in 32-bit mode.
HVM technology affects the Xen implementation in many ways.
This chapter discusses the architecture and design of Sun xVM
Server, which does not leverage the processor HVM feature.
Chapter 5 discusses Sun xVM Server for x86 processors with
HVM support (Sun xVM Server with HVM).

Note - Sun xVM Server includes support for the Xen open
source community work on the x86 platform and support for
LDoms on the UltraSPARC T1/T2 platform. In this paper, in
order to distinguish the discussion of x86 and UltraSPARC
T1/T2 processors, Sun xVM Server is specifically used to refer
to the Sun hardware virtualization product for the x86 platform,
This chapter is organized as follows:
"Sun xVM Server Architecture Overview" on page 40 provides an
overview of the Sun xVM Server architecture.
"Sun xVM Server CPU Virtualization" on page 45 discusses the
CPU virtualization employed by Sun xVM Server.
"Sun xVM Server Memory Virtualization" on page 52 describes
memory management issues.
and LDoms is used to refer to the Sun hardware virtualization
product for the UltraSPARC Tl and T2 platforms.
S
6

Sun Microsystems, Inc.

• "Sun xVM Server I/O Visualization" on page 56 discusses the


I/O virtualization used in Sun xVM Server.

Sun xVM Server Architecture Overview


A Sun xVM Server virtualized system consists of an x86 system,
a VMM, a control VM running Sun xVM Server (Domo), and zero
or more VMs (DomU), as shown in Figure 16. The Sun xVM
Hypervisor for x86, the VMM of the Sun xVM Server system,
manages hardware resources and provides services to the VMs.
Each VM, including Domo, runs an instance of a guest operating
system (GOS) and is capable of communicating with the VMM
through a set of hypervisor calls.

Dom U Dom U Dom U

Gues
t
Applicat
ions

Scheduler Event Channel


Console IF XenStore Hypercalls Grant Tables
Sun xVM Hypervisor for x86

Sun X64 Server

Figure 16. A Sun xVM Server virtualized system consists of a


VMM, a control VM (Domo), and zero or more VMs (DomU).

The Domo VM has some unique characteristics not available in


other VMs:

• First VM started by the VMM


• Able to directly access I/O devices
• Runs domain manager to create, start, stop, and
configure other VMs
• Provides I/O access service to other VMs (DomU)
Each DomU VM runs an instance of a paravirtualized GOS, and
gets VMM services through a set of hypercalls. Access to I/O
devices from each DomU VM are provided by drivers in Domo.
S

Sun Microsystems, Inc.

Sun xVM Hypervisor for x86 Services


The Sun xVM Hypervisor for x86, the VMM of the Sun xVM
Server, provides several communication channels between itself
and overlying domains:
• Hypercalls — synchronous calls from a GOS to the
VMM
• Event Channel — asynchronous notifications from the
VMM to VMs
• Grant Table — shared memory communication between the
VMM and VMs, and among VMs
• XenStore — a hierarchical collection of control and
status repository

Each of these mechanisms is described in more detail in the


following sections.

Hypercalls
The Sun xVM Server hypercalls are a set of interfaces used by a
GOS to request service from the VMM. The hypercalls are
invoked in a manner similar to OS system calls: a software
interrupt is issued which vectors to an entry point within the
VMM. Hypercalls use INT $0x82 on a 32-bit system and SYSCALL on
a 64-bit system, with the particular hypercall contained in the
%eax register.
For example, the common routine for hypercalls with four
arguments on a 64-bit Solaris kernel is:

long
__hypercall4(ulong_t callnum, ulong_t a1, ulong_t a2, ulong_t a3,
ulong_t a4);

The function in assembly is as follows:


_[0]> __hypercall4,7/ai
__hypercall4:
__hypercall4: movl %edi,%eax /* %edi is the first
argument */
__hypercall4+2: movq %rsi,%rdi
___hypercall4+5: movq %rdx,%rsi
___hypercall4+8: movq %rcx,%rdx
__hypercall4+0x movq %r8,%r10
b:
__hypercall4+0x syscall
e:
__hypercall4+0x ret
10:

The calling convention is compliant with the AMD64 ABI [8].

The SYSCALL instruction is intended to enable unprivileged software


(ring 3) to access services from privileged software (ring 0).
Solaris system calls also use SYSCALL to allow user applications to
access Solaris kernel services. Having SYSCALL used by both
S
7

Sun Microsystems, Inc.

Solaris system calls and the hypercalls means that the SYSCALL
made by the user process in Solaris is delivered indirectly by the
VMM to the Solaris kernel. This causes a slight overhead for
each Solaris system call.
A complete list of Sun xVM Server hypercalls is
provided in Table 4: Table 4. Sun xVM Server
hypercalls. Privilege Operations:
long set_trap_table(trap_info_t 4table);
long
6
mmu_update(mmu_update_t 5req, int count, int
success_count,
7
domid_t domid); long set_gdt(ulong_t
frame_list, int entries); long
stack_switch(ulong_t ss, ulong_t esp); long
fpu_taskswitch(int set);
long
9
mmuext_op(struct mmuext_op 8req, int count, int
success_count,
domid_t domain_id); long
update_descriptor(maddr_t ma, uint64_t desc);
long update_va_mapping(ulong_t va, uint64_t new_pte, ulong_t
flags);
long set_timer_op(uint64_t timeout);
long physdev_op(void 10physdev_op);
long vm_assist(uint_t cmd, uint_t type);
long update_va_mapping_otherdomain(ulong_t va, uint64_t
new_pte,
ulong_t flags, domid_t
domain_id); long iret();
long set_segment_base(int reg, ulong_t
value); long nmi_op(ulong_t12op, void 11arg);
long hvm_op(int cmd, void arg);
VMM Services:

4 Control VM operations such as platform_op, domain control, and virtual CPU control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
5 Control VM operations such as platform_op, domain control, and virtual CPU control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
6 Control VM operations such as platform_op, domain control, and virtual CPU control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
7 Control VM operations such as platform_op, domain control, and virtual CPU control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
8 Control VM operations such as platform_op, domain control, and virtual CPU control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
9 Control VM operations such as platform_op, domain control, and virtual CPU control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
10 Control VM operations such as platform_op, domain control, and virtual CPU
control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
11 Control VM operations such as platform_op, domain control, and virtual CPU
control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
12 Control VM operations such as platform_op, domain control, and virtual CPU
control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
S

Sun Microsystems, Inc.

long set_callbacks(ulong_t event_address, ulong_t


failsafe_address,
ulong_t syscall_address) ; long grant_table_op(uint_t
cmd, void 13uop, uint_t count); long event_channel_op(void 14op);
long xen_version(int cmd, void 15arg); long set_debugreg(int reg,
ulong_t value); long16 get_debugreg(int reg);
long multicall(void call_list, int nr_calls); long
console_io(int cmd, int 18 count, char 17str); long
sched_op(int cmd, void arg);
long do_kexec_op(unsigned long op, int argl, void 19arg); VM
Control Operations:
long sched_op_compat(int cmd, ulong_t 20
arg);
long platform_op(xen_platform_op_t21
platform_op);
long memory_op(int cmd, void arg);
long vcpu_op(int cmd, int23
vcpuid, void 22extra_args);
long sysctl(xen_sysctl_t sysctl); long
domctl(xen_domctl_t 24domctl); long acm_op();

As Table 4 shows, the hypercalls provide a variety of functions


for a GOS:
Perform privileged operations such as setting the trap table,
updating the page table, loading the GDT, and setting the GS
and FS segment registers
Get services from the VMM such as using the event channel,
13 Control VM operations such as platform_op, domain control, and virtual CPU
control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
14 Control VM operations such as platform_op, domain control, and virtual CPU
control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
15 Control VM operations such as platform_op, domain control, and virtual CPU
control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
16 Control VM operations such as platform_op, domain control, and virtual CPU
control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
17 Control VM operations such as platform_op, domain control, and virtual CPU
control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
18 Control VM operations such as platform_op, domain control, and virtual CPU
control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
19 Control VM operations such as platform_op, domain control, and virtual CPU
control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
20 Control VM operations such as platform_op, domain control, and virtual CPU
control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
21 Control VM operations such as platform_op, domain control, and virtual CPU
control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
22 Control VM operations such as platform_op, domain control, and virtual CPU
control
S
7

Sun Microsystems, Inc.

updates, is called by the Solaris kernel to perform the page table


updates. This routine returns control to the calling domain when
the operation is completed.
In the following example, a kmdb(1M) breakpoint is set at the
mmu_update() call. The stack trace illustrates how the
mmu_update() function is called after a new process is created
by fork():

[1]> set_pteval+0x4f:b // set breakpoint at


HYPERVISOR_mmu_update
[1]> :c // continue
kmdb: stop at set_pteval+0x4f // the breakpoint reached
kmdb: target stopped at:set_pteval+0x4f:call
-0x5a34 <HYPERVISOR_mmu_update>
[1]> $c // display the stack trace
set_pteval+0x4f(c753000, 1fb, 3, f9c29027)
x86pte_copy+0x73(fffffffec08115a8, fffffffec2a8a0d8, 1fb, 5)
hat_alloc+0x228(fffffffec2fa88c
0) as_alloc+0x99()
as_dup+0x3f(fffffffec27b1d28, fffffffec2a11168)
cfork+0x102(0, 1, 0)
forksys+0x25(0, 0)
sys_syscall32+0x13e()
{1]

The above example shows that the kernel doesn't maintain a


copy of the page table. It uses the mmu_update () hypercall to
request the VMM to update the page table.

Event Channels
To a GOS, a VMM event is the equivalent of a hardware
interrupt. Communication from the VMM to a VM is provided
through an asynchronous event mechanism, called an event
channel, which replaces the usual delivery mechanisms for
device interrupts. A VM creates an event channel to send and
receive asynchronous event notifications.

Three classes of events are delivered by this event channel


mechanism:

• Bi-directional inter- and intra-VM connections


A VM can bind an event-channel port to another domain or to
another virtual CPU within the VM.
• Physical interrupts
A VM with direct access to hardware (Domo) can bind an
event-channel port to a physical interrupt source.
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
23 Control VM operations such as platform_op, domain control, and virtual CPU
control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
24 Control VM operations such as platform_op, domain control, and virtual CPU
control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
S

Sun Microsystems, Inc.

• Virtual interrupts
A VM can bind an event-channel port to a virtual interrupt
source, such as the virtual-timer device.
Event channels are addressed by a port. Each channel is
associated with two bits of information:
• unsigned long evtchn_pending[sizeof(unsigned long) * 8] This
value notifies the domain that there is a pending notification to
be processed. This bit is cleared by the GOS.
• unsigned long evtchn_mask[sizeof(unsigned long) * 8]
This value specifies if the event channel is masked. If this bit is
clear and PENDING is set, an asynchronous upcall will be
scheduled. This bit is only updated by the GOS; it is read-only
within the VMM.
Interrupts to a VM are virtualized by mapping them to event
channels. These interrupts are delivered asynchronously to the
target domain using a callback supplied via the set_callbacks
hypercall. A guest OS can map these events onto its standard
interrupt dispatch mechanisms. The VMM is responsible for
determining the target domain that will handle each physical
interrupt source.
"Interrupts and Exceptions" on page 49 provides a detailed
discussion of how an interrupt is handled by the VMM and
delivered to a VM using an event channel.

Grant Tables
The Sun xVM Hypervisor for x86 allows sharing memory among
VMs, and between the VMM and a VM, through a grant table
mechanism. Each VM makes some of its pages available to other
VMs by granting access to its pages. The grant table is a data
structure that a VM uses to expose some of its pages, specifying
what permissions other VMs have on its pages. The following
example shows the information stored in a grant table entry:

struct grant_entry {
/* GTF_xxx: various type and flag [XEN,GST] */
information.
uint16_t flags;
/* The domain being granted foreign [GST] */
privileges.
domid_t domid;
uint32_t frame; // page frame
number (PFN)
};

The flags field stores the type and various flag information of the
grant table. There are three types of grant table entries:
• GTF_invalid — Grants no privileges.
• GTF_permit_access — Allows the domain domid to
map/access the specified
frame .
S
7

Sun Microsystems, Inc.

• GTF_accept_transfer — Allows domid to transfer ownership of


one page frame to this guest; the VMM writes the page number
to frame.
The type information acts as a capability which the grantee can
use to perform operations on the granter's memory. A grant
reference also encapsulates the details of a shared page,
removing the need for a domain to know the real machine
address of a page it is sharing. This makes it possible to share
memory correctly with domains running in fully virtualized
memory.
Device drivers in the Sun xVM Server (see "Sun xVM Server I/O
Visualization" on page 56) use grant tables to send data between
drivers of different domains, and use event channels and
callback services for asynchronous notification of data
availability.

XenStore
XenStore [22] is a shared storage space used by domains to
communicate and store configuration information. XenStore is
the mechanism by which control-plane activities, including the
following, occur:
• Setting up shared memory regions and event channels for use
with split device drivers
• Notifying the guest of control events (for example, balloon
driver requests)
• Reporting status information from the guest (for example,
performance-related statistics)
The store is arranged as a hierarchical collection of key-value
pairs. Each domain has a directory hierarchy containing data
related to its configuration. Domains are permitted to register for
notifications about changes in a subtree of the store, and to apply
changes to the store transactionally.

Sun xVM Server CPU Virtualization


The Sun xVM Hypervisor for x86 provides a paravirtualized
environment to a VM. Full CPU virtualization to a VM is achieved
by a concerted coordination of CPU management by the VMM,
and CPU usage by the GOS within a VM.
The next sections discuss CPU virtualization employed by the
Sun xVM Server for these tasks:
• Deprivileging CPUs to run a VM
• Scheduling CPUs for VMs
• Handling and delivery of interrupts to a VM
• Providing timer services to a VM

CPU Privilege Mode


The Sun xVM Hypervisor for x86 operates at a higher privilege
level than the GOS. On 32-bit x86 processors with protection
S

Sun Microsystems, Inc.

mode enabled, a GOS may use rings 1, 2 and 3 as it sees fit.


The Sun xVM Server kernel uses ring 1 for its own operation and
places applications in ring 3.
On 64-bit systems, linear address space (flat memory model) is
used to create a continuous, unsegmented address space for
both the kernel and application programs. Segmentation is
disabled and rings 1 and 2, which practically do not exist, have
the same privilege to access paging as ring 0 (see "Protected
Mode" and following sections beginning on page 22). To protect
the VMM, the Sun xVM Server kernel is therefore restricted to
run in ring 3 for the 64-bit mode and in ring 1 for the 32-bit mode
only, as seen in the definitions in segments, h:

% cat
intel/sys/segments.h
#if defined(__amd64)
#define SEL_XPL /* xen privilege level */
0
#define SEL_KPL /* both kernel and user in ring 3 */
3
#elif defined(__i386)
#define SEL_XPL 0 /* xen privilege level */
#define SEL_KPL /* kernel privilege level under xen */
1
#endif /* __i386 */

If both kernel and user application run with the same privilege
level, how does Sun xVM Server protect the kernel from user
applications? The answer is given as follows [32]:

1. The VMM performs context switching between kernel


mode and the currently running application in user mode.
The VMM tracks which mode the GOS is running, kernel
or user.
2.
The GOS maintains two top level (PML4) page tables per
process, one each for kernel and user. The GOS registers
the two page tables with the VMM. The kernel page table
contains translations for both the kernel and user
addresses, and the user page table contains translations
only for the user addresses. During the context switch, the
VMM switches the top level page table so the kernel
addresses are not visible to the user process. The linear
address mapping to paging data structure for 64-bit x86
processor is shown below in Figure 17:

63 48 47 39 38 3 212 121 0
0 0 1
2
9
S
7

Sun Microsystems, Inc.

Sign
PML4 PDP PDE PTE Offset
Extended

Figure 17. Linear address mapping to paging data structure for


64-bit x86 processor.
Switching the PML4 page tables between kernel and user mode
enables a 64-bit address space to be split into two logically
separate address spaces. In this logical separation of a 64-bit
address space, the kernel can access both its address space
and a user address space while a user process can access only
its own address space. The user address space in this
addressing scheme is therefore restricted to use the lower 48 bits
of the 64-bit address space. The resulting address space
partition in the 64-bit Sun xVM Server is shown as follows, in
Figure 18:

Kernel (ring 3)

0xFFFF8800 00000000 0xFFFF8000


VMM (ring 0) 00000000

Reserved

247 0x7FFF FFFFFFFF

User (ring 3)

0 I_________________I
Figure 18. Address space partitioning in the 64-bit Sun xVM
Server.
As discussed previously (see "Segmented Architecture" on page
23), the processor privilege level is set when a segment is
loaded. The Solaris OS uses the GDT for user and kernel
segments. The segment index of each segment type is assigned
as shown in Table 5 on page 54.
The command kmdb(1M) can be used to examine the segment
descriptor of kernel code:

[0]> gdt0+30::print -t 'struct user_desc' // 64-bit kernel code


segment
{
unsigned long usd_lolimit :16 =
0x7000 unsigned long usd_lobase
:16 = 0xe030 unsigned long
usd_midbase :8 = 0 unsigned
long usd_type :5 = 0xe unsigned
long usd_dpl :2= 0x3 unsigned
long usd_p :1 = 0x1 unsigned
long usd_hilimit :4 = 0x4 unsigned
S

Sun Microsystems, Inc.

long usd_avl :1 = 0 unsigned long


usd_long :1 = 0 unsigned long
usd_def32 :1 = 0 unsigned long
usd_gran :1 = 0x1 unsigned long
usd_hibase :8 = 0xfb
}
> gdt0+40::print -t 'struct user_desc' // 32-bit user code
segment
{
unsigned long usd_lolimit :16 = 0xc450
unsigned long usd_lobase :16 = 0xe030
unsigned long usd_midbase :8 = 0xf8
unsigned long usd_type :5 = 0xe
unsigned long usd_dpl :2= 0x3
unsigned long usd_p :1 = 0x1
unsigned long usd_hilimit :4 = 0x1
unsigned long usd_avl :1 = 0
unsigned long usd_long :1 = 0
unsigned long usd_def32 :1 = 0
unsigned long usd_gran :1 = 0x1
unsigned long usd_hibase :8 = 0xfb
}
The descriptor privilege level (DPL) of both kernel and 32-bit user
code segments are set to 3. At boot time, the Sun xVM
Hypervisor for x86 is loaded into memory in ring 0. After
initialization, it loads the Solaris kernel to run as Domo in ring 3.
The domain Domo is permitted to use the VM control hypercall
interfaces (see Table 4 on page 42), and is responsible for
hosting the application-level management software.

CPU Scheduling
The Sun xVM Hypervisor for x86 provides two schedulers for the
user to choose between: Credit and simple Earliest Deadline
First (sEDF). The Credit scheduler is the default scheduler; sEDF
might be phased out and removed from the Sun xVM Server
implementation.
The Credit scheduler is a proportional fair share CPU scheduler.
Each physical CPU (PCPU) manages a queue of runnable virtual
CPUs (VCPUs). This queue is sorted by VCPU priority. A
VCPU's priority can be either over or under, representing
whether this VCPU has exceeded its share of the PCPU or not.
A VCPU's share is determined by weight assigned to the VM and
credit accumulated by the VCPU in each accounting period.
(Credit
Credit i total X We ightVMi) + (Weighttotal - 1)
VM
= Weighttotal

Credit -
VCPU -
TotalVCPUVMi

The first equation determines the total credit of a VM and the


second equation determines the credit of a VCPU in a VM.
Credittotal is a constant; Weighttotal is the sum of the weight of all
domains. A VM's weight is assigned to a VM using xm(1M) (for
example, xm sched-credit -w weight). In each accounting period,
S
7

Sun Microsystems, Inc.

fixed amount of credits are added to idle VCPUs and are


subtracted from running VCPUs.
The VCPU has the priority under if the VCPU has not consumed
all credits it possesses. On each PCPU, at every scheduling
decision (when a VCPU blocks, yields, completes its time slice,
or is awakened), the next VCPU to run is picked off the head of
the run queue of priority under. When a VM runs, it consumes
credits of its VCPU[s]. When a VCPU uses all its allocated
credits, the VCPU's priority is changed from under to over. When
a CPU doesn't find a VCPU of priority under on its local run
queue, it will look on other PCPUs for one. This load balancing
guarantees each VM receives its fair share of PCPU resources
system-wide. Before a PCPU goes idle, it will look on other
PCPUs to find any runnable VCPU. This guarantees that no
PCPU idles when there is runnable work in the system.
Earliest Deadline First (EDF) scheduling provides weighted CPU
sharing by comparing the deadline of scheduled periodic
processes (or domains, in the case of Sun xVM
Server). This scheduler places domains in a priority queue. Each
domain is associated with two parameters: time requested to run,
and an interval or deadline. Whenever a scheduling event
occurs, the queue is searched for the domain closest to its
deadline. This domain is then scheduled for execution next with
the time requested. The EDF scheduler gives a better CPU
utilization when a system is underloaded. When the system is
overloaded, the set of domains that will miss deadlines is largely
unpredictable (it is a function of the exact deadlines and time at
which the overload occurs).

Interrupts and Exceptions


The x86 processor uses a vector of size 256 to associate with
exceptions and interrupts. The vector number is an index into the
interrupt descriptor table (IDT). The IDT associates each vector
with a gate descriptor for the procedure for handling the interrupt
or exception. The IDT register (IDTR) contains the base address
of the IDT.
When Sun xVM Server is booting up, it registers its own IDT to
the VMM. During system initialization, an early stage of Solaris
boot, the Solaris kernel function init_desctbls() is called to
initialize the GDT and IDT:

void
init_desctbls(void)
{
init_idt(&idt0[0]);
for (vec = 0; vec < NIDT; vec++)
xen_idt_write (&idt0[vec], vec);

}
S

Sun Microsystems, Inc.

The Solaris kernel function init_desctbls() passes each of its


exception and interrupt vectors to the VMM using the
set_trap_table() hypercall:

void
xen_idt_write
(gate_desc_t *sgd, uint_t
vec) {
trap_info_t trapinfo[2];
bzero(trapinfo, sizeof (trapinfo));
if (xen_idt_to_trap_info(vec, sgd, &trapinfo[0])
== 0) return;
if (xen_set_trap_table(trapinfo) != 0)
panic("xen_idt_write: xen_set_trap_table()
failed");
}

The set_trap_table () hypercall has one argument, trap_info,


which contains the privilege level of the GOS code segment, the
code segment selector, and the address of the handler which will
be used to set the instruction pointer when the VMM
S
8

Sun Microsystems, Inc.

passes the control back to the GOS (see following code


segment). The value of trap_info is set in the function
xen_idt_to_trap_info() using the setting in the kernel global
variable array, idtO.

typedef struct trap_info {


uint8_t vector; /* exception vector */
uint8_t flags; /* 0-3: privilege level */
uint16_t cs; /* code selector */
unsigned long address; /* code offset
*/ } trap_info_t;

On a 64-bit system, the interrupt descriptor has the descriptor


privilege level (DPL) 3, similar to the segment descriptor:

[0]> idt0::print 'struct gate_desc'


{
sgd_looffset = 0x4bf0
sgd_selector = 0xe030
sgd_ist = 0 sgd_resv1
= 0 sgd_type = 0xe
sgd_dpl= 0x3 sgd_p =
0x1 sgd_hioffset =
0xfb84 sgd_hi64offset
= 0xffffffff sgd_resv2 =
0 sgd_zero = 0
sgd_resv3 = 0
}

When an interrupt or exception occurs, the VMM's trap handler is


invoked to handle the interrupt or exception. If this is an
exception caused by a GOS, the VMM's trap handler sets the
pending bit (see "Event Channels" on page 43) and calls the
GOS's exception handler. Interrupts for the GOS are virtualized
by mapping them to event channels, which are delivered
asynchronously to the target GOS via the set_callbacks()
hypercall.
In the following example, a kmdb( 1M) breakpoint is set at the
interrupt service routine of the sd driver, sdintr(). The function
xen_callback_handler(), the callback function used for
processing events from the VMM, is registered in the VMM by
the hypercall set_callbacks(). When an interrupt intended for sd
arrives, the
S

Sun Microsystems, Inc.

hypercall HYPERViS0R_block() detects an event is available and


then invokes the callback function:

scTsdintr :
sd'sdintr: ec8b4855 = pushq %rbp
[0]> $c
sdNsdintr (fffffffec0 670 000)
mptNmpt_intr+0xdb(fffffffec0670000, 0)
av_dispatch_autovect+0x78(1b)
dispatch_hardint+0x33(1b, 0)
switch_sp_and_call+0x13()
do_interrupt+0x9b(ffffff0001005ae0, 1)
xen_callback_handler +0x3 6c(ffffff0001005ae0, 1)
xen_callback +0xd9()
HYPERVISOR_sched_op +0x2 9(1, 0)
HYPERVISOR_block +0x11()
mach_cpu_idle+0x52()
cpu_idle+0xcc()
idle+0x10e()
thread_start+8()
[0]>

Pending events are stored in a per-domain bitmask (see "Event


Channels" on page 43), that is updated by the VMM before
invoking an event-callback handler specified by the GOS. The
function xen_callback_handler() is responsible for resetting the
set of pending events and responding to the notifications in an
appropriate manner. A VM may explicitly defer event handling by
setting a VMM-readable software flag; this is analogous to
disabling interrupts on a real processor.

Timer Services
"Timer Devices" on page 27 discusses several hardware timers
available on x86 systems. These hardware devices vary in their
frequency reliability, granularity, counter size, and ability to
generate interrupts. The Solaris OS employs some of these timer
devices for running the OS clock and high resolution timer:
• OS system clock — The Solaris OS uses the local APIC timer
on multiprocessor systems to generate ticks for the system
clock. On uniprocessor systems, the Solaris OS uses the PIT to
generate ticks for the system clock.
• High resolution timer — The Solaris OS uses the TSC timer for
a high resolution timer. The PIT counter is used to calibrate the
TSC counter.
• Time-of-day clock — The time-of-day (TOD) clock is based on
the RTC. Only Domo can set the TOD clock. The DomU VMs
don't have the permission to update the machine's physical
RTC. Therefore, any attempt by the date (1) command to set
the date and time on DomU will be quietly ignored.
In Sun xVM Server, the VMM provides the system time to each
VCPU when it is scheduled to run. The high resolution timer,
gethrtime(), is still run through the
unprivileged RDTSC instruction, thus the high resolution timer is not
virtualized. The virtualized system time relies on the current TSC
S
8

Sun Microsystems, Inc.

to calculate the time in nanoseconds since the VCPU was


scheduled.

Sun xVM Server Memory Virtualization


Memory virtualization in Sun xVM Server deals with the following
two memory management issues:

• Physical memory sharing and partitioning


• Page table access

Physical Memory Management


Sun xVM Server introduces a distinction between machine
memory and physical memory. Machine memory refers to the
entire amount of memory installed in the machine. Physical
memory is a per-VM abstraction that allows a GOS to envision its
memory as a contiguous range of physical pages starting at
physical page frame number (PFN) 0, despite the fact that the
underlying machine PFN may be sparsely allocated and in any
order (see "Page Translations Virtualization" on page 14).
The VMM maintains a table of machine-to-physical memory
mappings. The GOS performs all page allocations and
management based on physical memory. During page table
updates, a conversion from physical memory to machine memory
is performed before making the mmu_update() hypercall to
update the page tables.
Since VMs get created and deleted randomly throughout time,
the VMM employs memory hotplug and ballooning schemes to
optimize the memory usage in a machine. Memory hotplug
allows a GOS to dynamically add or remove physical memory to
its inventory. The memory ballooning technique allows a VMM to
dynamically adjust the usage of physical memory among VMs.
For example, consider a machine that has 8 GB of memory. Two
VMs, VM-A and VM-B, are initially created with 5 GB of memory
each. Memory hotplug adds 5 GB memory to both VMs after they
are booted. The total memory committed to both VMs is greater
than the actually physical memory available. When VM-A needs
more physical memory, the memory ballooning technique
increases memory pressure in VM-B by inflating the balloon
driver. This results in memory being paged out to free up the
memory consumed by VM-B, and thus more memory becoming
available to VM-A.
The GOS requests the service of physical memory management
to the VMM through the memory_op(cmd, ...) hypercall. The
operations supported by the memory_ops () hypercall include the
following:

• XENMEM_increase_reservation
• XENMEM_decrease_reservation
XENMEM_populate_physmap
S

Sun Microsystems, Inc.

XENMEM maximum ram page


XENMEM_current_reservation
XENMEM_max imum_res erva tion
XENMEM_machphys_mfn_list
XENMEM_add_to_physmap
XENMEM_translate_gpfn_list
XENMEM_memory_map
XENMEM_machine_memory_map
XENMEM_set_memory_map
XENMEM_machphys_mapping
XENMEM_exchange

Page Translations
"Segmented Architecture" on page 23 describes two stages of
address translation to arrive at a physical address: virtual
address (VA) to linear address (LA) translation using
segmentation, and LA to physical address (PA) translation using
paging. Solaris x64 uses a flat address space in which the VA
and LA are equivalent, which means the base address of the
segment is 0. In Solaris 10, the Global Descriptor Table (GDT)
contains the segment descriptor for the code and data segments
of both kernel and user processes, as shown in Table 5 on page
54.
Since there is only one GDT in a system, the VMM maintains the
GDT in its memory. If a GOS wishes to use something other than
the default segment mapping that the VMM GDT provides, it
must register a custom GDT with the VMM using the set_gdt()
hypercall. In the following code sample, frame_list is the physical
address of the page that contains the GDT and entries is the
number of entries in the GDT.

xen_set_gdt(ulong_t *frame_list, int


entries) {
if ((err = HYPERVISOR_set_gdt(frame_list,
entries)) != 0) { }
return (err);
}

The Solaris 32-bit thread library uses %gs to refer to the LWP
state manipulated by the internals of the thread library. The 64-bit
thread library uses %fs to refer to the LWP state as specified by
the AMD64 ABI [8]. The 64-bit kernel still uses %gs for its CPU
state (%fs is never used in the kernel). The MSR's KernelBase
register is used to store the kernel %gs content while it switches
to run the 32-bit user LWP. The privileged instruction SWAPGS is
used to restore the kernel %gs during the context switch to the
kernel context. So when the VMM performs a context switch
between the guest kernel mode and the guest user mode, it
executes SWAPGS as part of the context switch (see "CPU Privilege
Mode" on page 45).
S
8

Sun Microsystems, Inc.

The GDT segment is given in Table 5 below:


% cat
intel/sys/segments.h
#defin GDT_NULL 0 /* null */
e
#defin GDT_B32D 1 /* dboot 32 bit data descriptor */
e ATA
#defin GDT_B32C 2 /* dboot 32 bit code descriptor */
e ODE
#defin GDT_B16C 3 /* bios call 16 bit code descriptor */
e ODE
#defin GDT_B16D 4 /* bios call 16 bit data descriptor */
e ATA
#defin GDT_B64C 5 /* dboot 64 bit code descriptor */
e ODE
#defin GDT_BGS 7 /* kmdb descriptor only used in boot
e TMP */
#if defined!
___________
amd64)
#defin GDT_KCO 6 /* kernel code seg %cs */
e DE
#defin GDT_KDAT 7 /* kernel data seg %ds */
e A
#defin GDT_U32C 8 /* 32-bit process on 64-bit kernel %cs
e ODE */
#defin GDT_UDA 9 /* user data seg %ds (32 and 64 bit)
e TA */
#defin GDT_UCO 10 /* native user code seg %cs */
e DE
#defin GDT_LDT 12 /* LDT for current process */
e
#defin GDT_KTSS 14 /* kernel tss */
e
#defin GDT_FS GDT_NULL /* kernel %fs segment selector
e */
#defin GDT_GS GDT_NULL /* kernel %gs segment selector
e */
#defin GDT_LWP 55 /* lwp private %fs segment selector
e FS (32-bit)*/
#defin GDT_LWP 56 /* lwp private %gs segment selector
e GS (32-bit)*/
#defin GDT_BRA 57 /* first entry in GDT for brand usage
e NDM IN */
#defin GDT_BRA 61 /* last entry in GDT for brand usage
e NDMAX */
#defin NGDT 62 /* number of entries in GDT */
e

Table 5. The GDT segment.

Every LWP context switch requires an update to the GDT for the
new LWP. The GOS uses
update_descriptor() for the task:

intel/ia32/os/desctbls.c
S

Sun Microsystems, Inc.

update_gdt_usegd(uint_t sidx, user_desc_t


*udp) {
if (HYPERVISOR_update_descriptor(pa_to_ma(dpa),
*(uint64_t *)udp)) panic("xen_update_gdt_usegd:
HYPERVISOR_update_descriptor");
}

On an x86 system, the base physical address of the page


directory is contained in the control register %cr3. In the Solaris
OS, the value of %cr3 is stored in the process's hat structure,
proc->p_as->a_hat->hat_table->ht_pfn, as shown in "Paging
Architecture" on page 25. The loading of %cr3 is performed by
the VMM for security and coherency reasons.
"Page Translations Virtualization" on page 14 discusses two
alternatives for updating page tables in a virtualized environment:
hypervisor calls to a read-only page table and shadow page
tables. The Sun xVM Hypervisor for x86 provides an additional
alternative, a writable page table, for the GOS to implement page
translations. In the default mode of operation, the VMM uses
both read-only page tables and writable page tables to manage
page tables. The VMM allows the GOS to use a writable page
table to update the lowest level page tables (for example, the
PTE). The higher levels, such as PDE, PDP, and PML4, use a
read-only page table and are updated using the hypercall
mmu_update (). Updates to higher level page tables are much
less frequent compared to the PTE page table updates.
• Read-only page table The GOS has read-only access to page
tables and uses the mmu_update() hypercall to update page
tables. As described in the previous section "Physical Memory
Management" on page 52, the GOS has a view of pseudo-
physical memory, and a translation from physical address to
machine address is performed before the mmu_update () call.

Void
set_pteval(paddr_t table, uint_t index, uint_t level, x86pte_t
pteval)
{
ma = pa_to_ma(PT_INDEX_PHYSADDR(pfn_to_pa(ht-
>ht_pfn), entry)); t[0].ptr = ma |
MMU_NORMAL_PT_UPDATE; t[0].val = new;
if (HYPERVISOR_mmu_update(t, cnt, &count,
DOMID_SELF))
panic("HYPERVISOR_mmu_update()
failed");

• Writable page table


If a GOS attempts to write to a page table that is maintained by
the VMM, this attempt will result in a #PF fault to the VMM. In
the VMM fault handling routine, the following tasks are
performed:
– Hold the lock for all further page table updates
S
8

Sun Microsystems, Inc.

– Disconnect the page that contains the updated page table by


clearing the page present bit of the page table entry in the parent
page table
– Make the page writable by the GOS
The page will be reconnected to the paging hierarchy again
automatically in a number of situations, including when the
guest modifies a different page-table page, when the domain is
preempted, and whenever the guest uses the VMM's explicit
page-table update interfaces.
• Shadow page table
S

Sun Microsystems, Inc.

The VMM maintains a independent copy of page tables, called


the shadow page table, that is pointed to by the %cr3 register.
If a page fault occurs when a GOS's page table is accessed,
the VMM propagates changes made to the GOS's page table
to the shadow page table. Shadow page mode can be set in
the GOS by calling
dom0_op(DOM0 SHADOW CONTROL).
In addition to creating a translation entry, the VMM also provides
the mmuext_op() hypercall for the GOS to flush, to invalidate, or
to lock a page translation. For example, it is necessary to lock
the translations of a process when it is being created. The
mmuext_op() is invoked by the kernel during the fork(2) system
call:

[3]> :c
kmdb: stop at xen_pin+0x3a
kmdb: target stopped at:
xen_pin+0x3a: call +0x208b1 <HYPERVISOR_mmuext_op>
[3]> $c
xen_pin+0x3a(ff2c, 3)
hat_alloc+0x285(fffffffec381b7
e8) as_alloc+0x99()
as_dup+0x3f(fffffffec381ba88,
fffffffec3d0f8d0) cfork+0x102(0, 1, 0)
forksys+0x25(0, 0) sys_syscall32+0x13e()

Sun xVM Server I/O Virtualization


Sun xVM Server uses a split device driver architecture to provide
device services to DomU domains. The device services are
provided by two co-operating drivers: the front-end driver, which
runs in a DomU, and the back-end driver, which runs in Domo
(Figure 19). Sun xVM Server doesn't export any real devices to
DomU domains. All device access made by DomU domains must
go through the back-end driver located in Domo.

DomO
S
8

Sun Microsystems, Inc.

The Sun xVM Server-related driver modules for Domo and DomU
respectively are shown below:

Sun xVM Server related device on dom0 :


modules
xpvtod (TOD module for Xen)
xpvd (virtual device nexus driver)
xencons (virtual console driver)
privcmd (privcmd driver)
evtchn (Evtchn driver)
xenbus (virtual bus driver)
xdb (vbd backend driver)
xnb (xnb module)
xsvc (xsvc driver)
balloon (balloon driver)
Sun xVM Server related device on domU :
modules
xenbus (virtual bus driver)
xpvtod (TOD module for i86xpv)
xpvd (virtual device nexus driver)
xencons (virtual console driver)
xdf (Xen virtual block driver)
xnf (Virtual Ethernet driver)

• T h e xpvtod driver provides setting and getting the time-of-day


for the VM. TOD service is provided by the RTC timer. If the
request to set the TOD comes from a DomU domain, the
request is silently ignored, as DomU doesn't have permission
to set the RTC timer.
• The nexus driver in Solaris provides bus mapping and
translation services to subordinate devices in the device tree.
The xpvd driver is the nexus driver for all virtual I/O drivers
which don't directly access physical device. This driver's
primary functions are to provide interrupt mapping and to
invoke the initialization routine of its children devices.
• T h e xenbus driver provides a bus abstraction that drivers can
use to communicate between VMs. The bus is mainly used for
configuration negotiation, leaving most data transfer to be done
via an interdomain channel composed of a grant table and an
event channel. The xenbus driver also makes the configuration
data available to the XenStore shared storage repository (see
"XenStore" on page 45).
• T h e evtchn driver is used for receiving and demultiplexing
event-channel signals to the user land.
• T h e balloon driver is controlled by the VMM to manage
physical memory usage by a VM. (See "Physical Memory
Visualization" on page 13 and "Physical Memory Management"
on page 52).
• T h e privcmd driver is used by the domain manager on Domo
to get the VMM service for VM management.
• T h e drivers xdf and xdb, the front-end and back-end block
device drivers
S

Sun Microsystems, Inc.

respectively, are discussed in "Disk Driver" on page 60. The


xnf and xnb drivers, the front-end and back-end network
drivers respectively, are discussed in "Network Driver" on page
61.
Data transfer between interdomain drivers is mainly provided by
the VMM grant table and event-channel services. Most of the
data transfer is handled in a similar fashion to DMA transfer
between host and device. Data is put in the grant table by the
sending VM, and notification is sent to the receiving VM through
the event channel. Then, the callback routine in the receiving VM
is invoked to process the data.

Disk Driver
The xdb driver, the back-end driver on Domo, is used to provide
services for block device management. This driver receives I/O
requests from DomU domains and sends them on to the native
driver. On DomU, xdf is the pseudo block driver that gets the I/O
requests from applications and sends them to the xdb driver in
Domo. The xdf driver provides functions similar to those of the
SCSI target disk driver, sd, on an unvirtualized Solaris system.
On Solaris systems, the main interface between a file system and
storage device is the strategy(9E) driver entry point. The
strategy(9E) entry point takes only one argument, buf (9S), which
is the basic data structure for block I/O transfer. The I/O request
made by a file system to the strategy( 9E) entry point is called
PAGEIO, as the memory buffer for the I/O is allocated from the
kernel page pool. An application can also open the storage
device as a raw device and perform read(2) and write(2)
operations directly on the raw device. Such an I/O request is
called PHYSIO, physio(9F), as the memory buffer for the I/O is
allocated by the application.
In addition to the strategy(9E) driver entry point for supporting file
system and raw device access, a disk driver also supports a set
of ioctl (2) operations for disk control and management. The dkio
(7I) disk control operations define a standard set of ioctl(2)
commands. Normally, support for dkio (7I) operations requires
direct access to the device. In DomU, xdf supports most ioctl(2)
commands as defined in dkio(7i) by emulating the disk control
inside xdf. No communication is made by xdf to the back-end
driver for ioctl(2) operations.
The sequence of events for disk I/O data transfer is illustrated in
Figure 20. The disk control path, ioctl(2) , is similar to the data
path.

When a disk I/O request is issued by a DomU domain, the


sequence is as follows:

1. The file system calls the xdf driver's strategy(9E) entry point
as a result of a read(2) or write(2) system call.
S
9

Sun Microsystems, Inc.

2. The xdf driver puts the I/O buffer, buf(9S), on the grant table.
This buffer is allocated from the DomU memory. Permission
for other domain access is granted to this memory.
3. The xdf driver notifies Domo of an event through event
channel.
4. The VMM event channel generates an interrupt to the xdb
driver in Domo.
5. The xdb driver in Domo gets the DomU I/O buffer through
the grant_table.
6. The xdb driver in Domo calls the native driver's
strategy( 9E) entry point.
7. The native driver performs DMA.
8. The VMM receives the device interrupt.
9. The VMM generates an event to Domo.
10. The xdb driver's iodone() routine is called by biodone(9F).
11. The xdb driver's iodone() routine generates an event to
DomU.
12. The xdf driver in DomU receives an interrupt to free up the
grant table and DMA resources, and calls biodone(9F) to
wake up anyone waiting for it.

When a disk I/O request is issued by the control domain DomO,


the sequence is as follows:
13. Block I/O requests are sent directly to the native driver.

© |Xen
Callback
DomU User
read
<2)/write
<2) Kern

F
SI
.
1

EMI

© |Grant Tables| I Event Channell

Su n xVM Hypervisor for x86

u ® Sun X64 Server


X86 Hardware (CPU, Memory, Devices)

Figure 20. Sequence of events for an I/O request from a Sun


xVM Server virtual machine.

Network Driver
The Sun xVM Server network drivers uses a similar approach to
the disk block driver for handling network packets. On DomU, the
pseudo network driver xnf gets the I/O requests from the network
S

Sun Microsystems, Inc.

stack and sends them to xnb on Domo. The back-end network


driver xnb on Domo forwards packets sent by xnf to the native
network driver.
9
S

Sun Microsystems, Inc.

The buffer management for packet receiving has more impact on


network performance than packet transmitting does. On the
packet receiving end, the data is transferred via DMA into the
native driver receiving buffer on domO. Then, the packet is
copied from the native driver buffer to the VMM buffer. The VMM
buffer is then mapped to the DomU kernel address space without
another copy of the data.

The sequence of operations for packet receiving is as follows:

1. Data is transferred via DMA into the native driver, bge,


receive buffer ring.
2. The xnb drivers gets a new buffer from the VMM and copies data
from the bge receive ring to the new buffer.
3. The xnb driver sends DomU an event through the event
channel.
4. The xnf driver in DomU receives an interrupt.
5. The xnf driver maps a mblk(9S)to the VMM buffer and sends the
mblk(9S) to the upper stack.

dom0

xnb to peer — xnbO-


fmrn_mac+0xic
macmac_do_rx+0x88 3
macmac_rx+0xib domU
vnic\nic_rx+0x59
C xnfxnf_process_recv+0
2 C5) x275
vnic\nic_classifier_rx+0x xnfxnf_intr+0x5e
6b macmac_do_rx+0x88
macmac_rx+0xib unix'av_dispatch_autov
ect+0x78
umx'av_dispatch_autove unixdispatcVhardint+0x3
ct+0x78 3
unix*dispatch_hardint- unixswitch_sp_and_call
K>x33 +0xi3
unix'switch_sp_and_call
+0xi3

GrantTables EventChannel
XenCallback
Sun xVM Hypervisor for x86

Network Chip Sun X64 Server X86 Hardware (CPU, Memory,


Devices)

Figure 21. Sequence of events for a network request from a Sun


xVM Server virtual machine.
Chapter 6
Sun xVM Server with Hardware VM (HVM)

Intel and AMD have independently developed extensions to the


x86 architecture that provide hardware support for virtualization.
These extensions enable a VMM to provide full virtualization to a
VM, and support the running of unmodified guest operating
systems on a VM. This approach is in contrast to Sun xVM
Server PV, which requires modifications to the guest operating
system.
9

Sun Microsystems, Inc.

Virtual machines that are supported by virtualization capable


processors are called Hardware Virtual Machines (HVMs). An
HVM environment includes the following requirements:
• A processor that allows an OS with reduced privilege to
execute sensitive instructions
• A memory management scheme for a VM to update its page
tables without accessing MMU hardware
• An I/O emulation scheme that enables a VM to use its native
driver to access devices through an I/O VM (see "I/O
Virtualization" on page 16)
• An emulated BIOS to bootstrap the OS
The x86 processor for HVM meets the first requirement, allowing
an OS with reduced privilege to execute sensitive instructions.
However, a processor alone is not enough to provide full
virtualization. The memory management, I/O emulation, and
emulated BIOS requirements necessitate enhancements in the
VMM.
This chapter begins with a discussion of HVM operations that are
applicable to both Intel and AMD virtualization extensions,
followed by Intel and AMD specific enhancements for HVM.
After the introduction of processor extensions, Sun xVM Server
enhancements in the areas of BIOS emulation, memory
management, and I/O virtualization for full virtualization are
discussed in detail.

Note - Intel's virtualization extension is called Virtual Machine


Extensions (VMX), and is documented in the IA-32 Intel
Architecture Software Developer's Manual (see [7] Volume 3B
Chapters 19-23). AMD's extension is called Secure Virtual
Machine (SVM), and is documented in the AMD64 Architecture
Programmers Manual Volume 2: System Programming (see [9]
Chapter 15).
HVM Operations and Data Structure
Both Intel and AMD's extension for HVM, though not compatible
to each other, are similar in basic concepts. Both create a special
mode of operation that allows system software running in a
reduced privileged mode to execute sensitive instructions. In
addition, both implementations also define state and control data
structures that enable the transition between modes of operation.
The processor for HVM has two operating modes: privileged
mode and reduced privilege mode. Processor behavior in the
privileged mode is very much the same as the processor running
without the virtualization extension. Processor behavior in the
reduced privilege mode is restricted and modified to facilitate
virtualization.
Table 6 summarizes the terms used by Intel and AMD for HVM.
The extension creates new instructions, and a HVM control and
state data structure (HVMCSDS) for the VMM to manage
transition from one mode to another. The HVMCSDS is called
VMCS on the Intel processor and is called VMCB on the AMD
processor. The VMM associates a HVMCSDS with each VM. For
a VM with multiple VCPUs, the VMM can associate a HVMCSDS
with each VCPU in the VM.

Table 6. Comparison of Intel and AMD processor support for


virtualization.
Intel AMD
9
S

Sun Microsystems, Inc.

Virtualization Operation VMX SVM


Privileged Mode VMX Root Host Mode
Reduced-privileged mode VMX non- Guest Mode
Root
HVM Control and State Data VMCS VMCB
Structure (HVMCSDS)
Entering non-privileged mode VMLAUNC
H/VM VMRUN
RESUME
Exiting non-privileged mode Implicit Implicit

After HVM is enabled, the processor is operating at privileged


mode. Transitions from privileged mode to reduced privilege
modes are called VM Entries. Transitions from reduced privilege
mode to privileged mode are called VM Exits. Figure 22
illustrates entry and exit with the HVMCSDS.
S

Sun xVM Server with Hardware VM (HVM)

I
VM EXIT/ T
VM
VM EXIT/ VM
ENTER/
ENTER/
#VMEXIT
#VMEXIT
VMRUN
VMRUN

VMX non-root Virtual Machine Monitor


operation (Intel) (VMM)
Guest Mode (AMD)
Y
A
VMX root operation
(Intel) Host Mode
(AMD)

Figure 22. Virtual


machine entry and
exit with hardware
support on AMD and
Intel processors.
VM entry is explicitly
initiated by the VMM
using an instruction
(VMLAUNCH and
VMRESUME on
Intel; VMRUN on AMD).
The processor
performs checks on
the processor state,
VMM state, control
fields, and the VM
state before loading
the VM state from
the HVMCSDS to
launch the VM entry.
As a part of VM
entry, the VMM can
inject an event into
the VM. The event
injection process is
used to deliver
virtualized external
interrupts to a VM. A
VM normally doesn't
get interrupts from
S
9

Sun xVM Server with Hardware VM (HVM)

I/O devices, because


I/O devices are not
exposed to VMs
(with the exception
of Domo). As will be
shown in "Sun xVM
Server with HVM I/O
Virtualization
(QEMU)" on page
71, a VM's I/O is
handled by a special
domain (Domo) that
runs a paravitualized
OS and has direct
access to I/O
devices. When an
I/O operation
completes, Domo
informs the VMM to
send an interrupt
through an hvm_op
hypercall. The VMM
prepares the
HVMCSDS for event
injection and the
VM's return
instruction pointer
(RIP) is pushed on
the stack.
VM exit occurs
implicitly in response
to certain instruction
and events in a VM.
The VMM governs
the conditions
causing a VM exit
through manipulating
the control fields in
the HVMCSDS. The
events that can be
controlled to result in
a VM exit include the
following (see [9]
Chapter 20):
• External
interrupts, non-
maskable
interrupts, and
system
S

Sun xVM Server with Hardware VM (HVM)

management
interrupts
• Executing certain
instructions (such
as RDPMC, RDTSC, or
instructions that
access the CR)
•Exceptions
The exact conditions
that cause a VM exit
are defined in the
HVMCSDS control
fields. Certain
conditions may
cause a VM exit for
one VM but not for
other VMs.
VM exits behave like
a fault, meaning that
the instructions
causing the VM exit
does not execute
and no processor
state is updated by
the instruction. The
VM exit handler
S
9

Sun xVM Server with Hardware VM (HVM)

in the VMM is responsible for taking appropriate actions for the


VM exit. Unlike exceptions, the VM exit handler is specified in
the HVMCSDS host RIP field rather than using the IDT:

static void construct_vmcs(struct vcpu


*v) {
/* Host CS:RIP. */
_vmwrite (HOST_CS_SELECTOR, HYPERVISOR_CS);
vmwrite(HOST RIP, (unsigned
long)vmx_asm_vmexit_handler);

Intel Virtualization Technology Specifics


Intel Virtualization (Intel-VT), code name Vanderpool, is the Intel
virtual machine extensions (VMX) to run unmodified guest OSes.
Intel-VT has two implementations: VT-x defines the extensions
to the IA-32 Intel architecture, and VT-i defines the extensions to
the Intel Itanium architecture. This paper focuses on the Intel VT-
x implementation.
Table 6 on page 64 summarizes the terms used in Intel
documents [7] for HVM. Intel-VTx adds several new instructions
to the existing IA-32 instructions set to facilitate HVM operations
(see Table 7):
Table 7. Intel-VTx instructions that facilitate HVM operations.
Instruction Description
VMLAUNCH/VMRES launch/resume VM
UME
VMCLEAR clear VMCS
VMPTRLD/VMPTRS load/store VMCS
T
VMREAD/VMWRITE read/write VMCS
VMXON/VMXOFF enable/disable VMX
operation
VMCALL call to the VMM

In addition to new VMX instructions and VMCS, VT-x introduces


a direct I/O architecture for Intel-VT [28] to improve VM security,
reliability, and performance through I/O enhancements. As will
be shown in "Sun xVM Server with HVM I/O Virtualization
(QEMU)" on page 71, the current I/O virtualization
implementation for Sun xVM Server with HVM, which is based
on the QEMU project, is inefficient as all I/O transaction have to
go through Domo, unreliable as the I/O virtualization layer on
Domo becomes a single point-of-failure, and insecure as a VM
may access other VM's DMA memory by manipulating the value
written to I/O port.
The Intel-VT direct I/O architecture specifies the following
hardware capabilities to the VMM:
• DMA remapping — This feature provides IOMMU support for
I/O address translation and caching capabilities. The IOMMU
as specified in the architecture includes a page table hierarchy
similar to the processor page table, and an IOTLB for
frequently accessed I/O pages. Addresses used in the DMA
transactions are allocated from IOMMU address space, and
S

Sun xVM Server with Hardware VM (HVM)

the IOMMU hardware provide address translation from the


IOMMU address space to the system memory address space.
• I/O device assignment across VMs — This feature allows a
PCI/PCI-X device that is behind a PCI-E to PCI/PCI-X bridge
or a PCI-E device to be assigned to a VM, regardless of how
the PCI bus is bound to a VM.

AMD Secure Virtual Machine Specifics


The AMD Secure Virtual Machine (SVM), code name Pacifica, is
similar to Intel VT-x in technology and design. The AMD SVM
uses the instruction VMRUN to switch between a GOS and the
VMM. The instruction VMRUN takes, as a single argument, the
physical address of a 4KB-aligned page, the virtual machine
control block (VMCB), which describes a virtual machine (guest)
to be executed.
In addition to functions that are equivalent to those in Intel VT-x,
AMD SVM provides additional features, that are not available in
Intel VT-x, to improve HVM operations:
• Nested page table (NPT)
As an alternative to using a shadow page table for address
translation (see "Shadow Page Table" on page 69), AMD SVM
uses two %cr3 registers, gCR3 and nCR3, to point to guest
page tables and nested page tables respectively. Guest page
tables map guest linear addresses to guest physical
addresses. Nested page tables map guest physical addresses
to system physical addresses. The table walker first translates
that entry's guest physical address into a system physical
address. Then translations from guest linear to system
physical addresses are cached in the TLB for subsequent
guest access.
• Tagged TLB
To avoid a TLB flush during context switch (see "Paging
Architecture" on page 25), AMD SVM provides a tagged TLB
with Address Space Identifier (ASID) bits to distinguish
different address spaces. A tagged TLB allows the VMM to
use shadow page tables or multiple nested page tables for
address translation during a context switch without flushing the
TLBs.
• IOMMU
The AMD64 IOMMU enables secure virtual machine guest
operating system access to selected I/O devices by providing
address translation and access protection on DMA transfers by
peripheral devices. The IOMMU can be thought of as a
combination and
S
1

Sun xVM Server with Hardware VM (HVM)

generalization of two facilities included in the AMD64


architecture: the Graphics Aperture Remapping Table (GART)
and the Device Exclusion Vector (DEV). The GART provides
address translation of I/O device accesses to a small range of
the system physical address space, and the DEV provides a
limited degree of I/O device classification and memory
protection.

Sun xVM Server with HVM Architecture Overview


Sun xVM Server with HVM supports the running of unmodified
operating systems in DomU. However, Domo still requires a
paravirtualized OS in order to provide full I/O virtualization
support for DomUs.
To support full virtualization, the Sun xVM Hypervisor for x86
has extended its paravirtualized architecture with the following
enhancements:
• A set of HVM functions (struct hvm_function_table) for
processor dependent implementation of HVM, and an hvm_op
hypercall
• A shadow page table to virtualize memory management
• Device emulation based on the QEMU project for I/O
virtualization
• Emulated BIOS, hvmload, to bootstrap the GOS

These enhancements are discussed in more detail in the


following sections.
S

Sun xVM Server with Hardware VM (HVM)

Processor Dependent HVM Functions


The Sun xVM Hypervisor for x86 defines a set of foundational
interfaces, struct hvm_function_table, to abstract processor
HVM specifics. The struct hvm_function_table entries are:

struct hvm_function_table {
void (*disable)(void);
int (*vcpu_initialise)(struct vcpu
*v); void (*vcpu_destroy)(struct vcpu
*v); void (*store_cpu_guest_regs)(
struct vcpu *v, struct cpu_user_regs *r, unsigned long
*crs); void (*load_cpu_guest_regs)(
struct vcpu *v, struct cpu_user_regs *r);
int (*paging_enabled)(struct vcpu *v); int
(*long_mode_enabled)(struct vcpu *v); int
(*pae_enabled)(struct vcpu *v); int
(*guest_x86_mode)(struct vcpu *v); unsigned
long (*get_guest_ctrl_reg)(struct vcpu *v,
unsigned int num); unsigned long
(*get_segment_base)(struct vcpu *v,
enum x86_segment seg); void
(*get_segment_register)(struct vcpu *v,
enum x86_segment seg, struct
segment_register *reg); void
(*update_host_cr3)(struct vcpu *v); void
(*update_guest_cr3)(struct vcpu *v); void (*stts)
(struct vcpu *v);
void (*set_tsc_offset)(struct vcpu *v, u64 offset);
void (*inject_exception)(unsigned int trapnr, int errcode,
unsigned long cr2); void
(*init_ap_context)(struct vcpu_guest_context
*ctxt,
int vcpuid, int trampoline_vector); void
(*init_hypercall_page)(struct domain *d, void
*hypercall_page);
};

The VMM uses hvm_function_table to provide a VCPU to a VM.


The entry points in hvm_function_table fall into two categories:
setup and runtime. The setup entry points are called when a VM
is being created. The runtime entry points are called before VM
entry or after VM exit. Since the HVMCSDS data structure
abstracts the states and controls of a VCPU, the entry points in
hvm_funct ion_table are primarily used to manipulate the data
structure.

Shadow Page Table


Because the GOS is unmodified, the read-only page table
scheme for page translation as used in the Sun xVM Server is
no longer applicable. The read-only page table scheme requires
the OS to make hypercalls into the VMM to update page tables.
To support an unmodified OS, the shadow page table scheme
becomes the only option available. In this scheme, the shadow
page table (also known as the active page table hierarchy) is the
actual page table used by the processor.
In supporting shadow page [29], the Sun xVM Hypervisor for x86
attempts to intercept all updates to a guest page table, and
updates both the VM's page table and the shadow page table
maintained by the VMM, keeping both page tables synchronized
at all times. This implementation results in two page faults, one
due to faulting the actual page and a second one due to page
table access.
S
1

Sun xVM Server with Hardware VM (HVM)

This shadow page table scheme has a significant impact on the


VM performance. An alternative such as nested page table (see
"AMD Secure Virtual Machine Specifics" on page 67) has been
proposed to improve the memory virtualization performance.

Sun xVM Server Interrupt and Exception Handling for HVM


The VMM can specify processor behavior on specific exceptions
and interrupts by setting appropriate control filed in the
HVMCSDS. When a physical interrupt occurs, the processor
uses the setting in the HVMCSDS to determine whether this
interrupt would result in the VM exit of a running VM. Upon VM
exit, the VMM gets the interrupt vector from the HVMCSDS, sets
controls field for event injection, and launches the VM entry of
the target VM.
The interrupt handling by the VMM is a two stage process: from
physical device to the VMM, and from a virtual device in Domo
to the target VM. The VMM controls the IDT for interrupt from
physical devices. Each VM registers its own IDT with the VMM.
When a physical interrupt arrives, the VMM delivers the interrupt
to a virtual device in Domo. The virtual device then generates a
virtual interrupt to a VM.
A virtual interrupt is delivered to a VM through event injection by
setting the VM entry control field in the HVMCSDS for event
injection. The VMM uses the inj ect_exception entry point in
hvm_function_table (see "Processor Dependent HVM
Functions" on page 69) to set the HVMCSDS event injection
control field. The event is delivered when the VM is entered.

Emulated BIOS
The PC BIOS provides hardware initialization, boot services,
and runtime services to the OS. There are some restrictions on
VMX operation. An OS in HVM cannot operate in real mode.
Unlike a paravirtualized OS that can change its bring up
sequence for an environment without BIOS, an unmodified OS
requires an emulated BIOS to perform some real mode
operations before control is passed to the OS. Sun xVM Server
includes a BIOS emulator, hvmloader, as a surrogate to real
BIOS.
The hvmloader BIOS emulation contains three components:
ROMBIOS, VGABIOS, and VMXAssist. Both ROMBIOS and VGABIOS are based
the open source Bochs BIOS [23]. The VMXAssist component is
included in hvmloader to emulate real mode, which is required
by hvmloader and bootstrap loaders. The hvmloader BIOS
emulator is bootstrapped as any other 32-bit OS. After it is
loaded, hvmloader copies
its three components to pre-assigned addresses (VGABIOS at
C000:0000, VMXAssist at D000:0000, and ROMBIOS at F000:0000)
and transfers control to VMXAssist.
The hvmloader BIOS emulator does not directly interface with
physical devices. It communicates with virtual devices as
S

Sun xVM Server with Hardware VM (HVM)

discussed in the following section "Sun xVM Server with HVM


I/O Virtualization (QEMU)".

Sun xVM Server with HVM I/O Virtualization (QEMU)


Sun xVM Server I/O virtualization on an HVM-enabled
environment is based on the open source QEMU project [24].
QEMU is a machine emulator that uses dynamic binary
translation to run an unmodified OS and its applications in a
virtual machine. QEMU includes several components: CPU
emulators, emulated devices, generic devices, machine
descriptions, user interface, and a debugger. The emulated
devices and generic devices in QEMU make up its device
models for I/O virtualization. Sun xVM Server uses QEMU's
device models to provide full I/O virtualization to VMs.
For example, QEMU supports several emulated network
interfaces, including ne2000, PCNet, and Realteck 8139. The
Solaris OS has the pcn driver for the PCNet NIC. The Solaris
OS running in DomU can use pcn and communicate to QEMU
on a Solaris Domo that has a e1000g NIC. The pcnet emulation
in QEMU converts Solaris pcn transactions to a generic virtual
network interface (such as TAP), which forwards the packet to
the driver for the native network interface (such as e1000g).
QEMU I/O emulation is illustrated in Figure 23. The principle of
operation for sending out an I/O request is outlined as follows:
1. An OS interfaces with a device through I/O ports and/or
memory-mapped device memory. The device performs
certain operations, such as DMA, in response to I/O
port/memory access by the OS. At the completion of the
operation, the device generates an interrupt to notify the OS
(Steps 1 and 2 on Figure 23).
2. The VMM monitors and intercepts the device I/O ports and
memory accesses (Step 3 on Figure 23).
3. The VMM forwards the I/O port/memory data to an I/O
virtualization layer such as QEMU (Step 4 in Figure 23).
4. QEMU decodes the I/O port/memory data and performs
necessary emulation for the I/O request (Step 5 in Figure
23).
5. QEMU delivers the emulated I/O request to the OS native
device interface (Steps 6 and 7 in Figure 23).
S
1

Sun xVM Server with Hardware VM (HVM)

____
ra
Event
hvm
Sun ChannelxVM I/O
VMPort ------©-----------------
exit handler]
hypercall
Hypervisor for x86

Device
memory
X86 Hardware (CPU, Memory,
Devices)
Figure 23. I/O emulation in Sun xVM Server using QEMU for
dynamic binary translation.
Using the AMD PCNet LANCE PCI Ethernet controller as an
example, the vendor ID and device ID of the PC Net chip is
respectively 1022 and 2000. From prtconf (1M) output, the PCI
registers exported by the device are:

% prtconf -v
pci1022,2000,
instance #0
Hardware
properties:
name='assigned-addresses' type=int items=5
value=81008810.00000 000.00001400.00000
000.00000 080 name='regl type=int items=10
value=000 08800.000 00000.000 00000.000 00000.000
00000.01008810.000 00000.000 00000.00000000.00000080

According to IEEE1275 OpenBoot Firmware [25], the reg


property is generated by reading the base address registers in
the configuration address space. Each entry in the reg property
format consists of one 32-bit cell for register configuration, a 64-
bit address cell, and a 64-bit size cell [26]. As the prtconf (1M)
output shows, the PCNet chip has a 128 byte (0x00000080)
register in the I/O address space (01 in the first byte
of0x0l008810 denotes I/O address space). QEMU emulation for
PCNet simply monitors the Solaris driver access to the 128
bytes register using x86 IN/OUT instructions.
The QEMU visualization for transmitting and receiving a packet
using the PCNet emulation is illustrated in the Figure 23 on page
72. The sequence of events corresponding to the numbered
dots in the figure is described below:

1. Applications make an I/O request to the driver through


system calls.
S

Sun xVM Server with Hardware VM (HVM)

2. The pen driver writes to the DMA descriptor using the OUT
instruction. In pen, pcn_send() calls pcn_OutcsR() to start the
DMA transaction. Then, pcn_OutcsR () calls ddi_put16 () to write
a value to an I/O address. Next, ddi_put16() checks whether the
mapping (io_handle) is for I/O space or memory space. If the
mapping is for the I/O space, it moves its third argument to %rax
and port ID to %rdx, and issues the OUTW instruction to the port
referenced by %dx.

pcn_send()
{
pcn_OutCSR
(pcnp, CSR0, CSR0_INEA
| CSR0_TDMD); }
static void
pcn_OutCSR (struct pcninstance *pcnp, uintptr_t reg, ushort_t
value)
{
ddi_put16(pcnp->io_handle, REG16(pcnp->io_reg,
PCN_IO_RAP), reg); ddi_put16(pcnp->io_handle,
REG16(pcnp->io_reg, PCN_IO_RDP),
value);
}
ENTRY(ddi_put16)
movl ACC_ATTR(%rdi), %ecx
cmpl $_CONST(DDI_ACCATTR_IO_SPACE|
DDI_ACCATTR_DIRECT), %ecx
jne 8f
movq %rdx, %rax
movq %rsi, %rdx
outw (%dx)
ret

The OUT instruction causes a VM exit. The CPU is setup by


the VMM to have an unconditional VM exit if the VM
executes IN/OUT/INS/OUS as shown in the setting of the
CPU_BASED_UNCOND_IO_EXITING bit in VM exit control (see Table 2o-
6 in [7]).
#define \
MONITOR_CPU_BASED_EXEC_CONTRO
LS
( MONITOR_CPU_BASED_EXEC_CO \
NTROLS_SUBARCH |
CPU_BASED_HLT_EXITING | \
CPU_BASED_INVDPG_EXITING | \
CPU_BASED_MWAIT_EXITING | \
CPU_BASED_MOV_DR_EXITING | \
CPU_BASED_UNCOND_IO_EXITING | \
CPU_BASED_USE_TSC_OFFSETIN
G)
void
vmx_init_vmcs_config(void)
{
_vmx_vmexit_control =
adjust_vmx_controls( MONITOR_VM_EXIT_C
ONTROLS,
MSR_IA32_VMX_EXIT_
CTLS_MSR); }
S
1

Sun xVM Server with Hardware VM (HVM)

The VM exit handler is set in the host RIP field in HVMCDCS


(see "HVM Operations and Data Structure" on page 64). The
VM exit handler examines the exit reason and calls the I/O
instruction function, vmx_io_instruction(), to handle the VM
exit.
asmlinkage void vmx_vmexit_handler
(struct cpu_user_regs
*regs) {
case EXIT_REASON_IO_INSTRUCTION:
exit_qualification = vmread(EXIT
QUALIFICATION); inst_len =
__get_instruction_length();
vmx_io_instruction (exit_qualification,
inst_len); break;

3. The VM exit handler for I/O instructions in the VMM


examines the exit
qualification, and gets OUT information from the HVMCDCS.
This information includes:
– Size of the access (1 byte, 2 byte, or 4 bytes)
– Direction of the access (IN or OUT)
– Port number
- D o u b l e fault exception or not
- Size and address of string buffer if this is an I/O string
operation
The VM exit handler then fills in struct ioreq fields, and sends
the I/O request to its client by calling send_pio_req( ).

static void vmx_io_instruction


(unsigned long exit_qualification,
unsigned long inst_len)
{
send_pio_req
(port, count, size, addr, dir, df,
1); }

4. The client of the I/O request (qemu-dm) is blocked on the


event channel device node created by the evtchn module
(see "Event Channels" on page 43). In the VMM,
hvm_send_assist_req() gets called by send_pio_req() to set
the event pending bit of the event channel and wake up the
qemu-dm client waiting on the event.

void hvm_send_assist_req
(struct
vcpu *v) {
p->state = STATE_IOREQ_READY;
notify_via_xen_event_channel(v->arch.hvm_vcpu.xen_port);
}

5. The QEMU emulator, qemu-dm, is a user process that


contains the ioemu module for I/O emulation. The ioemu
module waits on one end of the event channel for I/O
requests from the VMM.
S

Sun xVM Server with Hardware VM (HVM)

int main_loop(void)
{
qemu_set_fd_handler(evtchn_fd, cpu_handle_ioreq,NULL,
env); while (1) {
main_loop_wait(10);
}

When an I/O request arrives, ioemu is unblocked and


cpu_handle_ioreq() is called to get the ioreg structure from
the event channel. Based on the information in ioreq,
appropriate pcnet functions are invoked to handle the I/O
request.
6. After pcnet decodes the ioreq structure, ioemu sends the
packet to the TAP network interface. The TAP network
interface [27] is a virtual ethernet network device that
provides two interfaces to applications:
-Character device —/dev/tapX
- Virtual network interface —
tapX
where X is the instance number of the TAP interface.
Applications can write Ethernet frames to the /dev/tapX
character interface, and the TAP driver will receive this frame
from the tapX network interface. In the same manner, a packet
that kernel writes to the tapX network interface can be read by
application from the character /dev/tapX device node.
To continue the packet flow, pcnet_transmit() is called to send
out ioreq. In pcnet_tranmit( ), qemu_send_packet() invokes
tap_receive () to write the packet to the TAP character
interface which will forward the packet to the native driver
interface.

static void pcnet_transmit(PCNetState *s)


{

qemu_send_packet
(s->vc, s->buffer, s->xmit_pos);

}
static void tap_receive(void *opaque, const uint8_t *buf, int size)
{

for(;;) {
ret = write(s->fd, buf, size);

}
}

7. The Domo native driver sends the packet to the network


hardware. This marks the end of transmitting a packet from
DomU to the real network.
S
1

Sun xVM Server with Hardware VM (HVM)

8. Domo receives an interrupt indicating a packet intended for


DomU has arrived. This marks the beginning of receiving a
packet targeted to a DomU from the real network. The native
network driver forwards the packet through a bridge to the
TAP network interface, tapX.
9. Next, tap_send( ) is invoked when data is written to the file.
The packet is read from the character interface of /dev/tapX.
Next, qemu_send_packet() calls pcnet_receive( ) to send out
the buffer.
10.
S

Sun xVM Server with Hardware VM (HVM)

static void tap_send(void *opaque)


{
size = read(s->fd, buf,
sizeof(buf)); if (size > 0) {
qemu_send_packet (s->vc, buf, size);
}
}

10. The pcnet_receive( ) function in ioemu copies data read from


the TAP character device to the VMM memory. The data can
be either an I/O port value from the IN instruction or a network
packet. At the end of data transfer, pcnet informs the VMM to
generate an interrupt.

static void pcnet_receive


(void *opaque, const uint8_t *buf, int
size) {
cpu_physical_memory_write(rbadr, src,
count); pcnet_update_irq(s);

11. The ioemu module makes a hvm_opt (set_pci_intx_level)


hypercall to the VMM to generate an interrupt to the target
domain.

int xc_hvm_set_pci_intx_level( int xc_handle, domid_t dom,


uint8_t domain, uint8_t bus, uint8_t device,
uint8_t intx, unsigned int level)
{
hypercall.op = _ HYPERVISOR_hvm_op;
hypercall.arg[0] =
HVMOP_set_pci_intx_level;
hypercall.arg[1] = (unsigned long)&arg;

rc = do_xen_hypercall(xc_handle, &hypercall);

The VMM sets the guest HVMCDCS area to inject an event with
the next VM entry. The target VM will get an interrupt when the
VMM launches a VM entry to the target domain (see "Sun xVM
Server Interrupt and Exception Handling for HVM" on page 70).

Sun xVM Server with HVM I/O Virtualization (PV Drivers)


As shown in the previous section, the QEMU I/O emulation used
in Sun xVM Server with HVM suffers significant performance
overhead. An I/O packet has to go through several context
switches, including a switch to the user level at Domo, to reach
its destination. One alternative for improving the performance is
to use a similar I/O
S
1

Sun xVM Server with Hardware VM (HVM)

virtualization model as the Sun xVM Server PV architecture (see


"Sun xVM Server I/O Visualization" on page 56). Paravirtualized
drivers (PV drivers) like xbf and xnf are included in the OS
distribution. When a VM is created, Domo exports virtual I/O
devices (for example, xnf and xbf) instead of emulated I/O
devices (for example, pen and mpt) to the GOS. PV drivers are
subsequently bound to these virtual devices and used for
handling I/O. The I/O transactions follow the same path as
described in Chapter 5, "Sun xVM Server". PV drivers will be
provided for Solaris 10 and Windows so they can run unmodified
in the Sun xVM Server with better I/O performance.
L

Sun Microsystems, Inc.

Chapter 7 Logical
Domains

The Logical Domains (LDoms) technology from Sun


Microsystems allows a system's resources, such as memory,
CPUs, and I/O devices, to be allocated into logical groupings.
Multiple isolated systems, each with their own operating system,
resources, and identity within a single computer system, can then
be created using these partitioned resources.
Unlike Sun xVM Server, LDoms technology partitions a
processor into multiple strands, and assigns each strand its own
hardware resources. (See "Terms and Definitions" on page 113.)
Each virtual machine, called a domain in LDoms terminology, is
associated with one or more dedicated strands. A thin layer of
firmware, called the hypervisor, is interposed between the
hardware and the operating system (Figure 24). The hypervisor
abstracts the hardware resources and provides an interface to
the operating system software.

Contr Domai Domain Domain


ol n 2 3
Domai
n 1
Solari I1 Solaris ! Linux !
s 10 Solari 10
s 10

m wm
Hyperviso 1
1
r1 1
r
J

I
Hardware J 1 1 1
1 L C Me CP
L Pm1 U
U
C Me 1 CPU 1 CPU
P m 1 Mem 1 Mem 1
U
.--------- . _ _ _ i_ _ _ _ i_ __ _
__1 _i _i
Figure 24. The hypervisor, a thin layer of firmware, abstracts
hardware resources and presents them to the OS.
The LDoms implementation includes four components:
• UltraSPARC
T1/T2 processor
•UltraSPARC hypervisor
• Logical Domain Manager (LDM)
• Paravirtualized Solaris OS

Note - The terms strand, hardware thread, logical processor,


virtual CPU and virtual processor are used by various documents
L
1

Sun Microsystems, Inc.

to refer to the same concept. For consistency, the term strand is


used in this chapter.
Note - In Sun documents, the term hypervisor is used to refer to
the hyperprivileged software that performs the functions of the
VMM and the term domain is used to refer to a VM. To
accommodate Sun's terminologies, hypervisor and domain
(instead of VMM and VM) are used in this chapter.

This chapter assumes a basic understanding of the UltraSPARC


T1/T2 processor, which plays a major role in the implementation
of LDoms. (See Chapter 4, "SPARC Processor Architecture" on
page 29.) The remainder of the chapter is organized as follows:
• "Logical Domains (LDoms) Architecture Overview" on page 80
provides an overview of the LDoms architecture and the other
three components of LDoms: paravirtualized Solaris, the
UltraSPARC hypervisor, and the Logical Domain manager.
• "CPU Virtualization in LDoms" on page 84 discusses CPU
virtualization including trap and interrupt handling.
• "Memory Virtualization in LDoms" on page 88 discusses
memory virtualization including physical memory allocation and
page translations.
• "I/O Virtualization in LDoms" on page 91 discusses I/O
virtualization and describes the operation of the disk block and
network drivers.

Logical Domains (LDoms) Architecture Overview


Logical Domains (LDoms) technology supports CPU partitioning
and enables multiple OS instances to run on a single
UltraSPARC T1/T2 system. The UltraSPARC T1/T2 architecture
has been enhanced from the original UltraSPARC specification
to incorporate hypervisor technology that supports hardware
level virtualization.
The hypervisor is delivered with the UltraSPARC T1/T2 platform,
not with the OS. During a boot, the OpenBoot PROM (OBP)
loads the Solaris OS directly from the disk. After the boot, a
logical domain manager is enabled and initializes the first domain
as the control domain. From a control domain, the administrator
can create, shutdown, configure, and destroy other domains. The
control domain can also be configured as an I/O domain, which
has direct access to I/O devices and provides services for other
domains to access I/O devices (Figure 25).
L

Sun Microsystems, Inc.

Figure 25. A control domain, Solaris OS, and Linux guest


domains running in logical domains on an UltraSPARC T1/T2
processor-powered server.

The UltraSPARC T1/T2 processor architecture is described


earlier in Chapter 4, "SPARC Processor Architecture" on page
29. In this section, the other three components of the LDoms
technology — paravirtualized Solaris OS, hypervisor, and logical
domain manager — are discussed.

Paravirtualized Solaris OS
The Solaris kernel implementation for the UltraSPARC T1/T2
hardware class (uname -m) is referred to as the Solaris sun4v
architecture. In this implementation, the Solaris OS is
paravirtualized to replace operations that require hyperprivileged
mode with hypervisor calls. The Solaris OS communicates with
the hypervisor through a set of hypervisor APIs, and uses these
APIs to request that the hypervisor perform hyperprivileged
operations.
Sun4v support for LDoms is a combination of partitioning the
UltraSPARC T1/T2 processor into strands and virtualization of
memory and I/O services. Unlike Sun xVM Server and VMware,
an LDoms domain does not share strands with other domains.
Each domain has one or more strands assigned to it, and each
strand has its own hardware resources so that it can execute
instructions independently of other strands. The virtualization of
CPU functions to support CMT is implemented at the processor
rather than at the software level (that is, there is no software
scheduler). A Solaris guest OS can directly access strand-
specific registers in a domain and can, for example, perform
operations such as setting an OS trap table to the trap base
address register (TBA).
The Solaris sun4v architecture assumes that the platform
includes the hypervisor as part of its firmware. The hypervisor
runs in the hyperprivileged mode, and the Solaris
OS runs in the privileged mode of the processor. The Solaris
kernel uses hypercalls to request that the hypervisor perform
hyperprivileged functions of the processor.
Like Intel's VT and AMD's Pacifica architectures, the sun4v
architecture leverages CPU support (hyperprivileged mode) for
L
1

Sun Microsystems, Inc.

the implementation of the hypervisor. Unlike Intel's VT and


AMD's Pacifica architectures which provide a special mode of
execution for the hypervisor and thus make the hypervisor
transparent to the GOS, the support for the hypervisor in
UltraSPARC T1/T2 is non-transparent to the GOS. The
UltraSPARC T1/T2 processors provide a set of hypervisor APIs
for the GOS to delegate the hyperprivileged operations to the
hypervisor.

Hypervisor Services
The hypervisor layer is a component of the UltraSPARC T1/T2
system's firmware. An UltraSPARC system's firmware consists
of Open Boot PROM (OBP), Advanced Lights Out Management
(ALOM), Power-on Self Test (POST), and the hypervisor.
The hypervisor leverages the UltraSPARC T1/T2 hyperprivileged
extensions to provide a protection mechanism for running
multiple guest domains on the system. The hypervisor includes a
number of hypervisor services to its overlaying domains. These
services include hypervisor APIs that are the interfaces for a
GOS to request hypervisor services, and Logical Domain
Channel (LDC) services which are used by virtual device drivers
for inter-domain communications.

Hypervisor API
The Sun4v hypervisor API [11] uses the Tcc instruction to cause
the GOS to trap into hyperprivileged mode, in a similar fashion to
how OS system calls are implemented. The function of the
hypervisor API is equivalent to system calls in the OS that
enable user applications to request services from the OS. The
Sun4v hypervisor API allows a GOS to perform the following
actions:
• Request services from the hypervisor
• Get and set CPU information through the hypervisor
The UltraSPARC Virtual Machine Specification [11] lists the
complete set of services and APIs for:
• API versioning — request and check for a version of the
hypervisor APIs with which it may be compatible
• Domain services — enable a control domain to request
information about or to affect other domains
• CPU services — control and configure a strand; includes
operations such as start/stop/suspend a strand, set/get the trap
base address register, and configure the interrupt queue
• MMU services — perform MMU related operations such as
configure the TSB, map/demap the TLB, and configure the
fault status register
• Memory services — zero and flush data from cache to
memory
• Interrupt services — get/set interrupt enabled, target strand,
and state of the interrupt
• Time-of-Day services — get/set time-of-day
L

Sun Microsystems, Inc.

• Console services — get/put a character to the console


• Channel Services — provide communication channels
between domains (see "Logical Domain Channel (LDC)
Services" on page 83)
The following two examples of hv_mem_sync() and
hv_api_set_version() show the implementation for hypervisor
calls:

% mdb -k
> hv_mem_sync,6/ai
hv_mem_sync:
hv_mem_sync: %o %o4
mov 2,
hv_mem_sync+4: 0x3 %o5
mov 2,
hv_mem_sync+8: ta %ic %g0 +
c, 0
hv_mem_sync+0xc:ret
l
hv_mem_sync+0x10: stx %o1
,
>
hv_api_set_version,6/
ai
hv_api_set_version:
hv_api_set_version: mov %o3 %o4
,
hv_api_set_version+4: clr %o5
hv_api_set_version+8: ta %icc %g0 + 0x7f
,
hv_api_set_version+0 retl
xc:
hv_api_set_version+0 stx %o1 [%o4]
x10: ,

The trap type in the range 0x180-0x1FF is used to transition


from a privileged mode to a hyperprivileged mode. In the two
preceding examples, a TT value of 0x180 (offset of 0) is used for
hv_mem_sync (), and a TT value of 0x1FF (offset of 0x7f) is
used for hv_api_set_version() .
Hypervisor calls are normally invoked during the startup of the
kernel to set up strands for the domain. Only a few hypercall
functions are called during the runtime of the kernel, including:
hv_tod_set(), hv_tod_get (), hv_set_ctx0 (),
hv_mmu_map_perm_addr() , hv_mmu_unmap_perm_addr() ,
hv_set_ctxnon0() , and hv_mmu_set_stat_area() .

Logical Domain Channel (LDC) Services


The hypervisor provides communication channels between
domains. These channels are accessed within a domain as an
endpoint. Two endpoints are connected together forming a bi-
directional point-to-point LDC.
L
1

Sun Microsystems, Inc.

All traffic sent to a local endpoint arrives at the corresponding


endpoint at the other end of the channel in the form of short
fixed-length (64-byte) message packets. Each endpoint is
associated with one receive queue and one transmit queue.
Messages from a channel are deposited by the hypervisor at the
tail of a queue, and the receiving
domain indicates receipt by moving the corresponding head
pointer for the queue. To send a packet down an LDC, a domain
inserts the packet into its transmit queue, and then uses a
hypervisor API call to update the tail pointer for the transmit
queue.
In the Solaris OS, the hypervisor LDC service is used as a
simulated I/O bus interface, enabling a virtual device to
communicate with a real device on the I/O domain. All virtual
devices that communicate with the I/O domain for device access
are a leaf nodes on the LDC bus. For example, a virtual disk
driver, vdc, uses the LDC service to communicate with the virtual
disk driver, vds, on the other side of the channel. Both vdc and
vds are leaf nodes on the channel bus (see "I/O Virtualization in
LDoms" on page 91).

Logical Domain Manager


The Logical Domain Manager (LDM) provides the following
functionality:

• Provides a control point for managing a domain's


configuration and operation
• Binds a domain to the resources of the underlying local
physical machine
• Manages the integrity of the configuration in a persistent
and consistent manner
The LDM is a software module that runs on a control domain
(see "Logical Domains (LDoms) Architecture Overview" on page
80). The LDM uses the LDC to communicate with the hypervisor
when binding a domain to hardware resources, and stores the
configuration in the service processor. The LDM is only required
when a domain reconfiguration operation is needed, such as
during the creation, shutdown, or deletion of a domain.
The LDM maintains two persistent databases: one for the
currently defined domains, and one for active domains. The
active domain database is stored with the service processor, and
the currently defined database is held with LDM's own persistent
storage. The command line interface to the LDM is ldm(1M).

CPU Virtualization in LDoms


The hypervisor exposes strands to a domain. Each strand has it
own registers and trap queues; shares L1 caches, the TLB, and
the instruction pipeline with other strands in the same core; and
shares the L2 cache with other strands in the socket. Strands on
the UltraSPARC T1 processor share the FPU with other strands,
L

Sun Microsystems, Inc.

while each core in the UltraSPARC T2 processor has its own


floating-point and graphics unit (FGU). Each domain has its own
strands that are not shared with other domains. The software
threads (also known as kernel threads) are executed on the
strands that are bound to that domain. Unlike the VMM in Sun
xVM Server or VMware, there is no software scheduler in the
hypervisor.
CPU virtualization in LDoms, from a software perspective,
involves trap and interrupt handling and timer services.
L
1

Sun Microsystems, Inc.

Trap and Interrupt Handling


Each strand has two trap tables for handling traps: the
hyperprivileged trap table and the privileged trap table. The trap
table used for handling a trap depends on the following criteria:
• Trap type (TT)
• Trap level at the time when the trap is taken
• Privilege mode at the time when the trap is taken
The UltraSPARC Architecture 2005 specification (see Table 12-4
in [2]) lists the mode in which a trap is delivered based on a given
TT and current privileged mode.
The hyperprivileged trap table and the privileged trap table are
installed, respectively, by the hypervisor and the GOS. For
example, the Solaris OS installs the trap table for sun4v in
mach_cpu_startup() :
ENTRY_NP(mach_cpu_start
up)
set trap_table,
%g1
wrpr %g1, %tba ! write trap_table to %tba

And the hypervisor installs its trap table in start_master() :

ENTRY_NP(start_master)
setx htraptable, %g3,
%g1 wrhpr %g1,
%htba

Each strand has two interrupt queues: cpu_mondo and


dev_mondo. The cpu_mondo queue is used for CPU-to-CPU
cross-call interrupts; the dev_mondo queue is used for I/O-to-
CPU interrupts. The Solaris kernel allocates memory for each
queue, and registers these queues with the hv_cpu_qconf()
hypercall. When the queue is nonempty (that is, the queue
header is not equal to the queue tail), a trap is generated to the
target CPU. The data of the interrupt received (mondo data) is
stored in the queue.
The Solaris kernel function for registering the interrupt queues is
cpu_intrq_register() as shown below:

void
cpu_intrq_register(struct cpu
*cpu) {
struct machcpu *mcpup =
&cpu->cpu_m; uint64_t ret;

ret = hv_cpu_qconf(INTR_CPU_Q, mcpup-

>cpu_q_base_pa,

cpu_q_entries);

ret = hv_cpu_qconf(INTR_DEV_Q, mcpup-

>dev_q_base_pa,
L

Sun Microsystems, Inc.

dev_q_entries);

The I/O and CPU cross-call interrupt delivering mechanism is as


follows:
1. An I/O device asserts its interrupt line to generate an interrupt
to the processor. The I/O bridge chip receives the interrupt
request and prepares a mondo packet to be sent to the target
processor whose CPU number is stored in the bridge chip
register by the OS. The mondo packet contains an interrupt
number that uniquely identifies the source of the interrupt.
2. The hypervisor receives an interrupt request from the
hardware through the interrupt vector trap (0x60). For
example, the trap table for the T2000 firmware has the
following entries:

ENTRY(htraptable)

TRAP(tt0_0 5e, HSTICK_INTR) /* reserved */


TRAP(tt0_05f, NOT) /* interrupt vector*/
TRAP(tt0_0 60, VECINTR)
/* HV: hstick match */

The CPU number and interrupt number are also delivered,


along with the interrupt trap. The interrupt vector trap handle,
VECINTR, uses the interrupt number to determine the source of
the interrupt. If the interrupt is coming from I/O, the trap
handler use the CPU number to find the dev_mondo queue
associated with the CPU and adds the interrupt to the tail of
the dev_mondo queue. When the head of the queue is not
equal to the tail, a trap (0x7C for CPU cross calls and 0x7D
for I/O) is generated to the CPU that owns the queue.
3. Traps 0x7C and 0x7D are taken via the GOS trap table. For
I/O interrupts, dev_mondo() is the trap handler for 0x7D.
# mdb -k
>
trap_table+0x20*0x7c/
ai
0x1000f80:
0x1000f80: %xc +0xc784 <cpu_mondo>
ba,a,pt c,
>
trap_table+0x20*0x7d/
ai
0x1000fa0:
0x1000fa0: %xc +0xc800 <dev_mondo>
ba,a,pt c,
>

The dev_mondo () handler takes the interrupt out of the


queue by incrementing the queue header. It also finds the
interrupt vector data, struct intr_vec, from the system's
interrupt vector table. The struct intr_vec data contains the
priority interrupt level (PIL) and the driver's interrupt service
L
1

Sun Microsystems, Inc.

routine (ISR) for the interrupt. The dev_mondo() handler then


sets the SOFTINT register with the PIL of the interrupt.
4. Setting the SOFTINT register causes an interrupt_level_n trap,
0x41-0x4f, to be generated where n is the PIL of the interrupt.
The GOS's trap handler for the interrupt_level_1 interrupt, for
example, is shown below:

>
trap_table+0x20*0x41,
2/ai tt_pil1:
tt_pil1: ba,pt %xcc, +0xc33c <pil_interrupt>
0x1000824: mov 1, %g4
>

If the PIL of the interrupt is below the clock PIL, an interrupt


thread is allocated to handle the interrupt. Otherwise, the high
level interrupt is handled by the currently executing thread.

In summary, the interrupt delivering mechanism is a two stage


process. First, an interrupt is delivered to the hypervisor as the
interrupt vector trap, 0x60. Then the interrupt is added to an
interrupt queue, which causes another trap to the GOS.

LDoms Timer Service


The system time is provided by the programmable interrupt
generator. Clock interrupts are sent directly from the hardware to
the domain, without being queued in the hypervisor. Therefore,
unlike Sun xVM Server domains, LDoms exhibit no "lost ticks"
issues.
The time of day (TOD) is maintained by the hypervisor on a per-
domain basis. The Solaris OS uses the tod_get() and tod_set()
hypercalls to get and set the TOD, respectively. Setting the TOD
in one domain does not affect any other domain.
The high resolution timer is provided by the rdtick instruction,
which reads the counter field of the TICK register. The rdtick
instruction is a privileged instruction that can be executed by the
Solaris OS without the hypervisor involvement.

Memory Virtualization in LDoms


Similar to Sun xVM Server, memory virtualization in LDoms deals
with two memory management issues:

• Physical memory sharing and partitioning


• Page translations

Physical Memory Allocation


The UltraSPARC T1/T2 processors supports three types of
memory addressing:

• Virtual Address (VA) — utilized by user programs


• Real Address (RA) — describes the underlying memory
allocated to a GOS
• Physical Address (PA) — appears in the system bus for
accessing physical memory
Multiple virtual address spaces within the same real address
space are distinguished by a context identifier (context ID). The
context ID is included as a field in the TTE for VA to PA
L

Sun Microsystems, Inc.

translation (see "Memory Management Unit" on page 32). The


GOS can create multiple virtual address spaces, using the
primary and secondary context registers to associate a context
ID with every virtual address. The GOS manages the allocation
of context IDs among the processes within the domain.
Multiple real address spaces within the same physical address
space are distinguished by a partition identifier (partition ID). The
hypervisor can create multiple real address spaces, using the
partition register to associate a partition ID with every real
address. The hypervisor manages the allocation of partition IDs.
Table 8. New ASIs defined for real
and physical addresses.

ASI #

ASI Name

Description

0x14

ASI_REAL

Real Address (memory)

0X15

ASI_REAL_IO

Noncacheable Real Address

0X1C

ASI_REAL_LITTLE

Real Address Little-endian

0X1D

ASI_REAL_IO_LITTLE

Noncacheable Real Address Little-endian

0X21

ASI_MM U_CO NTEXTID

MMU context register

0X52

ASI_MMU_REAL

MMU Register

Because of the new addressing scheme, a number of new ASIs


are defined for RA and PA addressing, as described in Table 8.
1
L

Sun Microsystems, Inc.

The partition ID register is defined in ASI 0x58, VA 0x80 [2] with


an 8-bit field for the partition ID.

The full representation of each type of address is as follows:

real_address = context_ID :: virtual_address


physical_address = partition ID : : real_address

or:

physical_address = partition ID : : context ID : : virtual_address

Virtual
Addressing
Unprivileged

Real
Addressing
Privileged
Figure 26 illustrates the type of addressing in each of mode of
operation.
Physical
Addressing
64-bit VA + context ID + Partition ID Hyperprivileged
Figure 26. Different types of addressing are used in different
modes of operation.

Table 9. MMU-related trap types in the


UltraSPARC T1/T2 processor

Trap name
Trap Cause
TT
Handled by
fast_instruction_access_MMU_miss
iTLB Miss
0x64
Hypervisor
fast_data_access_MMU_miss
dTLB Miss
0x68
Hypervisor
1

Sun Microsystems, Inc.

fast_data_access_protection
Protection Violation
0x6c
Hypervisor
instruction_access_exception
Several
0x08
Hypervisor
data_access_exception
Several
0x30
Hypervisor
Page translations in the UltraSPARC architecture are managed
by software through several different type of traps (see "Memory
Management Unit" on page 32). Depending on the trap type,
traps may be handled by the hypervisor or the GOS. Table 9
summarizes the MMU related trap types (see also Table 12-4 in
[2]).
instruction_access_MM iTSB Miss 0x09 GOS
U_miss
data_access_MMU_mis dTSB Miss 0x31 GOS
s
*mem_address_not_ali Misaligned 0x34- Hypervisor
gned memory
operation 0x39

In the hypervisor trap table, htraptable, the instructions for


handling dTLB miss, trap 0x68, are:
% mdb ./ontario/release/q
> htraptable+0x20*0x68,8/ai
htraptable+0xd00:
htraptable+0xd00: rdpr %priv_16, %g1
htraptable+0xd04: cmp %g1, 3
htraptable+0xd08: bgu,pn %xcc, +0x73b8
<watchdog_guest>
htraptable+0xd0c: mov 0x28, %g1
htraptable+0xd10: ba,pt %xcc, +0x97a0 <dmmu_miss>
htraptable+0xd14: ldxa [%g1] 0x4f, %g1
htraptable+0xd18: illtrap 0
htraptable+0xd1c: illtrap 0

The trap table transfers control to dmmu_miss () to load the


page translation from the TSB. If the translation doesn't exist in
the TSB, dmmu_miss() calls dtsb_miss(). The handler
dtsb_miss() sets the TT register to trap type 0x31
(data_access_MMU_miss), changes the PSTATE register to the
privileged mode, and transfers control to the GOS's trap handler
for trap 0x31. The portion of dtsb_miss() that performs this
functionality is shown in the following example:

> dtsb_miss,80/ai
wrpr %g0, 0x31, %tt ! write 0x31 to %tt
rdpr %pstate, %g3 ! read %pstate to %g3
or %g3, 4, %g3
wrpr %g3, %pstate ! write %g3 to%pstate
1
L

Sun Microsystems, Inc.

rdpr %tba, %g3 ! get privileged mode s trap


! table base address
add %g3, 0x620, ! set %g3 to the address of
%g3
! trap type 0x31
jmp %g3 ! jump to 0x31 trap handler

In the Solaris OS, the trap handler for trap type 0x31 calls the
handler sfmmu_slow_dmmu_miss() to load the page translation
from hme_blk. If no entry is found in hme_blk for the virtual
address, sfmmu_slow_dmmu_miss () calls sfmmu_pagefault() to
transfer control to Solaris's trap() handler.
% mdb -k
>
trap_table+0x20*0x31,2/a
i
scb+0x620:
scb+0x620: ba,a < sfmmu_slow_dmmu_miss >
+0xc1b4
scb+0x624: illtrap
0
> sfmmu_pagefault,80/ai
sfmmu_pagefault+0x78: sethi %hi(0x101d400), %g1
sfmmu_pagefault+0x7c: or %g1, 0x3 64, %g1
sfmmu_pagefault+0x80: ba,pt %xcc, -0x13f0 <sys_trap>

I/O Virtualization in LDoms


LDoms provide the ability to partition system PCI buses so that
more than one domain can directly access devices. (Currently,
access by up to two domains is supported.) A domain that has
direct access to devices is called an I/O domain or service
domain. A domain that doesn't have direct access to devices
uses the virtual I/O (VIO) framework and goes through an I/O
domain for access.
The device tree of a domain is determined by the OBP of that
domain. The OBP device tree of a typical non-I/O domain is
shown in the following example:

{0} ok show-devs
/cpu@3
/cpu@2
/cpu@1
/cpu@0
/virtual-devices@100
/virtual-memory
/memory@m0,4000000
/aliases
/options
/openprom
/chosen
/packages
/virtual-devices@100/channel-devices@200
/virtual-devices@100/console@1
/virtual-devices@100/ncp@4
/virtual-devices@100/channel-devices@200/disk@0
/virtual-devices@100/channel-devices@200/network@0
/openprom/client-services
/packages/obp-tftp
/packages/kbd-translator
1

Sun Microsystems, Inc.

/packages/SUNW,asr
/packages/dropins
/packages/terminal-emulator
/packages/disk-label
/packages/deblocker
/packages/SUNW,builtin-drivers
During the system boot, the OBP device tree information is
passed to the Solaris OS and used to create the system device
nodes. Output from the following pftconf(lM) command shows
the system configuration of a typical non-I/O domain:
# prtconf
System Configuration: Sun Microsystems sun4v
Memory size: 1024 Megabytes
System Peripherals (Software Nodes):
SUNW,Sun-Fire-T200
scsi_vhci, instance #0
packages (driver not
attached)
SUNW,builtin-drivers (driver not attached)
deblocker (driver not attached)
disk-label (driver not attached)
terminal-emulator (driver not attached)
dropins (driver not attached)
SUNW,asr (driver not attached)
kbd-translator (driver not attached)
obp-tftp (driver not attached)
ufs-file-system (driver not
attached) chosen (driver not
attached) openprom (driver not
attached)
client-services (driver not
attached) options, instance #0
aliases (driver not attached)
memory (driver not attached)
virtual-memory (driver not
attached) virtual-devices,
instance #0
ncp (driver not attached)
console,instance #0
channel-devices, instance #0
disk, instance #0 network,
instance #0 cpu (driver not
attached) cpu (driver not
attached) cpu (driver not
attached) cpu (driver not
attached) iscsi, instance #0
pseudo, instance #0

As this system configuration shows, no physical devices are


exported to the domain. The virtual-devices entry is the nexus
node of all virtual devices. The channel-devices entry is the bus
node for the virtual devices that require communication with the
I/O domain. The disk and network entries are leaf nodes on the
channeldevices bus.
The Solaris drivers that are specific to the LDom configuration
are listed below:

LDOM drivers:
vdc (virtual disk client 1.4) non I/O domain only
ldc (sun4v LDC module v1.5)
ds (Domain Services 1.3)
cnex (sun4v channel-devices nexus dri)
vnex (sun4v virtual-devices nexus dri)
dr_cpu (sun4v CPU DR 1.2)
drctl (DR Control pseudo driver v1.1)
1
L

Sun Microsystems, Inc.

qcn (sun4v console driver v1.5)


vnet (vnet driver v1.4) non I/O domain only
vds (virtual disk server v1.6) I/O domain only
vsw (sun4v Virtual Switch Driver 1.5) I/O domain only

Similar to Sun xVM Server, the LDoms VIO on a non-I/O domain


uses a split device driver architecture for virtual disk and network
devices. The vdc and vnet client drivers are used in non-I/O
domains. The vds and vsw server drivers are used in the I/O
domain to support the vdc and vnet drivers. The vnex nexus
driver, the driver for the virtual-devices nexus node, provides
bus services to its children nodes, vnet and vdc .
The VIO framework uses the hypervisor's Logical Domain
Channel (LDC) service for driver communication between
domains. The LDC forms bi-directional point-to-point links
between two endpoints. All traffic sent to a local endpoint arrives
only at the corresponding endpoint at the other end of the
channel in the form of short fixed-length (64 byte) message
packets. Each endpoint is associated with one receive queue
and one transmit queue. Messages from a channel are
deposited by the hypervisor at the tail of a queue, and the
receiving domain indicates receipt by moving the corresponding
head pointer for the queue. To send a packet down an LDC, a
domain inserts the packet into its transmit queue, and then uses
a hypervisor API call to update the tail pointer for the transmit
queue.

Disk Block Driver


On non-I/O domains, the vdc client driver provides disk
interface. The vdc driver receives I/O requests from the file
system or raw device access, and sends these requests to the
hypervisor LDC. The vds server driver, located in the I/O
domain, sits on the other end of LDC. The vds driver receives
requests from the vdc driver and then forwards these requests to
the disk service to which the disk device on the client is mapped.

The sequence of events for disk I/O is illustrated in Figure 27.


1

Sun Microsystems, Inc.

Figure 27. Sequence


UltraSPARC T1 (CPU,of events for disk I/O from a
Memory,domain
non-I/O Devices)
to an I/O domain.

For non-I/O domains, the following events occur when


applications use read(2) and write(2) system calls to access a
file:

1. The file system calls the vdc driver's strategy(9E) entry point.
2. The vdc drivers send the I/O buf, buf(9S), to the LDC. The vdc
driver returns after all data is successfully sent to the LDC.
3. The vds driver is notified by the hypervisor that messages are
available on its queue.
4. The vds driver retrieves data from the LDC and sends it to the
device service that is mapped to the client virtual disk.
5. The vds driver starts the block I/O by sending the I/O request to
the native driver and then dispatching a task queue to await I/O
completion.
6. The native SCSI driver receives the device interrupt.
7. The vds driver's I/O completion is woken up by biodone (9F).
8. The vds driver sends a message to vdc indicating I/O
completion.
9. The vdc driver receives the message from vds, and calls
biodone(9F) to wake up anyone waiting for it.

For I/O domains, the I/O path of data requests is simpler:

10. Block I/O requests are sent directly from the file system to
the native driver.

In addition to the strategy(9E) driver entry point for supporting


file system and raw device access, the vdc driver also supports
most of the ioctl (2) commands as defined in dkio(7i) for disk
control. The Solaris kernel variable dk_ioctlx defines the exact
disk ioctl commands supported by the vdc driver.

Network Driver
The Solaris LDoms network drivers include a client network
driver, vnet, and a virtual switch, vsw, on the server side. To
transmit a packet, vnet sends a packet over the LDC to vsw. The
1
L

Sun Microsystems, Inc.

binding of vnet to vsw is defined in the vnet resource of the


domain when the domain is created. The vsw forwards the
packet to the native driver, and includes the IP address of vnet
as the source address. The vnet driver returns as soon as the
packet has been put on a buffer and the buffer has been added
to the tail of the LDC queue.
When receiving packets from the network, if the native driver is
configured as a virtual switch in the vswitch resource of the
domain, the packet is passed up from the native driver to vsw.
The vsw finds the MAC address associated with the destination
IP add ress from its ARP table. The vsw gets the target domain
from the MAC address, and gets the vnet interface from the vnet
resource. The packet is then sent to the LDC of the designated
vnet driver.
The vnet driver uses Solaris GLD v3 interfaces and is fully
compatible with the native driver using the same GLD v3
interface.
Figure 28 depicts the flow of receiving a packet from the network
through an I/O domain to a guest domain. The sequence of
operations for receiving packets is as follows:
1. Data is stored via DMA into the native driver, e1000g,
receive buffer ring.
2. The vsw sends the packet to client driver, vnet, through the
LDC.
3. The LDC receiving worker thread gets the packet and sends
it to the vnet driver.

1. Information on the Solaris kernel variable dk_ioctl can be


looked up at the Web site: http://www.opensolaris.org/.
1

Sun Microsystems, Inc.

I/O Domain

Idc'ldc write cpu_halt+0xc0


idle+0xi28
vsw\sw_dringsend+0x234 thread_start+4
vsw\sw_portJend+0x6o
vsw\sw_forward_ali+0xi34 Guest Domain
vsw'vsw_switch_l2_frame+0x248 vnefvgen handle datamsg
mac'mac_rx+0x58 ldc'i_lJc_rx_hdlr+0x3c0
ei000g'ei000g_intr_pciexpress+0xb cnex'cnex_intr_wrapper+0xc
8 px*px_msiq_intrH>xib8 intr_thread+0xi70
idle+0xi28
thread_start+4

0 Logical Domain Channel

LDoms Hypervisor

Network Chip Sun UltraSPARC T1/T2 Server


UltraSPARC T1/T2 (CPU, Memory, Devices)

Figure 28. Flow of control for receiving a network packet from an


I/O domain to a guest domain.
Chapter 8
VMware

VMware, the current market leader in virtualization software for


x86 platforms, offers three virtual machine systems: VMware
Workstation; no-cost VMware Server, formerly known as VMware
GSX Server; and VMware Infrastructure3, a suite of virtualization
products based on VMware ESX Server Version 3.
The VMware Workstation and VMware Server products are add-
on software modules that run on a host OS such as Windows,
Linux-hosted, or BSD variants (Figure 29). In these
implementations the VMM is a part of, and has the same
privilege as, the host OS kernel. The guest OS runs as an
application on the host OS. The Solaris OS can only run as a
guest OS on VMware Workstation and VMware Server.
The VMware Infrastructure suite of products is built around the
VMware ESX Server. VMware ESX Server runs on bare metal
and uses a derived version of SimOS [18] as the kernel for
running the VMM and I/O services. All other operating systems
run as a guest OS. VMware Infrastructure supports Windows,
Linux, and Solaris as guest OS. VMware ESX Server provides
lower overhead and better control of system resources than
VMware Workstation and Server. However, because it provides
all device drivers, it therefore supports fewer devices than
VMware Workstation and VMware Server.

Figure 29 shows the configuration of VMware ESX Server and


GSX Server.

GSX Server ESX Server


Guest Host Guest Guest Guest
OS OS OS OS
Solaris Apps Linux Solaris Windo
VMM ws
Hos Operat ystem VMM
t ing S
1
L

Sun Microsystems, Inc.

Figure 29. VMware GSX Server (Vmware Workstation and


VMware Server products) runs within a host operating system,
while VMware ESX Server runs on the bare metal.
VMware ESX Server is a Type I VMM, and has exclusive control
of hardware resources (see "Types of VMM" on page 10). In
contrast, VMware Workstation and VMware Server are Type II
VMMs, and leverage the host OS by running inside the OS
kernel.
V

Sun Microsystems, Inc.

VMware Infrastructure Overview


VMware ESX Server, VMware's product for running enterprise
applications in data centers, serves as the foundation of the
VMware Infrastructure solution. VMware ESX Server includes the
following components:
• Virtualization layer — abstracts the hardware resources
including CPU, memory, and I/O
• I/O interface — enables the delivery of file system
services to VMs
• Service Console — provides an interface to manage
resources and administer VMs

Manage
VMkern
Netw
Stor
Stora
el
ork
age
ment the functional components of the VMware ESX Server
shows
ge
Stac
Interfa
Emula
k ce
product. The VMkernel, the core of the ESX Server, abstracts
tion
the underlying hardware resources and implements the

virtualization layer. The VMkernel includes multiple VMMs, one

for each VM. The VMM implements the virtual CPUs for each

VM. The VMkernel also includes modules for I/O driver

emulation, the I/O stack, and device drivers for network and

storage devices. The service console, a RedHat Linux-based

component, serves as a boot loader and provides a


Network
Emulati
management interface to the VMkernel.


Execution H ardware Network Storage
Mode VMM Interface Driver Driver
Layer

CPU ~J Sun X64 Server Network Storage

Figure 30. VMware ESX Server functional components.


The following sections discuss the functional components of
VMware Infrastructure, with particular emphasis on the
virtualization layer which forms the core of all VMware
virtualization products.

VMware CPU Virtualization


V
1

Sun Microsystems, Inc.

ESX Server provides full virtualization, enabling an unmodified


GOS to run on the underlying x86 hardware. The full
virtualization is achieved by the ESX virtualization
layer. The core of the ESX virtualization layer is the VMM, which
includes three modules (Figure 31) [12]:
• Execution decision module — decides whether VM instructions
should be sent to the direct execution module or the binary
translation module
• Binary translation module — used to execute the VM whenever
the hardware processor is in a state in which direct execution
cannot be used
• Direct execution module — enables the VM to directly execute
its instruction sequences on the underlying hardware processor

VMM Execution Module


Execution Decision

Binary Translation
Direct Execution

Figure 31. VMware ESX Server virtualizes the CPU hardware


through binary translation whenever the processor itself cannot
directly execute an instruction.
The decision to use binary translation or direct execution
depends on the state of the processor and whether the segment
is reversible or not (see "Segmented Architecture" on page 23). If
the content of the descriptor table, for example the GDT, is
changed by the VMM because of a context switch to another VM,
the segment is non-reversible. Direct execution can be used only
if the VM is running in an unprivileged mode and the hidden
descriptors of the segment register are reversible. In all other
cases, the VMM will switch to the binary translation module.

Binary Translation
The binary translation (BT) module is believed influenced by the
machine simulators Shade [13] and Embra [14]. Embra is part of
SimOS [18] which was developed by a Stanford team led by
Mendel Rosenblum, one of the founders of VMware. While
extensive details of the BT module implementation have not
been published, Agesen [15], Embra [14], and Shade [13]
provide some information on its implementation.
The BT module translates GOS instructions, which are running in
a deprivileged VM, into instructions that can run in the privileged
VMM segment. The BT module receives x86 binary instructions,
including privileged instructions, as input. The output of the
module is a set of instructions that can be safely executed in the
V

Sun Microsystems, Inc.

non-privileged mode. Agesen [15] gives an example of how


control flow is handled in the BT module.
To avoid frequently retranslating blocks of instructions, translated
blocks are kept in a Translation Cache (TC). The execution of a
block of instructions is simulated by locating the block's
translation in the TC and jumping to it. A hash table maintains the
mappings from a program counter to the address of the
translated code in the TC.
The main loop of the dynamic binary translation simulator is
shown in Figure 32. The loop checks to see if the current
simulated program counter address is present in the TC. If it is
present in the TC, the translated block is executed. If it is not, the
translator is called to add the block to the TC. Each block of
translated code ends by loading the new simulated program
counter and jumping back to the main loop for dispatching.

Translator

Main{ T
dispatch loop
if (PC_not_in TC(PC)) Translation Cache:
tc=translate(pc); code fragments
which end with
newpc = pc_to_tc(pc); jump back to
jump_to_pc(newpc) dispatchloop

translate(pc) {
blk =
read_instructions(pc)
;
perform_translation(
blck);
write_into_TC(blk);

} ""
Figure 32. Binary translation manages a translation cache to
reduce the need to re-translate frequently executed blocks of
instructions.
A more detailed description of binary translation is beyond the
scope of this paper. Readers should refer to Shade [13] and
Embra [14] for more details about dynamic binary translation.
Some privileged instructions that have simple operations use in-
TC sequences. For example, a clear interrupt instruction (cli) can
be replaced by setting a virtual processor flag. Privileged
instructions that have more complex operations (such as setting
cr3 during a context switch), require a call out of the TC to
perform the emulation work.
In addition to binary translation and logic for determining the
code execution, the virtualization layer employs other techniques
to overcome x86 virtualization issues:
• Memory Tracing
The virtualization layer traces modifications on any given
physical page of the virtual machine, and is notified of all read
and write accesses made to that page in a transparent manner.
V
1

Sun Microsystems, Inc.

This memory tracing ability in the VMM is enabled by page


faults and the ability to single-step the virtual machine via
binary translation.
• Shadow Descriptor Tables The x86's segmented architecture
(see "Segmented Architecture" on page 23) has a segment
caching mechanism that allows the segment register's hidden
fields to be re-used. However, this can approach can cause
difficulty if the descriptor table is modified in a non-coherent
way. The virtualization layer supports the GOS system
descriptor tables using VMM shadow descriptor tables.
The VMM descriptor tables include shadow descriptors that
correspond to predetermined descriptors of the VM descriptor
tables. The VMM also includes a segment tracking mechanism
that compares the shadow descriptors with their corresponding
VM segment descriptors. This mechanism indicates any lack of
correspondence between shadow descriptor tables with their
corresponding VM descriptor tables, and updates the shadow
descriptors so that they correspond to their respective
corresponding VM segment descriptors.
The ESX Server's VMM implementation is unique in that each
GOS has an associated VMM. The ESX Server may include any
number of VMMs in a given physical system, each supporting a
corresponding VM; the number of VMMs is limited only by
available memory and speed requirements. The features in the
virtualization layer mentioned in the previous discussion allow
multiple concurrent VMMs, with each VMM supporting an
unmodified GOS in the virtualization layer.

CPU Scheduling
The ESX Server implements a rate-based proportional-share
scheduler [19] that is similar to the fair-share-scheduler scheme
used by the Solaris OS (see [21] Chapter 8) in which each virtual
machine is given a number of shares. The amount of CPU time
given to each VM is based on its fractional share of the total
number of shares of active VMs in the whole system.
The term share is used to define a portion of the system's CPU
resources that is allocated to a VM. If a greater number of CPU
shares is assigned to a VM, relative to other VMs, then that VM
receives more CPU resources from the scheduler. CPU shares
are not equivalent to percentages of CPU resources. Rather,
shares are used to define the relative weight of a CPU load in a
VM in relation to CPU loads of other VMs.
The following formula shows how the scheduler calculates per-
domain allocation of CPU resources.
Shares
i
Allocation i domain
domain TotalShare
s

The ESX scheduler allows specifying minimum (reservation) and


maximum (limit) CPU utilization for each virtual machine. A
V

Sun Microsystems, Inc.

minimum CPU reservation guarantees that a virtual machine


always has this minimum percentage of a physical CPU's time

allocated to it, regardless of the total number of shares. A


maximum CPU limit ensures that the virtual machine never uses
more than this maximum percentage of a physical CPU's time,
even if extra idle time is available. The proportional-share
algorithm is only applied if the VM CPU utilization falls within the
range of reservation and limit CPU utilization. Figure 33
shows how CPU resource allocation is calculated.

Total MHz Limit

Reservation 0 MHz
Figure 33. Calculation of CPU resources in VMware.
In an SMP environment in which a VM could have more than one
virtual CPU (VCPU), a scalability issue arises when one VCPU is
spinning on a lock held by another VCPU that gets de-scheduled.
The spinning VCPU wastes CPU cycles spinning on the lock until
the lock owner VCPU is finally scheduled again and releases the
lock.
ESX implements co-schiedulingto work around this problem. In co-
scheduling (also called gang scheduling), all virtual processors of
a VM are mapped one-to-one onto the underlying processors
and simultaneously scheduled for an equal time slice. The ESX
scheduler guarantees that no VCPUs are spinning on a lock hold
by a VCPU that has been preempted.
However, co-scheduling does introduce other problems. Because
all VCPUs are scheduled at the same time, co-scheduling
activates a VCPU regardless of whether there are jobs in the
VCPU's run queue. Co-scheduling also precludes multiplexing
multiple VCPUs on the same physical processor.

Timer Services
Similar to Sun xVM Server, ESX Server faces the same issue of
getting clock interrupts delivered to VMs at the configured
interval [16]. This issue arises because the VM may not get
scheduled when interrupts are due to deliver. ESX Server keeps
track of the clock interrupt backlog and tries to deliver clock
interrupts at a higher rate when the backlog gets large. However,
the backlog can get so large that it is not possible for the GOS to
catch up with the real time. In such cases, ESX Server stops
attempting to catch
up if the clock interrupt backlog grows beyond 60 seconds.
Instead, ESX Servers sets its record of the clock interrupt
backlog to zero and synchronizes the GOS clock with the host
machine clock.
V
1

Sun Microsystems, Inc.

ESX Server virtualizes the Time Stamp Counter (TSC) so that


the virtualized TSC counter matches with the GOS clock (see
"Time Stamp Counter (TSC)" on page 28). When the clock
interrupt backlog is cleared due to catching up or due to reset
when the backlog is too large, the virtualized TSC catches up
with the adjusted clock.

VMware Memory Virtualization


Similar to Sun xVM Server, memory virtualization in VMware
ESX Server deals with two memory management issues:
physical memory management and page translations.

Physical Memory Management


Similar to Sun xVM Server, ESX Server virtualizes a VM's
physical memory by adding an extra level of address translation
when mapping a VM's physical memory pages to the physical
memory pages on the underlying machine. Also like Sun xVM
Server, the underlying physical pages are referred to as machine
pages, and the VM's physical pages as physical pages. Each VM
sees a contiguous, zero-based, addressable physical memory
space whereas the underlying machine memory used by each
virtual machine may not be contiguous.
ESX Server manages physical memory allocation and
reclamation, similar to Sun xVM Server, by using the memory
ballooning technique. More detailed information on how the
memory ballooning technique manages the physical memory
allocation and reclamation is included in [5].

Page Translations
Each GOS in the ESX Server maintains page tables for virtual-to-
physical address mappings. The VMM also maintains shadow
page tables for the virtual-to-machine page mappings along with
physical-to-machine mappings in its memory. The processor's
MMU uses the VMM's shadow page table. When a GOS updates
its page tables with a virtual-to-physical translation, the VMM
intercepts the instruction, gets the physical-to-machine mapping
from its memory, and loads the shadow page table with the
virtual-to-machine mapping. This mechanism allows normal
memory accesses in the VM to execute without adding address
translation overhead if the shadow page tables are set up for that
access.

VMware I/O Virtualization


Every VM is configured with a set of standard PC virtual devices:
PS2/ keyboard and mouse, IDE controller with ATA disk and
ATAPI CDROM, serial port, parallel port, and
V

Sun Microsystems, Inc.

sound chip [20]. In addition, ESX Server also provides virtual PCI
emulation for PCI addon devices such as SCSI, Ethernet, and
SVGA graphics (see Figure 30 on page 98).
The device tree as exported by the VMM to a GOS is shown in
the following prtconf(lM) output.
% prtconf
System Configuration: Sun Microsystems i86pc
Memory size: 1648 Megabytes
System Peripherals (Software Nodes):
i86pc
scsi_vhci, instance #0
isa, instance #0
i8042, instance #0
keyboard, instance #0 mouse, instance #0
lp (driver not attached) asy, instance #0
(driver not attached) asy, instance #1
(driver not attached) fdc, instance #0 fd,
instance #0 pci, instance #0
pci15ad,1976 (driver not
attached) pci8086,7191,
instance #0 pci15ad,1976
(driver not attached) pci-ide,
instance #0 ide, instance #0
sd, instance #16 ide
(driver not attached)
pci15ad,1976(driver not
attached) display,instance #0
pci1000,30,instance #0
sd, instance #0
pci15ad,750,instance #0 iscsi,
instance #0
pseudo, instance
#0 options,
instance #0
agpgart, instance #0 (driver not attached)
xsvc, instance #0 objmgr, instance #0 acpi
(driver not attached) used-resources
(driver not attached) cpus (driver not
attached)
cpu, instance #0 (driver not attached)
cpu, instance #1 (driver not attached)
The PCI vendor ID of VMware is 15ad. The following entries are
relevant to VMware I/O
virtualization:
Device Entry Description
pci15ad,750 VMware emulation of Intel's 100FX
Gigabit Ethernet
pci15ad,1976 VMware emulation of the Intel 440BX/ZX
PCI bridge chip
pci1000,30 the LSI logic 53C1020/1030 SCSI
controller
display VMware virtual SVGA
For the example device tree shown here, the Solaris OS binds
the e1000g driver to pci15ad,750 and uses e1000g as the
network driver. The actual network hardware used on the system
is a Broadcom's NetXtreme Dual Gigabit Adapter with the PCI ID
pci14e4,1468. VMware translates the e1000g device interfaces
passed by the Solaris e1000g driver, and sends them to the
Broadcom's NetXtreme device.
For storage, unlike Sun xVM Server, ESX Server continues to
use sd as the interface to file systems. The emulation of disk
interface is provided at the SCSI bus adapter interface (LSI logic
SCSI controller) instead of at the SCSI target interface (SCSI
disk sd).

Device Emulation
Each storage device, regardless of the specific adapters,
appears as a SCSI drive connected to an LSI Logic SCSI
V
1

Sun Microsystems, Inc.

adapter within the VM. For network I/O, ESX Server emulates an
AMD Lance/PCNet or Intel El000g device driver, or uses a
custom interface called vmxnet for the physical network adapter.
VMware provides device emulation rather than the I/O emulation
as used by Sun xVM Server and UltraSparc LDoms (see "I/O
Virtualization" on page 16). In a simple scenario, consider an
application within the VM making an I/O request to the GOS, as
illustrated in Figure 34:
1

Sun Microsystems, Inc.

VMKe
8
rnel
Device Device
Independen Emulati
(7) on
t I/O Module
Access (3)
Handler
(
Hardware
CdJ Interface Layer
VMM Real Device
Driver

Sun x64
Server
I/O
Device
Figure 34. Sequence of events
for applications making an I/O
request.
1. Applications perform I/O
operations through the
interface to the device as
exported by the VMware
VMM (see "VMware I/O
Visualization" on page 103).
The virtual device interface
uses the native drivers (for
example, the e1000g for
network and mpt for the LSI
SCSI HBA) in the Solaris
kernel.
2. The Solaris native driver
attempts to access the
device via the IN/OUT
instructions (for example, by
writing a DMA descriptor to
the device's DMA engine).
3. The VMM intercepts the I/O
instructions and then
transfers control to the
device-independent module
in the VMkernel for handling
the I/O request.
4. The VMkernel converts the
I/O request from the
1
V

Sun Microsystems, Inc.

emulated device to one for


the real device, and sends
the converted I/O request to
the driver for the real device.
5. The VMware driver sends
the I/O request to the real I/O
device.
6. When an I/O request
completion interrupt (for
example, DMA completion
interrupt) arrives, the
VMkernel device driver
receives and processes the
interrupt.
7. The VMkernel then notifies
the VMM of the target virtual
machine, which copies data
to the VM memory and then
raises the interrupt to the
GOS.
8. The Solaris driver's
interrupt service routine
(ISR) is called.
9. The Solaris driver performs a
sequence of I/O accesses
(for example, reads the
transaction status,
acknowledges receipt) to the
I/O ports before passing the
data to its applications.

The VMkernel ensures that data


intended for each virtual
machine is isolated from other
VMs.
V
M Sun Microsystems, Inc.

Section III
Additional Information

• Appendix A: VMM Comparison (page 109)


• Appendix B: References (page ill)
• Appendix C: Terms and Definitions (page 113)
• Appendix D: Author Biography (page 117)

142 VMware Sun Microsystems, Inc.


143 VMM Comparison Sun Microsystems,
Inc.

Appendix A

Table 10. Comparison of virtual machine


monitors discussed in this paper.

General
Sun xVM Server w/o HVM
Sun xVM Server w/HVM
VMware
LDoms
VMM version
3.0.4
3.0.4
ESX 3.0.1
LDoms 1.0.1
Supported ISA
x86 and IA-64
x86 and IA-64
x86
UltraSPARC T1/T2
VMM Layer
Run on bare metal
Run on bare metal
Run on bare metal
Firmware
Virtualization Scheme
Paravirtualization
Full
Full
Paravirtualization
144 VMM Comparison Sun Microsystems,
Inc.

Supported GOS
Linux, NetBSD, FreeBSD, OpenBSD, Solaris
Linux, NetBSD, FreeBSD, OpenBSD, Windows
Windows, Linux, Netware, Solaris
Solaris, Linux
SMP GOS
Yes
Yes
Yes
Yes
64-bit GOS
Yes
Yes
Yes
Yes
Max VMs
Limited by memory
Limited by memory
Limited by memory
32 on UltraSPARC Ti;
64 on UltraSPARC T2
Method of operation
Modified GOS
Hardware Virtualization
Binary Translation
Modified OS
License
GPL (free)
GPL (free)
Proprietary
CDDL (Free)
CPU
Sun xVM Server w/o HVM
Sun xVM Server w/HVM
VMware
LDoms
CPU scheduling
Credit
Credit
Fair Share
N/A
VMM Privilege Mode
Privileged (ring 0)
Privileged (ring 0)
Privileged
Hyperprivileged
GOS Privileged Mode
Unprivileged (ring 3 for 64-bit kernel; ring 1 for 32-bit kernel
Reduced privileged
Deprivileged
Privileged
CPU Granularity
Fractional
Fractional
Fractional
1 strand
Interrupt
Queued and delivered
Queued and delivered
Queued and delivered
Deliver directly to the VM

when the VM is scheduled


when the VM is scheduled
when the VM is scheduled
145 VMM Comparison Sun Microsystems,
Inc.

to run
to run
to run

Memory
Sun xVM Server w/o HVM
Sun xVM Server w/HVM
VMware
LDoms
Page Translation
Hypercall to VMM
Shadow Page
Shadow page
Hypercall to VMM
Physical Memory Allocation
Balloon driver
Balloon driver
Balloon driver
Hard Partition
Page Tables
Managed by VMM
Managed by VMM
Managed by VMM
Managed by GOS

This chapter presents a summary comparison of the four


virtual machine monitors discussed in this paper: Sun xVM
Server without HVM, Sun xVM Server with HVM, VMware,
and Logical Domains (LDoms). Table 10 summarizes their
general characteristics; provides information on their CPU,
Memory, and I/O virtualization implementation; and lists the
management options available for each.
146 VMM Comparison Sun Microsystems, Inc.

I/O Sun xVM Server Sun xVM Server VMware LDoms


w/o HVM w/HVM
I/O Shared Shared Shared PCI bus
Granularit
y
I/O I/O emulation by Device emulation Device emulation I/O emulation by
by QEMU or I/O by vmkernel I/O domain
Virtualizatio Domo emulation by
n Domo
Device Virtual driver on Native driver on Native driver on Virtual driver on
DomU, native DomU and Domo guest supported non I/O domain
drivers driver on Domo (QEMU) by the VMM and native driver
on I/O domain
Management Sun xVM Server Sun xVM Server VMware LDoms
w/o HVM w/HVM
Management Domo - SPOF Domo - SPOF Service console - Control domains
Model SPOF
Interfa CLI: xm(1M) GUI: CLI: (xm(1) ) GUI:
virt-manager virt-manager GUI: Virtual CLI: ldm(1M),
XML, and SNMP
ce Center MIBs
R
e
1

Sun Microsystems, Inc.

Appendix B
References

1. Popek, Gerald J. and Goldberg, Robert P. "Formal


Requirements for Virtualizable Third Generation
Architectures," Communications of the ACM 17 (7), pages
412421, July 1974.
2. UltraSPARC Architecture 2oo5: One Architecture.... Multiple
Innovative Implementations, Draft Do.9, 15 May 2oo7.
3. Robin, John Scott and Irvine, Cynthia E. "Analysis of the Intel
Pentium's Ability to Support a Secure Virtual Machine
Monitor," Proceedings of the 9th USENIX Security
Symposium, August 2ooo.
4. VMware: http:/ /www.vmware .com/vinfrastructure/
5. Waldspurger, Carl A. "Memory Resource Management in
VMware ESX Server," Proceedings of the 5th Symposium on
Operating Systems Design and Implementation, Dec. 2oo2.
6. Xen, "The Xen virtual machine monitor," University of
Cambridge Computer Laboratory: http: // www.cl. cam.
ac.uk/research/srg/netos/xen/
7. IA-32 Intel Architecture Software Developer's Manual,
March 2oo6.
8. System V Application Binary Interface AMD64 Architecture
Processor Supplement Draft Version o.98, September 27,
2oo6.
http://www.x86-64.org/documentation/abi.pdf
9. AMD64 Architecture Programmer's Manual, Volume 2:
System Programming, Rev. 3.12, September 2oo6.
10. OpenSPARC T1 Microarchitecture Specification, Revision
A, August 2oo6.
11. UltraSPARC Virtual Machine Specification (The sun4v
architecture and Hypervisor API specification), Revision 1.o,
January 24, 2oo6.
12. Devine, Scott W.; Bugnion, Edouard; Rosenblum, Mendel.
"Visualization system including a virtual machine monitor for a
computer with a segmented architecture," U.S. Patent
6,397,242, October 26, 1998.
13. Cmelik, Robert F. and Keppel, David. "Shade: A Fast
Instruction Set Simulator for Execution Profiling,"
ACMSIGMETRRICSPerformanceEvauuation Review, pages 128137,
May 1994.
14. Witchel, Emmett and Rosenblum, Mendel. "Embra: Fast and
Flexible Machine Simulation," The Proceedings of ACM
SIGMETRICS '96: Conference on Measurement and
Modeling of Computer Systems, 1996.
15. Adams, Keith and Agesen, Ole. "A Comparison of Software
and Hardware Techniques for x86 Visualization," ASPLOS
2006, San Jose, CA, USA, October 21-25, 2oo6.
R
1 e

Sun Microsystems, Inc.

16."Timekeeping in VMware Virtual Machines," VMware white


paper, August 2005.
17.Bittman, T. "Gartner RAS Core Strategic Planning SPA-21-5502,
Research Note 14," November 2003.
18.Rosenblum, Mendel; Herrod, Stephen A.; Witchel, Emmett; and
Gupta, Anoop. "Complete Computer Simulation: The SimOS
Approach," IEEE Parallel and Distributed Technology, pages 34-
43, Winter 1995.
19."VMware ESX Server 2 Architecture and Performance
Implication," VMware white paper, 2005.
20.Sugerman, Jeremy; Venkitachalam, Ganesh; and Lim, Beng-
Hong. "Virtualizing I/O Devices on VMware Workstation's Hosted
Virtual Machine Monitor," Proceedings of the 2001 USENIX
Annual Technical Conference, Boston, Massachusetts, USA,
June 25-30, 2001.
21.System Administration Guide: Solaris Containers-Resource
Management and Solaris Zones, Part No: 817-1592 -14, June
2007
22.Drakos, Nikos; Hennecke, Marcus; Moore, Ross; and Swan,
Herb. XenInterface manual:Xen v3.0 forx86.
23.Bochs IA-32 Emulator Project: http: //bochs .sourceforge.net/
24.QEMU, Open Source Processor Emulator:
http://fabrice.bellard.free.fr/qemu/
25.IEEE 1275-1994 Open Firmware:
http://playground.sun.com/1275/
26.PCI Bus Binding to IEEE std. 1275-1994, Rev 2.1 August 29,
1998.
27.TAP — a Virtual Ethernet network device:
http://vtun.sourceforge.net/tun/
28.Intel Virtualization Technology for Directed I/O Architecture
Specification, May 2007, Order Number: D51397-002.
29.Shadow2 presentation at Xen Technical Summit, Summer
2006:
http://www.xensource.com/files/summit_3/XenSummit_Shado
w2.pdf
30.PCI SIG, "Address Translation Services," Revision 1.0,
March 8, 2007.
31.AMD I/O Virtualization Technology (IOMMU) Specification,
Revision 1.20, Publication# 34434, February 2007.
32.Jun Nakajima, Asit Mallick, Ian Pratt, Keir Fraser, "x86-64
XenLinux: Architecture, Implementation, and Optimizations,"
Proceedings of the Linux Symposium, July 19-22 2006. Ontario,
Canada.
33.OpenSPARC T2 Core Microarchitecture Specification, July
2007, Revision 5.
34.UltraSPARC Architecture 2007, Hyperprivileged, Privileged, and
Nonprivileged, Draft D0.91, Aug 2007.
R
e
1

Sun Microsystems, Inc.

35.PCI SIG, "Single Root I/O Virtualization and Sharing


Specification," Revision 1.0, September 11, 2007.
Appendix C
Terms and Definitions

Hardware level virtualization introduces several terms that are


used throughout this document. The following terms are defined
in the context of hardware-level virtualization.

Balloon driver
A method for dynamic sharing of physical
memory among VMs [5]. Binary Translation
In computing, binary translation [13] [14] usually refers to the
emulation of one instruction set by another through translation of
instructions to allow software programs (e.g., operating systems
and applications) written for a particular processor architecture to
run on another. In the context of VMware products, binary
translation refers to the conversion of one set of instruction
sequences that belongs to a VM and has been deprivileged, to
another set of instruction sequences that can run in a privileged
VMM segment. VMware uses binary translation [12] and [15] to
provide full virtualization of x86 processor.
Domain
A running virtual machine within which a guest OS runs. Domain
and virtual machine are used interchangeably in this document.
Full Virtualization
Full virtualization is an implementation of virtual machine that
doesn't require guest OS to be modified to run in the VM. The
techniques used for full virtualization can be a dynamic
translation of software programs running in a VM (e.g., VMware
products), or providing a complete emulation of the underlying
processor (e.g., Xen with Intel-VT or AMD-V).
Guest Operating Systems (GOS)
A GOS is one of the OSes that the VMM can host in a VM. The
relationship between VMM, VM, and GOS is analogous to the
relationship between, respectively, OS, process, and program.
Hardware Level Virtualization
Hardware Level Virtualization is the technique of using a thin
layer of software to abstract the system hardware resources for
creating multiple instance of virtual executing environment, each
of which runs a separate instance of operating system.
Hardware Thread
See strand.

HVM
Hardware Virtual Machine, also known as hardware-
assisted virtualization. Hypervisor
Hypervisor is another term for VMM. Hypervisor is an extension
of the term supervisor which was commonly applied to operating
system kernel.
Logical Domains (LDoms)
Logical domains are Sun's implementation for hardware level
virtualization based on the UltraSPARC T1 processor
technology. LDom technology allows multiple domains to be
created on one processor; each domain runs an instance of OS
supported by one or more strands.
Operating System Level Virtualization
R
1 e

Sun Microsystems, Inc.

OS Level Virtualization is provided by an OS by virtualizing its


services to allow multiple and separate operating environments
to be created for applications. The services virtualized by the
OS includes: file system, devices, networking, security, and Inter
Process Communication (IPC). Pacifica
AMD's implementation for Hardware Virtualization, also known
as AMD-V or AMD SVM. Paravirtualization
Paravirtualization is an implementation of virtual machine that
requires the guest OS to be modified to run in the VM.
Paravirtualization provides partial emulation of the underlying
hardware to a VM and requires the guest OS to replace all
sensitive instructions and passes the control to the VMM for
handling these operations.
Privileged Instructions
Privileged instruction are those that result in trap if the processor
is running in user mode and do not result in trap if the processor
is running in supervisor mode.
Secure Virtual Machine (SVM)
AMD's implementation for Hardware Virtualization, also known
as Pacifica or ADM-V (see [9] Chapter 15).
Sensitive Instructions
Sensitive instructions [1] [12] are those that change the
configuration of resources (memory), affect the processor mode
without going through the memory trap sequence (page fault), or
whose behavior changes with the processor mode or the
contents of relocation register. If sensitive instructions are a
subset of privileged instructions, it is relatively easy to build a VM
because all sensitive instructions will result in a trap and the
underlying VMM can process the trap and emulate the behavior
of these sensitive instructions. If some sensitive instructions are
not privileged instructions, special measure has to be taken to
handles these sensitive instructions.
Shadow Page
A technique for hiding the layout of machine memory from a
virtual machine's operating system. A virtual page table is
presented to the guest OS by the VMM, but not connected to the
processor's memory management unit. The VMM is responsible
for trapping accesses to the table, validating updates and
maintaining consistency with the real page table that is visible to
the processor MMU. Shadow page is typically used to provide full
virtualization to a
VM.
Simple Earliest Deadline First (sEDF)
One of the scheduling algorithms used in Sun xVM Hypervisor
for x86 for scheduling domains. See section "CPU Scheduling"
on page 48 for a detailed description of sEDF.
Strand
Strand [2] refers to the state that hardware must maintain in
order to execute a software thread. Specifically, a strand is the
software-visible state (PC, NPC, general-purpose registers,
floating-point registers, condition codes, status registers, ASRs,
etc.) of a thread plus any microarchitecture state required by
hardware for its execution. Strand replaces the ambiguous term
hardware thread. The number of strands in a processor defines
the number of threads that an operating system can schedule on
that processor at any given time.
Sun xVM Hypervisor for x86
Sun xVM Hypervisor for x86 is the VMM of
the Sun xVM Server. Sun xVM
Infrastructure
R
e
1

Sun Microsystems, Inc.

Sun Cross Virtualization and Management Infrastructure is a


complete solution offering for virtualizing and managing the data
center. Sun xVM Infrastructure = Sun xVM Server + xVM Ops
Center
Sun xVM Ops Center
Sun xVM Ops Center is the management suite for the Sun xVM
Server.
152 Terms and Definitions Sun Microsystems,
Inc.

Sun xVM Server


Sun xVM Server is a paravirtualized Solaris OS that includes
support for the Xen open source community work on the x86
platform and support for LDoms on the UltraSPARC T1/T2
platform. In this paper, Sun xVM Server specifically refers to the
Sun xVM Server for the x86 platform.
Vanderpool
Intel's implementation for Hardware Virtualization, also
known as Intel-VT. Virtual CPU (VCPU)
VCPU is an entity that can be dispatched by the scheduler of a
guest OS. For UltraSPARC processors's LDoms, VCPU is also
know as strand, hardware thread, or logical processor.
Virtual Machine (VM)
Virtual machine is a discrete execution environment that
abstracts computer platform resources to an operating system.
Each virtual machine runs an independent and separate instance
of operating system. Popek and Goldberg [1] also defines VM as
an "efficient, isolated duplicate of a real machine."
Virtual Machine Monitor (VMM)
The VMM is a software layer that runs directly on top of the
hardware and virtualizes all resources of the computer system.
The VMM layer is situated between VMs and hardware
resources. The VMM abstracts hardware resources to VMs and
performs privileged and sensitive actions on the behalf of VM.
Virtualization Technology (VT)
Intel's implementation for Hardware Virtualization, also
known as Vanderpool. Xen
Xen is a open source VMM for x86, IA-64, and PPC [6].
S

153 Terms and Definitions


A
1

Sun Microsystems, Inc.

Appendix D Author
Biography

Chien-Hua Yen is currently a senior staff engineer in the ISV


engineering group at Sun. Before joining Sun more than 12 years
ago, he had been with several Silicon Valley companies working
as a software development engineer on Unix file systems, real
time embedded system, and device drivers. His first job at Sun
was with the kernel I/O group developing a kernel virtual memory
segment driver for device memory mapping. After the kernel
group, he worked with third party hardware vendors on
developing PCI drivers for the Solaris OS and high availability
products for the Sun CompactPCI board. In the last two yeas,
Chien-Hua has been working with ISVs on application
performance tuning, Solaris 10 adoption, and Solaris
virtualization.

Acknowledgements
The author would like to thank Honlin Su, Lodewijk Bonebakker,
Thomas Bastian, Ray Voight, and Joost Pronk for their invaluable
comment; Patric Change for his encouragement and support;
Suzanne Zorn for her editorial work; and Kemer Thompson for
his constructive comments and his coordination of the reviews.
Solaris Operating System Hardware On the
Virtualization Product Architecture Web
sun.com

Sun Microsystems, Inc. 4150 Network CircLe, Santa Clara, CA 95054 USA Phone 1-650-960-
1300 or 1-800-555-9SUN (9786) Web sun.com
© 2007 Sun Microsystems, Inc. All rights reserved. Sun, Sun Microsystems, the Sun logo, Java,
JVM, Solaris, and Sun BluePrints are trademarks or registered trademarks of Sun Microsystems,
Inc. in the United
States and other countries. All SPARC trademarks are used under license and are trademarks or
registered trademarks of SPARC International, Inc. in the U.S. and other countries. Products
bearing SPARC trademarks
are based upon architecture developed by Sun Microsystems, Inc. Information subject to change
without notice. Printed in
USA W07

Você também pode gostar