Escolar Documentos
Profissional Documentos
Cultura Documentos
Part No 820-
3703-10 Revision
1.0, 11/27/07
Edition:
November 2007
Sun Microsystems, Inc.
Table of Contents
Introduction..............................................................................1
Hardware Level Virtualization.................................................2
Scope......................................................................................4
Chapter 1
Introduction
1The terms "Java Virtual Machine" and "JVM" mean a Virtual Machine for the Java(TM) platform.
Sun Microsystems, Inc.
VM
8
I
VM
9
VM
1
I
GOS
1
I
GOS
1
GOS
1
I
Virt
1
ual Mac
2
I
hine Mo
2
1
2
I
nitor (Vf
2
MM)
2
I
Platform Hardware
Hardware resource virtualization can take the form of sharing,
partitioning, or delegating:
• Sharing — Resources are shared among VMs. The VMM
coordinates the use of resources by VMs. For example, the
VMM may include a CPU scheduler to run threads of VMs
based on a pre-determined scheduling policy and VM priority.
• Partitioning — Resources are partitioned so that each VM gets
the portion of resources allocated to it. Partitioning can be
dynamically adjusted by the VMM based on the utilization of
each VM. Examples of resource partitioning include the
ballooning memory technique employed in Sun xVM Server
and VMware, and the allocation of CPU resources in Logical
Domains technology.
• Delegating — With delegating, resources are not directly
accessible by a VM. Instead, all resource accesses are made
through a control VM that has direct access to the resource.
I/O device virtualization is normally accessed via delegation.
The distinction and boundaries between the virtualization
methods are often not clear. For example, sharing may be used
for one component and partitioning used in others, and together
they make up an integral functional module.
• Workload Consolidation
According to Gartner [17] "Intel servers running at 10 percent
to 15 percent utilization are common." Many IT organizations
run out and buy a new server every time they deploy a new
application. With virtualization, computers no longer have to be
dedicated to a particular task. Applications and users can
share computing resources, remaining blissfully unaware that
they are doing so. Companies can shift computing resources
around to meet demand at a given time, and get by with less
infrastructure overall. When used for consolidation,
virtualization can also save
hardware and maintenance expenses, floor space, cooling
costs, and power consumption.
• Workload Migration
Hardware level virtualization decouples the OS from the
underlying physical platform resources. A guest OS state,
along with the user applications running on top of it, can be
encapsulated into an entity and moved to another system. This
capability is useful for migrating a legacy OS system from an
old under-powered server to a more powerful server while
preserving the investment in software. When a server needs to
2
Scope
This paper explores the underlying hardware architecture and
software implementation for enabling hardware virtualization.
Great emphasis has been placed on the CPU hardware
architecture limitations for virtualizing CPU services and their
software workarounds. In addition, this paper discusses in detail
the software architecture for implementing the following types of
virtualization:
• CPU virtualization — uses processor privileged mode to
control resource usage by the VM, and relays hardware traps
and interrupts to VMs
• Memory virtualization — partitions physical memory among
multiple VMs and handles page translations for each VM
• I/O virtualization — uses a dedicated VM with direct access to
I/O devices to provide device services
2
I
•
S
28 Introduction
Section I
Background Information
•
S
3
Chapter 2
Virtual Machine Monitor Basics
VMM Requirements
A software program communicates with the computer hardware
through instructions. Instructions, in turn, operate on registers
and memory. If any of the instructions, registers, or memory
involved in an action is privileged, that instruction results in a
privileged action. Sometimes an action, which is not necessarily
privileged, attempts to change the configuration of resources in
the system. Subsequently, this action would impact other actions
whose behavior or result depends on the configuration of
resources. The instructions that result in such operations are
called sensitive instructions.
In the context of the virtualization discussion, a processor's
instructions can be classified into three groups:
• Privileged instructions are those that trap if the processor is in
non-privileged mode and do not trap if it is in privileged mode.
• Sensitive instructions are those that change or reference the
configuration of resources (memory), affect the processor
mode without going through the memory trap sequence (page
fault), or reference the sensitive registers whose contents
change when the processor switches to run another VM.
• Non-privileged and non-sensitive instructions are those that do
not fall into either the privileged or sensitive categories
described above.
Sensitive instructions have "a major bearing on the
virtualizability of a machine" [1] because of their system-wide
S
Types of VMM
In a virtualized environment, the VMM controls the hardware
resources. VMMs can be categorized into two types, based on
this control of resources:
The GRUB bootloader first loads the Sun xVM Hypervisor for
x86, xen.gz. After the VMM gains control of the hardware, it
loads the Solaris kernel,
/platform/i86xpv/kernel/$ISADIR/unix, to run as a GOS.
Sun's Logical Domains and VMware's Virtual Infrastructure 3 [4]
(formerly knows as VMware ESX Server), described in detail in
Chapter 7 "Logical Domains" on page 79 and Chapter 8
"VMware" on page 97, are also Type I VMMs.
A Type II VMM typically runs inside a host OS kernel as an add-
on module, and the host OS maintains control of the hardware
resources. The GOS in a Type II VMM is a process of the host
OS. A Type II VMM leverages the kernel services of the host OS
to access hardware, and intercepts a GOS's privileged
operations and performs these operations in the context of the
host OS. Type II VMMs have the advantage of preserving the
existing installation by allowing a new GOS to be added to an
running OS.
An example of type II VMM is VMware's VMware Server
(formerly known as VMware GSX Server).
Figure 2 illustrates the relationships among hardware, VMM,
GOS, host OS, and user application in virtualized environments.
A
A
A
G Apps
G
|
GOS h
VVMM
P Host OS
Platform Hardware
Privileged Mode
Figure 2. Virtual machine monitors vary in how they support
guest OS, host OS, and user applications in virtualized
environments.
VMM Architecture
As discussed in "VMM Requirements" on page 9, the VMM
performs some of the functions that an OS normally does:
namely, it controls and arbitrates CPU and memory resources,
and provides services to upper layer software for sensitive and
privileged operations. These functions require the VMM to run in
privileged mode and the OS to relinquish the privileged and
sensitive operations to the VMM. In addition to processor and
memory operation, I/O device support also has a large impact on
VMM architecture.
VMM in Privileged Mode
A processor typically has two or more privileged modes. The
operating system kernel runs in the privileged mode. The user
applications run in a non-privileged mode and trap to the kernel
when they need to access system resources or services from
the kernel.
The GOS normally assumes it runs in the most privileged mode
of the processor. Running a VMM in a privileged mode can be
accomplished with one of the following three methods:
• Deprivileging the GOS — This method usually requires a
modification to the OS to run at a lower privilege level. For x86
systems, the OS normally runs at protected ring 0, the most
privileged level. In Sun xVM Server, ring 0 is reserved to run
the VMM. This requires the GOS to be modified, or
paravirtualized, to run outside of ring 0 at a lower privilege
level.
• Hyperprivileging the VMM — Instead of changing the GOS to
run at lower privilege, another approach taken by the chip
vendors is to create a hyperprivileged processor mode for the
VMM. The Sun UltraSPARC Tl and T2 processor's
hyperprivileged mode [2], Intel-VT's VMX-root operation (see
[7] Volume 3B, Chapter 19), and AMD-V's VMRUN-Exit state
(see [9] Chapter 15) are examples of a hyperprivileged
processor for VMM operations.
• Both VMM and GOS run in same privileged mode — It is
possible to have both the VMM and GOS run in the same
S
3
VM0
PFN 0
VM1
PFN MPFN
0 0
Physical
Memory Machine
VM/GOS Memory
VMM
Figure 3. Example physical-to-machine memory mapping.
A ballooning technique [5] has been used in some virtualization
products to achieve better utilization of physical memory among
VMs. The idea behind the ballooning technique is simple. The
VMM controls a balloon module in a GOS. When the VMM
wants to reclaim memory, it inflates the balloon to increase
pressure on memory, forcing the GOS to page out memory to
disk. If the demand for physical memory decreases, the VMM
deflates the balloon in a VM, enabling the GOS to claim more
memory.
•
3
S
page table and the shadow page table is handled by the VMM
when page faults occur.
Figure 4 shows three different page translation implementations
in the Solaris OS on x86 and SPARC platforms.
HVCalls|
3
VMM
HW Page Table
Hardware 1—
TLB
TLB
paper.
Hardwar
e
S
4
I/O Virtualization
I/O devices are typically managed by a special software module
called the device driver running in the kernel context. Due to
vastly different types and varieties of device types and device
drivers, the VMM either includes few device drivers or leaves
device management entirely to the GOS. In the latter case,
because of existing device architecture limitations (discussed
later in the section), devices can only be exclusively managed by
one VM.
This constraint creates some challenges for I/O access by a VM,
and limits the following:
• What device are exported to a VM
• How devices are exported to a VM
• How each I/O transaction is handled by a VM and the
VMM
Consequently, I/O has the most challenges in the areas of
compatibility and performance for virtual machines. In order to
explain what devices are exported and how they are exported, it
is first necessary to understand the options available to handle
I/O transactions in a VM.
I/O A A
Native
Transaction -¥ Driver
Emulation or
Virtual
and
Native
V
Device Emulation
3 and Device Driver
*------------'
Network Chip SCSI Controller-*-Sun X64 Server
manage devices. The VM that has direct I/O access uses the
existing driver in the GOS to communicate directly with the
device. VM 1 and VM3 in Figure 5 have direct I/O access to
devices. VM1 is also a special I/O VM that provides virtual I/O
for other VMs, such as VM2, to access devices.
Virtual I/O is made possible by controlling the device types
exported to a VM. There are two different methods of
implementing virtual I/O: I/O transaction emulation (shown in
VM2 in Figure 5) and device emulation (shown in VM4).
• I/O transaction emulation requires virtual drivers on both ends
for each type of I/O transaction (data and control functions). As
shown in Figure 5, the virtual driver on the client side (VM2)
receives I/O requests from applications and forwards requests
through the VMM to the virtual driver on the server side (VMl);
the virtual driver on the server side then sends out the request
to the device.
I/O transaction emulation is typically used in paravirtualization
because the OS on the client side needs to include the special
drivers to communicate with its corresponding driver in the OS
on the server side, and needs to add kernel interfaces for
inter-domain communication using the VMM services.
However, it is possible to have PV drivers in an un-
paravirtualized OS (full virtualization) for better I/O
performance. For example, Solaris 10, which is not
paravirtualized, can include PV drivers on a HVM-capable
system to get better performance than that achieved using
device emulation drivers such as QEMU. (See "Sun xVM
Server with HVM I/O Virtualization (QEMU)" on page 71.)
I /O transaction emulation may cause application compatibility
issues if the virtual driver does not provide all data and control
functions (for example, ioctl (2)) that the existing driver does.
• Device emulation provides an emulation of a device type,
enabling the existing driver for the emulated device in a GOS
to be used. The VMM exports emulated device nodes to a VM
so that the existing drivers for the emulated devices in a GOS
are used. By doing this, the VMM controls the driver used by a
GOS for a particular device type; for example, using the
e1000g driver for all network devices. Thus, the VMM can
focus on the emulation of underlying hardware using one
driver interface. Driver accesses to the I/O register and port in
a GOS, which will result in a trap due to invalid address, are
caught and converted to access the real device hardware.
VM4 in Figure 5 uses native OS drivers to access emulated
devices exported by the VMM.
Device emulation is in general less efficient and more limited
on platforms supported than I/O transaction emulation. Device
emulation does not require changes in the GOS and,
therefore, is typically used to provide full virtualization to a VM.
S
4
3 24 23 1 15 1 10 7 21
1 6 18 0
Reserved Bus Number Device Funct Register 00
Number ion Number
Num
ber
where addr (VF1, BARa) is the starting address of BARa for the
first VF and (VF BARa aperture size) is the size of the VF BARa
as determined by writing a value of 1's to BARa and reading the
value back. Using this mechanism, a GOS in a VM is able to
share the device with other VMs while performing device
operations that pertain only to the VM.
DMA
In many current implementations (especially in most x86
platforms), physical addresses are used in DMA. Since a VM
shares the same physical address space on the system with
other VMs, a VM might read/write to another VM's memory
through DMA. For example, a device driver in a VM might write
the memory contents that belong to other VMs to a disk and
read the data back into the VM's memory. This causes a
potential breach in security and fault isolation among VMs.
To provide isolation during DMA operation, the ATS specification
defines a scheme for a VM to use the address mapped to its
own physical memory for DMA operation. (This approach is
used in similar designs such as IOMMUSpecification [31] and
DMA Remapping [28].) This DMA ATS enables DMA memory to
be partitioned into multiple domains, and keeps DMA
transactions on one domain isolated from other domains.
Figure 7 shows device DMA with and without ATS. With DMA
ATS, the DMA address is like a virtual address that is associated
with a context (VM). DMA transactions initiated by a VM can
only be associated with the memory owned by the VM. DMA
ATS is a chipset function that resides outside of the processor.
PCI Device
South Bridge
1 1
w/ IOMMU
DVA/GPA
PCI Device
South
Bridge
PA - Physical Address
HPA - Host Physical
Address DVA - Device
Virtual Address GPA -
Guest Physical
Address
Protected Mode
The x86 architecture protected mode provides a protection
mechanism to limit access to certain segments or pages and
prevent unprivileged access. The processor's segment-
protection mechanism recognizes 4 privilege levels, numbered
from 0 to 3 (Figure 8). The greater the level number, the lesser
the privileges provided.
4
15 321 0
4
Index TI RPL
Index: up to 8K descriptors
(bits 3-15) TI: Table
Indicator; 0=GDT, 1=LDT
RPL: Request Privilege
Level
Hidden
Figure 10. Each segment descriptor has a visible and a hidden
part.
The hidden fields of a segment register are loaded to the
processor from a descriptor table and are stored in the
descriptor cache registers. The descriptor cache registers, like
the TLB, allow the hardware processor to refer to the contents of
the segment register's hidden part without further reference to
the descriptor table. Each time a segment register is loaded, the
descriptor cache register gets fully loaded from the descriptor
table. Since each VM has its own descriptor table (for example,
the GDT), the VMM has to maintain a shadow copy of each
VM's descriptor table. A context switch to a VM will cause the
VM's shadow descriptor table to be loaded to the hardware
descriptor table. If the content of the descriptor table is changed
by the VMM because of a context switch to another VM, the
segment is non-reversible, which means the segment cannot be
restored if an event such as a trap causes the segment to be
saved and replaced.
The Current Privilege Level (CPL) is stored in the hidden portion
of the segment register. The CPL is initially equal to the privilege
level of the code segment from which it is being loaded. The
processor changes the CPL when program control is transferred
to a code segment with a different privilege level.
The segment descriptor contains the size, location, access
control, and status information of the segment that is stored in
either the LDT or GDT. The OS sets segment descriptors in the
descriptor table and controls which descriptor entry to use for a
segment (Figure 11). See "CPU Privilege Mode" on page 45 for
a discussion of setting the segment descriptor in the Solaris OS.
5
S
31 24 23 22 21 20 19 16 15 14 13
12 11 87 0
Base D D L A SL P D S Type Base
31:24 / V PL 23:16
B L
31 1 0
6
Base 15:00 Segment Limit
15:00
L: 64-bit code segment
AVL: Available for use by system software
Base: Segment base address
D/B Default operation size (0=64-bit segment, 1=32
bit segment)
DBL: Descriptor Privilege Level
G: Granularity
SL: Segment Limit 19:16
P: Segment present
S: Descriptor type (0=system,
1=code or data) Type: segment
type
Figure 11. Segment descriptor.
The privilege check performed by the processor recognizes
three types of privilege levels: requested privilege level (RPL),
current privilege level (CPL), and descriptor privilege level
(DPL). A segment can be loaded if the DPL of the segment is
numerically greater than or equal to both the CPL and the RPL.
In other words, a segment can be
accessed only by code that has equal or higher privilege level.
Otherwise, a general-protection fault exception, #GP, is generated
and the segment register is not loaded.
On 64-bit systems, linear address space (flat memory model) is
used to create a continuous, unsegmented address space for
both kernel and application programs. Segmentation is disabled
in the sense that privilege checking can not apply to VA to LA
translations as it doesn't exist. The only protection left to prevent
a user application from accessing kernel memory is through the
page protection mechanism. This is why the kernel of a GOS
has to run in ring 3 (user mode in page level protection) on a 64-
bit system.
Paging Architecture
When operating in the protected mode, the LA } PA translation is
performed by the paging hardware of the x86 processor. To
access data in memory, the processor requires the presence of
a VA } PA translation in the TLB (in Solaris, LA is equal to VA),
the page table backing up the TLB entry, and a page of physical
memory. For the x86 processor, loading the VA } PA page
translation from the page table to TLB is performed
automatically by the processor. The OS is responsible for
allocating physical memory and loading the VA } PA translation
to the page table.
When the processor cannot load a translation from the page
table, it generates a page fault exception, #PF. A #PF exception on
x86 processors usually means a physical page has not been
allocated, because the loading of the translation from the page
table to the TLB is handled by the processor (Figure 12).
5
Timer Devices
An OS typically uses several timer devices for different
purposes. Timer devices are characterized by their frequency
granularity, frequency reliability, and ability to generate interrupts
and receive counter input. Understanding the characteristics of
timer devices is important for the discussion of timekeeping in a
virtualized environment, as the VMM provides virtualized
timekeeping of some timers to its overlaying VMs. Virtualized
timekeeping has significant impact on the accuracy of time
5
Chapter 4
SPARC Processor Architecture
• Privileged registers
The UltraSPARC T2 processor[33] is built upon the UltraSPARC
T1 architecture. It has the following enhancements over the
UltraSPARC T1 processor:
EIght strands per core (for a total of 64 strands)
Two integer pipelines per core, with each integer pipeline
supporting 4 strands
2 Eight banks of 4 MB L2 cache
A The
floating-point and graphics unit (FGU) per core
UltraSPARC T2 has a total of 64 strands in 8 cores, and each core has its own
floating-pointing and graphics unit (FGU). This allows up to 64 domains to be created
on the UltraSPARC T2 processor. This design also adds integrated support for
industry standard I/O interfaces such PCI-Express and 10 Gb Ethernet.
Table 3 summarizes the association of processor components to physical processor,
core and strand.
3 Eight banks of 4 MB L2 cache
The UltraSPARC T2 has a total of 64 strands in 8 cores, and each core has its own
floating-pointing and graphics unit (FGU). This allows up to 64 domains to be created
on the UltraSPARC T2 processor. This design also adds integrated support for
industry standard I/O interfaces such PCI-Express and 10 Gb Ethernet.
Table 3 summarizes the association of processor components to physical processor,
core and strand.
S
5
Manage
Proce
ssor
d by
Hypervi
MMU TSB
A<-
sor VA TLB
TLB miss w
PA<-RA
TLB miss w
control for the current TL. The ability to support nested traps in SPARC
processors makes the implementation of an OS trap handler easier
and more efficient, as the OS doesn't need to explicitly save the current
trap stack information.
On UltraSPARC T1/T2 processors, each strand has a full set of trap
control and stack registers which include TT, TL, TPC, TNPC, TSTATE, HTSTATE
(hyperprivileged trap state), TBA, HTBA (hyperprivileged trap base
address), and PIL (priority interrupt level). This design feature allows
each strand to receive traps independently of other strands. This
capability significantly helps trap handling and management by the
hypervisor, as traps are delivered to a strand without being queued up
in the hypervisor.
Interrupts
On SPARC platforms, interrupt requests are delivered to the CPU as
traps. Traps 0x041 through 0x04F are used for Priority Interrupt Level
(PIL) interrupts, and trap 0x60 is used for the vector interrupt. There
are 15 interrupt levels for PIL interrupts. Interrupts are serviced in
accordance to their PIL, with higher PILs having higher priority. The
vector interrupt is used to support the data bearing vector interrupt
which allows a device to include its private data in the interrupt packet
(also known as the mondo
S
6
Section II
Hardware Virtualization Implementations
•
S
Chapter 5
Sun xVM Server
Note - Sun xVM Server includes support for the Xen open
source community work on the x86 platform and support for
LDoms on the UltraSPARC T1/T2 platform. In this paper, in
order to distinguish the discussion of x86 and UltraSPARC
T1/T2 processors, Sun xVM Server is specifically used to refer
to the Sun hardware virtualization product for the x86 platform,
This chapter is organized as follows:
"Sun xVM Server Architecture Overview" on page 40 provides an
overview of the Sun xVM Server architecture.
"Sun xVM Server CPU Virtualization" on page 45 discusses the
CPU virtualization employed by Sun xVM Server.
"Sun xVM Server Memory Virtualization" on page 52 describes
memory management issues.
and LDoms is used to refer to the Sun hardware virtualization
product for the UltraSPARC Tl and T2 platforms.
S
6
Gues
t
Applicat
ions
Hypercalls
The Sun xVM Server hypercalls are a set of interfaces used by a
GOS to request service from the VMM. The hypercalls are
invoked in a manner similar to OS system calls: a software
interrupt is issued which vectors to an entry point within the
VMM. Hypercalls use INT $0x82 on a 32-bit system and SYSCALL on
a 64-bit system, with the particular hypercall contained in the
%eax register.
For example, the common routine for hypercalls with four
arguments on a 64-bit Solaris kernel is:
long
__hypercall4(ulong_t callnum, ulong_t a1, ulong_t a2, ulong_t a3,
ulong_t a4);
Solaris system calls and the hypercalls means that the SYSCALL
made by the user process in Solaris is delivered indirectly by the
VMM to the Solaris kernel. This causes a slight overhead for
each Solaris system call.
A complete list of Sun xVM Server hypercalls is
provided in Table 4: Table 4. Sun xVM Server
hypercalls. Privilege Operations:
long set_trap_table(trap_info_t 4table);
long
6
mmu_update(mmu_update_t 5req, int count, int
success_count,
7
domid_t domid); long set_gdt(ulong_t
frame_list, int entries); long
stack_switch(ulong_t ss, ulong_t esp); long
fpu_taskswitch(int set);
long
9
mmuext_op(struct mmuext_op 8req, int count, int
success_count,
domid_t domain_id); long
update_descriptor(maddr_t ma, uint64_t desc);
long update_va_mapping(ulong_t va, uint64_t new_pte, ulong_t
flags);
long set_timer_op(uint64_t timeout);
long physdev_op(void 10physdev_op);
long vm_assist(uint_t cmd, uint_t type);
long update_va_mapping_otherdomain(ulong_t va, uint64_t
new_pte,
ulong_t flags, domid_t
domain_id); long iret();
long set_segment_base(int reg, ulong_t
value); long nmi_op(ulong_t12op, void 11arg);
long hvm_op(int cmd, void arg);
VMM Services:
4 Control VM operations such as platform_op, domain control, and virtual CPU control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
5 Control VM operations such as platform_op, domain control, and virtual CPU control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
6 Control VM operations such as platform_op, domain control, and virtual CPU control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
7 Control VM operations such as platform_op, domain control, and virtual CPU control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
8 Control VM operations such as platform_op, domain control, and virtual CPU control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
9 Control VM operations such as platform_op, domain control, and virtual CPU control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
10 Control VM operations such as platform_op, domain control, and virtual CPU
control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
11 Control VM operations such as platform_op, domain control, and virtual CPU
control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
12 Control VM operations such as platform_op, domain control, and virtual CPU
control
An example use of a hypercall is to request a set of page table updates. For example,
a new process created by the fork(2) call requires the creation of page tables. The
hypercall HYPERViS0R_mmu_update (), which validates and applies a list of
S
Event Channels
To a GOS, a VMM event is the equivalent of a hardware
interrupt. Communication from the VMM to a VM is provided
through an asynchronous event mechanism, called an event
channel, which replaces the usual delivery mechanisms for
device interrupts. A VM creates an event channel to send and
receive asynchronous event notifications.
• Virtual interrupts
A VM can bind an event-channel port to a virtual interrupt
source, such as the virtual-timer device.
Event channels are addressed by a port. Each channel is
associated with two bits of information:
• unsigned long evtchn_pending[sizeof(unsigned long) * 8] This
value notifies the domain that there is a pending notification to
be processed. This bit is cleared by the GOS.
• unsigned long evtchn_mask[sizeof(unsigned long) * 8]
This value specifies if the event channel is masked. If this bit is
clear and PENDING is set, an asynchronous upcall will be
scheduled. This bit is only updated by the GOS; it is read-only
within the VMM.
Interrupts to a VM are virtualized by mapping them to event
channels. These interrupts are delivered asynchronously to the
target domain using a callback supplied via the set_callbacks
hypercall. A guest OS can map these events onto its standard
interrupt dispatch mechanisms. The VMM is responsible for
determining the target domain that will handle each physical
interrupt source.
"Interrupts and Exceptions" on page 49 provides a detailed
discussion of how an interrupt is handled by the VMM and
delivered to a VM using an event channel.
Grant Tables
The Sun xVM Hypervisor for x86 allows sharing memory among
VMs, and between the VMM and a VM, through a grant table
mechanism. Each VM makes some of its pages available to other
VMs by granting access to its pages. The grant table is a data
structure that a VM uses to expose some of its pages, specifying
what permissions other VMs have on its pages. The following
example shows the information stored in a grant table entry:
struct grant_entry {
/* GTF_xxx: various type and flag [XEN,GST] */
information.
uint16_t flags;
/* The domain being granted foreign [GST] */
privileges.
domid_t domid;
uint32_t frame; // page frame
number (PFN)
};
The flags field stores the type and various flag information of the
grant table. There are three types of grant table entries:
• GTF_invalid — Grants no privileges.
• GTF_permit_access — Allows the domain domid to
map/access the specified
frame .
S
7
XenStore
XenStore [22] is a shared storage space used by domains to
communicate and store configuration information. XenStore is
the mechanism by which control-plane activities, including the
following, occur:
• Setting up shared memory regions and event channels for use
with split device drivers
• Notifying the guest of control events (for example, balloon
driver requests)
• Reporting status information from the guest (for example,
performance-related statistics)
The store is arranged as a hierarchical collection of key-value
pairs. Each domain has a directory hierarchy containing data
related to its configuration. Domains are permitted to register for
notifications about changes in a subtree of the store, and to apply
changes to the store transactionally.
% cat
intel/sys/segments.h
#if defined(__amd64)
#define SEL_XPL /* xen privilege level */
0
#define SEL_KPL /* both kernel and user in ring 3 */
3
#elif defined(__i386)
#define SEL_XPL 0 /* xen privilege level */
#define SEL_KPL /* kernel privilege level under xen */
1
#endif /* __i386 */
If both kernel and user application run with the same privilege
level, how does Sun xVM Server protect the kernel from user
applications? The answer is given as follows [32]:
63 48 47 39 38 3 212 121 0
0 0 1
2
9
S
7
Sign
PML4 PDP PDE PTE Offset
Extended
Kernel (ring 3)
Reserved
User (ring 3)
0 I_________________I
Figure 18. Address space partitioning in the 64-bit Sun xVM
Server.
As discussed previously (see "Segmented Architecture" on page
23), the processor privilege level is set when a segment is
loaded. The Solaris OS uses the GDT for user and kernel
segments. The segment index of each segment type is assigned
as shown in Table 5 on page 54.
The command kmdb(1M) can be used to examine the segment
descriptor of kernel code:
CPU Scheduling
The Sun xVM Hypervisor for x86 provides two schedulers for the
user to choose between: Credit and simple Earliest Deadline
First (sEDF). The Credit scheduler is the default scheduler; sEDF
might be phased out and removed from the Sun xVM Server
implementation.
The Credit scheduler is a proportional fair share CPU scheduler.
Each physical CPU (PCPU) manages a queue of runnable virtual
CPUs (VCPUs). This queue is sorted by VCPU priority. A
VCPU's priority can be either over or under, representing
whether this VCPU has exceeded its share of the PCPU or not.
A VCPU's share is determined by weight assigned to the VM and
credit accumulated by the VCPU in each accounting period.
(Credit
Credit i total X We ightVMi) + (Weighttotal - 1)
VM
= Weighttotal
Credit -
VCPU -
TotalVCPUVMi
void
init_desctbls(void)
{
init_idt(&idt0[0]);
for (vec = 0; vec < NIDT; vec++)
xen_idt_write (&idt0[vec], vec);
}
S
void
xen_idt_write
(gate_desc_t *sgd, uint_t
vec) {
trap_info_t trapinfo[2];
bzero(trapinfo, sizeof (trapinfo));
if (xen_idt_to_trap_info(vec, sgd, &trapinfo[0])
== 0) return;
if (xen_set_trap_table(trapinfo) != 0)
panic("xen_idt_write: xen_set_trap_table()
failed");
}
scTsdintr :
sd'sdintr: ec8b4855 = pushq %rbp
[0]> $c
sdNsdintr (fffffffec0 670 000)
mptNmpt_intr+0xdb(fffffffec0670000, 0)
av_dispatch_autovect+0x78(1b)
dispatch_hardint+0x33(1b, 0)
switch_sp_and_call+0x13()
do_interrupt+0x9b(ffffff0001005ae0, 1)
xen_callback_handler +0x3 6c(ffffff0001005ae0, 1)
xen_callback +0xd9()
HYPERVISOR_sched_op +0x2 9(1, 0)
HYPERVISOR_block +0x11()
mach_cpu_idle+0x52()
cpu_idle+0xcc()
idle+0x10e()
thread_start+8()
[0]>
Timer Services
"Timer Devices" on page 27 discusses several hardware timers
available on x86 systems. These hardware devices vary in their
frequency reliability, granularity, counter size, and ability to
generate interrupts. The Solaris OS employs some of these timer
devices for running the OS clock and high resolution timer:
• OS system clock — The Solaris OS uses the local APIC timer
on multiprocessor systems to generate ticks for the system
clock. On uniprocessor systems, the Solaris OS uses the PIT to
generate ticks for the system clock.
• High resolution timer — The Solaris OS uses the TSC timer for
a high resolution timer. The PIT counter is used to calibrate the
TSC counter.
• Time-of-day clock — The time-of-day (TOD) clock is based on
the RTC. Only Domo can set the TOD clock. The DomU VMs
don't have the permission to update the machine's physical
RTC. Therefore, any attempt by the date (1) command to set
the date and time on DomU will be quietly ignored.
In Sun xVM Server, the VMM provides the system time to each
VCPU when it is scheduled to run. The high resolution timer,
gethrtime(), is still run through the
unprivileged RDTSC instruction, thus the high resolution timer is not
virtualized. The virtualized system time relies on the current TSC
S
8
• XENMEM_increase_reservation
• XENMEM_decrease_reservation
XENMEM_populate_physmap
S
Page Translations
"Segmented Architecture" on page 23 describes two stages of
address translation to arrive at a physical address: virtual
address (VA) to linear address (LA) translation using
segmentation, and LA to physical address (PA) translation using
paging. Solaris x64 uses a flat address space in which the VA
and LA are equivalent, which means the base address of the
segment is 0. In Solaris 10, the Global Descriptor Table (GDT)
contains the segment descriptor for the code and data segments
of both kernel and user processes, as shown in Table 5 on page
54.
Since there is only one GDT in a system, the VMM maintains the
GDT in its memory. If a GOS wishes to use something other than
the default segment mapping that the VMM GDT provides, it
must register a custom GDT with the VMM using the set_gdt()
hypercall. In the following code sample, frame_list is the physical
address of the page that contains the GDT and entries is the
number of entries in the GDT.
The Solaris 32-bit thread library uses %gs to refer to the LWP
state manipulated by the internals of the thread library. The 64-bit
thread library uses %fs to refer to the LWP state as specified by
the AMD64 ABI [8]. The 64-bit kernel still uses %gs for its CPU
state (%fs is never used in the kernel). The MSR's KernelBase
register is used to store the kernel %gs content while it switches
to run the 32-bit user LWP. The privileged instruction SWAPGS is
used to restore the kernel %gs during the context switch to the
kernel context. So when the VMM performs a context switch
between the guest kernel mode and the guest user mode, it
executes SWAPGS as part of the context switch (see "CPU Privilege
Mode" on page 45).
S
8
Every LWP context switch requires an update to the GDT for the
new LWP. The GOS uses
update_descriptor() for the task:
intel/ia32/os/desctbls.c
S
Void
set_pteval(paddr_t table, uint_t index, uint_t level, x86pte_t
pteval)
{
ma = pa_to_ma(PT_INDEX_PHYSADDR(pfn_to_pa(ht-
>ht_pfn), entry)); t[0].ptr = ma |
MMU_NORMAL_PT_UPDATE; t[0].val = new;
if (HYPERVISOR_mmu_update(t, cnt, &count,
DOMID_SELF))
panic("HYPERVISOR_mmu_update()
failed");
[3]> :c
kmdb: stop at xen_pin+0x3a
kmdb: target stopped at:
xen_pin+0x3a: call +0x208b1 <HYPERVISOR_mmuext_op>
[3]> $c
xen_pin+0x3a(ff2c, 3)
hat_alloc+0x285(fffffffec381b7
e8) as_alloc+0x99()
as_dup+0x3f(fffffffec381ba88,
fffffffec3d0f8d0) cfork+0x102(0, 1, 0)
forksys+0x25(0, 0) sys_syscall32+0x13e()
DomO
S
8
The Sun xVM Server-related driver modules for Domo and DomU
respectively are shown below:
Disk Driver
The xdb driver, the back-end driver on Domo, is used to provide
services for block device management. This driver receives I/O
requests from DomU domains and sends them on to the native
driver. On DomU, xdf is the pseudo block driver that gets the I/O
requests from applications and sends them to the xdb driver in
Domo. The xdf driver provides functions similar to those of the
SCSI target disk driver, sd, on an unvirtualized Solaris system.
On Solaris systems, the main interface between a file system and
storage device is the strategy(9E) driver entry point. The
strategy(9E) entry point takes only one argument, buf (9S), which
is the basic data structure for block I/O transfer. The I/O request
made by a file system to the strategy( 9E) entry point is called
PAGEIO, as the memory buffer for the I/O is allocated from the
kernel page pool. An application can also open the storage
device as a raw device and perform read(2) and write(2)
operations directly on the raw device. Such an I/O request is
called PHYSIO, physio(9F), as the memory buffer for the I/O is
allocated by the application.
In addition to the strategy(9E) driver entry point for supporting file
system and raw device access, a disk driver also supports a set
of ioctl (2) operations for disk control and management. The dkio
(7I) disk control operations define a standard set of ioctl(2)
commands. Normally, support for dkio (7I) operations requires
direct access to the device. In DomU, xdf supports most ioctl(2)
commands as defined in dkio(7i) by emulating the disk control
inside xdf. No communication is made by xdf to the back-end
driver for ioctl(2) operations.
The sequence of events for disk I/O data transfer is illustrated in
Figure 20. The disk control path, ioctl(2) , is similar to the data
path.
1. The file system calls the xdf driver's strategy(9E) entry point
as a result of a read(2) or write(2) system call.
S
9
2. The xdf driver puts the I/O buffer, buf(9S), on the grant table.
This buffer is allocated from the DomU memory. Permission
for other domain access is granted to this memory.
3. The xdf driver notifies Domo of an event through event
channel.
4. The VMM event channel generates an interrupt to the xdb
driver in Domo.
5. The xdb driver in Domo gets the DomU I/O buffer through
the grant_table.
6. The xdb driver in Domo calls the native driver's
strategy( 9E) entry point.
7. The native driver performs DMA.
8. The VMM receives the device interrupt.
9. The VMM generates an event to Domo.
10. The xdb driver's iodone() routine is called by biodone(9F).
11. The xdb driver's iodone() routine generates an event to
DomU.
12. The xdf driver in DomU receives an interrupt to free up the
grant table and DMA resources, and calls biodone(9F) to
wake up anyone waiting for it.
© |Xen
Callback
DomU User
read
<2)/write
<2) Kern
F
SI
.
1
EMI
Network Driver
The Sun xVM Server network drivers uses a similar approach to
the disk block driver for handling network packets. On DomU, the
pseudo network driver xnf gets the I/O requests from the network
S
dom0
GrantTables EventChannel
XenCallback
Sun xVM Hypervisor for x86
I
VM EXIT/ T
VM
VM EXIT/ VM
ENTER/
ENTER/
#VMEXIT
#VMEXIT
VMRUN
VMRUN
management
interrupts
• Executing certain
instructions (such
as RDPMC, RDTSC, or
instructions that
access the CR)
•Exceptions
The exact conditions
that cause a VM exit
are defined in the
HVMCSDS control
fields. Certain
conditions may
cause a VM exit for
one VM but not for
other VMs.
VM exits behave like
a fault, meaning that
the instructions
causing the VM exit
does not execute
and no processor
state is updated by
the instruction. The
VM exit handler
S
9
struct hvm_function_table {
void (*disable)(void);
int (*vcpu_initialise)(struct vcpu
*v); void (*vcpu_destroy)(struct vcpu
*v); void (*store_cpu_guest_regs)(
struct vcpu *v, struct cpu_user_regs *r, unsigned long
*crs); void (*load_cpu_guest_regs)(
struct vcpu *v, struct cpu_user_regs *r);
int (*paging_enabled)(struct vcpu *v); int
(*long_mode_enabled)(struct vcpu *v); int
(*pae_enabled)(struct vcpu *v); int
(*guest_x86_mode)(struct vcpu *v); unsigned
long (*get_guest_ctrl_reg)(struct vcpu *v,
unsigned int num); unsigned long
(*get_segment_base)(struct vcpu *v,
enum x86_segment seg); void
(*get_segment_register)(struct vcpu *v,
enum x86_segment seg, struct
segment_register *reg); void
(*update_host_cr3)(struct vcpu *v); void
(*update_guest_cr3)(struct vcpu *v); void (*stts)
(struct vcpu *v);
void (*set_tsc_offset)(struct vcpu *v, u64 offset);
void (*inject_exception)(unsigned int trapnr, int errcode,
unsigned long cr2); void
(*init_ap_context)(struct vcpu_guest_context
*ctxt,
int vcpuid, int trampoline_vector); void
(*init_hypercall_page)(struct domain *d, void
*hypercall_page);
};
Emulated BIOS
The PC BIOS provides hardware initialization, boot services,
and runtime services to the OS. There are some restrictions on
VMX operation. An OS in HVM cannot operate in real mode.
Unlike a paravirtualized OS that can change its bring up
sequence for an environment without BIOS, an unmodified OS
requires an emulated BIOS to perform some real mode
operations before control is passed to the OS. Sun xVM Server
includes a BIOS emulator, hvmloader, as a surrogate to real
BIOS.
The hvmloader BIOS emulation contains three components:
ROMBIOS, VGABIOS, and VMXAssist. Both ROMBIOS and VGABIOS are based
the open source Bochs BIOS [23]. The VMXAssist component is
included in hvmloader to emulate real mode, which is required
by hvmloader and bootstrap loaders. The hvmloader BIOS
emulator is bootstrapped as any other 32-bit OS. After it is
loaded, hvmloader copies
its three components to pre-assigned addresses (VGABIOS at
C000:0000, VMXAssist at D000:0000, and ROMBIOS at F000:0000)
and transfers control to VMXAssist.
The hvmloader BIOS emulator does not directly interface with
physical devices. It communicates with virtual devices as
S
____
ra
Event
hvm
Sun ChannelxVM I/O
VMPort ------©-----------------
exit handler]
hypercall
Hypervisor for x86
Device
memory
X86 Hardware (CPU, Memory,
Devices)
Figure 23. I/O emulation in Sun xVM Server using QEMU for
dynamic binary translation.
Using the AMD PCNet LANCE PCI Ethernet controller as an
example, the vendor ID and device ID of the PC Net chip is
respectively 1022 and 2000. From prtconf (1M) output, the PCI
registers exported by the device are:
% prtconf -v
pci1022,2000,
instance #0
Hardware
properties:
name='assigned-addresses' type=int items=5
value=81008810.00000 000.00001400.00000
000.00000 080 name='regl type=int items=10
value=000 08800.000 00000.000 00000.000 00000.000
00000.01008810.000 00000.000 00000.00000000.00000080
2. The pen driver writes to the DMA descriptor using the OUT
instruction. In pen, pcn_send() calls pcn_OutcsR() to start the
DMA transaction. Then, pcn_OutcsR () calls ddi_put16 () to write
a value to an I/O address. Next, ddi_put16() checks whether the
mapping (io_handle) is for I/O space or memory space. If the
mapping is for the I/O space, it moves its third argument to %rax
and port ID to %rdx, and issues the OUTW instruction to the port
referenced by %dx.
pcn_send()
{
pcn_OutCSR
(pcnp, CSR0, CSR0_INEA
| CSR0_TDMD); }
static void
pcn_OutCSR (struct pcninstance *pcnp, uintptr_t reg, ushort_t
value)
{
ddi_put16(pcnp->io_handle, REG16(pcnp->io_reg,
PCN_IO_RAP), reg); ddi_put16(pcnp->io_handle,
REG16(pcnp->io_reg, PCN_IO_RDP),
value);
}
ENTRY(ddi_put16)
movl ACC_ATTR(%rdi), %ecx
cmpl $_CONST(DDI_ACCATTR_IO_SPACE|
DDI_ACCATTR_DIRECT), %ecx
jne 8f
movq %rdx, %rax
movq %rsi, %rdx
outw (%dx)
ret
void hvm_send_assist_req
(struct
vcpu *v) {
p->state = STATE_IOREQ_READY;
notify_via_xen_event_channel(v->arch.hvm_vcpu.xen_port);
}
int main_loop(void)
{
qemu_set_fd_handler(evtchn_fd, cpu_handle_ioreq,NULL,
env); while (1) {
main_loop_wait(10);
}
qemu_send_packet
(s->vc, s->buffer, s->xmit_pos);
}
static void tap_receive(void *opaque, const uint8_t *buf, int size)
{
for(;;) {
ret = write(s->fd, buf, size);
}
}
rc = do_xen_hypercall(xc_handle, &hypercall);
The VMM sets the guest HVMCDCS area to inject an event with
the next VM entry. The target VM will get an interrupt when the
VMM launches a VM entry to the target domain (see "Sun xVM
Server Interrupt and Exception Handling for HVM" on page 70).
Chapter 7 Logical
Domains
m wm
Hyperviso 1
1
r1 1
r
J
—
I
Hardware J 1 1 1
1 L C Me CP
L Pm1 U
U
C Me 1 CPU 1 CPU
P m 1 Mem 1 Mem 1
U
.--------- . _ _ _ i_ _ _ _ i_ __ _
__1 _i _i
Figure 24. The hypervisor, a thin layer of firmware, abstracts
hardware resources and presents them to the OS.
The LDoms implementation includes four components:
• UltraSPARC
T1/T2 processor
•UltraSPARC hypervisor
• Logical Domain Manager (LDM)
• Paravirtualized Solaris OS
Paravirtualized Solaris OS
The Solaris kernel implementation for the UltraSPARC T1/T2
hardware class (uname -m) is referred to as the Solaris sun4v
architecture. In this implementation, the Solaris OS is
paravirtualized to replace operations that require hyperprivileged
mode with hypervisor calls. The Solaris OS communicates with
the hypervisor through a set of hypervisor APIs, and uses these
APIs to request that the hypervisor perform hyperprivileged
operations.
Sun4v support for LDoms is a combination of partitioning the
UltraSPARC T1/T2 processor into strands and virtualization of
memory and I/O services. Unlike Sun xVM Server and VMware,
an LDoms domain does not share strands with other domains.
Each domain has one or more strands assigned to it, and each
strand has its own hardware resources so that it can execute
instructions independently of other strands. The virtualization of
CPU functions to support CMT is implemented at the processor
rather than at the software level (that is, there is no software
scheduler). A Solaris guest OS can directly access strand-
specific registers in a domain and can, for example, perform
operations such as setting an OS trap table to the trap base
address register (TBA).
The Solaris sun4v architecture assumes that the platform
includes the hypervisor as part of its firmware. The hypervisor
runs in the hyperprivileged mode, and the Solaris
OS runs in the privileged mode of the processor. The Solaris
kernel uses hypercalls to request that the hypervisor perform
hyperprivileged functions of the processor.
Like Intel's VT and AMD's Pacifica architectures, the sun4v
architecture leverages CPU support (hyperprivileged mode) for
L
1
Hypervisor Services
The hypervisor layer is a component of the UltraSPARC T1/T2
system's firmware. An UltraSPARC system's firmware consists
of Open Boot PROM (OBP), Advanced Lights Out Management
(ALOM), Power-on Self Test (POST), and the hypervisor.
The hypervisor leverages the UltraSPARC T1/T2 hyperprivileged
extensions to provide a protection mechanism for running
multiple guest domains on the system. The hypervisor includes a
number of hypervisor services to its overlaying domains. These
services include hypervisor APIs that are the interfaces for a
GOS to request hypervisor services, and Logical Domain
Channel (LDC) services which are used by virtual device drivers
for inter-domain communications.
Hypervisor API
The Sun4v hypervisor API [11] uses the Tcc instruction to cause
the GOS to trap into hyperprivileged mode, in a similar fashion to
how OS system calls are implemented. The function of the
hypervisor API is equivalent to system calls in the OS that
enable user applications to request services from the OS. The
Sun4v hypervisor API allows a GOS to perform the following
actions:
• Request services from the hypervisor
• Get and set CPU information through the hypervisor
The UltraSPARC Virtual Machine Specification [11] lists the
complete set of services and APIs for:
• API versioning — request and check for a version of the
hypervisor APIs with which it may be compatible
• Domain services — enable a control domain to request
information about or to affect other domains
• CPU services — control and configure a strand; includes
operations such as start/stop/suspend a strand, set/get the trap
base address register, and configure the interrupt queue
• MMU services — perform MMU related operations such as
configure the TSB, map/demap the TLB, and configure the
fault status register
• Memory services — zero and flush data from cache to
memory
• Interrupt services — get/set interrupt enabled, target strand,
and state of the interrupt
• Time-of-Day services — get/set time-of-day
L
% mdb -k
> hv_mem_sync,6/ai
hv_mem_sync:
hv_mem_sync: %o %o4
mov 2,
hv_mem_sync+4: 0x3 %o5
mov 2,
hv_mem_sync+8: ta %ic %g0 +
c, 0
hv_mem_sync+0xc:ret
l
hv_mem_sync+0x10: stx %o1
,
>
hv_api_set_version,6/
ai
hv_api_set_version:
hv_api_set_version: mov %o3 %o4
,
hv_api_set_version+4: clr %o5
hv_api_set_version+8: ta %icc %g0 + 0x7f
,
hv_api_set_version+0 retl
xc:
hv_api_set_version+0 stx %o1 [%o4]
x10: ,
ENTRY_NP(start_master)
setx htraptable, %g3,
%g1 wrhpr %g1,
%htba
void
cpu_intrq_register(struct cpu
*cpu) {
struct machcpu *mcpup =
&cpu->cpu_m; uint64_t ret;
>cpu_q_base_pa,
cpu_q_entries);
>dev_q_base_pa,
L
dev_q_entries);
ENTRY(htraptable)
>
trap_table+0x20*0x41,
2/ai tt_pil1:
tt_pil1: ba,pt %xcc, +0xc33c <pil_interrupt>
0x1000824: mov 1, %g4
>
ASI #
ASI Name
Description
0x14
ASI_REAL
0X15
ASI_REAL_IO
0X1C
ASI_REAL_LITTLE
0X1D
ASI_REAL_IO_LITTLE
0X21
0X52
ASI_MMU_REAL
MMU Register
or:
Virtual
Addressing
Unprivileged
Real
Addressing
Privileged
Figure 26 illustrates the type of addressing in each of mode of
operation.
Physical
Addressing
64-bit VA + context ID + Partition ID Hyperprivileged
Figure 26. Different types of addressing are used in different
modes of operation.
Trap name
Trap Cause
TT
Handled by
fast_instruction_access_MMU_miss
iTLB Miss
0x64
Hypervisor
fast_data_access_MMU_miss
dTLB Miss
0x68
Hypervisor
1
fast_data_access_protection
Protection Violation
0x6c
Hypervisor
instruction_access_exception
Several
0x08
Hypervisor
data_access_exception
Several
0x30
Hypervisor
Page translations in the UltraSPARC architecture are managed
by software through several different type of traps (see "Memory
Management Unit" on page 32). Depending on the trap type,
traps may be handled by the hypervisor or the GOS. Table 9
summarizes the MMU related trap types (see also Table 12-4 in
[2]).
instruction_access_MM iTSB Miss 0x09 GOS
U_miss
data_access_MMU_mis dTSB Miss 0x31 GOS
s
*mem_address_not_ali Misaligned 0x34- Hypervisor
gned memory
operation 0x39
> dtsb_miss,80/ai
wrpr %g0, 0x31, %tt ! write 0x31 to %tt
rdpr %pstate, %g3 ! read %pstate to %g3
or %g3, 4, %g3
wrpr %g3, %pstate ! write %g3 to%pstate
1
L
In the Solaris OS, the trap handler for trap type 0x31 calls the
handler sfmmu_slow_dmmu_miss() to load the page translation
from hme_blk. If no entry is found in hme_blk for the virtual
address, sfmmu_slow_dmmu_miss () calls sfmmu_pagefault() to
transfer control to Solaris's trap() handler.
% mdb -k
>
trap_table+0x20*0x31,2/a
i
scb+0x620:
scb+0x620: ba,a < sfmmu_slow_dmmu_miss >
+0xc1b4
scb+0x624: illtrap
0
> sfmmu_pagefault,80/ai
sfmmu_pagefault+0x78: sethi %hi(0x101d400), %g1
sfmmu_pagefault+0x7c: or %g1, 0x3 64, %g1
sfmmu_pagefault+0x80: ba,pt %xcc, -0x13f0 <sys_trap>
{0} ok show-devs
/cpu@3
/cpu@2
/cpu@1
/cpu@0
/virtual-devices@100
/virtual-memory
/memory@m0,4000000
/aliases
/options
/openprom
/chosen
/packages
/virtual-devices@100/channel-devices@200
/virtual-devices@100/console@1
/virtual-devices@100/ncp@4
/virtual-devices@100/channel-devices@200/disk@0
/virtual-devices@100/channel-devices@200/network@0
/openprom/client-services
/packages/obp-tftp
/packages/kbd-translator
1
/packages/SUNW,asr
/packages/dropins
/packages/terminal-emulator
/packages/disk-label
/packages/deblocker
/packages/SUNW,builtin-drivers
During the system boot, the OBP device tree information is
passed to the Solaris OS and used to create the system device
nodes. Output from the following pftconf(lM) command shows
the system configuration of a typical non-I/O domain:
# prtconf
System Configuration: Sun Microsystems sun4v
Memory size: 1024 Megabytes
System Peripherals (Software Nodes):
SUNW,Sun-Fire-T200
scsi_vhci, instance #0
packages (driver not
attached)
SUNW,builtin-drivers (driver not attached)
deblocker (driver not attached)
disk-label (driver not attached)
terminal-emulator (driver not attached)
dropins (driver not attached)
SUNW,asr (driver not attached)
kbd-translator (driver not attached)
obp-tftp (driver not attached)
ufs-file-system (driver not
attached) chosen (driver not
attached) openprom (driver not
attached)
client-services (driver not
attached) options, instance #0
aliases (driver not attached)
memory (driver not attached)
virtual-memory (driver not
attached) virtual-devices,
instance #0
ncp (driver not attached)
console,instance #0
channel-devices, instance #0
disk, instance #0 network,
instance #0 cpu (driver not
attached) cpu (driver not
attached) cpu (driver not
attached) cpu (driver not
attached) iscsi, instance #0
pseudo, instance #0
LDOM drivers:
vdc (virtual disk client 1.4) non I/O domain only
ldc (sun4v LDC module v1.5)
ds (Domain Services 1.3)
cnex (sun4v channel-devices nexus dri)
vnex (sun4v virtual-devices nexus dri)
dr_cpu (sun4v CPU DR 1.2)
drctl (DR Control pseudo driver v1.1)
1
L
1. The file system calls the vdc driver's strategy(9E) entry point.
2. The vdc drivers send the I/O buf, buf(9S), to the LDC. The vdc
driver returns after all data is successfully sent to the LDC.
3. The vds driver is notified by the hypervisor that messages are
available on its queue.
4. The vds driver retrieves data from the LDC and sends it to the
device service that is mapped to the client virtual disk.
5. The vds driver starts the block I/O by sending the I/O request to
the native driver and then dispatching a task queue to await I/O
completion.
6. The native SCSI driver receives the device interrupt.
7. The vds driver's I/O completion is woken up by biodone (9F).
8. The vds driver sends a message to vdc indicating I/O
completion.
9. The vdc driver receives the message from vds, and calls
biodone(9F) to wake up anyone waiting for it.
10. Block I/O requests are sent directly from the file system to
the native driver.
Network Driver
The Solaris LDoms network drivers include a client network
driver, vnet, and a virtual switch, vsw, on the server side. To
transmit a packet, vnet sends a packet over the LDC to vsw. The
1
L
I/O Domain
LDoms Hypervisor
Manage
VMkern
Netw
Stor
Stora
el
ork
age
ment the functional components of the VMware ESX Server
shows
ge
Stac
Interfa
Emula
k ce
product. The VMkernel, the core of the ESX Server, abstracts
tion
the underlying hardware resources and implements the
for each VM. The VMM implements the virtual CPUs for each
emulation, the I/O stack, and device drivers for network and
•
Execution H ardware Network Storage
Mode VMM Interface Driver Driver
Layer
Binary Translation
Direct Execution
Binary Translation
The binary translation (BT) module is believed influenced by the
machine simulators Shade [13] and Embra [14]. Embra is part of
SimOS [18] which was developed by a Stanford team led by
Mendel Rosenblum, one of the founders of VMware. While
extensive details of the BT module implementation have not
been published, Agesen [15], Embra [14], and Shade [13]
provide some information on its implementation.
The BT module translates GOS instructions, which are running in
a deprivileged VM, into instructions that can run in the privileged
VMM segment. The BT module receives x86 binary instructions,
including privileged instructions, as input. The output of the
module is a set of instructions that can be safely executed in the
V
Translator
Main{ T
dispatch loop
if (PC_not_in TC(PC)) Translation Cache:
tc=translate(pc); code fragments
which end with
newpc = pc_to_tc(pc); jump back to
jump_to_pc(newpc) dispatchloop
translate(pc) {
blk =
read_instructions(pc)
;
perform_translation(
blck);
write_into_TC(blk);
} ""
Figure 32. Binary translation manages a translation cache to
reduce the need to re-translate frequently executed blocks of
instructions.
A more detailed description of binary translation is beyond the
scope of this paper. Readers should refer to Shade [13] and
Embra [14] for more details about dynamic binary translation.
Some privileged instructions that have simple operations use in-
TC sequences. For example, a clear interrupt instruction (cli) can
be replaced by setting a virtual processor flag. Privileged
instructions that have more complex operations (such as setting
cr3 during a context switch), require a call out of the TC to
perform the emulation work.
In addition to binary translation and logic for determining the
code execution, the virtualization layer employs other techniques
to overcome x86 virtualization issues:
• Memory Tracing
The virtualization layer traces modifications on any given
physical page of the virtual machine, and is notified of all read
and write accesses made to that page in a transparent manner.
V
1
CPU Scheduling
The ESX Server implements a rate-based proportional-share
scheduler [19] that is similar to the fair-share-scheduler scheme
used by the Solaris OS (see [21] Chapter 8) in which each virtual
machine is given a number of shares. The amount of CPU time
given to each VM is based on its fractional share of the total
number of shares of active VMs in the whole system.
The term share is used to define a portion of the system's CPU
resources that is allocated to a VM. If a greater number of CPU
shares is assigned to a VM, relative to other VMs, then that VM
receives more CPU resources from the scheduler. CPU shares
are not equivalent to percentages of CPU resources. Rather,
shares are used to define the relative weight of a CPU load in a
VM in relation to CPU loads of other VMs.
The following formula shows how the scheduler calculates per-
domain allocation of CPU resources.
Shares
i
Allocation i domain
domain TotalShare
s
Reservation 0 MHz
Figure 33. Calculation of CPU resources in VMware.
In an SMP environment in which a VM could have more than one
virtual CPU (VCPU), a scalability issue arises when one VCPU is
spinning on a lock held by another VCPU that gets de-scheduled.
The spinning VCPU wastes CPU cycles spinning on the lock until
the lock owner VCPU is finally scheduled again and releases the
lock.
ESX implements co-schiedulingto work around this problem. In co-
scheduling (also called gang scheduling), all virtual processors of
a VM are mapped one-to-one onto the underlying processors
and simultaneously scheduled for an equal time slice. The ESX
scheduler guarantees that no VCPUs are spinning on a lock hold
by a VCPU that has been preempted.
However, co-scheduling does introduce other problems. Because
all VCPUs are scheduled at the same time, co-scheduling
activates a VCPU regardless of whether there are jobs in the
VCPU's run queue. Co-scheduling also precludes multiplexing
multiple VCPUs on the same physical processor.
Timer Services
Similar to Sun xVM Server, ESX Server faces the same issue of
getting clock interrupts delivered to VMs at the configured
interval [16]. This issue arises because the VM may not get
scheduled when interrupts are due to deliver. ESX Server keeps
track of the clock interrupt backlog and tries to deliver clock
interrupts at a higher rate when the backlog gets large. However,
the backlog can get so large that it is not possible for the GOS to
catch up with the real time. In such cases, ESX Server stops
attempting to catch
up if the clock interrupt backlog grows beyond 60 seconds.
Instead, ESX Servers sets its record of the clock interrupt
backlog to zero and synchronizes the GOS clock with the host
machine clock.
V
1
Page Translations
Each GOS in the ESX Server maintains page tables for virtual-to-
physical address mappings. The VMM also maintains shadow
page tables for the virtual-to-machine page mappings along with
physical-to-machine mappings in its memory. The processor's
MMU uses the VMM's shadow page table. When a GOS updates
its page tables with a virtual-to-physical translation, the VMM
intercepts the instruction, gets the physical-to-machine mapping
from its memory, and loads the shadow page table with the
virtual-to-machine mapping. This mechanism allows normal
memory accesses in the VM to execute without adding address
translation overhead if the shadow page tables are set up for that
access.
sound chip [20]. In addition, ESX Server also provides virtual PCI
emulation for PCI addon devices such as SCSI, Ethernet, and
SVGA graphics (see Figure 30 on page 98).
The device tree as exported by the VMM to a GOS is shown in
the following prtconf(lM) output.
% prtconf
System Configuration: Sun Microsystems i86pc
Memory size: 1648 Megabytes
System Peripherals (Software Nodes):
i86pc
scsi_vhci, instance #0
isa, instance #0
i8042, instance #0
keyboard, instance #0 mouse, instance #0
lp (driver not attached) asy, instance #0
(driver not attached) asy, instance #1
(driver not attached) fdc, instance #0 fd,
instance #0 pci, instance #0
pci15ad,1976 (driver not
attached) pci8086,7191,
instance #0 pci15ad,1976
(driver not attached) pci-ide,
instance #0 ide, instance #0
sd, instance #16 ide
(driver not attached)
pci15ad,1976(driver not
attached) display,instance #0
pci1000,30,instance #0
sd, instance #0
pci15ad,750,instance #0 iscsi,
instance #0
pseudo, instance
#0 options,
instance #0
agpgart, instance #0 (driver not attached)
xsvc, instance #0 objmgr, instance #0 acpi
(driver not attached) used-resources
(driver not attached) cpus (driver not
attached)
cpu, instance #0 (driver not attached)
cpu, instance #1 (driver not attached)
The PCI vendor ID of VMware is 15ad. The following entries are
relevant to VMware I/O
virtualization:
Device Entry Description
pci15ad,750 VMware emulation of Intel's 100FX
Gigabit Ethernet
pci15ad,1976 VMware emulation of the Intel 440BX/ZX
PCI bridge chip
pci1000,30 the LSI logic 53C1020/1030 SCSI
controller
display VMware virtual SVGA
For the example device tree shown here, the Solaris OS binds
the e1000g driver to pci15ad,750 and uses e1000g as the
network driver. The actual network hardware used on the system
is a Broadcom's NetXtreme Dual Gigabit Adapter with the PCI ID
pci14e4,1468. VMware translates the e1000g device interfaces
passed by the Solaris e1000g driver, and sends them to the
Broadcom's NetXtreme device.
For storage, unlike Sun xVM Server, ESX Server continues to
use sd as the interface to file systems. The emulation of disk
interface is provided at the SCSI bus adapter interface (LSI logic
SCSI controller) instead of at the SCSI target interface (SCSI
disk sd).
Device Emulation
Each storage device, regardless of the specific adapters,
appears as a SCSI drive connected to an LSI Logic SCSI
V
1
adapter within the VM. For network I/O, ESX Server emulates an
AMD Lance/PCNet or Intel El000g device driver, or uses a
custom interface called vmxnet for the physical network adapter.
VMware provides device emulation rather than the I/O emulation
as used by Sun xVM Server and UltraSparc LDoms (see "I/O
Virtualization" on page 16). In a simple scenario, consider an
application within the VM making an I/O request to the GOS, as
illustrated in Figure 34:
1
VMKe
8
rnel
Device Device
Independen Emulati
(7) on
t I/O Module
Access (3)
Handler
(
Hardware
CdJ Interface Layer
VMM Real Device
Driver
Sun x64
Server
I/O
Device
Figure 34. Sequence of events
for applications making an I/O
request.
1. Applications perform I/O
operations through the
interface to the device as
exported by the VMware
VMM (see "VMware I/O
Visualization" on page 103).
The virtual device interface
uses the native drivers (for
example, the e1000g for
network and mpt for the LSI
SCSI HBA) in the Solaris
kernel.
2. The Solaris native driver
attempts to access the
device via the IN/OUT
instructions (for example, by
writing a DMA descriptor to
the device's DMA engine).
3. The VMM intercepts the I/O
instructions and then
transfers control to the
device-independent module
in the VMkernel for handling
the I/O request.
4. The VMkernel converts the
I/O request from the
1
V
Section III
Additional Information
•
143 VMM Comparison Sun Microsystems,
Inc.
Appendix A
General
Sun xVM Server w/o HVM
Sun xVM Server w/HVM
VMware
LDoms
VMM version
3.0.4
3.0.4
ESX 3.0.1
LDoms 1.0.1
Supported ISA
x86 and IA-64
x86 and IA-64
x86
UltraSPARC T1/T2
VMM Layer
Run on bare metal
Run on bare metal
Run on bare metal
Firmware
Virtualization Scheme
Paravirtualization
Full
Full
Paravirtualization
144 VMM Comparison Sun Microsystems,
Inc.
Supported GOS
Linux, NetBSD, FreeBSD, OpenBSD, Solaris
Linux, NetBSD, FreeBSD, OpenBSD, Windows
Windows, Linux, Netware, Solaris
Solaris, Linux
SMP GOS
Yes
Yes
Yes
Yes
64-bit GOS
Yes
Yes
Yes
Yes
Max VMs
Limited by memory
Limited by memory
Limited by memory
32 on UltraSPARC Ti;
64 on UltraSPARC T2
Method of operation
Modified GOS
Hardware Virtualization
Binary Translation
Modified OS
License
GPL (free)
GPL (free)
Proprietary
CDDL (Free)
CPU
Sun xVM Server w/o HVM
Sun xVM Server w/HVM
VMware
LDoms
CPU scheduling
Credit
Credit
Fair Share
N/A
VMM Privilege Mode
Privileged (ring 0)
Privileged (ring 0)
Privileged
Hyperprivileged
GOS Privileged Mode
Unprivileged (ring 3 for 64-bit kernel; ring 1 for 32-bit kernel
Reduced privileged
Deprivileged
Privileged
CPU Granularity
Fractional
Fractional
Fractional
1 strand
Interrupt
Queued and delivered
Queued and delivered
Queued and delivered
Deliver directly to the VM
to run
to run
to run
Memory
Sun xVM Server w/o HVM
Sun xVM Server w/HVM
VMware
LDoms
Page Translation
Hypercall to VMM
Shadow Page
Shadow page
Hypercall to VMM
Physical Memory Allocation
Balloon driver
Balloon driver
Balloon driver
Hard Partition
Page Tables
Managed by VMM
Managed by VMM
Managed by VMM
Managed by GOS
Appendix B
References
Balloon driver
A method for dynamic sharing of physical
memory among VMs [5]. Binary Translation
In computing, binary translation [13] [14] usually refers to the
emulation of one instruction set by another through translation of
instructions to allow software programs (e.g., operating systems
and applications) written for a particular processor architecture to
run on another. In the context of VMware products, binary
translation refers to the conversion of one set of instruction
sequences that belongs to a VM and has been deprivileged, to
another set of instruction sequences that can run in a privileged
VMM segment. VMware uses binary translation [12] and [15] to
provide full virtualization of x86 processor.
Domain
A running virtual machine within which a guest OS runs. Domain
and virtual machine are used interchangeably in this document.
Full Virtualization
Full virtualization is an implementation of virtual machine that
doesn't require guest OS to be modified to run in the VM. The
techniques used for full virtualization can be a dynamic
translation of software programs running in a VM (e.g., VMware
products), or providing a complete emulation of the underlying
processor (e.g., Xen with Intel-VT or AMD-V).
Guest Operating Systems (GOS)
A GOS is one of the OSes that the VMM can host in a VM. The
relationship between VMM, VM, and GOS is analogous to the
relationship between, respectively, OS, process, and program.
Hardware Level Virtualization
Hardware Level Virtualization is the technique of using a thin
layer of software to abstract the system hardware resources for
creating multiple instance of virtual executing environment, each
of which runs a separate instance of operating system.
Hardware Thread
See strand.
HVM
Hardware Virtual Machine, also known as hardware-
assisted virtualization. Hypervisor
Hypervisor is another term for VMM. Hypervisor is an extension
of the term supervisor which was commonly applied to operating
system kernel.
Logical Domains (LDoms)
Logical domains are Sun's implementation for hardware level
virtualization based on the UltraSPARC T1 processor
technology. LDom technology allows multiple domains to be
created on one processor; each domain runs an instance of OS
supported by one or more strands.
Operating System Level Virtualization
R
1 e
Appendix D Author
Biography
Acknowledgements
The author would like to thank Honlin Su, Lodewijk Bonebakker,
Thomas Bastian, Ray Voight, and Joost Pronk for their invaluable
comment; Patric Change for his encouragement and support;
Suzanne Zorn for her editorial work; and Kemer Thompson for
his constructive comments and his coordination of the reviews.
Solaris Operating System Hardware On the
Virtualization Product Architecture Web
sun.com
Sun Microsystems, Inc. 4150 Network CircLe, Santa Clara, CA 95054 USA Phone 1-650-960-
1300 or 1-800-555-9SUN (9786) Web sun.com
© 2007 Sun Microsystems, Inc. All rights reserved. Sun, Sun Microsystems, the Sun logo, Java,
JVM, Solaris, and Sun BluePrints are trademarks or registered trademarks of Sun Microsystems,
Inc. in the United
States and other countries. All SPARC trademarks are used under license and are trademarks or
registered trademarks of SPARC International, Inc. in the U.S. and other countries. Products
bearing SPARC trademarks
are based upon architecture developed by Sun Microsystems, Inc. Information subject to change
without notice. Printed in
USA W07