Você está na página 1de 32

OpenStack With VMware.

Reference:- https://cloudarchitectmusings.com/

One of the most common questions I receive is “How does OpenStack compare with
VMware?” That question comes, not only from customers, but from my peers in the vendor
community, many of whom have lived and breathed VMware technology for a number of
years. Related questions include, “Which is a better cloud platform,” “Who has the better
hypervisor,” and “why would I use one vs. another?” Many of these are important and sound
questions; others, however, reveal a common misunderstanding of how OpenStack matches up
with VMware.

The misunderstanding is compounded by the tech media and frankly, IT folks who often fail to
differentiate between vSphere and other VMware technologies such as vCloud Director. It is not
enough to just report that a company is running VMware; are they just running vSphere or are
they also using other VMware products such as vCloud Director, vCenter Operations Manager,
or vCenter Automation Center? If a business announces they are building an OpenStack Private
Cloud, instead of using the VMware vCloud Suite, does it necessarily imply that they are
throwing out vSphere as well? The questions to be asked in such a scenario are “which
hypervisors will be used with OpenStack” and “will the OpenStack Cloud be deployed alongside
a legacy vSphere environment?”

What makes the first question particularly noteworthy is the amount of code that VMware has
contributed to the recent releases of OpenStack. Ironically, the OpenStack drivers for ESX were
initially written, not by VMware, but by Citrix. These drivers provided limited integration with
OpenStack Compute and were, understandably, not well maintained. However, with VMware
now a member of the OpenStack foundation, their OpenStack team has been busy updating and
contributing new code. I believe deeper integration between vSphere and OpenStack is critical
for users who want to move to an Open Cloud platform, such as OpenStack, but have existing
investments in VMware and legacy applications that run on the vSphere platform.

In this and upcoming blog posts, I will break down how VMware compares to and integrates
with OpenStack, starting with the most visible and well know component of OpenStack – Nova
compute. I’ll start by providing an overview of the Nova architecture and how vSphere
integrates with that project. In subsequent posts, I will provide design and deployment details
for an OpenStack Cloud that deploys multiple hypervisors, including vSphere.

OpenStack Architecture

OpenStack is built on a shared-nothing, messaging-based architecture with modular components


that each manage a different service; these services, work together to instantiate an IaaS Cloud.
A full discussion of the entire OpenStack Architecture is beyond the scope of this post. For
those who are unfamiliar with OpenStack or need a refresher, I recommend reading Ken
Pepple’s excellent OpenStack overview blog post and my “Getting Started With OpenStack”
slide deck. I will focus, in this post, on the compute component of OpenStack, called Nova.

Nova Compute

Nova compute is the OpenStack component that orchestrates the creation and deletion of
compute/VM instances. To gain a better understanding of how Nova performs this orchestration,
you can read the relevant section of the “OpenStack Cloud Administrator Guide.” Similar to
other OpenStack components, Nova is based on a modular architectural design where services
can be co-resident on a single host or more commonly, on multiple hosts.
The core components of Nova
include the following:

 The nova-api accepts and responds to end-user compute API calls. It also initiates most of
the orchestration activities (such as running an instance) as well as enforcing some policies.
 The nova-compute process is primarily a worker daemon that creates and terminates virtual
machine instances via hypervisor APIs (XenAPI for XenServer/XCP, libvirt for KVM or QEMU,
VMwareAPI for vSphere, etc.).
 The nova-scheduler process is conceptually the simplest piece of code in OpenStack Nova: it
take a virtual machine instance request from the queue and determines where it should run
(specifically, which compute node it should run on).

Although many have mistakenly made direct comparisons between OpenStack Nova and
vSphere, that is actually quite inaccurate since Nova actually sits at a layer above the hypervisor
layer. OpenStack in general and Nova in particular, is most analogous to vCloud Director (vCD)
and vCloud Automation Center (vCAC), and not ESXi or even vCenter. In fact, it is very
important to remember that Nova itself does NOT come with a hypervisor but manages multiple
hypervisors, such as KVM or ESXi. Nova orchestrate these hypervisors via APIs and
drivers. The list of supported hypervisors include KVM, vSphere, Xen, and others; a detailed list
of what is supported can be found on the OpenStack Hypervisor Support Matrix.
Nova manages it’s supported
hypervisors through APIs and native management tools. For example, Hyper-V is managed
directly by Nova, KVM is managed via a virtualization management tool called libvirt, while
Xen and vSphere can be manged directly or through a management tool such as libvirt and
vCenter for vSphere respectively.

vSphere Integration with OpenStack Nova

OpenStack Nova compute manages vSphere 4.1 and higher through two compute driver options
provided by VMware – vmwareapi.VMwareESXDriver and vmwareapi.VMwareVCDriver:

 The vmwareapi.VMwareESXDriver driver, originally written by Citrix and subsequently updated


by VMware, allows Nova to communicate directly to an ESXi host via the vSphere SDK.
 The vmwareapi.VMwareVCDriver driver, developed by VMware initially for the Grizzly release,
allows Nova to communicate with a VMware vCenter server managing a cluster of ESXi
hosts. [Havana] In the new Havana release a single vCenter server can manage multiple
clusters.

Let’s talk more about these drivers and how Nova leverages them to manage vSphere. Note that
I am focusing specifically on compute and tabling discussions regarding vSphere networking and
storage to other posts.

ESXi Integration with Nova (vmwareapi.VMwareESXDriver)


Logically, the nova-compute
service communicates directly to the ESXi host; vCenter is not in the picture. Since vCenter is
not involved, using the ESXDriver means advanced capabilities, such as vMotion, High
Availability, and Dynamic Resource Scheduler (DRS), are not available. Also, in terms of
physical topology, you should note the following:

 Unlike Linux kernel based hypervisors, such as KVM, vSphere with OpenStack requires the VM
instances to be hosted on an ESXi server distinct from a Nova compute node, which must run on
some flavor of Linux. In contrast, VM instances running on KVM can be hosted directly on a
Nova compute node.
 Although a single OpenStack installation can support multiple hypervisors, each compute node
will support only one hypervisor. So any multi-hypervisor OpenStack Cloud requires at least one
compute node for each hypervisor type.
 Currently, the ESXDriver has a limit of one ESXi host per Nova compute service.
ESXi Integration with Nova
(vmwareapi.VMwareVCDriver)

[Havana] Logically, the nova-compute service communicates to a vCenter Server, which


handles management of one or more ESXi clusters. With vCenter in the picture, using the
VCDriver means advanced capabilities, such as vMotion, High Availability, and Dynamic
Resource Scheduler (DRS), are now supported. However, since vCenter abstracts the ESXi
hosts from the nova-compute service, the nova-scheduler views each cluster as a single
compute/hypervisor node with resources amounting to the aggregate resources of all ESXi hosts
managed by that cluster. This can cause issues with how VMs are scheduled/distributed across a
multi-compute node OpenStack environment.

[Havana] For now, look at the diagram below, courtesy of VMware. Note that nova-scheduler
sees each cluster as a hypervisor node; we’ll discuss in another post how this impacts VM
scheduling and distribution in a multi-hypervisor Cloud with vSphere in the mix. I also want to
highlight that the VCDriver integrates with the OpenStack Image service, aka. Glance, to
instantiate VMs with full operating systems installed and not just bare VMs. I will be doing a
write–up on that integration in a later post.
[Havana] Puling
back a bit, you can begin to see below how vSphere integrates architecturally with OpenStack
alongside a KVM environment. We talk more about that as well in another post.
Also, in terms of physical topology, you should note the following:

 Unlike Linux kernel based hypervisors, such as KVM, vSphere with vCenter on OpenStack
requires a separate vCenter Server host and that the VM instances to be hosted in an ESXi
cluster run on ESXi hosts distinct from a Nova compute node. In contrast, VM instances running
on KVM can be hosted directly on a Nova compute node.
 Although a single OpenStack installation can support multiple hypervisors, each compute node
will support only one hypervisor. So any multi-hypervisor OpenStack Cloud requires at least one
compute node for each hypervisor type.
 To utilize the advanced vSphere capabilities mentioned above, each Cluster must be able to
access datastores sitting on shared storage.
 [Havana] In Grizzly, the VCDriver had a limit of one vSphere cluster per Nova compute
node. With Havana, a single compute node can now manage multiple vSphere clusters.
 Currently, the VCDriver requires that only one datastore can be configured and used per cluster.

Hopefully, this post has help shed some light on where vSphere integration stands with
OpenStack. In upcoming posts, I will provide more technical and implementation details.

See part 2 for DRS, part 3 for HA and VM Migration, part 4 for Resource Overcommitment and
part 5 for Designing a Multi-Hypervisor Cloud.
In part 1, I gave an overview of the OpenStack Nova Compute project and how VMware
vSphere integrates with that particular project. Before we get into actual design and
implementation details, I want to review an important component of Nova Compute – resource
scheduling, aka. nova-scheduler in OpenStack and Distributed Resource Scheduler (DRS) in
vSphere. I want to show how DRS compares with nova-scheduler and the impact of running
both as part of an OpenStack implementation. Note that I can’t be completely exhaustive in this
blog post but would commend everyone to read the following:

 For DRS, the “bible” is “VMware vSphere 5 Clustering Technical Deepdive,” by Frank Denneman
and Duncan Epping.
 For nova-scheduler, the most comprehensive treatment I’ve read is Yong Sheng Bong’s excellent
blog post on “OpenStack nova-scheduler and its algorithm.” It’s a must read if you want to dive
deep into the internals of the nova-scheduler.

OpenStack nova-scheduler

Nova Compute uses the nova-scheduler service to determine which compute node should be
used when a VM instance is to be launched. The nova-scheduler, like vSphere DRS, makes this
automatic initial placement decision using a number of metric and policy considerations, such as
available compute resources and VM affinity rules. Note that unlike DRS, the nova-scheduler
does not do periodic load-balancing or power-management of VMs after the initial
placement. The scheduler has a number of configurable options that can be accessed and
modified in the nova.conf file, the primary file used to configure nova-compute services in an
OpenStack Cloud.

scheduler_driver=nova.scheduler.multi.MultiScheduler
compute_scheduler_driver=nova.scheduler.filter_scheduler.FilterScheduler
scheduler_available_filters=nova.scheduler.filters.all_filters
--> scheduler_default_filters=AvailabilityZoneFilter,RamFilter,ComputeFilter
--> least_cost_functions=nova.scheduler.least_cost.compute_fill_first_cost_fn
compute_fill_first_cost_fn_weight=-1.0

The 2 variables I want to focus on are “schedule_default_filters” and “Least_cost_functions.”


They represent two algorithmic processes used by the Filter Scheduler to determine initial
placement of VMs (There are two other schedulers, the Chance Scheduler and the Multi
Scheduler, that can used in place of the Filter Scheduler; however, the Filter Scheduler is the
default and should be used in most cases). These two process work together to balance workload
across all existing compute nodes at VM launch much in the same way the Dynamic
Entitlements and Resource Allocation Settings are used by DRS to balance VMs across an ESXi
cluster.

Filtering (schedule_default_filters)

When a request is made to launch a new VM, the nova-compute service contacts the nova-
scheduler to request placement of the new instance. The nova-scheduler uses the scheduler, by
default the Filter Scheduler, named in the nova.conf file to determine that placement. First, a
filtering process is used to determine which hosts are eligible for consideration and an eligible
host list is created and then a second algorithm, Costs and Weights (described later), is applied
against the list to determine which compute node is optimal for fulfilling the request.

The Filtering process uses the scheduler_available_filters configuration option


in nova.conf to determine what filters will be used for filtering out ineligible compute nodes
and to create the eligible hosts list. By default, there are three filters that are used:

 The AvailabilityZoneFilter filters out hosts that do not belong to the Availability Zone specified
when a new VM launch request is made via the Horizon dashboard or from the nova CLI client.
 The RamFilter ensures that only nodes with sufficient RAM make it on to the eligible host list. If
the RamFilter is not used, the nova-scheduler may over-provision a node with insufficient RAM
resources. By default, the filter is set to allow overcommitment on RAM by 50%, i.e. the
scheduler will allow a VM requiring 1.5 GB of RAM to be launched on a node with only 1 GB of
available RAM. This setting is configurable by changing the “ram_allocation_ratio=1.5” setting
in nova.conf.
 The ComputeFilter filters out any nodes that do not have sufficient capabilities to launch the
requested VM as it corresponds to the requested instance type. It also filters out any nodes that
cannot support properties defined by the image that is being used to launch the VM. These
image properties include architecture, hypervisor, and VM mode.
An
example of how the Filtering process may work is that host 2 and host 4 above may have been
initially filtered out for any combination of the following reasons, assuming the default filters
were applied:

1. The requested VM is to be in Availability Zone 1 while nodes 2 and 4 are in Availability Zone 2.
2. The requested VM requires 4 GB of RAM and nodes 2 and 4 each have only 2 GB of available
RAM.
3. The requested VM has to run on vSphere and nodes 2 and 4 support KVM.

There are a number of other filters that can be used along with or in place of the default filters;
some of them include:

 The CoreFilter ensures that only nodes with sufficient CPU cores make it on to the eligible host
list. If the CoreFilter is not used, the nova-scheduler may over-provision a node with insufficient
physical cores. By default, the filter is set to allow overcommitment based on a ratio of 16
virtual cores to 1 physical core. This setting is configurable by changing the
“cpu_allocation_ratio=16.0” setting in nova.conf.
 The DifferentHostFilter ensures that the VM instance is launched on a different compute node
from a given set of instances, as defined in a scheduler hint list. This filter is analogous to the
anti-affinty rules in vSphere DRS.
 The SameHostFilter ensures that the VM instance is launched on the same compute node as a
given set of instances, as defined in a scheduler hint list. This filter is analogous to the affinity
rules in vSphere DRS.
The full list of filters are available in the Filters section of the “OpenStack Compute
Administration Guide.” The Nova-Scheduler is flexible enough that customer filters can be
created and multiple filters can be applied simultaneously.

Costs and Weights

Next, the Filter Scheduler takes the hosts that remain after the filters have been applied and
applies one or more cost function to each host to get numerical scores for each host. Each cost
score is multiplied by a weighting constant specified in the nova.conf config file. Details on the
algorithm used are detailed in the “OpenStack nova-scheduler and its algorithm” I referenced
previously. The weighting constant configuration option is the name of the cost function, with
the _weight string appended. Here is an example of specifying a cost function and its
corresponding weight:

least_cost_functions=nova.scheduler.least_cost.compute_fill_first_cost_fn,nov
a.scheduler.least_cost.noop_cost_fn
compute_fill_first_cost_fn_weight=-1.0
noop_cost_fn_weight=1.0

There are three cost functions available; any of the functions can used alone or in any
combination with the other functions.

1. The nova.scheduler.least_cost.compute_fill_first_cost_fn function calculates the amount of


available RAM on the compute nodes and chooses which node is best suited for fulfilling a VM
launch request based on the weight assigned to the function. A weight of 1.0 will configure the
Scheduler to “fill-up” a node until there is insufficient RAM available. A weight of -1.0 will
configure the scheduler to favor the node with the most available RAM for each VM launch
request.
2. The nova.scheduler.least_cost.retry_host_cost_fn function adds additional cost for retrying a
node that was already used for a previous attempt. The intent of this function is ensure that
nodes which consistently encounter failures are used less frequently.
3. The nova.scheduler.least_cost.noop_cost_fn function will cause the scheduler not to
discriminate between any nodes. In practice this function is never used.

The Cost and Weight function is analogous to the Share concept in vSphere DRS.
In the example
above, if we choose to only use the nova.scheduler.least_cost.compute_fill_first_cost_fn
function and set the weight to compute_fill_first_cost_fn_weight=1.0, we would expect the
following results:

 All nodes would be ranked according to amount of available RAM, starting with host 4.
 The nova-scheduler would favor launching VMs on host 4 until there are insufficient RAM
available to launch a new instance.

DRS with the Filter Scheduler

Now that we’ve reviewed how Nova compute schedules resources and how it works compared
with vSphere DRS, let’s look at how DRS and the nova-scheduler work together when
integrating vSphere into OpenStack. Note that I will not be discussing Nova with the ESXDriver
and standalone ESXi hosts; in that configuration, the filtering works the same as it does with
other hypervisors. I will, instead, focus on Nova with the VCDriver and ESXi clusters managed
by vCenter.
[Havana] As mentioned in my
previous post, since vCenter abstracts the ESXi hosts from the nova-compute service, the nova-
scheduler views each cluster as a single compute node/hypervisor host with resources amounting
to the aggregate resources of all ESXi hosts in a given cluster. This has two effects:

1. While the VCDriver interacts with the nova-scheduler to determine which cluster will host a new
VM the nova-scheduler plays NO part in where VMs are hosted in the chosen cluster. When the
appropriate vSphere cluster is chosen by the scheduler, nova-compute simply makes an API call
to vCenter and hands-off the request to launch a VM. vCenter then selects which ESXi host in
that cluster will host the launched VM, based on Dynamic Entitlements and Resource Allocation
settings (DRS should be enabled with “Fully automated” placement turned on). Any automatic
load balancing or power-management by DRS is allowed but will not be known by Nova
compute.
2. In the example above, the KVM compute node has 8 GB of available RAM while the nova-
scheduler sees the “vSphere compute node” as having 12 GB RAM. Again, the nova-scheduler
does not take into account the RAM in the vCenter Server or the actual available RAM in each
individual ESXi host; it only considers the aggregate RAM available in all ESXi hosts in a given
cluster.

As I also mentioned in my previous post, the latter effect can cause issues with how VMs are
scheduled/distributed across a multi-compute node OpenStack environment. Let’s take a look at
two use cases where the nova-scheduler may be impeded in how it attempts to schedule
resources; we’ll focus on RAM resources, using the environment shown above, and assume we
are allowing the default of 50% overcommitment on physical RAM resources.

1. A user requests a VM with 10 GB of RAM. The nova-scheduler applies the Filtering algorithm
and adds the vSphere compute node to the eligible host list even though neither ESXi hosts in
the cluster has sufficient RAM resources, as defined by the RamFilter. If the vSphere compute
node is chosen to launch the new instance, vCenter will then make the determination that there
aren’t enough resources and will respond to the request based on DRS defined rules.
2. A user requests a VM with 4 GB of RAM. The nova-scheduler applies the Filtering Algorithm and
correctly adds all three compute nodes to the eligible host list. The scheduler then applies the
Cost and Weights algorithm and favors the vSphere compute node which it believes has more
available RAM. This creates an imbalanced system where the hypervisor/compute node with
the lower amount of available RAM may be incorrectly assigned the lower cost and favored for
launching new VMs.

Another use
case that is impacted by the fact that nova-scheduler does not insight in to individual ESXi
servers has to do with the placement of legacy workloads in a multi-hypervisor Cloud. For
example, due to the stateful nature and the lack of application resiliency in most legacy
applications, they often are not suitable for running on a VM that is hosted in a hypervisor such
as KVM with libvirt, which does not have vSphere style HA capabilities; I talk about why that is
the case in part 3 of this series.
In the use case above, a
user requests a new instance to be spun up that will used to house an Oracle application. Nova-
scheduler chooses to instantiate this new VM on a KVM compute node/hypervisor host since it
has the most available resources. As stated above, however a vSphere cluster is probably a more
appropriate solution.

So how do we design around this architecture? Unfortunately, the VCDriver is new to


OpenStack and there does not seem to be a great deal of documentation or recommended
practices available. I am hoping this changes shortly and I plan to help as best as I can. I’ve
been speaking with the VMware team working on OpenStack integration and I am hoping we’ll
be able to collaborate together on putting out documentation and recommended practices. As
well, VMware is continuing to improve on the code they’ve already contributed and I expect
some of these issues will be addressed in future OpenStack releases; I also expect that additional
functionality will be added to Nova compute as OpenStack continues to mature.

After walking though some more architectural details behind vSphere integration with
OpenStack, I will begin posting, in the near future, how to design, deploy, and configure vSphere
with OpenStack Nova compute. Please also see part 3 for information on HA and VM Migration
in OpenStack, part 4 on Resource Overcommitment, and part 5 on Designing a Multi-Hypervisor
Cloud.

I am often asked by customers to compare the capabilities of OpenStack to the vCloud Suite and
KVM to vSphere. specifically, the questions revolve around features in vSphere, such as High
Availability (HA), Distributed Resource Scheduler (DRS), and vMotion. These customers want
to know if they should choose to use KVM with OpenStack, will they be able to use comparable
features to what is available with vSphere? This is a common question since KVM, like
OpenStack, is open sourced and is the default hypervisor used when installing
OpenStack. Continuing on from part 1 and part 2 of this series, where I reviewed the
architecture of vSphere with OpenStack Nova and DRS, I will be spending some time on HA
and vMotion before moving on to design and implementation details of vSphere with OpenStack
in upcoming posts. Please also see part 4 for my post on Resource Overcommitment and part 5
for Designing a Multi-Hypervisor Cloud.

High Availability

As most readers of this post will know, High-Availability (HA) is one of the most important
features of vSphere, providing resiliency at the compute layer by automatically restarting VMs
on a surviving ESXi host when the original host suffers a failure. HA is such a critical feature,
especially when hosting applications that do not have application level resiliency but assume a
bulletproof infrastructure, that many enterprises consider this a nonnegotiable when considering
moving to another hypervisor such as KVM. So, it’s often a surprise when customers hear that
OpenStack does not have the ability natively to auto-restart VMs on another compute node when
the original node fails.

In lieu of vSphere HA, OpenStack uses a feature called “Instance Evacuation” in the event of a
compute node failure (keep in mind that outside of vSphere, a Nova Compute node also
functions as the hypervisor and hypervisor management node). Instance Evacuation is a manual
process that essentially leverages cold migration to make instances available again on a new
compute node.

At a high-level, the steps for Instance Evacuation is as follows (performed by a Cloud


Administrator):

1. Use the nova host-list command to list all compute nodes under management.
2. Choose an appropriate compute node (If user data needs to be preserved, the target compute
node must have access to the same shared storage used by the failed compute node).
3. When the nova evacuate command is invoked, Nova Compute reads the database that stores
the configuration data for the downed instances and essentially “rebuilds” the instances on the
chosen compute node, using that stored configuration data.
4. With Instance Evacuation, the recovered instance behaves differently when it is booted on the
target compute node, depending on if it was deployed with or without shared storage:
o With Shared Storage – User data is preserved and server password is retained
o Without Shared Storage – The new instance will be booted from a new disk but will
preserve configuration data, e.g. hostname, ID, IP address, UID, etc.. User data,
however, is not preserved and a new server password is regenerated.
As mentioned in previous posts, vCenter essentially proxies the ESXi cluster under its
management and abstracts all the member ESXi hosts from the Nova Compute node. When an
ESXi host fails, vSphere HA kicks in and restarts all the VMs, previously hosted on the failed
server, on the surviving servers; DRS would take care of balancing the VMs appropriately across
the surviving servers. Nova Compute is unaware of any VM movement in the cluster or that an
ESXi host has failed; vCenter, however, reports back on the lowered available resources in the
cluster and Nova take that into account, in its scheduling, the next time a request to spin up an
instance is made.

So the question that users may ask, particularly those with a VMware background, is “Why
would OpenStack be missing such a ‘basic’ feature such as High-Availability?” Speaking for
myself, there are a few factors to consider:

 Since OpenStack itself does not have it’s own specific hypervisor, it chooses to expose the
functionality that is available that comes with each hypervisor. So for example, HA is available
with Hyper-V and vSphere since those hypervisors natively support HA.
 Since OpenStack was designed from the beginning to be a Cloud platform, it follows certain
design principles that differ from a “traditional” virtualized infrastructure. Some of these
differences are detailed in Massimo Re Ferre’s excellent post on pets vs. cattle.

Some of these “Cloud vs. virtualized infrastructure” principles include the following:

 A VM instance and a compute node are commodity components that provides services. If a VM
or a compute node dies, just shoot it and restart.
 Resiliency is multi-layered and is required to be built into both the application and infrastructure
layers.
 Scale-out is preferred over scale-up, not only for performance, but for resiliency. By distributing
workloads across multiple instances and compute nodes, the impact of a failed instance is
minimized, allowing time to evacuate instances.

[Havana] It’s also noteworthy that some vendors, such as Piston Computing, have built HA into
their OpenStack distribution. Also, with the new Heat project that is now part of OpenStack,
instance-level HA could be written into an OpenStack Cloud using auto-scaling.

Still, Enterprises may wonder how their legacy applications, which require more resiliency at the
infrastructure layer, fits in with OpenStack. This is an area where vSphere can provide both a
competitive advantage over other hypervisors supported by OpenStack and provide a more
robust option in a multi-hypervisor Cloud. Deploying vSphere with OpenStack, users have the
option of creating a tiered Cloud architecture where new apps can be hosted on KVM or Xen
instances while legacy apps, that require vSphere HA functionality, can be hosted on vSphere.

VM Migration
OpenStack Nova supports VM migrations by leveraging the native migration features inherent
within the individual hypervisors under its management. The table below outlines what
migration features are supported by OpenStack with each hypervisor:

The biggest difference between vSphere and the other hypervisors in the table above is that cold
migration and vMotion cannot be initiated through Nova; it has to be initiated via the vSphere
Client or vSphere CLI. So, other than vSphere HA, the hypervisors above supported by
OpenStack are mostly in parity with one another, which makes HA and Storage vMotion the
most compelling reasons to use vSphere. What will be worth watching in upcoming OpenStack
releases are both the development of new features by other hypervisors to match vSphere
functionality and VMware’s continued commitment to make vSphere a first class hypervisor
solution in OpenStack.

One of the areas in OpenStack that seem to be lacking is solid information on how to design an
actual OpenStack Cloud deployment. Almost all of the available documentation focus on the
installation and configuration of OpenStack, with little in the way of guidance on how to design
for high availability, performance, redundancy, and scalability . One of the gaps in
documentation is around the area of CPU and RAM overcommitment, aka oversubscription,
when designing for Nova Compute. The current OpenStack documentation points out that the
default CPU overcommitment ratio settings, as configured in the Nova Scheduler, are 16:1 for
CPU and 1.5:1 for RAM, but does not provide the rationale for these settings or give much
guidance on how to customize these default ratios. The current documentation also does not
provide guidance for each hypervisor supported within OpenStack.

In contrast, there is an abundance of guidance to help VMware administrators with sizing


vSphere and determining the correct CPU and RAM overcommitment ratios. Perhaps that
should not be a surprise since Public Cloud providers, who have been the early adopters of
OpenStack, typically maintain enough physical compute resources in reserve to handle
unexpected resource spikes. However, with a Private Cloud, more attention must be paid to
ensure there are enough resources to handle current and future workloads. One way to consider
this is to view Public Cloud providers as essentially cattle ranch conglomerates, such as Koch
Industries, while a Private Cloud is more like a privately-owned cattle ranch that can range from
a few dozen head of cattle to thousands of head of cattle.
Typical Compute Sizing Guidelines

As previously mentioned, OpenStack sets a default ratio of 16:1 virtual CPU (vCPU) to physical
CPU (pCPU) and 1.5:1 virtual RAM (vRAM) to physical RAM (pRAM). Coming from the
VMware world, I’ve often heard rule-of thumb guidelines such as assuming 6:1 CPU and 1.5:1
RAM overcommitment ratios for general workloads. It seems to me that many of these
guidelines for general workloads were developed at a time when most shops were consolidating
Window servers with only 1 pCPU, which were often only 10% to 20% utilized, and using 2 GB
pRAM which were often only 50% utilized; under those assumptions, a 6:1 ratio made
sense. For example, 6 windows servers, each running on a single 2 GHz CPU that is only 20%
utilized (400 MHz) and using 2 GB RAM that is only 50% utilized (400 MB), would require
only an aggregate of 2.4 GHz of CPU cycles and 6 GB RAM; you could virtualize those 6
servers and put them on a 3 GHz CPU core with 8 GB RAM and still have 20% headroom for
spikes; this would allow you to host ~50 VMs on an ESXi server with 8 cores and 64 GB of
RAM. But how feasible are these assumptions today when applications are being written to take
better advantage of the higher number of CPU cores and RAM? In those cases, the rule-of-thumb
guidelines will likely not work. This is particularly true for business critical workloads, where a
much more conservative approach with NO overcommitment is generally considered a
recommended practice.

Compute Sizing Methodologies

So, what is the best way to determine compute sizing for OpenStack with vSphere or another
hypervisor, such as KVM? Over the years, I’ve used different methodologies that range from
data collection and analysis, to best-guess estimates, to following general rules-of-thumb,
depending on what my customers can and are willing to provide:

 Using current utilization data – I get this from very few customers but they provide the best
input for the most accurate compute design. What I am looking for is the aggregate CPU cycles
and RAM usage rate for all servers that are or will be virtualized and be running in an OpenStack
Cloud.

 Using their current inventory – More common is to get an inventory without any utilization
data. In which case, I use the same methodology above but make assumptions about pCPU and
pRAM utilization rate. I recommend making sure those assumptions are agreed on by the
customer before I deliver my design.

 Using Rule-of-Thumb guidelines – In reality, this is the most common scenario because I often
get insufficient data from customers. At that point I take extra care to confirm I have an
agreement with the customer as to what our assumption will be regarding CPU and RAM
overcommitment.
Rule-of-Thumb For Sizing CPU Resources

For all hypervisors, including KVM and vSphere, the guidelines are the same
since overcommitment for both are dependent on the amount of CPU cycles that the OpenStack
architect assumes are being used at any one slice of time. For business critical applications, I
would start with NO overcommitment and adjust if that instance turns out not to require all the
physical CPU cycles to which it has been given. For today’s general workloads, I start by
assuming a 3:1 CPU overcommitment ratio and adjust as real workload data demands.

For example, let’s assume a dual quad-core server with 3 GHz CPUs as our Nova Compute
node; that works out to 16 physical cores and 48 GHz of processing resources. Using the 3:1
overcommitment ratio, it would work out to 48 vCPUs to be hosted with the assumption that
each vCPU would be utilizing 1 GHz of actual processing resources. If each VM instance has 2
vCPUs, that would equate to 24 instances per compute node. Again that ratio could change
based on workload requirements.

This is obviously much more conservative than the 16:1 default overcommitment ratio for
OpenStack Nova Compute. My opinion, however, is that experience and the math does not
support such a high overcommitment ratio for general use cases. For example, to pack 16 vCPUs
into a single 3 GHz physical core with 80% utilization would mean assuming each vCPU would
only require an average of 150 MHz of CPU cycles. Or we would have to assume significant
idle time for those vCPUs. That’s certainly possible, but I would not be comfortable assuming
such low CPU cycle or utilization requirements, without some solid data to back that assumption.

Rule-of-Thumb For Sizing RAM Resources

This is where guidelines may differ depending on the hypervisor used with OpenStack. For
hypervisors with very advanced native memory management capabilities, such as ESXi, I use a
ratio of 1.25:1 RAM overcommitment when designing for a production OpenStack
Cloud. These advanced memory management techniques include Transparent Page sharing,
Guest Ballooning, and memory compression. In a non-production environment, i would
consider using the 1:5:1 overcommitment ratio that is the default in Nova Compute.

For example, let’s assume a server with 128 GB pRAM as our Nova Compute node. Using the
1.25:1 overcommitment ratio and assuming 96 GB RAM available after accounting for overhead
and restricting to 80% utilization, it would work out to 120 GB vRAM to be hosted. If each VM
instance requires 4 GB vRAM, that would equate to 30 instances per compute node. Note that
you should use the more conservative of the CPU or RAM numbers.
For hypervisors, such as KVM, that do not have the same degree of advance memory
management capabilities, I would assume NO overcommitment at all. This is again in contrast
to the default 1:5:1 ratio in OpenStack for all environments. I would not recommend adopting
that aggressive a ratio without some data showing this can be justified.

For example, let’s assume a server with 128 GB pRAM as our Nova Compute node. Using 1:1
overcommitment ratio (No overcommitment) and assuming 96 GB RAM available after
accounting for overhead and restricting to 80% utilization, it would work out to 96 GB vRAM to
be hosted. If each VM instance requires 4 GB vRAM, that would equate to 24 instances per
compute node. Note that you should use the more conservative of the CPU or RAM numbers.

Concluding Thoughts

It is also important to factor in the number of Compute Node or ESXi host failures a customer
can tolerate. For example, if I have four computes nodes that are 80% utilized across each node,
if one node fails, then I am out of resources and either unable to create new instances or will
experience significant performance drops across all instances. In the case of vSphere, depending
on how Admission Control is configured , you may not be able to spin up new VMs. In that
case, you may need to design, for example, a 5 or 6 node configuration that can tolerate a single
node failure.

Given that resource overcommitment guidelines can differ across hypervisors, I would
recommend separating out compute nodes, managing these different hypervisors, using Cells or
Host aggregates. This avoids the Nova Compute Scheduling issues that I referenced in part 2 of
this series which can occur when you have multiple hypervisors with different capabilities being
managed in the same OpenStack Cloud. I will provide more details on how to design for this in
the future.

In the next post, I will start putting everything together and walking through what a multi-
hypervisor OpenStack Cloud design, with vSphere integrated, may look like.

Having walked through several aspects of vSphere with OpenStack,, I want to start putting some
of the pieces together to demonstrate how an architect might begin to design a multi-hypervisor
OpenStack Cloud. Our use case will be for a deployment with KVM and vSphere; note that for
the sake of simplicity, I am deferring discussion about networking and storage for future blog
posts. I encourage readers to review my previous posts on Architecture, Resource Scheduling,
VM Migration, and Resource Overcommitment.

Cloud Segregation

Before delving into our sample use case and OpenStack Cloud design, I want to review the
different ways we can segregate resources in OpenStack; one of these methods, in particular, will
be relevant in our multi-hypervisor Cloud architecture.
The table above and description of each method can be found in the OpenStack Operations
Guide. Note that for the purposes of this post, we’ll be leveraging Host Aggregates.

User Requirements

ACME Corporation has elected to deploy an OpenStack Cloud. As part of that deployment,
ACME’s IT group has decided to purse a dual-hypervisor strategy using ESXi and KVM; the
plan is to create a tiered offering where applications would be deployed on the most appropriate
hypervisor, with a bias towards using KVM whenever possible since it is free. Their list of
requirements for this new Cloud includes the following:

 Virtualize 6 Oracle servers that are currently running on bare-metal Linux. Each server has dual
quad-core processors with 256 GB of RAM and is heavily used. ACME needs these Oracle
servers to all be highly available but have elected to not use Oracle Real Application Clusters
(RAC).
 Virtualize 16 application servers that are currently running on bare-metal Linux. Each server has
a single quad-core processor with 64 GB of RAM, although performance data indicates the
server CPUs are typically only 20% utilized and could run on just 1 core. The application is
home-grown and was written to be stateful and is sensitive to any type of outage or service
disruption.
 Virtualize 32 Apache servers that are currently running on bare-meal Linux. Each server has a
single quad-core processor with 8 GB of RAM. It has been determined that each server can be
easily consolidated based on moderate utilization. The application is stateless and new VMs can
be spun up and added to the web farm as needed.
 Integrate 90 miscellaneous Virtual Machines running a mix of Linux and Windows. These VMs
use an average of 1 vCPU and 4 GB of RAM.
 Maintain 20% free resource capacity across all hypervisor hosts for bursting and peaks.
 Utilize a N+1 design in case of a hypervisor host failure to maintain acceptable performance
levels.

Cloud Controller

The place to start is with the Cloud Controller. A Cloud Controller is a collection of services
that work together to manage the various OpenStack projects, such as Nova, Neutron, Cinder,
etc.. All services can reside on the same physical or virtual host or, since OpenStack utilizes a
shared-nothing message-based architecture, services can be distributed across multiple hosts to
provide scalability and HA. In the case of Nova, all services except nova-compute can run on
the Controller while the nova-compute service runs on the Compute nodes.

For production, I would always


recommend running two Cloud Controllers for redundancy and using a number of technologies
such as Pacemaker, HAProxy, MySQL with Galera, etc.. A full discussion of Controller HA is
beyond the scope of this post. You can find more information in the OpenStack High
Availability Guide and I also address it in my slide deck on HA in OpenStack. For scalability, a
large Cloud implementation can leverage the modular nature of OpenStack to distribute various
services across multiple Controller nodes.
Compute Nodes

The first step in designing the Compute Node layer is to take a look at the customer requirements
and map them to current capabilities within Nova Compute, beginning with the choice of which
hypervisors to use for which workloads.

KVM

As discussed in previous posts. KVM with libvirt is best suited for workloads that are stateless in
nature and does not require some of the more advanced HA capabilities found in hypervisors
such as Hyper-V or ESXi with vCenter. For this customer use case, it would be appropriate to
virtualize the Apache servers and the miscellaneous bare-metal servers, using KVM.

From a sizing perspective we will use a 4:1 vCPU to pCPU overcommitment ratio for the
Apache servers, which is slightly more aggressive than the 3:1 vCPU to pCPU overcommitment
I advocated in my previous Resource Overcommitment post; however, I am comfortable with
being a bit more aggressive since these are web servers are moderately used in terms of resource
utilization. For the miscellaneous servers, we will use the standard 3:1 overcommitment
ratio. From a RAM perspective we will maintain a 1:1 vRAM to pRAM overcommitment ratio
across both types of servers. The formula looks something like this:

((32 Apache * 4 pCPUs) / 4) + ((90 Misc VMs * 1 vCPU) / 3) = 62 pCPUs

((32 Apache * 8 pRAM) / 1) + ((90 Misc VMs * 4 vRAM) / 1) = 616 GB pRAM

Using a standard dual hex-core server with 128 GB RAM for our Compute Node hardware
standard, that works out to a minimum of 6 Compute Nodes, based on CPU, and 5 Compute
Nodes, based on RAM. Please note the following:

 The quantity of servers would change depending on the server specifications.


 It is recommended that all the Compute Nodes for a common hypervisor have the same
hardware specifications in order to maintain compatibility and predictable performance in the
event of a Node failure (The exception is for Compute Nodes managing vSphere, which we will
discuss later).
 Use the greater quantity of Compute Nodes determined by either the CPU or RAM
requirements. In this use case, we want to go with 6 Compute Nodes.

Starting with a base of 6 nodes, we then factor in the additional requirements for free resource
capacity and N+1 design. To allow for the 20% free resource requirements, the formula looks
something like this:

62 pCPU * 125% = 78 pCPU

616 GB pRAM * 125% = 770 GB pRAM

That works out to an additional Compute Node for a total of 7 overall. To factor in the N+1
design, we add an other node for a total of 8 Compute Nodes.

vSphere

Given the user case requirements for a highly available infrastructure underneath Oracle, this
becomes a great use case for vSphere HA. For the purpose of this exercise, we will host Oracle
and application servers on VMs running on ESXi hypervisors. For vSphere, the design is a bit
more complicated since we have to design at 3 layers – the ESXi layer, the vCenter layer, and the
Compute Node layer.

ESXi Layer

Starting at the ESXi hypervisor layer we can calculate the following, starting with the Oracle
servers:

((6 Oracle * 8 pCPUs) / 1) = 48 pCPUs

((6 Oracle * 256 GB pRAM) / 1) = 1536 GB pRAM


Note that because Oracle falls into the category of a Business Critical Application (BCA), we
will follow the recommended practice of no overcommitment on CPUs or RAM, as outlined in
my blog post on Virtualizing BCAs. For Oracle licensing purposes, I would recommend, if
possible, the largest ESXi host you can feasibly use; you can find details on why in my blog post
on virtualizing Oracle. If we use ESXi servers with dual octa-core processors and 512 GB RAM,
we could start with 3 ESXi servers in a cluster. To allow for the 20% free resource requirements,
the formula looks something like this:

48 pCPU * 125% = 60 pCPU

1536 GB pRAM * 125% = 1920 GB pRAM

That works out to an additional ESXi host for a total of 4 overall. To factor in the N+1 design,
we add an other node for a total of 5 hosts in a cluster. For performance and Oracle licensing
reasons, you should place these host into a dedicated cluster and create one or more additional
clusters for non-Oracle workloads.

Due to the sensitive nature of the application servers and their statefulness, they are also good
candidates for being hosted on vSphere. We can calculate the number of ESXi hosts required by
using the following:

((16 App * 4 pCPUs) / 4) = 16 pCPUs

((16 App * 64 GB pRAM) / 1.25) = 819 GB pRAM

Using dual quad-core servers with 256 GB RAM for our ESXi servers, that works out to a
minimum of 2 ESXi hosts, based on CPU, and 4 ESXi hosts, based on RAM. Again, it is
always recommended to go with the more conservative design and start with 4 ESXi hosts. To
allow for the 20% free resource requirements, the formula looks something like this:

16 pCPU * 125% = 20 pCPU

819 GB pRAM * 125% = 1024 GB pRAM

That works out to an additional ESXi host for a total of 5 overall. To factor in the N+1 design,
we add an other node for a total of 6 hosts in a cluster. Typically, you would place all 6 hosts
into a single vSphere cluster.

vCenter Layer

I want to briefly discuss the issue of vCenter redundancy. In discussions with the VMware team,
they have recommended creating a separate management vSphere cluster, outside of OpenStack,
which would host the vCenter servers being manged by the OpenStack Compute nodes. This
would enable HA and DRS of the vCenter servers and maintain separation of the management
plane from the objects under management. However, for simplicity’s sake, i will assume here
that we do not have vCenter redundancy and that the vCenter server will not be running
virtualized.

Compute Node Layer

[Havana] In the Grizzly release, a compute node could only manage a single vSphere
cluster. With the ability, in Havana, to use one Compute Node to manage multiple ESXi
clusters, we have the option of deploying just one compute node or choosing to deploy 2
Compute Nodes for our use case. Note that there is currently no way to migrate a vSphere
cluster from one Compute Node to another. So in the event of a Node failure, vSphere will
continue running and vCenter could still be used for management; however, there is no way to
migrate the vSphere cluster to another Compute Node. Once the managing Compute Node is
restored, connection between it and vCenter can be re-established.

IN terms of redundancy, there is some question as to a recommended practice. The Nova


Compute guide recommends virtualizing the Compute Node and hosting it on the same cluster it
is managing; in talks with VMware’s OpenStack team, they recommend placing the Compute
Node in a separate management vSphere cluster so it can take advantage of vSphere HA and
DRS. The rationale again would be to design around the fact that a vSphere cluster cannot be
migrated to another Compute Node. For simplicity’s sake, i will assume that the Compute
Nodes will not be virtualized.

[Havana] Putting the three layers together, we have a design that looks something like this if we
choose to consolidate management of our 2 vSphere clusters on to 1 compute node; as mentioned
earlier, we also have the option in Havana to use 2 compute nodes, 1 per cluster in order to
separate failure zones at that layer.
The use of host
aggregates above are to ensure that Oracle VMs will be spun up in the Oracle vSphere cluster
and application servers in the app server vSphere cluster when deployed through
OpenStack. More details about the use of host aggregates will be provided in the next section.

Putting It All Together

[Havana] A multi-hypervisor OpenStack cloud, with KVM and vSphere may look something
like this, using 1 compute node hypervisor; as mentioned earlier, we also have the option in
Havana to use 3 compute nodes, 1 per cluster in order to separate failure zones at that layer.
Note the use of
Host Aggregates in order to partition Compute Nodes according to VM types. As discussed in
my previous post on Nova-Scheduler, a vSphere cluster appears to the nova-scheduler as a single
host. This can create unexpected decisions regarding VM placement. Also, it is important to be
able to guarantee that Oracle VMs are spun up on the Oracle vSphere cluster, which utilizes
more robust hardware. To design for this and to ensure correct VM placement, we can leverage
a feature called Host Aggregates.

Host Aggregates are used with nova-scheduler to support automated scheduling of VM instances
to be instantiated on a subset of Compute Nodes, based on server capabilities. For example to
use Host Aggregates to ensure specific VMs are spun up on Compute Nodes managing KVM,
you would do the following:

1. Create a new Host Aggregate called, for example, KVM.


2. Add the Compute Nodes managing KVM into the new KVM Host Aggregate.
3. Create a new instance flavor called, for example, KVM.medium with the correct VM resource
specifications.
4. Map the new flavor to the correct Host Aggregate.

Now when users request an instance with the KVM.medium flavor, the nova-scheduler will only
consider compute nodes in the KVM Host Aggregate. We would follow the same steps to create
an Oracle Host Aggregate and Oracle flavor. Any new instance requests with the Oracle flavor
would be fulfilled by the Compute Node managing the Oracle vSphere Cluster; the Compute
Node would hand the new instance request to vCenter, which would then decides which ESXi
host should be used.
[Havana] Of course there are other infrastructure requirements to consider, such as networking
and storage. As mentioned earlier, I hope to address these concerns in future blogs posts over
time. Also, you can freely access and download my “Cattle Wrangling For Pet Whisperers:
Building A Multi-hypervisor OpenStack Cloud with KVM and vSphere” slide deck.

As always, all feedback are welcomed.

Você também pode gostar