Escolar Documentos
Profissional Documentos
Cultura Documentos
Introduction
vCenter Server Heartbeat, from here on shortened to vCSHB, is a Windows application
that protects vCenter Server and its associated components and applications with a true,
high availability solution. There are many technologies that offer some type of protection
for vCenter Server, but today, vCSHB is unique in its level of protection. The focus of
this chapter will be to understand at a high level how vCSHB works to protect vCenter
Server. Before we can talk about how vCSHB is built and how it works, though, we
should understand why we need it in the first place. Well do that by first exploring the
effects of an unavailable vCenter Server, what options exist to protect vCenter Server,
their shortcomings, and how vCSHB provides high availability in a way they cant.
Well also look at how vCSHB is commonly deployed and take a look at the interfaces
used to manage it. These topics are outlined in the following sections.
The next section talks about why an environment may need vCenter Server to be highly
available. In other words, it helps define a use case for vCSBH because not every
vCenter Server needs to be highly available. It also discusses the implications of vCenter
Server when its down.
be losing performance data and backups are on a reliable, rotating schedule and are
copied offsite.
Finally, an RTO defines how long it takes to restore all or individual services. If vCenter
does not need to be highly available, perhaps an RTO of 4 or 8 hours is sufficient because
a complete rebuild is planned if vCenter Server is lost. For others, loss of vCenter
services would mean millions of dollars in lost revenue so it must not incur any
downtime, whatsoever. An RTO of seconds, then, is required, and even then, those
seconds are used to determine if the service is available. If the service is not available,
vCenter Server may failover to another instance.
Once a company has identified their SLA, RPO, and RTO requirements, it can determine
how best to achieve them by tying them together into a use case. While this book focuses
on vCSHB, it will help to understand the different use cases for other vCenter Server
availability solutions in addition to vCSHB. To begin, lets look at some of the services
vCenter Server offers and how theyre affected when vCenter Server is not available.
First and foremost, vCenter Server offers the main management interface into a VMware
vSphere environment. It can manage hundreds of ESXi hosts and thousands of virtual
machines that otherwise would have to be managed on an individual ESXi host basis.
Right away, we see the benefit of the centralized management vCenter Server affords.
vCenter Server is the default, main collector of performance metrics pulled from the
environment. It also provides the default alarm system for performance, capacity, and
configurations. If vCenter Server is not available, its not gathering performance data
and its not evaluating thresholds or sending emails or SNMP traps for alarms.
There are several, high-profile functions that dont work if vCenter Server is offline.
These include the Distributed Resource Scheduler (DRS), Storage Distributed
Resource Scheduler (SDRS), vMotion, and Storage vMotion. DRS and SDRS
calculations that detect imbalances in host memory and CPU utilization, datastore
capacity, and datastore I/O latency are performed by vCenter Server based on input from
the hosts. If vCenter Server is down, these functions are unavailable.
vMotion and Storage vMotion perform live migrations of virtual machines between hosts
and datastores, respectively. DRS and SDRS rely on these functions in order to complete
their tasks. Without vCenter Server, none of these functions work. Together, these
functions are ultimately responsible for load-balancing workloads to avoid performance
problems or to aid in maintenance. Without vCenter Server, workloads cannot be
balanced and performance may suffer. If virtual machines cant be moved around,
maintenance may become more difficult.
Many vSphere administrators know that vSphere High Availability (HA) still functions
without vCenter Server because, once configured, its only communication between ESXi
hosts that determine which virtual machines are restarted and where. But one HA feature
3
that is managed by vCenter Server is the Admission Control Policy. This policy ensures
enough resources are reserved in a vSphere Cluster to boot virtual machines in the case of
an HA event. If vCenter Server is down, this policy cannot be enforced.
Finally, there are many additional servers from VMware and third-parties that rely on
vCenter Server being operational in order to perform their tasks. These include vCenter
Operations Manager, VMware Horizon View, vCloud Director, vCloud Automation
Center, vCenter Orchestrator, and third-parties such as Veeam Backup and Replication
for business continuity and disaster recovery or Quest vFoglight for performance and
capacity management.
One further downside to relying on HA to keep vCenter Server available is its lack of
restart priorities. If HA restarts vCenter Server and its database server when theyre
installed on separate VMs, its only coincidence if they come back online in the correct
order: database first, then vCenter.For environments that cant withstand this amount of
downtime or risk to their vCenter Server, HA is of little help.
VM and Application Monitoring
VM and Application Monitoring are closely tied to vSphere HA and operate in a similar
fashion, relying on heartbeats to determine if a service is available and restarting it if it
fails. VM Monitoring is disabled by default in a vSphere HA cluster. When enabled, the
HA agent on the ESXi host will receive heartbeats from the VMware Tools installed on
virtual machine. If the heartbeat stops, the HA agent checks to see if disk or network
activity has occurred from the virtual machine in the last two minutes. If they have not,
the virtual machine is restarted. If I/O has occurred and VMware Tools is not
responding, the virtual machine is considered active and is not restarted. VM Monitoring
can protect vCenter Server in this way but again, the virtual machine is restarted and
downtime is measured in minutes. The drawback is VM Monitoring offers no application
awareness. VM Monitoring offers protection from operating system faults only. So if
vCenter Server suffered from application or service issues, VM Monitoring cant help.
Application monitoring fills the gap in VM Monitoring by allowing third-parties like
Veritas/Symantec to write applications that can monitor vCenter Server and database
services. If services become unresponsive to the application monitor, those services can
be restarted first before restarting the virtual machine. Here again, downtime to vCenter
Server can still be measured in minutes if a virtual machine has to be restarted, possibly
less so if only services are restarted. Even here, though, if the vCenter management
services or database services are restarted, vCenter is not available during that time.
As part of vSphere 5.5, VMware released its own application monitoring tool called App
HA. In its initial release, it can only protect the vCenter Server database services, not
vCenter Server itself.
VMware Fault Tolerance
Ill say very little about the possibility of Fault Tolerance (FT) protecting vCenter Server
because as of vSphere 5.5, it only supports single-vCPU virtual machines. You rarely
find vCenter Servers with only one vCPU. Theyre usually in test labs or small
deployments that dont require high availability in the first place.
Lack of support for multiple vCPUs is likely to change in an upcoming release,
though (VMware shared a technical preview of Symmetric Multi-Processor
Fault Tolerance (SMP-FT) at VMworld 2012).
Cluster failover times can be measured in seconds to minutes. The major drawback to
this, though, is its complexity to setup and maintain.
For an excellent blog series on configuring WSFC in SQL Server 2012 on Windows
Server 2012, which can be used to protect a vCenter Server database, see Derek
Seamans Blog http://www.derekseaman.com/2013/09/sql-2012-failovercluster-pt-1-introduction.html
While not a database clustering solution, SQL Server Log Shipping can be used to
replicate vCenter Server database changes to a stand-by database server. If the active
database server fails, via a manual process, vCenter could be pointed to the new database
server. The downtime associated with this is easily measured in 30 minute increments
because of the manual processes involved. Like other database-only protection schemes,
the vCenter Server is not protected by Log Shipping.
SQL Server Mirroring
SQL Mirroring allows for quicker, automatic failover and much less downtime than Log
Shipping. Mirroring has the advantage over clustering solutions that require shared
storage because virtual machines participating in SQL Server Mirroring can be
vMotioned and DRS can move them around a vSphere cluster automatically. The
drawback with Mirroring is that the vCenter Server itself is still not protected.
Backups and clones
The final way to protect vCenter Server is to perform regular backups. This type of
protection should already be in place in most environments but is probably the least
effective way to provide high availability. In the worst case, restore operations can be
measured in hours or days. The key component to backup is, of course, the vCenter
Server database. This is easily accomplished through native SQL Maintenance Plans but
there are third-party tools available. In addition, this discussion is from the perspective of
the vSphere Administrator, not a Database Administrator. If the reader wears both hats,
Maintenance Plans will likely suffice. If a Database Administrator is responsible for
backing up the vCenter Server database, hell likely perform some tasks differently. The
important thing is to ensure the backups are performed and verified.
For an introduction to SQL Maintenance Plans for the vSphere Administrator,
see my blog post on the subject
http://virtuallymikebrown.com/2011/10/14/sql-server2008-backups-for-vmware-databases/
As with other SQL-centric approaches to protection, the vCenter Server itself is not
protected. Thats where cloning and snapshots come in. A clone or snapshot can be
taken on a schedule and reverted to in case a restore is needed. Cloning works with
physical vCenter Servers, as well. Clones can be created of a physical vCenter Server by
using VMware vCenter Converter. This process will non-disruptively clone the physical
server into a virtual machine. Once in virtual machine format, it can be backed up or
moved like any other virtual machine. If the physical vCenter Server ever fails, the clone
virtual machine can be powered on and immediately take the place of the original.
If vCenter Server is already a virtual machine, VMware Converter can clone it, as well.
But an easier way to protect a virtual vCenter Server is to use the built-in cloning
7
function of vSphere itself. Just right-click the vCenter Server virtual machine and choose
Clone. This can serve as a backup in case the original is lost. The cloning process for
both VMware Converter and vSphere cloning are substantial enough, in terms of
resources required, that it might only be reasonable to clone them once or twice a day.
Of course, this means that any changes made after the last clone operation and before the
next clone will be lost if theres a failure.
The final way to backup a vCenter Server is to take snapshots. There are two kinds of
snapshots, VMware Snapshots and storage array snapshots. In the case of VMware
Snapshots alone, these arent true backups. They should be used for short periods of time
only and deleted when no longer needed. Leaving VMware Snapshots for long periods
of time can cause the Snapshots to grow very large and cause problems during their
deletion or when committing them. Performance problems can also arise when running
of large Snaphot delta files. As a long term protection strategy, VMware Snapshots
should not be used. Theyre really only useful when changes are made to the system, like
before Microsoft updates are applied or before an upgrade is performed.
For an overview of best practices for VMware Snapshots, see VMware KB
article 1025279
http://kb.vmware.com/selfservice/microsites/search.do?la
nguage=en_US&cmd=displayKC&externalId=1025279
The second type of snapshot is taken at the storage array level. These snapshots are
managed by the storage array itself so the vSphere layer is usually unaware of them.
Storage array vendor tools can help automate the work required to take advantage of
these snapshots in a restore situation. Otherwise, the datastores in which the virtual
machine snapshots exist can be mounted on an ESXi host and the snapshot restored.
These snapshots are often replicated to long term secondary storage for archival and
disaster recovery purposes. Unlike VMware Snapshots, storage snapshots can, indeed, be
a part of a backup strategy. VMware Snapshots can even be included in storage array
snapshots so virtual machines have file system consistent backups.
vCSHB networks
Part of the magic behind how vCSHB provides high availability with live fail over hosts
is how it utilizes the networking between them. This section will explain what those
networks are and how theyre implemented.
Management network
Another network is used for day-to-day management of the nodes themselves, for
instance, to apply patches, connect via the Remote Desktop Protocol (RDP), or perform
other maintenance. This is called the Management Network and each node will have a
unique Management IP address. This is usually the standard network used for the
server being protected.
Finally, the Public IP address is also assigned to the Management Network. This is a
single IP address that is shared between nodes. Only the Active node, though, is
accessible via the Public IP address. The Passive node is not visible on the network using
this IP address. vCSHB accomplishes this through the use of a packet filter installed on
each node. Both nodes are actually assigned the same Public IP address but the packet
filter on the Passive node will block all traffic that uses the Public IP address. When a
fail over or switch over occurs, the following process takes place.
be dedicated for VMware Channel replication traffic and the other will service client and
administrative traffic through the Public and Management IP addresses, respectively.
As shown in the figure above, the VMware Channel IP addresses share a subnet as do the
Public and Management IP addresses. This configuration is specific to a LAN
deployment. If deployed across a WAN, each server would have its subnets for VMware
Channel and management traffic. Looking at the figure, client traffic will only access
services through the Public IP address and so, client traffic is directed to the server on the
left. The packet filter installed on the server on the right actively blocks traffic associated
with the Public IP address, essentially hiding the server from clients on the network.
Administrative traffic, however, does not use the Public IP address; it uses the
Management IP addresses and so can flow to each server simultaneously. This is the
steady state of a vCSHB installation. The process involved when a fail over or switch
over occurs is discussed next.
interface with the Public IP address while the newly active node activates the Public IP
address and starts accepting traffic directed to it. This is shown in the figure below. In
this case, DNS records do not have to be updated because the nodes themselves are
responsible for blocking and activating their Public IP address interfaces. The newly
Active node is also responsible for sending a gratuitous ARP broadcast to let upstream
switches know it is now servicing all requests for the Public IP address.
A stretched VLAN deployment operates in exactly the same way. The only difference is
that the nodes are geographically dispersed.
12
13
Backups and clones, while being important in the overall protection scheme,
are far from being able to provide high availability mainly due to the time
required to recover.
vCSHB addresses every one of these to provide a true high availability solution. It
protects both the vCenter Server, its database, and certain other services from hardware
failures, operating system failures, application and service failures, network failures, and
even performance problems that might cause these services to become unavailable. How
vCSHB accomplishes this is described below.
vCSHB operates across all hardware on the VMware Hardware Compatibility List
(HCL). If you run vCenter or its database on physical servers, this is advantageous
because vCSHB can still offer protection. Being a Windows application, it can only
protect a vCenter Server installed on Windows, though. It cannot protect the vCenter
Server Virtual Appliance (vCSA). The best way to protect the vCSA today is to use
vSphere HA.
DiskAvgSecsPerWrite
DiskIO
DiskQueueLength
DiskReadsPerSec
DiskWritesPerSec
DiskWritable
FreeDiskSpace
FreeDiskSpaceOnDrive
MemoryCommittedBytes
MemoryCommittedBytesPercent
MemoryFreePTEs
MemoryPageReadsPerSec
MemoryPageWritesPerSec
MemoryPagesPerSec
MemoryPagingFileUseage
PageFaultsPerSec
ProcessorIntsPerSec
ProcessorLoad
ProcessorQueueLength
RedirectorBytesTotalPerSec
RedirectorNetworkErrorsPerSec
ServerBytesTotalPerSec
ServerWorkItemShortages
ServerWorkQueueLength
SystemContextSwitches
these services use. It installs service-specific file filters which capture all data associated
with each service in order to replicate it. Now lets assume the Update Manager is
installed on this machine after vCSHB is installed. vCSHB will not recognize the Update
Manager installation automatically and will not replicate any data associated with it. A
vCSHB administrator must install the Update Manager file filter manually. This will tell
vCSHB to protect the Update Manager application by replicated its data.
The process of replicating data is captured in the steps below.
1.
2.
3.
4.
5.
Server architecture
vCSHB supports three different server deployment architectures: physical to physical (PP), physical to virtual (P-V), and virtual to virtual (V-V). Most deployments will be V-V
because most vCenters are virtualized. This is beneficial because working with virtual
machines is generally much easier than working with physical machines. In the case of
vCSHB, deployments that involve physical servers are different in a number of ways.
18
The first is the way in which a clone is made of the Primary server. There are only three
cloning options supported by VMware to do this. The options are VMware vCenter
Converter, VMware vCenter cloning through the vSphere Client or Web Client, and
vCSHB native cloning which is bundled with the vCSHB media. Which one is used
depends on the architecture chosen.
If the Primary server is physical and the secondary is virtual, the cloning
process used will be VMware vCenter Converter.
Number of NICs
Another important option when choosing an architecture is whether to use one, two, or
more NICs. Whichever is chosen, it must be same on both nodes. For redundancy, two
NICs are recommended. Two NICs allow for a dedicated Public NIC and a VMware
Channel NIC. The Public NIC would share the Management and Public IP addresses and
the VMware Channel NIC would host the Channel IP address. If a single NIC is used, all
IP addresses will be hosted by the same NIC and in this case, can all share the same
subnet. Single NIC deployments should be avoided because they are a single point of
failure.
Teaming Intel NICs is supported with the workaround in VMware KB article
1027288
19
http://kb.vmware.com/selfservice/microsites/search.do?la
nguage=en_US&cmd=displayKC&externalId=1027288
When vCSHB protected servers are virtualized, each virtual NIC should be placed on
separate vSwitches. This is in an effort to prevent all vCSHB traffic from traversing a
single physical server NIC which fails and takes down all vCSHB traffic. Ensure
redundancy is employed on whichever vSwitches vCSHB uses to avoid single points of
failure.
The most common use of more than two NICs is when two NICs are dedicated to the
VMware Channel in addition to the one used for Public traffic. These are not teamed
NICs but two, independent NICs with unique IP addresses. This is usually done when
multiple, distinct paths, including different physical links and upstream switches and
routers, can be used for each VMware Channel network. This increases the network
redundancy between vCSHB nodes. Such redundancy is usually seen across a WAN
when two different providers allow for two separate paths between sites. This
redundancy is illustrated below.
For WAN deployments, a static route is required on the VMware Channel network
because the NIC used for this traffic is not configured for a default gateway. An example
of adding a static route using PowerShell in Windows Server 2012 is shown below.
New-NetRoute -DestinationPrefix 172.20.20.0/0 -InterfaceAlias Channel
20
Or services may be distributed among many machines. In this case, each machine can
fail over independently and the surviving nodes will reconnect once the fail over is
complete. If a node is running several protected services, its important to note that all
services fail over at once. Its not possible to fail over some services on a single machine
and leave others running on that same machine. Its all or nothing, so to speak.
22
Finally, vCSHB is capable of being deployed in environments that use features such as
Linked Mode, Site Recovery Manager, and multi-site SSO. In large deployments that
include DR provided by SRM, vCSHB can be deployed in an architecture similar to that
shown in the diagram below.
23
Each site has distributed services which it protects local to the site only using vCSHB.
Linked Mode is used between sites to benefit from single pane of glass management and
shared vCenter inventory and licenses. Site protection is provided by SRM but recall
SRM itself, including its database, is not supported by vCSHB.
One other interesting note is that only two vCSHB licenses are used in this deployment
one at each site. Even though six servers are protected by vCSHB, only two vCenter
instances are and recall, vCSHB is licensed per vCenter Server instance.
24
Here, we can note several items. The first is the overall status of the vCSHB pair. Green
check marks mean everything is healthy. If something were wrong, yellow triangles
would appear and messages would be seen in the lower pane.
Notice the tabs along the top of the page. They include the various levels at which
vCSHB offers protection: server, network, application, and data replication. Choosing
each tab will offer sub-menus for those categories. The first tab in the client under Server
> Summary, we have the option to perform a manual switch over by choosing the Make
Active button in the top pane. This will start a manual switch over which we can
watch progress via the client.
25
During the switch over, we can view a progress bar as well as the numbered steps being
executed. Details of the switchover can be viewed from the Logs tab which can also be
useful in troubleshooting.
Staying on the Server tab, when the Passive server is selected, we can see the status of the
Send and Receive Queues by viewing the Recovery Point in the bottom pane. Most of
the time, youll want to see 0 milliseconds, which tells you data is being quickly
replicated and committed on the passive node. If this number grows, its an indication
that theres a problem with the VMware Channel network, perhaps bandwidth contention,
latency, or connectivity problems.
Finally, both the vSphere Client and Web Client have vCSHB plug-ins installed when
vCSHB itself is installed. The plug-ins do not have nearly the features the standalone
client does, but for the purposes of executing common workflows, the plug-ins work
well.
26
Common workflows that can be executed include those listed in the Actions pane in the
upper right. Similar to the standalone client, we can see the overall status of the
replication relationship here as well as perform manual switch overs.
Summary
In this chapter, we introduced vCSHB, looked at why vCenter Server may need to be
highly available and what services are affected when its down. Recall that vCenter
allows for centralized management of your vSphere infrastructure, in addition to DRS,
vMotion, deploying from templates, and more. Also, important services like Auto
Deploy, Update Manager, vCloud Director and vCAC, and Horizon View environments,
among others, rely on an operational vCenter Server. By identifying the services you
lose in concert with defined SLAs, RPOs, and RTOs, a use case can be built for vCSHB.
The architecture of vCSHB was discussed next starting with the two types of vCSHB
deployments, LAN and WAN. vCSHB identities, Primary and Secondary, stay with the
server throughout its lifecycle, while the roles, Active and Passive, move between them.
A key part of the vCSHB architecture is the VMware Channel network over which
hearbeats and data replication take place. The Management Network is the normal
network over which day-to-day management takes place. Both servers Management IP
address and the shared, client-facing Public IP address are on the Management Network.
27
LAN and WAN fail overs are much the same with the biggest difference being a change
in the Public IP address DNS record during a WAN fail over because the Public IP
address will be different across the WAN.
We also looked at each vCSHB protection mechanism in turn to see how it provides a
spectrum of protection for vCenter Server and its database. Recall the following failures
against which vCSHB protects:
Network
Application and service
Performance degradation
Application data
Finally, the vCSHB management interfaces were introduced along with their major
features.
With an understanding of what vCSHB is, how it works, and to interface with it, the next
chapter will explore in detail how to install and more importantly how to configure
vCSHB to protect vCenter Server and its SQL Server when theyre installed on the same
machine.
28