Multiple vCPU Fault Tolerance on vSphere 6.0

It’s been a long wait, but it is finally over. The official support for multiple vCPU’s on a guest have been increased for VMware’s Fault Tolerance (FT). This feature, provides zero downtime, zero data loss and continuous availability for any application.

A little background on the history of FT with VMware.

They officially introduced this feature in vSphere version 4.0 back in 2009 and was instantly hailed as a breakthrough in zero downtime for guest operating systems and applications. This however did come with a price. There were only so many vm’s that you can protect per host and the overhead on the network and hosts were demanding.

I think back to when this feature became available and thought to myself – wow, there are going to be a lot of business lines that will want in on that magic. I was always sad when I told them the memory and vCPU limitations and those that did meet the guidelines were turned away since we exceeded the number of guest protections per cluster.

Number of virtual machines that can be protected is now based on how many vCPU’s are protected per host. The maximum remains at 4 guests or 8 FT protected vCPU’s (whichever comes first). These values are indicative of both primary and secondary virtual machines and vCPU’s. There is some overhead that is involved based on the workload and number of FT protected virtual machines. Generally, you can expect a 10-30% overhead increase. This overhead will primarily be on the network with a minimal CPU hit on each cluster node.

Reasons why it’s been at one vCPU for a while now

The limitations on how many vCPU’s can be protected lies in the lockstep mechanism that was used to keep the dormant node up to date and ready for an immediate takeover. This was known as the “Record-Replay” method and has been replaced with a new technology known as “Fast Checkpointing”. This new mechanism allows for multiple vCPU protection through the continuous copying/checkpointing of the virtual machine.

Some of the same rules apply:

To ensure a successful protection of virtual machines, you still need to abide by some of the basic rules for vSphere Fault Tolerance.

  • You still need to ensure that all machines protected are on a host running the same version of vSphere. In this case, version 6.0 (of course)
  • Dedicated virtual network VMkernel portgroups must be configured for FT logging.
  • 10GB Network links must be used.
  • As mentioned above, you will see a 10-30% network overhead increase (based on demand from the number of FT protected machines and workload).
  • vMotion is supported for both the primary and secondary nodes of a protected vm.

Additional Protection and options at the Storage Layer

In addition to the added processor limit, the new fault tolerant version also deviates from its old method of single storage point for both primary and secondary virtual machines. This new version separates the location of the virtual machine file on different storage volumes that further protects the machine from storage failures.

Fault tolerance on vSphere 6.0 now supports the use of thick and thin disk types. The previous version only supported eager zero thick.

Another great feature is the re-protect mechanism for those storage volumes that run out of space. vCenter will monitor the FT replica and spin up a new secondary VM on a new datastore.

Remaining Points

  • As before, svMotion is not possible with VMware Fault Tolerance running multiple vCPU’s.
  • Virtual Machines in vCloud Director, VSAN/VVOL’s and VMware Replication are not supported on SMP-FT machines.
  • VADP (vStorage API’s for Data Protection) and Snapshots are now supported on vSphere 6.0 FT!

Understanding COW.MaxHeapSizeMB

One of the funniest named advanced settings in ESX is the COW.COWMaxHeapSizeMB. This setting that is done on a per host basis and only 20% (on avg) is set aside for internal data structure. The root entries are the primary occupant of memory and consume 75% of it. The default size is 192 MB as of this post.


COW stands for ‘Copy on Write’ and is the delta disk process that snapshots use on virtual machines running on an vSphere host. Let’s take a quick look at the math behind this value in order to ascertain the COW requirements and understand your limitations based on the values.

X will be the number of virtual machines
Y will be the number of disks
S will be the number of snapshots on each of the virtual disks
B will be the size in bytes

X = (75 / 100 * COW_HEAP_SIZE) / ((B / (2 * 1048576) * 4 * S) * Y)

2 * 1048576 is GDE Coverage and constant 4 is the bytes per Root Entry

Let’s just take an example of a virtual machine that has 2 vDisks and each of the disks are 100GB in size (107374182400 bytes to fit it into the formula) with only 3 snapshots on each vDisk and the COW Heap Size is configured at 256 MB.

So to get our number of vm’s that can be powered on in this case, we take:

(75 /100 * 268435456) / ((107374182400 / (2 * 1048576) * 4 * 3) * 2)

(75 /100 * 268435456) / ((107374182400 / 2097152 * 4 * 3) * 2)

201326592 / (614400 * 2)

201326592 / 1228800

For a total of: 163.84 or roughly 163 such virtual machines that can be powered up with these settings and current snapshots.

What’s interesting is that with each disk that is added to this formula, the number of available virtual machines that you can power up goes down considerably. Let’s just put one more disk in this formula to take a look:

(75 /100 * 268435456) / ((107374182400 / (2 * 1048576) * 4 * 3) * 3)

201326592 / ((107374182400 / 2097152 * 4 * 3) * 3)

201326592 / (614400 * 3)

201326592 / 1843200

Now we can only power up 109 of these type of machines with just one more disk! Imagine what your figures will look like with vm’s that have 5 or 6 virtual disks on them. Yet another reason why snapshots should be temporary!


On to the adjustments

If you find yourself in the unwanted predicament that your machines will not power on due to low heap, you may want to increase it slightly and you should never just crank it to the maximum allowed.

Turn it up to 11

This may have an adverse affect on your host(s) and may make matters worse! Start off by just doubling what is currently allocated.

Increasing this through the Virtual Infrastructure Client can be done here:

COW Heap Size VIC

You can also do this via command line by executing these commands.

To see the current value:

~ # esxcfg-advcfg -g /COW/COWMaxHeapSizeMB
Value of COWMaxHeapSizeMB is 192

To change the value:

~ # esxcfg-advcfg -s 256 /COW/COWMaxHeapSizeMB
Value of COWMaxHeapSizeMB is 256MB

*The vSphere host must be restarted take advantage of the new heap size!


Shellshock / BASH VMware Patches

I’m sure everyone is aware of yet another vulnerability that was discovered just the other week and this time it is targeted at the BASH shell that is on almost every unix system. As this pertains to our VMware environments, there is the good, the bad and the ugly.

The Good:

Ever since VMware went to ESXi for the vSphere platform, they have removed underlying dependency on a full blown Unix environment and vSphere hosts now run the Ash shell (busybox) for commands. This encompasses all versions of ESXi (vSphere). Another good part about this is that many of these patches are very easy to apply and don’t take a lot of work. I patched a number of test virtual appliances within 20-30 minutes.

Since this issue is primarily Unix based, all VMware products that run on Windows systems are not affected.

The Bad:

Older ESX (non-integrated) environments are susceptible to this vulnerability need to be patched and VMware has released an update to fix both ESX 4.0 and 4.1 systems. Two KB articles have been released to address these releases and are 2090853 for ESX 4.0 and 2090859 for ESX 4.1. Both contain a zip file for download that is roughly 2mb in size and do require a reboot to implement. The patch process is done through the ‘esxupdate’ command and has the following syntax:

#esxupdate – update

Alternatively, you can use VUM (VMware Update Manager) to deliver the patch to the host if it is managed by vCenter. This is done in the traditional manner.

*I would seriously consider upgrading to ESXi if your hardware supports it to eliminate these types of issues in the future. I love vSphere 4.x as much as the next person, but there are so many advancements in 5.x that are worth the upgrade – especially if you’re paying for SnS on those systems! Check the VMware support matrix for compatibility.

The Ugly:

Nearly all virtual appliances are affected. The VMware Security Advisories page released VMSA-2014-0010.5 to address all known products and patches. Many of the patches are delivered through a .pak format and can be easily uploaded and applied to those VA’s through the web management console.

Virtual appliances like vCloud Automation Center need to be patched through a more traditional manner. I’m picking on this one since it requires a more lengthy process due to the nature of how the appliance runs. The process for these types of VA’s are:

  1. Take a snapshot of all vm’s associated with the virtual appliance (if they are in a vApp delivery model or not)
  2. Using your favorite SCP program, upload the Zip file to the VA.
  3. Extract the contents to a temp folder on the appliance.
  4. Install the patch through the RPM installer: rpm -Uvh <patch>.rpm
  5. Restart the appliance
  6. Verify that the patch has been installed and that the appliance is functioning correctly.
  7. Remove the snapshot!

Some VA’s re really easy and the patch can be applied simply be logging into the Web UI, navigating to the update tab, clicking on the “check updates” radio button and then selecting “install”. That was easy!