Multiple vCPU Fault Tolerance on vSphere 6.0

It’s been a long wait, but it is finally over. The official support for multiple vCPU’s on a guest have been increased for VMware’s Fault Tolerance (FT). This feature, provides zero downtime, zero data loss and continuous availability for any application.

A little background on the history of FT with VMware.

They officially introduced this feature in vSphere version 4.0 back in 2009 and was instantly hailed as a breakthrough in zero downtime for guest operating systems and applications. This however did come with a price. There were only so many vm’s that you can protect per host and the overhead on the network and hosts were demanding.

I think back to when this feature became available and thought to myself – wow, there are going to be a lot of business lines that will want in on that magic. I was always sad when I told them the memory and vCPU limitations and those that did meet the guidelines were turned away since we exceeded the number of guest protections per cluster.

Number of virtual machines that can be protected is now based on how many vCPU’s are protected per host. The maximum remains at 4 guests or 8 FT protected vCPU’s (whichever comes first). These values are indicative of both primary and secondary virtual machines and vCPU’s. There is some overhead that is involved based on the workload and number of FT protected virtual machines. Generally, you can expect a 10-30% overhead increase. This overhead will primarily be on the network with a minimal CPU hit on each cluster node.

Reasons why it’s been at one vCPU for a while now

The limitations on how many vCPU’s can be protected lies in the lockstep mechanism that was used to keep the dormant node up to date and ready for an immediate takeover. This was known as the “Record-Replay” method and has been replaced with a new technology known as “Fast Checkpointing”. This new mechanism allows for multiple vCPU protection through the continuous copying/checkpointing of the virtual machine.

Some of the same rules apply:

To ensure a successful protection of virtual machines, you still need to abide by some of the basic rules for vSphere Fault Tolerance.

  • You still need to ensure that all machines protected are on a host running the same version of vSphere. In this case, version 6.0 (of course)
  • Dedicated virtual network VMkernel portgroups must be configured for FT logging.
  • 10GB Network links must be used.
  • As mentioned above, you will see a 10-30% network overhead increase (based on demand from the number of FT protected machines and workload).
  • vMotion is supported for both the primary and secondary nodes of a protected vm.

Additional Protection and options at the Storage Layer

In addition to the added processor limit, the new fault tolerant version also deviates from its old method of single storage point for both primary and secondary virtual machines. This new version separates the location of the virtual machine file on different storage volumes that further protects the machine from storage failures.

Fault tolerance on vSphere 6.0 now supports the use of thick and thin disk types. The previous version only supported eager zero thick.

Another great feature is the re-protect mechanism for those storage volumes that run out of space. vCenter will monitor the FT replica and spin up a new secondary VM on a new datastore.

Remaining Points

  • As before, svMotion is not possible with VMware Fault Tolerance running multiple vCPU’s.
  • Virtual Machines in vCloud Director, VSAN/VVOL’s and VMware Replication are not supported on SMP-FT machines.
  • VADP (vStorage API’s for Data Protection) and Snapshots are now supported on vSphere 6.0 FT!