Hedvig Overview

I was part of the Storage Field Day 10 group last week and had a chance to visit Hedvig at their new offices in Santa Clara, CA. Lots of space to grow into here and they have a nice friendly atmosphere like most places we visited.

The founder and CEO (Avinash Lakshman), spent 6-8 years building large scale, distributed systems. He was one of the co-inventors of Amazon Dynamo and was part of the Apache Cassandra project at Facebook. He believes that the state of traditional storage will disappear and from what I’ve seen at this presentation, they are building that next generation storage platform for tomorrow’s workload.

SFD10 Hedvig Welcome

Founded in 2012 and with a product launch in April of this year, you can see that they have had some time to adjust their product for what the market is demanding. The operational model is focused on a policy based engine that is defined by the infrastructure.

Hedvig is software that is decoupled and residing on commodity servers equaling their distributed storage platform.

One thing that was talked about early in the presentation was the fact that most of their customers don’t even use the user interface since Hedvig’s platform is architected to be API driven. That should give you a good idea what type company is looking at this deployment model.

If you look atHedvig_reception the way they are scaling out their storage architecture (through the multi-site architecture), you can see that they have regional protection in mind from the start. This is accomplished through their “container” based storage model and it’s not the containers that you’re thinking of (read part two).

The software can be deployed within a private datacenter or in a public cloud location or together that would classify it as a hybrid architecture.

High Level Overview:

  1. I found it very interesting that they have prepared the platform for both x86 and ARM based processors. They noted that they have had some interest from some large customers that low power ARM-based deployments are being looked at for some deployments.
  2. They have support for any hypervisor that is out on the market today as well as native storage provisioning to containers.
  3. Block (iSCSI), file (NFSv3 and v4) and object (S3 & Swift) protocol support.
  4. Deduplication, compression, tiering, caching and snaps/clones.
  5. Policy driven storage that provides HA or DR on a per-application basis.

How its Deployed: 

  • The storage service itself is deployed on bare-metal servers or cloud based infrastructure (as mentioned above).
  • It is then presented as file and block storage through a VM/Container/Bare-Metal mechanism called a storage proxy. They have a proprietary “network optimized” protocol that talks to the underlying storage service.
  • For object based storage, it talks natively to the service through the RESTful API’s via S3 or Swift and does not go through the storage proxy converter.

What happens at the Storage Service Layer:

  • When the writes reach the cluster, the data is distributed based on a policy that is pre-configured for that application. (This also contains a replication element)
  • In addition to this, there are background tasks that balance the data across all nodes in the cluster and caches the data for reads.
  • The data is then replicated to multiple datacenters or cloud nodes for DR purposes through synchronous and asynchronous replication.

Look for part two that goes a bit deeper on the intricacies of the Hedvig platform.


Post Disclaimer: I was invited to attend Storage Field Day 10 as an independent participant. My accommodations, travel and meals were covered by the Tech Field Day group but I was not compensated for my time spent attending this presentation. My post is not influenced in any way by Gestalt IT or Hedvig and I am under no obligation to write this article. The companies mentioned above did not review or edit this content and it is written from purely an independent perspective.


Multiple vCPU Fault Tolerance on vSphere 6.0

It’s been a long wait, but it is finally over. The official support for multiple vCPU’s on a guest have been increased for VMware’s Fault Tolerance (FT). This feature, provides zero downtime, zero data loss and continuous availability for any application.

A little background on the history of FT with VMware.

They officially introduced this feature in vSphere version 4.0 back in 2009 and was instantly hailed as a breakthrough in zero downtime for guest operating systems and applications. This however did come with a price. There were only so many vm’s that you can protect per host and the overhead on the network and hosts were demanding.

I think back to when this feature became available and thought to myself – wow, there are going to be a lot of business lines that will want in on that magic. I was always sad when I told them the memory and vCPU limitations and those that did meet the guidelines were turned away since we exceeded the number of guest protections per cluster.

Number of virtual machines that can be protected is now based on how many vCPU’s are protected per host. The maximum remains at 4 guests or 8 FT protected vCPU’s (whichever comes first). These values are indicative of both primary and secondary virtual machines and vCPU’s. There is some overhead that is involved based on the workload and number of FT protected virtual machines. Generally, you can expect a 10-30% overhead increase. This overhead will primarily be on the network with a minimal CPU hit on each cluster node.

Reasons why it’s been at one vCPU for a while now

The limitations on how many vCPU’s can be protected lies in the lockstep mechanism that was used to keep the dormant node up to date and ready for an immediate takeover. This was known as the “Record-Replay” method and has been replaced with a new technology known as “Fast Checkpointing”. This new mechanism allows for multiple vCPU protection through the continuous copying/checkpointing of the virtual machine.

Some of the same rules apply:

To ensure a successful protection of virtual machines, you still need to abide by some of the basic rules for vSphere Fault Tolerance.

  • You still need to ensure that all machines protected are on a host running the same version of vSphere. In this case, version 6.0 (of course)
  • Dedicated virtual network VMkernel portgroups must be configured for FT logging.
  • 10GB Network links must be used.
  • As mentioned above, you will see a 10-30% network overhead increase (based on demand from the number of FT protected machines and workload).
  • vMotion is supported for both the primary and secondary nodes of a protected vm.

Additional Protection and options at the Storage Layer

In addition to the added processor limit, the new fault tolerant version also deviates from its old method of single storage point for both primary and secondary virtual machines. This new version separates the location of the virtual machine file on different storage volumes that further protects the machine from storage failures.

Fault tolerance on vSphere 6.0 now supports the use of thick and thin disk types. The previous version only supported eager zero thick.

Another great feature is the re-protect mechanism for those storage volumes that run out of space. vCenter will monitor the FT replica and spin up a new secondary VM on a new datastore.

Remaining Points

  • As before, svMotion is not possible with VMware Fault Tolerance running multiple vCPU’s.
  • Virtual Machines in vCloud Director, VSAN/VVOL’s and VMware Replication are not supported on SMP-FT machines.
  • VADP (vStorage API’s for Data Protection) and Snapshots are now supported on vSphere 6.0 FT!

CloudPhysics Review

March 5th 2014 at Virtualization Field Day 3

This is one company that I’ve been anxious to meet for a while now and I am really glad that I got a chance to at Virtualization Field Day 3. They have a very unique offering to companies in that they provide collective intelligence for IT data sets.

First up in the presentation was John Blumenthal (who happens to be the ex-director of storage at VMware) and during his short introduction to the company, the slide deck had an interesting yet straight to the point phrase of “Answers to primitive questions”, this is how the physics of things actually operate.

Progress for ROI

They believe that this must happen above the automation level and needs to be analyzed as quality of service (QOS) and service level agreements (SLA’s) that will in turn need to be ingested into the analytics of the product to determine the next course of action.

They also poised the age old question of “Can a private cloud match the operations of a large scale public offering?” Their answer is that all companies must be able to use the same techniques and methodologies. With CloudPhysics, this is done through an aggregation of data from all aspects of private cloud.

How is it deployed

The product is delivered as a SaaS Model through a vApp virtual appliance and is deployed into a customers vCenter through standard techniques. The product is lightweight and has a minimal impact on the resources it consumes.

How it works

The single vApp collects and scrubs the data from vCenter. Once the process is complete, the information is pushed to CloudPhysics for analysis. The information is stored in an anonymous format to meet any regulatory requirements for compliance such as PCI. They mentioned in the presentation that even if the information was looked at, there is nothing that distinguishes the data points to any particular company.

5 Minutes to Analytics Delivery

One of the interesting points that they made was that you can start analyzing your collections within 5 minutes, which is something I would like to test in the lab since many products out on the market take weeks to deliver tangible results.

Cloud Physics has a datacenter simulator that they run customers sets through to analyze and recommend changes in the environment. This service is included in the subscription service pricing.

Datacenter Simulator Analysis can be done on a per-vm basis and the cache performance analysis can determine the right amount of cache/tweaks that the customer will need to do to maximize the configuration.

The Datastore analysis tool has two primary functions:
1) It highlights the contention periods.
2) It determines which ones were affected and which ones caused the contention.

Predicting Potential Outages

The product identifies problem points through hardware analysis of the compute side as well as other data points that adversely affect the virtualization environment.

We were then shown a demo that was delivered by Raj Raja from product management.

Finding vSphere Operations Hazards

Applications are called “cards” and delivered in segments such as datastore performance, datastore space, memory, etc. Custom “decks” can be created that are simply a collocation of cards and metrics to review and analyze.

Another nice function is the ability to simulate what will occur if changes are made to the environment before you implement. This could result in reduced lab time to validate configurations for change controls.

Root cause analysis with Cloud Physics

Datastore focus is to correlate information from datastore activities, pull in data from backups, sDRS, etc) then form a relationship management structure to determine performance metrics.

I plan to have a follow up conversation with them to find out more detailed information and hopefully get this stood up in the VMbulletin lab for further analysis.