Microsoft SQL Server High Availability Options on Nutanix


Microsoft SQL Server (MSSQL) supports several High Availability (HA) options at both the host and storage level.  For the purposes of this post I will only be addressing the HA options which leverage native Windows Server Failover Clustering (WSFC)  in some form.  SQL Server also provides transactional replication through the use of a publisher and subscriber model, which some consider an HA option, but that’s a topic (and debate) for another post.

Starting with MSSQL 2012, Microsoft introduced AlwaysOn which is a combination of some existing and new functionality.  Under the AlwaysOn umbrella falls two main options, Failover Cluster Instances (FCI) and Availability Groups (AAG).

Nutanix has long supported and recommended the use of AlwaysOn Availability Groups.  AAG leverages a combination of WSFC and native database level replication to create either an HA or disaster recovery solution between instances of MSSQL.  The instances of MSSQL leveraged to support the AAG can be either standalone or clustered (in the case of Nutanix these would be standalone instances today).   The following figure provides a logical overview of an AlwaysOn Availability Group.

SQLAAG

An AAG performs replication at the database level creating “primary” and one or more “secondary” database copies.  The secondary copies are replicated using  either synchronous or asynchronous commit mode, as specified by an administrator.  Asynchronous commit is intended more as a disaster recovery or reporting solution as it implies the potential for data loss.  So for HA scenarios as we’re discussing them here, we should assume synchronous commit.  Because database replication is used, shared storage is not required and each MSSQL instance within the AAG can use its own local devices.  Additional details on AlwaysOn Availability Groups can be found here: https://msdn.microsoft.com/en-us/library/hh510230.aspx

AAGs can take advantage of the secondary databases for the purpose of read-only transactions or for backup operations.  In the context of a scale-out architecture like Nutanix, leveraging multiple copies across hypervisor hosts for distributing these kinds of operations creates an excellent solution.

While AAGs are a great solution and fit nicely with the Nutanix architecture, they may not be a good fit or even possible for certain environments.  Some of the limiting factors for adopting AAGs can include:

  • Space utilization:  Because a secondary database copy is created additional storage space will be consumed.  Some administrators may prefer a single database copy where server HA is the primary use case.
  • Synchronous commit performance:  The synchronous replication of transactions (Insert/Update/Delete…) needed for AAG replication (in the context of an HA solution) do have a performance overhead.  Administrators of latency sensitive applications may prefer not to have the additional response time of waiting for transactions to be committed to multiple SQL instances.
  • Distributed Transactions:  Some applications perform distributed transactions across databases and MSSQL instances.  Microsoft does not support the use of distributed transactions with AAGs, and by extension application vendors will not support their software which utilize distributed transactions where AAGs are present.
  • SQL Server versions:  Some environments can simply not yet upgrade to SQL 2012 or higher.  Whether it be due to current business requirements or application requirements based on qualification, many administrators have to stick with SQL 2008 (and I hope not, but maybe even earlier versions) for the time being.

In the above cases MSSQL Failover Cluster Instances are likely the better solution.  FCI have long been used as the primary means for HA with MSSQL.  FCI can be leveraged with all current versions of MSSQL and relies on shared storage to support the MSSQL instances.  The following figure provides a logical overview of Failover Cluster Instances.

SQLFCI

The shared storage used can be block (LUN) based or, starting with MSSQL 2012, SMB (file) based.  In the case of LUN based shared storage, SCSI-3 persistent reservations are used to arbitrate ownership of the shared disk resources between nodes.  The MSSQL instance utilizing specific LUNs is made dependent against those disk resources.  Additional details on AlwaysOn Failover Cluster Instances can be found here:  https://msdn.microsoft.com/en-us/library/ms189134.aspx

Until very recently Nutanix has not supported MSSQL FCI within virtual machines, whether they reside on ESXi, Hyper-V or the Nutanix Acropolis Hypervisor (AHV).  But starting with the Nutanix 4.5 release (with technical preview support in the recently posted 4.1.5 release), MSSQL FCI will be supported (note: as of the 4.5.1 release Hyper-V and ESXi are officially supported).  Nutanix will support this form of clustering using iSCSI from within the virtual machines.  In essence Nutanix virtual disks (vdisks) which support SCSI-3 persistent reservations are created within a Nutanix container.  These vdisks will be presented directly to virtual machines as LUNs, leveraging the Nutanix Controller Virtual Machines (CVM) as iSCSI targets.  The virtual machines will utilize the Microsoft iSCSI initiator service and the Multipath I/O (MPIO) capabilities native to the Windows Operating System for connectivity and path failover.  An overview of this configuration can be seen in the following diagram.

Nutanix and iSCSI

The association between virtual machine iSCSI initiators and the vdisks is managed via the concept of a Volume Group.  A volume group acts as a mapping to determine the virtual disks which can be accessed by one or multiple (in the case of clustering) iSCSI initiators.   Additional information on volume groups can be found under the Volumes API section of the Nutanix Bible: http://stevenpoitras.com/the-nutanix-bible/

Like AAG’s, MSSQL FCI may not be best suited for all environments.  Some of its drawback can include:

  • Shared storage complexity:  The configuration and maintenance of shared storage is often more complex to manage than standalone environments
  • Planned or unplanned downtime:  FCI can generally take more time to transition operation between cluster nodes than a similar AAG configuration.  Part of this downtime is to commit transactions which may have been in-flight prior to failover.  This can be somewhat mitigated with the recovery interval setting or using indirect checkpoints (https://msdn.microsoft.com/en-us/library/ms189573.aspx).
  • Separation of workloads:  AAG configurations can create multiple database copies across SQL instances for the purposes of distributed reporting or for backup offload.  An FCI cannot offer this functionality natively, although such configurations are possible via intelligent cloning methodologies that the Nutanix platform can offer.

As mentioned earlier it’s possible to configure both FCI and AAG as a part of the same solution.  So for example, if the HA capabilities of FCI are preferred, but the  replication capabilities of AAG are desired for the purposes of reporting, backup offload or disaster recovery, a blended configuration can be deployed.

With the support of shared storage clustering in 4.5, Nutanix can provide the full range of options necessary to support the broad number of use cases SQL Server can require.  I’ll have follow-on posts to detail how to configure volume group based clustering for Microsoft SQL Server.   Thanks for reading.