Hyper-V Virtual Machine Converged Networking Performance

A converged networking architecture is a great way to consolidate resources to allow for more efficient server operations. It is becoming more common with Windows Server 2012 R2 and Hyper-V to see the physical NICs within a host combined into a single load balancing and fail over (LBFO) team to support the network traffic between servers. Nutanix leverages such a configuration with Hyper-V by creating an LBFO team in combination with an external virtual switch. Virtual NICs are then created against the virtual switch to support the host connections and virtual machines. The following figure depicts the default Nutanix configuration with Hyper-V where a pair of 10Gb network adapters are used.

default

Microsoft supports a variety of functionality which impacts network performance including Jumbo Frames, Virtual Receive Side Scaling (vRSS), Dynamic Virtual Machine Queue (DVMQ) and Large Send Offload (LSO) to name a few. There are a lot of resources online which discuss these features in detail so I’m not going to rehash them here. Nutanix also has a networking best practices doc with Windows Server 2012 R2 which touches on these features, available at the following link: http://go.nutanix.com/Microsoft-Windows-Server-Virtual-Networking-Best-Practices.html

It’s common for administrators to test maximum networking throughput between virtual machines (which are typically configured with a single VNIC) and subsequently be disappointed in the results.  What I wanted to do with this post is review the bandwidth limitation of a single virtual NIC (VNIC) and see if it was possible to saturate the throughput of a single 10Gb physical NIC.

There are several reputable articles which discuss the expected throughput of a single VNIC. One excellent article I’ve read recently (http://blogs.technet.com/b/networking/archive/2014/05/28/debugging-performance-issues-with-vmq.aspx) mentions how a single VNIC will utilize a single CPU core and be limited in throughput by the frequency of that core, specifically:

You can expect anywhere from 3.5 to 4.5Gbps from a single processor but it will vary based on the workload and CPU.”

As with most things related to performance, testing results will depend on the specific configuration. But these numbers seemed conservative to me, along with the fact that vRSS can help to enable the use of multiple CPUs to support networking traffic. So I decided to test what kind of networking throughput I could get between two virtual machines on a Nutanix NX3060 while disabling and enabling certain features.

To cut to the chase, the biggest impact to total throughput was dependent on the use of jumbo frames. Without jumbo frames, CPU utilization would limit total throughput to around 4.8Gbps (600MB/S). With jumbo frames (9014 bytes) enabled, throughput maxed out at nearly 9 Gbps with CPU utilization well below 100%. This makes perfect sense as the CPU was able to send more traffic with fewer cycles thanks to the larger packet size. For this test I was sending and receiving network traffic using IOmeter in one direction. I also tested with VMQ and vRSS enabled or disabled, and the results are in the “half duplex” side of the chart below.

Because the CPU was not fully utilized where jumbo frames were enabled and because I was only sending and receiving traffic in one direction, I decided to send and receive traffic in both directions to see when the CPU would max out. These results are in the “full duplex” section of the chart. Once CPU was maxed out, the benefits of vRSS can be seen, where multiple CPUs are then utilized for the send side traffic.

net_chart

For most environments, I’d expect 4.8Gbps of network throughput for a single VM to be plenty, so I wouldn’t go enabling jumbo frames just for the sake of hitting maximums. But at the very least it’s important to understand the relationship between network performance and the host cpu while benchmarking.