NOTE: This post was originally published on djgoosen.blogspot.com Saturday, September 13, 2014.
Overview
When it comes to systems performance, there are four classical utilization metric types we can pretty trivially look at: CPU, memory, disk and network. In the cloud, we also have to consider shared resource contention among VM’s on cloud hypervisors, within pools/clusters and within tenants; this can often complicate attempts to diagnose the root cause of slowness and application timeouts. At cloud scale, especially when consuming or managing private cloud KVM/OpenStack, Xen or VMware hypervisor resources, and/or when we have visibility into the aggregate metrics of our clouds, it’s useful to keep our front sights focused on these building block concepts, because they help us simplify the problem domain and rule out components in capacity planning and troubleshooting exercises.
Metrics like CPU idle and physical memory utilization tend to be fairly straightforward to interpret– both are typically reported as percentages relative to 100%. Yes, caveats apply, but just by eyeballing top or a similar diagnostic tool, we usually know if these numbers are “low” or “high”, respectively.
In contrast to the first two, many disk and network metrics tend to be expressed as raw numbers per second, making it a bit harder to tell if we’re approaching the upper limits of what our system can actually handle. A number without context is just a number. What we want to know is: How many is too many?
I’ll save disk performance for a future post. Today we’re going to zoom in on network utilization, specifically packets per second (pps).
Is 40,000 pps a lot? How do we know?
Before we try to clarify if we actually mean, Is 40,000 pps a lot for a VM? Or for a hypervisor? Or for some other type of network device in the cloud? And before we try to think through all the exceptions and intermediary network devices we might have to traverse, we should stay disciplined and first ask ourselves: What are we always going to be limited by?
There are two basic constraints here, effectively no matter what:
- Maximum packet size: 1538 bytes
- Network interface speed (e.g., 1Gbps, 10Gbps)
So if every packet has a 1538-byte MTU, and a 1Gbps NIC on a system is handling 40,000 pps, that means that the system is pushing (1538 * 40000) * 8 = 492160000 bits per second, or ~492Mbps. Thus I think that we’ll agree the system is pushing about half of its theoretical maximum speed. (Though in reality, not every packet is created equal. Some are going to be smaller, so 40,000 pps might actually be less than 492Mbps. Maybe a lot less. But it can’t really be more.)
(MTU * pps * 8) / NIC speed in bps = % we’re interested in
A good rule of thumb is that a system with sustained network throughput utilization of > 75% is probably too busy.
So on that basis, 40,000 pps really isn’t too many pps for a single 1Gbps system to handle. It’s even less of a concern for a 10Gbps system.
Some Gotchas in the Cloud
Many VM’s running on the same hypervisor
However… 40,000 pps might actually be too many if it’s a single VM running on the same hypervisor as a bunch of other VM’s, especially if they’re all pushing the same sort of pps. We might want to isolate this VM on a hypervisor. Or scale its role horizontally onto other hypervisors. Along those lines, the business might tell us that 40,000 pps’ worth of revenue is too many for a single VM to take with it at the instant it goes down. But then we’re no longer talking about maximum pps, we’re talking about something else.
The underlying hypervisor NIC’s and EtherChannel
At the hypervisor level, the maximum pps really depends on what the aggregate is of all traffic on it; as well as its NIC speed.
Also relevant is the hypervisor’s EtherChannel configuration (we wouldn’t run our production VM on a hypervisor without some type of redundant links), especially in terms of actual maximum throughput (for instance, actual LACP maximum throughput can start to fall when we take into account factors like MAC-based load balancing between the links).
Additionally, implementation choices like Open vSwitch vs. Linux bridging can have an impact on effective pps. On Citrix Xen hypervisors, OVS is a common design choice. My understanding is that the default OVS flow eviction threshold is 2,500. The maximum recommended value appears to be 10,000. Reasonable people appear to disagree about whether this metric is leading, coincident or lagging to the root causes of packet loss, but in my own experience, with higher hypervisor pps we can expect to see dropped packets relative to this metric for our VM.
What else do we have to think about in the cloud?
Assuming every networking component in our cloud is configured correctly, here are some other factors that can come into play, in no particular order:
- Firewalls: Obviously these have a maximum pps too. In aggregate, we might be constrained by firewall pps at our tenant’s boundary, or even closer.
- Load balancers: Things like licensing limits; request throttling; and their own NIC/EtherChannel pps; can all influence our cloud VM’s effective pps.
- Broadcast storms: Other cloud systems in the same subnet as our VM or hypervisor can saturate the network.
- UDP vs. TCP: if we’re testing maximum pps using a UDP tool, we’re likely to experience more packet loss and thus a smaller perceived maximum.
- Outbound internet bandwidth: guaranteed and burst rates will apply here.
- Jumbo frames: These have 9000-byte MTU’s, so especially in our storage and database tiers, we’ll make sure we remember which MTU to use in our calculations.
- sysctl network parameter tuning: These are beyond the scope of this post, but they can definitely impact the VM’s network performance.